Third Post: Statistical Inference and Regression Models
Before getting to 3 months without any posts, I thought to myself, I better write my third post soonish, otherwise it would seem that I haven’t been productive since the last post. (Well, maybe that’s partly true since the corovirus lockdown has been relaxed in Germany :D.)
During this time, I went through two courses from the Data Science Spezialization: Statistical Inference and Regression Models.
Statistical Inference
The statistical inference course covers topics like: Probability, expectations, variance, asymptotics, hypothesis testing, p-values, multiple testing, resampling. You can find a more detailed description of the course under https://www.coursera.org/learn/statistical-inference#syllabus. The course is taught by Brian Caffo and includes a script of the lectures in online version. Alternatively, you can also buy a pdf version of the script. The script is helpful and not helpful at the same time. It is helpful, because instead of watching all the video again to review the material, you could just read from the script. On some other times, it is not very helpful since you need the videos to understand the script. So, in my opinion, I would recommend watching the video first and not using the script as if it were a book, that you could understand without further material. Besides this, all the code used in the lecture in on github.
I think the course provides a good refresh of all the relevant material needed for data science for someone like me who had formal statistics and econometrics lectures at university. For the most part of it, it doesn’t go very deep into the technical details and in some parts of the course you can actively choose to skip the technical details if you are not so much into math. I really like this practical approach into statistics the course offers, because sometimes you need to grasp the basic idea before going into details. As opposed to my formal stats class, where it went too technical, before I had the chance to understand the idea. (That was one of the reasons I lost interest in statistics in university :D)
One of the topics worth mentioning in this short post and that it was more than a review for me and is resampling and bootstraping. The question we try to answer with bootstrapping is that the sampling distribution of a statistic is not always available, so what should we do in this case? The basic idea of bootstraping is that it is possible to approximate the sampling distribution of a statistic by sampling with replacement from the observed data and then calculating the statistic of interest.By doing this for n times, we can get a good approximation of the sampling distribution of the statistic of interest.
So, if you’re looking to review some statistical inference knowledge and maybe learn something new, this is a good course!
Regression Models
The Regression Models course covers ordinary least squares (OLS), multivariable regression, generalized linear models (GLM) and it has some great chapters about residuals diagnostics. Under this link you can get a full detail of the syllabus of the course https://www.coursera.org/learn/regression-models#syllabus. The course is taught by Brian Caffo as well and the learning materail is in the same format as in Statistical Inference. For me personally, the course was more than a review because I learned about GLM, a very helpful method to solve classification problems. Nevertheless, in my opinion, the GLM module was kept relatively short, some concepts were covered to high-level and I needed to google additional information to understand the whole GLM process. Given the importance of GLM for machine learning, I think it is best to cover GLM in a separate blogpost. Besides this, the course extensively covers residuals diagnostics and some concepts taught here were also new to me (or at least I don’t recall learning them at university).
When analyzing residuals it is important to investigate whether an outlier has influence or not. An outlier can arise from several reasons: meassuring errors, data entry errors, a real outlier, among others. Excluding or including an outlier in the regression analysis can have huge (unintended) effects in the analysis. For instance, it could change the direction of the regressor or reduce the effect of the regressor. In order to assess whether an outlier should be included in the regression analysis or not, we have to examine the leverage and influence of an outlier. With leverage we want to how far an observation x’s value is from the “cloud” of the other x values. On the the other hand, influence measures how much impact a point has on the regression fit. An influential point will typically have a high leverage. A high leverage point is not necessarily an influential point. So these are the two concepts to keep in mind when analyzing residuals.
In this entry, I have shortly summarized my progress since my last post. Now I have started with the practical machine learning and hope to finish it soon!