Statistical modelling for Data Science

Statistical modelling for Data Science

This book is a compilation of STAT301 class notes written by Professor Gabriela Cohen Freue with contributions from Professors Rodolfo Lourenzutti and Alexi Rodriguez-Arelis.

Overview

This course introduces main concepts of explanatory and predictive data analysis with multiple explanatory variables. By the end of the course, students are expected to be able to choose the right methods to apply based on the statistical question and data at hand. Special emphasis is placed on case studies and real data sets, as well as reproducible and transparent workflows when writing computer scripts for analysis and reports.

Learning Outcomes

By the end of the course, students are expected to be able to:

  • Describe real-world examples of explanatory modelling (e.g. A/B testing optimization & regression with variable selection) and predictive modelling problems.

  • Explain the trade-offs between model-based and non-model based approaches, and describe situations where each might be the preferred approach.

  • Explain the difference between creating models for explanation vs prediction, in the context of both how you choose and evaluate models as well as how you interpret the results.

  • Choose & apply a suitable method (e.g., regression, GLM’s, sample size estimation, controlling for multiple testing, peeking, bandit algorithms, variable selection, model diagnostics) based on the statistical question and data at hand. Discuss the advantages and disadvantages of different methods that may be suitable for a given problem.

  • Correctly interpret computer output when performing the statistical analyses presented in this course, in the context of the statistical question being asked and the audience being reported to.

  • Identify the assumptions / conditions required for each method to produce reliable results. Choose techniques to check (or at least be able to falsify) those assumptions. Discuss the consequence(s) of mapping the wrong methods to the question and/or data type.