00 Review and bonus clickers

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 04 December 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Big picture

  • What is a model?
  • How do we evaluate models?
  • How do we decide which models to use?
  • How do we improve models?

General stuff

  • Linear algebra (SVD, matrix multiplication, matrix properties, etc.)
  • Optimization (derivitive + set to 0, gradient descent, Newton’s method, etc.)
  • Probability (conditional probability, Bayes rule, etc.)
  • Statistics (likelihood, MLE, confidence intervals, etc.)

1. Model selection

  • What is a statistical model?
  • What is the difference between training error, test error, and risk?
  • What is the (theoretical) predictor with the lowest risk for regression? For classification?
    • Why can we not obtain these predictor in practice?
  • What is the bias-variance tradeoff?
  • What is the goal of model selection?
  • What is the difference between AIC / BIC / CV / Held-out validation?

2. Regression

  • What do we mean by regression?
  • What is regularization?
    • What is the goal of regularization?
    • What is the difference between L1 and L2 regularization?
  • How do we do non-linear regression?
    • What are splies?
    • What are kernel smoothers?
    • What is knn?
    • What are decision trees?
  • What is the curse of dimensionality?

3. Classification

  • What is classification?
    • What is the difference between a generative versus descriminative classification model?
  • What is a decision boundary? When is it linear?
  • Compare logistic regression to discriminant analysis.
    • What are the assumptions made by each method?
    • What are the shapes of the decision boundaries?
  • What are the positives and negatives of trees?
  • How do we measure performance of classification beyond 0-1 loss?
    • What is a probabilistic notion of classification performance?
    • How do we measure the goodness of uncertainty estimates?

4. Modern methods

  • What is the bootstrap?
  • What is the difference between bagging and boosting?
    • When do we prefer one over the other (think bias-variance tradeoff)?
  • What is the difference between random forests and bagging?
  • How do we understand Neural Networks?
    • What is the difference between neural networks and other non-linear methods?
    • What is the difference between increasing width and increasing depth? (Number of parameters, expressivity)
    • How do we train neural networks? What is backpropagation?
    • Why are we surprised that neural networks “work”?

5. Unsupervised learning

  • What is unsupervised learning?
  • What is dimensionality reduction?
    • What is the difference between PCA vs KPCA?
    • What do the principle components represent?
  • What is clustering?
    • What is the difference between k-means and hierarchical clustering?

Pause for course evals

Currently at 51/144.

A few clicker questions

The singular value decomposition applies to any matrix.


  1. True
  2. False

Which of the following is true about the training error?


  1. It will decrease as we add interaction terms
  2. It will decrease as we add more training data
  3. It will decrease as we add more regularization
  4. It will decrease as we remove useless predictors

(Multiple answer)

Which of the following is an advantage to using LOO-CV over k-fold CV?


  1. The bias of LOO-CV, as a risk estimator, is lower than that of k-fold CV.
  2. The variance of LOO-CV, as a risk estimator, is lower than that of k-fold CV.
  3. It can be computed more quickly than k-fold CV for kernel smoothers.
  4. It can be computed more quickly than k-fold CV for ridge regression.

(Multiple answer)

Which of the following reduce the bias of linear regression?


  1. Adding a ridge penalty
  2. Adding a lasso penalty
  3. Adding interaction terms / nonlinear basis functions
  4. Adding more training data

(multiple answer)

The decision boundary for classification problems…


  1. Is the set of points where \(P(Y=1|X) = P(Y=0|X)\)
  2. Is the set of points where \(P(Y=1|X) / P(Y=0|X) = P(Y=1) / P(Y=0)\)
  3. Is the set of points where \(P(Y=1|X) / P(Y=0|X) = P(Y=0) / P(Y=1)\)
  4. Is linear for all discriminant analysis predictors

(multiple answer)

Which of the following properties of boosting are true?


  1. The risk can be estimated without a holdout set
  2. The component predictors can be trained in parallel
  3. The predictive uncertainty can be estimated by the variance of the predictors
  4. The bias of the ensemble is lower than the bias of the component predictors
  5. The variance of the ensemble is lower than the variance of the component predictors

(multiple answer)

Which of the following statements are true about PCA and KPCA?


  1. PCA requires specifying the number of principled components, while KPCA does not
  2. KPCA requires the data to be centered, while PCA does not
  3. PCA is a linear method, while KPCA is a non-linear method
  4. After performing KPCA, the principle components can be used to reduce the dimensionality of new (previously unseen) test data