Statistical Learning

Yuan Tian Posted at — Jun 10, 2019

1 Little Statistical Learning Book
- 1.1 Model Representation: a function \(h : X → Y\)
- 1.2 Linear Algebra Review
2 ML Resources
- 2.1 Superb Online Courses and Books
- 2.2 Other Resources

1 Little Statistical Learning Book

1.1 Model Representation: a function \(h : X → Y\)

This function \(h\) is called a hypothesis.

Accuracy of our hypothesis function \(h\) is measured using a cost/loss function. One particular choice of the loss function for linear regression is called “Squared error function” (or “Mean squared error”).

\[J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2\]

The mean is halved (1/2) to simplify the computation of the gradient descent, as the derivative term of the square function will cancel out the (1/2) term.

Why squared loss (but not absolute loss)?

The absolute value is not convenient, because it doesn’t have a continuous derivative, which makes the function not smooth. Functions that are not smooth create unnecessary difficulties when employing linear algebra to find closed form solutions to optimization problems. Closed form solutions to finding an optimum of a function are simple algebraic expressions and are often preferable to using complex numerical optimization methods, such as gradient descent (used, among others, to train neural networks). –The 100-page ML Book

1.2 Linear Algebra Review

1.2.1 What is the Inverse of a Matrix?

To have an inverse, the matrix must be “square” (\(N_{row} = N_{col}\)).

Inverse of a value. Inverse of a matrix. * When we multiply a number by its reciprocal we get 1. * When we multiply a matrix by its inverse we get the Identity Matrix. \(A \times A^{-1}=A^{-1} \times A = I\)

for a \(2 \times 2\) matrix, the inverse is:

According to the invertible matrix theorem, The inverse might not exsit, if the determinant is zero, such a matrix is called “Singular”.

2 ML Resources

2.1 Superb Online Courses and Books

Best FREE Book: An Introduction to Statistical Learning in R
- G. James et al., An Introduction to Statistical Learning: with Applications in R, Springer Texts in Statistics, DOI 10.1007/978-1-4614-7138-71.
- Slides and videos for Statistical Learning MOOC by Hastie and Tibshirani
- Slides and video tutorials related to this book by Abass Al Sharif
Best Coursera Course: Machine Learning (Stanford University)
- Approx. 55 hours to complete, course can be found here.
Best LITTLE Book: The hundred-page Machine Learning Book.
- On average, people spend a week of after-work reading.
- Book Wiki

2.2 Other Resources

2.2.1 Data Science Bootcamp June 10-21, 2019

2.2.1.1 R Packages Installations

#Packages for data science: Statistical analysis for high dimensional data
install.packages('e1071')
# Multiclass Logistic Regression
install.packages("glmnet")
install.packages(c("lar","RandomForest","rpart","SIS","tilting"))
#Packages for data science: survival analysis case study
install.packages(c("survival","mstate","p3state.msm","msm"))

2.2.1.2 Day 1-2 and 4-5: Machine Learning Statistical Learning

Slides and Resources are here.

2.2.1.3 Day 3: Case Study: MGUS Data (Survival Analyses)

Slides and Resouces are here

Outcome/Response variable: time to occurance of an event.
Covariates: age, sex, pstat and mspike.
- mspike: size of the monoclonal serum spike
- ptime: time until progression to a plasma cell malignancy (PCM) or last contact (months)
- pstat: occurrence of PCM: 0 = no, 1 = yes
- futime: time until death or last contact (months)
- death: occurrence of death: 0 = no, 1 = yes
Model: A multi-state model to describe the path to death.
- T: survival time
- C: censoring time
- t = min(T,C)
- \(\delta\) = 1 (T \(\le\) C)
  - t is an observed lifetime (full information) if T\(\le\) C
  - t is an sensoring time (incomplete information, e.g., withdraw, alive when the study ends) if T\(\ge\) C
- If \(h_0(t)\) is left unspecified, then it is called the Cox PH (Proportionality Assumption and Hazard Ratio) Model.
How to Interpret HR? (Example 1):
- Event: Death
- Covariates: age, sex, pstat, mspike
- 𝒙 = (age = 75, sex, pstat, mspike)′
- 𝒙∗= (age = 70, sex, pstat, mspike)′
- HR = ?

\[HR=e^{\beta_1(75-70) + \beta_2(Sex-Sex) +\beta_3(pstat-pstat) + \beta_4(mspike-mspike) }\]

Given \(\beta_1 = 0.05\), \(HR=exp(5*0.06)=1.35\). The event relative risk will increase 35% for 5 units (year) controlling for other factors.

Even though \(h_0(t)\) is unspecified, we estimate the \(\beta\)s. We can estimate \(S(t,x)\) using a minimum of assumptions. There are two techniques to adjust the partical likelihood for tied lifetimes: Brceslow and Efron.

The code can be found in the Google Drive folder.

library(survival)
mgus.data<-read.csv("C:/Users/ytian/Google Drive/PhD/Data Science Bootcamp June 10-21 2019/D3 Code/mgus.data.csv",header = TRUE)

2.2.1.4 Day 6-7 Case Study: Diabetes Readmission

Slides and RMarkdown Code can be found here.

install.packages(c("here","olsrr","modelr","broom","caret","neuralnet","DescTools","PredictABEL"))

library(magrittr)
library(here)
library(olsrr)
library(modelr)
library(neuralnet)
library(dplyr)
library(PredictABEL)
library(ggplot2)
library(caret)
library(ggplot2)
library(ROCR)
library(broom)
library(DescTools)

diabetes<-read.csv("C:/Users/ytian/Google Drive/PhD/Data Science Bootcamp June 10-21 2019/C3 Case Study_Diabetes/diabetes_data_full.csv")

diabetes_data <- read.csv("C:/Users/ytian/Google Drive/PhD/Data Science Bootcamp June 10-21 2019/C3 Case Study_Diabetes/diabetes_analytic_data.csv")

Yuan Tian

PhD life | Computer & Health Science | Music and Dance | Views Are My Own