class: center, middle # Introduction to Machine Learning Hugo RICHARD .affiliations[ Criteo ] .footnote.tiny[Credits: Most of slides are from Mathieu Blondel (see [here](https://data-psl.github.io/lectures2021/slides/02_intro_to_machine_learning/#1)). Many contents and figures are borrowed from the scikit-learn [mooc](https://github.com/inria/scikit-learn-mooc/) and scipy [lecture notes](https://scipy-lectures.org/packages/scikit-learn/index.html).] --- # Scope of this lecture - A 1h30 long tutorial for absolute beginners in machine learning - High-level concepts with almost zero maths -- What will be covered later this week but not today: - Model-specific explanations (linear models, trees, neural networks) - scikit-learn - Optimization - Deep learning --- # Outline - Supervised learning - Classification - Regression - Generalization - Model selection - Model complexity - Overfitting - Bias / variance trade-off --- class: center, middle # Supervised learning --- # Supervised learning - Given some input, **predict** some output - Input domain $\mathcal{X}$ - Output domain $\mathcal{Y}$ - Prediction function $f \colon \mathcal{X} \to \mathcal{Y}$ - Calling the function: $y = f(\mathbf{x})$ --- # Supervised learning Example 1 : Classifying cats and dogs - $\mathcal{X} = \text{ Set of all gray scale images of size 64 x 64 }$ - $\mathcal{Y} = \\{0, 1 \\}$ where $0$ means dog" and $1$ means cat" - Given $f \colon \mathcal{X} \to \mathcal{Y}$ $f(x) = 0$ means the image is a cat according to the model --- # Supervised learning Example 2 : Predict salary given age - $\mathcal{X} = \mathbb{R}$ (Set of all possible ages) - $\mathcal{Y} = \mathbb{R}$ (Set of all possible salaries) - Given the model $f \colon \mathcal{X} \to \mathcal{Y}$ - Take $ \mathbf{x}$ the age of a person, $f( \mathbf{x})$ is its predicted salary according to the model --- # Supervised learning - Learning from examples (inductive learning) _Training phase_ - $X_{train} = (x_1, y_1), \dots, (x_n, y_n) \in \mathcal{X} \times \mathcal{Y}$ Ex: $n$ images of cats or dogs - $f$ is chosen based on $X_{train}$ via a *training algorithm* _Inference phase_ - Use $f$ on unseen data $x_{test} \in \mathcal{X}$ --- # Supervised learning Feature extraction or Feature engineering - Samples are often in a raw format - It is necessary to extract useful features (descriptors) for the prediction task - Traditional ML: human-designed extraction rules - Deep learning: learned features representations --- # Supervised learning Feature extraction or Feature engineering - Predictions as downstream features Example: "is there a pedestrian?" can be used as a feature for automatic breaking Example 2: Input data = age $a$ of the person Input feature is whether the person is older than $50$ The function $a \to 1_{a > 50}$ is called a *feature extractor* We then have $\mathcal{X} = \\{0, 1\\}$ instead of $\mathcal{X} = \mathbb{R}$ --- # Supervised learning - Training phase .center[
] - Inference (prediction) phase .center[
] --- # Classification - Given an input, predict the class (category) it belongs to - $y \in \mathcal{Y}$ is a categorical variable = $ \mathcal{Y}$ is discrete - Examples - Given the picture of an astronomical object, determine whether it is a star, a galaxy, etc. - Given the picture of a person, identify the person - Given an email, detect whether it is a spam or not --- # Types of classification - Binary classification: $\mathcal{Y} = \\{0, 1\\}$ - Multiclass classification: $\mathcal{Y} = [K] = \\{1, \dots, K\\}$ - Multi-label classification: $\mathcal{Y} = \\{ 0, 1 \\}^K$ Example: The model should predict whether a dog, a lion, or a cat is in the image We choose $\mathcal{Y} = \\{ 0, 1 \\}^3$ $ f(\mathbf{x}) = (1, 1, 0)$ means the image $ \mathbf{x}$ contains a dog and a lion but no cat according to the model. --- # Regression - Given an input, predict a real (continuous) variable associated with it - $y \in \mathcal{Y} = \mathbb{R}$ is real variable - Examples - Given the profile of a person, predict his/her salary - Given house characteristics, predict its price - Given a patient's DNA information, predict his/her life expectancy - Given camrea input, predict steering wheel position (autonomous driving) --- # The iris dataset - A famous dataset in statistics and machine learning .center[
Setosa
Versicolor
Viriginica
] .small[ | Sepal length | Sepal width | Petal length | Petal width | | Iris type | | ------------ | ----------- | ------------ | ----------- |-| ---------- | | 6cm | 3.4cm | 4.5cm | 1.6cm | | versicolor | | 5.7cm | 3.8cm | 1.7cm | 0.3cm | | setosa | | 6.5cm | 3.2cm | 5.1cm | 2cm | | virginica | | 5cm | 3.cm | 1.6cm | 0.2cm | | setosa | ] --- # Data matrix and target vector $$ \mathbf{X} = \begin{bmatrix} \mathbf{x}_1 \\\\ \mathbf{x}_2 \\\\ \mathbf{x}_3 \\\\ \mathbf{x}_4 \\\\ \vdots \end{bmatrix} = \begin{bmatrix} 6 & 3.4 & 4.5 & 1.6 \\\\ 5.7 & 3.8 & 1.7 & 0.3 \\\\ 6.5 & 3.2 & 5.1 & 2 \\\\ 5 & 3 & 1.6 & 0.2 \\\\ \vdots & \vdots & \vdots & \vdots \end{bmatrix} \quad \mathbf{y} = \begin{bmatrix} y_1 \\\\ y_2 \\\\ y_3 \\\\ y_4 \\\\ \vdots \end{bmatrix} = \begin{bmatrix} 1 \\\\ 0 \\\\ 2 \\\\ 0 \\\\ \vdots \end{bmatrix} $$ $\mathbf{X}$ is called the data matrix / design matrix and consists of $n$ samples (observations) with $d$ features (attributes) $\mathbf{x}_i$ is called a feature vector $\mathbf{y}$ is called the target or label vector $y_i$ is called a target or label --- # Samples as points in a space .center[
] Samples from the iris datasets live in a 4-dimensional space, we only use 2 dimensions for visualization purposes. --- # Inferring rules from data
Sepal length
Sepal width
Petal length
Petal width
We can *infer* from the data that setosa irises have small petals. Learning algorithms automate this process! --- # Learning a model from examples - A model can be seen as a function $f$ that maps a feature vector $\mathbf{x}$ (input) to a target $y$ (output). - Given input-output pairs (training examples) $(\mathbf{x}_1, y_1)$, $(\mathbf{x}_2, y_2)$, ..., $(\mathbf{x}_n, y_n)$, we want to find a $f$ such that $$ f(\mathbf{x}_i) \approx y_i $$ --- # Example: the nearest neighbor algorithm $f(\mathbf{x})$ searches for the example $\mathbf{x}_i$ with smallest distance $d(\mathbf{x}, \mathbf{x}_i)$ and returns the label $y_i$ associated with this example. .center[
] --- # US Census data .super-tiny[ | Age | Workclass | Education | Marital-status | Occupation | Relationship | Race | Sex | Capital-gain | Hours-per-week | Native-country | Class | | --- | --------- | ------------ | ------------------ | ------------------ | ------------ | ----- | ---- | ------------ | -------------- | -------------- | ----- | | 25 | Private | 11th | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 40 | United-States | <=50K | | 38 | Private | HS-grad | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 50 | United-States | <=50K | | 28 | Local-gov | Assoc-acdm | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 40 | United-States | >50K | | 44 | Private | Some-college | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | 7688 | 40 | United-States | >50K | ] Does this person make more than 50K per year? --- # Generalizing - In machine learning, we want to predict on new data instances. That is, for a feature vector $\mathbf{x}$ not seen in the training set, we want to compute $f(\mathbf{x})$. - In the census example, we want to predict the income of new individuals, with a combination of jobs and demographics that we have never seen. - Many sources of variability: age workclass, education, marital-status, occupation, ... *+* Noise: unexplainable variance --- # Memorizing * Store all known individuals (the census) * Given a new individual, predict the income of its closest match in our database (nearest neighbor predictor) Trying out this strategy on the data we have, the census, *what error rate do we expect?* -- > **0 errors** -- .red[Yet, we will make errors on **new** data] --- # Generalizing ≠ Memorizing .center["test" data ≠ "train" data] .pull-left[Data on which the predictive model is applied] .pull-right[Data used to learn the predictive model] * Different sampling of noise * Unobserved combination of features --- # Other types of learning - Unsupervised: learning without human annotations - Clustering: discovers groups (clusters) in the data - Dimensionality reduction - Anomaly detection - Semi-supervised learning: learning from both annotated and non-annotated data - Active learning: the algorithm can choose what samples need to be annotated - And many others... --- class: center, middle # Model selection --- # Model complexity - There exists a wealth a machine learning models, some more complex and expressive than others. - What are the implications of model complexity? - How do we pick one model rather than another? .center[
Linear
Non-linear
] --- ## Varying the degree of a polynomial .center[
] .footnote.tiny[Reminder: a univariate polynomial of degree $D$ is given by $f(x) \triangleq w_0 + w_1 x + w_2 x^2 + \dots + w_D x^D$.] --- ## Varying the degree of a polynomial .center[
] .footnote.tiny[Reminder: a univariate polynomial of degree $D$ is given by $f(x) \triangleq w_0 + w_1 x + w_2 x^2 + \dots + w_D x^D$.] --- ## Varying the degree of a polynomial .center[
] .footnote.tiny[Reminder: a univariate polynomial of degree $D$ is given by $f(x) \triangleq w_0 + w_1 x + w_2 x^2 + \dots + w_D x^D$.] --- ## Varying the degree of a polynomial .center[
] .footnote.tiny[Reminder: a univariate polynomial of degree $D$ is given by $f(x) \triangleq w_0 + w_1 x + w_2 x^2 + \dots + w_D x^D$.] --- ## Varying the degree of a polynomial .center[
] .footnote.tiny[Reminder: a univariate polynomial of degree $D$ is given by $f(x) \triangleq w_0 + w_1 x + w_2 x^2 + \dots + w_D x^D$.] --- ## Varying the degree of a polynomial .center[
] .footnote.tiny[Reminder: a univariate polynomial of degree $D$ is given by $f(x) \triangleq w_0 + w_1 x + w_2 x^2 + \dots + w_D x^D$.] --- # Which model do you prefer? .pull-left[
] .pull-right[
] --- # Which model do you prefer? (new data) .pull-left[
] .pull-right[
] --- # Bias / Variance trade-off .center[
] - High bias: the model is too simple / constrained (underfit) - High variance: the model is too expressive / flexible (overfit) - Need to find a good trade-off --- # Underfit versus overfit .center[
High bias
High variance
] --- # Train vs. test error - Train set error: how well did the learning algorithm tune the model parameters - Test set error: how well does the model generalize to new data --- # Train vs test error: effect of model complexity .pull-left[
] .pull-right[
] --- # Train vs test error: effect of model complexity .pull-left[
] .pull-right[
] --- # Train vs test error: effect of model complexity .pull-left[
] .pull-right[
] --- # Train vs test error: effect of model complexity .pull-left[
] .pull-right[
] --- # Train vs test error: effect of sample size .pull-left[
] .pull-right[
] --- # Train vs test error: effect of sample size .pull-left[
] .pull-right[
] --- # Train vs test error: effect of sample size .pull-left[
] .pull-right[
] --- # Train vs test error: effect of sample size .center[
] .center[Diminishing returns phenomenon] --- # Hyperparameters - Parameters not directly tuned by the training algorithm - Needs to be decided before running the training algorithm - Typically controls complexity - Examples - the degree parameter (polynomial regression) - the regularization parameter (linear models) - the number of layers (deep learning) - the step size / learning rate (SGD) --- # Three types of data sets - Training set: used for fitting the model parameters - Validation set: used for picking the "best" hyperparameters - Test set: used for estimating the final model's quality (as a proxy for generalization accuracy), once the hyperparameters are decided --- # Train vs. validation accuracy - Test data should never be used for choosing hyper-parameters - Use validation data instead (aka held-out data) .center[
] --- # Holding out data - During the development phase, we often have a single dataset - Needs to create validation and test sets one way or another - Simplest strategy: set aside some data once for all - Inconvenients : we lose training data, we could make a "lucky" split - Typically fine if the original dataset is very large --- # Cross-validation strategies - Repeatedly split the data into training and validation sets and average the accuracy scores - More robust estimation of the validation accuracy but requires to train the model several times - The same can also be done for the test accuracy - Possibility to use nested cross-validation (costly) --- # K-fold strategy - Set aside $1/K$ of the data for validation - Repeat $K$ times .center[
] - Leave-one-out: $K = n$, where $n$ is the dataset size --- # Take home messages * Machine learning is about extracting from *data* models that *generalize* to new observations. * Bias (underfit) vs. Variance (overfit) trade-off. * Never tune hyper-parameters / model complexity against the test data; use validation data or cross-validation instead. --- # This afternoon: lab work - Introduction to Python and NumPy - Either use Google Colab (needs a gmail account) - Or download the notebook locally (needs a Python environment) - Link available at https://github.com/data-psl/lectures2024