class: center, middle ## Machine learning with scikit-learn Hugo RICHARD .affiliations[ ENSAE, CREST ] .footnote.tiny[Credits: Based on the [slides](https://data-psl.github.io/lectures2021/slides/04_scikit_learn/) of Mathieu Blondel and the [slides](https://data-psl.github.io/lectures2020/slides/04_scikit_learn/) of Arthur Mensch] --- # Previously Formal definition of supervised learning - $x \in \mathbb{R}^p \to \hat y = f_\theta(x) \in \lbrace 0, 1 \rbrace$: binary classification - Is this mail a spam ? Is this a dog ? - $x \in \mathbb{R}^p \to \hat y = f_\theta(x) \in \mathbb{R}$: regression - What is the height of this flower based on its age and species ? --- # This lecture - Loading data - Splitting data - Defining and fitting estimators - Grid search - Clustering - Companion libraries --- # Training a supervised model - We are provided with $n$ samples of "raw" data with labels $\mathbf{y} \in \mathbb{R}^n$ - General steps 1. Transform data into a $n \times p$ matrix $\mathbf{X}$ with n samples with p features 2. Split the data into training and test sets 3. Fit a certain estimator to the training data 4. Evaluate the estimator on the test data --- # A bit more formally - Define $f_\theta$ with various parameters (**Instantiate**) - Learn $\hat \theta$ with $(\mathbf{X}\_{train}, \mathbf{y}\_{train})$ (**Fit**) - $\hat y\_{test} = f\_{\hat \theta}(X\_{test})$ (**Predict**) - Compare $\hat y\_{test}$ and $y\_{test}$ (**Score**) - Select the model $f$ with the best generalization (**Model selection**) Scikit-learn provides API to do all this easily --- # Example ```python from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.svm import LinearSVC from sklearn.metrics import accuracy_score iris = load_iris() X, y = iris['data'], iris['target'] # type: np.ndarray estimator = LinearSVC(C=1.0) # provide optional parameter here X_train, X_test, y_train, y_test = train_test_split(X, y) estimator.fit(X_train, y_train) # change the internals of the estimator (theta) y_pred = estimator.predict(X_test) accuracy = accuracy_score(y_test, y_pred) # Evaluate the misclassification ``` --- class: center, middle # Loading, splitting, preprocessing the data --- # Loading data Scikit-learn allows to access many *benchmark datasets* - `load_XXX` loads toy datasets published with the library ```python from sklearn.datasets import load_iris iris = load_iris() X, y = iris['data'], iris['target'] # type: np.ndarray ``` - `fetch_XXX` downloads bigger dataset from the internet - `make_XXX` synthetic datasets --- # Getting the data into a numerical matrix - scikit learn works with numpy arrays: $X$ and $y$ must be converted in this format .centering[
] - For non-numeric fields (strings, categorical variables), a transformation is necessary --- # Splitting the data - Evaluate a certain model $f\_{\theta}$ over new data - Split $\mathbf{X}$, $\mathbf{y}$ into training and evaluation examples $$(\mathbf{X}\_{train}, \mathbf{y}\_{train}), (\mathbf{X}\_{test}, \mathbf{y}\_{test})$$ - Learn $\theta$ with $(\mathbf{X}\_{train}, \mathbf{y}\_{train})$, evaluate it on $(\mathbf{X}\_{test}, \mathbf{y}\_{test})$ --- # Splitting the data - Utilities to separate training and validation data ```python from sklearn.model_selection import train_test_split # Hold out 40% for test purposes X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4) ``` - Other utilities to do it repeatedly: `KFoldSplit`, `ShuffleSplit` ```python X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4]) # 5-fold cross-validation kf = KFold(n_splits=5) for train_index, test_index in kf.split(X): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] ``` - Stratified K-fold (preserves the percentage of samples for each class) --- # Loading from dictionaries ```python >>> measurements = [ ... {'city': 'Dubai', 'temperature': 33.}, ... {'city': 'London', 'temperature': 12.}, ... {'city': 'San Francisco', 'temperature': 18.}, ... ] >>> from sklearn.feature_extraction import DictVectorizer >>> vec = DictVectorizer() >>> vec.fit_transform(measurements).toarray() array([[ 1., 0., 0., 33.], [ 0., 1., 0., 12.], [ 0., 0., 1., 18.]]) ``` - Estimators with a `transform` method are called transformers - `fit_transform` is equivalent to `fit` followed by `transform` ```python vec.fit(measurements) vec.transform(measurements) ``` --- # Categorical variables - Categorical variables must be converted to a one-hot encoding - Example ```python enc = OneHotEncoder() X = [['Male', 1], ['Female', 3], ['Female', 2]] X = enc.fit_transform(X).toarray() ``` outputs ``` array([[0., 1., 1., 0., 0.], [1., 0., 0., 0., 1.], [1., 0., 0., 1., 0.]]) ``` --- # Bag-of-word representations ```python >>> from sklearn.feature_extraction.text import CountVectorizer >>> vectorizer = CountVectorizer() >>> corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?', ] >>> X = vectorizer.fit_transform(corpus) >>> X.toarray() array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 1, 0, 1, 0, 2, 1, 0, 1], [1, 0, 0, 0, 1, 0, 1, 1, 0], [0, 1, 1, 1, 0, 0, 1, 0, 1]]...) >>> vectorizer.transform(['Something completely new.']).toarray() array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...) ``` --- # Useful preprocessing - Scale the columns: `$x_{i, j} = \frac{x_{i, j} - \text{Mean}(x_{:,j})}{\sqrt{\text{Var}(x_{:,j})}}$` - So that they are zero-mean and one-std ```python standard_scaler = StandardScaler() standard_scaler.fit(X) X_t = standard_scaler.transform(X) print(X_t.mean(axis=0)) # [0, 0, ..., 0] print(X_t.std(axis=0)) # [1, 1, ..., 1] ``` - Impute missing values ```python imputer = SimpleImputer(strategy='median')) imputer.fit(X) # X contains `np.nan` X_imp = imput.transform(X) # X no longer contains `np.nan` ``` - Always do those after train/test splitting --- # Dimensionality reduction - Reduce the dimension of the $X$ matrix - Principal component analysis: multiplication of $X$ by a matrix $P$ that maximizes the variance of $P X$. ```python from sklearn.decomposition import PCA pca = PCA(n_components=10) pca.fit(X) X_t = pca.transform(X) # X.shape = (n_samples, p'=10) ``` --- # Feature selection - Perform a $\chi^2$ test to the samples to retrieve only the two best features ```python >>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectKBest >>> from sklearn.feature_selection import chi2 >>> X, y = load_iris(return_X_y=True) >>> X.shape (150, 4) >>> fs = SelectKBest(chi2, k=2) >>> X_new = fs.fit_transform(X, y) >>> X_new.shape (150, 2) ``` --- class: center, middle # Defining, fitting, using and evaluating estimators --- # Defining an estimator ```python from sklearn.svm import LinearSVC estimator = LinearSVC(C=1.0) ``` - Many different models available (see documentation) - Examples: - All linear models (`Ridge`, `SVM`, `LogisticRegression`) - Random forests / tree based classifier/regressors (`DecisionTreeClassifier`, `RandomForestClassifier`, ...) - Neural-network (Multi-layer perceptron `MLPClassifier`/`MLPRegressor`) --- # Defining an estimator ```python from sklearn.svm import LinearSVC estimator = LinearSVC(C=1.0) ``` - All estimators have a `fit` method ``` estimator.fit(X, y) ``` - Hyperparameters are estimator attributes ```python print(estimator.C) ``` - Classifiers and regressors have a `predict` method ```python estimator.predict(X) ``` --- # Fitting an estimator ```python estimator.fit(X_train, y_train) ``` Fitted parameters $\hat \theta$ are stored as attributes ```python estimator.coef_, estimator.intercept_ ``` - Linear: Holds the weight $W$ and bias $b$ of the linear model - For neural-networks: many weights and biases - Random forests: holds the various decision rules - Those attributes are used by `predict` --- # Using an estimator ```python y_pred = estimator.predict(X_test) ``` - Predicts the output: $\hat y = f\_\theta(X\_{\textrm{test}})$ - Linear models: $\hat y = \textrm{argmax}\_{\textrm{class i}} (W x + b)\_i $ - Similar in neural-networks - Output of the decisions in decision trees - Majority voting/mean in random forests - When a probabilistic model is available (classifier) ```python # y_proba.shape = (n_samples, n_classes) y_proba = estimator.predict_proba(X_test) ``` --- # Evaluating an estimator - Estimator can evaluate their performance directly on test data ```python estimator.score(X_test, y_test) ``` with the metric adapted to the estimator - Possibility to use custom accuracy metrics ```python from sklearn.metrics import accuracy_score y_pred = estimator.predict(X_test) accuracy = accuracy_score(y_test, y_pred) # Evaluate the misclassification ``` - Many metrics available - mean squared error, average error for regression - precision and recall for binary classification - F1 score, balanced accuracy for multi-class classification --- # Confusion matrices ```python from sklearn.metrics import plot_confusion_matrix plot_confusion_matrix(estimator, X_test, y_test) ``` .center[
] --- # Pipelines - A way to chain multiple transformations with a final estimator ```python from sklearn.pipeline import Pipeline from sklearn.svm import SVC from sklearn.decomposition import PCA estimators = [('reduce_dim', PCA()), ('clf', SVC())] pipe = Pipeline(estimators) pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) ``` --- # Hyperparameter selection - `scikit-learn` provides meta-estimators for hyperparameter selection - `GridSearchCV` exhaustively tries all hyperparameter combinations ```python estimator = LinearSVC(C=1) # 10 log-spaced values between 10^-7 and 1.0 C_values = np.logspace(-7, 0, 10) estimator_cv = GridSearchCV(estimator, param_grid=dict(C=C_values)) # A valid estimator (can use fit and predict on it) estimator_cv.fit(X, y) ``` - `RandomizedSearchCV` randomly tries hyperparameter combinations --- # Implement your own model - When implementing your own model, it is a good idea to follow the scikit-learn API ```python class MyModel(sklearn.base.BaseEstimator): def __init__(self, hyperparam): self.hyperparam = hyperparam def fit(self, X, y): # [...] def predict(self, X): # [...] ``` - Benefits - A familiar very readable interface - Compatible with `Pipeline`, `GridSearchCV`, ... --- class: center, middle # Clustering --- # Clustering What can we do if we are only provided data without labels ? .center[
] --- # Clustering Cluster them around compact areas .center[
] --- # Clustering What happened ? .center[
] We found centroids $(c\_j)\_{1\leq j\leq K}$, with minimal distance $ \sum\_{i=1}^n \text{Dist}\big(x\_i, (c\_j)\_{1\leq j\leq K}\big) $ --- # Clustering Overfitting .center[
] --- # Clustering Underfitting .center[
] --- # Clustering .center[
] Why is it useful ? - Extract groups in the data - Vector quantization (compression) - Cluster number is a new feature ! - Propagate a few known-labels (semi-supervised learning) --- # Clustering with K-means ```python from sklearn.datasets import make_classification import matplotlib.pyplot as plt # Synthetic 2D dataset for visualization X, y = make_classification(n_samples=200, n_features=2, n_redundant=0, class_sep=2) km = KMeans(n_clusters=4) km.fit(X) # Cluster assignments y_pred = km.predict(X) plt.scatter(X[:, 0], X[:, 1], c=y_pred) ``` .center[
] --- class: center, middle # Companion libraries --- # `matplotlib` .center[
] --- # `Pandas` - Data structures and operations for manipulating numerical tables and time series - Features (columns) have labels and can have different types ```python import pandas as pd X = pd.DataFrame( {'city': ['London', 'London', 'Paris', 'Sallisaw'], 'title': ["His Last Bow", "How Watson Learned the Trick", "A Moveable Feast", "The Grapes of Wrath"], 'expert_rating': [5, 3, 4, 5], 'user_rating': [4, 5, 4, 3]}) ``` --- # Reduction to numpy array - `ColumnTransformer` can be used to apply different transformations per column ```python from sklearn.compose import ColumnTransformer from sklearn.feature_extraction.text import CountVectorizer from sklearn.preprocessing import OneHotEncoder column_trans = ColumnTransformer( ... [('city_category', OneHotEncoder(), 'city'), ... ('title_bow', CountVectorizer(), 'title')], ... remainder='passthrough') X = column_trans.fit(X) ``` --- # `Seaborn` - Quickly visualize `pandas` DataFrame .center[
] [Gallery of plots](https://seaborn.pydata.org/examples/index.html) --- # This afternoon: lab work - A tutorial on `scikit-learn` - Link available at https://github.com/data-psl/lectures2022 .center[
]