+ - 0:00:00
Notes for current slide
Notes for next slide

Machine learning with scikit-learn

Arthur Mensch

École Normale Supérieure
1 / 36

Previously today

Formal definition of supervised learning

  • xRpˆy=fθ(x){0,1}: classification

    • Is this mail a spam ? Is this a dog ?
  • xRpˆy=fθ(x)R: regression

    • What is the height of this flower based on its age and species ?
2 / 36

This lecture

  • What if there isn't any label ? A short introduction to unsupervised learning

  • How do I code all of this ? A short introduction to scikit-learn

3 / 36

Unsupervised learning

What can we do if we are only provided data without labels ?

4 / 36

Unsupervised learning

Cluster them around compact areas

5 / 36

Unsupervised learning

What happened ?

We found centroids (yj)1jc, with minimal distance ni=1Dist(xi,(yj)1jc)

6 / 36

Unsupervised learning

Overfitting

7 / 36

Unsupervised learning

Underfitting

8 / 36

Unsupervised learning

Why is it useful ?

  • Cluster number is a new feature !
  • Propagate a few known-labels (semi-supervised learning)
9 / 36

What should we do to train a model ?

  • We are provided a n×p matrix X with n samples with p features

  • With labels yRn: n associated labels (in a supervised setting)

  1. Transform data into a numerical matrix

  2. Split the data to retain a test set

  3. Fit a certain estimator to the data

  4. Evaluate the estimator on the data

10 / 36

Getting the data into a numerical format

In order to train a model, we need the data to look like a 2D matrix:

  • Each sample holds a certain number of numerical features
  • These features must be build from non-numeric fields.
  • Categorical field made into binary columns

  • Handle timestamps / ordinal data (ordered labels) / etc.

  • Remove column mean, divide by column variance

11 / 36

Splitting the data into a train/validation set

  • Evaluate a certain model fθ over new data

  • Split X, y into training and evaluation examples

(Xtrain,ytrain),(Xtest,ytest)

  • Learn θ with (Xtrain,ytrain), evaluate it on (Xtest,ytest)
12 / 36

Fitting the data and evaluating prediction

  • Define fθ with various parameters (Instantiate)

  • Learn ˆθ with (Xtrain,ytrain) (Fit)

  • ˆytest=fˆθ(Xtest) (Predict)

  • Compare ˆytest and ytest (Score)

  • Select the model f with the best generalization (Model selection)

Scikit-learn provides the API to do it easily

13 / 36

How to implement it

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
iris = load_iris()
X, y = iris['data'], iris['target'] # type: np.ndarray
estimator = LinearSVC(C=1.0) # provide optional parameter here
X_train, X_test, y_train, y_test = train_test_split(X, y)
estimator.fit(X_train, y_train) # change the internals of the estimator (theta)
y_pred = estimator.predict(X_test)
accuracy = accuracy_score(y_test, y_pred) # Evaluate the misclassification
14 / 36

Loading data

from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris['data'], iris['target'] # type: np.ndarray

Scikit-learn allows to access many benchmark datasets

  • load_XXX loads toy datasets published with the library

  • fetch_XXX downloads bigger dataset from the internet

  • Create your own !

We want to have X and y numeric matrices, i.e. numpy.array

  • scikit-learn provides Transformer for this (studied later)
15 / 36

Splitting data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
  • Utilities to separate training and validation data
  • Other utilities to do it repeatedly: KFoldSplit, ShuffleSplit
  • Stratify according to a group (as many male/female in the train/test groups)
16 / 36

Defining an estimator

from sklearn.svm import LinearSVC
estimator = LinearSVC(C=1.0)

Many different models (see documentation)

  • Classification (multi-class/binary)
  • Regularization
  • Multi-target (several targets to predict)

In particular:

  • All linear models (Ridge, SVM, LogisticRegression)
  • Random forests / tree based classifier/regressors (DecisionTreeClassifier, RandomForestClassifier, ...)
  • Neural-network (Multi-layer perceptron MLPClassifier/MLPRegressor)
17 / 36

Defining an estimator

from sklearn.svm import LinearSVC
estimator = LinearSVC(C=1.0)

Many hyper-parameters, that changes

  • The model (example: width/depth for neural-networks)
  • The function comparing y_pred and y_true
  • The regularization over parameters
  • The method used to fit the model
18 / 36

Fitting an estimator

estimator.fit(X_train, y_train)

The estimator is a stateful Python object.

The fitted parameter ˆθ is stored as attributes

estimator.intercept_, estimator.weight_
  • Linear: Holds the weight W and bias b of the linear model

  • For neural-networks: many weights and biases

  • Random forests: holds the various decision rules

  • Those attributes allows to use test-time functions

19 / 36

Using an estimator

y_pred = estimator.predict(X_test)
  • Predicts the output: ˆy=fθ(Xtest)

    • Linear models: ˆy=argmaxclass i(Wx+b)i

    • Similar in neural-networks

    • Output of the decisions in decision trees

    • Majority voting/mean in random forests

  • When a probabilistic model is available (classifier)

y_pred = estimator.predict_proba(X_test)
# y_pred.shape = (n_samples, n_classes)
20 / 36

Evaluating an estimator

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred) # Evaluate the misclassification
  • Many metrics in sklearn.metrics:

    - mean squared error, average error for regression
    - precision and recall for binary classification
    - F1 score, balanced accuracy for multi-class classification
21 / 36

Evaluating an estimator

For classification, all metrics derive from the confusion matrix

from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(estimator, X_test, y_test)

Estimator can evaluate their performance directly on test data

estimator.score(X_test, y_test)

with the metric adapted to the estimator.

  • Often you want to choose the metric yourself !
22 / 36

Unsupervised learning

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
# Synthetic 2D dataset for visualization
X, y = make_classification(n_samples=200,
n_features=2, n_redundant=0, class_sep=2)
km = KMeans(n_clusters=4)
km.fit(X)
y_pred = km.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
23 / 36

Choosing your estimator

24 / 36

Preprocessing

  • Data may come in an Excel-like format:

    • numpy array with column names
    • Object of the pandas library
  • We need to transform it into a true numpy array

  • Use Transformer objects

    # Assume X is a dataframe with strings
    transformer = OneHotEncoder(sparse=False)
    transformer.fit(X) # Notice the absence of y
    X = transformer.transform(X)
25 / 36

Useful preprocessing

  • Scale the columns: xi,j=xi,jMean(xj)Var(xj)

    • So that they are zero-mean and one-std
standard_scaler = StandardScaler()
standard_scaler.fit(X)
X_t = standard_scaler.transform(X)
print(X_t.mean(axis=0)) # [0, 0, ..., 0]
print(X_t.std(axis=0)) # [1, 1, ..., 1]
  • Impute missing values
    imputer = SimpleImputer(strategy='median'))
    imputer.fit(X, stragy='median') # X contains np.nan
    X_imp = imput.transform(X) # X no longer contains np.nan
26 / 36

Useful preprocessing

  • Reduce the dimension of the X matrix
    • Random projection (multiplication of X by a matrix PRp×p)
    • Principal component analysis: multiplication of X by a matrix P that maximizes the variance of PX).
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
pca.fit(X)
X_t = pca.transform(X)
# X.shape = (n_samples, p'=10)
27 / 36

Useful preprocessing

  • Select relevant features only
from sklearn.feature_selection import SelectKBest, f_regression
anova_filter = SelectKBest(score_func=f_regression, k=3)
# Will select the 3 best features that are
# the highest-correlated with the target y
anova_filter.fit(X, y)
X_t = anova_filter.transform(X)
# X_t.shape = (n_samples, 3)
28 / 36

Model selection

scikit-learn provides meta-estimators:

estimator = LinearSVC(C=1)
estimator_cv = GridSearchCV(estimator,
param_grid=dict(C=np.logspace(-7, 0, 1)))
# Still an estimator !
estimator_cv.fit(X, y) # Tests validation performance for all C
  • So that you may forget about hyper-parameters

  • While knowing what you are doing (too big a grid overfits !)

29 / 36

And also

  • The Pipeline object to connect transformers and estimators

    • Example of piping a dimension reduction and a logistic regression
  • Many other preprocessing utilities:

    • Label pre-processing
    • Dimension reduction
    • Feature selection
  • Standard routines: cross_val, etc.

  • Plotting functions:

    - Precision/recall curves / Training curves / Feature importance / Data visualization
30 / 36

Implement your own model

  • The API is well defined and should follow known rules

  • You may create your own estimator object and connect it to Pipeline and GridSearchCV

31 / 36

Companion lib: matplotlib

32 / 36

Companion lib: Pandas

  • Pandas to handle columnar data + time-series
33 / 36

Handling pandas.DataFrame

  • Pandas to numpy with ColumnTransformer and Pipeline
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
categorical_columns = ['pclass', 'sex', 'embarked',]
numerical_columns = ['age', 'sibsp', 'parch', 'fare']
X = X[categorical_columns + numerical_columns]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
categorical_pipe = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
numerical_pipe = Pipeline([
('imputer', SimpleImputer(strategy='mean'))
])
preprocessing = ColumnTransformer(
[('cat', categorical_pipe, categorical_columns),
('num', numerical_pipe, numerical_columns)])
preprocessing.fit(X_train)
X_train = preprocessing.transform(X_train) # Now a Numpy array
34 / 36

Companion lib: Seaborn

  • Quickly visualize pandas DataFrame

Gallery of plots

35 / 36

This afternoon

  • A notebook tutorial on scikit-learn

  • For those who go fast:

    • Pipeline
    • GridSearchCV
  • Take your own data and explore it !

36 / 36

Previously today

Formal definition of supervised learning

  • xRpˆy=fθ(x){0,1}: classification

    • Is this mail a spam ? Is this a dog ?
  • xRpˆy=fθ(x)R: regression

    • What is the height of this flower based on its age and species ?
2 / 36
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow