Arthur Mensch
Formal definition of supervised learning
x∈Rp→ˆy=fθ(x)∈{0,1}: classification
x∈Rp→ˆy=fθ(x)∈R: regression
What if there isn't any label ? A short introduction to unsupervised learning
How do I code all of this ? A short introduction to scikit-learn
What can we do if we are only provided data without labels ?
Cluster them around compact areas
What happened ?
We found centroids (yj)1≤j≤c, with minimal distance ∑ni=1Dist(xi,(yj)1≤j≤c)
Overfitting
Underfitting
Why is it useful ?
We are provided a n×p matrix X with n samples with p features
With labels y∈Rn: n associated labels (in a supervised setting)
Transform data into a numerical matrix
Split the data to retain a test set
Fit a certain estimator to the data
Evaluate the estimator on the data
In order to train a model, we need the data to look like a 2D matrix:
Categorical field made into binary columns
Handle timestamps / ordinal data (ordered labels) / etc.
Remove column mean, divide by column variance
Evaluate a certain model fθ over new data
Split X, y into training and evaluation examples
(Xtrain,ytrain),(Xtest,ytest)
Define fθ with various parameters (Instantiate)
Learn ˆθ with (Xtrain,ytrain) (Fit)
ˆytest=fˆθ(Xtest) (Predict)
Compare ˆytest and ytest (Score)
Select the model f with the best generalization (Model selection)
Scikit-learn provides the API to do it easily
from sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.svm import LinearSVCfrom sklearn.metrics import accuracy_scoreiris = load_iris()X, y = iris['data'], iris['target'] # type: np.ndarrayestimator = LinearSVC(C=1.0) # provide optional parameter hereX_train, X_test, y_train, y_test = train_test_split(X, y)estimator.fit(X_train, y_train) # change the internals of the estimator (theta)y_pred = estimator.predict(X_test)accuracy = accuracy_score(y_test, y_pred) # Evaluate the misclassification
from sklearn.datasets import load_irisiris = load_iris()X, y = iris['data'], iris['target'] # type: np.ndarray
Scikit-learn allows to access many benchmark datasets
load_XXX
loads toy datasets published with the library
fetch_XXX
downloads bigger dataset from the internet
Create your own !
We want to have X and y numeric matrices, i.e. numpy.array
scikit-learn
provides Transformer
for this (studied later)from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y)
KFoldSplit
, ShuffleSplit
from sklearn.svm import LinearSVCestimator = LinearSVC(C=1.0)
Many different models (see documentation)
In particular:
Ridge
, SVM
, LogisticRegression
)DecisionTreeClassifier
, RandomForestClassifier
, ...)MLPClassifier
/MLPRegressor
)from sklearn.svm import LinearSVCestimator = LinearSVC(C=1.0)
Many hyper-parameters, that changes
width
/depth
for neural-networks)y_pred
and y_true
estimator.fit(X_train, y_train)
The estimator
is a stateful Python
object.
The fitted parameter ˆθ is stored as attributes
estimator.intercept_, estimator.weight_
Linear: Holds the weight W and bias b of the linear model
For neural-networks: many weights and biases
Random forests: holds the various decision rules
Those attributes allows to use test-time functions
y_pred = estimator.predict(X_test)
Predicts the output: ˆy=fθ(Xtest)
Linear models: ˆy=argmaxclass i(Wx+b)i
Similar in neural-networks
Output of the decisions in decision trees
Majority voting/mean in random forests
When a probabilistic model is available (classifier)
y_pred = estimator.predict_proba(X_test)# y_pred.shape = (n_samples, n_classes)
from sklearn.metrics import accuracy_scoreaccuracy = accuracy_score(y_test, y_pred) # Evaluate the misclassification
Many metrics in sklearn.metrics
:
- mean squared error, average error for regression- precision and recall for binary classification- F1 score, balanced accuracy for multi-class classification
For classification, all metrics derive from the confusion matrix
from sklearn.metrics import plot_confusion_matrixplot_confusion_matrix(estimator, X_test, y_test)
Estimator can evaluate their performance directly on test data
estimator.score(X_test, y_test)
with the metric adapted to the estimator.
from sklearn.datasets import make_classificationimport matplotlib.pyplot as plt# Synthetic 2D dataset for visualizationX, y = make_classification(n_samples=200, n_features=2, n_redundant=0, class_sep=2)km = KMeans(n_clusters=4)km.fit(X)y_pred = km.predict(X)plt.scatter(X[:, 0], X[:, 1], c=y_pred)
Data may come in an Excel-like format:
numpy array
with column namespandas
libraryWe need to transform it into a true numpy array
Use Transformer objects
# Assume X is a dataframe with stringstransformer = OneHotEncoder(sparse=False)transformer.fit(X) # Notice the absence of yX = transformer.transform(X)
Scale the columns: xi,j=xi,j−Mean(xj)√Var(xj)
standard_scaler = StandardScaler() standard_scaler.fit(X) X_t = standard_scaler.transform(X) print(X_t.mean(axis=0)) # [0, 0, ..., 0] print(X_t.std(axis=0)) # [1, 1, ..., 1]
imputer = SimpleImputer(strategy='median'))imputer.fit(X, stragy='median') # X contains np.nanX_imp = imput.transform(X) # X no longer contains np.nan
from sklearn.decomposition import PCApca = PCA(n_components=10)pca.fit(X)X_t = pca.transform(X)# X.shape = (n_samples, p'=10)
from sklearn.feature_selection import SelectKBest, f_regressionanova_filter = SelectKBest(score_func=f_regression, k=3) # Will select the 3 best features that are# the highest-correlated with the target yanova_filter.fit(X, y)X_t = anova_filter.transform(X)# X_t.shape = (n_samples, 3)
scikit-learn
provides meta-estimators:
estimator = LinearSVC(C=1)estimator_cv = GridSearchCV(estimator, param_grid=dict(C=np.logspace(-7, 0, 1)))# Still an estimator !estimator_cv.fit(X, y) # Tests validation performance for all C
So that you may forget about hyper-parameters
While knowing what you are doing (too big a grid overfits !)
The Pipeline
object to connect transformers
and estimators
Many other preprocessing utilities:
Standard routines: cross_val
, etc.
Plotting functions:
- Precision/recall curves / Training curves / Feature importance / Data visualization
The API is well defined and should follow known rules
You may create your own estimator object and connect it to Pipeline
and GridSearchCV
matplotlib
Pandas
Pandas
to handle columnar data + time-seriespandas.DataFrame
Pandas
to numpy
with ColumnTransformer
and Pipeline
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)categorical_columns = ['pclass', 'sex', 'embarked',]numerical_columns = ['age', 'sibsp', 'parch', 'fare']X = X[categorical_columns + numerical_columns]X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)categorical_pipe = Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])numerical_pipe = Pipeline([ ('imputer', SimpleImputer(strategy='mean'))])preprocessing = ColumnTransformer( [('cat', categorical_pipe, categorical_columns), ('num', numerical_pipe, numerical_columns)])preprocessing.fit(X_train)X_train = preprocessing.transform(X_train) # Now a Numpy array
A notebook tutorial on scikit-learn
For those who go fast:
Pipeline
GridSearchCV
Take your own data and explore it !
Formal definition of supervised learning
x∈Rp→ˆy=fθ(x)∈{0,1}: classification
x∈Rp→ˆy=fθ(x)∈R: regression
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |