Slides are available here, made from the same org file that this Hugo blogpost was generated from.

scikit-learn in a nutshell πŸ₯œ πŸ”—

14 Nov 2018

Presented by Christabella Irwanto

Machine learning? πŸ”—

β€œLearn” from known data and generalize to unknown data

What about scikit-learn? πŸ”—

Input data matrix πŸ”—

Datasets πŸ”—

Swiss roll πŸ”—

from sklearn import datasets
X, y = datasets.make_moons(n_samples=100, noise=.1)

Feature scaling πŸ”—

Feature selection πŸ”—

Hyperparameter optimization πŸ”—

from sklearn.linear_model import Ridge, RidgeCV
from sklearn.grid_search import GridSearchCV
model = Ridge()
gs = GridSearchCV(model, {alpha=[0.01, 0.05, 0.1]}, cv=3).fit(X, y)
model = Ridge(alphas=[0.01, 0.05, 0.1], cv=3).fit(X, y)
model.alpha_  # Best alpha

Dimensionality reduction πŸ”—

Model selection πŸ”—

All the ML algorithms πŸ”—

Find them (and more) in the User Guide!

All the ML algorithms πŸ”—

API’s of sklearn objects πŸ”—

Ensembles πŸ”—

Aggregate the individual predictions of multiple predictors

ensemble_clf = VotingClassifier(estimators=[
			    ('dummy', dummy_classifier),
			    ('logistic', lr),
			    ('rf', RandomForestClassifier())],
			    voting='soft');, y_train);
ensemble_clf_accuracy_ = cost_accuracy(y_test,

Metrics πŸ”—

>> from sklearn import metrics
>> print(metrics.classification_report(y, y_pred))
	      precision    recall  f1-score   support

     class 0       0.50      1.00      0.67         1
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.67      0.80         3

weighted avg       0.70      0.60      0.61         5

Splitting data πŸ”—

from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,
					test_size=0.25, random_state=0)

Cross validation πŸ”—

from sklearn.model_selection import cross_val_score
cross_val_score(estimator, X, y, cv=5)  # 5 folds

Assumes that each observation is independent.

Case study πŸ”—

Music genre classification 🌚

End-to-end ML workflow:

  1. Automated model selection with TPOT
  2. Preprocessing
  3. Visualizations
  4. Splitting, evaluating
  5. Multiple models, using same fit
  6. GridSearchCV
  7. Ensemble (VotingClassifier)

Optimizing performance πŸ”—

Conclusion πŸ”—

Resources πŸ”—