14 Nov 2018
Presented by Christabella Irwanto
“Learn” from known data and generalize to unknown data
np.array
) or scipy.sparse
matrix[n_samples, n_features]
n_samples
: no. of items, e.g. documents/images/rows in a CSVn_features
: no. of traits describing each itemscipy.sparse
matrices are more memory-efficient
from sklearn import datasets
X, y = datasets.make_moons(n_samples=100, noise=.1)
Depending on the data, different scaling methods will be more appropriate.
StandardScaler
, RobustScaler
, MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler = scaler.fit(X_train)
CountVectorizer
computes the count of each word.import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = ['Roses are red.',
... 'Violets are red too.']
>>> vectorizer = CountVectorizer();
>>> X = vectorizer.fit_transform(corpus);
>>> pd.DataFrame(X.toarray(),
columns=vectorizer.get_feature_names())
are red roses too violets
0 1 1 1 0 0
1 1 1 0 1 1
feature_selection
module
feature_selection.RFECV
SelectKBest
, SelectFromModel
sklearn.grid_search.GridSearchCV
, RandomizedGridSearchCV
from sklearn.linear_model import Ridge, RidgeCV from sklearn.grid_search import GridSearchCV model = Ridge() gs = GridSearchCV(model, {alpha=[0.01, 0.05, 0.1]}, cv=3).fit(X, y) gs.best_params
model = Ridge(alphas=[0.01, 0.05, 0.1], cv=3).fit(X, y) model.alpha_ # Best alpha
PCA
: deterministic, inductive (learns a model that can be applied to unseen data)TSNE
: stochastic, transductive (models data directly)SparseRandomProjection
: more memory efficient, faster computation
Find them (and more) in the User Guide!
DBSCAN
sklearn
objects
Estimator: Most important object in sklearn
, provides a consistent interface for every machine learning algorithm
estimator.fit(data[, targets])
predict
and sometimes predict_proba
, e.g. classifier, regressor, clusterer
score
to judge quality of the fit/prediction on new datacoef_
(estimated model parameters)transform
All other modules support the estimator, e.g. datasets, metrics, feature_selection, model_selection, ensemble…
transform
data, or predict
on test dataFeatureUnion
for estimators in parallel
from sklearn.pipeline import Pipeline
clf = Pipeline([('pca', decomposition.PCA(n_components=150)),
('svm', svm.LinearSVC(C=1.0))])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
Aggregate the individual predictions of multiple predictors
ensemble_clf = VotingClassifier(estimators=[
('dummy', dummy_classifier),
('logistic', lr),
('rf', RandomForestClassifier())],
voting='soft');
ensemble_clf.fit(X_train, y_train);
ensemble_clf_accuracy_ = cost_accuracy(y_test,
ensemble_clf.predict(X_test));
metrics.confusion_matrix
, metrics.classification_report
(combines metrics.f1_score
etc.)>> from sklearn import metrics
>> print(metrics.classification_report(y, y_pred))
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
weighted avg 0.70 0.60 0.61 5
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,
test_size=0.25, random_state=0)
from sklearn.model_selection import cross_val_score cross_val_score(estimator, X, y, cv=5) # 5 folds
Assumes that each observation is independent.
Music genre classification 🌚
github.com/christabella/music-genre-classification/
End-to-end ML workflow:
fit
n_jobs
parameter in most estimators to specify number of subprocesses