Meta Machine Learning¶

Meta machine learning contains specific ML-algorithms, that are taking some classification/regression model as an input.

Also there is a Factory which allows set of models training and comparing them very simply.

Factory¶

Factory provides convenient way to train several classifiers on the same dataset. These classifiers can be trained one-by-one in a single thread, or simultaneously with IPython cluster or in several threads.

Also Factory allows comparison of several classifiers (predictions of which can be computed again in parallel).

class rep.metaml.factory.ClassifiersFactory(*args, **kwds)[source]¶

Bases: rep.metaml.factory.AbstractFactory

Factory provides training of several classifiers in parallel. Quality of trained classifiers can be compared.

Initialize an ordered dictionary. The signature is the same as regular dictionaries, but keyword arguments are not recommended because their insertion order is arbitrary.

add_classifier(name, classifier)[source]¶

Add classifier to factory. Automatically wraps classifier with SklearnClassifier

Parameters:	name (str) – unique name for classifier. If name coincides with one already used, the old classifier will be replaced by one passed. classifier (sklearn.base.BaseEstimator or estimators.interface.Classifier) – classifier object Note if type == sklearn.base.BaseEstimator, then features=None is used, to specify features used by classifier, wrap it with `SklearnClassifier`

predict(X, parallel_profile=None)[source]¶

Predict labels for all events in dataset.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] parallel_profile (None or str) – profile for IPython cluster
Return type:	OrderedDict[numpy.array of shape [n_samples] with integer labels]

predict_proba(X, parallel_profile=None)[source]¶

Predict probabilities for all events in dataset.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] parallel_profile (None or str) – profile
Return type:	OrderedDict[numpy.array of shape [n_samples] with float predictions]

staged_predict_proba(X)[source]¶

Predict probabilities on each stage (attention: returns dictionary of generators)

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features]
Return type:	dict[iterator]

test_on_lds(lds)[source]¶

Prepare report for factory of estimators

Parameters:	lds (LabeledDataStorage) – data
Return type:	rep.report.classification.ClassificationReport

class rep.metaml.factory.RegressorsFactory(*args, **kwds)[source]¶

Bases: rep.metaml.factory.AbstractFactory

Factory provides training of several classifiers in parallel. Quality of trained regressors can be compared.

Initialize an ordered dictionary. The signature is the same as regular dictionaries, but keyword arguments are not recommended because their insertion order is arbitrary.

add_regressor(name, regressor)[source]¶

Add regressor to factory

Parameters:	name (str) – unique name for regressor. If name coincides with one already used, the old regressor will be replaced by one passed. regressor (sklearn.base.BaseEstimator or estimators.interface.Regressor) – regressor object Note if type == sklearn.base.BaseEstimator, then features=None is used, to specify features used by regressor, wrap it first with `SklearnRegressor`

predict(X, parallel_profile=None)[source]¶

Predict values for all events in dataset.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] parallel_profile (None or name of profile to parallelize computations.) – profile
Return type:	OrderedDict[numpy.array of shape [n_samples] with float values]

staged_predict(X)[source]¶

Predicts probabilities on each stage

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features]
Return type:	dict[iterator]

test_on_lds(lds)[source]¶

Report for factory of estimators

Parameters:	lds (LabeledDataStorage) – data
Return type:	rep.report.regression.RegressionReport

Factory Examples¶

Prepare dataset

>>> from sklearn import datasets
>>> import pandas, numpy
>>> from rep.utils import train_test_split
>>> from sklearn.metrics import roc_auc_score
>>> # iris data
>>> iris = datasets.load_iris()
>>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd'])
>>> labels = iris.target
>>> # Take just two classes instead of three
>>> data = data[labels != 2]
>>> labels = labels[labels != 2]
>>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)

Train factory of classifiers

>>> from rep.metaml import ClassifiersFactory
>>> from rep.estimators import TMVAClassifier, SklearnClassifier, XGBoostClassifier
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> factory = ClassifiersFactory()
>>> estimators
>>> factory.add_classifier('tmva', TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=['a', 'b']))
>>> factory.add_classifier('ada', GradientBoostingClassifier())
>>> factory['xgb'] = XGBoostClassifier(features=['a', 'b'])
>>> factory.fit(train_data, train_labels)
model ef           was trained in 0.22 seconds
model tmva         was trained in 2.47 seconds
model ada          was trained in 0.02 seconds
model xgb          was trained in 0.01 seconds
Totally spent 2.71 seconds on training
>>> pred = factory.predict_proba(test_data)
data was predicted by tmva         in 0.02 seconds
data was predicted by ada          in 0.00 seconds
data was predicted by xgb          in 0.00 seconds
Totally spent 0.05 seconds on prediction
>>> print pred
OrderedDict([('tmva', array([[  9.98732217e-01,   1.26778255e-03], [  9.99649503e-01,   3.50497149e-04], ..])),
             ('ada', array([[  9.99705117e-01,   2.94883265e-04], [  9.99705117e-01,   2.94883265e-04], ..])),
             ('xgb', array([[  9.91589248e-01,   8.41078255e-03], ..], dtype=float32))])
>>> for key in pred:
>>>    print key, roc_auc_score(test_labels, pred[key][:, 1])
tmva 0.933035714286
ada 1.0
xgb 0.995535714286

Grid Search¶

This module does hyper parameters optimization – finds the best parameters for estimator using different optimization models. Components of optimization:

estimator (for which optimal parameters are searched, any REP classifier will work, see rep.estimators)
target metric function (which is maximized, anything meeting REP metric interface, see rep.report.metrics)
optimization algorithm (introduced in this module)
cross-validation technique (kFolding, introduced in this module)

During optimization, many cycles of estimating quality on different sets of parameters is done. To speed up the process, threads or IPython cluster can be used.

GridOptimalSearchCV¶

Main class linking the whole process is GridOptimalSearchCV, which takes as parameters:

estimator to be optimized
scorer (which trains classifier and estimates quality using cross-validation)
parameter generator (which draws next set of parameters to be checked)

class rep.metaml.gridsearch.GridOptimalSearchCV(estimator, params_generator, scorer, parallel_profile=None)[source]¶

Bases: object

Optimal search over specified parameter values for an estimator. Uses different optimization techniques to use limited number of evaluations without using exhaustive grid scanning.

GridSearchCV implements a “fit” method and a “fit_best_estimator” method to train models.

Parameters:

estimator (BaseEstimator) – object of type that implements the “fit” and “fit_best_estimator” methods A new object of that type is cloned for each point.
params_generator (AbstractParameterGenerator) – generator of grid search algorithm
scorer (object) – which implement method __call__ with kwargs: “base_estimator”, “params”, “X”, “y”, “sample_weight”
parallel_profile (None or str) – name of profile

Attributes:

generator: return grid parameter generator

fit(X, y, sample_weight=None)[source]¶

Run fit with all sets of parameters.

Parameters:	X – array-like, shape = [n_samples, n_features] Training vector, where n_samples is the number of samples and n_features is the number of features. y – array-like, shape = [n_samples] or [n_samples, n_output], optional sample_weight – array-like, shape = [n_samples], weight

fit_best_estimator(X, y, sample_weight=None)[source]¶

Train estimator with the best parameters

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] y – labels of events - array-like of shape [n_samples] sample_weight – weight of events, array-like of shape [n_samples] or None if all weights are equal
Returns:	the best estimator

generator¶: Property for params_generator

Folding Scorer¶

Folding cross validation can be used in grid search optimization.

class rep.metaml.gridsearch.ClassificationFoldingScorer(score_function, folds=3, fold_checks=1, shuffle=False, random_state=None)[source]¶

Bases: rep.metaml.gridsearch.FoldingScorerBase

Scorer, which implements logic of data folding and scoring for classification models. This is a function-like object

Parameters:	folds (int) – ‘k’ used in k-folding while validating fold_checks (int) – not greater than folds, the number of checks we do by cross-validating score_function (function) – quality. if fold_checks > 1, the average is computed over checks.

Example:

>>> def new_score_function(y_true, proba, sample_weight=None):
>>>     '''
>>>     y_true: [n_samples]
>>>     proba: [n_samples, n_classes]
>>>     sample_weight: [n_samples] or None
>>>     '''
>>>     ...
>>>
>>> f_scorer = FoldingScorer(new_score_function)
>>> f_scorer(base_estimator, params, X, y, sample_weight=None)
0.5

Scorer, which implements logic of data folding and scoring. This is a function-like object

Parameters:	folds (int) – ‘k’ used in k-folding while validating fold_checks (int) – not greater than folds, the number of checks we do by cross-validating score_function (function) – quality. if fold_checks > 1, the average is computed over checks.

class rep.metaml.gridsearch.RegressionFoldingScorer(score_function, folds=3, fold_checks=1, shuffle=False, random_state=None)[source]¶

Bases: rep.metaml.gridsearch.FoldingScorerBase

Scorer, which implements logic of data folding and scoring for regression models. This is a function-like object

Parameters:	folds (int) – ‘k’ used in k-folding while validating fold_checks (int) – not greater than folds, the number of checks we do by cross-validating score_function (function) – quality. if fold_checks > 1, the average is computed over checks.

Example:

>>> def new_score_function(y_true, pred, sample_weight=None):
>>>     '''
>>>     y_true: [n_samples]
>>>     pred: [n_samples]
>>>     sample_weight: [n_samples] or None
>>>     '''
>>>     ...
>>>
>>> f_scorer = RegressionFoldingScorer(new_score_function)
>>> f_scorer(base_estimator, params, X, y, sample_weight=None)
0.5

Scorer, which implements logic of data folding and scoring. This is a function-like object

Parameters:	folds (int) – ‘k’ used in k-folding while validating fold_checks (int) – not greater than folds, the number of checks we do by cross-validating score_function (function) – quality. if fold_checks > 1, the average is computed over checks.

Available optimization algorithms¶

class rep.metaml.gridsearch.RandomParameterOptimizer(param_grid, n_evaluations=10, maximize=True, random_state=None)[source]¶

Bases: rep.metaml.gridsearch.AbstractParameterGenerator

Works in the same way as sklearn.grid_search.RandomizedSearch. Each next point is generated independently.

Param_grid:	dict with distributions used to sample each parameter. name -> list of possible values (in which case sampled uniformly from options) name -> distribution (should implement ‘.rvs()’ as scipy distributions)
Parameters:	maximize (bool) – ignored parameter, added for uniformity

NB: this is the only optimizer, which supports passing distributions for parameters.

generate_next_point()[source]¶

class rep.metaml.gridsearch.AnnealingParameterOptimizer(param_grid, n_evaluations=10, temperature=0.2, random_state=None, maximize=True)[source]¶

Bases: rep.metaml.gridsearch.AbstractParameterGenerator

Implementation if annealing algorithm

Parameters:	param_grid – the grid with parameters to optimize on n_evaluations (int) – the number od evaluations temperature – float, how tolerant we are to worse results. If temperature is very small, algorithm never steps to point with worse predictions.

Doesn’t support parallel execution, so cannot be used in optimization on cluster.

generate_batch_points(size)[source]¶

generate_next_point()[source]¶: Generating next random point in parameters space

class rep.metaml.gridsearch.SubgridParameterOptimizer(param_grid, n_evaluations=10, random_state=None, start_evaluations=3, subgrid_size=3, maximize=True)[source]¶

Bases: rep.metaml.gridsearch.AbstractParameterGenerator

Uses Metropolis-like optimization. If the parameter grid is large, first performs optimization on subgrid.

Parameters:

param_grid (OrderedDict) – the grid with parameters to optimize on
n_evaluations (int) – the number of evaluations to do
random_state (int or RandomState or None) – random generator
start_evaluations (int) – count of random point generation on start
subgrid_size (int) – if the size of mesh too large, first we will optimize on subgrid with not more then subgrid_size possible values for each parameter.

add_result(state_indices, value)[source]¶

generate_next_point()[source]¶: Generating next point in parameters space

class rep.metaml.gridsearch.RegressionParameterOptimizer(param_grid, n_evaluations=10, random_state=None, start_evaluations=3, n_attempts=10, regressor=None, maximize=True)[source]¶

Bases: rep.metaml.gridsearch.AbstractParameterGenerator

This general method relies on regression. Regressor will try to predict the best point based on already known result fir different parameters.

Parameters:

param_grid (OrderedDict) – the grid with parameters to optimize on
n_evaluations (int) – the number of evaluations to do
random_state (int or RandomState or None) – random generator
start_evaluations (int) – count of random point generation on start
n_attempts (int) – this number of points will be compared on each iteration. Regressor is to choose optimal from them.
regressor – regressor to choose appropriate next point with potential best score (estimated this score by regressor); If None them RandomForest algorithm will be used.

generate_next_point()[source]¶: Generating next random point in parameters space

Interface of parameter optimizer¶

Each of parameter optimizers has the following interface.

class rep.metaml.gridsearch.AbstractParameterGenerator(param_grid, n_evaluations=10, maximize=True, random_state=None)[source]¶

Bases: object

Abstract class for grid search algorithm. The aim of this class is to generate new points, where the function (estimator) will be computed. You can define your own algorithm of step location of parameters grid.

Parameters:	param_grid (OrderedDict) – the grid with parameters to optimize on n_evaluations (int) – the number of evaluations to do random_state (int or RandomState or None) – random generator maximize – whether algorithm should maximize or minimize target function.

add_result(state_indices, value)[source]¶: After the model was trained and evaluated for specific set of parameters, we use this function to store result :param state_indices: tuple, which represents the space :param value: quality at this point

best_params_¶: Property, return point of parameters grid with the best score

best_score_¶: Property, return best score of optimization

generate_batch_points(size)[source]¶

Generate several points in parameter space at once (needed when using parallel computations)

Parameters:	size – how many points we shall generate
Returns:	tuple of arrays (state_indices, state_parameters)

generate_next_point()[source]¶: Generating next random point in parameters space :return: tuple (indices, parameters)

print_results(reorder=True)[source]¶

Prints the results of training

Parameters:	reorder (bool) – if reorder==True, best results go earlier, otherwise the results are printed in the order of computation

Folding¶

FoldingClassifier and FoldingRegressor provide an easy way to run k-Folding cross-validation. Also it is a nice way to combine predictions of trained classifiers.

class rep.metaml.folding.FoldingClassifier(base_estimator, n_folds=2, random_state=None, features=None, parallel_profile=None)[source]¶

Bases: rep.metaml.folding.FoldingBase, rep.estimators.interface.Classifier

This meta-classifier implements folding algorithm:

split training data into n equal parts;
train n classifiers, each one is trained using n-1 folds

To get unbiased predictions for data, pass the same dataset (with same order of events) as in training to prediction methods, in which case each event is predicted with base classifier which didn’t use that event during training.

To use information from not one, but several estimators during predictions, provide appropriate voting function. Examples of voting function:

>>> voting = lambda x: numpy.mean(x, axis=0)
>>> voting = lambda x: numpy.median(x, axis=0)

Parameters:	base_estimator (sklearn.BaseEstimator) – base classifier, which will be used for training n_folds (int) – count of folds features (None or list[str]) – features used in training parallel_profile (None or str) – profile for IPython cluster, None to compute locally. random_state (None or int or RandomState) – random state for reproducibility

feature_importances_¶: Sklearn-way of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None

fit(X, y, sample_weight=None)¶

Train the model, will train several base classifiers on overlapping subsets of training dataset.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] y – labels of events - array-like of shape [n_samples] sample_weight – weight of events, array-like of shape [n_samples] or None if all weights are equal

get_feature_importances()¶

Get features importance

Return type:	pandas.DataFrame with column effect and index=features

predict(X, vote_function=None)¶

Predict labels. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type:	numpy.array of shape [n_samples]

predict_proba(X, vote_function=None)¶

Predict probabilities. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type:	numpy.array of shape [n_samples, n_classes]

staged_predict_proba(X, vote_function=None)¶

Predict probabilities after each stage of base_estimator. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type:	sequence of numpy.arrays of shape [n_samples, n_classes]

class rep.metaml.folding.FoldingRegressor(base_estimator, n_folds=2, random_state=None, features=None, parallel_profile=None)[source]¶

Bases: rep.metaml.folding.FoldingBase, rep.estimators.interface.Regressor

This meta-regressor implements folding algorithm:

split training data into n equal parts;
train n regressors, each one is trained using n-1 folds

To get unbiased predictions for data, pass the same dataset (with same order of events) as in training to prediction methods, in which case each event is predicted with base regressor which didn’t use that event during training.

To use information from not one, but several estimators during predictions, provide appropriate voting function. Examples of voting function:

>>> voting = lambda x: numpy.mean(x, axis=0)
>>> voting = lambda x: numpy.median(x, axis=0)

Parameters:	base_estimator (sklearn.BaseEstimator) – base classifier, which will be used for training n_folds (int) – count of folds features (None or list[str]) – features used in training parallel_profile (None or str) – profile for IPython cluster, None to compute locally. random_state (None or int or RandomState) – random state for reproducibility

feature_importances_¶: Sklearn-way of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None

fit(X, y, sample_weight=None)¶

Train the model, will train several base regressors on overlapping subsets of training dataset.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] y – labels of events - array-like of shape [n_samples] sample_weight – weight of events, array-like of shape [n_samples] or None if all weights are equal

get_feature_importances()¶

Get features importance

Return type:	pandas.DataFrame with column effect and index=features

predict(X, vote_function=None)¶

Get predictions. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used. Parameters: numpy.ndarray [n_classifiers, n_samples]
Return type:	numpy.array of shape [n_samples, n_outputs]

staged_predict(X, vote_function=None)¶

Get predictions after each iteration of base estimator. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used. Parameters: numpy.ndarray [n_classifiers, n_samples]
Return type:	sequence of numpy.array of shape [n_samples, n_outputs]

Cache¶

In many cases training a classification/regression takes hours. To avoid retraining at each step, one can store trained classifier in a file, and later load trained model.

However, in this case user should care about situations when something changed in the pipeline (for instance, train/test splitting) manually.

Cache estimators are lazy way to store trained model. After training, classifier/regressor is stored in the file under specific name (which was passed in constructor).

On the next runs following conditions are checked:

model has the same name
model trained has exactly same parameters
model is trained using exactly the same data
stored copy in not too old (10 days by default)

If all the conditions satisfied, stored copy is loaded, otherwise classifier/regressor is fitted.

Example of usage¶

CacheClassifier and CacheRegressor work as meta-estimators

>>> from rep.estimators import XGBoostClassifier
>>> from rep.metaml import FoldingClassifier
>>> from rep.metaml.cache import CacheClassifier
>>> clf = CacheClassifier('xgboost folding', FoldingClassifier(XGBoostClassifier(), n_folds=3))
>>> # this works normally
>>> clf.fit(X, y, sample_weight)
>>> clf.predict_proba(testX)

However in the following situation:

>>> clf = FoldingClassifier(CacheClassifier('xgboost', XGBoostClassifier()))

cache is not going to work, because for each fold a copy of classifier is created. Each time after looking at cache, a version with same parameters, but different data will be found.

So, every time stored copy will be erased and a new one saved.

By default, cache is stored in ‘.cache/rep’ subfolder of project directory (where the ipython notebook is placed). To change parameters of caching use:

>>> import rep.metaml.cache
>>> from rep.metaml._cache import CacheHelper
>>> rep.metaml.cache.cache_helper = CacheHelper(folder, expiration_in_seconds)
>>> # to delete all cached items, use:
>>> rep.metaml.cache.cache_helper.clear_cache()

class rep.metaml.cache.CacheClassifier(name, clf, features=None)[source]¶

Bases: rep.metaml.cache.CacheBase, rep.estimators.sklearn.SklearnClassifier

Cache classifier allows to save trained models in lazy way. Useful when training classifier takes much time.

On the next run, stored model in cache will be used instead of fitting again.

Parameters:	name – unique name of classifier (to be used in storing) clf (sklearn.BaseEstimator) – your estimator, which will be used for training features – features to use in training.

class rep.metaml.cache.CacheRegressor(name, clf, features=None)[source]¶

Bases: rep.metaml.cache.CacheBase, rep.estimators.sklearn.SklearnRegressor

Cache regressor allows to save trained models in lazy way. Useful when training regressor takes much time.

On the next run, stored model in cache will be used instead of fitting again.

Parameters:	name – unique name of classifier (to be used in storing) clf (sklearn.BaseEstimator) – your estimator, which will be used for training features – features to use in training.

Stacking¶

FeatureSplitter defined in this module.

This meta-algorithm is handy to train different models for subsets of the data without manually splitting the data into parts.

class rep.metaml.stacking.FeatureSplitter(split_feature, base_estimator, train_features=None)[source]¶

Bases: rep.estimators.interface.Classifier

Dataset is split by values of split_feature, for each value of feature, new classifier is trained.

When building predictions, classifier predicts the events with the same value of split_feature it was trained on.

Parameters:	split_feature (str) – the name of key feature base_estimator – the classifier, its’ copies are trained on parts of dataset train_features (list[str]) – list of columns classifier uses in training

fit(X, y, sample_weight=None)[source]¶

Fit dataset.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] with features y – array-like of shape [n_samples] with targets sample_weight – array-like of shape [n_samples] with events weights or None.
Returns:	self

predict_proba(X)[source]¶

Predict probabilities. Each event is predicted by the classifier trained on corresponding value of split_feature

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features]
Returns:	probabilities of shape [n_samples, n_classes]

staged_predict_proba(X)[source]¶

Predict probabilities after each stage of base classifier. Each event is predicted by the classifier trained on corresponding value of split_feature

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features]
Returns:	iterable sequence of numpy.arrays of shape [n_samples, n_classes]