Meta Machine Learning

Meta machine learning contains specific ML-algorithms, that are taking some classification/regression model as an input.

Also there is a Factory which allows set of models training and comparing them very simply.

Factory

Factory provides convenient way to train several classifiers on the same dataset. These classifiers can be trained one-by-one in a single thread, or simultaneously with IPython cluster or in several threads.

Also Factory allows comparison of several classifiers (predictions of which can be computed again in parallel).

class rep.metaml.factory.ClassifiersFactory(*args, **kwds)[source]

Bases: rep.metaml.factory.AbstractFactory

Factory provides training of several classifiers in parallel. Quality of trained classifiers can be compared.

Initialize an ordered dictionary. The signature is the same as regular dictionaries, but keyword arguments are not recommended because their insertion order is arbitrary.

add_classifier(name, classifier)[source]

Add classifier to factory. Automatically wraps classifier with SklearnClassifier

Parameters:
  • name (str) – unique name for classifier. If name coincides with one already used, the old classifier will be replaced by one passed.
  • classifier (sklearn.base.BaseEstimator or estimators.interface.Classifier) –

    classifier object

    Note

    if type == sklearn.base.BaseEstimator, then features=None is used, to specify features used by classifier, wrap it with SklearnClassifier

predict(X, parallel_profile=None)[source]

Predict labels for all events in dataset.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • parallel_profile (None or str) – profile for IPython cluster
Return type:

OrderedDict[numpy.array of shape [n_samples] with integer labels]

predict_proba(X, parallel_profile=None)[source]

Predict probabilities for all events in dataset.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • parallel_profile (None or str) – profile
Return type:

OrderedDict[numpy.array of shape [n_samples] with float predictions]

staged_predict_proba(X)[source]

Predict probabilities on each stage (attention: returns dictionary of generators)

Parameters:X – pandas.DataFrame of shape [n_samples, n_features]
Return type:dict[iterator]
test_on_lds(lds)[source]

Prepare report for factory of estimators

Parameters:lds (LabeledDataStorage) – data
Return type:rep.report.classification.ClassificationReport
class rep.metaml.factory.RegressorsFactory(*args, **kwds)[source]

Bases: rep.metaml.factory.AbstractFactory

Factory provides training of several classifiers in parallel. Quality of trained regressors can be compared.

Initialize an ordered dictionary. The signature is the same as regular dictionaries, but keyword arguments are not recommended because their insertion order is arbitrary.

add_regressor(name, regressor)[source]

Add regressor to factory

Parameters:
  • name (str) – unique name for regressor. If name coincides with one already used, the old regressor will be replaced by one passed.
  • regressor (sklearn.base.BaseEstimator or estimators.interface.Regressor) –

    regressor object

    Note

    if type == sklearn.base.BaseEstimator, then features=None is used, to specify features used by regressor, wrap it first with SklearnRegressor

predict(X, parallel_profile=None)[source]

Predict values for all events in dataset.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • parallel_profile (None or name of profile to parallelize computations.) – profile
Return type:

OrderedDict[numpy.array of shape [n_samples] with float values]

staged_predict(X)[source]

Predicts probabilities on each stage

Parameters:X – pandas.DataFrame of shape [n_samples, n_features]
Return type:dict[iterator]
test_on_lds(lds)[source]

Report for factory of estimators

Parameters:lds (LabeledDataStorage) – data
Return type:rep.report.regression.RegressionReport

Factory Examples

  • Prepare dataset
    >>> from sklearn import datasets
    >>> import pandas, numpy
    >>> from rep.utils import train_test_split
    >>> from sklearn.metrics import roc_auc_score
    >>> # iris data
    >>> iris = datasets.load_iris()
    >>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd'])
    >>> labels = iris.target
    >>> # Take just two classes instead of three
    >>> data = data[labels != 2]
    >>> labels = labels[labels != 2]
    >>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
    
  • Train factory of classifiers
    >>> from rep.metaml import ClassifiersFactory
    >>> from rep.estimators import TMVAClassifier, SklearnClassifier, XGBoostClassifier
    >>> from sklearn.ensemble import GradientBoostingClassifier
    >>> factory = ClassifiersFactory()
    >>> estimators
    >>> factory.add_classifier('tmva', TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=['a', 'b']))
    >>> factory.add_classifier('ada', GradientBoostingClassifier())
    >>> factory['xgb'] = XGBoostClassifier(features=['a', 'b'])
    >>> factory.fit(train_data, train_labels)
    model ef           was trained in 0.22 seconds
    model tmva         was trained in 2.47 seconds
    model ada          was trained in 0.02 seconds
    model xgb          was trained in 0.01 seconds
    Totally spent 2.71 seconds on training
    >>> pred = factory.predict_proba(test_data)
    data was predicted by tmva         in 0.02 seconds
    data was predicted by ada          in 0.00 seconds
    data was predicted by xgb          in 0.00 seconds
    Totally spent 0.05 seconds on prediction
    >>> print pred
    OrderedDict([('tmva', array([[  9.98732217e-01,   1.26778255e-03], [  9.99649503e-01,   3.50497149e-04], ..])),
                 ('ada', array([[  9.99705117e-01,   2.94883265e-04], [  9.99705117e-01,   2.94883265e-04], ..])),
                 ('xgb', array([[  9.91589248e-01,   8.41078255e-03], ..], dtype=float32))])
    >>> for key in pred:
    >>>    print key, roc_auc_score(test_labels, pred[key][:, 1])
    tmva 0.933035714286
    ada 1.0
    xgb 0.995535714286
    

Grid Search

This module does hyper parameters optimization – finds the best parameters for estimator using different optimization models. Components of optimization:

  • estimator (for which optimal parameters are searched, any REP classifier will work, see rep.estimators)
  • target metric function (which is maximized, anything meeting REP metric interface, see rep.report.metrics)
  • optimization algorithm (introduced in this module)
  • cross-validation technique (kFolding, introduced in this module)

During optimization, many cycles of estimating quality on different sets of parameters is done. To speed up the process, threads or IPython cluster can be used.

GridOptimalSearchCV

Main class linking the whole process is GridOptimalSearchCV, which takes as parameters:

  • estimator to be optimized
  • scorer (which trains classifier and estimates quality using cross-validation)
  • parameter generator (which draws next set of parameters to be checked)
class rep.metaml.gridsearch.GridOptimalSearchCV(estimator, params_generator, scorer, parallel_profile=None)[source]

Bases: object

Optimal search over specified parameter values for an estimator. Uses different optimization techniques to use limited number of evaluations without using exhaustive grid scanning.

GridSearchCV implements a “fit” method and a “fit_best_estimator” method to train models.

Parameters:
  • estimator (BaseEstimator) – object of type that implements the “fit” and “fit_best_estimator” methods A new object of that type is cloned for each point.
  • params_generator (AbstractParameterGenerator) – generator of grid search algorithm
  • scorer (object) – which implement method __call__ with kwargs: “base_estimator”, “params”, “X”, “y”, “sample_weight”
  • parallel_profile (None or str) – name of profile

Attributes:

generator: return grid parameter generator

fit(X, y, sample_weight=None)[source]

Run fit with all sets of parameters.

Parameters:
  • X – array-like, shape = [n_samples, n_features] Training vector, where n_samples is the number of samples and n_features is the number of features.
  • y – array-like, shape = [n_samples] or [n_samples, n_output], optional
  • sample_weight – array-like, shape = [n_samples], weight
fit_best_estimator(X, y, sample_weight=None)[source]

Train estimator with the best parameters

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • y – labels of events - array-like of shape [n_samples]
  • sample_weight – weight of events, array-like of shape [n_samples] or None if all weights are equal
Returns:

the best estimator

generator

Property for params_generator

Folding Scorer

Folding cross validation can be used in grid search optimization.

class rep.metaml.gridsearch.ClassificationFoldingScorer(score_function, folds=3, fold_checks=1, shuffle=False, random_state=None)[source]

Bases: rep.metaml.gridsearch.FoldingScorerBase

Scorer, which implements logic of data folding and scoring for classification models. This is a function-like object

Parameters:
  • folds (int) – ‘k’ used in k-folding while validating
  • fold_checks (int) – not greater than folds, the number of checks we do by cross-validating
  • score_function (function) – quality. if fold_checks > 1, the average is computed over checks.

Example:

>>> def new_score_function(y_true, proba, sample_weight=None):
>>>     '''
>>>     y_true: [n_samples]
>>>     proba: [n_samples, n_classes]
>>>     sample_weight: [n_samples] or None
>>>     '''
>>>     ...
>>>
>>> f_scorer = FoldingScorer(new_score_function)
>>> f_scorer(base_estimator, params, X, y, sample_weight=None)
0.5

Scorer, which implements logic of data folding and scoring. This is a function-like object

Parameters:
  • folds (int) – ‘k’ used in k-folding while validating
  • fold_checks (int) – not greater than folds, the number of checks we do by cross-validating
  • score_function (function) – quality. if fold_checks > 1, the average is computed over checks.
class rep.metaml.gridsearch.RegressionFoldingScorer(score_function, folds=3, fold_checks=1, shuffle=False, random_state=None)[source]

Bases: rep.metaml.gridsearch.FoldingScorerBase

Scorer, which implements logic of data folding and scoring for regression models. This is a function-like object

Parameters:
  • folds (int) – ‘k’ used in k-folding while validating
  • fold_checks (int) – not greater than folds, the number of checks we do by cross-validating
  • score_function (function) – quality. if fold_checks > 1, the average is computed over checks.

Example:

>>> def new_score_function(y_true, pred, sample_weight=None):
>>>     '''
>>>     y_true: [n_samples]
>>>     pred: [n_samples]
>>>     sample_weight: [n_samples] or None
>>>     '''
>>>     ...
>>>
>>> f_scorer = RegressionFoldingScorer(new_score_function)
>>> f_scorer(base_estimator, params, X, y, sample_weight=None)
0.5

Scorer, which implements logic of data folding and scoring. This is a function-like object

Parameters:
  • folds (int) – ‘k’ used in k-folding while validating
  • fold_checks (int) – not greater than folds, the number of checks we do by cross-validating
  • score_function (function) – quality. if fold_checks > 1, the average is computed over checks.

Available optimization algorithms

class rep.metaml.gridsearch.RandomParameterOptimizer(param_grid, n_evaluations=10, maximize=True, random_state=None)[source]

Bases: rep.metaml.gridsearch.AbstractParameterGenerator

Works in the same way as sklearn.grid_search.RandomizedSearch. Each next point is generated independently.

Param_grid:dict with distributions used to sample each parameter. name -> list of possible values (in which case sampled uniformly from options) name -> distribution (should implement ‘.rvs()’ as scipy distributions)
Parameters:maximize (bool) – ignored parameter, added for uniformity

NB: this is the only optimizer, which supports passing distributions for parameters.

generate_next_point()[source]
class rep.metaml.gridsearch.AnnealingParameterOptimizer(param_grid, n_evaluations=10, temperature=0.2, random_state=None, maximize=True)[source]

Bases: rep.metaml.gridsearch.AbstractParameterGenerator

Implementation if annealing algorithm

Parameters:
  • param_grid – the grid with parameters to optimize on
  • n_evaluations (int) – the number od evaluations
  • temperature – float, how tolerant we are to worse results. If temperature is very small, algorithm never steps to point with worse predictions.

Doesn’t support parallel execution, so cannot be used in optimization on cluster.

generate_batch_points(size)[source]
generate_next_point()[source]

Generating next random point in parameters space

class rep.metaml.gridsearch.SubgridParameterOptimizer(param_grid, n_evaluations=10, random_state=None, start_evaluations=3, subgrid_size=3, maximize=True)[source]

Bases: rep.metaml.gridsearch.AbstractParameterGenerator

Uses Metropolis-like optimization. If the parameter grid is large, first performs optimization on subgrid.

Parameters:
  • param_grid (OrderedDict) – the grid with parameters to optimize on
  • n_evaluations (int) – the number of evaluations to do
  • random_state (int or RandomState or None) – random generator
  • start_evaluations (int) – count of random point generation on start
  • subgrid_size (int) – if the size of mesh too large, first we will optimize on subgrid with not more then subgrid_size possible values for each parameter.
add_result(state_indices, value)[source]
generate_next_point()[source]

Generating next point in parameters space

class rep.metaml.gridsearch.RegressionParameterOptimizer(param_grid, n_evaluations=10, random_state=None, start_evaluations=3, n_attempts=10, regressor=None, maximize=True)[source]

Bases: rep.metaml.gridsearch.AbstractParameterGenerator

This general method relies on regression. Regressor will try to predict the best point based on already known result fir different parameters.

Parameters:
  • param_grid (OrderedDict) – the grid with parameters to optimize on
  • n_evaluations (int) – the number of evaluations to do
  • random_state (int or RandomState or None) – random generator
  • start_evaluations (int) – count of random point generation on start
  • n_attempts (int) – this number of points will be compared on each iteration. Regressor is to choose optimal from them.
  • regressor – regressor to choose appropriate next point with potential best score (estimated this score by regressor); If None them RandomForest algorithm will be used.
generate_next_point()[source]

Generating next random point in parameters space

Interface of parameter optimizer

Each of parameter optimizers has the following interface.

class rep.metaml.gridsearch.AbstractParameterGenerator(param_grid, n_evaluations=10, maximize=True, random_state=None)[source]

Bases: object

Abstract class for grid search algorithm. The aim of this class is to generate new points, where the function (estimator) will be computed. You can define your own algorithm of step location of parameters grid.

Parameters:
  • param_grid (OrderedDict) – the grid with parameters to optimize on
  • n_evaluations (int) – the number of evaluations to do
  • random_state (int or RandomState or None) – random generator
  • maximize – whether algorithm should maximize or minimize target function.
add_result(state_indices, value)[source]

After the model was trained and evaluated for specific set of parameters, we use this function to store result :param state_indices: tuple, which represents the space :param value: quality at this point

best_params_

Property, return point of parameters grid with the best score

best_score_

Property, return best score of optimization

generate_batch_points(size)[source]

Generate several points in parameter space at once (needed when using parallel computations)

Parameters:size – how many points we shall generate
Returns:tuple of arrays (state_indices, state_parameters)
generate_next_point()[source]

Generating next random point in parameters space :return: tuple (indices, parameters)

print_results(reorder=True)[source]

Prints the results of training

Parameters:reorder (bool) – if reorder==True, best results go earlier, otherwise the results are printed in the order of computation

Folding

FoldingClassifier and FoldingRegressor provide an easy way to run k-Folding cross-validation. Also it is a nice way to combine predictions of trained classifiers.

class rep.metaml.folding.FoldingClassifier(base_estimator, n_folds=2, random_state=None, features=None, parallel_profile=None)[source]

Bases: rep.metaml.folding.FoldingBase, rep.estimators.interface.Classifier

This meta-classifier implements folding algorithm:

  • split training data into n equal parts;
  • train n classifiers, each one is trained using n-1 folds

To get unbiased predictions for data, pass the same dataset (with same order of events) as in training to prediction methods, in which case each event is predicted with base classifier which didn’t use that event during training.

To use information from not one, but several estimators during predictions, provide appropriate voting function. Examples of voting function:

>>> voting = lambda x: numpy.mean(x, axis=0)
>>> voting = lambda x: numpy.median(x, axis=0)
Parameters:
  • base_estimator (sklearn.BaseEstimator) – base classifier, which will be used for training
  • n_folds (int) – count of folds
  • features (None or list[str]) – features used in training
  • parallel_profile (None or str) – profile for IPython cluster, None to compute locally.
  • random_state (None or int or RandomState) – random state for reproducibility
feature_importances_

Sklearn-way of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None

fit(X, y, sample_weight=None)

Train the model, will train several base classifiers on overlapping subsets of training dataset.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • y – labels of events - array-like of shape [n_samples]
  • sample_weight – weight of events, array-like of shape [n_samples] or None if all weights are equal
get_feature_importances()

Get features importance

Return type:pandas.DataFrame with column effect and index=features
predict(X, vote_function=None)

Predict labels. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type:

numpy.array of shape [n_samples]

predict_proba(X, vote_function=None)

Predict probabilities. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type:

numpy.array of shape [n_samples, n_classes]

staged_predict_proba(X, vote_function=None)

Predict probabilities after each stage of base_estimator. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type:

sequence of numpy.arrays of shape [n_samples, n_classes]

class rep.metaml.folding.FoldingRegressor(base_estimator, n_folds=2, random_state=None, features=None, parallel_profile=None)[source]

Bases: rep.metaml.folding.FoldingBase, rep.estimators.interface.Regressor

This meta-regressor implements folding algorithm:

  • split training data into n equal parts;
  • train n regressors, each one is trained using n-1 folds

To get unbiased predictions for data, pass the same dataset (with same order of events) as in training to prediction methods, in which case each event is predicted with base regressor which didn’t use that event during training.

To use information from not one, but several estimators during predictions, provide appropriate voting function. Examples of voting function:

>>> voting = lambda x: numpy.mean(x, axis=0)
>>> voting = lambda x: numpy.median(x, axis=0)
Parameters:
  • base_estimator (sklearn.BaseEstimator) – base classifier, which will be used for training
  • n_folds (int) – count of folds
  • features (None or list[str]) – features used in training
  • parallel_profile (None or str) – profile for IPython cluster, None to compute locally.
  • random_state (None or int or RandomState) – random state for reproducibility
feature_importances_

Sklearn-way of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None

fit(X, y, sample_weight=None)

Train the model, will train several base regressors on overlapping subsets of training dataset.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • y – labels of events - array-like of shape [n_samples]
  • sample_weight – weight of events, array-like of shape [n_samples] or None if all weights are equal
get_feature_importances()

Get features importance

Return type:pandas.DataFrame with column effect and index=features
predict(X, vote_function=None)

Get predictions. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used. Parameters: numpy.ndarray [n_classifiers, n_samples]
Return type:

numpy.array of shape [n_samples, n_outputs]

staged_predict(X, vote_function=None)

Get predictions after each iteration of base estimator. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used. Parameters: numpy.ndarray [n_classifiers, n_samples]
Return type:

sequence of numpy.array of shape [n_samples, n_outputs]

Cache

In many cases training a classification/regression takes hours. To avoid retraining at each step, one can store trained classifier in a file, and later load trained model.

However, in this case user should care about situations when something changed in the pipeline (for instance, train/test splitting) manually.

Cache estimators are lazy way to store trained model. After training, classifier/regressor is stored in the file under specific name (which was passed in constructor).

On the next runs following conditions are checked:

  • model has the same name
  • model trained has exactly same parameters
  • model is trained using exactly the same data
  • stored copy in not too old (10 days by default)

If all the conditions satisfied, stored copy is loaded, otherwise classifier/regressor is fitted.

Example of usage

CacheClassifier and CacheRegressor work as meta-estimators

>>> from rep.estimators import XGBoostClassifier
>>> from rep.metaml import FoldingClassifier
>>> from rep.metaml.cache import CacheClassifier
>>> clf = CacheClassifier('xgboost folding', FoldingClassifier(XGBoostClassifier(), n_folds=3))
>>> # this works normally
>>> clf.fit(X, y, sample_weight)
>>> clf.predict_proba(testX)

However in the following situation:

>>> clf = FoldingClassifier(CacheClassifier('xgboost', XGBoostClassifier()))

cache is not going to work, because for each fold a copy of classifier is created. Each time after looking at cache, a version with same parameters, but different data will be found.

So, every time stored copy will be erased and a new one saved.

By default, cache is stored in ‘.cache/rep’ subfolder of project directory (where the ipython notebook is placed). To change parameters of caching use:

>>> import rep.metaml.cache
>>> from rep.metaml._cache import CacheHelper
>>> rep.metaml.cache.cache_helper = CacheHelper(folder, expiration_in_seconds)
>>> # to delete all cached items, use:
>>> rep.metaml.cache.cache_helper.clear_cache()
class rep.metaml.cache.CacheClassifier(name, clf, features=None)[source]

Bases: rep.metaml.cache.CacheBase, rep.estimators.sklearn.SklearnClassifier

Cache classifier allows to save trained models in lazy way. Useful when training classifier takes much time.

On the next run, stored model in cache will be used instead of fitting again.

Parameters:
  • name – unique name of classifier (to be used in storing)
  • clf (sklearn.BaseEstimator) – your estimator, which will be used for training
  • features – features to use in training.
class rep.metaml.cache.CacheRegressor(name, clf, features=None)[source]

Bases: rep.metaml.cache.CacheBase, rep.estimators.sklearn.SklearnRegressor

Cache regressor allows to save trained models in lazy way. Useful when training regressor takes much time.

On the next run, stored model in cache will be used instead of fitting again.

Parameters:
  • name – unique name of classifier (to be used in storing)
  • clf (sklearn.BaseEstimator) – your estimator, which will be used for training
  • features – features to use in training.

Stacking

FeatureSplitter defined in this module.

This meta-algorithm is handy to train different models for subsets of the data without manually splitting the data into parts.

class rep.metaml.stacking.FeatureSplitter(split_feature, base_estimator, train_features=None)[source]

Bases: rep.estimators.interface.Classifier

Dataset is split by values of split_feature, for each value of feature, new classifier is trained.

When building predictions, classifier predicts the events with the same value of split_feature it was trained on.

Parameters:
  • split_feature (str) – the name of key feature
  • base_estimator – the classifier, its’ copies are trained on parts of dataset
  • train_features (list[str]) – list of columns classifier uses in training
fit(X, y, sample_weight=None)[source]

Fit dataset.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features] with features
  • y – array-like of shape [n_samples] with targets
  • sample_weight – array-like of shape [n_samples] with events weights or None.
Returns:

self

predict_proba(X)[source]

Predict probabilities. Each event is predicted by the classifier trained on corresponding value of split_feature

Parameters:X – pandas.DataFrame of shape [n_samples, n_features]
Returns:probabilities of shape [n_samples, n_classes]
staged_predict_proba(X)[source]

Predict probabilities after each stage of base classifier. Each event is predicted by the classifier trained on corresponding value of split_feature

Parameters:X – pandas.DataFrame of shape [n_samples, n_features]
Returns:iterable sequence of numpy.arrays of shape [n_samples, n_classes]