Meta Machine Learning¶
Meta machine learning contains specific MLalgorithms, that are taking some classification/regression model as an input.
Also there is a Factory which allows set of models training and comparing them very simply.
Factory¶
Factory provides convenient way to train several classifiers on the same dataset. These classifiers can be trained onebyone in a single thread, or simultaneously with IPython cluster or in several threads.
Also Factory
allows comparison of several classifiers (predictions of which can be computed again in parallel).

class
rep.metaml.factory.
ClassifiersFactory
(*args, **kwds)[source]¶ Bases:
rep.metaml.factory.AbstractFactory
Factory provides training of several classifiers in parallel. Quality of trained classifiers can be compared.
Initialize an ordered dictionary. The signature is the same as regular dictionaries, but keyword arguments are not recommended because their insertion order is arbitrary.

add_classifier
(name, classifier)[source]¶ Add classifier to factory. Automatically wraps classifier with
SklearnClassifier
Parameters:  name (str) – unique name for classifier. If name coincides with one already used, the old classifier will be replaced by one passed.
 classifier (sklearn.base.BaseEstimator or estimators.interface.Classifier) –
classifier object
Note
if type == sklearn.base.BaseEstimator, then features=None is used, to specify features used by classifier, wrap it with
SklearnClassifier

predict
(X, parallel_profile=None)[source]¶ Predict labels for all events in dataset.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 parallel_profile (None or str) – profile for IPython cluster
Return type: OrderedDict[numpy.array of shape [n_samples] with integer labels]

predict_proba
(X, parallel_profile=None)[source]¶ Predict probabilities for all events in dataset.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 parallel_profile (None or str) – profile
Return type: OrderedDict[numpy.array of shape [n_samples] with float predictions]

staged_predict_proba
(X)[source]¶ Predict probabilities on each stage (attention: returns dictionary of generators)
Parameters: X – pandas.DataFrame of shape [n_samples, n_features] Return type: dict[iterator]

test_on_lds
(lds)[source]¶ Prepare report for factory of estimators
Parameters: lds (LabeledDataStorage) – data Return type: rep.report.classification.ClassificationReport


class
rep.metaml.factory.
RegressorsFactory
(*args, **kwds)[source]¶ Bases:
rep.metaml.factory.AbstractFactory
Factory provides training of several classifiers in parallel. Quality of trained regressors can be compared.
Initialize an ordered dictionary. The signature is the same as regular dictionaries, but keyword arguments are not recommended because their insertion order is arbitrary.

add_regressor
(name, regressor)[source]¶ Add regressor to factory
Parameters:  name (str) – unique name for regressor. If name coincides with one already used, the old regressor will be replaced by one passed.
 regressor (sklearn.base.BaseEstimator or estimators.interface.Regressor) –
regressor object
Note
if type == sklearn.base.BaseEstimator, then features=None is used, to specify features used by regressor, wrap it first with
SklearnRegressor

predict
(X, parallel_profile=None)[source]¶ Predict values for all events in dataset.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 parallel_profile (None or name of profile to parallelize computations.) – profile
Return type: OrderedDict[numpy.array of shape [n_samples] with float values]

staged_predict
(X)[source]¶ Predicts probabilities on each stage
Parameters: X – pandas.DataFrame of shape [n_samples, n_features] Return type: dict[iterator]

test_on_lds
(lds)[source]¶ Report for factory of estimators
Parameters: lds (LabeledDataStorage) – data Return type: rep.report.regression.RegressionReport

Factory Examples¶
 Prepare dataset
>>> from sklearn import datasets >>> import pandas, numpy >>> from rep.utils import train_test_split >>> from sklearn.metrics import roc_auc_score >>> # iris data >>> iris = datasets.load_iris() >>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd']) >>> labels = iris.target >>> # Take just two classes instead of three >>> data = data[labels != 2] >>> labels = labels[labels != 2] >>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
 Train factory of classifiers
>>> from rep.metaml import ClassifiersFactory >>> from rep.estimators import TMVAClassifier, SklearnClassifier, XGBoostClassifier >>> from sklearn.ensemble import GradientBoostingClassifier >>> factory = ClassifiersFactory() >>> estimators >>> factory.add_classifier('tmva', TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=1, BoostType='Grad', features=['a', 'b'])) >>> factory.add_classifier('ada', GradientBoostingClassifier()) >>> factory['xgb'] = XGBoostClassifier(features=['a', 'b']) >>> factory.fit(train_data, train_labels) model ef was trained in 0.22 seconds model tmva was trained in 2.47 seconds model ada was trained in 0.02 seconds model xgb was trained in 0.01 seconds Totally spent 2.71 seconds on training >>> pred = factory.predict_proba(test_data) data was predicted by tmva in 0.02 seconds data was predicted by ada in 0.00 seconds data was predicted by xgb in 0.00 seconds Totally spent 0.05 seconds on prediction >>> print pred OrderedDict([('tmva', array([[ 9.98732217e01, 1.26778255e03], [ 9.99649503e01, 3.50497149e04], ..])), ('ada', array([[ 9.99705117e01, 2.94883265e04], [ 9.99705117e01, 2.94883265e04], ..])), ('xgb', array([[ 9.91589248e01, 8.41078255e03], ..], dtype=float32))]) >>> for key in pred: >>> print key, roc_auc_score(test_labels, pred[key][:, 1]) tmva 0.933035714286 ada 1.0 xgb 0.995535714286
Grid Search¶
This module does hyper parameters optimization – finds the best parameters for estimator using different optimization models. Components of optimization:
 estimator (for which optimal parameters are searched, any REP classifier will work, see
rep.estimators
)  target metric function (which is maximized, anything meeting REP metric interface, see
rep.report.metrics
)  optimization algorithm (introduced in this module)
 crossvalidation technique (kFolding, introduced in this module)
During optimization, many cycles of estimating quality on different sets of parameters is done. To speed up the process, threads or IPython cluster can be used.
GridOptimalSearchCV¶
Main class linking the whole process is GridOptimalSearchCV
, which takes as parameters:
 estimator to be optimized
 scorer (which trains classifier and estimates quality using crossvalidation)
 parameter generator (which draws next set of parameters to be checked)

class
rep.metaml.gridsearch.
GridOptimalSearchCV
(estimator, params_generator, scorer, parallel_profile=None)[source]¶ Bases:
object
Optimal search over specified parameter values for an estimator. Uses different optimization techniques to use limited number of evaluations without using exhaustive grid scanning.
GridSearchCV implements a “fit” method and a “fit_best_estimator” method to train models.
Parameters:  estimator (BaseEstimator) – object of type that implements the “fit” and “fit_best_estimator” methods A new object of that type is cloned for each point.
 params_generator (AbstractParameterGenerator) – generator of grid search algorithm
 scorer (object) – which implement method __call__ with kwargs: “base_estimator”, “params”, “X”, “y”, “sample_weight”
 parallel_profile (None or str) – name of profile
Attributes:
generator: return grid parameter generator

fit
(X, y, sample_weight=None)[source]¶ Run fit with all sets of parameters.
Parameters:  X – arraylike, shape = [n_samples, n_features] Training vector, where n_samples is the number of samples and n_features is the number of features.
 y – arraylike, shape = [n_samples] or [n_samples, n_output], optional
 sample_weight – arraylike, shape = [n_samples], weight

fit_best_estimator
(X, y, sample_weight=None)[source]¶ Train estimator with the best parameters
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 y – labels of events  arraylike of shape [n_samples]
 sample_weight – weight of events, arraylike of shape [n_samples] or None if all weights are equal
Returns: the best estimator

generator
¶ Property for params_generator
Folding Scorer¶
Folding cross validation can be used in grid search optimization.

class
rep.metaml.gridsearch.
ClassificationFoldingScorer
(score_function, folds=3, fold_checks=1, shuffle=False, random_state=None)[source]¶ Bases:
rep.metaml.gridsearch.FoldingScorerBase
Scorer, which implements logic of data folding and scoring for classification models. This is a functionlike object
Parameters:  folds (int) – ‘k’ used in kfolding while validating
 fold_checks (int) – not greater than folds, the number of checks we do by crossvalidating
 score_function (function) – quality. if fold_checks > 1, the average is computed over checks.
Example:
>>> def new_score_function(y_true, proba, sample_weight=None): >>> ''' >>> y_true: [n_samples] >>> proba: [n_samples, n_classes] >>> sample_weight: [n_samples] or None >>> ''' >>> ... >>> >>> f_scorer = FoldingScorer(new_score_function) >>> f_scorer(base_estimator, params, X, y, sample_weight=None) 0.5
Scorer, which implements logic of data folding and scoring. This is a functionlike object
Parameters:  folds (int) – ‘k’ used in kfolding while validating
 fold_checks (int) – not greater than folds, the number of checks we do by crossvalidating
 score_function (function) – quality. if fold_checks > 1, the average is computed over checks.

class
rep.metaml.gridsearch.
RegressionFoldingScorer
(score_function, folds=3, fold_checks=1, shuffle=False, random_state=None)[source]¶ Bases:
rep.metaml.gridsearch.FoldingScorerBase
Scorer, which implements logic of data folding and scoring for regression models. This is a functionlike object
Parameters:  folds (int) – ‘k’ used in kfolding while validating
 fold_checks (int) – not greater than folds, the number of checks we do by crossvalidating
 score_function (function) – quality. if fold_checks > 1, the average is computed over checks.
Example:
>>> def new_score_function(y_true, pred, sample_weight=None): >>> ''' >>> y_true: [n_samples] >>> pred: [n_samples] >>> sample_weight: [n_samples] or None >>> ''' >>> ... >>> >>> f_scorer = RegressionFoldingScorer(new_score_function) >>> f_scorer(base_estimator, params, X, y, sample_weight=None) 0.5
Scorer, which implements logic of data folding and scoring. This is a functionlike object
Parameters:  folds (int) – ‘k’ used in kfolding while validating
 fold_checks (int) – not greater than folds, the number of checks we do by crossvalidating
 score_function (function) – quality. if fold_checks > 1, the average is computed over checks.
Available optimization algorithms¶

class
rep.metaml.gridsearch.
RandomParameterOptimizer
(param_grid, n_evaluations=10, maximize=True, random_state=None)[source]¶ Bases:
rep.metaml.gridsearch.AbstractParameterGenerator
Works in the same way as sklearn.grid_search.RandomizedSearch. Each next point is generated independently.
Param_grid: dict with distributions used to sample each parameter. name > list of possible values (in which case sampled uniformly from options) name > distribution (should implement ‘.rvs()’ as scipy distributions) Parameters: maximize (bool) – ignored parameter, added for uniformity NB: this is the only optimizer, which supports passing distributions for parameters.

class
rep.metaml.gridsearch.
AnnealingParameterOptimizer
(param_grid, n_evaluations=10, temperature=0.2, random_state=None, maximize=True)[source]¶ Bases:
rep.metaml.gridsearch.AbstractParameterGenerator
Implementation if annealing algorithm
Parameters:  param_grid – the grid with parameters to optimize on
 n_evaluations (int) – the number od evaluations
 temperature – float, how tolerant we are to worse results. If temperature is very small, algorithm never steps to point with worse predictions.
Doesn’t support parallel execution, so cannot be used in optimization on cluster.

class
rep.metaml.gridsearch.
SubgridParameterOptimizer
(param_grid, n_evaluations=10, random_state=None, start_evaluations=3, subgrid_size=3, maximize=True)[source]¶ Bases:
rep.metaml.gridsearch.AbstractParameterGenerator
Uses Metropolislike optimization. If the parameter grid is large, first performs optimization on subgrid.
Parameters:  param_grid (OrderedDict) – the grid with parameters to optimize on
 n_evaluations (int) – the number of evaluations to do
 random_state (int or RandomState or None) – random generator
 start_evaluations (int) – count of random point generation on start
 subgrid_size (int) – if the size of mesh too large, first we will optimize on subgrid with not more then subgrid_size possible values for each parameter.

class
rep.metaml.gridsearch.
RegressionParameterOptimizer
(param_grid, n_evaluations=10, random_state=None, start_evaluations=3, n_attempts=10, regressor=None, maximize=True)[source]¶ Bases:
rep.metaml.gridsearch.AbstractParameterGenerator
This general method relies on regression. Regressor will try to predict the best point based on already known result fir different parameters.
Parameters:  param_grid (OrderedDict) – the grid with parameters to optimize on
 n_evaluations (int) – the number of evaluations to do
 random_state (int or RandomState or None) – random generator
 start_evaluations (int) – count of random point generation on start
 n_attempts (int) – this number of points will be compared on each iteration. Regressor is to choose optimal from them.
 regressor – regressor to choose appropriate next point with potential best score (estimated this score by regressor); If None them RandomForest algorithm will be used.
Interface of parameter optimizer¶
Each of parameter optimizers has the following interface.

class
rep.metaml.gridsearch.
AbstractParameterGenerator
(param_grid, n_evaluations=10, maximize=True, random_state=None)[source]¶ Bases:
object
Abstract class for grid search algorithm. The aim of this class is to generate new points, where the function (estimator) will be computed. You can define your own algorithm of step location of parameters grid.
Parameters:  param_grid (OrderedDict) – the grid with parameters to optimize on
 n_evaluations (int) – the number of evaluations to do
 random_state (int or RandomState or None) – random generator
 maximize – whether algorithm should maximize or minimize target function.

add_result
(state_indices, value)[source]¶ After the model was trained and evaluated for specific set of parameters, we use this function to store result :param state_indices: tuple, which represents the space :param value: quality at this point

best_params_
¶ Property, return point of parameters grid with the best score

best_score_
¶ Property, return best score of optimization

generate_batch_points
(size)[source]¶ Generate several points in parameter space at once (needed when using parallel computations)
Parameters: size – how many points we shall generate Returns: tuple of arrays (state_indices, state_parameters)
Folding¶
FoldingClassifier
and FoldingRegressor
provide an easy way
to run kFolding crossvalidation. Also it is a nice way to combine predictions of trained classifiers.

class
rep.metaml.folding.
FoldingClassifier
(base_estimator, n_folds=2, random_state=None, features=None, parallel_profile=None)[source]¶ Bases:
rep.metaml.folding.FoldingBase
,rep.estimators.interface.Classifier
This metaclassifier implements folding algorithm:
 split training data into n equal parts;
 train n classifiers, each one is trained using n1 folds
To get unbiased predictions for data, pass the same dataset (with same order of events) as in training to prediction methods, in which case each event is predicted with base classifier which didn’t use that event during training.
To use information from not one, but several estimators during predictions, provide appropriate voting function. Examples of voting function:
>>> voting = lambda x: numpy.mean(x, axis=0) >>> voting = lambda x: numpy.median(x, axis=0)
Parameters:  base_estimator (sklearn.BaseEstimator) – base classifier, which will be used for training
 n_folds (int) – count of folds
 features (None or list[str]) – features used in training
 parallel_profile (None or str) – profile for IPython cluster, None to compute locally.
 random_state (None or int or RandomState) – random state for reproducibility

feature_importances_
¶ Sklearnway of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None

fit
(X, y, sample_weight=None)¶ Train the model, will train several base classifiers on overlapping subsets of training dataset.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 y – labels of events  arraylike of shape [n_samples]
 sample_weight – weight of events, arraylike of shape [n_samples] or None if all weights are equal

get_feature_importances
()¶ Get features importance
Return type: pandas.DataFrame with column effect and index=features

predict
(X, vote_function=None)¶ Predict labels. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type: numpy.array of shape [n_samples]

predict_proba
(X, vote_function=None)¶ Predict probabilities. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type: numpy.array of shape [n_samples, n_classes]

staged_predict_proba
(X, vote_function=None)¶ Predict probabilities after each stage of base_estimator. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type: sequence of numpy.arrays of shape [n_samples, n_classes]

class
rep.metaml.folding.
FoldingRegressor
(base_estimator, n_folds=2, random_state=None, features=None, parallel_profile=None)[source]¶ Bases:
rep.metaml.folding.FoldingBase
,rep.estimators.interface.Regressor
This metaregressor implements folding algorithm:
 split training data into n equal parts;
 train n regressors, each one is trained using n1 folds
To get unbiased predictions for data, pass the same dataset (with same order of events) as in training to prediction methods, in which case each event is predicted with base regressor which didn’t use that event during training.
To use information from not one, but several estimators during predictions, provide appropriate voting function. Examples of voting function:
>>> voting = lambda x: numpy.mean(x, axis=0) >>> voting = lambda x: numpy.median(x, axis=0)
Parameters:  base_estimator (sklearn.BaseEstimator) – base classifier, which will be used for training
 n_folds (int) – count of folds
 features (None or list[str]) – features used in training
 parallel_profile (None or str) – profile for IPython cluster, None to compute locally.
 random_state (None or int or RandomState) – random state for reproducibility

feature_importances_
¶ Sklearnway of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None

fit
(X, y, sample_weight=None)¶ Train the model, will train several base regressors on overlapping subsets of training dataset.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 y – labels of events  arraylike of shape [n_samples]
 sample_weight – weight of events, arraylike of shape [n_samples] or None if all weights are equal

get_feature_importances
()¶ Get features importance
Return type: pandas.DataFrame with column effect and index=features

predict
(X, vote_function=None)¶ Get predictions. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used. Parameters: numpy.ndarray [n_classifiers, n_samples]
Return type: numpy.array of shape [n_samples, n_outputs]

staged_predict
(X, vote_function=None)¶ Get predictions after each iteration of base estimator. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used. Parameters: numpy.ndarray [n_classifiers, n_samples]
Return type: sequence of numpy.array of shape [n_samples, n_outputs]
Cache¶
In many cases training a classification/regression takes hours. To avoid retraining at each step, one can store trained classifier in a file, and later load trained model.
However, in this case user should care about situations when something changed in the pipeline (for instance, train/test splitting) manually.
Cache estimators are lazy way to store trained model. After training, classifier/regressor is stored in the file under specific name (which was passed in constructor).
On the next runs following conditions are checked:
 model has the same name
 model trained has exactly same parameters
 model is trained using exactly the same data
 stored copy in not too old (10 days by default)
If all the conditions satisfied, stored copy is loaded, otherwise classifier/regressor is fitted.
Example of usage¶
CacheClassifier
and CacheRegressor
work as metaestimators
>>> from rep.estimators import XGBoostClassifier
>>> from rep.metaml import FoldingClassifier
>>> from rep.metaml.cache import CacheClassifier
>>> clf = CacheClassifier('xgboost folding', FoldingClassifier(XGBoostClassifier(), n_folds=3))
>>> # this works normally
>>> clf.fit(X, y, sample_weight)
>>> clf.predict_proba(testX)
However in the following situation:
>>> clf = FoldingClassifier(CacheClassifier('xgboost', XGBoostClassifier()))
cache is not going to work, because for each fold a copy of classifier is created. Each time after looking at cache, a version with same parameters, but different data will be found.
So, every time stored copy will be erased and a new one saved.
By default, cache is stored in ‘.cache/rep’ subfolder of project directory (where the ipython notebook is placed). To change parameters of caching use:
>>> import rep.metaml.cache
>>> from rep.metaml._cache import CacheHelper
>>> rep.metaml.cache.cache_helper = CacheHelper(folder, expiration_in_seconds)
>>> # to delete all cached items, use:
>>> rep.metaml.cache.cache_helper.clear_cache()

class
rep.metaml.cache.
CacheClassifier
(name, clf, features=None)[source]¶ Bases:
rep.metaml.cache.CacheBase
,rep.estimators.sklearn.SklearnClassifier
Cache classifier allows to save trained models in lazy way. Useful when training classifier takes much time.
On the next run, stored model in cache will be used instead of fitting again.
Parameters:  name – unique name of classifier (to be used in storing)
 clf (sklearn.BaseEstimator) – your estimator, which will be used for training
 features – features to use in training.

class
rep.metaml.cache.
CacheRegressor
(name, clf, features=None)[source]¶ Bases:
rep.metaml.cache.CacheBase
,rep.estimators.sklearn.SklearnRegressor
Cache regressor allows to save trained models in lazy way. Useful when training regressor takes much time.
On the next run, stored model in cache will be used instead of fitting again.
Parameters:  name – unique name of classifier (to be used in storing)
 clf (sklearn.BaseEstimator) – your estimator, which will be used for training
 features – features to use in training.
Stacking¶
FeatureSplitter
defined in this module.
This metaalgorithm is handy to train different models for subsets of the data without manually splitting the data into parts.

class
rep.metaml.stacking.
FeatureSplitter
(split_feature, base_estimator, train_features=None)[source]¶ Bases:
rep.estimators.interface.Classifier
Dataset is split by values of split_feature, for each value of feature, new classifier is trained.
When building predictions, classifier predicts the events with the same value of split_feature it was trained on.
Parameters:  split_feature (str) – the name of key feature
 base_estimator – the classifier, its’ copies are trained on parts of dataset
 train_features (list[str]) – list of columns classifier uses in training

fit
(X, y, sample_weight=None)[source]¶ Fit dataset.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features] with features
 y – arraylike of shape [n_samples] with targets
 sample_weight – arraylike of shape [n_samples] with events weights or None.
Returns: self

predict_proba
(X)[source]¶ Predict probabilities. Each event is predicted by the classifier trained on corresponding value of split_feature
Parameters: X – pandas.DataFrame of shape [n_samples, n_features] Returns: probabilities of shape [n_samples, n_classes]

staged_predict_proba
(X)[source]¶ Predict probabilities after each stage of base classifier. Each event is predicted by the classifier trained on corresponding value of split_feature
Parameters: X – pandas.DataFrame of shape [n_samples, n_features] Returns: iterable sequence of numpy.arrays of shape [n_samples, n_classes]