Meta Machine Learning¶
Meta machine learning contains specific ML-algorithms, that are taking some classification/regression model as an input.
Also there is a Factory which allows set of models training and comparing them very simply.
Factory¶
Factory provides convenient way to train several classifiers on the same dataset. These classifiers can be trained one-by-one in a single thread, or simultaneously with IPython cluster or in several threads.
Also Factory
allows comparison of several classifiers (predictions of which can be computed again in parallel).
-
class
rep.metaml.factory.
ClassifiersFactory
(*args, **kwds)[source]¶ Bases:
rep.metaml.factory.AbstractFactory
Factory provides training of several classifiers in parallel. Quality of trained classifiers can be compared.
Initialize an ordered dictionary. The signature is the same as regular dictionaries, but keyword arguments are not recommended because their insertion order is arbitrary.
-
add_classifier
(name, classifier)[source]¶ Add classifier to factory. Automatically wraps classifier with
SklearnClassifier
Parameters: - name (str) – unique name for classifier. If name coincides with one already used, the old classifier will be replaced by one passed.
- classifier (sklearn.base.BaseEstimator or estimators.interface.Classifier) –
classifier object
Note
if type == sklearn.base.BaseEstimator, then features=None is used, to specify features used by classifier, wrap it with
SklearnClassifier
-
predict
(X, parallel_profile=None)[source]¶ Predict labels for all events in dataset.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- parallel_profile (None or str) – profile for IPython cluster
Return type: OrderedDict[numpy.array of shape [n_samples] with integer labels]
-
predict_proba
(X, parallel_profile=None)[source]¶ Predict probabilities for all events in dataset.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- parallel_profile (None or str) – profile
Return type: OrderedDict[numpy.array of shape [n_samples] with float predictions]
-
staged_predict_proba
(X)[source]¶ Predict probabilities on each stage (attention: returns dictionary of generators)
Parameters: X – pandas.DataFrame of shape [n_samples, n_features] Return type: dict[iterator]
-
test_on_lds
(lds)[source]¶ Prepare report for factory of estimators
Parameters: lds (LabeledDataStorage) – data Return type: rep.report.classification.ClassificationReport
-
-
class
rep.metaml.factory.
RegressorsFactory
(*args, **kwds)[source]¶ Bases:
rep.metaml.factory.AbstractFactory
Factory provides training of several classifiers in parallel. Quality of trained regressors can be compared.
Initialize an ordered dictionary. The signature is the same as regular dictionaries, but keyword arguments are not recommended because their insertion order is arbitrary.
-
add_regressor
(name, regressor)[source]¶ Add regressor to factory
Parameters: - name (str) – unique name for regressor. If name coincides with one already used, the old regressor will be replaced by one passed.
- regressor (sklearn.base.BaseEstimator or estimators.interface.Regressor) –
regressor object
Note
if type == sklearn.base.BaseEstimator, then features=None is used, to specify features used by regressor, wrap it first with
SklearnRegressor
-
predict
(X, parallel_profile=None)[source]¶ Predict values for all events in dataset.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- parallel_profile (None or name of profile to parallelize computations.) – profile
Return type: OrderedDict[numpy.array of shape [n_samples] with float values]
-
staged_predict
(X)[source]¶ Predicts probabilities on each stage
Parameters: X – pandas.DataFrame of shape [n_samples, n_features] Return type: dict[iterator]
-
test_on_lds
(lds)[source]¶ Report for factory of estimators
Parameters: lds (LabeledDataStorage) – data Return type: rep.report.regression.RegressionReport
-
Factory Examples¶
- Prepare dataset
>>> from sklearn import datasets >>> import pandas, numpy >>> from rep.utils import train_test_split >>> from sklearn.metrics import roc_auc_score >>> # iris data >>> iris = datasets.load_iris() >>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd']) >>> labels = iris.target >>> # Take just two classes instead of three >>> data = data[labels != 2] >>> labels = labels[labels != 2] >>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
- Train factory of classifiers
>>> from rep.metaml import ClassifiersFactory >>> from rep.estimators import TMVAClassifier, SklearnClassifier, XGBoostClassifier >>> from sklearn.ensemble import GradientBoostingClassifier >>> factory = ClassifiersFactory() >>> estimators >>> factory.add_classifier('tmva', TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=['a', 'b'])) >>> factory.add_classifier('ada', GradientBoostingClassifier()) >>> factory['xgb'] = XGBoostClassifier(features=['a', 'b']) >>> factory.fit(train_data, train_labels) model ef was trained in 0.22 seconds model tmva was trained in 2.47 seconds model ada was trained in 0.02 seconds model xgb was trained in 0.01 seconds Totally spent 2.71 seconds on training >>> pred = factory.predict_proba(test_data) data was predicted by tmva in 0.02 seconds data was predicted by ada in 0.00 seconds data was predicted by xgb in 0.00 seconds Totally spent 0.05 seconds on prediction >>> print pred OrderedDict([('tmva', array([[ 9.98732217e-01, 1.26778255e-03], [ 9.99649503e-01, 3.50497149e-04], ..])), ('ada', array([[ 9.99705117e-01, 2.94883265e-04], [ 9.99705117e-01, 2.94883265e-04], ..])), ('xgb', array([[ 9.91589248e-01, 8.41078255e-03], ..], dtype=float32))]) >>> for key in pred: >>> print key, roc_auc_score(test_labels, pred[key][:, 1]) tmva 0.933035714286 ada 1.0 xgb 0.995535714286
Grid Search¶
This module does hyper parameters optimization – finds the best parameters for estimator using different optimization models. Components of optimization:
- estimator (for which optimal parameters are searched, any REP classifier will work, see
rep.estimators
) - target metric function (which is maximized, anything meeting REP metric interface, see
rep.report.metrics
) - optimization algorithm (introduced in this module)
- cross-validation technique (kFolding, introduced in this module)
During optimization, many cycles of estimating quality on different sets of parameters is done. To speed up the process, threads or IPython cluster can be used.
GridOptimalSearchCV¶
Main class linking the whole process is GridOptimalSearchCV
, which takes as parameters:
- estimator to be optimized
- scorer (which trains classifier and estimates quality using cross-validation)
- parameter generator (which draws next set of parameters to be checked)
-
class
rep.metaml.gridsearch.
GridOptimalSearchCV
(estimator, params_generator, scorer, parallel_profile=None)[source]¶ Bases:
object
Optimal search over specified parameter values for an estimator. Uses different optimization techniques to use limited number of evaluations without using exhaustive grid scanning.
GridSearchCV implements a “fit” method and a “fit_best_estimator” method to train models.
Parameters: - estimator (BaseEstimator) – object of type that implements the “fit” and “fit_best_estimator” methods A new object of that type is cloned for each point.
- params_generator (AbstractParameterGenerator) – generator of grid search algorithm
- scorer (object) – which implement method __call__ with kwargs: “base_estimator”, “params”, “X”, “y”, “sample_weight”
- parallel_profile (None or str) – name of profile
Attributes:
generator: return grid parameter generator
-
fit
(X, y, sample_weight=None)[source]¶ Run fit with all sets of parameters.
Parameters: - X – array-like, shape = [n_samples, n_features] Training vector, where n_samples is the number of samples and n_features is the number of features.
- y – array-like, shape = [n_samples] or [n_samples, n_output], optional
- sample_weight – array-like, shape = [n_samples], weight
-
fit_best_estimator
(X, y, sample_weight=None)[source]¶ Train estimator with the best parameters
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- y – labels of events - array-like of shape [n_samples]
- sample_weight – weight of events, array-like of shape [n_samples] or None if all weights are equal
Returns: the best estimator
-
generator
¶ Property for params_generator
Folding Scorer¶
Folding cross validation can be used in grid search optimization.
-
class
rep.metaml.gridsearch.
ClassificationFoldingScorer
(score_function, folds=3, fold_checks=1, shuffle=False, random_state=None)[source]¶ Bases:
rep.metaml.gridsearch.FoldingScorerBase
Scorer, which implements logic of data folding and scoring for classification models. This is a function-like object
Parameters: - folds (int) – ‘k’ used in k-folding while validating
- fold_checks (int) – not greater than folds, the number of checks we do by cross-validating
- score_function (function) – quality. if fold_checks > 1, the average is computed over checks.
Example:
>>> def new_score_function(y_true, proba, sample_weight=None): >>> ''' >>> y_true: [n_samples] >>> proba: [n_samples, n_classes] >>> sample_weight: [n_samples] or None >>> ''' >>> ... >>> >>> f_scorer = FoldingScorer(new_score_function) >>> f_scorer(base_estimator, params, X, y, sample_weight=None) 0.5
Scorer, which implements logic of data folding and scoring. This is a function-like object
Parameters: - folds (int) – ‘k’ used in k-folding while validating
- fold_checks (int) – not greater than folds, the number of checks we do by cross-validating
- score_function (function) – quality. if fold_checks > 1, the average is computed over checks.
-
class
rep.metaml.gridsearch.
RegressionFoldingScorer
(score_function, folds=3, fold_checks=1, shuffle=False, random_state=None)[source]¶ Bases:
rep.metaml.gridsearch.FoldingScorerBase
Scorer, which implements logic of data folding and scoring for regression models. This is a function-like object
Parameters: - folds (int) – ‘k’ used in k-folding while validating
- fold_checks (int) – not greater than folds, the number of checks we do by cross-validating
- score_function (function) – quality. if fold_checks > 1, the average is computed over checks.
Example:
>>> def new_score_function(y_true, pred, sample_weight=None): >>> ''' >>> y_true: [n_samples] >>> pred: [n_samples] >>> sample_weight: [n_samples] or None >>> ''' >>> ... >>> >>> f_scorer = RegressionFoldingScorer(new_score_function) >>> f_scorer(base_estimator, params, X, y, sample_weight=None) 0.5
Scorer, which implements logic of data folding and scoring. This is a function-like object
Parameters: - folds (int) – ‘k’ used in k-folding while validating
- fold_checks (int) – not greater than folds, the number of checks we do by cross-validating
- score_function (function) – quality. if fold_checks > 1, the average is computed over checks.
Available optimization algorithms¶
-
class
rep.metaml.gridsearch.
RandomParameterOptimizer
(param_grid, n_evaluations=10, maximize=True, random_state=None)[source]¶ Bases:
rep.metaml.gridsearch.AbstractParameterGenerator
Works in the same way as sklearn.grid_search.RandomizedSearch. Each next point is generated independently.
Param_grid: dict with distributions used to sample each parameter. name -> list of possible values (in which case sampled uniformly from options) name -> distribution (should implement ‘.rvs()’ as scipy distributions) Parameters: maximize (bool) – ignored parameter, added for uniformity NB: this is the only optimizer, which supports passing distributions for parameters.
-
class
rep.metaml.gridsearch.
AnnealingParameterOptimizer
(param_grid, n_evaluations=10, temperature=0.2, random_state=None, maximize=True)[source]¶ Bases:
rep.metaml.gridsearch.AbstractParameterGenerator
Implementation if annealing algorithm
Parameters: - param_grid – the grid with parameters to optimize on
- n_evaluations (int) – the number od evaluations
- temperature – float, how tolerant we are to worse results. If temperature is very small, algorithm never steps to point with worse predictions.
Doesn’t support parallel execution, so cannot be used in optimization on cluster.
-
class
rep.metaml.gridsearch.
SubgridParameterOptimizer
(param_grid, n_evaluations=10, random_state=None, start_evaluations=3, subgrid_size=3, maximize=True)[source]¶ Bases:
rep.metaml.gridsearch.AbstractParameterGenerator
Uses Metropolis-like optimization. If the parameter grid is large, first performs optimization on subgrid.
Parameters: - param_grid (OrderedDict) – the grid with parameters to optimize on
- n_evaluations (int) – the number of evaluations to do
- random_state (int or RandomState or None) – random generator
- start_evaluations (int) – count of random point generation on start
- subgrid_size (int) – if the size of mesh too large, first we will optimize on subgrid with not more then subgrid_size possible values for each parameter.
-
class
rep.metaml.gridsearch.
RegressionParameterOptimizer
(param_grid, n_evaluations=10, random_state=None, start_evaluations=3, n_attempts=10, regressor=None, maximize=True)[source]¶ Bases:
rep.metaml.gridsearch.AbstractParameterGenerator
This general method relies on regression. Regressor will try to predict the best point based on already known result fir different parameters.
Parameters: - param_grid (OrderedDict) – the grid with parameters to optimize on
- n_evaluations (int) – the number of evaluations to do
- random_state (int or RandomState or None) – random generator
- start_evaluations (int) – count of random point generation on start
- n_attempts (int) – this number of points will be compared on each iteration. Regressor is to choose optimal from them.
- regressor – regressor to choose appropriate next point with potential best score (estimated this score by regressor); If None them RandomForest algorithm will be used.
Interface of parameter optimizer¶
Each of parameter optimizers has the following interface.
-
class
rep.metaml.gridsearch.
AbstractParameterGenerator
(param_grid, n_evaluations=10, maximize=True, random_state=None)[source]¶ Bases:
object
Abstract class for grid search algorithm. The aim of this class is to generate new points, where the function (estimator) will be computed. You can define your own algorithm of step location of parameters grid.
Parameters: - param_grid (OrderedDict) – the grid with parameters to optimize on
- n_evaluations (int) – the number of evaluations to do
- random_state (int or RandomState or None) – random generator
- maximize – whether algorithm should maximize or minimize target function.
-
add_result
(state_indices, value)[source]¶ After the model was trained and evaluated for specific set of parameters, we use this function to store result :param state_indices: tuple, which represents the space :param value: quality at this point
-
best_params_
¶ Property, return point of parameters grid with the best score
-
best_score_
¶ Property, return best score of optimization
-
generate_batch_points
(size)[source]¶ Generate several points in parameter space at once (needed when using parallel computations)
Parameters: size – how many points we shall generate Returns: tuple of arrays (state_indices, state_parameters)
Folding¶
FoldingClassifier
and FoldingRegressor
provide an easy way
to run k-Folding cross-validation. Also it is a nice way to combine predictions of trained classifiers.
-
class
rep.metaml.folding.
FoldingClassifier
(base_estimator, n_folds=2, random_state=None, features=None, parallel_profile=None)[source]¶ Bases:
rep.metaml.folding.FoldingBase
,rep.estimators.interface.Classifier
This meta-classifier implements folding algorithm:
- split training data into n equal parts;
- train n classifiers, each one is trained using n-1 folds
To get unbiased predictions for data, pass the same dataset (with same order of events) as in training to prediction methods, in which case each event is predicted with base classifier which didn’t use that event during training.
To use information from not one, but several estimators during predictions, provide appropriate voting function. Examples of voting function:
>>> voting = lambda x: numpy.mean(x, axis=0) >>> voting = lambda x: numpy.median(x, axis=0)
Parameters: - base_estimator (sklearn.BaseEstimator) – base classifier, which will be used for training
- n_folds (int) – count of folds
- features (None or list[str]) – features used in training
- parallel_profile (None or str) – profile for IPython cluster, None to compute locally.
- random_state (None or int or RandomState) – random state for reproducibility
-
feature_importances_
¶ Sklearn-way of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None
-
fit
(X, y, sample_weight=None)¶ Train the model, will train several base classifiers on overlapping subsets of training dataset.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- y – labels of events - array-like of shape [n_samples]
- sample_weight – weight of events, array-like of shape [n_samples] or None if all weights are equal
-
get_feature_importances
()¶ Get features importance
Return type: pandas.DataFrame with column effect and index=features
-
predict
(X, vote_function=None)¶ Predict labels. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type: numpy.array of shape [n_samples]
-
predict_proba
(X, vote_function=None)¶ Predict probabilities. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type: numpy.array of shape [n_samples, n_classes]
-
staged_predict_proba
(X, vote_function=None)¶ Predict probabilities after each stage of base_estimator. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used.
Return type: sequence of numpy.arrays of shape [n_samples, n_classes]
-
class
rep.metaml.folding.
FoldingRegressor
(base_estimator, n_folds=2, random_state=None, features=None, parallel_profile=None)[source]¶ Bases:
rep.metaml.folding.FoldingBase
,rep.estimators.interface.Regressor
This meta-regressor implements folding algorithm:
- split training data into n equal parts;
- train n regressors, each one is trained using n-1 folds
To get unbiased predictions for data, pass the same dataset (with same order of events) as in training to prediction methods, in which case each event is predicted with base regressor which didn’t use that event during training.
To use information from not one, but several estimators during predictions, provide appropriate voting function. Examples of voting function:
>>> voting = lambda x: numpy.mean(x, axis=0) >>> voting = lambda x: numpy.median(x, axis=0)
Parameters: - base_estimator (sklearn.BaseEstimator) – base classifier, which will be used for training
- n_folds (int) – count of folds
- features (None or list[str]) – features used in training
- parallel_profile (None or str) – profile for IPython cluster, None to compute locally.
- random_state (None or int or RandomState) – random state for reproducibility
-
feature_importances_
¶ Sklearn-way of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None
-
fit
(X, y, sample_weight=None)¶ Train the model, will train several base regressors on overlapping subsets of training dataset.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- y – labels of events - array-like of shape [n_samples]
- sample_weight – weight of events, array-like of shape [n_samples] or None if all weights are equal
-
get_feature_importances
()¶ Get features importance
Return type: pandas.DataFrame with column effect and index=features
-
predict
(X, vote_function=None)¶ Get predictions. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used. Parameters: numpy.ndarray [n_classifiers, n_samples]
Return type: numpy.array of shape [n_samples, n_outputs]
-
staged_predict
(X, vote_function=None)¶ Get predictions after each iteration of base estimator. To get unbiased predictions on training dataset, pass training data (with same order of events) and vote_function=None.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- vote_function (None or function) – function to combine prediction of folds’ estimators. If None then folding scheme is used. Parameters: numpy.ndarray [n_classifiers, n_samples]
Return type: sequence of numpy.array of shape [n_samples, n_outputs]
Cache¶
In many cases training a classification/regression takes hours. To avoid retraining at each step, one can store trained classifier in a file, and later load trained model.
However, in this case user should care about situations when something changed in the pipeline (for instance, train/test splitting) manually.
Cache estimators are lazy way to store trained model. After training, classifier/regressor is stored in the file under specific name (which was passed in constructor).
On the next runs following conditions are checked:
- model has the same name
- model trained has exactly same parameters
- model is trained using exactly the same data
- stored copy in not too old (10 days by default)
If all the conditions satisfied, stored copy is loaded, otherwise classifier/regressor is fitted.
Example of usage¶
CacheClassifier
and CacheRegressor
work as meta-estimators
>>> from rep.estimators import XGBoostClassifier
>>> from rep.metaml import FoldingClassifier
>>> from rep.metaml.cache import CacheClassifier
>>> clf = CacheClassifier('xgboost folding', FoldingClassifier(XGBoostClassifier(), n_folds=3))
>>> # this works normally
>>> clf.fit(X, y, sample_weight)
>>> clf.predict_proba(testX)
However in the following situation:
>>> clf = FoldingClassifier(CacheClassifier('xgboost', XGBoostClassifier()))
cache is not going to work, because for each fold a copy of classifier is created. Each time after looking at cache, a version with same parameters, but different data will be found.
So, every time stored copy will be erased and a new one saved.
By default, cache is stored in ‘.cache/rep’ subfolder of project directory (where the ipython notebook is placed). To change parameters of caching use:
>>> import rep.metaml.cache
>>> from rep.metaml._cache import CacheHelper
>>> rep.metaml.cache.cache_helper = CacheHelper(folder, expiration_in_seconds)
>>> # to delete all cached items, use:
>>> rep.metaml.cache.cache_helper.clear_cache()
-
class
rep.metaml.cache.
CacheClassifier
(name, clf, features=None)[source]¶ Bases:
rep.metaml.cache.CacheBase
,rep.estimators.sklearn.SklearnClassifier
Cache classifier allows to save trained models in lazy way. Useful when training classifier takes much time.
On the next run, stored model in cache will be used instead of fitting again.
Parameters: - name – unique name of classifier (to be used in storing)
- clf (sklearn.BaseEstimator) – your estimator, which will be used for training
- features – features to use in training.
-
class
rep.metaml.cache.
CacheRegressor
(name, clf, features=None)[source]¶ Bases:
rep.metaml.cache.CacheBase
,rep.estimators.sklearn.SklearnRegressor
Cache regressor allows to save trained models in lazy way. Useful when training regressor takes much time.
On the next run, stored model in cache will be used instead of fitting again.
Parameters: - name – unique name of classifier (to be used in storing)
- clf (sklearn.BaseEstimator) – your estimator, which will be used for training
- features – features to use in training.
Stacking¶
FeatureSplitter
defined in this module.
This meta-algorithm is handy to train different models for subsets of the data without manually splitting the data into parts.
-
class
rep.metaml.stacking.
FeatureSplitter
(split_feature, base_estimator, train_features=None)[source]¶ Bases:
rep.estimators.interface.Classifier
Dataset is split by values of split_feature, for each value of feature, new classifier is trained.
When building predictions, classifier predicts the events with the same value of split_feature it was trained on.
Parameters: - split_feature (str) – the name of key feature
- base_estimator – the classifier, its’ copies are trained on parts of dataset
- train_features (list[str]) – list of columns classifier uses in training
-
fit
(X, y, sample_weight=None)[source]¶ Fit dataset.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features] with features
- y – array-like of shape [n_samples] with targets
- sample_weight – array-like of shape [n_samples] with events weights or None.
Returns: self
-
predict_proba
(X)[source]¶ Predict probabilities. Each event is predicted by the classifier trained on corresponding value of split_feature
Parameters: X – pandas.DataFrame of shape [n_samples, n_features] Returns: probabilities of shape [n_samples, n_classes]
-
staged_predict_proba
(X)[source]¶ Predict probabilities after each stage of base classifier. Each event is predicted by the classifier trained on corresponding value of split_feature
Parameters: X – pandas.DataFrame of shape [n_samples, n_features] Returns: iterable sequence of numpy.arrays of shape [n_samples, n_classes]