Estimators (classification and regression)

This module contains wrappers with sklearn interface for different machine learning libraries:

  • scikit-learn
  • TMVA
  • XGBoost
  • pybrain
  • neurolab
  • theanets.

REP defines interface for classifiers’ and regressors’ wrappers, thus new wrappers can be added for another libraries following the same interface. Notably the interface has backward compatibility with scikit-learn library.

Estimators interfaces (for classification and regression)

REP wrappers are derived from Classifier and Regressor depending on the problem of interest.

Below you can see the standard methods available in the wrappers.

class rep.estimators.interface.Classifier(features=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Interface to train different classification models from different machine learning libraries, like sklearn, TMVA, XGBoost, ...

Parameters:features (list[str] or None) – features used to train a model

Note

  • if features aren’t set (None), then all features in the training dataset will be used
  • Datasets should be pandas.DataFrame, not numpy.array. Provided this, you’ll be able to choose features used in training by setting e.g. features=[‘mass’, ‘momentum’] in the constructor.
  • It works fine with numpy.array as well, but in this case all the features will be used.
  • Classes values must be from 0 to n_classes-1!
fit(X, y, sample_weight=None)[source]

Train a classification model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – labels of samples, array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

self

fit_lds(lds)[source]

Train a classifier on the specific type of dataset.

Parameters:lds (LabeledDataStorage) – data
Returns:self
get_feature_importances()[source]

Return features importance.

Return type:pandas.DataFrame with index=self.features
get_params(deep=True)

Get parameters for this estimator.

deep: boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
predict(X)[source]

Predict labels for all samples in the dataset.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples] with integer labels
predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples, n_classes] with probabilities
score(X, y, sample_weight=None)

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
score : float
Mean accuracy of self.predict(X) wrt. y.
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

staged_predict_proba(X)[source]

Predict probabilities for data for each class label on each stage (i.e. for boosting algorithms).

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:iterator
test_on(X, y, sample_weight=None)[source]

Prepare classification report for a single classifier.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – labels of samples — array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

ClassificationReport

test_on_lds(lds)[source]

Prepare a classification report for a single classifier.

Parameters:lds (LabeledDataStorage) – data
Returns:ClassificationReport
class rep.estimators.interface.Regressor(features=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Interface to train different regression models from different machine learning libraries, like sklearn, TMVA, XGBoost, ...

Parameters:features (list[str] or None) – features used to train a model

Note

  • if features aren’t set (None), then all features in the training dataset will be used
  • Datasets should be pandas.DataFrame, not numpy.array. Provided this, you’ll be able to choose features used in training by setting e.g. features=[‘mass’, ‘momentum’] in the constructor.
  • It works fine with numpy.array as well, but in this case all the features will be used.
fit(X, y, sample_weight=None)[source]

Train a regression model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values for samples, array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

self

fit_lds(lds)[source]

Train a regression model on the specific type of dataset.

Parameters:lds (LabeledDataStorage) – data
Returns:self
get_feature_importances()[source]

Get features importances.

Return type:pandas.DataFrame with index=self.features
get_params(deep=True)

Get parameters for this estimator.

deep: boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
predict(X)[source]

Predict values for data.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples] with predicted values
score(X, y, sample_weight=None)

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

staged_predict(X)[source]

Predicts values for data on each stage (i.e. for boosting algorithms).

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:iterator
test_on(X, y, sample_weight=None)[source]

Prepare a regression report for a single regressor

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values of samples — array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

RegressionReport

test_on_lds(lds)[source]

Prepare a regression report for a single regressor.

Parameters:lds (LabeledDataStorage) – data
Returns:RegressionReport

Sklearn classifier and regressor

SklearnClassifier and SklearnRegressor are wrappers for algorithms from scikit-learn.

From user perspective, wrapped sklearn model behaves in the same way as non-wrapped, but has one additional parameter features to choose necessary columns to use in training.

Typically, models from REP are used with pandas.DataFrames, which makes it possible to name needed variables or give some variables specific role in the training.

If data has numpy.array type then behaviour will be the same as in sklearn. For complete list of the available algorithms, see sklearn API.

class rep.estimators.sklearn.SklearnClassifier(clf, features=None)[source]

Bases: rep.estimators.sklearn.SklearnBase, rep.estimators.interface.Classifier

SklearnClassifier is a wrapper over sklearn-like classifiers.

Parameters:
  • clf (sklearn.BaseEstimator) – classifier to train. Should be sklearn-compatible.
  • features (list[str] or None) – features used in training
fit(X, y, sample_weight=None, **kwargs)[source]

Train the classifier.

Parameters:
  • X (pandas.DataFrame) – data shape [n_samples, n_features]
  • y – target of training, array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

self

Note

if sklearn classifier doesn’t support sample_weight, then put sample_weight=None, otherwise exception will be thrown.

predict(X)[source]

Predict labels for all samples in the dataset.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples] with integer labels
predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X)[source]

Predict probabilities for data for each class label on each stage (i.e. for boosting algorithms).

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:iterator
class rep.estimators.sklearn.SklearnRegressor(clf, features=None)[source]

Bases: rep.estimators.sklearn.SklearnBase, rep.estimators.interface.Regressor

SklearnRegressor is a wrapper over sklearn-like regressors

Parameters:
  • clf (sklearn.BaseEstimator) – classifier to train. Should be sklearn-compatible.
  • features (list[str] or None) – features used in training
fit(X, y, sample_weight=None, **kwargs)[source]

Train the classifier.

Parameters:
  • X (pandas.DataFrame) – data shape [n_samples, n_features]
  • y – target of training, array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

self

Note

if sklearn classifier doesn’t support sample_weight, then put sample_weight=None, otherwise exception will be thrown.

predict(X)[source]

Predict values for data.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples] with predicted values
staged_predict(X)[source]

Predicts values for data on each stage (i.e. for boosting algorithms).

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:iterator

TMVA classifier and regressor

These classes are wrappers for physics machine learning library TMVA used .root format files (c++ library). Now you can simply use it in python. TMVA contains classification and regression algorithms, including neural networks. See TMVA guide for the list of the available algorithms and parameters.

class rep.estimators.tmva.TMVAClassifier(method='kBDT', features=None, factory_options='', sigmoid_function='bdt', **method_parameters)[source]

Bases: rep.estimators.tmva.TMVABase, rep.estimators.interface.Classifier

Implements classification models from TMVA library: CERN library for machine learning.

Parameters:
  • method (str) – algorithm method (default=’kBDT’)
  • features (list[str] or None) – features used in training
  • factory_options (str) –

    system options, including data transformations before training, for example:

    "!V:!Silent:Color:Transformations=I;D;P;G,D"
    
  • sigmoid_function (str) –

    function which is used to convert TMVA output to probabilities;

    • identity (use for svm, mlp) — do not transform the output, use this value for methods returning class probabilities
    • sigmoid — sigmoid transformation, use it if output varies in range [-infinity, +infinity]
    • bdt (for the BDT algorithms output varies in range [-1, 1])
    • sig_eff=0.4 — for the rectangular cut optimization methods, for instance, here 0.4 will be used as a signal efficiency to evaluate MVA, (put any float number from [0, 1])
  • method_parameters (dict) – classifier options, example: NTrees=100, BoostType=’Grad’.

Warning

TMVA doesn’t support staged_predict_proba() and feature_importances__.

TMVA doesn’t support multiclassification, only two-class classification.

TMVA guide.

fit(X, y, sample_weight=None)[source]

Train a classification model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – labels of samples, array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

self

get_params(deep=True)[source]

Get parameters for this estimator.

Returns:dict, parameter names mapped to their values.
predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples, n_classes] with probabilities
set_params(**params)[source]

Set the parameters of this estimator.

Parameters:params (dict) – parameters to set in the model
staged_predict_proba(X)[source]

Warning

This function is not supported for the TMVA library (AttributeError will be thrown)

class rep.estimators.tmva.TMVARegressor(method='kBDT', features=None, factory_options='', **method_parameters)[source]

Bases: rep.estimators.tmva.TMVABase, rep.estimators.interface.Regressor

Implements regression models from TMVA library: CERN library for machine learning.

Parameters:
  • method (str) – algorithm method (default=’kBDT’)
  • features (list[str] or None) – features used in training
  • factory_options (str) –

    system options, including data transformations before training, for example:

    "!V:!Silent:Color:Transformations=I;D;P;G,D"
    
  • method_parameters (dict) – regressor options, for example: NTrees=100, BoostType=’Grad’

Warning

TMVA doesn’t support staged_predict() and feature_importances__.

TMVA guide

fit(X, y, sample_weight=None)[source]

Train a regression model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values for samples, array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

self

get_params(deep=True)[source]

Get parameters for this estimator.

Returns:dict, parameter names mapped to their values.
predict(X)[source]

Predict values for data.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples] with predicted values
set_params(**params)[source]

Set the parameters of this estimator.

Parameters:params (dict) – parameters to set in the model
staged_predict(X)[source]

Warning

This function is not supported for the TMVA library (AttributeError will be thrown)

XGBoost classifier and regressor

These classes are wrappers for XGBoost library.

class rep.estimators.xgboost.XGBoostBase(n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, scale_pos_weight=1.0, min_child_weight=1.0, subsample=1.0, colsample=1.0, base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]

Bases: object

A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.

Parameters:
  • n_estimators (int) – number of trees built.
  • nthreads (int) – number of parallel threads used to run XGBoost.
  • num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
  • gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
  • eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
  • max_depth (int) – maximum depth of a tree.
  • scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
  • min_child_weight (float) –

    minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.

    Note

    weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.

  • subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
  • colsample (float) – subsample ratio of columns when constructing each tree.
  • base_score (float) – the initial prediction score of all instances, global bias.
  • random_state (None or int or RandomState) – state for a pseudo random generator
  • verbose (boot) – if 1, will print messages during training
  • missing (float) – the number considered by XGBoost as missing value.
feature_importances_

Sklearn-way of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None

get_feature_importances()[source]

Get features importances.

Return type:pandas.DataFrame with index=self.features
class rep.estimators.xgboost.XGBoostClassifier(features=None, n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, scale_pos_weight=1.0, min_child_weight=1.0, subsample=1.0, colsample=1.0, base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]

Bases: rep.estimators.xgboost.XGBoostBase, rep.estimators.interface.Classifier

Implements classification model from XGBoost library. A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.

Parameters:
  • n_estimators (int) – number of trees built.
  • nthreads (int) – number of parallel threads used to run XGBoost.
  • num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
  • gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
  • eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
  • max_depth (int) – maximum depth of a tree.
  • scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
  • min_child_weight (float) –

    minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.

    Note

    weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.

  • subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
  • colsample (float) – subsample ratio of columns when constructing each tree.
  • base_score (float) – the initial prediction score of all instances, global bias.
  • random_state (None or int or RandomState) – state for a pseudo random generator
  • verbose (boot) – if 1, will print messages during training
  • missing (float) – the number considered by XGBoost as missing value.
fit(X, y, sample_weight=None)[source]

Train a classification model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – labels of samples, array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

self

predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X, step=None)[source]

Predict probabilities for data for each class label on each stage..

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • step (int) – step for returned iterations (None by default). XGBoost does not implement this functionality and we need to predict from the beginning each time. With None passed step is chosen to have 10 points in the learning curve.
Returns:

iterator

class rep.estimators.xgboost.XGBoostRegressor(features=None, n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, min_child_weight=1.0, subsample=1.0, colsample=1.0, objective_type='linear', base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]

Bases: rep.estimators.xgboost.XGBoostBase, rep.estimators.interface.Regressor

Implements regression model from XGBoost library. A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.

Parameters:
  • n_estimators (int) – number of trees built.
  • nthreads (int) – number of parallel threads used to run XGBoost.
  • num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
  • gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
  • eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
  • max_depth (int) – maximum depth of a tree.
  • scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
  • min_child_weight (float) –

    minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.

    Note

    weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.

  • subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
  • colsample (float) – subsample ratio of columns when constructing each tree.
  • base_score (float) – the initial prediction score of all instances, global bias.
  • random_state (None or int or RandomState) – state for a pseudo random generator
  • verbose (boot) – if 1, will print messages during training
  • missing (float) – the number considered by XGBoost as missing value.
fit(X, y, sample_weight=None)[source]

Train a regression model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values for samples, array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

self

predict(X)[source]

Predict values for data.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples] with predicted values
staged_predict(X, step=None)[source]

Predicts values for data on each stage.

Parameters:
  • X – pandas.DataFrame of shape [n_samples, n_features]
  • step (int) – step for returned iterations (None by default). XGBoost does not implement this functionality and we need to predict from the beginning each time. With None passed step is chosen to have 10 points in the learning curve.
Returns:

iterator

Theanets classifier and regressor

These classes are wrappers for theanets library — a neural network python library.

class rep.estimators.theanets.TheanetsBase(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]

Bases: object

A base class for the estimators from Theanets library.

Parameters:
  • features (None or list(str)) – list of features to train model
  • layers (sequence of int, tuple, dict) – a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.
  • input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset
  • output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset
  • hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
  • output_activation (str) – the name of an activation function to use on the output layer by default
  • input_noise (float) – standard deviation of desired noise to inject into input
  • hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
  • input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
  • hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
  • decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
  • scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
  • trainers (list[dict] or None) –

    parameters to specify training algorithm(s), for example:

    trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]

  • random_state (None or int or RandomState) – state for a pseudo random generator

For more information on the available trainers and their parameters see this page.

fit(X, y, sample_weight=None)[source]

Train a classification/regression model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values for samples — array-like of shape [n_samples]
  • sample_weight – weights for samples — array-like of shape [n_samples]
Returns:

self

partial_fit(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]

Train the estimator by training the existing estimator again.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values for samples — array-like of shape [n_samples]
  • sample_weight – weights for samples — array-like of shape [n_samples]
  • keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
  • trainer (dict) – parameters of the training algorithm we want to use now
Returns:

self

set_params(**params)[source]

Set the parameters of the estimator. Deep parameters of trainers and scaler can be accessed, for instance:

trainers__0 = {'algo': 'sgd', 'learning_rate': 0.3}
trainers__0_algo = 'sgd'
layers__1 = 14
scaler__use_std = True
Parameters:params (dict) – parameters to set in the model
class rep.estimators.theanets.TheanetsClassifier(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]

Bases: rep.estimators.theanets.TheanetsBase, rep.estimators.interface.Classifier

Implements a classification model from the Theanets library.

Parameters:
  • features (None or list(str)) – list of features to train model
  • layers (sequence of int, tuple, dict) –

    a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.

  • input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset
  • output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset
  • hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
  • output_activation (str) – the name of an activation function to use on the output layer by default
  • input_noise (float) – standard deviation of desired noise to inject into input
  • hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
  • input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
  • hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
  • decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
  • scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
  • trainers (list[dict] or None) –

    parameters to specify training algorithm(s), for example:

    trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]

  • random_state (None or int or RandomState) – state for a pseudo random generator

For more information on the available trainers and their parameters see this page.

partial_fit(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]

Train the estimator by training the existing estimator again.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values for samples — array-like of shape [n_samples]
  • sample_weight – weights for samples — array-like of shape [n_samples]
  • keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
  • trainer (dict) – parameters of the training algorithm we want to use now
Returns:

self

predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X)[source]

Warning

This function is not supported in the Theanets (NotImplementedError will be thrown)

class rep.estimators.theanets.TheanetsRegressor(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]

Bases: rep.estimators.theanets.TheanetsBase, rep.estimators.interface.Regressor

Implements a regression model from the Theanets library.

Parameters:
  • features (None or list(str)) – list of features to train model
  • layers (sequence of int, tuple, dict) –

    a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.

  • input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset
  • output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset
  • hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
  • output_activation (str) – the name of an activation function to use on the output layer by default
  • input_noise (float) – standard deviation of desired noise to inject into input
  • hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
  • input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
  • hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
  • decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
  • scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
  • trainers (list[dict] or None) –

    parameters to specify training algorithm(s), for example:

    trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]

  • random_state (None or int or RandomState) – state for a pseudo random generator

For more information on the available trainers and their parameters see this page.

partial_fit(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]

Train the estimator by training the existing estimator again.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values for samples — array-like of shape [n_samples]
  • sample_weight – weights for samples — array-like of shape [n_samples]
  • keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
  • trainer (dict) – parameters of the training algorithm we want to use now
Returns:

self

predict(X)[source]

Predict values for data.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples] with predicted values
staged_predict(X)[source]

Warning

This function is not supported in the Theanets (NotImplementedError will be thrown)

Neurolab classifier and regressor

These classes are wrappers for the Neurolab library — a neural network python library.

Warning

To make neurolab reproducible we change global random seed

numpy.random.seed(42)
class rep.estimators.neurolab.NeurolabClassifier(features=None, layers=(10, ), net_type='feed-forward', initf=<function init_rand>, trainf=None, scaler='standard', random_state=None, **other_params)[source]

Bases: rep.estimators.neurolab.NeurolabBase, rep.estimators.interface.Classifier

Implements a classification model from the Neurolab library.

Parameters:
  • features (list[str] or None) – features used in training
  • layers (list[int]) – sequence, number of units inside each hidden layer.
  • net_type (string) –

    type of the network; possible values are:

    • feed-forward
    • competing-layer
    • learning-vector
    • elman-recurrent
    • hemming-recurrent
  • initf (anything implementing call(layer), e.g. neurolab.init.* or list[neurolab.init.*] of shape [n_layers]) – layer initializers
  • trainf – net training function; default value depends on the type of a network
  • scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
  • random_state – this parameter is ignored and is added for uniformity.
  • kwargs (dict) – additional arguments to net __init__, varies with different net_types
fit(X, y)[source]

Train a classification model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – labels of samples — array-like of shape [n_samples]
Returns:

self

partial_fit(X, y)[source]

Additional training of the classifier.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – labels of samples, array-like of shape [n_samples]
Returns:

self

predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X)[source]

Warning

This is not supported in the Neurolab (AttributeError will be thrown)

class rep.estimators.neurolab.NeurolabRegressor(features=None, layers=(10, ), net_type='feed-forward', initf=<function init_rand>, trainf=None, scaler='standard', random_state=None, **other_params)[source]

Bases: rep.estimators.neurolab.NeurolabBase, rep.estimators.interface.Regressor

Implements a regression model from the Neurolab library.

Parameters:
  • features (list[str] or None) – features used in training
  • layers (list[int]) – sequence, number of units inside each hidden layer.
  • net_type (string) –

    type of the network; possible values are:

    • feed-forward
    • competing-layer
    • learning-vector
    • elman-recurrent
    • hemming-recurrent
  • initf (anything implementing call(layer), e.g. neurolab.init.* or list[neurolab.init.*] of shape [n_layers]) – layer initializers
  • trainf – net training function; default value depends on the type of a network
  • scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
  • random_state – this parameter is ignored and is added for uniformity.
  • kwargs (dict) – additional arguments to net __init__, varies with different net_types
fit(X, y)[source]

Train a regression model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values for samples — array-like of shape [n_samples]
Returns:

self

partial_fit(X, y)[source]

Additional training of the regressor.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values for samples, array-like of shape [n_samples]
Returns:

self

predict(X)[source]

Predict values for data.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples] with predicted values
staged_predict(X)[source]

Warning

This is not supported in the Neurolab (AttributeError will be thrown)

Pybrain classifier and regressor

These classes are wrappers for the PyBrain library — a neural network python library.

Warning

pybrain training isn’t reproducible (training with the same parameters produces different neural network each time)

class rep.estimators.pybrain.PyBrainBase(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]

Bases: object

A base class for the estimator from the PyBrain.

Parameters:
  • features (list[str] or None) – features used in training.
  • scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
  • use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
  • verbose (bool) – print train/validation errors.
  • random_state – it is ignored parameter, pybrain training is not reproducible

Net parameters:

Parameters:
  • layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
  • hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
  • params (dict) –

    other net parameters:

    • bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
    • peepholes (boolean);
    • recurrent (boolean): if the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork

Gradient descent trainer parameters:

Parameters:
  • learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
  • lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
  • momentum (float) – the ratio by which the gradient of the last time step is used
  • batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
  • weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all

Rprop trainer parameters:

Parameters:
  • etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
  • etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
  • delta (float) – step width for each weight
  • deltamin (float) – minimum step width (default=1e-6)
  • deltamax (float) – maximum step width (default=5.0)
  • delta0 (float) – initial step width (default=0.1)

Training termination parameters

Parameters:
  • epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
  • max_epochs (int) – maximum number of epochs the trainer should train if it is given
  • continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
  • validation_proportion (float) – the ratio of the dataset that is used for the validation dataset

Note

Details about parameters here.

fit(X, y)[source]

Train a classification/regression model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values for samples — array-like of shape [n_samples]
Returns:

self

partial_fit(X, y)[source]

Additional training of the classification/regression model.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – values for samples, array-like of shape [n_samples]
Returns:

self

set_params(**params)[source]

Set the parameters of the estimator.

Names of the parameters are the same as in the constructor.

class rep.estimators.pybrain.PyBrainClassifier(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]

Bases: rep.estimators.pybrain.PyBrainBase, rep.estimators.interface.Classifier

Implements a classification model from the PyBrain library.

Parameters:
  • features (list[str] or None) – features used in training.
  • scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
  • use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
  • verbose (bool) – print train/validation errors.
  • random_state – it is ignored parameter, pybrain training is not reproducible

Net parameters:

Parameters:
  • layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
  • hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
  • params (dict) –

    other net parameters:

    • bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
    • peepholes (boolean);
    • recurrent (boolean): if the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork

Gradient descent trainer parameters:

Parameters:
  • learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
  • lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
  • momentum (float) – the ratio by which the gradient of the last time step is used
  • batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
  • weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all

Rprop trainer parameters:

Parameters:
  • etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
  • etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
  • delta (float) – step width for each weight
  • deltamin (float) – minimum step width (default=1e-6)
  • deltamax (float) – maximum step width (default=5.0)
  • delta0 (float) – initial step width (default=0.1)

Training termination parameters

Parameters:
  • epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
  • max_epochs (int) – maximum number of epochs the trainer should train if it is given
  • continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
  • validation_proportion (float) – the ratio of the dataset that is used for the validation dataset

Note

Details about parameters here.

predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X)[source]

Warning

This function is not supported for PyBrain (AttributeError will be thrown).

class rep.estimators.pybrain.PyBrainRegressor(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]

Bases: rep.estimators.pybrain.PyBrainBase, rep.estimators.interface.Regressor

Implements a regression model from the PyBrain library.

Parameters:
  • features (list[str] or None) – features used in training.
  • scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
  • use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
  • verbose (bool) – print train/validation errors.
  • random_state – it is ignored parameter, pybrain training is not reproducible

Net parameters:

Parameters:
  • layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
  • hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
  • params (dict) –

    other net parameters:

    • bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
    • peepholes (boolean);
    • recurrent (boolean): if the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork

Gradient descent trainer parameters:

Parameters:
  • learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
  • lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
  • momentum (float) – the ratio by which the gradient of the last time step is used
  • batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
  • weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all

Rprop trainer parameters:

Parameters:
  • etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
  • etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
  • delta (float) – step width for each weight
  • deltamin (float) – minimum step width (default=1e-6)
  • deltamax (float) – maximum step width (default=5.0)
  • delta0 (float) – initial step width (default=0.1)

Training termination parameters

Parameters:
  • epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
  • max_epochs (int) – maximum number of epochs the trainer should train if it is given
  • continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
  • validation_proportion (float) – the ratio of the dataset that is used for the validation dataset

Note

Details about parameters here.

predict(X)[source]

Predict labels for all samples in the dataset.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples] with integer labels
staged_predict(X)[source]

Warning

This function is not supported for PyBrain (AttributeError will be thrown).

MatrixNet classifier and regressor

MatrixNetClassifier and MatrixNetRegressor are wrappers for MatrixNet web service - proprietary BDT developed at Yandex. Think about this as a specific Boosted Decision Tree algorithm which is available as a service. At this moment MatrixMet is available only for CERN users.

To use MatrixNet, first acquire token::
  • Go to https://yandex-apps.cern.ch/ (login with your CERN-account)

  • Click Add token at the left panel

  • Choose service MatrixNet and click Create token

  • Create ~/.rep-matrixnet.config.json file with the following content (custom path to the config file can be specified when creating a wrapper object):

    {
        "url": "https://ml.cern.yandex.net/v1",
    
        "token": "<your_token>"
    }
    
class rep.estimators.matrixnet.MatrixNetBase(api_config_file='$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]

Bases: object

Base class for MatrixNetClassifier and MatrixNetRegressor.

This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.

Parameters:
  • features (list[str] or None) – features used in training
  • api_config_file (str) –

    path to the file with remote api configuration in the json format:

    {"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
    
  • iterations (int) – number of constructed trees (default=100)
  • regularization (float) – regularization number (default=0.01)
  • intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
  • max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
  • features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
  • training_fraction (float) – training rows bagging (default=0.5)
  • auto_stop (None or float) – error value for training pre-stopping
  • sync (bool) – synchronous or asynchronous training on the server
  • random_state (None or int or RandomState) – state for a pseudo random generator
feature_importances_

Sklearn-way of returning feature importance. This returned as numpy.array, ‘effect’ column is used among MatrixNet importances.

get_feature_importances()[source]

Get features importance: effect, efficiency, information characteristics

Return type:pandas.DataFrame with index=self.features
get_iterations()[source]

Return number of already constructed trees during training

Returns:int or None
resubmit()[source]

Resubmit training process on the server in case of failing job.

synchronize()[source]

Synchronise asynchronic training: wait while training process will be finished on the server

training_status()[source]

Check if training has finished on the server

Return type:bool
class rep.estimators.matrixnet.MatrixNetClassifier(features=None, api_config_file='$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]

Bases: rep.estimators.matrixnet.MatrixNetBase, rep.estimators.interface.Classifier

MatrixNet classification model.

This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.

Parameters:
  • features (list[str] or None) – features used in training
  • api_config_file (str) –

    path to the file with remote api configuration in the json format:

    {"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
    
  • iterations (int) – number of constructed trees (default=100)
  • regularization (float) – regularization number (default=0.01)
  • intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
  • max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
  • features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
  • training_fraction (float) – training rows bagging (default=0.5)
  • auto_stop (None or float) – error value for training pre-stopping
  • sync (bool) – synchronous or asynchronous training on the server
  • random_state (None or int or RandomState) – state for a pseudo random generator
fit(X, y, sample_weight=None)[source]

Train a classification model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – labels of samples, array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

self

predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X, step=10)[source]

Predict probabilities for data for each class label on each stage.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • step (int) – step for returned iterations (10 by default).
Returns:

iterator

class rep.estimators.matrixnet.MatrixNetRegressor(features=None, api_config_file='$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]

Bases: rep.estimators.matrixnet.MatrixNetBase, rep.estimators.interface.Regressor

MatrixNet for regression model.

This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.

Parameters:
  • features (list[str] or None) – features used in training
  • api_config_file (str) –

    path to the file with remote api configuration in the json format:

    {"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
    
  • iterations (int) – number of constructed trees (default=100)
  • regularization (float) – regularization number (default=0.01)
  • intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
  • max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
  • features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
  • training_fraction (float) – training rows bagging (default=0.5)
  • auto_stop (None or float) – error value for training pre-stopping
  • sync (bool) – synchronous or asynchronous training on the server
  • random_state (None or int or RandomState) – state for a pseudo random generator
fit(X, y, sample_weight=None)[source]

Train a classification model on the data.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • y – labels of samples, array-like of shape [n_samples]
  • sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:

self

predict(X)[source]

Predict labels for all samples in the dataset.

Parameters:X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:numpy.array of shape [n_samples] with integer labels
staged_predict(X, step=10)[source]

Predict probabilities for data for each class label on each stage.

Parameters:
  • X (pandas.DataFrame) – data of shape [n_samples, n_features]
  • step (int) – step for returned iterations (10 by default).
Returns:

iterator

Examples

Classification

  • Prepare dataset
    >>> from sklearn import datasets
    >>> import pandas, numpy
    >>> from rep.utils import train_test_split
    >>> from sklearn.metrics import roc_auc_score
    >>> # iris data
    >>> iris = datasets.load_iris()
    >>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd'])
    >>> labels = iris.target
    >>> # Take just two classes instead of three
    >>> data = data[labels != 2]
    >>> labels = labels[labels != 2]
    >>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
    
  • Sklearn classification
    >>> from rep.estimators import SklearnClassifier
    >>> from sklearn.ensemble import GradientBoostingClassifier
    >>> # Using gradient boosting with default settings
    >>> sk = SklearnClassifier(GradientBoostingClassifier(), features=['a', 'b'])
    >>> # Training classifier
    >>> sk.fit(train_data, train_labels)
    >>> pred = sk.predict_proba(test_data)
    >>> print pred
    [[  9.99842983e-01   1.57016893e-04]
     [  1.45163843e-04   9.99854836e-01]
     [  9.99842983e-01   1.57016893e-04]
     [  9.99827693e-01   1.72306607e-04], ..]
    >>> roc_auc_score(test_labels, pred[:, 1])
    0.99768518518518523
    
  • TMVA classification
    >>> from rep.estimators import TMVAClassifier
    >>> tmva = TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=['a', 'b'])
    >>> tmva.fit(train_data, train_labels)
    >>> pred = tmva.predict_proba(test_data)
    >>> print pred
    [[  9.99991025e-01   8.97546346e-06]
     [  1.14084636e-04   9.99885915e-01]
     [  9.99991009e-01   8.99060302e-06]
     [  9.99798700e-01   2.01300452e-04], ..]
    >>> roc_auc_score(test_labels, pred[:, 1])
    0.99999999999999989
    
  • XGBoost classification
    >>> from rep.estimators import XGBoostClassifier
    >>> # XGBoost with default parameters
    >>> xgb = XGBoostClassifier(features=['a', 'b'])
    >>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
    >>> pred = xgb.predict_proba(test_data)
    >>> print pred
    [[ 0.9983651   0.00163494]
     [ 0.00170585  0.99829417]
     [ 0.99845636  0.00154361]
     [ 0.96618336  0.03381656], ..]
    >>> roc_auc_score(test_labels, pred[:, 1])
    0.99768518518518512
    

Regression

  • Prepare dataset
    >>> from sklearn import datasets
    >>> from sklearn.metrics import mean_squared_error
    >>> from rep.utils import train_test_split
    >>> import pandas, numpy
    >>> # diabetes data
    >>> diabetes = datasets.load_diabetes()
    >>> features = ['feature_%d' % number for number in range(diabetes.data.shape[1])]
    >>> data = pandas.DataFrame(diabetes.data, columns=features)
    >>> labels = diabetes.target
    >>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
    
  • Sklearn regression
    >>> from rep.estimators import SklearnRegressor
    >>> from sklearn.ensemble import GradientBoostingRegressor
    >>> # Using gradient boosting with default settings
    >>> sk = SklearnRegressor(GradientBoostingRegressor(), features=features[:8])
    >>> # Training classifier
    >>> sk.fit(train_data, train_labels)
    >>> pred = sk.predict(train_data)
    >>> numpy.sqrt(mean_squared_error(train_labels, pred))
    60.666009962879265
    
  • TMVA regression
    >>> from rep.estimators import TMVARegressor
    >>> tmva = TMVARegressor(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=features[:8])
    >>> tmva.fit(train_data, train_labels)
    >>> pred = tmva.predict(test_data)
    >>> numpy.sqrt(mean_squared_error(test_labels, pred))
    73.74191838418254
    
  • XGBoost regression
    >>> from rep.estimators import XGBoostRegressor
    >>> # XGBoost with default parameters
    >>> xgb = XGBoostRegressor(features=features[:8])
    >>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
    >>> pred = xgb.predict(test_data)
    >>> numpy.sqrt(mean_squared_error(test_labels, pred))
    65.557743652940133
    

Compatible libraries

REP can deal with any library which supports scikit-learn interface.

Examples of compatible libraries: nolearn, skflow, gplearn and hep_ml.