# Estimators (classification and regression)¶

This module contains wrappers with sklearn interface for different machine learning libraries:

• scikit-learn
• TMVA
• XGBoost
• pybrain
• neurolab
• theanets.

REP defines interface for classifiers’ and regressors’ wrappers, thus new wrappers can be added for another libraries following the same interface. Notably the interface has backward compatibility with scikit-learn library.

## Estimators interfaces (for classification and regression)¶

REP wrappers are derived from Classifier and Regressor depending on the problem of interest.

Below you can see the standard methods available in the wrappers.

class rep.estimators.interface.Classifier(features=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Interface to train different classification models from different machine learning libraries, like sklearn, TMVA, XGBoost, ...

Parameters: features (list[str] or None) – features used to train a model

Note

• if features aren’t set (None), then all features in the training dataset will be used
• Datasets should be pandas.DataFrame, not numpy.array. Provided this, you’ll be able to choose features used in training by setting e.g. features=[‘mass’, ‘momentum’] in the constructor.
• It works fine with numpy.array as well, but in this case all the features will be used.
• Classes values must be from 0 to n_classes-1!
fit(X, y, sample_weight=None)[source]

Train a classification model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal self
fit_lds(lds)[source]

Train a classifier on the specific type of dataset.

Parameters: lds (LabeledDataStorage) – data self
get_feature_importances()[source]

Return features importance.

Return type: pandas.DataFrame with index=self.features
get_params(deep=True)

Get parameters for this estimator.

deep: boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
predict(X)[source]

Predict labels for all samples in the dataset.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples] with integer labels
predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples, n_classes] with probabilities
score(X, y, sample_weight=None)

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
score : float
Mean accuracy of self.predict(X) wrt. y.
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

staged_predict_proba(X)[source]

Predict probabilities for data for each class label on each stage (i.e. for boosting algorithms).

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] iterator
test_on(X, y, sample_weight=None)[source]

Prepare classification report for a single classifier.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples — array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal ClassificationReport
test_on_lds(lds)[source]

Prepare a classification report for a single classifier.

Parameters: lds (LabeledDataStorage) – data ClassificationReport
class rep.estimators.interface.Regressor(features=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Interface to train different regression models from different machine learning libraries, like sklearn, TMVA, XGBoost, ...

Parameters: features (list[str] or None) – features used to train a model

Note

• if features aren’t set (None), then all features in the training dataset will be used
• Datasets should be pandas.DataFrame, not numpy.array. Provided this, you’ll be able to choose features used in training by setting e.g. features=[‘mass’, ‘momentum’] in the constructor.
• It works fine with numpy.array as well, but in this case all the features will be used.
fit(X, y, sample_weight=None)[source]

Train a regression model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal self
fit_lds(lds)[source]

Train a regression model on the specific type of dataset.

Parameters: lds (LabeledDataStorage) – data self
get_feature_importances()[source]

Get features importances.

Return type: pandas.DataFrame with index=self.features
get_params(deep=True)

Get parameters for this estimator.

deep: boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
predict(X)[source]

Predict values for data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples] with predicted values
score(X, y, sample_weight=None)

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
score : float
R^2 of self.predict(X) wrt. y.
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

staged_predict(X)[source]

Predicts values for data on each stage (i.e. for boosting algorithms).

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] iterator
test_on(X, y, sample_weight=None)[source]

Prepare a regression report for a single regressor

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values of samples — array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal RegressionReport
test_on_lds(lds)[source]

Prepare a regression report for a single regressor.

Parameters: lds (LabeledDataStorage) – data RegressionReport

## Sklearn classifier and regressor¶

SklearnClassifier and SklearnRegressor are wrappers for algorithms from scikit-learn.

From user perspective, wrapped sklearn model behaves in the same way as non-wrapped, but has one additional parameter features to choose necessary columns to use in training.

Typically, models from REP are used with pandas.DataFrames, which makes it possible to name needed variables or give some variables specific role in the training.

If data has numpy.array type then behaviour will be the same as in sklearn. For complete list of the available algorithms, see sklearn API.

class rep.estimators.sklearn.SklearnClassifier(clf, features=None)[source]

Bases: rep.estimators.sklearn.SklearnBase, rep.estimators.interface.Classifier

SklearnClassifier is a wrapper over sklearn-like classifiers.

Parameters: clf (sklearn.BaseEstimator) – classifier to train. Should be sklearn-compatible. features (list[str] or None) – features used in training
fit(X, y, sample_weight=None, **kwargs)[source]

Train the classifier.

Parameters: X (pandas.DataFrame) – data shape [n_samples, n_features] y – target of training, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal self

Note

if sklearn classifier doesn’t support sample_weight, then put sample_weight=None, otherwise exception will be thrown.

predict(X)[source]

Predict labels for all samples in the dataset.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples] with integer labels
predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X)[source]

Predict probabilities for data for each class label on each stage (i.e. for boosting algorithms).

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] iterator
class rep.estimators.sklearn.SklearnRegressor(clf, features=None)[source]

Bases: rep.estimators.sklearn.SklearnBase, rep.estimators.interface.Regressor

SklearnRegressor is a wrapper over sklearn-like regressors

Parameters: clf (sklearn.BaseEstimator) – classifier to train. Should be sklearn-compatible. features (list[str] or None) – features used in training
fit(X, y, sample_weight=None, **kwargs)[source]

Train the classifier.

Parameters: X (pandas.DataFrame) – data shape [n_samples, n_features] y – target of training, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal self

Note

if sklearn classifier doesn’t support sample_weight, then put sample_weight=None, otherwise exception will be thrown.

predict(X)[source]

Predict values for data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples] with predicted values
staged_predict(X)[source]

Predicts values for data on each stage (i.e. for boosting algorithms).

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] iterator

## TMVA classifier and regressor¶

These classes are wrappers for physics machine learning library TMVA used .root format files (c++ library). Now you can simply use it in python. TMVA contains classification and regression algorithms, including neural networks. See TMVA guide for the list of the available algorithms and parameters.

class rep.estimators.tmva.TMVAClassifier(method='kBDT', features=None, factory_options='', sigmoid_function='bdt', **method_parameters)[source]

Bases: rep.estimators.tmva.TMVABase, rep.estimators.interface.Classifier

Implements classification models from TMVA library: CERN library for machine learning.

Parameters: method (str) – algorithm method (default=’kBDT’) features (list[str] or None) – features used in training factory_options (str) – system options, including data transformations before training, for example: "!V:!Silent:Color:Transformations=I;D;P;G,D"  sigmoid_function (str) – function which is used to convert TMVA output to probabilities; identity (use for svm, mlp) — do not transform the output, use this value for methods returning class probabilities sigmoid — sigmoid transformation, use it if output varies in range [-infinity, +infinity] bdt (for the BDT algorithms output varies in range [-1, 1]) sig_eff=0.4 — for the rectangular cut optimization methods, for instance, here 0.4 will be used as a signal efficiency to evaluate MVA, (put any float number from [0, 1]) method_parameters (dict) – classifier options, example: NTrees=100, BoostType=’Grad’.

Warning

TMVA doesn’t support staged_predict_proba() and feature_importances__.

TMVA doesn’t support multiclassification, only two-class classification.

fit(X, y, sample_weight=None)[source]

Train a classification model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal self
get_params(deep=True)[source]

Get parameters for this estimator.

Returns: dict, parameter names mapped to their values.
predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples, n_classes] with probabilities
set_params(**params)[source]

Set the parameters of this estimator.

Parameters: params (dict) – parameters to set in the model
staged_predict_proba(X)[source]

Warning

This function is not supported for the TMVA library (AttributeError will be thrown)

class rep.estimators.tmva.TMVARegressor(method='kBDT', features=None, factory_options='', **method_parameters)[source]

Bases: rep.estimators.tmva.TMVABase, rep.estimators.interface.Regressor

Implements regression models from TMVA library: CERN library for machine learning.

Parameters: method (str) – algorithm method (default=’kBDT’) features (list[str] or None) – features used in training factory_options (str) – system options, including data transformations before training, for example: "!V:!Silent:Color:Transformations=I;D;P;G,D"  method_parameters (dict) – regressor options, for example: NTrees=100, BoostType=’Grad’

Warning

TMVA doesn’t support staged_predict() and feature_importances__.

TMVA guide

fit(X, y, sample_weight=None)[source]

Train a regression model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal self
get_params(deep=True)[source]

Get parameters for this estimator.

Returns: dict, parameter names mapped to their values.
predict(X)[source]

Predict values for data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples] with predicted values
set_params(**params)[source]

Set the parameters of this estimator.

Parameters: params (dict) – parameters to set in the model
staged_predict(X)[source]

Warning

This function is not supported for the TMVA library (AttributeError will be thrown)

## XGBoost classifier and regressor¶

These classes are wrappers for XGBoost library.

class rep.estimators.xgboost.XGBoostBase(n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, scale_pos_weight=1.0, min_child_weight=1.0, subsample=1.0, colsample=1.0, base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]

Bases: object

A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.

Parameters: n_estimators (int) – number of trees built. nthreads (int) – number of parallel threads used to run XGBoost. num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user). gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative. max_depth (int) – maximum depth of a tree. scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0. min_child_weight (float) – minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. Note weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events. subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. colsample (float) – subsample ratio of columns when constructing each tree. base_score (float) – the initial prediction score of all instances, global bias. random_state (None or int or RandomState) – state for a pseudo random generator verbose (boot) – if 1, will print messages during training missing (float) – the number considered by XGBoost as missing value.
feature_importances_

Sklearn-way of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None

get_feature_importances()[source]

Get features importances.

Return type: pandas.DataFrame with index=self.features
class rep.estimators.xgboost.XGBoostClassifier(features=None, n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, scale_pos_weight=1.0, min_child_weight=1.0, subsample=1.0, colsample=1.0, base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]

Implements classification model from XGBoost library. A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.

Parameters: n_estimators (int) – number of trees built. nthreads (int) – number of parallel threads used to run XGBoost. num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user). gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative. max_depth (int) – maximum depth of a tree. scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0. min_child_weight (float) – minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. Note weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events. subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. colsample (float) – subsample ratio of columns when constructing each tree. base_score (float) – the initial prediction score of all instances, global bias. random_state (None or int or RandomState) – state for a pseudo random generator verbose (boot) – if 1, will print messages during training missing (float) – the number considered by XGBoost as missing value.
fit(X, y, sample_weight=None)[source]

Train a classification model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal self
predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X, step=None)[source]

Predict probabilities for data for each class label on each stage..

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] step (int) – step for returned iterations (None by default). XGBoost does not implement this functionality and we need to predict from the beginning each time. With None passed step is chosen to have 10 points in the learning curve. iterator
class rep.estimators.xgboost.XGBoostRegressor(features=None, n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, min_child_weight=1.0, subsample=1.0, colsample=1.0, objective_type='linear', base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]

Implements regression model from XGBoost library. A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.

Parameters: n_estimators (int) – number of trees built. nthreads (int) – number of parallel threads used to run XGBoost. num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user). gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative. max_depth (int) – maximum depth of a tree. scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0. min_child_weight (float) – minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. Note weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events. subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. colsample (float) – subsample ratio of columns when constructing each tree. base_score (float) – the initial prediction score of all instances, global bias. random_state (None or int or RandomState) – state for a pseudo random generator verbose (boot) – if 1, will print messages during training missing (float) – the number considered by XGBoost as missing value.
fit(X, y, sample_weight=None)[source]

Train a regression model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal self
predict(X)[source]

Predict values for data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples] with predicted values
staged_predict(X, step=None)[source]

Predicts values for data on each stage.

Parameters: X – pandas.DataFrame of shape [n_samples, n_features] step (int) – step for returned iterations (None by default). XGBoost does not implement this functionality and we need to predict from the beginning each time. With None passed step is chosen to have 10 points in the learning curve. iterator

## Theanets classifier and regressor¶

These classes are wrappers for theanets library — a neural network python library.

class rep.estimators.theanets.TheanetsBase(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]

Bases: object

A base class for the estimators from Theanets library.

Parameters: features (None or list(str)) – list of features to train model layers (sequence of int, tuple, dict) – a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well. input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset hidden_activation (str) – the name of an activation function to use on the hidden network layers by default output_activation (str) – the name of an activation function to use on the output layer by default input_noise (float) – standard deviation of desired noise to inject into input hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1] hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1] decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer. scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used trainers (list[dict] or None) – parameters to specify training algorithm(s), for example: trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}] random_state (None or int or RandomState) – state for a pseudo random generator

fit(X, y, sample_weight=None)[source]

Train a classification/regression model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples — array-like of shape [n_samples] sample_weight – weights for samples — array-like of shape [n_samples] self
partial_fit(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]

Train the estimator by training the existing estimator again.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples — array-like of shape [n_samples] sample_weight – weights for samples — array-like of shape [n_samples] keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators. trainer (dict) – parameters of the training algorithm we want to use now self
set_params(**params)[source]

Set the parameters of the estimator. Deep parameters of trainers and scaler can be accessed, for instance:

trainers__0 = {'algo': 'sgd', 'learning_rate': 0.3}
trainers__0_algo = 'sgd'
layers__1 = 14
scaler__use_std = True

Parameters: params (dict) – parameters to set in the model
class rep.estimators.theanets.TheanetsClassifier(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]

Implements a classification model from the Theanets library.

Parameters: features (None or list(str)) – list of features to train model layers (sequence of int, tuple, dict) – a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well. input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset hidden_activation (str) – the name of an activation function to use on the hidden network layers by default output_activation (str) – the name of an activation function to use on the output layer by default input_noise (float) – standard deviation of desired noise to inject into input hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1] hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1] decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer. scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used trainers (list[dict] or None) – parameters to specify training algorithm(s), for example: trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}] random_state (None or int or RandomState) – state for a pseudo random generator

partial_fit(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]

Train the estimator by training the existing estimator again.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples — array-like of shape [n_samples] sample_weight – weights for samples — array-like of shape [n_samples] keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators. trainer (dict) – parameters of the training algorithm we want to use now self
predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X)[source]

Warning

This function is not supported in the Theanets (NotImplementedError will be thrown)

class rep.estimators.theanets.TheanetsRegressor(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]

Implements a regression model from the Theanets library.

Parameters: features (None or list(str)) – list of features to train model layers (sequence of int, tuple, dict) – a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well. input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset hidden_activation (str) – the name of an activation function to use on the hidden network layers by default output_activation (str) – the name of an activation function to use on the output layer by default input_noise (float) – standard deviation of desired noise to inject into input hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1] hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1] decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer. scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used trainers (list[dict] or None) – parameters to specify training algorithm(s), for example: trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}] random_state (None or int or RandomState) – state for a pseudo random generator

partial_fit(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]

Train the estimator by training the existing estimator again.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples — array-like of shape [n_samples] sample_weight – weights for samples — array-like of shape [n_samples] keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators. trainer (dict) – parameters of the training algorithm we want to use now self
predict(X)[source]

Predict values for data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples] with predicted values
staged_predict(X)[source]

Warning

This function is not supported in the Theanets (NotImplementedError will be thrown)

## Neurolab classifier and regressor¶

These classes are wrappers for the Neurolab library — a neural network python library.

Warning

To make neurolab reproducible we change global random seed

numpy.random.seed(42)

class rep.estimators.neurolab.NeurolabClassifier(features=None, layers=(10, ), net_type='feed-forward', initf=<function init_rand>, trainf=None, scaler='standard', random_state=None, **other_params)[source]

Bases: rep.estimators.neurolab.NeurolabBase, rep.estimators.interface.Classifier

Implements a classification model from the Neurolab library.

Parameters: features (list[str] or None) – features used in training layers (list[int]) – sequence, number of units inside each hidden layer. net_type (string) – type of the network; possible values are: feed-forward competing-layer learning-vector elman-recurrent hemming-recurrent initf (anything implementing call(layer), e.g. neurolab.init.* or list[neurolab.init.*] of shape [n_layers]) – layer initializers trainf – net training function; default value depends on the type of a network scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used random_state – this parameter is ignored and is added for uniformity. kwargs (dict) – additional arguments to net __init__, varies with different net_types
fit(X, y)[source]

Train a classification model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples — array-like of shape [n_samples] self
partial_fit(X, y)[source]

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples, array-like of shape [n_samples] self
predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X)[source]

Warning

This is not supported in the Neurolab (AttributeError will be thrown)

class rep.estimators.neurolab.NeurolabRegressor(features=None, layers=(10, ), net_type='feed-forward', initf=<function init_rand>, trainf=None, scaler='standard', random_state=None, **other_params)[source]

Bases: rep.estimators.neurolab.NeurolabBase, rep.estimators.interface.Regressor

Implements a regression model from the Neurolab library.

Parameters: features (list[str] or None) – features used in training layers (list[int]) – sequence, number of units inside each hidden layer. net_type (string) – type of the network; possible values are: feed-forward competing-layer learning-vector elman-recurrent hemming-recurrent initf (anything implementing call(layer), e.g. neurolab.init.* or list[neurolab.init.*] of shape [n_layers]) – layer initializers trainf – net training function; default value depends on the type of a network scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used random_state – this parameter is ignored and is added for uniformity. kwargs (dict) – additional arguments to net __init__, varies with different net_types
fit(X, y)[source]

Train a regression model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples — array-like of shape [n_samples] self
partial_fit(X, y)[source]

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples, array-like of shape [n_samples] self
predict(X)[source]

Predict values for data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples] with predicted values
staged_predict(X)[source]

Warning

This is not supported in the Neurolab (AttributeError will be thrown)

## Pybrain classifier and regressor¶

These classes are wrappers for the PyBrain library — a neural network python library.

Warning

pybrain training isn’t reproducible (training with the same parameters produces different neural network each time)

class rep.estimators.pybrain.PyBrainBase(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]

Bases: object

A base class for the estimator from the PyBrain.

Parameters: features (list[str] or None) – features used in training. scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer verbose (bool) – print train/validation errors. random_state – it is ignored parameter, pybrain training is not reproducible

Net parameters:

Parameters: layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’ params (dict) – other net parameters: bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True; peepholes (boolean); recurrent (boolean): if the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork

Parameters: learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step momentum (float) – the ratio by which the gradient of the last time step is used batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all

Rprop trainer parameters:

Parameters: etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5) etaplus (float) – factor by which a step width is increased when following gradient (default=1.2) delta (float) – step width for each weight deltamin (float) – minimum step width (default=1e-6) deltamax (float) – maximum step width (default=5.0) delta0 (float) – initial step width (default=0.1)

Training termination parameters

Parameters: epochs (int) – number of iterations in training; if < 0 then estimator trains until converge max_epochs (int) – maximum number of epochs the trainer should train if it is given continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one validation_proportion (float) – the ratio of the dataset that is used for the validation dataset

Note

fit(X, y)[source]

Train a classification/regression model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples — array-like of shape [n_samples] self
partial_fit(X, y)[source]

Additional training of the classification/regression model.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples, array-like of shape [n_samples] self
set_params(**params)[source]

Set the parameters of the estimator.

Names of the parameters are the same as in the constructor.

class rep.estimators.pybrain.PyBrainClassifier(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]

Implements a classification model from the PyBrain library.

Parameters: features (list[str] or None) – features used in training. scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer verbose (bool) – print train/validation errors. random_state – it is ignored parameter, pybrain training is not reproducible

Net parameters:

Parameters: layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’ params (dict) – other net parameters: bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True; peepholes (boolean); recurrent (boolean): if the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork

Parameters: learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step momentum (float) – the ratio by which the gradient of the last time step is used batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all

Rprop trainer parameters:

Parameters: etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5) etaplus (float) – factor by which a step width is increased when following gradient (default=1.2) delta (float) – step width for each weight deltamin (float) – minimum step width (default=1e-6) deltamax (float) – maximum step width (default=5.0) delta0 (float) – initial step width (default=0.1)

Training termination parameters

Parameters: epochs (int) – number of iterations in training; if < 0 then estimator trains until converge max_epochs (int) – maximum number of epochs the trainer should train if it is given continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one validation_proportion (float) – the ratio of the dataset that is used for the validation dataset

Note

predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X)[source]

Warning

This function is not supported for PyBrain (AttributeError will be thrown).

class rep.estimators.pybrain.PyBrainRegressor(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]

Implements a regression model from the PyBrain library.

Parameters: features (list[str] or None) – features used in training. scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer verbose (bool) – print train/validation errors. random_state – it is ignored parameter, pybrain training is not reproducible

Net parameters:

Parameters: layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’ params (dict) – other net parameters: bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True; peepholes (boolean); recurrent (boolean): if the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork

Parameters: learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step momentum (float) – the ratio by which the gradient of the last time step is used batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all

Rprop trainer parameters:

Parameters: etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5) etaplus (float) – factor by which a step width is increased when following gradient (default=1.2) delta (float) – step width for each weight deltamin (float) – minimum step width (default=1e-6) deltamax (float) – maximum step width (default=5.0) delta0 (float) – initial step width (default=0.1)

Training termination parameters

Parameters: epochs (int) – number of iterations in training; if < 0 then estimator trains until converge max_epochs (int) – maximum number of epochs the trainer should train if it is given continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one validation_proportion (float) – the ratio of the dataset that is used for the validation dataset

Note

predict(X)[source]

Predict labels for all samples in the dataset.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples] with integer labels
staged_predict(X)[source]

Warning

This function is not supported for PyBrain (AttributeError will be thrown).

## MatrixNet classifier and regressor¶

MatrixNetClassifier and MatrixNetRegressor are wrappers for MatrixNet web service - proprietary BDT developed at Yandex. Think about this as a specific Boosted Decision Tree algorithm which is available as a service. At this moment MatrixMet is available only for CERN users.

To use MatrixNet, first acquire token::

• Click Add token at the left panel

• Choose service MatrixNet and click Create token

• Create ~/.rep-matrixnet.config.json file with the following content (custom path to the config file can be specified when creating a wrapper object):

{
"url": "https://ml.cern.yandex.net/v1",

"token": "<your_token>"
}

class rep.estimators.matrixnet.MatrixNetBase(api_config_file='$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source] Bases: object Base class for MatrixNetClassifier and MatrixNetRegressor. This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time. Parameters: features (list[str] or None) – features used in training api_config_file (str) – path to the file with remote api configuration in the json format: {"url": "https://ml.cern.yandex.net/v1", "token": ""}  iterations (int) – number of constructed trees (default=100) regularization (float) – regularization number (default=0.01) intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8) max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6) features_sample_rate_per_iteration (float) – training features sampling (default=1.0) training_fraction (float) – training rows bagging (default=0.5) auto_stop (None or float) – error value for training pre-stopping sync (bool) – synchronous or asynchronous training on the server random_state (None or int or RandomState) – state for a pseudo random generator feature_importances_ Sklearn-way of returning feature importance. This returned as numpy.array, ‘effect’ column is used among MatrixNet importances. get_feature_importances()[source] Get features importance: effect, efficiency, information characteristics Return type: pandas.DataFrame with index=self.features get_iterations()[source] Return number of already constructed trees during training Returns: int or None resubmit()[source] Resubmit training process on the server in case of failing job. synchronize()[source] Synchronise asynchronic training: wait while training process will be finished on the server training_status()[source] Check if training has finished on the server Return type: bool class rep.estimators.matrixnet.MatrixNetClassifier(features=None, api_config_file='$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]

MatrixNet classification model.

This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.

Parameters: features (list[str] or None) – features used in training api_config_file (str) – path to the file with remote api configuration in the json format: {"url": "https://ml.cern.yandex.net/v1", "token": ""}  iterations (int) – number of constructed trees (default=100) regularization (float) – regularization number (default=0.01) intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8) max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6) features_sample_rate_per_iteration (float) – training features sampling (default=1.0) training_fraction (float) – training rows bagging (default=0.5) auto_stop (None or float) – error value for training pre-stopping sync (bool) – synchronous or asynchronous training on the server random_state (None or int or RandomState) – state for a pseudo random generator
fit(X, y, sample_weight=None)[source]

Train a classification model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal self
predict_proba(X)[source]

Predict probabilities for each class label for samples.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples, n_classes] with probabilities
staged_predict_proba(X, step=10)[source]

Predict probabilities for data for each class label on each stage.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] step (int) – step for returned iterations (10 by default). iterator
class rep.estimators.matrixnet.MatrixNetRegressor(features=None, api_config_file='\$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]

MatrixNet for regression model.

This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.

Parameters: features (list[str] or None) – features used in training api_config_file (str) – path to the file with remote api configuration in the json format: {"url": "https://ml.cern.yandex.net/v1", "token": ""}  iterations (int) – number of constructed trees (default=100) regularization (float) – regularization number (default=0.01) intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8) max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6) features_sample_rate_per_iteration (float) – training features sampling (default=1.0) training_fraction (float) – training rows bagging (default=0.5) auto_stop (None or float) – error value for training pre-stopping sync (bool) – synchronous or asynchronous training on the server random_state (None or int or RandomState) – state for a pseudo random generator
fit(X, y, sample_weight=None)[source]

Train a classification model on the data.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal self
predict(X)[source]

Predict labels for all samples in the dataset.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] numpy.array of shape [n_samples] with integer labels
staged_predict(X, step=10)[source]

Predict probabilities for data for each class label on each stage.

Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] step (int) – step for returned iterations (10 by default). iterator

## Examples¶

### Classification¶

• Prepare dataset
>>> from sklearn import datasets
>>> import pandas, numpy
>>> from rep.utils import train_test_split
>>> from sklearn.metrics import roc_auc_score
>>> # iris data
>>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd'])
>>> labels = iris.target
>>> # Take just two classes instead of three
>>> data = data[labels != 2]
>>> labels = labels[labels != 2]
>>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)

• Sklearn classification
>>> from rep.estimators import SklearnClassifier
>>> # Using gradient boosting with default settings
>>> sk = SklearnClassifier(GradientBoostingClassifier(), features=['a', 'b'])
>>> # Training classifier
>>> sk.fit(train_data, train_labels)
>>> pred = sk.predict_proba(test_data)
>>> print pred
[[  9.99842983e-01   1.57016893e-04]
[  1.45163843e-04   9.99854836e-01]
[  9.99842983e-01   1.57016893e-04]
[  9.99827693e-01   1.72306607e-04], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99768518518518523

• TMVA classification
>>> from rep.estimators import TMVAClassifier
>>> tmva = TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=['a', 'b'])
>>> tmva.fit(train_data, train_labels)
>>> pred = tmva.predict_proba(test_data)
>>> print pred
[[  9.99991025e-01   8.97546346e-06]
[  1.14084636e-04   9.99885915e-01]
[  9.99991009e-01   8.99060302e-06]
[  9.99798700e-01   2.01300452e-04], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99999999999999989

• XGBoost classification
>>> from rep.estimators import XGBoostClassifier
>>> # XGBoost with default parameters
>>> xgb = XGBoostClassifier(features=['a', 'b'])
>>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
>>> pred = xgb.predict_proba(test_data)
>>> print pred
[[ 0.9983651   0.00163494]
[ 0.00170585  0.99829417]
[ 0.99845636  0.00154361]
[ 0.96618336  0.03381656], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99768518518518512


### Regression¶

• Prepare dataset
>>> from sklearn import datasets
>>> from sklearn.metrics import mean_squared_error
>>> from rep.utils import train_test_split
>>> import pandas, numpy
>>> # diabetes data
>>> features = ['feature_%d' % number for number in range(diabetes.data.shape[1])]
>>> data = pandas.DataFrame(diabetes.data, columns=features)
>>> labels = diabetes.target
>>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)

• Sklearn regression
>>> from rep.estimators import SklearnRegressor
>>> # Using gradient boosting with default settings
>>> # Training classifier
>>> sk.fit(train_data, train_labels)
>>> pred = sk.predict(train_data)
>>> numpy.sqrt(mean_squared_error(train_labels, pred))
60.666009962879265

• TMVA regression
>>> from rep.estimators import TMVARegressor
>>> tmva = TMVARegressor(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=features[:8])
>>> tmva.fit(train_data, train_labels)
>>> pred = tmva.predict(test_data)
>>> numpy.sqrt(mean_squared_error(test_labels, pred))
73.74191838418254

• XGBoost regression
>>> from rep.estimators import XGBoostRegressor
>>> # XGBoost with default parameters
>>> xgb = XGBoostRegressor(features=features[:8])
>>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
>>> pred = xgb.predict(test_data)
>>> numpy.sqrt(mean_squared_error(test_labels, pred))
65.557743652940133


## Compatible libraries¶

REP can deal with any library which supports scikit-learn interface.

Examples of compatible libraries: nolearn, skflow, gplearn and hep_ml.