Estimators (classification and regression)¶

This module contains wrappers with sklearn interface for different machine learning libraries:

scikit-learn

TMVA

XGBoost

pybrain

neurolab

theanets.

REP defines interface for classifiers’ and regressors’ wrappers, thus new wrappers can be added for another libraries following the same interface. Notably the interface has backward compatibility with scikit-learn library.

Estimators interfaces (for classification and regression)¶

REP wrappers are derived from Classifier and Regressor depending on the problem of interest.

Below you can see the standard methods available in the wrappers.

class rep.estimators.interface.Classifier(features=None)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Interface to train different classification models from different machine learning libraries, like sklearn, TMVA, XGBoost, ...

Parameters:	features (list[str] or None) – features used to train a model

Note

if features aren’t set (None), then all features in the training dataset will be used
Datasets should be pandas.DataFrame, not numpy.array. Provided this, you’ll be able to choose features used in training by setting e.g. features=[‘mass’, ‘momentum’] in the constructor.
It works fine with numpy.array as well, but in this case all the features will be used.
Classes values must be from 0 to n_classes-1!

fit(X, y, sample_weight=None)[source]¶

Train a classification model on the data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	self

fit_lds(lds)[source]¶

Train a classifier on the specific type of dataset.

Parameters:	lds (LabeledDataStorage) – data
Returns:	self

get_feature_importances()[source]¶

Return features importance.

Return type:	pandas.DataFrame with index=self.features

get_params(deep=True)¶

Get parameters for this estimator.

deep: boolean, optional: If True, will return the parameters for this estimator and contained subobjects that are estimators.

params : mapping of string to any: Parameter names mapped to their values.

predict(X)[source]¶

Predict labels for all samples in the dataset.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples] with integer labels

predict_proba(X)[source]¶

Predict probabilities for each class label for samples.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples, n_classes] with probabilities

score(X, y, sample_weight=None)¶

Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

X : array-like, shape = (n_samples, n_features): Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs): True labels for X.
sample_weight : array-like, shape = [n_samples], optional: Sample weights.

score : float: Mean accuracy of self.predict(X) wrt. y.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

staged_predict_proba(X)[source]¶

Predict probabilities for data for each class label on each stage (i.e. for boosting algorithms).

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	iterator

test_on(X, y, sample_weight=None)[source]¶

Prepare classification report for a single classifier.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples — array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	ClassificationReport

test_on_lds(lds)[source]¶

Prepare a classification report for a single classifier.

Parameters:	lds (LabeledDataStorage) – data
Returns:	ClassificationReport

class rep.estimators.interface.Regressor(features=None)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

Interface to train different regression models from different machine learning libraries, like sklearn, TMVA, XGBoost, ...

Parameters:	features (list[str] or None) – features used to train a model

Note

if features aren’t set (None), then all features in the training dataset will be used
Datasets should be pandas.DataFrame, not numpy.array. Provided this, you’ll be able to choose features used in training by setting e.g. features=[‘mass’, ‘momentum’] in the constructor.
It works fine with numpy.array as well, but in this case all the features will be used.

fit(X, y, sample_weight=None)[source]¶

Train a regression model on the data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	self

fit_lds(lds)[source]¶

Train a regression model on the specific type of dataset.

Parameters:	lds (LabeledDataStorage) – data
Returns:	self

get_feature_importances()[source]¶

Get features importances.

Return type:	pandas.DataFrame with index=self.features

get_params(deep=True)¶

Get parameters for this estimator.

deep: boolean, optional: If True, will return the parameters for this estimator and contained subobjects that are estimators.

params : mapping of string to any: Parameter names mapped to their values.

predict(X)[source]¶

Predict values for data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples] with predicted values

score(X, y, sample_weight=None)¶

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

X : array-like, shape = (n_samples, n_features): Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs): True values for X.
sample_weight : array-like, shape = [n_samples], optional: Sample weights.

score : float: R^2 of self.predict(X) wrt. y.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

staged_predict(X)[source]¶

Predicts values for data on each stage (i.e. for boosting algorithms).

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	iterator

test_on(X, y, sample_weight=None)[source]¶

Prepare a regression report for a single regressor

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values of samples — array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	RegressionReport

test_on_lds(lds)[source]¶

Prepare a regression report for a single regressor.

Parameters:	lds (LabeledDataStorage) – data
Returns:	RegressionReport

Sklearn classifier and regressor¶

SklearnClassifier and SklearnRegressor are wrappers for algorithms from scikit-learn.

From user perspective, wrapped sklearn model behaves in the same way as non-wrapped, but has one additional parameter features to choose necessary columns to use in training.

Typically, models from REP are used with pandas.DataFrames, which makes it possible to name needed variables or give some variables specific role in the training.

If data has numpy.array type then behaviour will be the same as in sklearn. For complete list of the available algorithms, see sklearn API.

class rep.estimators.sklearn.SklearnClassifier(clf, features=None)[source]¶

Bases: rep.estimators.sklearn.SklearnBase, rep.estimators.interface.Classifier

SklearnClassifier is a wrapper over sklearn-like classifiers.

Parameters:	clf (sklearn.BaseEstimator) – classifier to train. Should be sklearn-compatible. features (list[str] or None) – features used in training

fit(X, y, sample_weight=None, **kwargs)[source]¶

Train the classifier.

Parameters:	X (pandas.DataFrame) – data shape [n_samples, n_features] y – target of training, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	self

Note

if sklearn classifier doesn’t support sample_weight, then put sample_weight=None, otherwise exception will be thrown.

predict(X)[source]¶

Predict labels for all samples in the dataset.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples] with integer labels

predict_proba(X)[source]¶

Predict probabilities for each class label for samples.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples, n_classes] with probabilities

staged_predict_proba(X)[source]¶

Predict probabilities for data for each class label on each stage (i.e. for boosting algorithms).

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	iterator

class rep.estimators.sklearn.SklearnRegressor(clf, features=None)[source]¶

Bases: rep.estimators.sklearn.SklearnBase, rep.estimators.interface.Regressor

SklearnRegressor is a wrapper over sklearn-like regressors

Parameters:	clf (sklearn.BaseEstimator) – classifier to train. Should be sklearn-compatible. features (list[str] or None) – features used in training

fit(X, y, sample_weight=None, **kwargs)[source]¶

Train the classifier.

Parameters:	X (pandas.DataFrame) – data shape [n_samples, n_features] y – target of training, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	self

Note

if sklearn classifier doesn’t support sample_weight, then put sample_weight=None, otherwise exception will be thrown.

predict(X)[source]¶

Predict values for data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples] with predicted values

staged_predict(X)[source]¶

Predicts values for data on each stage (i.e. for boosting algorithms).

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	iterator

TMVA classifier and regressor¶

These classes are wrappers for physics machine learning library TMVA used .root format files (c++ library). Now you can simply use it in python. TMVA contains classification and regression algorithms, including neural networks. See TMVA guide for the list of the available algorithms and parameters.

class rep.estimators.tmva.TMVAClassifier(method='kBDT', features=None, factory_options='', sigmoid_function='bdt', **method_parameters)[source]¶

Bases: rep.estimators.tmva.TMVABase, rep.estimators.interface.Classifier

Implements classification models from TMVA library: CERN library for machine learning.

Parameters:

method (str) – algorithm method (default=’kBDT’)
features (list[str] or None) – features used in training
factory_options (str) –
system options, including data transformations before training, for example:
```
"!V:!Silent:Color:Transformations=I;D;P;G,D"
```
sigmoid_function (str) –
function which is used to convert TMVA output to probabilities;
- identity (use for svm, mlp) — do not transform the output, use this value for methods returning class probabilities
- sigmoid — sigmoid transformation, use it if output varies in range [-infinity, +infinity]
- bdt (for the BDT algorithms output varies in range [-1, 1])
- sig_eff=0.4 — for the rectangular cut optimization methods, for instance, here 0.4 will be used as a signal efficiency to evaluate MVA, (put any float number from [0, 1])
method_parameters (dict) – classifier options, example: NTrees=100, BoostType=’Grad’.

Warning

TMVA doesn’t support staged_predict_proba() and feature_importances__.

TMVA doesn’t support multiclassification, only two-class classification.

TMVA guide.

fit(X, y, sample_weight=None)[source]¶

Train a classification model on the data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	self

get_params(deep=True)[source]¶

Get parameters for this estimator.

Returns:	dict, parameter names mapped to their values.

predict_proba(X)[source]¶

Predict probabilities for each class label for samples.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples, n_classes] with probabilities

set_params(**params)[source]¶

Set the parameters of this estimator.

Parameters:	params (dict) – parameters to set in the model

staged_predict_proba(X)[source]¶: Warning

This function is not supported for the TMVA library (AttributeError will be thrown)

class rep.estimators.tmva.TMVARegressor(method='kBDT', features=None, factory_options='', **method_parameters)[source]¶

Bases: rep.estimators.tmva.TMVABase, rep.estimators.interface.Regressor

Implements regression models from TMVA library: CERN library for machine learning.

Parameters:	method (str) – algorithm method (default=’kBDT’) features (list[str] or None) – features used in training factory_options (str) – system options, including data transformations before training, for example: "!V:!Silent:Color:Transformations=I;D;P;G,D" method_parameters (dict) – regressor options, for example: NTrees=100, BoostType=’Grad’

Warning

TMVA doesn’t support staged_predict() and feature_importances__.

TMVA guide

fit(X, y, sample_weight=None)[source]¶

Train a regression model on the data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	self

get_params(deep=True)[source]¶

Get parameters for this estimator.

Returns:	dict, parameter names mapped to their values.

predict(X)[source]¶

Predict values for data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples] with predicted values

set_params(**params)[source]¶

Set the parameters of this estimator.

Parameters:	params (dict) – parameters to set in the model

staged_predict(X)[source]¶: Warning

This function is not supported for the TMVA library (AttributeError will be thrown)

XGBoost classifier and regressor¶

These classes are wrappers for XGBoost library.

class rep.estimators.xgboost.XGBoostBase(n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, scale_pos_weight=1.0, min_child_weight=1.0, subsample=1.0, colsample=1.0, base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]¶

Bases: object

A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.

Parameters:

n_estimators (int) – number of trees built.
nthreads (int) – number of parallel threads used to run XGBoost.
num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
max_depth (int) – maximum depth of a tree.
scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
min_child_weight (float) –
minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.

Note

weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.
subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
colsample (float) – subsample ratio of columns when constructing each tree.
base_score (float) – the initial prediction score of all instances, global bias.
random_state (None or int or RandomState) – state for a pseudo random generator
verbose (boot) – if 1, will print messages during training
missing (float) – the number considered by XGBoost as missing value.

feature_importances_¶: Sklearn-way of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None

get_feature_importances()[source]¶

Get features importances.

Return type:	pandas.DataFrame with index=self.features

class rep.estimators.xgboost.XGBoostClassifier(features=None, n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, scale_pos_weight=1.0, min_child_weight=1.0, subsample=1.0, colsample=1.0, base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]¶

Bases: rep.estimators.xgboost.XGBoostBase, rep.estimators.interface.Classifier

Implements classification model from XGBoost library. A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.

Parameters:

n_estimators (int) – number of trees built.
nthreads (int) – number of parallel threads used to run XGBoost.
num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
max_depth (int) – maximum depth of a tree.
scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
min_child_weight (float) –
minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.

Note

weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.
subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
colsample (float) – subsample ratio of columns when constructing each tree.
base_score (float) – the initial prediction score of all instances, global bias.
random_state (None or int or RandomState) – state for a pseudo random generator
verbose (boot) – if 1, will print messages during training
missing (float) – the number considered by XGBoost as missing value.

fit(X, y, sample_weight=None)[source]¶

Train a classification model on the data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	self

predict_proba(X)[source]¶

Predict probabilities for each class label for samples.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples, n_classes] with probabilities

staged_predict_proba(X, step=None)[source]¶

Predict probabilities for data for each class label on each stage..

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] step (int) – step for returned iterations (None by default). XGBoost does not implement this functionality and we need to predict from the beginning each time. With None passed step is chosen to have 10 points in the learning curve.
Returns:	iterator

class rep.estimators.xgboost.XGBoostRegressor(features=None, n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, min_child_weight=1.0, subsample=1.0, colsample=1.0, objective_type='linear', base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]¶

Bases: rep.estimators.xgboost.XGBoostBase, rep.estimators.interface.Regressor

Implements regression model from XGBoost library. A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.

Parameters:

n_estimators (int) – number of trees built.
nthreads (int) – number of parallel threads used to run XGBoost.
num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
max_depth (int) – maximum depth of a tree.
scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
min_child_weight (float) –
minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.

Note

weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.
subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
colsample (float) – subsample ratio of columns when constructing each tree.
base_score (float) – the initial prediction score of all instances, global bias.
random_state (None or int or RandomState) – state for a pseudo random generator
verbose (boot) – if 1, will print messages during training
missing (float) – the number considered by XGBoost as missing value.

fit(X, y, sample_weight=None)[source]¶

Train a regression model on the data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	self

predict(X)[source]¶

Predict values for data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples] with predicted values

staged_predict(X, step=None)[source]¶

Predicts values for data on each stage.

Parameters:	X – pandas.DataFrame of shape [n_samples, n_features] step (int) – step for returned iterations (None by default). XGBoost does not implement this functionality and we need to predict from the beginning each time. With None passed step is chosen to have 10 points in the learning curve.
Returns:	iterator

Theanets classifier and regressor¶

These classes are wrappers for theanets library — a neural network python library.

class rep.estimators.theanets.TheanetsBase(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]¶

Bases: object

A base class for the estimators from Theanets library.

Parameters:

features (None or list(str)) – list of features to train model
layers (sequence of int, tuple, dict) – a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.
input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset
output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset
hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
output_activation (str) – the name of an activation function to use on the output layer by default
input_noise (float) – standard deviation of desired noise to inject into input
hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
trainers (list[dict] or None) –
parameters to specify training algorithm(s), for example:

trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]
random_state (None or int or RandomState) – state for a pseudo random generator

For more information on the available trainers and their parameters see this page.

fit(X, y, sample_weight=None)[source]¶

Train a classification/regression model on the data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples — array-like of shape [n_samples] sample_weight – weights for samples — array-like of shape [n_samples]
Returns:	self

partial_fit(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]¶

Train the estimator by training the existing estimator again.

Parameters:

X (pandas.DataFrame) – data of shape [n_samples, n_features]
y – values for samples — array-like of shape [n_samples]
sample_weight – weights for samples — array-like of shape [n_samples]
keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
trainer (dict) – parameters of the training algorithm we want to use now

Returns:

self

set_params(**params)[source]¶

Set the parameters of the estimator. Deep parameters of trainers and scaler can be accessed, for instance:

trainers__0 = {'algo': 'sgd', 'learning_rate': 0.3}
trainers__0_algo = 'sgd'
layers__1 = 14
scaler__use_std = True

Parameters:	params (dict) – parameters to set in the model

class rep.estimators.theanets.TheanetsClassifier(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]¶

Bases: rep.estimators.theanets.TheanetsBase, rep.estimators.interface.Classifier

Implements a classification model from the Theanets library.

Parameters:

features (None or list(str)) – list of features to train model
layers (sequence of int, tuple, dict) –
a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.
input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset
output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset
hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
output_activation (str) – the name of an activation function to use on the output layer by default
input_noise (float) – standard deviation of desired noise to inject into input
hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
trainers (list[dict] or None) –
parameters to specify training algorithm(s), for example:

trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]
random_state (None or int or RandomState) – state for a pseudo random generator

For more information on the available trainers and their parameters see this page.

partial_fit(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]¶

Train the estimator by training the existing estimator again.

Parameters:

X (pandas.DataFrame) – data of shape [n_samples, n_features]
y – values for samples — array-like of shape [n_samples]
sample_weight – weights for samples — array-like of shape [n_samples]
keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
trainer (dict) – parameters of the training algorithm we want to use now

Returns:

self

predict_proba(X)[source]¶

Predict probabilities for each class label for samples.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples, n_classes] with probabilities

staged_predict_proba(X)[source]¶: Warning

This function is not supported in the Theanets (NotImplementedError will be thrown)

class rep.estimators.theanets.TheanetsRegressor(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]¶

Bases: rep.estimators.theanets.TheanetsBase, rep.estimators.interface.Regressor

Implements a regression model from the Theanets library.

Parameters:

features (None or list(str)) – list of features to train model
layers (sequence of int, tuple, dict) –
a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.
input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset
output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset
hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
output_activation (str) – the name of an activation function to use on the output layer by default
input_noise (float) – standard deviation of desired noise to inject into input
hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
trainers (list[dict] or None) –
parameters to specify training algorithm(s), for example:

trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]
random_state (None or int or RandomState) – state for a pseudo random generator

For more information on the available trainers and their parameters see this page.

partial_fit(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]¶

Train the estimator by training the existing estimator again.

Parameters:

X (pandas.DataFrame) – data of shape [n_samples, n_features]
y – values for samples — array-like of shape [n_samples]
sample_weight – weights for samples — array-like of shape [n_samples]
keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
trainer (dict) – parameters of the training algorithm we want to use now

Returns:

self

predict(X)[source]¶

Predict values for data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples] with predicted values

staged_predict(X)[source]¶: Warning

This function is not supported in the Theanets (NotImplementedError will be thrown)

Neurolab classifier and regressor¶

These classes are wrappers for the Neurolab library — a neural network python library.

Warning

To make neurolab reproducible we change global random seed

numpy.random.seed(42)

class rep.estimators.neurolab.NeurolabClassifier(features=None, layers=(10, ), net_type='feed-forward', initf=<function init_rand>, trainf=None, scaler='standard', random_state=None, **other_params)[source]¶

Bases: rep.estimators.neurolab.NeurolabBase, rep.estimators.interface.Classifier

Implements a classification model from the Neurolab library.

Parameters:

features (list[str] or None) – features used in training
layers (list[int]) – sequence, number of units inside each hidden layer.
net_type (string) –
type of the network; possible values are:
- feed-forward
- competing-layer
- learning-vector
- elman-recurrent
- hemming-recurrent
initf (anything implementing call(layer), e.g. neurolab.init.* or list[neurolab.init.*] of shape [n_layers]) – layer initializers
trainf – net training function; default value depends on the type of a network
scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
random_state – this parameter is ignored and is added for uniformity.
kwargs (dict) – additional arguments to net __init__, varies with different net_types

Pybrain classifier and regressor¶

These classes are wrappers for the PyBrain library — a neural network python library.

Warning

pybrain training isn’t reproducible (training with the same parameters produces different neural network each time)

class rep.estimators.pybrain.PyBrainBase(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]¶

Bases: object

A base class for the estimator from the PyBrain.

Parameters:

features (list[str] or None) – features used in training.
scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
verbose (bool) – print train/validation errors.
random_state – it is ignored parameter, pybrain training is not reproducible

Net parameters:

Parameters:

layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
params (dict) –
other net parameters:
- bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
- peepholes (boolean);
- recurrent (boolean): if the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork

Gradient descent trainer parameters:

Parameters:

learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
momentum (float) – the ratio by which the gradient of the last time step is used
batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all

Rprop trainer parameters:

Parameters:

etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
delta (float) – step width for each weight
deltamin (float) – minimum step width (default=1e-6)
deltamax (float) – maximum step width (default=5.0)
delta0 (float) – initial step width (default=0.1)

Training termination parameters

Parameters:

epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
max_epochs (int) – maximum number of epochs the trainer should train if it is given
continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
validation_proportion (float) – the ratio of the dataset that is used for the validation dataset

Note

Details about parameters here.

fit(X, y)[source]¶

Train a classification/regression model on the data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples — array-like of shape [n_samples]
Returns:	self

partial_fit(X, y)[source]¶

Additional training of the classification/regression model.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – values for samples, array-like of shape [n_samples]
Returns:	self

set_params(**params)[source]¶

Set the parameters of the estimator.

Names of the parameters are the same as in the constructor.

class rep.estimators.pybrain.PyBrainClassifier(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]¶

Bases: rep.estimators.pybrain.PyBrainBase, rep.estimators.interface.Classifier

Implements a classification model from the PyBrain library.

Parameters:

features (list[str] or None) – features used in training.
scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
verbose (bool) – print train/validation errors.
random_state – it is ignored parameter, pybrain training is not reproducible

Net parameters:

Parameters:

layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
params (dict) –
other net parameters:
- bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
- peepholes (boolean);
- recurrent (boolean): if the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork

Gradient descent trainer parameters:

Parameters:

learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
momentum (float) – the ratio by which the gradient of the last time step is used
batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all

Rprop trainer parameters:

Parameters:

etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
delta (float) – step width for each weight
deltamin (float) – minimum step width (default=1e-6)
deltamax (float) – maximum step width (default=5.0)
delta0 (float) – initial step width (default=0.1)

Training termination parameters

Parameters:

epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
max_epochs (int) – maximum number of epochs the trainer should train if it is given
continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
validation_proportion (float) – the ratio of the dataset that is used for the validation dataset

Note

Details about parameters here.

predict_proba(X)[source]¶

Predict probabilities for each class label for samples.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples, n_classes] with probabilities

staged_predict_proba(X)[source]¶: Warning

This function is not supported for PyBrain (AttributeError will be thrown).

class rep.estimators.pybrain.PyBrainRegressor(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]¶

Bases: rep.estimators.pybrain.PyBrainBase, rep.estimators.interface.Regressor

Implements a regression model from the PyBrain library.

Parameters:

features (list[str] or None) – features used in training.
scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
verbose (bool) – print train/validation errors.
random_state – it is ignored parameter, pybrain training is not reproducible

Net parameters:

Parameters:

layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
params (dict) –
other net parameters:
- bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
- peepholes (boolean);
- recurrent (boolean): if the recurrent flag is set, a RecurrentNetwork will be created, otherwise a FeedForwardNetwork

Gradient descent trainer parameters:

Parameters:

learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
momentum (float) – the ratio by which the gradient of the last time step is used
batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all

Rprop trainer parameters:

Parameters:

etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
delta (float) – step width for each weight
deltamin (float) – minimum step width (default=1e-6)
deltamax (float) – maximum step width (default=5.0)
delta0 (float) – initial step width (default=0.1)

Training termination parameters

Parameters:

epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
max_epochs (int) – maximum number of epochs the trainer should train if it is given
continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
validation_proportion (float) – the ratio of the dataset that is used for the validation dataset

Note

Details about parameters here.

predict(X)[source]¶

Predict labels for all samples in the dataset.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples] with integer labels

staged_predict(X)[source]¶: Warning

This function is not supported for PyBrain (AttributeError will be thrown).

MatrixNet classifier and regressor¶

MatrixNetClassifier and MatrixNetRegressor are wrappers for MatrixNet web service - proprietary BDT developed at Yandex. Think about this as a specific Boosted Decision Tree algorithm which is available as a service. At this moment MatrixMet is available only for CERN users.

To use MatrixNet, first acquire token::

Go to https://yandex-apps.cern.ch/ (login with your CERN-account)
Click Add token at the left panel
Choose service MatrixNet and click Create token
Create ~/.rep-matrixnet.config.json file with the following content (custom path to the config file can be specified when creating a wrapper object):
```
{
    "url": "https://ml.cern.yandex.net/v1",

    "token": "<your_token>"
}
```

class rep.estimators.matrixnet.MatrixNetBase(api_config_file='$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]¶

Bases: object

Base class for MatrixNetClassifier and MatrixNetRegressor.

This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.

Parameters:

features (list[str] or None) – features used in training
api_config_file (str) –
path to the file with remote api configuration in the json format:
```
{"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
```
iterations (int) – number of constructed trees (default=100)
regularization (float) – regularization number (default=0.01)
intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
training_fraction (float) – training rows bagging (default=0.5)
auto_stop (None or float) – error value for training pre-stopping
sync (bool) – synchronous or asynchronous training on the server
random_state (None or int or RandomState) – state for a pseudo random generator

feature_importances_¶: Sklearn-way of returning feature importance. This returned as numpy.array, ‘effect’ column is used among MatrixNet importances.

get_feature_importances()[source]¶

Get features importance: effect, efficiency, information characteristics

Return type:	pandas.DataFrame with index=self.features

get_iterations()[source]¶

Return number of already constructed trees during training

Returns:	int or None

resubmit()[source]¶: Resubmit training process on the server in case of failing job.

synchronize()[source]¶: Synchronise asynchronic training: wait while training process will be finished on the server

training_status()[source]¶

Check if training has finished on the server

Return type:	bool

class rep.estimators.matrixnet.MatrixNetClassifier(features=None, api_config_file='$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]¶

Bases: rep.estimators.matrixnet.MatrixNetBase, rep.estimators.interface.Classifier

MatrixNet classification model.

This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.

Parameters:

features (list[str] or None) – features used in training
api_config_file (str) –
path to the file with remote api configuration in the json format:
```
{"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
```
iterations (int) – number of constructed trees (default=100)
regularization (float) – regularization number (default=0.01)
intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
training_fraction (float) – training rows bagging (default=0.5)
auto_stop (None or float) – error value for training pre-stopping
sync (bool) – synchronous or asynchronous training on the server
random_state (None or int or RandomState) – state for a pseudo random generator

fit(X, y, sample_weight=None)[source]¶

Train a classification model on the data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	self

predict_proba(X)[source]¶

Predict probabilities for each class label for samples.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples, n_classes] with probabilities

staged_predict_proba(X, step=10)[source]¶

Predict probabilities for data for each class label on each stage.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] step (int) – step for returned iterations (10 by default).
Returns:	iterator

class rep.estimators.matrixnet.MatrixNetRegressor(features=None, api_config_file='$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]¶

Bases: rep.estimators.matrixnet.MatrixNetBase, rep.estimators.interface.Regressor

MatrixNet for regression model.

This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.

Parameters:

features (list[str] or None) – features used in training
api_config_file (str) –
path to the file with remote api configuration in the json format:
```
{"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
```
iterations (int) – number of constructed trees (default=100)
regularization (float) – regularization number (default=0.01)
intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
training_fraction (float) – training rows bagging (default=0.5)
auto_stop (None or float) – error value for training pre-stopping
sync (bool) – synchronous or asynchronous training on the server
random_state (None or int or RandomState) – state for a pseudo random generator

fit(X, y, sample_weight=None)[source]¶

Train a classification model on the data.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] y – labels of samples, array-like of shape [n_samples] sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns:	self

predict(X)[source]¶

Predict labels for all samples in the dataset.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features]
Return type:	numpy.array of shape [n_samples] with integer labels

staged_predict(X, step=10)[source]¶

Predict probabilities for data for each class label on each stage.

Parameters:	X (pandas.DataFrame) – data of shape [n_samples, n_features] step (int) – step for returned iterations (10 by default).
Returns:	iterator

Examples¶

Classification¶

Prepare dataset

>>> from sklearn import datasets
>>> import pandas, numpy
>>> from rep.utils import train_test_split
>>> from sklearn.metrics import roc_auc_score
>>> # iris data
>>> iris = datasets.load_iris()
>>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd'])
>>> labels = iris.target
>>> # Take just two classes instead of three
>>> data = data[labels != 2]
>>> labels = labels[labels != 2]
>>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)

Sklearn classification

>>> from rep.estimators import SklearnClassifier
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> # Using gradient boosting with default settings
>>> sk = SklearnClassifier(GradientBoostingClassifier(), features=['a', 'b'])
>>> # Training classifier
>>> sk.fit(train_data, train_labels)
>>> pred = sk.predict_proba(test_data)
>>> print pred
[[  9.99842983e-01   1.57016893e-04]
 [  1.45163843e-04   9.99854836e-01]
 [  9.99842983e-01   1.57016893e-04]
 [  9.99827693e-01   1.72306607e-04], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99768518518518523

TMVA classification

>>> from rep.estimators import TMVAClassifier
>>> tmva = TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=['a', 'b'])
>>> tmva.fit(train_data, train_labels)
>>> pred = tmva.predict_proba(test_data)
>>> print pred
[[  9.99991025e-01   8.97546346e-06]
 [  1.14084636e-04   9.99885915e-01]
 [  9.99991009e-01   8.99060302e-06]
 [  9.99798700e-01   2.01300452e-04], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99999999999999989

XGBoost classification

>>> from rep.estimators import XGBoostClassifier
>>> # XGBoost with default parameters
>>> xgb = XGBoostClassifier(features=['a', 'b'])
>>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
>>> pred = xgb.predict_proba(test_data)
>>> print pred
[[ 0.9983651   0.00163494]
 [ 0.00170585  0.99829417]
 [ 0.99845636  0.00154361]
 [ 0.96618336  0.03381656], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99768518518518512

Regression¶

Prepare dataset

>>> from sklearn import datasets
>>> from sklearn.metrics import mean_squared_error
>>> from rep.utils import train_test_split
>>> import pandas, numpy
>>> # diabetes data
>>> diabetes = datasets.load_diabetes()
>>> features = ['feature_%d' % number for number in range(diabetes.data.shape[1])]
>>> data = pandas.DataFrame(diabetes.data, columns=features)
>>> labels = diabetes.target
>>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)

Sklearn regression

>>> from rep.estimators import SklearnRegressor
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> # Using gradient boosting with default settings
>>> sk = SklearnRegressor(GradientBoostingRegressor(), features=features[:8])
>>> # Training classifier
>>> sk.fit(train_data, train_labels)
>>> pred = sk.predict(train_data)
>>> numpy.sqrt(mean_squared_error(train_labels, pred))
60.666009962879265

TMVA regression

>>> from rep.estimators import TMVARegressor
>>> tmva = TMVARegressor(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=features[:8])
>>> tmva.fit(train_data, train_labels)
>>> pred = tmva.predict(test_data)
>>> numpy.sqrt(mean_squared_error(test_labels, pred))
73.74191838418254

XGBoost regression

>>> from rep.estimators import XGBoostRegressor
>>> # XGBoost with default parameters
>>> xgb = XGBoostRegressor(features=features[:8])
>>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
>>> pred = xgb.predict(test_data)
>>> numpy.sqrt(mean_squared_error(test_labels, pred))
65.557743652940133

Compatible libraries¶

REP can deal with any library which supports scikit-learn interface.

Examples of compatible libraries: nolearn, skflow, gplearn and hep_ml.