Estimators (classification and regression)¶
This module contains wrappers with sklearn
interface for different machine learning libraries:
 scikitlearn
 TMVA
 XGBoost
 pybrain
 neurolab
 theanets.
REP defines interface for classifiers’ and regressors’ wrappers, thus new wrappers can be added for another libraries following the same interface. Notably the interface has backward compatibility with scikitlearn library.
Estimators interfaces (for classification and regression)¶
REP wrappers are derived from Classifier
and Regressor
depending on the problem of interest.
Below you can see the standard methods available in the wrappers.

class
rep.estimators.interface.
Classifier
(features=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
Interface to train different classification models from different machine learning libraries, like sklearn, TMVA, XGBoost, ...
Parameters: features (list[str] or None) – features used to train a model Note
 if features aren’t set (None), then all features in the training dataset will be used
 Datasets should be pandas.DataFrame, not numpy.array. Provided this, you’ll be able to choose features used in training by setting e.g. features=[‘mass’, ‘momentum’] in the constructor.
 It works fine with numpy.array as well, but in this case all the features will be used.
 Classes values must be from 0 to n_classes1!

fit
(X, y, sample_weight=None)[source]¶ Train a classification model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – labels of samples, arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: self

fit_lds
(lds)[source]¶ Train a classifier on the specific type of dataset.
Parameters: lds (LabeledDataStorage) – data Returns: self

get_feature_importances
()[source]¶ Return features importance.
Return type: pandas.DataFrame with index=self.features

get_params
(deep=True)¶ Get parameters for this estimator.
 deep: boolean, optional
 If True, will return the parameters for this estimator and contained subobjects that are estimators.
 params : mapping of string to any
 Parameter names mapped to their values.

predict
(X)[source]¶ Predict labels for all samples in the dataset.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples] with integer labels

predict_proba
(X)[source]¶ Predict probabilities for each class label for samples.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples, n_classes] with probabilities

score
(X, y, sample_weight=None)¶ Returns the mean accuracy on the given test data and labels.
In multilabel classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
 X : arraylike, shape = (n_samples, n_features)
 Test samples.
 y : arraylike, shape = (n_samples) or (n_samples, n_outputs)
 True labels for X.
 sample_weight : arraylike, shape = [n_samples], optional
 Sample weights.
 score : float
 Mean accuracy of self.predict(X) wrt. y.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.self

staged_predict_proba
(X)[source]¶ Predict probabilities for data for each class label on each stage (i.e. for boosting algorithms).
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: iterator

test_on
(X, y, sample_weight=None)[source]¶ Prepare classification report for a single classifier.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – labels of samples — arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: ClassificationReport

test_on_lds
(lds)[source]¶ Prepare a classification report for a single classifier.
Parameters: lds (LabeledDataStorage) – data Returns: ClassificationReport

class
rep.estimators.interface.
Regressor
(features=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.RegressorMixin
Interface to train different regression models from different machine learning libraries, like sklearn, TMVA, XGBoost, ...
Parameters: features (list[str] or None) – features used to train a model Note
 if features aren’t set (None), then all features in the training dataset will be used
 Datasets should be pandas.DataFrame, not numpy.array. Provided this, you’ll be able to choose features used in training by setting e.g. features=[‘mass’, ‘momentum’] in the constructor.
 It works fine with numpy.array as well, but in this case all the features will be used.

fit
(X, y, sample_weight=None)[source]¶ Train a regression model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – values for samples, arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: self

fit_lds
(lds)[source]¶ Train a regression model on the specific type of dataset.
Parameters: lds (LabeledDataStorage) – data Returns: self

get_feature_importances
()[source]¶ Get features importances.
Return type: pandas.DataFrame with index=self.features

get_params
(deep=True)¶ Get parameters for this estimator.
 deep: boolean, optional
 If True, will return the parameters for this estimator and contained subobjects that are estimators.
 params : mapping of string to any
 Parameter names mapped to their values.

predict
(X)[source]¶ Predict values for data.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples] with predicted values

score
(X, y, sample_weight=None)¶ Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1  u/v), where u is the regression sum of squares ((y_true  y_pred) ** 2).sum() and v is the residual sum of squares ((y_true  y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
 X : arraylike, shape = (n_samples, n_features)
 Test samples.
 y : arraylike, shape = (n_samples) or (n_samples, n_outputs)
 True values for X.
 sample_weight : arraylike, shape = [n_samples], optional
 Sample weights.
 score : float
 R^2 of self.predict(X) wrt. y.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.self

staged_predict
(X)[source]¶ Predicts values for data on each stage (i.e. for boosting algorithms).
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: iterator

test_on
(X, y, sample_weight=None)[source]¶ Prepare a regression report for a single regressor
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – values of samples — arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: RegressionReport

test_on_lds
(lds)[source]¶ Prepare a regression report for a single regressor.
Parameters: lds (LabeledDataStorage) – data Returns: RegressionReport
Sklearn classifier and regressor¶
SklearnClassifier
and SklearnRegressor
are wrappers for algorithms from scikitlearn.
From user perspective, wrapped sklearn model behaves in the same way as nonwrapped, but has one additional parameter features to choose necessary columns to use in training.
Typically, models from REP are used with pandas.DataFrames, which makes it possible to name needed variables or give some variables specific role in the training.
If data has numpy.array
type then behaviour will be the same as in sklearn.
For complete list of the available algorithms, see sklearn API.

class
rep.estimators.sklearn.
SklearnClassifier
(clf, features=None)[source]¶ Bases:
rep.estimators.sklearn.SklearnBase
,rep.estimators.interface.Classifier
SklearnClassifier is a wrapper over sklearnlike classifiers.
Parameters:  clf (sklearn.BaseEstimator) – classifier to train. Should be sklearncompatible.
 features (list[str] or None) – features used in training

fit
(X, y, sample_weight=None, **kwargs)[source]¶ Train the classifier.
Parameters:  X (pandas.DataFrame) – data shape [n_samples, n_features]
 y – target of training, arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: self
Note
if sklearn classifier doesn’t support sample_weight, then put sample_weight=None, otherwise exception will be thrown.

predict
(X)[source]¶ Predict labels for all samples in the dataset.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples] with integer labels

class
rep.estimators.sklearn.
SklearnRegressor
(clf, features=None)[source]¶ Bases:
rep.estimators.sklearn.SklearnBase
,rep.estimators.interface.Regressor
SklearnRegressor is a wrapper over sklearnlike regressors
Parameters:  clf (sklearn.BaseEstimator) – classifier to train. Should be sklearncompatible.
 features (list[str] or None) – features used in training

fit
(X, y, sample_weight=None, **kwargs)[source]¶ Train the classifier.
Parameters:  X (pandas.DataFrame) – data shape [n_samples, n_features]
 y – target of training, arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: self
Note
if sklearn classifier doesn’t support sample_weight, then put sample_weight=None, otherwise exception will be thrown.
TMVA classifier and regressor¶
These classes are wrappers for physics machine learning library TMVA used .root format files (c++ library). Now you can simply use it in python. TMVA contains classification and regression algorithms, including neural networks. See TMVA guide for the list of the available algorithms and parameters.

class
rep.estimators.tmva.
TMVAClassifier
(method='kBDT', features=None, factory_options='', sigmoid_function='bdt', **method_parameters)[source]¶ Bases:
rep.estimators.tmva.TMVABase
,rep.estimators.interface.Classifier
Implements classification models from TMVA library: CERN library for machine learning.
Parameters:  method (str) – algorithm method (default=’kBDT’)
 features (list[str] or None) – features used in training
 factory_options (str) –
system options, including data transformations before training, for example:
"!V:!Silent:Color:Transformations=I;D;P;G,D"
 sigmoid_function (str) –
function which is used to convert TMVA output to probabilities;
 identity (use for svm, mlp) — do not transform the output, use this value for methods returning class probabilities
 sigmoid — sigmoid transformation, use it if output varies in range [infinity, +infinity]
 bdt (for the BDT algorithms output varies in range [1, 1])
 sig_eff=0.4 — for the rectangular cut optimization methods, for instance, here 0.4 will be used as a signal efficiency to evaluate MVA, (put any float number from [0, 1])
 method_parameters (dict) – classifier options, example: NTrees=100, BoostType=’Grad’.
Warning
TMVA doesn’t support staged_predict_proba() and feature_importances__.
TMVA doesn’t support multiclassification, only twoclass classification.

fit
(X, y, sample_weight=None)[source]¶ Train a classification model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – labels of samples, arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: self

get_params
(deep=True)[source]¶ Get parameters for this estimator.
Returns: dict, parameter names mapped to their values.

predict_proba
(X)[source]¶ Predict probabilities for each class label for samples.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples, n_classes] with probabilities

class
rep.estimators.tmva.
TMVARegressor
(method='kBDT', features=None, factory_options='', **method_parameters)[source]¶ Bases:
rep.estimators.tmva.TMVABase
,rep.estimators.interface.Regressor
Implements regression models from TMVA library: CERN library for machine learning.
Parameters:  method (str) – algorithm method (default=’kBDT’)
 features (list[str] or None) – features used in training
 factory_options (str) –
system options, including data transformations before training, for example:
"!V:!Silent:Color:Transformations=I;D;P;G,D"
 method_parameters (dict) – regressor options, for example: NTrees=100, BoostType=’Grad’
Warning
TMVA doesn’t support staged_predict() and feature_importances__.

fit
(X, y, sample_weight=None)[source]¶ Train a regression model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – values for samples, arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: self

get_params
(deep=True)[source]¶ Get parameters for this estimator.
Returns: dict, parameter names mapped to their values.

predict
(X)[source]¶ Predict values for data.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples] with predicted values
XGBoost classifier and regressor¶
These classes are wrappers for XGBoost library.

class
rep.estimators.xgboost.
XGBoostBase
(n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, scale_pos_weight=1.0, min_child_weight=1.0, subsample=1.0, colsample=1.0, base_score=0.5, verbose=0, missing=999.0, random_state=0)[source]¶ Bases:
object
A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.
Parameters:  n_estimators (int) – number of trees built.
 nthreads (int) – number of parallel threads used to run XGBoost.
 num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
 gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
 eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
 max_depth (int) – maximum depth of a tree.
 scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
 min_child_weight (float) –
minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.
Note
weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.
 subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
 colsample (float) – subsample ratio of columns when constructing each tree.
 base_score (float) – the initial prediction score of all instances, global bias.
 random_state (None or int or RandomState) – state for a pseudo random generator
 verbose (boot) – if 1, will print messages during training
 missing (float) – the number considered by XGBoost as missing value.

feature_importances_
¶ Sklearnway of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None

class
rep.estimators.xgboost.
XGBoostClassifier
(features=None, n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, scale_pos_weight=1.0, min_child_weight=1.0, subsample=1.0, colsample=1.0, base_score=0.5, verbose=0, missing=999.0, random_state=0)[source]¶ Bases:
rep.estimators.xgboost.XGBoostBase
,rep.estimators.interface.Classifier
Implements classification model from XGBoost library. A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.
Parameters:  n_estimators (int) – number of trees built.
 nthreads (int) – number of parallel threads used to run XGBoost.
 num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
 gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
 eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
 max_depth (int) – maximum depth of a tree.
 scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
 min_child_weight (float) –
minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.
Note
weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.
 subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
 colsample (float) – subsample ratio of columns when constructing each tree.
 base_score (float) – the initial prediction score of all instances, global bias.
 random_state (None or int or RandomState) – state for a pseudo random generator
 verbose (boot) – if 1, will print messages during training
 missing (float) – the number considered by XGBoost as missing value.

fit
(X, y, sample_weight=None)[source]¶ Train a classification model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – labels of samples, arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: self

predict_proba
(X)[source]¶ Predict probabilities for each class label for samples.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples, n_classes] with probabilities

staged_predict_proba
(X, step=None)[source]¶ Predict probabilities for data for each class label on each stage..
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 step (int) – step for returned iterations (None by default). XGBoost does not implement this functionality and we need to predict from the beginning each time. With None passed step is chosen to have 10 points in the learning curve.
Returns: iterator

class
rep.estimators.xgboost.
XGBoostRegressor
(features=None, n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, min_child_weight=1.0, subsample=1.0, colsample=1.0, objective_type='linear', base_score=0.5, verbose=0, missing=999.0, random_state=0)[source]¶ Bases:
rep.estimators.xgboost.XGBoostBase
,rep.estimators.interface.Regressor
Implements regression model from XGBoost library. A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.
Parameters:  n_estimators (int) – number of trees built.
 nthreads (int) – number of parallel threads used to run XGBoost.
 num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
 gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
 eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
 max_depth (int) – maximum depth of a tree.
 scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
 min_child_weight (float) –
minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.
Note
weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.
 subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
 colsample (float) – subsample ratio of columns when constructing each tree.
 base_score (float) – the initial prediction score of all instances, global bias.
 random_state (None or int or RandomState) – state for a pseudo random generator
 verbose (boot) – if 1, will print messages during training
 missing (float) – the number considered by XGBoost as missing value.

fit
(X, y, sample_weight=None)[source]¶ Train a regression model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – values for samples, arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: self

predict
(X)[source]¶ Predict values for data.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples] with predicted values

staged_predict
(X, step=None)[source]¶ Predicts values for data on each stage.
Parameters:  X – pandas.DataFrame of shape [n_samples, n_features]
 step (int) – step for returned iterations (None by default). XGBoost does not implement this functionality and we need to predict from the beginning each time. With None passed step is chosen to have 10 points in the learning curve.
Returns: iterator
Theanets classifier and regressor¶
These classes are wrappers for theanets library — a neural network python library.

class
rep.estimators.theanets.
TheanetsBase
(features=None, layers=(10, ), input_layer=1, output_layer=1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]¶ Bases:
object
A base class for the estimators from Theanets library.
Parameters:  features (None or list(str)) – list of features to train model
 layers (sequence of int, tuple, dict) – a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.
 input_layer (int) – size of the input layer. If it equals 1, the size is taken from the training dataset
 output_layer (int) – size of the output layer. If it equals 1, the size is taken from the training dataset
 hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
 output_activation (str) – the name of an activation function to use on the output layer by default
 input_noise (float) – standard deviation of desired noise to inject into input
 hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
 input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
 hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
 decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
 scaler (str or sklearnlike transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
 trainers (list[dict] or None) –
parameters to specify training algorithm(s), for example:
trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]
 random_state (None or int or RandomState) – state for a pseudo random generator
For more information on the available trainers and their parameters see this page.

fit
(X, y, sample_weight=None)[source]¶ Train a classification/regression model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – values for samples — arraylike of shape [n_samples]
 sample_weight – weights for samples — arraylike of shape [n_samples]
Returns: self

partial_fit
(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]¶ Train the estimator by training the existing estimator again.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – values for samples — arraylike of shape [n_samples]
 sample_weight – weights for samples — arraylike of shape [n_samples]
 keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
 trainer (dict) – parameters of the training algorithm we want to use now
Returns: self

set_params
(**params)[source]¶ Set the parameters of the estimator. Deep parameters of trainers and scaler can be accessed, for instance:
trainers__0 = {'algo': 'sgd', 'learning_rate': 0.3} trainers__0_algo = 'sgd' layers__1 = 14 scaler__use_std = True
Parameters: params (dict) – parameters to set in the model

class
rep.estimators.theanets.
TheanetsClassifier
(features=None, layers=(10, ), input_layer=1, output_layer=1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]¶ Bases:
rep.estimators.theanets.TheanetsBase
,rep.estimators.interface.Classifier
Implements a classification model from the Theanets library.
Parameters:  features (None or list(str)) – list of features to train model
 layers (sequence of int, tuple, dict) –
a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.
 input_layer (int) – size of the input layer. If it equals 1, the size is taken from the training dataset
 output_layer (int) – size of the output layer. If it equals 1, the size is taken from the training dataset
 hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
 output_activation (str) – the name of an activation function to use on the output layer by default
 input_noise (float) – standard deviation of desired noise to inject into input
 hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
 input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
 hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
 decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
 scaler (str or sklearnlike transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
 trainers (list[dict] or None) –
parameters to specify training algorithm(s), for example:
trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]
 random_state (None or int or RandomState) – state for a pseudo random generator
For more information on the available trainers and their parameters see this page.

partial_fit
(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]¶ Train the estimator by training the existing estimator again.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – values for samples — arraylike of shape [n_samples]
 sample_weight – weights for samples — arraylike of shape [n_samples]
 keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
 trainer (dict) – parameters of the training algorithm we want to use now
Returns: self

class
rep.estimators.theanets.
TheanetsRegressor
(features=None, layers=(10, ), input_layer=1, output_layer=1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]¶ Bases:
rep.estimators.theanets.TheanetsBase
,rep.estimators.interface.Regressor
Implements a regression model from the Theanets library.
Parameters:  features (None or list(str)) – list of features to train model
 layers (sequence of int, tuple, dict) –
a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.
 input_layer (int) – size of the input layer. If it equals 1, the size is taken from the training dataset
 output_layer (int) – size of the output layer. If it equals 1, the size is taken from the training dataset
 hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
 output_activation (str) – the name of an activation function to use on the output layer by default
 input_noise (float) – standard deviation of desired noise to inject into input
 hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
 input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
 hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
 decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
 scaler (str or sklearnlike transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
 trainers (list[dict] or None) –
parameters to specify training algorithm(s), for example:
trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]
 random_state (None or int or RandomState) – state for a pseudo random generator
For more information on the available trainers and their parameters see this page.

partial_fit
(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]¶ Train the estimator by training the existing estimator again.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – values for samples — arraylike of shape [n_samples]
 sample_weight – weights for samples — arraylike of shape [n_samples]
 keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
 trainer (dict) – parameters of the training algorithm we want to use now
Returns: self
Neurolab classifier and regressor¶
These classes are wrappers for the Neurolab library — a neural network python library.
Warning
To make neurolab reproducible we change global random seed
numpy.random.seed(42)

class
rep.estimators.neurolab.
NeurolabClassifier
(features=None, layers=(10, ), net_type='feedforward', initf=<function init_rand>, trainf=None, scaler='standard', random_state=None, **other_params)[source]¶ Bases:
rep.estimators.neurolab.NeurolabBase
,rep.estimators.interface.Classifier
Implements a classification model from the Neurolab library.
Parameters:  features (list[str] or None) – features used in training
 layers (list[int]) – sequence, number of units inside each hidden layer.
 net_type (string) –
type of the network; possible values are:
 feedforward
 competinglayer
 learningvector
 elmanrecurrent
 hemmingrecurrent
 initf (anything implementing call(layer), e.g. neurolab.init.* or list[neurolab.init.*] of shape [n_layers]) – layer initializers
 trainf – net training function; default value depends on the type of a network
 scaler (str or sklearnlike transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
 random_state – this parameter is ignored and is added for uniformity.
 kwargs (dict) – additional arguments to net __init__, varies with different net_types

fit
(X, y)[source]¶ Train a classification model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – labels of samples — arraylike of shape [n_samples]
Returns: self

partial_fit
(X, y)[source]¶ Additional training of the classifier.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – labels of samples, arraylike of shape [n_samples]
Returns: self

class
rep.estimators.neurolab.
NeurolabRegressor
(features=None, layers=(10, ), net_type='feedforward', initf=<function init_rand>, trainf=None, scaler='standard', random_state=None, **other_params)[source]¶ Bases:
rep.estimators.neurolab.NeurolabBase
,rep.estimators.interface.Regressor
Implements a regression model from the Neurolab library.
Parameters:  features (list[str] or None) – features used in training
 layers (list[int]) – sequence, number of units inside each hidden layer.
 net_type (string) –
type of the network; possible values are:
 feedforward
 competinglayer
 learningvector
 elmanrecurrent
 hemmingrecurrent
 initf (anything implementing call(layer), e.g. neurolab.init.* or list[neurolab.init.*] of shape [n_layers]) – layer initializers
 trainf – net training function; default value depends on the type of a network
 scaler (str or sklearnlike transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
 random_state – this parameter is ignored and is added for uniformity.
 kwargs (dict) – additional arguments to net __init__, varies with different net_types

fit
(X, y)[source]¶ Train a regression model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – values for samples — arraylike of shape [n_samples]
Returns: self

partial_fit
(X, y)[source]¶ Additional training of the regressor.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – values for samples, arraylike of shape [n_samples]
Returns: self
Pybrain classifier and regressor¶
These classes are wrappers for the PyBrain library — a neural network python library.
Warning
pybrain training isn’t reproducible (training with the same parameters produces different neural network each time)

class
rep.estimators.pybrain.
PyBrainBase
(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]¶ Bases:
object
A base class for the estimator from the PyBrain.
Parameters:  features (list[str] or None) – features used in training.
 scaler (str or sklearnlike transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
 use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
 verbose (bool) – print train/validation errors.
 random_state – it is ignored parameter, pybrain training is not reproducible
Net parameters:
Parameters:  layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
 hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
 params (dict) –
other net parameters:
 bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
 peepholes (boolean);
 recurrent (boolean): if the recurrent flag is set, a
RecurrentNetwork
will be created, otherwise aFeedForwardNetwork
Gradient descent trainer parameters:
Parameters:  learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
 lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
 momentum (float) – the ratio by which the gradient of the last time step is used
 batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
 weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all
Rprop trainer parameters:
Parameters:  etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
 etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
 delta (float) – step width for each weight
 deltamin (float) – minimum step width (default=1e6)
 deltamax (float) – maximum step width (default=5.0)
 delta0 (float) – initial step width (default=0.1)
Training termination parameters
Parameters:  epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
 max_epochs (int) – maximum number of epochs the trainer should train if it is given
 continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
 validation_proportion (float) – the ratio of the dataset that is used for the validation dataset
Note
Details about parameters here.

fit
(X, y)[source]¶ Train a classification/regression model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – values for samples — arraylike of shape [n_samples]
Returns: self

class
rep.estimators.pybrain.
PyBrainClassifier
(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]¶ Bases:
rep.estimators.pybrain.PyBrainBase
,rep.estimators.interface.Classifier
Implements a classification model from the PyBrain library.
Parameters:  features (list[str] or None) – features used in training.
 scaler (str or sklearnlike transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
 use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
 verbose (bool) – print train/validation errors.
 random_state – it is ignored parameter, pybrain training is not reproducible
Net parameters:
Parameters:  layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
 hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
 params (dict) –
other net parameters:
 bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
 peepholes (boolean);
 recurrent (boolean): if the recurrent flag is set, a
RecurrentNetwork
will be created, otherwise aFeedForwardNetwork
Gradient descent trainer parameters:
Parameters:  learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
 lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
 momentum (float) – the ratio by which the gradient of the last time step is used
 batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
 weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all
Rprop trainer parameters:
Parameters:  etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
 etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
 delta (float) – step width for each weight
 deltamin (float) – minimum step width (default=1e6)
 deltamax (float) – maximum step width (default=5.0)
 delta0 (float) – initial step width (default=0.1)
Training termination parameters
Parameters:  epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
 max_epochs (int) – maximum number of epochs the trainer should train if it is given
 continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
 validation_proportion (float) – the ratio of the dataset that is used for the validation dataset
Note
Details about parameters here.

class
rep.estimators.pybrain.
PyBrainRegressor
(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]¶ Bases:
rep.estimators.pybrain.PyBrainBase
,rep.estimators.interface.Regressor
Implements a regression model from the PyBrain library.
Parameters:  features (list[str] or None) – features used in training.
 scaler (str or sklearnlike transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
 use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
 verbose (bool) – print train/validation errors.
 random_state – it is ignored parameter, pybrain training is not reproducible
Net parameters:
Parameters:  layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
 hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
 params (dict) –
other net parameters:
 bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
 peepholes (boolean);
 recurrent (boolean): if the recurrent flag is set, a
RecurrentNetwork
will be created, otherwise aFeedForwardNetwork
Gradient descent trainer parameters:
Parameters:  learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
 lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
 momentum (float) – the ratio by which the gradient of the last time step is used
 batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
 weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all
Rprop trainer parameters:
Parameters:  etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
 etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
 delta (float) – step width for each weight
 deltamin (float) – minimum step width (default=1e6)
 deltamax (float) – maximum step width (default=5.0)
 delta0 (float) – initial step width (default=0.1)
Training termination parameters
Parameters:  epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
 max_epochs (int) – maximum number of epochs the trainer should train if it is given
 continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
 validation_proportion (float) – the ratio of the dataset that is used for the validation dataset
Note
Details about parameters here.
MatrixNet classifier and regressor¶
MatrixNetClassifier
and MatrixNetRegressor
are wrappers for MatrixNet web service  proprietary BDT
developed at Yandex. Think about this as a specific Boosted Decision Tree algorithm which is available as a service.
At this moment MatrixMet is available only for CERN users.
 To use MatrixNet, first acquire token::
Go to https://yandexapps.cern.ch/ (login with your CERNaccount)
Click Add token at the left panel
Choose service MatrixNet and click Create token
Create ~/.repmatrixnet.config.json file with the following content (custom path to the config file can be specified when creating a wrapper object):
{ "url": "https://ml.cern.yandex.net/v1", "token": "<your_token>" }

class
rep.estimators.matrixnet.
MatrixNetBase
(api_config_file='$HOME/.repmatrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]¶ Bases:
object
Base class for MatrixNetClassifier and MatrixNetRegressor.
This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.
Parameters:  features (list[str] or None) – features used in training
 api_config_file (str) –
path to the file with remote api configuration in the json format:
{"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
 iterations (int) – number of constructed trees (default=100)
 regularization (float) – regularization number (default=0.01)
 intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
 max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
 features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
 training_fraction (float) – training rows bagging (default=0.5)
 auto_stop (None or float) – error value for training prestopping
 sync (bool) – synchronous or asynchronous training on the server
 random_state (None or int or RandomState) – state for a pseudo random generator

feature_importances_
¶ Sklearnway of returning feature importance. This returned as numpy.array, ‘effect’ column is used among MatrixNet importances.

get_feature_importances
()[source]¶ Get features importance: effect, efficiency, information characteristics
Return type: pandas.DataFrame with index=self.features

get_iterations
()[source]¶ Return number of already constructed trees during training
Returns: int or None

class
rep.estimators.matrixnet.
MatrixNetClassifier
(features=None, api_config_file='$HOME/.repmatrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]¶ Bases:
rep.estimators.matrixnet.MatrixNetBase
,rep.estimators.interface.Classifier
MatrixNet classification model.
This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.
Parameters:  features (list[str] or None) – features used in training
 api_config_file (str) –
path to the file with remote api configuration in the json format:
{"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
 iterations (int) – number of constructed trees (default=100)
 regularization (float) – regularization number (default=0.01)
 intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
 max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
 features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
 training_fraction (float) – training rows bagging (default=0.5)
 auto_stop (None or float) – error value for training prestopping
 sync (bool) – synchronous or asynchronous training on the server
 random_state (None or int or RandomState) – state for a pseudo random generator

fit
(X, y, sample_weight=None)[source]¶ Train a classification model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – labels of samples, arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: self

class
rep.estimators.matrixnet.
MatrixNetRegressor
(features=None, api_config_file='$HOME/.repmatrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]¶ Bases:
rep.estimators.matrixnet.MatrixNetBase
,rep.estimators.interface.Regressor
MatrixNet for regression model.
This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.
Parameters:  features (list[str] or None) – features used in training
 api_config_file (str) –
path to the file with remote api configuration in the json format:
{"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
 iterations (int) – number of constructed trees (default=100)
 regularization (float) – regularization number (default=0.01)
 intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
 max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
 features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
 training_fraction (float) – training rows bagging (default=0.5)
 auto_stop (None or float) – error value for training prestopping
 sync (bool) – synchronous or asynchronous training on the server
 random_state (None or int or RandomState) – state for a pseudo random generator

fit
(X, y, sample_weight=None)[source]¶ Train a classification model on the data.
Parameters:  X (pandas.DataFrame) – data of shape [n_samples, n_features]
 y – labels of samples, arraylike of shape [n_samples]
 sample_weight – weight of samples, arraylike of shape [n_samples] or None if all weights are equal
Returns: self
Examples¶
Classification¶
 Prepare dataset
>>> from sklearn import datasets >>> import pandas, numpy >>> from rep.utils import train_test_split >>> from sklearn.metrics import roc_auc_score >>> # iris data >>> iris = datasets.load_iris() >>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd']) >>> labels = iris.target >>> # Take just two classes instead of three >>> data = data[labels != 2] >>> labels = labels[labels != 2] >>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
 Sklearn classification
>>> from rep.estimators import SklearnClassifier >>> from sklearn.ensemble import GradientBoostingClassifier >>> # Using gradient boosting with default settings >>> sk = SklearnClassifier(GradientBoostingClassifier(), features=['a', 'b']) >>> # Training classifier >>> sk.fit(train_data, train_labels) >>> pred = sk.predict_proba(test_data) >>> print pred [[ 9.99842983e01 1.57016893e04] [ 1.45163843e04 9.99854836e01] [ 9.99842983e01 1.57016893e04] [ 9.99827693e01 1.72306607e04], ..] >>> roc_auc_score(test_labels, pred[:, 1]) 0.99768518518518523
 TMVA classification
>>> from rep.estimators import TMVAClassifier >>> tmva = TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=1, BoostType='Grad', features=['a', 'b']) >>> tmva.fit(train_data, train_labels) >>> pred = tmva.predict_proba(test_data) >>> print pred [[ 9.99991025e01 8.97546346e06] [ 1.14084636e04 9.99885915e01] [ 9.99991009e01 8.99060302e06] [ 9.99798700e01 2.01300452e04], ..] >>> roc_auc_score(test_labels, pred[:, 1]) 0.99999999999999989
 XGBoost classification
>>> from rep.estimators import XGBoostClassifier >>> # XGBoost with default parameters >>> xgb = XGBoostClassifier(features=['a', 'b']) >>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels))) >>> pred = xgb.predict_proba(test_data) >>> print pred [[ 0.9983651 0.00163494] [ 0.00170585 0.99829417] [ 0.99845636 0.00154361] [ 0.96618336 0.03381656], ..] >>> roc_auc_score(test_labels, pred[:, 1]) 0.99768518518518512
Regression¶
 Prepare dataset
>>> from sklearn import datasets >>> from sklearn.metrics import mean_squared_error >>> from rep.utils import train_test_split >>> import pandas, numpy >>> # diabetes data >>> diabetes = datasets.load_diabetes() >>> features = ['feature_%d' % number for number in range(diabetes.data.shape[1])] >>> data = pandas.DataFrame(diabetes.data, columns=features) >>> labels = diabetes.target >>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
 Sklearn regression
>>> from rep.estimators import SklearnRegressor >>> from sklearn.ensemble import GradientBoostingRegressor >>> # Using gradient boosting with default settings >>> sk = SklearnRegressor(GradientBoostingRegressor(), features=features[:8]) >>> # Training classifier >>> sk.fit(train_data, train_labels) >>> pred = sk.predict(train_data) >>> numpy.sqrt(mean_squared_error(train_labels, pred)) 60.666009962879265
 TMVA regression
>>> from rep.estimators import TMVARegressor >>> tmva = TMVARegressor(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=1, BoostType='Grad', features=features[:8]) >>> tmva.fit(train_data, train_labels) >>> pred = tmva.predict(test_data) >>> numpy.sqrt(mean_squared_error(test_labels, pred)) 73.74191838418254
 XGBoost regression
>>> from rep.estimators import XGBoostRegressor >>> # XGBoost with default parameters >>> xgb = XGBoostRegressor(features=features[:8]) >>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels))) >>> pred = xgb.predict(test_data) >>> numpy.sqrt(mean_squared_error(test_labels, pred)) 65.557743652940133