Estimators (classification and regression)¶
This module contains wrappers with sklearn
interface for different machine learning libraries:
- scikit-learn
- TMVA
- XGBoost
- pybrain
- neurolab
- theanets.
REP defines interface for classifiers’ and regressors’ wrappers, thus new wrappers can be added for another libraries following the same interface. Notably the interface has backward compatibility with scikit-learn library.
Estimators interfaces (for classification and regression)¶
REP wrappers are derived from Classifier
and Regressor
depending on the problem of interest.
Below you can see the standard methods available in the wrappers.
-
class
rep.estimators.interface.
Classifier
(features=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
Interface to train different classification models from different machine learning libraries, like sklearn, TMVA, XGBoost, ...
Parameters: features (list[str] or None) – features used to train a model Note
- if features aren’t set (None), then all features in the training dataset will be used
- Datasets should be pandas.DataFrame, not numpy.array. Provided this, you’ll be able to choose features used in training by setting e.g. features=[‘mass’, ‘momentum’] in the constructor.
- It works fine with numpy.array as well, but in this case all the features will be used.
- Classes values must be from 0 to n_classes-1!
-
fit
(X, y, sample_weight=None)[source]¶ Train a classification model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – labels of samples, array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: self
-
fit_lds
(lds)[source]¶ Train a classifier on the specific type of dataset.
Parameters: lds (LabeledDataStorage) – data Returns: self
-
get_feature_importances
()[source]¶ Return features importance.
Return type: pandas.DataFrame with index=self.features
-
get_params
(deep=True)¶ Get parameters for this estimator.
- deep: boolean, optional
- If True, will return the parameters for this estimator and contained subobjects that are estimators.
- params : mapping of string to any
- Parameter names mapped to their values.
-
predict
(X)[source]¶ Predict labels for all samples in the dataset.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples] with integer labels
-
predict_proba
(X)[source]¶ Predict probabilities for each class label for samples.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples, n_classes] with probabilities
-
score
(X, y, sample_weight=None)¶ Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- X : array-like, shape = (n_samples, n_features)
- Test samples.
- y : array-like, shape = (n_samples) or (n_samples, n_outputs)
- True labels for X.
- sample_weight : array-like, shape = [n_samples], optional
- Sample weights.
- score : float
- Mean accuracy of self.predict(X) wrt. y.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.self
-
staged_predict_proba
(X)[source]¶ Predict probabilities for data for each class label on each stage (i.e. for boosting algorithms).
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: iterator
-
test_on
(X, y, sample_weight=None)[source]¶ Prepare classification report for a single classifier.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – labels of samples — array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: ClassificationReport
-
test_on_lds
(lds)[source]¶ Prepare a classification report for a single classifier.
Parameters: lds (LabeledDataStorage) – data Returns: ClassificationReport
-
class
rep.estimators.interface.
Regressor
(features=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.RegressorMixin
Interface to train different regression models from different machine learning libraries, like sklearn, TMVA, XGBoost, ...
Parameters: features (list[str] or None) – features used to train a model Note
- if features aren’t set (None), then all features in the training dataset will be used
- Datasets should be pandas.DataFrame, not numpy.array. Provided this, you’ll be able to choose features used in training by setting e.g. features=[‘mass’, ‘momentum’] in the constructor.
- It works fine with numpy.array as well, but in this case all the features will be used.
-
fit
(X, y, sample_weight=None)[source]¶ Train a regression model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – values for samples, array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: self
-
fit_lds
(lds)[source]¶ Train a regression model on the specific type of dataset.
Parameters: lds (LabeledDataStorage) – data Returns: self
-
get_feature_importances
()[source]¶ Get features importances.
Return type: pandas.DataFrame with index=self.features
-
get_params
(deep=True)¶ Get parameters for this estimator.
- deep: boolean, optional
- If True, will return the parameters for this estimator and contained subobjects that are estimators.
- params : mapping of string to any
- Parameter names mapped to their values.
-
predict
(X)[source]¶ Predict values for data.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples] with predicted values
-
score
(X, y, sample_weight=None)¶ Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
- X : array-like, shape = (n_samples, n_features)
- Test samples.
- y : array-like, shape = (n_samples) or (n_samples, n_outputs)
- True values for X.
- sample_weight : array-like, shape = [n_samples], optional
- Sample weights.
- score : float
- R^2 of self.predict(X) wrt. y.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.self
-
staged_predict
(X)[source]¶ Predicts values for data on each stage (i.e. for boosting algorithms).
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: iterator
-
test_on
(X, y, sample_weight=None)[source]¶ Prepare a regression report for a single regressor
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – values of samples — array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: RegressionReport
-
test_on_lds
(lds)[source]¶ Prepare a regression report for a single regressor.
Parameters: lds (LabeledDataStorage) – data Returns: RegressionReport
Sklearn classifier and regressor¶
SklearnClassifier
and SklearnRegressor
are wrappers for algorithms from scikit-learn.
From user perspective, wrapped sklearn model behaves in the same way as non-wrapped, but has one additional parameter features to choose necessary columns to use in training.
Typically, models from REP are used with pandas.DataFrames, which makes it possible to name needed variables or give some variables specific role in the training.
If data has numpy.array
type then behaviour will be the same as in sklearn.
For complete list of the available algorithms, see sklearn API.
-
class
rep.estimators.sklearn.
SklearnClassifier
(clf, features=None)[source]¶ Bases:
rep.estimators.sklearn.SklearnBase
,rep.estimators.interface.Classifier
SklearnClassifier is a wrapper over sklearn-like classifiers.
Parameters: - clf (sklearn.BaseEstimator) – classifier to train. Should be sklearn-compatible.
- features (list[str] or None) – features used in training
-
fit
(X, y, sample_weight=None, **kwargs)[source]¶ Train the classifier.
Parameters: - X (pandas.DataFrame) – data shape [n_samples, n_features]
- y – target of training, array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: self
Note
if sklearn classifier doesn’t support sample_weight, then put sample_weight=None, otherwise exception will be thrown.
-
predict
(X)[source]¶ Predict labels for all samples in the dataset.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples] with integer labels
-
class
rep.estimators.sklearn.
SklearnRegressor
(clf, features=None)[source]¶ Bases:
rep.estimators.sklearn.SklearnBase
,rep.estimators.interface.Regressor
SklearnRegressor is a wrapper over sklearn-like regressors
Parameters: - clf (sklearn.BaseEstimator) – classifier to train. Should be sklearn-compatible.
- features (list[str] or None) – features used in training
-
fit
(X, y, sample_weight=None, **kwargs)[source]¶ Train the classifier.
Parameters: - X (pandas.DataFrame) – data shape [n_samples, n_features]
- y – target of training, array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: self
Note
if sklearn classifier doesn’t support sample_weight, then put sample_weight=None, otherwise exception will be thrown.
TMVA classifier and regressor¶
These classes are wrappers for physics machine learning library TMVA used .root format files (c++ library). Now you can simply use it in python. TMVA contains classification and regression algorithms, including neural networks. See TMVA guide for the list of the available algorithms and parameters.
-
class
rep.estimators.tmva.
TMVAClassifier
(method='kBDT', features=None, factory_options='', sigmoid_function='bdt', **method_parameters)[source]¶ Bases:
rep.estimators.tmva.TMVABase
,rep.estimators.interface.Classifier
Implements classification models from TMVA library: CERN library for machine learning.
Parameters: - method (str) – algorithm method (default=’kBDT’)
- features (list[str] or None) – features used in training
- factory_options (str) –
system options, including data transformations before training, for example:
"!V:!Silent:Color:Transformations=I;D;P;G,D"
- sigmoid_function (str) –
function which is used to convert TMVA output to probabilities;
- identity (use for svm, mlp) — do not transform the output, use this value for methods returning class probabilities
- sigmoid — sigmoid transformation, use it if output varies in range [-infinity, +infinity]
- bdt (for the BDT algorithms output varies in range [-1, 1])
- sig_eff=0.4 — for the rectangular cut optimization methods, for instance, here 0.4 will be used as a signal efficiency to evaluate MVA, (put any float number from [0, 1])
- method_parameters (dict) – classifier options, example: NTrees=100, BoostType=’Grad’.
Warning
TMVA doesn’t support staged_predict_proba() and feature_importances__.
TMVA doesn’t support multiclassification, only two-class classification.
-
fit
(X, y, sample_weight=None)[source]¶ Train a classification model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – labels of samples, array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: self
-
get_params
(deep=True)[source]¶ Get parameters for this estimator.
Returns: dict, parameter names mapped to their values.
-
predict_proba
(X)[source]¶ Predict probabilities for each class label for samples.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples, n_classes] with probabilities
-
class
rep.estimators.tmva.
TMVARegressor
(method='kBDT', features=None, factory_options='', **method_parameters)[source]¶ Bases:
rep.estimators.tmva.TMVABase
,rep.estimators.interface.Regressor
Implements regression models from TMVA library: CERN library for machine learning.
Parameters: - method (str) – algorithm method (default=’kBDT’)
- features (list[str] or None) – features used in training
- factory_options (str) –
system options, including data transformations before training, for example:
"!V:!Silent:Color:Transformations=I;D;P;G,D"
- method_parameters (dict) – regressor options, for example: NTrees=100, BoostType=’Grad’
Warning
TMVA doesn’t support staged_predict() and feature_importances__.
-
fit
(X, y, sample_weight=None)[source]¶ Train a regression model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – values for samples, array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: self
-
get_params
(deep=True)[source]¶ Get parameters for this estimator.
Returns: dict, parameter names mapped to their values.
-
predict
(X)[source]¶ Predict values for data.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples] with predicted values
XGBoost classifier and regressor¶
These classes are wrappers for XGBoost library.
-
class
rep.estimators.xgboost.
XGBoostBase
(n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, scale_pos_weight=1.0, min_child_weight=1.0, subsample=1.0, colsample=1.0, base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]¶ Bases:
object
A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.
Parameters: - n_estimators (int) – number of trees built.
- nthreads (int) – number of parallel threads used to run XGBoost.
- num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
- gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
- eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
- max_depth (int) – maximum depth of a tree.
- scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
- min_child_weight (float) –
minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.
Note
weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.
- subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
- colsample (float) – subsample ratio of columns when constructing each tree.
- base_score (float) – the initial prediction score of all instances, global bias.
- random_state (None or int or RandomState) – state for a pseudo random generator
- verbose (boot) – if 1, will print messages during training
- missing (float) – the number considered by XGBoost as missing value.
-
feature_importances_
¶ Sklearn-way of returning feature importance. This returned as numpy.array, assuming that initially passed train_features=None
-
class
rep.estimators.xgboost.
XGBoostClassifier
(features=None, n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, scale_pos_weight=1.0, min_child_weight=1.0, subsample=1.0, colsample=1.0, base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]¶ Bases:
rep.estimators.xgboost.XGBoostBase
,rep.estimators.interface.Classifier
Implements classification model from XGBoost library. A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.
Parameters: - n_estimators (int) – number of trees built.
- nthreads (int) – number of parallel threads used to run XGBoost.
- num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
- gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
- eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
- max_depth (int) – maximum depth of a tree.
- scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
- min_child_weight (float) –
minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.
Note
weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.
- subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
- colsample (float) – subsample ratio of columns when constructing each tree.
- base_score (float) – the initial prediction score of all instances, global bias.
- random_state (None or int or RandomState) – state for a pseudo random generator
- verbose (boot) – if 1, will print messages during training
- missing (float) – the number considered by XGBoost as missing value.
-
fit
(X, y, sample_weight=None)[source]¶ Train a classification model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – labels of samples, array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: self
-
predict_proba
(X)[source]¶ Predict probabilities for each class label for samples.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples, n_classes] with probabilities
-
staged_predict_proba
(X, step=None)[source]¶ Predict probabilities for data for each class label on each stage..
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- step (int) – step for returned iterations (None by default). XGBoost does not implement this functionality and we need to predict from the beginning each time. With None passed step is chosen to have 10 points in the learning curve.
Returns: iterator
-
class
rep.estimators.xgboost.
XGBoostRegressor
(features=None, n_estimators=100, nthreads=16, num_feature=None, gamma=None, eta=0.3, max_depth=6, min_child_weight=1.0, subsample=1.0, colsample=1.0, objective_type='linear', base_score=0.5, verbose=0, missing=-999.0, random_state=0)[source]¶ Bases:
rep.estimators.xgboost.XGBoostBase
,rep.estimators.interface.Regressor
Implements regression model from XGBoost library. A base class for the XGBoostClassifier and XGBoostRegressor. XGBoost tree booster is used.
Parameters: - n_estimators (int) – number of trees built.
- nthreads (int) – number of parallel threads used to run XGBoost.
- num_feature (None or int) – feature dimension used in boosting, set to maximum dimension of the feature (set automatically by XGBoost, no need to be set by user).
- gamma (None or float) – minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
- eta (float) – (or learning rate) step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinkages the feature weights to make the boosting process more conservative.
- max_depth (int) – maximum depth of a tree.
- scale_pos_weight (float) – ration of weights of the class 1 to the weights of the class 0.
- min_child_weight (float) –
minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.
Note
weights are normalized so that mean=1 before fitting. Roughly min_child_weight is equal to the number of events.
- subsample (float) – subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
- colsample (float) – subsample ratio of columns when constructing each tree.
- base_score (float) – the initial prediction score of all instances, global bias.
- random_state (None or int or RandomState) – state for a pseudo random generator
- verbose (boot) – if 1, will print messages during training
- missing (float) – the number considered by XGBoost as missing value.
-
fit
(X, y, sample_weight=None)[source]¶ Train a regression model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – values for samples, array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: self
-
predict
(X)[source]¶ Predict values for data.
Parameters: X (pandas.DataFrame) – data of shape [n_samples, n_features] Return type: numpy.array of shape [n_samples] with predicted values
-
staged_predict
(X, step=None)[source]¶ Predicts values for data on each stage.
Parameters: - X – pandas.DataFrame of shape [n_samples, n_features]
- step (int) – step for returned iterations (None by default). XGBoost does not implement this functionality and we need to predict from the beginning each time. With None passed step is chosen to have 10 points in the learning curve.
Returns: iterator
Theanets classifier and regressor¶
These classes are wrappers for theanets library — a neural network python library.
-
class
rep.estimators.theanets.
TheanetsBase
(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]¶ Bases:
object
A base class for the estimators from Theanets library.
Parameters: - features (None or list(str)) – list of features to train model
- layers (sequence of int, tuple, dict) – a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.
- input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset
- output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset
- hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
- output_activation (str) – the name of an activation function to use on the output layer by default
- input_noise (float) – standard deviation of desired noise to inject into input
- hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
- input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
- hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
- decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
- scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
- trainers (list[dict] or None) –
parameters to specify training algorithm(s), for example:
trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]
- random_state (None or int or RandomState) – state for a pseudo random generator
For more information on the available trainers and their parameters see this page.
-
fit
(X, y, sample_weight=None)[source]¶ Train a classification/regression model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – values for samples — array-like of shape [n_samples]
- sample_weight – weights for samples — array-like of shape [n_samples]
Returns: self
-
partial_fit
(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]¶ Train the estimator by training the existing estimator again.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – values for samples — array-like of shape [n_samples]
- sample_weight – weights for samples — array-like of shape [n_samples]
- keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
- trainer (dict) – parameters of the training algorithm we want to use now
Returns: self
-
set_params
(**params)[source]¶ Set the parameters of the estimator. Deep parameters of trainers and scaler can be accessed, for instance:
trainers__0 = {'algo': 'sgd', 'learning_rate': 0.3} trainers__0_algo = 'sgd' layers__1 = 14 scaler__use_std = True
Parameters: params (dict) – parameters to set in the model
-
class
rep.estimators.theanets.
TheanetsClassifier
(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]¶ Bases:
rep.estimators.theanets.TheanetsBase
,rep.estimators.interface.Classifier
Implements a classification model from the Theanets library.
Parameters: - features (None or list(str)) – list of features to train model
- layers (sequence of int, tuple, dict) –
a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.
- input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset
- output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset
- hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
- output_activation (str) – the name of an activation function to use on the output layer by default
- input_noise (float) – standard deviation of desired noise to inject into input
- hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
- input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
- hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
- decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
- scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
- trainers (list[dict] or None) –
parameters to specify training algorithm(s), for example:
trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]
- random_state (None or int or RandomState) – state for a pseudo random generator
For more information on the available trainers and their parameters see this page.
-
partial_fit
(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]¶ Train the estimator by training the existing estimator again.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – values for samples — array-like of shape [n_samples]
- sample_weight – weights for samples — array-like of shape [n_samples]
- keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
- trainer (dict) – parameters of the training algorithm we want to use now
Returns: self
-
class
rep.estimators.theanets.
TheanetsRegressor
(features=None, layers=(10, ), input_layer=-1, output_layer=-1, hidden_activation='logistic', output_activation='linear', input_noise=0, hidden_noise=0, input_dropout=0, hidden_dropout=0, decode_from=1, weight_l1=0.01, weight_l2=0.01, scaler='standard', trainers=None, random_state=42)[source]¶ Bases:
rep.estimators.theanets.TheanetsBase
,rep.estimators.interface.Regressor
Implements a regression model from the Theanets library.
Parameters: - features (None or list(str)) – list of features to train model
- layers (sequence of int, tuple, dict) –
a sequence of values specifying the hidden layer configuration for the network. For more information see Specifying layers in the theanets documentation. Note that theanets layers parameter includes input and output layers in the sequence as well.
- input_layer (int) – size of the input layer. If it equals -1, the size is taken from the training dataset
- output_layer (int) – size of the output layer. If it equals -1, the size is taken from the training dataset
- hidden_activation (str) – the name of an activation function to use on the hidden network layers by default
- output_activation (str) – the name of an activation function to use on the output layer by default
- input_noise (float) – standard deviation of desired noise to inject into input
- hidden_noise (float) – standard deviation of desired noise to inject into hidden unit activation output
- input_dropouts (float) – proportion of the input units to randomly set to 0; it ranges [0, 1]
- hidden_dropouts (float) – proportion of hidden unit activations to randomly set to 0; it ranges [0, 1]
- decode_from (int) – any of the hidden layers can be tapped at the output. Just specify a value greater than 1 to tap the last N hidden layers. The default is 1, which decodes from just the last layer.
- scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
- trainers (list[dict] or None) –
parameters to specify training algorithm(s), for example:
trainers=[{‘algo’: sgd, ‘momentum’: 0.2}, {‘algo’: ‘nag’}]
- random_state (None or int or RandomState) – state for a pseudo random generator
For more information on the available trainers and their parameters see this page.
-
partial_fit
(X, y, sample_weight=None, keep_trainer=True, **trainer)[source]¶ Train the estimator by training the existing estimator again.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – values for samples — array-like of shape [n_samples]
- sample_weight – weights for samples — array-like of shape [n_samples]
- keep_trainer (bool) – True if the trainer is not stored in self.trainers. If True, will add it to the list of the estimators.
- trainer (dict) – parameters of the training algorithm we want to use now
Returns: self
Neurolab classifier and regressor¶
These classes are wrappers for the Neurolab library — a neural network python library.
Warning
To make neurolab reproducible we change global random seed
numpy.random.seed(42)
-
class
rep.estimators.neurolab.
NeurolabClassifier
(features=None, layers=(10, ), net_type='feed-forward', initf=<function init_rand>, trainf=None, scaler='standard', random_state=None, **other_params)[source]¶ Bases:
rep.estimators.neurolab.NeurolabBase
,rep.estimators.interface.Classifier
Implements a classification model from the Neurolab library.
Parameters: - features (list[str] or None) – features used in training
- layers (list[int]) – sequence, number of units inside each hidden layer.
- net_type (string) –
type of the network; possible values are:
- feed-forward
- competing-layer
- learning-vector
- elman-recurrent
- hemming-recurrent
- initf (anything implementing call(layer), e.g. neurolab.init.* or list[neurolab.init.*] of shape [n_layers]) – layer initializers
- trainf – net training function; default value depends on the type of a network
- scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
- random_state – this parameter is ignored and is added for uniformity.
- kwargs (dict) – additional arguments to net __init__, varies with different net_types
-
fit
(X, y)[source]¶ Train a classification model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – labels of samples — array-like of shape [n_samples]
Returns: self
-
partial_fit
(X, y)[source]¶ Additional training of the classifier.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – labels of samples, array-like of shape [n_samples]
Returns: self
-
class
rep.estimators.neurolab.
NeurolabRegressor
(features=None, layers=(10, ), net_type='feed-forward', initf=<function init_rand>, trainf=None, scaler='standard', random_state=None, **other_params)[source]¶ Bases:
rep.estimators.neurolab.NeurolabBase
,rep.estimators.interface.Regressor
Implements a regression model from the Neurolab library.
Parameters: - features (list[str] or None) – features used in training
- layers (list[int]) – sequence, number of units inside each hidden layer.
- net_type (string) –
type of the network; possible values are:
- feed-forward
- competing-layer
- learning-vector
- elman-recurrent
- hemming-recurrent
- initf (anything implementing call(layer), e.g. neurolab.init.* or list[neurolab.init.*] of shape [n_layers]) – layer initializers
- trainf – net training function; default value depends on the type of a network
- scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
- random_state – this parameter is ignored and is added for uniformity.
- kwargs (dict) – additional arguments to net __init__, varies with different net_types
-
fit
(X, y)[source]¶ Train a regression model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – values for samples — array-like of shape [n_samples]
Returns: self
-
partial_fit
(X, y)[source]¶ Additional training of the regressor.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – values for samples, array-like of shape [n_samples]
Returns: self
Pybrain classifier and regressor¶
These classes are wrappers for the PyBrain library — a neural network python library.
Warning
pybrain training isn’t reproducible (training with the same parameters produces different neural network each time)
-
class
rep.estimators.pybrain.
PyBrainBase
(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]¶ Bases:
object
A base class for the estimator from the PyBrain.
Parameters: - features (list[str] or None) – features used in training.
- scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
- use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
- verbose (bool) – print train/validation errors.
- random_state – it is ignored parameter, pybrain training is not reproducible
Net parameters:
Parameters: - layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
- hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
- params (dict) –
other net parameters:
- bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
- peepholes (boolean);
- recurrent (boolean): if the recurrent flag is set, a
RecurrentNetwork
will be created, otherwise aFeedForwardNetwork
Gradient descent trainer parameters:
Parameters: - learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
- lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
- momentum (float) – the ratio by which the gradient of the last time step is used
- batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
- weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all
Rprop trainer parameters:
Parameters: - etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
- etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
- delta (float) – step width for each weight
- deltamin (float) – minimum step width (default=1e-6)
- deltamax (float) – maximum step width (default=5.0)
- delta0 (float) – initial step width (default=0.1)
Training termination parameters
Parameters: - epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
- max_epochs (int) – maximum number of epochs the trainer should train if it is given
- continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
- validation_proportion (float) – the ratio of the dataset that is used for the validation dataset
Note
Details about parameters here.
-
fit
(X, y)[source]¶ Train a classification/regression model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – values for samples — array-like of shape [n_samples]
Returns: self
-
class
rep.estimators.pybrain.
PyBrainClassifier
(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]¶ Bases:
rep.estimators.pybrain.PyBrainBase
,rep.estimators.interface.Classifier
Implements a classification model from the PyBrain library.
Parameters: - features (list[str] or None) – features used in training.
- scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
- use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
- verbose (bool) – print train/validation errors.
- random_state – it is ignored parameter, pybrain training is not reproducible
Net parameters:
Parameters: - layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
- hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
- params (dict) –
other net parameters:
- bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
- peepholes (boolean);
- recurrent (boolean): if the recurrent flag is set, a
RecurrentNetwork
will be created, otherwise aFeedForwardNetwork
Gradient descent trainer parameters:
Parameters: - learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
- lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
- momentum (float) – the ratio by which the gradient of the last time step is used
- batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
- weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all
Rprop trainer parameters:
Parameters: - etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
- etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
- delta (float) – step width for each weight
- deltamin (float) – minimum step width (default=1e-6)
- deltamax (float) – maximum step width (default=5.0)
- delta0 (float) – initial step width (default=0.1)
Training termination parameters
Parameters: - epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
- max_epochs (int) – maximum number of epochs the trainer should train if it is given
- continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
- validation_proportion (float) – the ratio of the dataset that is used for the validation dataset
Note
Details about parameters here.
-
class
rep.estimators.pybrain.
PyBrainRegressor
(features=None, layers=(10, ), hiddenclass=None, epochs=10, scaler='standard', use_rprop=False, learningrate=0.01, lrdecay=1.0, momentum=0.0, verbose=False, batchlearning=False, weightdecay=0.0, etaminus=0.5, etaplus=1.2, deltamin=1e-06, deltamax=0.5, delta0=0.1, max_epochs=None, continue_epochs=3, validation_proportion=0.25, random_state=None, **params)[source]¶ Bases:
rep.estimators.pybrain.PyBrainBase
,rep.estimators.interface.Regressor
Implements a regression model from the PyBrain library.
Parameters: - features (list[str] or None) – features used in training.
- scaler (str or sklearn-like transformer or False) – transformer which is applied to the input samples. If it is False, scaling will not be used
- use_rprop (bool) – flag to indicate whether we should use Rprop or SGD trainer
- verbose (bool) – print train/validation errors.
- random_state – it is ignored parameter, pybrain training is not reproducible
Net parameters:
Parameters: - layers (list[int]) – indicate how many neurons in each hidden(!) layer; default is 1 hidden layer with 10 neurons
- hiddenclass (list[str]) – classes of the hidden layers; default is ‘SigmoidLayer’
- params (dict) –
other net parameters:
- bias and outputbias (boolean) flags to indicate whether the network should have the corresponding biases, both default to True;
- peepholes (boolean);
- recurrent (boolean): if the recurrent flag is set, a
RecurrentNetwork
will be created, otherwise aFeedForwardNetwork
Gradient descent trainer parameters:
Parameters: - learningrate (float) – gives the ratio of which parameters are changed into the direction of the gradient
- lrdecay (float) – the learning rate decreases by lrdecay, which is used to multiply the learning rate after each training step
- momentum (float) – the ratio by which the gradient of the last time step is used
- batchlearning (boolean) – if set, the parameters are updated only at the end of each epoch. Default is False
- weightdecay (float) – corresponds to the weightdecay rate, where 0 is no weight decay at all
Rprop trainer parameters:
Parameters: - etaminus (float) – factor by which a step width is decreased when overstepping (default=0.5)
- etaplus (float) – factor by which a step width is increased when following gradient (default=1.2)
- delta (float) – step width for each weight
- deltamin (float) – minimum step width (default=1e-6)
- deltamax (float) – maximum step width (default=5.0)
- delta0 (float) – initial step width (default=0.1)
Training termination parameters
Parameters: - epochs (int) – number of iterations in training; if < 0 then estimator trains until converge
- max_epochs (int) – maximum number of epochs the trainer should train if it is given
- continue_epochs (int) – each time validation error decreases, try for continue_epochs epochs to find a better one
- validation_proportion (float) – the ratio of the dataset that is used for the validation dataset
Note
Details about parameters here.
MatrixNet classifier and regressor¶
MatrixNetClassifier
and MatrixNetRegressor
are wrappers for MatrixNet web service - proprietary BDT
developed at Yandex. Think about this as a specific Boosted Decision Tree algorithm which is available as a service.
At this moment MatrixMet is available only for CERN users.
- To use MatrixNet, first acquire token::
Go to https://yandex-apps.cern.ch/ (login with your CERN-account)
Click Add token at the left panel
Choose service MatrixNet and click Create token
Create ~/.rep-matrixnet.config.json file with the following content (custom path to the config file can be specified when creating a wrapper object):
{ "url": "https://ml.cern.yandex.net/v1", "token": "<your_token>" }
-
class
rep.estimators.matrixnet.
MatrixNetBase
(api_config_file='$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]¶ Bases:
object
Base class for MatrixNetClassifier and MatrixNetRegressor.
This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.
Parameters: - features (list[str] or None) – features used in training
- api_config_file (str) –
path to the file with remote api configuration in the json format:
{"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
- iterations (int) – number of constructed trees (default=100)
- regularization (float) – regularization number (default=0.01)
- intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
- max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
- features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
- training_fraction (float) – training rows bagging (default=0.5)
- auto_stop (None or float) – error value for training pre-stopping
- sync (bool) – synchronous or asynchronous training on the server
- random_state (None or int or RandomState) – state for a pseudo random generator
-
feature_importances_
¶ Sklearn-way of returning feature importance. This returned as numpy.array, ‘effect’ column is used among MatrixNet importances.
-
get_feature_importances
()[source]¶ Get features importance: effect, efficiency, information characteristics
Return type: pandas.DataFrame with index=self.features
-
get_iterations
()[source]¶ Return number of already constructed trees during training
Returns: int or None
-
class
rep.estimators.matrixnet.
MatrixNetClassifier
(features=None, api_config_file='$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]¶ Bases:
rep.estimators.matrixnet.MatrixNetBase
,rep.estimators.interface.Classifier
MatrixNet classification model.
This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.
Parameters: - features (list[str] or None) – features used in training
- api_config_file (str) –
path to the file with remote api configuration in the json format:
{"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
- iterations (int) – number of constructed trees (default=100)
- regularization (float) – regularization number (default=0.01)
- intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
- max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
- features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
- training_fraction (float) – training rows bagging (default=0.5)
- auto_stop (None or float) – error value for training pre-stopping
- sync (bool) – synchronous or asynchronous training on the server
- random_state (None or int or RandomState) – state for a pseudo random generator
-
fit
(X, y, sample_weight=None)[source]¶ Train a classification model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – labels of samples, array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: self
-
class
rep.estimators.matrixnet.
MatrixNetRegressor
(features=None, api_config_file='$HOME/.rep-matrixnet.config.json', iterations=100, regularization=0.01, intervals=8, max_features_per_iteration=6, features_sample_rate_per_iteration=1.0, training_fraction=0.5, auto_stop=None, sync=True, random_state=42)[source]¶ Bases:
rep.estimators.matrixnet.MatrixNetBase
,rep.estimators.interface.Regressor
MatrixNet for regression model.
This is a wrapper around MatrixNet (specific BDT) technology developed at Yandex, which is available for CERN people using authorization. Trained estimator is downloaded and stored at your computer, so you can use it at any time.
Parameters: - features (list[str] or None) – features used in training
- api_config_file (str) –
path to the file with remote api configuration in the json format:
{"url": "https://ml.cern.yandex.net/v1", "token": "<your_token>"}
- iterations (int) – number of constructed trees (default=100)
- regularization (float) – regularization number (default=0.01)
- intervals (int or dict(str, list)) – number of bins for features discretization or dict with borders list for each feature for its discretization (default=8)
- max_features_per_iteration (int) – depth (default=6, supports 1 <= .. <= 6)
- features_sample_rate_per_iteration (float) – training features sampling (default=1.0)
- training_fraction (float) – training rows bagging (default=0.5)
- auto_stop (None or float) – error value for training pre-stopping
- sync (bool) – synchronous or asynchronous training on the server
- random_state (None or int or RandomState) – state for a pseudo random generator
-
fit
(X, y, sample_weight=None)[source]¶ Train a classification model on the data.
Parameters: - X (pandas.DataFrame) – data of shape [n_samples, n_features]
- y – labels of samples, array-like of shape [n_samples]
- sample_weight – weight of samples, array-like of shape [n_samples] or None if all weights are equal
Returns: self
Examples¶
Classification¶
- Prepare dataset
>>> from sklearn import datasets >>> import pandas, numpy >>> from rep.utils import train_test_split >>> from sklearn.metrics import roc_auc_score >>> # iris data >>> iris = datasets.load_iris() >>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd']) >>> labels = iris.target >>> # Take just two classes instead of three >>> data = data[labels != 2] >>> labels = labels[labels != 2] >>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
- Sklearn classification
>>> from rep.estimators import SklearnClassifier >>> from sklearn.ensemble import GradientBoostingClassifier >>> # Using gradient boosting with default settings >>> sk = SklearnClassifier(GradientBoostingClassifier(), features=['a', 'b']) >>> # Training classifier >>> sk.fit(train_data, train_labels) >>> pred = sk.predict_proba(test_data) >>> print pred [[ 9.99842983e-01 1.57016893e-04] [ 1.45163843e-04 9.99854836e-01] [ 9.99842983e-01 1.57016893e-04] [ 9.99827693e-01 1.72306607e-04], ..] >>> roc_auc_score(test_labels, pred[:, 1]) 0.99768518518518523
- TMVA classification
>>> from rep.estimators import TMVAClassifier >>> tmva = TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=['a', 'b']) >>> tmva.fit(train_data, train_labels) >>> pred = tmva.predict_proba(test_data) >>> print pred [[ 9.99991025e-01 8.97546346e-06] [ 1.14084636e-04 9.99885915e-01] [ 9.99991009e-01 8.99060302e-06] [ 9.99798700e-01 2.01300452e-04], ..] >>> roc_auc_score(test_labels, pred[:, 1]) 0.99999999999999989
- XGBoost classification
>>> from rep.estimators import XGBoostClassifier >>> # XGBoost with default parameters >>> xgb = XGBoostClassifier(features=['a', 'b']) >>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels))) >>> pred = xgb.predict_proba(test_data) >>> print pred [[ 0.9983651 0.00163494] [ 0.00170585 0.99829417] [ 0.99845636 0.00154361] [ 0.96618336 0.03381656], ..] >>> roc_auc_score(test_labels, pred[:, 1]) 0.99768518518518512
Regression¶
- Prepare dataset
>>> from sklearn import datasets >>> from sklearn.metrics import mean_squared_error >>> from rep.utils import train_test_split >>> import pandas, numpy >>> # diabetes data >>> diabetes = datasets.load_diabetes() >>> features = ['feature_%d' % number for number in range(diabetes.data.shape[1])] >>> data = pandas.DataFrame(diabetes.data, columns=features) >>> labels = diabetes.target >>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
- Sklearn regression
>>> from rep.estimators import SklearnRegressor >>> from sklearn.ensemble import GradientBoostingRegressor >>> # Using gradient boosting with default settings >>> sk = SklearnRegressor(GradientBoostingRegressor(), features=features[:8]) >>> # Training classifier >>> sk.fit(train_data, train_labels) >>> pred = sk.predict(train_data) >>> numpy.sqrt(mean_squared_error(train_labels, pred)) 60.666009962879265
- TMVA regression
>>> from rep.estimators import TMVARegressor >>> tmva = TMVARegressor(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=features[:8]) >>> tmva.fit(train_data, train_labels) >>> pred = tmva.predict(test_data) >>> numpy.sqrt(mean_squared_error(test_labels, pred)) 73.74191838418254
- XGBoost regression
>>> from rep.estimators import XGBoostRegressor >>> # XGBoost with default parameters >>> xgb = XGBoostRegressor(features=features[:8]) >>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels))) >>> pred = xgb.predict(test_data) >>> numpy.sqrt(mean_squared_error(test_labels, pred)) 65.557743652940133