.. _estimators:
Estimators (classification and regression)
==========================================
This module contains wrappers with :class:`sklearn` interface for different machine learning libraries:
* scikit-learn
* TMVA
* XGBoost
* pybrain
* neurolab
* theanets.
**REP** defines interface for classifiers' and regressors' wrappers, thus new wrappers can be added for another libraries
following the same interface. Notably the interface has backward compatibility with scikit-learn library.
Estimators interfaces (for classification and regression)
---------------------------------------------------------
.. automodule:: rep.estimators.interface
:members:
:inherited-members:
:undoc-members:
:show-inheritance:
Sklearn classifier and regressor
--------------------------------
.. automodule:: rep.estimators.sklearn
:members:
:show-inheritance:
:undoc-members:
TMVA classifier and regressor
-----------------------------
.. automodule:: rep.estimators.tmva
:members:
:show-inheritance:
:undoc-members:
XGBoost classifier and regressor
--------------------------------
.. automodule:: rep.estimators.xgboost
:members:
:show-inheritance:
:undoc-members:
Theanets classifier and regressor
---------------------------------
.. automodule:: rep.estimators.theanets
:members:
:show-inheritance:
:undoc-members:
Neurolab classifier and regressor
---------------------------------
.. automodule:: rep.estimators.neurolab
:members:
:show-inheritance:
:undoc-members:
Pybrain classifier and regressor
--------------------------------
.. automodule:: rep.estimators.pybrain
:members:
:show-inheritance:
:undoc-members:
MatrixNet classifier and regressor
----------------------------------
.. automodule:: rep.estimators.matrixnet
:members:
:show-inheritance:
:undoc-members:
Examples
--------
Classification
**************
* Prepare dataset
>>> from sklearn import datasets
>>> import pandas, numpy
>>> from rep.utils import train_test_split
>>> from sklearn.metrics import roc_auc_score
>>> # iris data
>>> iris = datasets.load_iris()
>>> data = pandas.DataFrame(iris.data, columns=['a', 'b', 'c', 'd'])
>>> labels = iris.target
>>> # Take just two classes instead of three
>>> data = data[labels != 2]
>>> labels = labels[labels != 2]
>>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
* Sklearn classification
>>> from rep.estimators import SklearnClassifier
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> # Using gradient boosting with default settings
>>> sk = SklearnClassifier(GradientBoostingClassifier(), features=['a', 'b'])
>>> # Training classifier
>>> sk.fit(train_data, train_labels)
>>> pred = sk.predict_proba(test_data)
>>> print pred
[[ 9.99842983e-01 1.57016893e-04]
[ 1.45163843e-04 9.99854836e-01]
[ 9.99842983e-01 1.57016893e-04]
[ 9.99827693e-01 1.72306607e-04], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99768518518518523
* TMVA classification
>>> from rep.estimators import TMVAClassifier
>>> tmva = TMVAClassifier(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=['a', 'b'])
>>> tmva.fit(train_data, train_labels)
>>> pred = tmva.predict_proba(test_data)
>>> print pred
[[ 9.99991025e-01 8.97546346e-06]
[ 1.14084636e-04 9.99885915e-01]
[ 9.99991009e-01 8.99060302e-06]
[ 9.99798700e-01 2.01300452e-04], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99999999999999989
* XGBoost classification
>>> from rep.estimators import XGBoostClassifier
>>> # XGBoost with default parameters
>>> xgb = XGBoostClassifier(features=['a', 'b'])
>>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
>>> pred = xgb.predict_proba(test_data)
>>> print pred
[[ 0.9983651 0.00163494]
[ 0.00170585 0.99829417]
[ 0.99845636 0.00154361]
[ 0.96618336 0.03381656], ..]
>>> roc_auc_score(test_labels, pred[:, 1])
0.99768518518518512
Regression
**********
* Prepare dataset
>>> from sklearn import datasets
>>> from sklearn.metrics import mean_squared_error
>>> from rep.utils import train_test_split
>>> import pandas, numpy
>>> # diabetes data
>>> diabetes = datasets.load_diabetes()
>>> features = ['feature_%d' % number for number in range(diabetes.data.shape[1])]
>>> data = pandas.DataFrame(diabetes.data, columns=features)
>>> labels = diabetes.target
>>> train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.7)
* Sklearn regression
>>> from rep.estimators import SklearnRegressor
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> # Using gradient boosting with default settings
>>> sk = SklearnRegressor(GradientBoostingRegressor(), features=features[:8])
>>> # Training classifier
>>> sk.fit(train_data, train_labels)
>>> pred = sk.predict(train_data)
>>> numpy.sqrt(mean_squared_error(train_labels, pred))
60.666009962879265
* TMVA regression
>>> from rep.estimators import TMVARegressor
>>> tmva = TMVARegressor(method='kBDT', NTrees=100, Shrinkage=0.1, nCuts=-1, BoostType='Grad', features=features[:8])
>>> tmva.fit(train_data, train_labels)
>>> pred = tmva.predict(test_data)
>>> numpy.sqrt(mean_squared_error(test_labels, pred))
73.74191838418254
* XGBoost regression
>>> from rep.estimators import XGBoostRegressor
>>> # XGBoost with default parameters
>>> xgb = XGBoostRegressor(features=features[:8])
>>> xgb.fit(train_data, train_labels, sample_weight=numpy.ones(len(train_labels)))
>>> pred = xgb.predict(test_data)
>>> numpy.sqrt(mean_squared_error(test_labels, pred))
65.557743652940133
Compatible libraries
--------------------
REP can deal with any library which supports scikit-learn interface.
Examples of compatible libraries: `nolearn `_, `skflow `_, `gplearn `_ and `hep_ml `_.