Report for models¶

This module contains helpful classes to get a report for estimators, such as feature distribution, prediction distribution, ROC curves, learning curves, and others, and compare them.

Classification Report¶

This module contains report class for classification estimators. Report includes:

features scatter plots, distributions, correlations

learning curve

roc curve

efficiencies

metric vs cut

feature importance

feature importance by shuffling the feature column

All methods return objects, which have plot method (details see in rep.plotting), these objects contain raw information about things to be plotted.

class rep.report.classification.ClassificationReport(classifiers, lds)[source]¶

Test estimators on any data. Supports ROC curve, prediction distribution, features information (correlation matrix, distribution, scatter plots for pairs of features), efficiencies for thresholds (evaluate flatness of predictions for important feature), correlation with prediction for necessary feature, any metrics of quality.

Parameters:	classifiers (dict[str, Classifier]) – estimators lds (LabeledDataStorage) – data

compute_metric(metric, mask=None)¶

Compute metric value

Parameters:	metric – function like object with: __call__(self, y_true, prob, sample_weight=None) mask (None or array-like or str or function(pandas.DataFrame)) – mask, points we should use
Returns:	metric value for each estimator

efficiencies(features, thresholds=None, mask=None, bins=30, labels_dict=None, ignored_sideband=0.0, errors=False, grid_columns=2)[source]¶

Efficiencies for spectators

Parameters:

features (None or list[str]) – using features (if None then use classifier’s spectators)
bins (int or array-like) – bins for histogram
mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
thresholds (list[float]) – thresholds on prediction
errors (bool) – if True then use errorbar, else interpolate function
labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
grid_columns (int) – count of columns in grid
ignored_sideband (float) – (0, 1) percent of plotting data

Return type:

plotting.GridPlot

efficiencies_2d(features, efficiency, mask=None, n_bins=20, ignored_sideband=0.0, labels_dict=None, grid_columns=2, signal_label=1, cmap='RdBu')[source]¶

For binary classification plots the dependence of efficiency on two columns

Parameters:

features – tuple of list with names of two features
efficiency (float) – efficiency, float
n_bins (int or array-like) – bins for histogram
mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
grid_columns (int) – count of columns in grid
ignored_sideband (float) – (0, 1) percent of plotting data
signal_label (int) – label to calculate efficiency threshold
cmap (str) – name of colormap used

Return type:

plotting.GridPlot

feature_importance(grid_columns=2)¶

Get features importance

Parameters:	grid_columns (int) – count of columns in grid
Return type:	plotting.GridPlot

feature_importance_shuffling(metric=LogLoss(regularization=1e-15), mask=None, grid_columns=2)[source]¶

Get features importance using shuffling method (apply random permutation to one particular column)

Parameters:	metric – function to measure quality function(y_true, proba, sample_weight=None) mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask which points the data we should train on grid_columns (int) – number of columns in grid
Return type:	plotting.GridPlot

features_correlation_matrix(features=None, mask=None, tick_labels=None, vmin=-1, vmax=1, cmap='Reds')¶

Correlation between features

Parameters:

features (None or list[str]) – using features (if None then use estimator’s features)
mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
tick_labels (None or array-like) – names for features in matrix
vmin (int) – min of value for min color
vmax (int) – max of value for max color
cmap (str) – color map name

Return type:

plotting.ColorMap

features_correlation_matrix_by_class(features=None, mask=None, tick_labels=None, vmin=-1, vmax=1, labels_dict=None, grid_columns=2)[source]¶

Correlation between features (built separately for each class)

Parameters:

features (None or list[str]) – using features (if None then use classifier’s features)
mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
tick_labels (None or array-like) – names for features in matrix
vmin (int) – min of value for min color
vmax (int) – max of value for max color
grid_columns (int) – count of columns in grid

Return type:

plotting.GridPlot

features_pdf(features=None, mask=None, bins=30, ignored_sideband=0.0, labels_dict=None, grid_columns=2)[source]¶

Features distribution (with errors)

Parameters:

features (None or list[str]) – using features (if None then use classifier’s features)
mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
bins (int or array-like) – count of bins or array with boarders
labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
grid_columns (int) – count of columns in grid
ignored_sideband (float) – float from (0, 1), part of events ignored from the left and from the right

Return type:

plotting.GridPlot

learning_curve(metric, mask=None, steps=10, metric_label='metric', predict_only_masked=True)¶

Get learning curves

Parameters:

metric (function) – function looks like function def function(y_true, y_pred, sample_weight=None)
steps (int or dict) – if int, the same step is used in all learning curves, otherwise dict with steps for each estimator
metric_label (str) – name for metric on plot
predict_only_masked (bool) – if True, will predict only for needed events. When you build learning curves for FoldingClassifier/FoldingRegressor on the same dataset, set this to False to get unbiased predictions.

Return type:

plotting.FunctionsPlot

metrics_vs_cut(metric, mask=None, metric_label='metric')[source]¶

Draw values of binary metric depending on the threshold on predictions.

Parameters:	metric – binary metric (AMS, f1 or so - shall use only tpr and fpr) mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data used in comparison metric_label (str) – name for metric on plot
Return type:	plotting.FunctionsPlot

prediction_pdf(mask=None, target_class=1, bins=30, size=2, log=False, plot_type='error_bar', normed=True, labels_dict=None)[source]¶

Distribution of prediction for signal and bck separately with errors

Parameters:

mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
target_class (int or None) – draw probabilities of being classified as target_class (default 1, will draw signal probabilities). If None, will draw probability corresponding to right class of each event.
bins (int or array-like) – number of bins in histogram
size (int) – points size on plots
log (bool) – use logarithmic scale
normed (bool) – draw normed pdf or not (normed by default)
plot_type (str) – ‘error_bar’ for error type and ‘bar’ for hist type
labels_dict (None or OrderedDict(int: str)) – names for class labels as dictionary if None then {0: ‘bck’, ‘1’: ‘signal’}

Return type:

plotting.ErrorPlot or plotting.BarPlot

roc(mask=None, signal_label=1, physics_notion=False)[source]¶

Calculate roc functions for data and return roc plot object

Parameters:	mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used physics_notion (bool) – if set to True, will show signal efficiency vs background rejection, otherwise TPR vs FPR.
Return type:	plotting.FunctionsPlot

scatter(correlation_pairs, mask=None, marker_size=20, alpha=0.1, labels_dict=None, grid_columns=2)[source]¶

Correlation between pairs of features

Parameters:

correlation_pairs (list[tuple]) – pairs of features along which scatter plot will be build.
mask (None or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
marker_size (int) – size of marker for each event on the plot
alpha (float) – blending parameter for scatter
labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
grid_columns (int) – count of columns in grid

Return type:

plotting.GridPlot

Regression Report¶

This file contains report class for regression estimators. Report includes:

features scatter plots, correlations

learning curve

feature importance

feature importance by shuffling the feature column

All methods return objects, which can have plot method (details see in rep.plotting)

class rep.report.regression.RegressionReport(regressors, lds)[source]¶

Report simplifies comparison of regressors on the same dataset.

Parameters:	regressors (dict[str, Regressor]) – OrderedDict with regressors (RegressionFactory) lds (LabeledDataStorage) – data

compute_metric(metric, mask=None)¶

Compute metric value

Parameters:	metric – function like object with: __call__(self, y_true, prob, sample_weight=None) mask (None or array-like or str or function(pandas.DataFrame)) – mask, points we should use
Returns:	metric value for each estimator

feature_importance(grid_columns=2)¶

Get features importance

Parameters:	grid_columns (int) – count of columns in grid
Return type:	plotting.GridPlot

feature_importance_shuffling(metric=<function mean_squared_error>, mask=None, grid_columns=2)[source]¶

Get features importance using shuffling method (apply random permutation to one particular column)

Parameters:	metric – function to measure quality function(y_true, y_predicted, sample_weight=None) mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask which points we should compare on grid_columns (int) – number of columns in grid
Return type:	plotting.GridPlot

features_correlation_matrix(features=None, mask=None, tick_labels=None, vmin=-1, vmax=1, cmap='Reds')¶

Correlation between features

Parameters:

features (None or list[str]) – using features (if None then use estimator’s features)
mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
tick_labels (None or array-like) – names for features in matrix
vmin (int) – min of value for min color
vmax (int) – max of value for max color
cmap (str) – color map name

Return type:

plotting.ColorMap

learning_curve(metric, mask=None, steps=10, metric_label='metric', predict_only_masked=True)¶

Get learning curves

Parameters:

metric (function) – function looks like function def function(y_true, y_pred, sample_weight=None)
steps (int or dict) – if int, the same step is used in all learning curves, otherwise dict with steps for each estimator
metric_label (str) – name for metric on plot
predict_only_masked (bool) – if True, will predict only for needed events. When you build learning curves for FoldingClassifier/FoldingRegressor on the same dataset, set this to False to get unbiased predictions.

Return type:

plotting.FunctionsPlot

predictions_scatter(features=None, mask=None, marker_size=20, alpha=0.1, grid_columns=2)[source]¶

Correlation between predictions and features

Parameters:	features (None or list[str]) – using features (if None then use classifier’s features) mask (None or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used marker_size (int) – size of marker for each event on the plot alpha (float) – blending parameter for scatter grid_columns (int) – count of columns in grid
Return type:	plotting.GridPlot

scatter(correlation_pairs, mask=None, marker_size=20, alpha=0.1, grid_columns=2)[source]¶

Correlation between pairs of features

Parameters:	correlation_pairs (list[tuple]) – pairs of features along which scatter plot will be build. mask (None or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used marker_size (int) – size of marker for each event on the plot alpha (float) – blending parameter for scatter grid_columns (int) – count of columns in grid
Return type:	plotting.GridPlot