Report for models

This module contains helpful classes to get a report for estimators, such as feature distribution, prediction distribution, ROC curves, learning curves, and others, and compare them.

Classification Report

This module contains report class for classification estimators. Report includes:

  • features scatter plots, distributions, correlations
  • learning curve
  • roc curve
  • efficiencies
  • metric vs cut
  • feature importance
  • feature importance by shuffling the feature column

All methods return objects, which have plot method (details see in rep.plotting), these objects contain raw information about things to be plotted.

class rep.report.classification.ClassificationReport(classifiers, lds)[source]

Test estimators on any data. Supports ROC curve, prediction distribution, features information (correlation matrix, distribution, scatter plots for pairs of features), efficiencies for thresholds (evaluate flatness of predictions for important feature), correlation with prediction for necessary feature, any metrics of quality.

Parameters:
compute_metric(metric, mask=None)

Compute metric value

Parameters:
  • metric

    function like object with:

    __call__(self, y_true, prob, sample_weight=None)
    
  • mask (None or array-like or str or function(pandas.DataFrame)) – mask, points we should use
Returns:

metric value for each estimator

efficiencies(features, thresholds=None, mask=None, bins=30, labels_dict=None, ignored_sideband=0.0, errors=False, grid_columns=2)[source]

Efficiencies for spectators

Parameters:
  • features (None or list[str]) – using features (if None then use classifier’s spectators)
  • bins (int or array-like) – bins for histogram
  • mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
  • thresholds (list[float]) – thresholds on prediction
  • errors (bool) – if True then use errorbar, else interpolate function
  • labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
  • grid_columns (int) – count of columns in grid
  • ignored_sideband (float) – (0, 1) percent of plotting data
Return type:

plotting.GridPlot

efficiencies_2d(features, efficiency, mask=None, n_bins=20, ignored_sideband=0.0, labels_dict=None, grid_columns=2, signal_label=1, cmap='RdBu')[source]

For binary classification plots the dependence of efficiency on two columns

Parameters:
  • features – tuple of list with names of two features
  • efficiency (float) – efficiency, float
  • n_bins (int or array-like) – bins for histogram
  • mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
  • labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
  • grid_columns (int) – count of columns in grid
  • ignored_sideband (float) – (0, 1) percent of plotting data
  • signal_label (int) – label to calculate efficiency threshold
  • cmap (str) – name of colormap used
Return type:

plotting.GridPlot

feature_importance(grid_columns=2)

Get features importance

Parameters:grid_columns (int) – count of columns in grid
Return type:plotting.GridPlot
feature_importance_shuffling(metric=LogLoss(regularization=1e-15), mask=None, grid_columns=2)[source]

Get features importance using shuffling method (apply random permutation to one particular column)

Parameters:
  • metric – function to measure quality function(y_true, proba, sample_weight=None)
  • mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask which points the data we should train on
  • grid_columns (int) – number of columns in grid
Return type:

plotting.GridPlot

features_correlation_matrix(features=None, mask=None, tick_labels=None, vmin=-1, vmax=1, cmap='Reds')

Correlation between features

Parameters:
  • features (None or list[str]) – using features (if None then use estimator’s features)
  • mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
  • tick_labels (None or array-like) – names for features in matrix
  • vmin (int) – min of value for min color
  • vmax (int) – max of value for max color
  • cmap (str) – color map name
Return type:

plotting.ColorMap

features_correlation_matrix_by_class(features=None, mask=None, tick_labels=None, vmin=-1, vmax=1, labels_dict=None, grid_columns=2)[source]

Correlation between features (built separately for each class)

Parameters:
  • features (None or list[str]) – using features (if None then use classifier’s features)
  • mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
  • labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
  • tick_labels (None or array-like) – names for features in matrix
  • vmin (int) – min of value for min color
  • vmax (int) – max of value for max color
  • grid_columns (int) – count of columns in grid
Return type:

plotting.GridPlot

features_pdf(features=None, mask=None, bins=30, ignored_sideband=0.0, labels_dict=None, grid_columns=2)[source]

Features distribution (with errors)

Parameters:
  • features (None or list[str]) – using features (if None then use classifier’s features)
  • mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
  • bins (int or array-like) – count of bins or array with boarders
  • labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
  • grid_columns (int) – count of columns in grid
  • ignored_sideband (float) – float from (0, 1), part of events ignored from the left and from the right
Return type:

plotting.GridPlot

learning_curve(metric, mask=None, steps=10, metric_label='metric', predict_only_masked=True)

Get learning curves

Parameters:
  • metric (function) – function looks like function def function(y_true, y_pred, sample_weight=None)
  • steps (int or dict) – if int, the same step is used in all learning curves, otherwise dict with steps for each estimator
  • metric_label (str) – name for metric on plot
  • predict_only_masked (bool) – if True, will predict only for needed events. When you build learning curves for FoldingClassifier/FoldingRegressor on the same dataset, set this to False to get unbiased predictions.
Return type:

plotting.FunctionsPlot

metrics_vs_cut(metric, mask=None, metric_label='metric')[source]

Draw values of binary metric depending on the threshold on predictions.

Parameters:
  • metric – binary metric (AMS, f1 or so - shall use only tpr and fpr)
  • mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data used in comparison
  • metric_label (str) – name for metric on plot
Return type:

plotting.FunctionsPlot

prediction_pdf(mask=None, target_class=1, bins=30, size=2, log=False, plot_type='error_bar', normed=True, labels_dict=None)[source]

Distribution of prediction for signal and bck separately with errors

Parameters:
  • mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
  • target_class (int or None) – draw probabilities of being classified as target_class (default 1, will draw signal probabilities). If None, will draw probability corresponding to right class of each event.
  • bins (int or array-like) – number of bins in histogram
  • size (int) – points size on plots
  • log (bool) – use logarithmic scale
  • normed (bool) – draw normed pdf or not (normed by default)
  • plot_type (str) – ‘error_bar’ for error type and ‘bar’ for hist type
  • labels_dict (None or OrderedDict(int: str)) – names for class labels as dictionary if None then {0: ‘bck’, ‘1’: ‘signal’}
Return type:

plotting.ErrorPlot or plotting.BarPlot

roc(mask=None, signal_label=1, physics_notion=False)[source]

Calculate roc functions for data and return roc plot object

Parameters:
  • mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
  • physics_notion (bool) – if set to True, will show signal efficiency vs background rejection, otherwise TPR vs FPR.
Return type:

plotting.FunctionsPlot

scatter(correlation_pairs, mask=None, marker_size=20, alpha=0.1, labels_dict=None, grid_columns=2)[source]

Correlation between pairs of features

Parameters:
  • correlation_pairs (list[tuple]) – pairs of features along which scatter plot will be build.
  • mask (None or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
  • marker_size (int) – size of marker for each event on the plot
  • alpha (float) – blending parameter for scatter
  • labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
  • grid_columns (int) – count of columns in grid
Return type:

plotting.GridPlot

Regression Report

This file contains report class for regression estimators. Report includes:

  • features scatter plots, correlations
  • learning curve
  • feature importance
  • feature importance by shuffling the feature column

All methods return objects, which can have plot method (details see in rep.plotting)

class rep.report.regression.RegressionReport(regressors, lds)[source]

Report simplifies comparison of regressors on the same dataset.

Parameters:
  • regressors (dict[str, Regressor]) – OrderedDict with regressors (RegressionFactory)
  • lds (LabeledDataStorage) – data
compute_metric(metric, mask=None)

Compute metric value

Parameters:
  • metric

    function like object with:

    __call__(self, y_true, prob, sample_weight=None)
    
  • mask (None or array-like or str or function(pandas.DataFrame)) – mask, points we should use
Returns:

metric value for each estimator

feature_importance(grid_columns=2)

Get features importance

Parameters:grid_columns (int) – count of columns in grid
Return type:plotting.GridPlot
feature_importance_shuffling(metric=<function mean_squared_error>, mask=None, grid_columns=2)[source]

Get features importance using shuffling method (apply random permutation to one particular column)

Parameters:
  • metric – function to measure quality function(y_true, y_predicted, sample_weight=None)
  • mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask which points we should compare on
  • grid_columns (int) – number of columns in grid
Return type:

plotting.GridPlot

features_correlation_matrix(features=None, mask=None, tick_labels=None, vmin=-1, vmax=1, cmap='Reds')

Correlation between features

Parameters:
  • features (None or list[str]) – using features (if None then use estimator’s features)
  • mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
  • tick_labels (None or array-like) – names for features in matrix
  • vmin (int) – min of value for min color
  • vmax (int) – max of value for max color
  • cmap (str) – color map name
Return type:

plotting.ColorMap

learning_curve(metric, mask=None, steps=10, metric_label='metric', predict_only_masked=True)

Get learning curves

Parameters:
  • metric (function) – function looks like function def function(y_true, y_pred, sample_weight=None)
  • steps (int or dict) – if int, the same step is used in all learning curves, otherwise dict with steps for each estimator
  • metric_label (str) – name for metric on plot
  • predict_only_masked (bool) – if True, will predict only for needed events. When you build learning curves for FoldingClassifier/FoldingRegressor on the same dataset, set this to False to get unbiased predictions.
Return type:

plotting.FunctionsPlot

predictions_scatter(features=None, mask=None, marker_size=20, alpha=0.1, grid_columns=2)[source]

Correlation between predictions and features

Parameters:
  • features (None or list[str]) – using features (if None then use classifier’s features)
  • mask (None or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
  • marker_size (int) – size of marker for each event on the plot
  • alpha (float) – blending parameter for scatter
  • grid_columns (int) – count of columns in grid
Return type:

plotting.GridPlot

scatter(correlation_pairs, mask=None, marker_size=20, alpha=0.1, grid_columns=2)[source]

Correlation between pairs of features

Parameters:
  • correlation_pairs (list[tuple]) – pairs of features along which scatter plot will be build.
  • mask (None or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
  • marker_size (int) – size of marker for each event on the plot
  • alpha (float) – blending parameter for scatter
  • grid_columns (int) – count of columns in grid
Return type:

plotting.GridPlot