Report for models¶
This module contains helpful classes to get a report for estimators, such as feature distribution, prediction distribution, ROC curves, learning curves, and others, and compare them.
Classification Report¶
This module contains report class for classification estimators. Report includes:
- features scatter plots, distributions, correlations
- learning curve
- roc curve
- efficiencies
- metric vs cut
- feature importance
- feature importance by shuffling the feature column
All methods return objects, which have plot method (details see in rep.plotting
),
these objects contain raw information about things to be plotted.
-
class
rep.report.classification.
ClassificationReport
(classifiers, lds)[source]¶ Test estimators on any data. Supports ROC curve, prediction distribution, features information (correlation matrix, distribution, scatter plots for pairs of features), efficiencies for thresholds (evaluate flatness of predictions for important feature), correlation with prediction for necessary feature, any metrics of quality.
Parameters: - classifiers (dict[str, Classifier]) – estimators
- lds (LabeledDataStorage) – data
-
compute_metric
(metric, mask=None)¶ Compute metric value
Parameters: - metric –
function like object with:
__call__(self, y_true, prob, sample_weight=None)
- mask (None or array-like or str or function(pandas.DataFrame)) – mask, points we should use
Returns: metric value for each estimator
- metric –
-
efficiencies
(features, thresholds=None, mask=None, bins=30, labels_dict=None, ignored_sideband=0.0, errors=False, grid_columns=2)[source]¶ Efficiencies for spectators
Parameters: - features (None or list[str]) – using features (if None then use classifier’s spectators)
- bins (int or array-like) – bins for histogram
- mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
- thresholds (list[float]) – thresholds on prediction
- errors (bool) – if True then use errorbar, else interpolate function
- labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
- grid_columns (int) – count of columns in grid
- ignored_sideband (float) – (0, 1) percent of plotting data
Return type:
-
efficiencies_2d
(features, efficiency, mask=None, n_bins=20, ignored_sideband=0.0, labels_dict=None, grid_columns=2, signal_label=1, cmap='RdBu')[source]¶ For binary classification plots the dependence of efficiency on two columns
Parameters: - features – tuple of list with names of two features
- efficiency (float) – efficiency, float
- n_bins (int or array-like) – bins for histogram
- mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
- labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
- grid_columns (int) – count of columns in grid
- ignored_sideband (float) – (0, 1) percent of plotting data
- signal_label (int) – label to calculate efficiency threshold
- cmap (str) – name of colormap used
Return type:
-
feature_importance
(grid_columns=2)¶ Get features importance
Parameters: grid_columns (int) – count of columns in grid Return type: plotting.GridPlot
-
feature_importance_shuffling
(metric=LogLoss(regularization=1e-15), mask=None, grid_columns=2)[source]¶ Get features importance using shuffling method (apply random permutation to one particular column)
Parameters: - metric – function to measure quality function(y_true, proba, sample_weight=None)
- mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask which points the data we should train on
- grid_columns (int) – number of columns in grid
Return type:
-
features_correlation_matrix
(features=None, mask=None, tick_labels=None, vmin=-1, vmax=1, cmap='Reds')¶ Correlation between features
Parameters: - features (None or list[str]) – using features (if None then use estimator’s features)
- mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
- tick_labels (None or array-like) – names for features in matrix
- vmin (int) – min of value for min color
- vmax (int) – max of value for max color
- cmap (str) – color map name
Return type:
-
features_correlation_matrix_by_class
(features=None, mask=None, tick_labels=None, vmin=-1, vmax=1, labels_dict=None, grid_columns=2)[source]¶ Correlation between features (built separately for each class)
Parameters: - features (None or list[str]) – using features (if None then use classifier’s features)
- mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
- labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
- tick_labels (None or array-like) – names for features in matrix
- vmin (int) – min of value for min color
- vmax (int) – max of value for max color
- grid_columns (int) – count of columns in grid
Return type:
-
features_pdf
(features=None, mask=None, bins=30, ignored_sideband=0.0, labels_dict=None, grid_columns=2)[source]¶ Features distribution (with errors)
Parameters: - features (None or list[str]) – using features (if None then use classifier’s features)
- mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
- bins (int or array-like) – count of bins or array with boarders
- labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
- grid_columns (int) – count of columns in grid
- ignored_sideband (float) – float from (0, 1), part of events ignored from the left and from the right
Return type:
-
learning_curve
(metric, mask=None, steps=10, metric_label='metric', predict_only_masked=True)¶ Get learning curves
Parameters: - metric (function) – function looks like function def function(y_true, y_pred, sample_weight=None)
- steps (int or dict) – if int, the same step is used in all learning curves, otherwise dict with steps for each estimator
- metric_label (str) – name for metric on plot
- predict_only_masked (bool) – if True, will predict only for needed events. When you build learning curves for FoldingClassifier/FoldingRegressor on the same dataset, set this to False to get unbiased predictions.
Return type:
-
metrics_vs_cut
(metric, mask=None, metric_label='metric')[source]¶ Draw values of binary metric depending on the threshold on predictions.
Parameters: - metric – binary metric (AMS, f1 or so - shall use only tpr and fpr)
- mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data used in comparison
- metric_label (str) – name for metric on plot
Return type:
-
prediction_pdf
(mask=None, target_class=1, bins=30, size=2, log=False, plot_type='error_bar', normed=True, labels_dict=None)[source]¶ Distribution of prediction for signal and bck separately with errors
Parameters: - mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
- target_class (int or None) – draw probabilities of being classified as target_class (default 1, will draw signal probabilities). If None, will draw probability corresponding to right class of each event.
- bins (int or array-like) – number of bins in histogram
- size (int) – points size on plots
- log (bool) – use logarithmic scale
- normed (bool) – draw normed pdf or not (normed by default)
- plot_type (str) – ‘error_bar’ for error type and ‘bar’ for hist type
- labels_dict (None or OrderedDict(int: str)) – names for class labels as dictionary if None then {0: ‘bck’, ‘1’: ‘signal’}
Return type: plotting.ErrorPlot or plotting.BarPlot
-
roc
(mask=None, signal_label=1, physics_notion=False)[source]¶ Calculate roc functions for data and return roc plot object
Parameters: - mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
- physics_notion (bool) – if set to True, will show signal efficiency vs background rejection, otherwise TPR vs FPR.
Return type:
-
scatter
(correlation_pairs, mask=None, marker_size=20, alpha=0.1, labels_dict=None, grid_columns=2)[source]¶ Correlation between pairs of features
Parameters: - correlation_pairs (list[tuple]) – pairs of features along which scatter plot will be build.
- mask (None or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
- marker_size (int) – size of marker for each event on the plot
- alpha (float) – blending parameter for scatter
- labels_dict (None or OrderedDict(int: str)) – label – name for class label if None then {0: ‘bck’, ‘1’: ‘signal’}
- grid_columns (int) – count of columns in grid
Return type:
Regression Report¶
This file contains report class for regression estimators. Report includes:
- features scatter plots, correlations
- learning curve
- feature importance
- feature importance by shuffling the feature column
All methods return objects, which can have plot method (details see in rep.plotting
)
-
class
rep.report.regression.
RegressionReport
(regressors, lds)[source]¶ Report simplifies comparison of regressors on the same dataset.
Parameters: - regressors (dict[str, Regressor]) – OrderedDict with regressors (RegressionFactory)
- lds (LabeledDataStorage) – data
-
compute_metric
(metric, mask=None)¶ Compute metric value
Parameters: - metric –
function like object with:
__call__(self, y_true, prob, sample_weight=None)
- mask (None or array-like or str or function(pandas.DataFrame)) – mask, points we should use
Returns: metric value for each estimator
- metric –
-
feature_importance
(grid_columns=2)¶ Get features importance
Parameters: grid_columns (int) – count of columns in grid Return type: plotting.GridPlot
-
feature_importance_shuffling
(metric=<function mean_squared_error>, mask=None, grid_columns=2)[source]¶ Get features importance using shuffling method (apply random permutation to one particular column)
Parameters: - metric – function to measure quality function(y_true, y_predicted, sample_weight=None)
- mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask which points we should compare on
- grid_columns (int) – number of columns in grid
Return type:
-
features_correlation_matrix
(features=None, mask=None, tick_labels=None, vmin=-1, vmax=1, cmap='Reds')¶ Correlation between features
Parameters: - features (None or list[str]) – using features (if None then use estimator’s features)
- mask (None or numbers.Number or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
- tick_labels (None or array-like) – names for features in matrix
- vmin (int) – min of value for min color
- vmax (int) – max of value for max color
- cmap (str) – color map name
Return type:
-
learning_curve
(metric, mask=None, steps=10, metric_label='metric', predict_only_masked=True)¶ Get learning curves
Parameters: - metric (function) – function looks like function def function(y_true, y_pred, sample_weight=None)
- steps (int or dict) – if int, the same step is used in all learning curves, otherwise dict with steps for each estimator
- metric_label (str) – name for metric on plot
- predict_only_masked (bool) – if True, will predict only for needed events. When you build learning curves for FoldingClassifier/FoldingRegressor on the same dataset, set this to False to get unbiased predictions.
Return type:
-
predictions_scatter
(features=None, mask=None, marker_size=20, alpha=0.1, grid_columns=2)[source]¶ Correlation between predictions and features
Parameters: - features (None or list[str]) – using features (if None then use classifier’s features)
- mask (None or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
- marker_size (int) – size of marker for each event on the plot
- alpha (float) – blending parameter for scatter
- grid_columns (int) – count of columns in grid
Return type:
-
scatter
(correlation_pairs, mask=None, marker_size=20, alpha=0.1, grid_columns=2)[source]¶ Correlation between pairs of features
Parameters: - correlation_pairs (list[tuple]) – pairs of features along which scatter plot will be build.
- mask (None or array-like or str or function(pandas.DataFrame)) – mask for data, which will be used
- marker_size (int) – size of marker for each event on the plot
- alpha (float) – blending parameter for scatter
- grid_columns (int) – count of columns in grid
Return type: