Metrics¶
Metricobject API¶
REP introduces several metric functions in specific format. Metric functions following this format can be used in grid search and reports.
In general case, metrics follow standard sklearn convention for estimators, provides
 constructor (you should create an instance of metric!), all finetuning should be done at this step:
>>> metric = RocAuc(positive_label=2)
 fitting, where checks and heavy computations performed (this step is needed for ranking metrics, uniformity metrics):
>>> metric.fit(X, y, sample_weight=None)
 computation of metrics by probabilities (important: metric should be computed on exactly same dataset as was used on previous step):
>>> # in case of classification >>> proba = classifier.predict_proba(X) >>> metric(y, proba, sample_weight=None) >>> # in case of regression >>> prediction = regressor.predict(X) >>> metric(y, prediction, sample_weight=None)
This way metrics can be used in learning curves, for instance. Once fitted (and heavy computations done in fitting), then for every stage computation is fast.
Metricfunction (convenience) API¶
 Many metric functions do not require complex settings and different precomputing,
so REP also works with functions having following API:
>>> # for classification >>> metric(y, probabilities, sample_weight=None) >>> # for regression >>> metric(y, predictions, sample_weight=None)
As an example, mean_squared_error and mean_absolute_error from sklearn can be used in REP.
See also
API of metrics for details and explanations on API.
Correspondence between physics terms and ML terms¶
Some notation used below:
 IsSignal (IsS) — is really signal
 AsSignal (AsS) — classified as signal
 IsBackgroundAsSignal  background, but classified as signal
... and so on. Cute, right?
There are many ways to denote this things:
 tpr = s = isSasS / isS
 fpr = b = isBasS / isB
Here we used normalized s and b, while physicists usually normalize them to particular values of expected amount of s and b.
 signal efficiency = s = tpr
the following line used only in HEP
 background efficiency = b = fpr
Available Metric functions¶

class
rep.report.metrics.
RocAuc
(positive_label=1)[source]¶ Bases:
sklearn.base.BaseEstimator
,rep.report.metrics.MetricMixin
Computes area under the ROC curve. Generalpurpose quality measure for binary classification
Parameters: positive_label (int) – label of class, in case of more then two classes, will compute ROC AUC for this specific class vs others

class
rep.report.metrics.
LogLoss
(regularization=1e15)[source]¶ Bases:
sklearn.base.BaseEstimator
,rep.report.metrics.MetricMixin
Log loss, which is the same as minus loglikelihood, and the same as logistic loss, and the same as crossentropy loss.
Appropriate metric if algorithm is optimizing loglikelihood.
Parameters: regularization – minimal value for probability, to avoid high (or infinite) penalty for zero probabilities.

class
rep.report.metrics.
OptimalAccuracy
(sb_ratio=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,rep.report.metrics.MetricMixin
Estimation of binary classification accuracy for
Parameters: sb_ratio – ratio of signal (class 1) and background (class 0). If none, the parameter is estimated from test data.

class
rep.report.metrics.
OptimalAMS
(expected_s=691.988607712, expected_b=410999.847)[source]¶ Bases:
rep.report.metrics.OptimalMetric
Optimal values of AMS (average median significance)
Default values of expected_s and expected_b are from HiggsML challenge.
Parameters:  expected_s (float) – expected amount of signal
 expected_b (float) – expected amount of background

class
rep.report.metrics.
OptimalSignificance
(expected_s=1.0, expected_b=1.0)[source]¶ Bases:
rep.report.metrics.OptimalMetric
Optimal values of significance: s / sqrt(b)
Parameters:  expected_s (float) – expected amount of signal
 expected_b (float) – expected amount of background

class
rep.report.metrics.
TPRatFPR
(fpr)[source]¶ Bases:
sklearn.base.BaseEstimator
,rep.report.metrics.MetricMixin
Fix FPR value on ROC curve and return corresponding TPR value.
Parameters: fpr (float) – target value false positive rate, from range (0, 1)

class
rep.report.metrics.
FPRatTPR
(tpr)[source]¶ Bases:
sklearn.base.BaseEstimator
,rep.report.metrics.MetricMixin
Fix TPR value on ROC curve and return corresponding FPR value.
Parameters: tpr (float) – target value true positive rate, from range (0, 1)
Supplementary functions¶
Building blocks that should be useful to create new metrics.

class
rep.report.metrics.
MetricMixin
[source]¶ Bases:
object
Class with helpful methods for metrics, metrics are expected (but not obliged) to be derived from this mixin.

fit
(X, y, sample_weight=None)[source]¶ Prepare metrics for usage, preprocessing is done in this function.
Parameters:  X (pandas.DataFrame) – data shape [n_samples, n_features]
 y – labels of events  arraylike of shape [n_samples]
 sample_weight – weight of events, arraylike of shape [n_samples] or None if all weights are equal
Returns: self


class
rep.report.metrics.
OptimalMetric
(metric, expected_s=1.0, expected_b=1.0, signal_label=1)[source]¶ Bases:
sklearn.base.BaseEstimator
,rep.report.metrics.MetricMixin
Class to calculate optimal threshold on predictions for some binary metric.
Parameters:  metric (function) – metrics(s, b) > float
 expected_s – float, total weight of signal
 expected_b – float, total weight of background

rep.report.metrics.
ams
(s, b, br=10.0)[source]¶ Regularized approximate median significance
Parameters:  s – amount of signal passed
 b – amount of background passed
 br – regularization

rep.report.metrics.
significance
(s, b)[source]¶ Approximate significance of discovery: s / sqrt(b). Here we use normalization, so maximal s and b are equal to 1.
Parameters:  s – amount of signal passed
 b – amount of background passed

class
rep.report.metrics.
OptimalMetricNdim
(metric, expected_s=1.0, expected_b=1.0, step=10)[source]¶ Bases:
sklearn.base.BaseEstimator
Class to calculate optimal thresholds on predictions of several classifier (prediction_1, prediction_2, .. prediction_n) simultaneously to maximize some binary metric.
This metric differs from
OptimalMetric
Parameters:  metric (function) – metrics(s, b) > float, binary metric
 expected_s – float, total weight of signal
 expected_b – float, total weight of background
 step (int) – step in sorted array of predictions for each dimension to choose thresholds
>>> proba1 = classifier1.predict_proba(X)[:, 1] >>> proba2 = classifier2.predict_proba(X)[:, 1] >>> optimal_ndim = OptimalMetricNdim(ams) >>> optimal_ndim(y, sample_weight, proba1, proba2) >>> # returns optimal AUC and thresholds for proba1 and proba2 >>> 0.99, (0.88, 0.45)