Different helpful functions, objects, methods are collected here.

class rep.utils.Binner(values, bins_number)[source]

Binner is a class that helps to split the values into several bins. Initially an array of values is given, which is then splitted into ‘bins_number’ equal parts, and thus we are computing limits (boundaries of bins).

Returns:number of bins

Given the values of feature, compute the index of bin

Parameters:values – array of shape [n_samples]
Returns:array of shape [n_samples]

Change the thresholds inside bins.

Parameters:arrays – data to be splitted, the first array corresponds
Returns:sequence of length [n_bins] with values corresponding to each bin.
class rep.utils.Flattener(data, sample_weight=None)[source]

Prepares normalization function for some set of values transforms it to uniform distribution from [0, 1].

  • data (list or numpy.array) – predictions
  • sample_weight (None or list or numpy.array) – weights
Return func:

normalization function

Example of usage:

>>> normalizer = Flattener(signal)
>>> hist(normalizer(background))
>>> hist(normalizer(signal))
class rep.utils.Stopwatch[source]

Simple tool to measure time. If your internet connection is reliable, use %time magic.

>>> with Stopwatch() as timer:
>>>     # do something here
>>>     classifier.fit(X, y)
>>> # print how much time was spent
>>> print(timer)
rep.utils.calc_ROC(prediction, signal, sample_weight=None, max_points=10000)[source]

Calculate roc curve, returns limited number of points. This is needed for interactive plots, which suffer

  • prediction (numpy.ndarray or list) – predictions
  • signal (array or list) – true labels
  • sample_weight (None or array or list) – weights
  • max_points (int) – maximum of used points on roc curve

(tpr, tnr), (err_tnr, err_tpr), thresholds

rep.utils.calc_feature_correlation_matrix(df, weights=None)[source]

Calculate correlation matrix

  • df (pandas.DataFrame) – data of shape [n_samples, n_features]
  • weights – weights of shape [n_samples] (optional)

correlation matrix for dataFrame of shape [n_features, n_features]

Return type:


rep.utils.calc_hist_with_errors(x, weight=None, bins=60, normed=True, x_range=None, ignored_sideband=0.0)[source]

Calculate data for error bar (for plot pdf with errors)

  • x (list or numpy.array) – data
  • weight (None or list or numpy.array) – weights

tuple (x-points (list), y-points (list), y points errors (list), x points errors (list))


Left for consistency, version of sklearn.validation.check_arrays

Parameters:arrays (list[iterable]) – arrays with same length of first dimension.
rep.utils.check_sample_weight(y_true, sample_weight)[source]

Checks the weights, if None, returns array.

  • y_true – labels (or any array of length [n_samples])
  • sample_weight – None or array of length [n_samples]

numpy.array of shape [n_samples]

rep.utils.fit_metric(metric, *args, **kargs)[source]

Metric can implement one of two interfaces (function or object). This function fits metrics, if it is required (by simply checking presence of fit method).

Parameters:metric – metric function, following REP conventions

Get (new column: old column) dict expressions. This function is used to process names of features, which can contain expressions.

Parameters:columns (list[str]) – columns names
Return type:dict
rep.utils.get_columns_in_df(df, columns)[source]

Get columns in data frame using numexpr evaluation

  • df (pandas.DataFrame) – data
  • columns – necessary columns
  • columns – None or list[str]

data frame with pointed columns

rep.utils.get_efficiencies(prediction, spectator, sample_weight=None, bins_number=20, thresholds=None, errors=False, ignored_sideband=0.0)[source]

Construct efficiency function dependent on spectator for each threshold

Different score functions available: Efficiency, Precision, Recall, F1Score, and other things from sklearn.metrics

  • prediction – list of probabilities
  • spectator – list of spectator’s values
  • bins_number – int, count of bins for plot
  • thresholds

    list of prediction’s threshold

    (default=prediction’s cuts for which efficiency will be [0.2, 0.4, 0.5, 0.6, 0.8])


if errors=False OrderedDict threshold -> (x_values, y_values)

if errors=True OrderedDict threshold -> (x_values, y_values, y_err, x_err)

All the parts: x_values, y_values, y_err, x_err are numpy.arrays of the same length.


Applies the same permutation to all passed arrays, permutation sorts the first passed array


Returns the last element in sequence or raises an error

rep.utils.train_test_split(*arrays, **kw_args)[source]

Does the same thing as sklearn.cross_validation.train_test_split. Additionally has ‘allow_none’ parameter.

  • arrays (list[numpy.array] or list[pandas.DataFrame]) – arrays to split with same first dimension
  • allow_none (bool) – default False, is set to True, allows non-first arguments to be None (in this case, both resulting train and test parts are None).
rep.utils.train_test_split_group(group_column, *arrays, **kw_args)[source]

Modification of train_test_split which alters splitting rule.

  • group_column – array-like of shape [n_samples] with indices of groups, events from one group will be kept together (all events in train or all events in test). If group_column is used, train_size and test_size will refer to number of groups, not events
  • arrays (list[numpy.array] or list[pandas.DataFrame]) – arrays to split
  • allow_none (bool) – default False (useful for sample_weight - after splitting train and test of None are again None)
rep.utils.weighted_quantile(array, quantiles, sample_weight=None, array_sorted=False, old_style=False)[source]

Computing quantiles of array. Unlike the numpy.percentile, this function supports weights, but it is inefficient and performs complete sorting.

  • array – distribution, array of shape [n_samples]
  • quantiles – floats from range [0, 1] with quantiles of shape [n_quantiles]
  • sample_weight – optional weights of samples, array of shape [n_samples]
  • array_sorted – if True, the sorting step will be skipped
  • old_style – if True, will correct output to be consistent with numpy.percentile.

array of shape [n_quantiles]


>>> weighted_quantile([1, 2, 3, 4, 5], [0.5])
Out: array([ 3.])
>>> weighted_quantile([1, 2, 3, 4, 5], [0.5], sample_weight=[3, 1, 1, 1, 1])
Out: array([ 2.])