Utilities

Different helpful functions, objects, methods are collected here.

class rep.utils.Binner(values, bins_number)[source]

Binner is a class that helps to split the values into several bins. Initially an array of values is given, which is then splitted into ‘bins_number’ equal parts, and thus we are computing limits (boundaries of bins).

bins_number
Returns:number of bins
get_bins(values)[source]

Given the values of feature, compute the index of bin

Parameters:values – array of shape [n_samples]
Returns:array of shape [n_samples]
set_limits(limits)[source]

Change the thresholds inside bins.

split_into_bins(*arrays)[source]
Parameters:arrays – data to be splitted, the first array corresponds
Returns:sequence of length [n_bins] with values corresponding to each bin.
class rep.utils.Flattener(data, sample_weight=None)[source]

Prepares normalization function for some set of values transforms it to uniform distribution from [0, 1].

Parameters:
  • data (list or numpy.array) – predictions
  • sample_weight (None or list or numpy.array) – weights
Return func:

normalization function

Example of usage:

>>> normalizer = Flattener(signal)
>>> hist(normalizer(background))
>>> hist(normalizer(signal))
class rep.utils.Stopwatch[source]

Simple tool to measure time. If your internet connection is reliable, use %time magic.

>>> with Stopwatch() as timer:
>>>     # do something here
>>>     classifier.fit(X, y)
>>> # print how much time was spent
>>> print(timer)
elapsed
rep.utils.calc_ROC(prediction, signal, sample_weight=None, max_points=10000)[source]

Calculate roc curve, returns limited number of points. This is needed for interactive plots, which suffer

Parameters:
  • prediction (numpy.ndarray or list) – predictions
  • signal (array or list) – true labels
  • sample_weight (None or array or list) – weights
  • max_points (int) – maximum of used points on roc curve
Returns:

(tpr, tnr), (err_tnr, err_tpr), thresholds

rep.utils.calc_feature_correlation_matrix(df, weights=None)[source]

Calculate correlation matrix

Parameters:
  • df (pandas.DataFrame) – data of shape [n_samples, n_features]
  • weights – weights of shape [n_samples] (optional)
Returns:

correlation matrix for dataFrame of shape [n_features, n_features]

Return type:

numpy.ndarray

rep.utils.calc_hist_with_errors(x, weight=None, bins=60, normed=True, x_range=None, ignored_sideband=0.0)[source]

Calculate data for error bar (for plot pdf with errors)

Parameters:
  • x (list or numpy.array) – data
  • weight (None or list or numpy.array) – weights
Returns:

tuple (x-points (list), y-points (list), y points errors (list), x points errors (list))

rep.utils.check_arrays(*arrays)[source]

Left for consistency, version of sklearn.validation.check_arrays

Parameters:arrays (list[iterable]) – arrays with same length of first dimension.
rep.utils.check_sample_weight(y_true, sample_weight)[source]

Checks the weights, if None, returns array.

Parameters:
  • y_true – labels (or any array of length [n_samples])
  • sample_weight – None or array of length [n_samples]
Returns:

numpy.array of shape [n_samples]

rep.utils.fit_metric(metric, *args, **kargs)[source]

Metric can implement one of two interfaces (function or object). This function fits metrics, if it is required (by simply checking presence of fit method).

Parameters:metric – metric function, following REP conventions
rep.utils.get_columns_dict(columns)[source]

Get (new column: old column) dict expressions. This function is used to process names of features, which can contain expressions.

Parameters:columns (list[str]) – columns names
Return type:dict
rep.utils.get_columns_in_df(df, columns)[source]

Get columns in data frame using numexpr evaluation

Parameters:
  • df (pandas.DataFrame) – data
  • columns – necessary columns
  • columns – None or list[str]
Returns:

data frame with pointed columns

rep.utils.get_efficiencies(prediction, spectator, sample_weight=None, bins_number=20, thresholds=None, errors=False, ignored_sideband=0.0)[source]

Construct efficiency function dependent on spectator for each threshold

Different score functions available: Efficiency, Precision, Recall, F1Score, and other things from sklearn.metrics

Parameters:
  • prediction – list of probabilities
  • spectator – list of spectator’s values
  • bins_number – int, count of bins for plot
  • thresholds

    list of prediction’s threshold

    (default=prediction’s cuts for which efficiency will be [0.2, 0.4, 0.5, 0.6, 0.8])

Returns:

if errors=False OrderedDict threshold -> (x_values, y_values)

if errors=True OrderedDict threshold -> (x_values, y_values, y_err, x_err)

All the parts: x_values, y_values, y_err, x_err are numpy.arrays of the same length.

rep.utils.reorder_by_first(*arrays)[source]

Applies the same permutation to all passed arrays, permutation sorts the first passed array

rep.utils.take_last(sequence)[source]

Returns the last element in sequence or raises an error

rep.utils.train_test_split(*arrays, **kw_args)[source]

Does the same thing as sklearn.cross_validation.train_test_split. Additionally has ‘allow_none’ parameter.

Parameters:
  • arrays (list[numpy.array] or list[pandas.DataFrame]) – arrays to split with same first dimension
  • allow_none (bool) – default False, is set to True, allows non-first arguments to be None (in this case, both resulting train and test parts are None).
rep.utils.train_test_split_group(group_column, *arrays, **kw_args)[source]

Modification of train_test_split which alters splitting rule.

Parameters:
  • group_column – array-like of shape [n_samples] with indices of groups, events from one group will be kept together (all events in train or all events in test). If group_column is used, train_size and test_size will refer to number of groups, not events
  • arrays (list[numpy.array] or list[pandas.DataFrame]) – arrays to split
  • allow_none (bool) – default False (useful for sample_weight - after splitting train and test of None are again None)
rep.utils.weighted_quantile(array, quantiles, sample_weight=None, array_sorted=False, old_style=False)[source]

Computing quantiles of array. Unlike the numpy.percentile, this function supports weights, but it is inefficient and performs complete sorting.

Parameters:
  • array – distribution, array of shape [n_samples]
  • quantiles – floats from range [0, 1] with quantiles of shape [n_quantiles]
  • sample_weight – optional weights of samples, array of shape [n_samples]
  • array_sorted – if True, the sorting step will be skipped
  • old_style – if True, will correct output to be consistent with numpy.percentile.
Returns:

array of shape [n_quantiles]

Example:

>>> weighted_quantile([1, 2, 3, 4, 5], [0.5])
Out: array([ 3.])
>>> weighted_quantile([1, 2, 3, 4, 5], [0.5], sample_weight=[3, 1, 1, 1, 1])
Out: array([ 2.])