Utilities¶

Different helpful functions, objects, methods are collected here.

class rep.utils.Binner(values, bins_number)[source]¶

Binner is a class that helps to split the values into several bins. Initially an array of values is given, which is then splitted into ‘bins_number’ equal parts, and thus we are computing limits (boundaries of bins).

bins_number¶

Returns:	number of bins

get_bins(values)[source]¶

Given the values of feature, compute the index of bin

Parameters:	values – array of shape [n_samples]
Returns:	array of shape [n_samples]

set_limits(limits)[source]¶: Change the thresholds inside bins.

split_into_bins(*arrays)[source]¶

Parameters:	arrays – data to be splitted, the first array corresponds
Returns:	sequence of length [n_bins] with values corresponding to each bin.

class rep.utils.Flattener(data, sample_weight=None)[source]¶

Prepares normalization function for some set of values transforms it to uniform distribution from [0, 1].

Parameters:	data (list or numpy.array) – predictions sample_weight (None or list or numpy.array) – weights
Return func:	normalization function

Example of usage:

>>> normalizer = Flattener(signal)
>>> hist(normalizer(background))
>>> hist(normalizer(signal))

class rep.utils.Stopwatch[source]¶

Simple tool to measure time. If your internet connection is reliable, use %time magic.

>>> with Stopwatch() as timer:
>>>     # do something here
>>>     classifier.fit(X, y)
>>> # print how much time was spent
>>> print(timer)

elapsed¶

rep.utils.calc_ROC(prediction, signal, sample_weight=None, max_points=10000)[source]¶

Calculate roc curve, returns limited number of points. This is needed for interactive plots, which suffer

Parameters:	prediction (numpy.ndarray or list) – predictions signal (array or list) – true labels sample_weight (None or array or list) – weights max_points (int) – maximum of used points on roc curve
Returns:	(tpr, tnr), (err_tnr, err_tpr), thresholds

rep.utils.calc_feature_correlation_matrix(df, weights=None)[source]¶

Calculate correlation matrix

Parameters:	df (pandas.DataFrame) – data of shape [n_samples, n_features] weights – weights of shape [n_samples] (optional)
Returns:	correlation matrix for dataFrame of shape [n_features, n_features]
Return type:	numpy.ndarray

rep.utils.calc_hist_with_errors(x, weight=None, bins=60, normed=True, x_range=None, ignored_sideband=0.0)[source]¶

Calculate data for error bar (for plot pdf with errors)

Parameters:	x (list or numpy.array) – data weight (None or list or numpy.array) – weights
Returns:	tuple (x-points (list), y-points (list), y points errors (list), x points errors (list))

rep.utils.check_arrays(*arrays)[source]¶

Left for consistency, version of sklearn.validation.check_arrays

Parameters:	arrays (list[iterable]) – arrays with same length of first dimension.

rep.utils.check_sample_weight(y_true, sample_weight)[source]¶

Checks the weights, if None, returns array.

Parameters:	y_true – labels (or any array of length [n_samples]) sample_weight – None or array of length [n_samples]
Returns:	numpy.array of shape [n_samples]

rep.utils.fit_metric(metric, *args, **kargs)[source]¶

Metric can implement one of two interfaces (function or object). This function fits metrics, if it is required (by simply checking presence of fit method).

Parameters:	metric – metric function, following REP conventions

rep.utils.get_columns_dict(columns)[source]¶

Get (new column: old column) dict expressions. This function is used to process names of features, which can contain expressions.

Parameters:	columns (list[str]) – columns names
Return type:	dict

rep.utils.get_columns_in_df(df, columns)[source]¶

Get columns in data frame using numexpr evaluation

Parameters:	df (pandas.DataFrame) – data columns – necessary columns columns – None or list[str]
Returns:	data frame with pointed columns

rep.utils.get_efficiencies(prediction, spectator, sample_weight=None, bins_number=20, thresholds=None, errors=False, ignored_sideband=0.0)[source]¶

Construct efficiency function dependent on spectator for each threshold

Different score functions available: Efficiency, Precision, Recall, F1Score, and other things from sklearn.metrics

Parameters:

prediction – list of probabilities
spectator – list of spectator’s values
bins_number – int, count of bins for plot
thresholds –
list of prediction’s threshold

(default=prediction’s cuts for which efficiency will be [0.2, 0.4, 0.5, 0.6, 0.8])

Returns:

if errors=False OrderedDict threshold -> (x_values, y_values)

if errors=True OrderedDict threshold -> (x_values, y_values, y_err, x_err)

All the parts: x_values, y_values, y_err, x_err are numpy.arrays of the same length.

rep.utils.reorder_by_first(*arrays)[source]¶: Applies the same permutation to all passed arrays, permutation sorts the first passed array

rep.utils.take_last(sequence)[source]¶: Returns the last element in sequence or raises an error

rep.utils.train_test_split(*arrays, **kw_args)[source]¶

Does the same thing as sklearn.cross_validation.train_test_split. Additionally has ‘allow_none’ parameter.

Parameters:	arrays (list[numpy.array] or list[pandas.DataFrame]) – arrays to split with same first dimension allow_none (bool) – default False, is set to True, allows non-first arguments to be None (in this case, both resulting train and test parts are None).

rep.utils.train_test_split_group(group_column, *arrays, **kw_args)[source]¶

Modification of train_test_split which alters splitting rule.

Parameters:

group_column – array-like of shape [n_samples] with indices of groups, events from one group will be kept together (all events in train or all events in test). If group_column is used, train_size and test_size will refer to number of groups, not events
arrays (list[numpy.array] or list[pandas.DataFrame]) – arrays to split
allow_none (bool) – default False (useful for sample_weight - after splitting train and test of None are again None)

rep.utils.weighted_quantile(array, quantiles, sample_weight=None, array_sorted=False, old_style=False)[source]¶

Computing quantiles of array. Unlike the numpy.percentile, this function supports weights, but it is inefficient and performs complete sorting.

Parameters:	array – distribution, array of shape [n_samples] quantiles – floats from range [0, 1] with quantiles of shape [n_quantiles] sample_weight – optional weights of samples, array of shape [n_samples] array_sorted – if True, the sorting step will be skipped old_style – if True, will correct output to be consistent with numpy.percentile.
Returns:	array of shape [n_quantiles]

Example:

>>> weighted_quantile([1, 2, 3, 4, 5], [0.5])
Out: array([ 3.])
>>> weighted_quantile([1, 2, 3, 4, 5], [0.5], sample_weight=[3, 1, 1, 1, 1])
Out: array([ 2.])