Utilities¶
Different helpful functions, objects, methods are collected here.
-
class
rep.utils.
Binner
(values, bins_number)[source]¶ Binner is a class that helps to split the values into several bins. Initially an array of values is given, which is then splitted into ‘bins_number’ equal parts, and thus we are computing limits (boundaries of bins).
-
bins_number
¶ Returns: number of bins
-
-
class
rep.utils.
Flattener
(data, sample_weight=None)[source]¶ Prepares normalization function for some set of values transforms it to uniform distribution from [0, 1].
Parameters: - data (list or numpy.array) – predictions
- sample_weight (None or list or numpy.array) – weights
Return func: normalization function
Example of usage:
>>> normalizer = Flattener(signal) >>> hist(normalizer(background)) >>> hist(normalizer(signal))
-
class
rep.utils.
Stopwatch
[source]¶ Simple tool to measure time. If your internet connection is reliable, use %time magic.
>>> with Stopwatch() as timer: >>> # do something here >>> classifier.fit(X, y) >>> # print how much time was spent >>> print(timer)
-
elapsed
¶
-
-
rep.utils.
calc_ROC
(prediction, signal, sample_weight=None, max_points=10000)[source]¶ Calculate roc curve, returns limited number of points. This is needed for interactive plots, which suffer
Parameters: - prediction (numpy.ndarray or list) – predictions
- signal (array or list) – true labels
- sample_weight (None or array or list) – weights
- max_points (int) – maximum of used points on roc curve
Returns: (tpr, tnr), (err_tnr, err_tpr), thresholds
-
rep.utils.
calc_feature_correlation_matrix
(df, weights=None)[source]¶ Calculate correlation matrix
Parameters: - df (pandas.DataFrame) – data of shape [n_samples, n_features]
- weights – weights of shape [n_samples] (optional)
Returns: correlation matrix for dataFrame of shape [n_features, n_features]
Return type: numpy.ndarray
-
rep.utils.
calc_hist_with_errors
(x, weight=None, bins=60, normed=True, x_range=None, ignored_sideband=0.0)[source]¶ Calculate data for error bar (for plot pdf with errors)
Parameters: - x (list or numpy.array) – data
- weight (None or list or numpy.array) – weights
Returns: tuple (x-points (list), y-points (list), y points errors (list), x points errors (list))
-
rep.utils.
check_arrays
(*arrays)[source]¶ Left for consistency, version of sklearn.validation.check_arrays
Parameters: arrays (list[iterable]) – arrays with same length of first dimension.
-
rep.utils.
check_sample_weight
(y_true, sample_weight)[source]¶ Checks the weights, if None, returns array.
Parameters: - y_true – labels (or any array of length [n_samples])
- sample_weight – None or array of length [n_samples]
Returns: numpy.array of shape [n_samples]
-
rep.utils.
fit_metric
(metric, *args, **kargs)[source]¶ Metric can implement one of two interfaces (function or object). This function fits metrics, if it is required (by simply checking presence of fit method).
Parameters: metric – metric function, following REP conventions
-
rep.utils.
get_columns_dict
(columns)[source]¶ Get (new column: old column) dict expressions. This function is used to process names of features, which can contain expressions.
Parameters: columns (list[str]) – columns names Return type: dict
-
rep.utils.
get_columns_in_df
(df, columns)[source]¶ Get columns in data frame using numexpr evaluation
Parameters: - df (pandas.DataFrame) – data
- columns – necessary columns
- columns – None or list[str]
Returns: data frame with pointed columns
-
rep.utils.
get_efficiencies
(prediction, spectator, sample_weight=None, bins_number=20, thresholds=None, errors=False, ignored_sideband=0.0)[source]¶ Construct efficiency function dependent on spectator for each threshold
Different score functions available: Efficiency, Precision, Recall, F1Score, and other things from sklearn.metrics
Parameters: - prediction – list of probabilities
- spectator – list of spectator’s values
- bins_number – int, count of bins for plot
- thresholds –
list of prediction’s threshold
(default=prediction’s cuts for which efficiency will be [0.2, 0.4, 0.5, 0.6, 0.8])
Returns: if errors=False OrderedDict threshold -> (x_values, y_values)
if errors=True OrderedDict threshold -> (x_values, y_values, y_err, x_err)
All the parts: x_values, y_values, y_err, x_err are numpy.arrays of the same length.
-
rep.utils.
reorder_by_first
(*arrays)[source]¶ Applies the same permutation to all passed arrays, permutation sorts the first passed array
-
rep.utils.
train_test_split
(*arrays, **kw_args)[source]¶ Does the same thing as sklearn.cross_validation.train_test_split. Additionally has ‘allow_none’ parameter.
Parameters: - arrays (list[numpy.array] or list[pandas.DataFrame]) – arrays to split with same first dimension
- allow_none (bool) – default False, is set to True, allows non-first arguments to be None (in this case, both resulting train and test parts are None).
-
rep.utils.
train_test_split_group
(group_column, *arrays, **kw_args)[source]¶ Modification of
train_test_split
which alters splitting rule.Parameters: - group_column – array-like of shape [n_samples] with indices of groups, events from one group will be kept together (all events in train or all events in test). If group_column is used, train_size and test_size will refer to number of groups, not events
- arrays (list[numpy.array] or list[pandas.DataFrame]) – arrays to split
- allow_none (bool) – default False (useful for sample_weight - after splitting train and test of None are again None)
-
rep.utils.
weighted_quantile
(array, quantiles, sample_weight=None, array_sorted=False, old_style=False)[source]¶ Computing quantiles of array. Unlike the numpy.percentile, this function supports weights, but it is inefficient and performs complete sorting.
Parameters: - array – distribution, array of shape [n_samples]
- quantiles – floats from range [0, 1] with quantiles of shape [n_quantiles]
- sample_weight – optional weights of samples, array of shape [n_samples]
- array_sorted – if True, the sorting step will be skipped
- old_style – if True, will correct output to be consistent with numpy.percentile.
Returns: array of shape [n_quantiles]
Example:
>>> weighted_quantile([1, 2, 3, 4, 5], [0.5]) Out: array([ 3.]) >>> weighted_quantile([1, 2, 3, 4, 5], [0.5], sample_weight=[3, 1, 1, 1, 1]) Out: array([ 2.])