dpemu package¶

Subpackages¶

Submodules¶

dpemu.dataset_utils module¶

dpemu.dataset_utils.load_coco_val_2017(n=5000, is_shuffled=False)[source]¶

Fetches the COCO dataset and returns its desired subset.

Parameters

n (int, optional) – The size of the wanted subset. Defaults to 5000.
is_shuffled (bool, optional) – If true, then the chosen subset of the data will be shuffled. Defaults to False.

Returns

The dataset, the labels of elements, the names of categories and the name of the dataset.

Return type

tuple

dpemu.dataset_utils.load_digits_(n_data=1797)[source]¶

Fetches the digits dataset and returns its desired subset.

Parameters: n_data (int, optional) – The size of the wanted subset. Defaults to 1797.
Returns: The dataset, the labels of data points, the names of categories and the name of the dataset.
Return type: tuple

dpemu.dataset_utils.load_fashion(n_data=70000)[source]¶

Fetches the fashion MNIST dataset and returns its desired subset.

Parameters: n_data (int, optional) – The size of the wanted subset. Defaults to 70000.
Returns: The dataset, the labels of elements, the names of categories and the name of the dataset.
Return type: tuple

dpemu.dataset_utils.load_mnist(reshape_to_28x28=False, integer_values=False)[source]¶

Fetches the MNIST dataset and returns its desired subset.

Parameters

reshape_to_28x28 (bool, optional) – The data is reshaped to 28x28 images if true. Defaults to False.
integer_values (bool, optional) – The data is typecast to integers if true. Defaults to False.

Returns

Training pixel data, training labels, test pixel data, test labels.

Return type

tuple

dpemu.dataset_utils.load_mnist_unsplit(n_data=70000)[source]¶

Fetches the MNIST dataset and returns its subset.

Parameters: n_data (int, optional) – The size of the wanted subset. Defaults to 70000.
Returns: The dataset, the labels of data points, the names of categories and the name of the dataset.
Return type: tuple

dpemu.dataset_utils.load_newsgroups(subset='all', n_categories=20)[source]¶

Fetches the 20 newsgroups dataset and returns its desired subset.

Parameters

subset (str, optional) – If “test” then a smaller dataset is used instead of the full one. Defaults to “all”.
n_categories (int, optional) – The number of categories to be included. Defaults to 20.

Returns

The dataset, categories as integers, category names and the name of the dataset.

Return type

tuple

dpemu.ml_utils module¶

dpemu.ml_utils.load_yolov3()[source]¶

Loads the custom weights and cfg for the YOLOv3 model.

Returns: Paths to YOLOv3 weights and cfg file.

dpemu.ml_utils.reduce_dimensions(data, random_state, target_dim=2)[source]¶

Reduces the dimensionality of the data using UMAP for lower dimensions, PCA for higher dimensions and possibly even random projections if the number of dimension is over the limit given by the Johnson–Lindenstrauss lemma. Works for NumPy arrays.

Parameters

data – The input data.
random_state – Random state to generate reproducible results.
target_dim – The targeted dimension.

Returns

Lower dimension representation of the data.

dpemu.ml_utils.reduce_dimensions_sparse(data, random_state, target_dim=2)[source]¶

Reduces the dimensionality of the data using UMAP for lower dimensions and TruncatedSVD for higher dimensions. Works for SciPy sparse matrices.

Parameters

data – The input data.
random_state – Random state to generate reproducible results.
target_dim – The targeted dimension.

Returns

Lower dimension representation of the data.

dpemu.ml_utils.run_ml_module_using_cli(cline, show_stdout=True)[source]¶

Runs an external ML model using its CLI.

Parameters

cline – Command line used to call the external ML model.
show_stdout – True to print the stdout of the external ML model.

Returns

A string containing the stdout of the external ML model.

dpemu.pg_utils module¶

dpemu.pg_utils.first_dimension_length(array)[source]¶

Returns the length of the first dimension of the provided array or list.

Parameters: array (list or numpy.ndarray) – An array.
Returns: The length of the first dimension of the array.
Return type: int

dpemu.pg_utils.generate_random_dict_key(dct, prefix)[source]¶

Generates a random string that is not already in the dict.

Parameters

dct (dict) – A Python dictionary.
prefix (str) – A prefix for the random key.

Returns

A randomly generated key.

Return type

str

dpemu.pg_utils.load_ocr_error_params(path_to_error_params)[source]¶

Loads error parameters from a JSON-file.

Parameters: path_to_error_params (str) – A string containing the relative or absolute path to the file.
Returns: A Python dictionary.
Return type: dict

dpemu.pg_utils.normalize_ocr_error_params(params)[source]¶

Normalises numerical weights associated with a character’s OCR-error likelihoods.

For every character found in the dict, the value associated with it is a list containing numerical weights. These weights are normalised so that they sum to 1, and can thus be used as probabilities. Every probability is then attached to the event of a character changing to another character specified in the .json file which was loaded using the load_ocr_error_params function.

Parameters: params (dict) – A dict containing character-list pairs.
Returns: A dict containing normalised probabilities for every character.
Return type: dict

dpemu.pg_utils.normalize_weights(weights)[source]¶

Normalises a list of numerical values (weights) into probabilities.

Every weight in the list is assigned a probability proportional to its value divided by the sum of all values.

Parameters: weights (list) – A list of numerical values
Returns: A list containing values which sum to 1.
Return type: list

dpemu.pg_utils.to_time_series_x_y(data, x_length)[source]¶

Convert time series data to pairs of x, y where x is a vector of x_length consecutive observations and y is the observation immediately following x.

Parameters

data ([type]) – The data used.
x_length (int) – Length of the x vector.

Returns

The x, y pair.

dpemu.plotting_utils module¶

dpemu.plotting_utils.get_lims(data)[source]¶

Returns the limits of the plot.

Parameters: data (list) – A list of 2-dimensional data points.
Returns: minimum x, maximum x, minimum y, maximum y.
Return type: float, float, float, float

dpemu.plotting_utils.get_n_rows_cols(n_plots, max_n_cols)[source]¶

dpemu.plotting_utils.print_results_by_model(df, dropped_columns=[])[source]¶

Prints the dataframe row by row excluding the unwanted columns.

Parameters

df (pandas.DataFrame) – The dataframe returned by the runner.
dropped_columns (list, optional) – List of the column names we do not want to be printed. Defaults to [].

dpemu.plotting_utils.visualize_best_model_params(df, model_name, model_params, score_names, is_higher_score_better, err_param_name, title, x_log=False, y_log=False, max_n_cols=2)[source]¶

Plots the best model parameters for distinct error values.

Parameters

df (pandas.DataFrame) – The dataframe returned by the runner.
model_name (str) – The name of the model for which we want to plot the best parameters.
model_params (list) – A list of strings which are the names of the params of the model we want to plot.
score_names (list) – A list of strings which are the names of the scores for which we want to create a plot.
is_higher_score_better (list) – A list of booleans for each score type: True means that a higher score is better and False means a lower score is better.
err_param_name (str) – The error whose distinct values are going to be used on the x-axis.
title (str) – The title of the plot.
x_log (bool, optional) – A bool telling whether a logarithmic scale should be used on x-axis or not. Defaults to False.
y_log (bool, optional) – A bool telling whether a logarithmic scale should be used on y-axis or not. Defaults to False.
max_n_cols –

dpemu.plotting_utils.visualize_classes(df, label_names, err_param_name, reduced_data_column, labels_column, cmap, title, max_n_cols=4)[source]¶

This function visualizes the classes as 2-dimensional plots for different error parameter values.

Parameters

df (pandas.DataFrame) – The dataframe returned by the runner.
label_names (list) – A list containing the names of the labels.
err_param_name (str) – The name of the error parameter whose different values are used for plots.
reduced_data_column (str) – The name of the column that contains the reduced data.
labels_column (str) – The name of the column that contains the labels for each element.
cmap (str) – The name of the color map used for coloring the plot.
title (str) – The title of the plot.
max_n_cols –

dpemu.plotting_utils.visualize_confusion_matrices(df, label_names, score_name, is_higher_score_better, err_param_name, labels_col, predictions_col, on_click=None)[source]¶

Generates confusion matrices for each error parameter combination and model.

Parameters

df (pandas.DataFrame) – The dataframe returned by the runner.
label_names (list) – A list containing the names of the labels.
score_name (str) – The name of the score type used for filtering the best results.
is_higher_score_better (bool) – If true, then a higher value of score is better and vice versa.
err_param_name (str) – The name of the error parameter whose different values the matrices use.
labels_col (str) – The name of the column containing the real labels.
predictions_col (str) – The name of the column containing the predicted labels.
on_click (function, optional) – If this parameter is passed to the function, then the interactive mode will be set on and clicking an element causes the event listener to call this function. The function should take three parameters: an element, a real label and a predicted label. Defaults to None.

dpemu.plotting_utils.visualize_confusion_matrix(df_, cm, row, label_names, title, labels_column, predicted_labels_column, on_click=None)[source]¶

Creates a confusion matrix which can be made interactive if wanted.

Parameters

df_ (DataFrame) – The original dataframe returned by the runner.
cm (list) – An integer matrix describing the number of elements in each category of the confusion matrix.
row (int) – The row of the dataframe used for this matrix.
label_names (list) – A list of strings containing the names of the labels.
title (str) – The title of the confusion matrix visualization.
labels_column (str) – The name of the column containing the real labels.
predicted_labels_column (str) – The name of the column containing the predicted labels.
on_click (function, optional) – If this parameter is passed to the function, then the interactive mode. will be set on and clicking an element causes the event listener to call this function. The function should take three parameters: an element, a real label and a predicted label. Defaults to None.

dpemu.plotting_utils.visualize_error_generator(root_node, view=True)[source]¶

Generates a directed graph describing the error generation tree and filters.

root_node.generate_error() needs to be called before calling this function, because otherwise Filters may have incorrect or missing parameter values in the graph.

Parameters

root_node (Node) – The root node of the error generation tree.
view (bool, optional) – If view is True then the error generation tree graph is displayed to user in addition to saving it to a file. If False then it’s only saved to file in DOT graph description language. Defaults to True.

Returns

File path to the saved DOT graph description file.

Return type

str

dpemu.plotting_utils.visualize_interactive_plot(df, err_param_name, data, scatter_cmap, reduced_data_column, on_click)[source]¶

Creates an interactive plot for each different value of the given error type.

The data points in the plots can be clicked to activate a given function.

Parameters

df (pandas.DataFrame) – The dataframe returned by the runner.
err_param_name (str) – The name of error parameter based on which the data is grouped by.
data (obj) – The original data that was given to the runner module.
scatter_cmap (str) – The color map for the scatter plot
reduced_data_column (str) – The name of the column containing the reduced data
on_click (function) – A function used for interactive plotting. When a data point is clicked, the function is given the original and modified elements as its parameters.

dpemu.plotting_utils.visualize_scores(df, score_names, is_higher_score_better, err_param_name, title, x_log=False, y_log=False, max_n_cols=2)[source]¶

Plots the wanted scores for all distinct models that were used.

Parameters

df (pandas.DataFrame) – The dataframe returned by the runner.
score_names (list) – A list of strings which are the names of the scores for which we want to create a plot.
is_higher_score_better (list) – A list of booleans for each score type: True means that a higher score is better and False means a lower score is better.
err_param_name (str) – The error whose distinct values are going to be used on the x-axis.
title (str) – The title of the plot.
x_log (bool, optional) – A bool telling whether a logarithmic scale should be used on x-axis or not. Defaults to False.
y_log (bool, optional) – A bool telling whether a logarithmic scale should be used on y-axis or not. Defaults to False.
max_n_cols –

dpemu.plotting_utils.visualize_time_series_prediction(df, data, score_name, is_higher_score_better, err_param_name, model_name, test_pred_column, err_train_column, title, max_n_cols=4)[source]¶

dpemu.radius_generators module¶

class dpemu.radius_generators.GaussianRadiusGenerator(mean, std)[source]¶

Bases: dpemu.radius_generators.RadiusGenerator

GaussianRadiusGenerator generates radii from a normal distribution with given parameters.

__init__(mean, std)[source]¶

Parameters

mean (float) – The mean of the normal distribution.
std (float) – The standard deviation of the normal distribution.

generate(random_state)[source]¶

Generates a single integer to be used as a radius in some of the filters.

Parameters: random_state (mtrand.RandomState) – A random state object to be used in all things related to randomness to ensure the repeatability.
Returns: An integer describing the generated radius.
Return type: int

class dpemu.radius_generators.ProbabilityArrayRadiusGenerator(probability_array)[source]¶

Bases: dpemu.radius_generators.RadiusGenerator

ProbabilityArrayRadiusGenerator generates radii based on the probabilities in the array given as a parameter.

__init__(probability_array)[source]¶

Parameters: probability_array (list) – A list where the value of an element describes the probability of using its index as a radius.

generate(random_state)[source]¶

Generates a single integer to be used as a radius in some of the filters.

Parameters: random_state (mtrand.RandomState) – A random state object to be used in all things related to randomness to ensure the repeatability.
Returns: An integer describing the generated radius.
Return type: int

class dpemu.radius_generators.RadiusGenerator[source]¶

Bases: abc.ABC

Radius generators are used by some filters for generating radii for their effects.

abstract generate(random_state)[source]¶

Generates a single integer to be used as a radius in some of the filters.

Parameters: random_state (mtrand.RandomState) – A random state object to be used in all things related to randomness to ensure the repeatability.
Returns: An integer describing the generated radius.
Return type: int

dpemu.runner module¶

dpemu.runner.add_more_stuff_to_results(result, err_params, model_name, i_data, time_pre, time_err, use_i_mode)[source]¶

Adds stuff like error parameters, model parameters and run times to a result dict.

Parameters

result – A result dict.
err_params – Error parameters.
model_name – Name of the model.
i_data – The interactive data.
time_pre – Time used in the preprocessing phase.
time_err – Time used in the error generation phase.
use_i_mode – True if interactive mode is used.

dpemu.runner.errorify_data(train_data, test_data, err_root_node, err_params)[source]¶

Applies the error to the data using the error source defined.

Parameters

train_data – The train data.
test_data – The test data.
err_root_node – Error root node.
err_params – Error parameters.

Returns

Erroneous data and time used in error generation.

dpemu.runner.get_df_columns_base(err_params_list, model_params_dict_list)[source]¶

Generates the base for a list of Dataframe column names.

Parameters

err_params_list – List of all error parameter combinations.
model_params_dict_list – List of dicts where each dict includes the class of the model and a list of different hyperparameter combinations.

Returns

Base list for Dataframe column names.

dpemu.runner.get_model_name(model, use_clean_train_data, same_model_counter)[source]¶

Returns the name of the model class. If the name ends with the word Model, it’s replaced with an empty string. If clean train data is used, word Clean is added to the name. A number is added to the end to separate multiple same models.

Parameters

model – The ML model class used.
use_clean_train_data – True if clean train data is used.
same_model_counter – Counter used to separate same models.

Returns

The model name.

dpemu.runner.get_result_with_model_params(model, model_params, train_data, test_data, result_base)[source]¶

Gets the results from a model using specified model parameters.

Parameters

model – The ML model class used.
model_params – The model parameters used.
train_data – The train data.
test_data – The test data.
result_base – Base results from the preprocessor.

Returns

The results in a dict.

dpemu.runner.get_results_from_model(model, model_params_list, train_data, test_data, result_base)[source]¶

Gets all results from a model using different hyperparameter combinations.

Parameters

model – The ML model class used.
model_params_list – A list of different hyperparameter combinations for this model.
train_data – The train data.
test_data – The test data.
result_base – Base results from the preprocessor.

Returns

A list of result dicts from the model.

dpemu.runner.get_total_results_from_workers(pool_inputs, n_err_params, n_processes)[source]¶

Gathers the results from different workers to a list.

Parameters

pool_inputs – List of inputs for different workers.
n_err_params – Number off error parameter combinations.
n_processes – Max number of active subprocesses.

Returns

List of all result dicts from different workers.

dpemu.runner.order_df_columns(df, err_params_list, model_params_dict_list)[source]¶

Defines the final order for Dataframe column names.

Parameters

df – A Dataframe containing the results.
err_params_list – List of all error parameter combinations.
model_params_dict_list – List of dicts where each dict includes the class of the model and a list of different hyperparameter combinations.

Returns

The reindexed Dataframe.

dpemu.runner.pickle_data(train_data, test_data)[source]¶

Saves the data to disk to be read by the workers.

Parameters

train_data – The train data.
test_data – The test data.

Returns

Paths to train and test data.

dpemu.runner.preproc_data(train_data, err_train_data, err_test_data, preproc, preproc_params)[source]¶

Preprocesses clean train data, errorified train data and errorified test data using the given preprocessor and parameters.

Parameters

train_data – The train data.
err_train_data – Errorified train data.
err_test_data – Errorified test data.
preproc – The preprocessor class.
preproc_params – The preprocessor parameters.

Returns

Preprocessed clean train data, preprocessed errorified test data using clean train data, result dict base when using clean train data, preprocessed errorified train data, preprocessed errorified test data when using errorified train data, result dict base when using errorified traindata and time used in preprocessing.

dpemu.runner.run(train_data, test_data, preproc, preproc_params, err_root_node, err_params_list, model_params_dict_list, n_processes=None, use_interactive_mode=False)[source]¶

The runner system is called with the run function. It creates a Pandas Dataframe from all of the results it gets from different workers.

Parameters

train_data – The train data.
test_data – The test data.
preproc – The preprocessor class.
preproc_params – The preprocessor parameters.
err_root_node – Error root node.
err_params_list – List of all error parameter combinations.
model_params_dict_list – List of dicts where each dict includes the class of the model and a list of different hyperparameter combinations.
n_processes – Max number of active subprocesses.
use_interactive_mode – True if interactive mode is used. The resulting Dataframe contains the errorified data.

Returns

A Dataframe containing the results.

dpemu.runner.unpickle_data(path_to_train_data, path_to_test_data)[source]¶

Loads the data to memory in a subprocess.

Parameters

path_to_train_data – Path to the train data.
path_to_test_data – Path to the test data.

Returns

The train and test data.

dpemu.runner.worker(inputs)[source]¶

One of the workers in the multiprocessing pool. A subprocess is created for every error parameter combination. In every worker, data is first errorified, preprocessed and then run through the models.

Parameters: inputs – Tuple containing the worker inputs.
Returns: List of all result dicts from different models.

dpemu.utils module¶

dpemu.utils.filter_optimized_results(df, err_param_name, score_name, is_higher_score_better)[source]¶

Removes suboptimal rows from the dataframe, returning only the best ones.

Parameters

df (pandas.DataFrame) – A dataframe containing all the data returned by the runner.
err_param_name (str) – The name of the error parameter by which the data will be grouped.
score_name (str) – The name of the score type we want to optimize.
is_higher_score_better (bool) – If true, then only the highest results are returned. Otherwise the lowest results are returned.

Returns

A dataframe containing the optimized results.

Return type

pandas.DataFrame

dpemu.utils.generate_unique_path(folder_name, extension, prefix=None)[source]¶

Generates a unique path name with desired folder name, extension and prefix.

Parameters

folder_name (str) – The name of the folder which the file path should contain.
extension (str) – The file extension of the file.
prefix (str, optional) – The optional prefix of the filename. Defaults to None.

Returns

The path generated by the function.

Return type

str

dpemu.utils.get_data_dir()[source]¶

Returns a path to the directory where datasets are saved.

Returns: The path to the directory where datasets are saved.
Return type: pathlib.PosixPath

dpemu.utils.get_project_root()[source]¶

Returns a path to the root of the project.

Returns: The path to the root of the project.
Return type: pathlib.PosixPath

dpemu.utils.split_df_by_model(df)[source]¶

Splits a dataframe such that each model gets its own dataframe.

Parameters: df (pandas.DataFrame) – A dataframe containing all the data returned by the runner.
Returns: A list of dataframes.
Return type: dfs

dpemu package¶

Subpackages¶

Submodules¶

dpemu.dataset_utils module¶

dpemu.ml_utils module¶

dpemu.pg_utils module¶

dpemu.plotting_utils module¶

dpemu.radius_generators module¶

dpemu.runner module¶

dpemu.utils module¶

Module contents¶