dpemu package

Submodules

dpemu.dataset_utils module

dpemu.dataset_utils.load_coco_val_2017(n=5000, is_shuffled=False)[source]

Fetches the COCO dataset and returns its desired subset.

Parameters
  • n (int, optional) – The size of the wanted subset. Defaults to 5000.

  • is_shuffled (bool, optional) – If true, then the chosen subset of the data will be shuffled. Defaults to False.

Returns

The dataset, the labels of elements, the names of categories and the name of the dataset.

Return type

tuple

dpemu.dataset_utils.load_digits_(n_data=1797)[source]

Fetches the digits dataset and returns its desired subset.

Parameters

n_data (int, optional) – The size of the wanted subset. Defaults to 1797.

Returns

The dataset, the labels of data points, the names of categories and the name of the dataset.

Return type

tuple

dpemu.dataset_utils.load_fashion(n_data=70000)[source]

Fetches the fashion MNIST dataset and returns its desired subset.

Parameters

n_data (int, optional) – The size of the wanted subset. Defaults to 70000.

Returns

The dataset, the labels of elements, the names of categories and the name of the dataset.

Return type

tuple

dpemu.dataset_utils.load_mnist(reshape_to_28x28=False, integer_values=False)[source]

Fetches the MNIST dataset and returns its desired subset.

Parameters
  • reshape_to_28x28 (bool, optional) – The data is reshaped to 28x28 images if true. Defaults to False.

  • integer_values (bool, optional) – The data is typecast to integers if true. Defaults to False.

Returns

Training pixel data, training labels, test pixel data, test labels.

Return type

tuple

dpemu.dataset_utils.load_mnist_unsplit(n_data=70000)[source]

Fetches the MNIST dataset and returns its subset.

Parameters

n_data (int, optional) – The size of the wanted subset. Defaults to 70000.

Returns

The dataset, the labels of data points, the names of categories and the name of the dataset.

Return type

tuple

dpemu.dataset_utils.load_newsgroups(subset='all', n_categories=20)[source]

Fetches the 20 newsgroups dataset and returns its desired subset.

Parameters
  • subset (str, optional) – If “test” then a smaller dataset is used instead of the full one. Defaults to “all”.

  • n_categories (int, optional) – The number of categories to be included. Defaults to 20.

Returns

The dataset, categories as integers, category names and the name of the dataset.

Return type

tuple

dpemu.ml_utils module

dpemu.ml_utils.load_yolov3()[source]

Loads the custom weights and cfg for the YOLOv3 model.

Returns

Paths to YOLOv3 weights and cfg file.

dpemu.ml_utils.reduce_dimensions(data, random_state, target_dim=2)[source]

Reduces the dimensionality of the data using UMAP for lower dimensions, PCA for higher dimensions and possibly even random projections if the number of dimension is over the limit given by the Johnson–Lindenstrauss lemma. Works for NumPy arrays.

Parameters
  • data – The input data.

  • random_state – Random state to generate reproducible results.

  • target_dim – The targeted dimension.

Returns

Lower dimension representation of the data.

dpemu.ml_utils.reduce_dimensions_sparse(data, random_state, target_dim=2)[source]

Reduces the dimensionality of the data using UMAP for lower dimensions and TruncatedSVD for higher dimensions. Works for SciPy sparse matrices.

Parameters
  • data – The input data.

  • random_state – Random state to generate reproducible results.

  • target_dim – The targeted dimension.

Returns

Lower dimension representation of the data.

dpemu.ml_utils.run_ml_module_using_cli(cline, show_stdout=True)[source]

Runs an external ML model using its CLI.

Parameters
  • cline – Command line used to call the external ML model.

  • show_stdout – True to print the stdout of the external ML model.

Returns

A string containing the stdout of the external ML model.

dpemu.pg_utils module

dpemu.pg_utils.first_dimension_length(array)[source]

Returns the length of the first dimension of the provided array or list.

Parameters

array (list or numpy.ndarray) – An array.

Returns

The length of the first dimension of the array.

Return type

int

dpemu.pg_utils.generate_random_dict_key(dct, prefix)[source]

Generates a random string that is not already in the dict.

Parameters
  • dct (dict) – A Python dictionary.

  • prefix (str) – A prefix for the random key.

Returns

A randomly generated key.

Return type

str

dpemu.pg_utils.load_ocr_error_params(path_to_error_params)[source]

Loads error parameters from a JSON-file.

Parameters

path_to_error_params (str) – A string containing the relative or absolute path to the file.

Returns

A Python dictionary.

Return type

dict

dpemu.pg_utils.normalize_ocr_error_params(params)[source]

Normalises numerical weights associated with a character’s OCR-error likelihoods.

For every character found in the dict, the value associated with it is a list containing numerical weights. These weights are normalised so that they sum to 1, and can thus be used as probabilities. Every probability is then attached to the event of a character changing to another character specified in the .json file which was loaded using the load_ocr_error_params function.

Parameters

params (dict) – A dict containing character-list pairs.

Returns

A dict containing normalised probabilities for every character.

Return type

dict

dpemu.pg_utils.normalize_weights(weights)[source]

Normalises a list of numerical values (weights) into probabilities.

Every weight in the list is assigned a probability proportional to its value divided by the sum of all values.

Parameters

weights (list) – A list of numerical values

Returns

A list containing values which sum to 1.

Return type

list

dpemu.pg_utils.to_time_series_x_y(data, x_length)[source]

Convert time series data to pairs of x, y where x is a vector of x_length consecutive observations and y is the observation immediately following x.

Parameters
  • data ([type]) – The data used.

  • x_length (int) – Length of the x vector.

Returns

The x, y pair.

dpemu.plotting_utils module

dpemu.plotting_utils.get_lims(data)[source]

Returns the limits of the plot.

Parameters

data (list) – A list of 2-dimensional data points.

Returns

minimum x, maximum x, minimum y, maximum y.

Return type

float, float, float, float

dpemu.plotting_utils.get_n_rows_cols(n_plots, max_n_cols)[source]
dpemu.plotting_utils.print_results_by_model(df, dropped_columns=[])[source]

Prints the dataframe row by row excluding the unwanted columns.

Parameters
  • df (pandas.DataFrame) – The dataframe returned by the runner.

  • dropped_columns (list, optional) – List of the column names we do not want to be printed. Defaults to [].

dpemu.plotting_utils.visualize_best_model_params(df, model_name, model_params, score_names, is_higher_score_better, err_param_name, title, x_log=False, y_log=False, max_n_cols=2)[source]

Plots the best model parameters for distinct error values.

Parameters
  • df (pandas.DataFrame) – The dataframe returned by the runner.

  • model_name (str) – The name of the model for which we want to plot the best parameters.

  • model_params (list) – A list of strings which are the names of the params of the model we want to plot.

  • score_names (list) – A list of strings which are the names of the scores for which we want to create a plot.

  • is_higher_score_better (list) – A list of booleans for each score type: True means that a higher score is better and False means a lower score is better.

  • err_param_name (str) – The error whose distinct values are going to be used on the x-axis.

  • title (str) – The title of the plot.

  • x_log (bool, optional) – A bool telling whether a logarithmic scale should be used on x-axis or not. Defaults to False.

  • y_log (bool, optional) – A bool telling whether a logarithmic scale should be used on y-axis or not. Defaults to False.

  • max_n_cols

dpemu.plotting_utils.visualize_classes(df, label_names, err_param_name, reduced_data_column, labels_column, cmap, title, max_n_cols=4)[source]

This function visualizes the classes as 2-dimensional plots for different error parameter values.

Parameters
  • df (pandas.DataFrame) – The dataframe returned by the runner.

  • label_names (list) – A list containing the names of the labels.

  • err_param_name (str) – The name of the error parameter whose different values are used for plots.

  • reduced_data_column (str) – The name of the column that contains the reduced data.

  • labels_column (str) – The name of the column that contains the labels for each element.

  • cmap (str) – The name of the color map used for coloring the plot.

  • title (str) – The title of the plot.

  • max_n_cols

dpemu.plotting_utils.visualize_confusion_matrices(df, label_names, score_name, is_higher_score_better, err_param_name, labels_col, predictions_col, on_click=None)[source]

Generates confusion matrices for each error parameter combination and model.

Parameters
  • df (pandas.DataFrame) – The dataframe returned by the runner.

  • label_names (list) – A list containing the names of the labels.

  • score_name (str) – The name of the score type used for filtering the best results.

  • is_higher_score_better (bool) – If true, then a higher value of score is better and vice versa.

  • err_param_name (str) – The name of the error parameter whose different values the matrices use.

  • labels_col (str) – The name of the column containing the real labels.

  • predictions_col (str) – The name of the column containing the predicted labels.

  • on_click (function, optional) – If this parameter is passed to the function, then the interactive mode will be set on and clicking an element causes the event listener to call this function. The function should take three parameters: an element, a real label and a predicted label. Defaults to None.

dpemu.plotting_utils.visualize_confusion_matrix(df_, cm, row, label_names, title, labels_column, predicted_labels_column, on_click=None)[source]

Creates a confusion matrix which can be made interactive if wanted.

Parameters
  • df_ (DataFrame) – The original dataframe returned by the runner.

  • cm (list) – An integer matrix describing the number of elements in each category of the confusion matrix.

  • row (int) – The row of the dataframe used for this matrix.

  • label_names (list) – A list of strings containing the names of the labels.

  • title (str) – The title of the confusion matrix visualization.

  • labels_column (str) – The name of the column containing the real labels.

  • predicted_labels_column (str) – The name of the column containing the predicted labels.

  • on_click (function, optional) – If this parameter is passed to the function, then the interactive mode. will be set on and clicking an element causes the event listener to call this function. The function should take three parameters: an element, a real label and a predicted label. Defaults to None.

dpemu.plotting_utils.visualize_error_generator(root_node, view=True)[source]

Generates a directed graph describing the error generation tree and filters.

root_node.generate_error() needs to be called before calling this function, because otherwise Filters may have incorrect or missing parameter values in the graph.

Parameters
  • root_node (Node) – The root node of the error generation tree.

  • view (bool, optional) – If view is True then the error generation tree graph is displayed to user in addition to saving it to a file. If False then it’s only saved to file in DOT graph description language. Defaults to True.

Returns

File path to the saved DOT graph description file.

Return type

str

dpemu.plotting_utils.visualize_interactive_plot(df, err_param_name, data, scatter_cmap, reduced_data_column, on_click)[source]

Creates an interactive plot for each different value of the given error type.

The data points in the plots can be clicked to activate a given function.

Parameters
  • df (pandas.DataFrame) – The dataframe returned by the runner.

  • err_param_name (str) – The name of error parameter based on which the data is grouped by.

  • data (obj) – The original data that was given to the runner module.

  • scatter_cmap (str) – The color map for the scatter plot

  • reduced_data_column (str) – The name of the column containing the reduced data

  • on_click (function) – A function used for interactive plotting. When a data point is clicked, the function is given the original and modified elements as its parameters.

dpemu.plotting_utils.visualize_scores(df, score_names, is_higher_score_better, err_param_name, title, x_log=False, y_log=False, max_n_cols=2)[source]

Plots the wanted scores for all distinct models that were used.

Parameters
  • df (pandas.DataFrame) – The dataframe returned by the runner.

  • score_names (list) – A list of strings which are the names of the scores for which we want to create a plot.

  • is_higher_score_better (list) – A list of booleans for each score type: True means that a higher score is better and False means a lower score is better.

  • err_param_name (str) – The error whose distinct values are going to be used on the x-axis.

  • title (str) – The title of the plot.

  • x_log (bool, optional) – A bool telling whether a logarithmic scale should be used on x-axis or not. Defaults to False.

  • y_log (bool, optional) – A bool telling whether a logarithmic scale should be used on y-axis or not. Defaults to False.

  • max_n_cols

dpemu.plotting_utils.visualize_time_series_prediction(df, data, score_name, is_higher_score_better, err_param_name, model_name, test_pred_column, err_train_column, title, max_n_cols=4)[source]

dpemu.radius_generators module

class dpemu.radius_generators.GaussianRadiusGenerator(mean, std)[source]

Bases: dpemu.radius_generators.RadiusGenerator

GaussianRadiusGenerator generates radii from a normal distribution with given parameters.

__init__(mean, std)[source]
Parameters
  • mean (float) – The mean of the normal distribution.

  • std (float) – The standard deviation of the normal distribution.

generate(random_state)[source]

Generates a single integer to be used as a radius in some of the filters.

Parameters

random_state (mtrand.RandomState) – A random state object to be used in all things related to randomness to ensure the repeatability.

Returns

An integer describing the generated radius.

Return type

int

class dpemu.radius_generators.ProbabilityArrayRadiusGenerator(probability_array)[source]

Bases: dpemu.radius_generators.RadiusGenerator

ProbabilityArrayRadiusGenerator generates radii based on the probabilities in the array given as a parameter.

__init__(probability_array)[source]
Parameters

probability_array (list) – A list where the value of an element describes the probability of using its index as a radius.

generate(random_state)[source]

Generates a single integer to be used as a radius in some of the filters.

Parameters

random_state (mtrand.RandomState) – A random state object to be used in all things related to randomness to ensure the repeatability.

Returns

An integer describing the generated radius.

Return type

int

class dpemu.radius_generators.RadiusGenerator[source]

Bases: abc.ABC

Radius generators are used by some filters for generating radii for their effects.

abstract generate(random_state)[source]

Generates a single integer to be used as a radius in some of the filters.

Parameters

random_state (mtrand.RandomState) – A random state object to be used in all things related to randomness to ensure the repeatability.

Returns

An integer describing the generated radius.

Return type

int

dpemu.runner module

dpemu.runner.add_more_stuff_to_results(result, err_params, model_name, i_data, time_pre, time_err, use_i_mode)[source]

Adds stuff like error parameters, model parameters and run times to a result dict.

Parameters
  • result – A result dict.

  • err_params – Error parameters.

  • model_name – Name of the model.

  • i_data – The interactive data.

  • time_pre – Time used in the preprocessing phase.

  • time_err – Time used in the error generation phase.

  • use_i_mode – True if interactive mode is used.

dpemu.runner.errorify_data(train_data, test_data, err_root_node, err_params)[source]

Applies the error to the data using the error source defined.

Parameters
  • train_data – The train data.

  • test_data – The test data.

  • err_root_node – Error root node.

  • err_params – Error parameters.

Returns

Erroneous data and time used in error generation.

dpemu.runner.get_df_columns_base(err_params_list, model_params_dict_list)[source]

Generates the base for a list of Dataframe column names.

Parameters
  • err_params_list – List of all error parameter combinations.

  • model_params_dict_list – List of dicts where each dict includes the class of the model and a list of different hyperparameter combinations.

Returns

Base list for Dataframe column names.

dpemu.runner.get_model_name(model, use_clean_train_data, same_model_counter)[source]

Returns the name of the model class. If the name ends with the word Model, it’s replaced with an empty string. If clean train data is used, word Clean is added to the name. A number is added to the end to separate multiple same models.

Parameters
  • model – The ML model class used.

  • use_clean_train_data – True if clean train data is used.

  • same_model_counter – Counter used to separate same models.

Returns

The model name.

dpemu.runner.get_result_with_model_params(model, model_params, train_data, test_data, result_base)[source]

Gets the results from a model using specified model parameters.

Parameters
  • model – The ML model class used.

  • model_params – The model parameters used.

  • train_data – The train data.

  • test_data – The test data.

  • result_base – Base results from the preprocessor.

Returns

The results in a dict.

dpemu.runner.get_results_from_model(model, model_params_list, train_data, test_data, result_base)[source]

Gets all results from a model using different hyperparameter combinations.

Parameters
  • model – The ML model class used.

  • model_params_list – A list of different hyperparameter combinations for this model.

  • train_data – The train data.

  • test_data – The test data.

  • result_base – Base results from the preprocessor.

Returns

A list of result dicts from the model.

dpemu.runner.get_total_results_from_workers(pool_inputs, n_err_params, n_processes)[source]

Gathers the results from different workers to a list.

Parameters
  • pool_inputs – List of inputs for different workers.

  • n_err_params – Number off error parameter combinations.

  • n_processes – Max number of active subprocesses.

Returns

List of all result dicts from different workers.

dpemu.runner.order_df_columns(df, err_params_list, model_params_dict_list)[source]

Defines the final order for Dataframe column names.

Parameters
  • df – A Dataframe containing the results.

  • err_params_list – List of all error parameter combinations.

  • model_params_dict_list – List of dicts where each dict includes the class of the model and a list of different hyperparameter combinations.

Returns

The reindexed Dataframe.

dpemu.runner.pickle_data(train_data, test_data)[source]

Saves the data to disk to be read by the workers.

Parameters
  • train_data – The train data.

  • test_data – The test data.

Returns

Paths to train and test data.

dpemu.runner.preproc_data(train_data, err_train_data, err_test_data, preproc, preproc_params)[source]

Preprocesses clean train data, errorified train data and errorified test data using the given preprocessor and parameters.

Parameters
  • train_data – The train data.

  • err_train_data – Errorified train data.

  • err_test_data – Errorified test data.

  • preproc – The preprocessor class.

  • preproc_params – The preprocessor parameters.

Returns

Preprocessed clean train data, preprocessed errorified test data using clean train data, result dict base when using clean train data, preprocessed errorified train data, preprocessed errorified test data when using errorified train data, result dict base when using errorified traindata and time used in preprocessing.

dpemu.runner.run(train_data, test_data, preproc, preproc_params, err_root_node, err_params_list, model_params_dict_list, n_processes=None, use_interactive_mode=False)[source]

The runner system is called with the run function. It creates a Pandas Dataframe from all of the results it gets from different workers.

Parameters
  • train_data – The train data.

  • test_data – The test data.

  • preproc – The preprocessor class.

  • preproc_params – The preprocessor parameters.

  • err_root_node – Error root node.

  • err_params_list – List of all error parameter combinations.

  • model_params_dict_list – List of dicts where each dict includes the class of the model and a list of different hyperparameter combinations.

  • n_processes – Max number of active subprocesses.

  • use_interactive_mode – True if interactive mode is used. The resulting Dataframe contains the errorified data.

Returns

A Dataframe containing the results.

dpemu.runner.unpickle_data(path_to_train_data, path_to_test_data)[source]

Loads the data to memory in a subprocess.

Parameters
  • path_to_train_data – Path to the train data.

  • path_to_test_data – Path to the test data.

Returns

The train and test data.

dpemu.runner.worker(inputs)[source]

One of the workers in the multiprocessing pool. A subprocess is created for every error parameter combination. In every worker, data is first errorified, preprocessed and then run through the models.

Parameters

inputs – Tuple containing the worker inputs.

Returns

List of all result dicts from different models.

dpemu.utils module

dpemu.utils.filter_optimized_results(df, err_param_name, score_name, is_higher_score_better)[source]

Removes suboptimal rows from the dataframe, returning only the best ones.

Parameters
  • df (pandas.DataFrame) – A dataframe containing all the data returned by the runner.

  • err_param_name (str) – The name of the error parameter by which the data will be grouped.

  • score_name (str) – The name of the score type we want to optimize.

  • is_higher_score_better (bool) – If true, then only the highest results are returned. Otherwise the lowest results are returned.

Returns

A dataframe containing the optimized results.

Return type

pandas.DataFrame

dpemu.utils.generate_unique_path(folder_name, extension, prefix=None)[source]

Generates a unique path name with desired folder name, extension and prefix.

Parameters
  • folder_name (str) – The name of the folder which the file path should contain.

  • extension (str) – The file extension of the file.

  • prefix (str, optional) – The optional prefix of the filename. Defaults to None.

Returns

The path generated by the function.

Return type

str

dpemu.utils.get_data_dir()[source]

Returns a path to the directory where datasets are saved.

Returns

The path to the directory where datasets are saved.

Return type

pathlib.PosixPath

dpemu.utils.get_project_root()[source]

Returns a path to the root of the project.

Returns

The path to the root of the project.

Return type

pathlib.PosixPath

dpemu.utils.split_df_by_model(df)[source]

Splits a dataframe such that each model gets its own dataframe.

Parameters

df (pandas.DataFrame) – A dataframe containing all the data returned by the runner.

Returns

A list of dataframes.

Return type

dfs

Module contents