dpemu package¶
Subpackages¶
Submodules¶
dpemu.dataset_utils module¶
-
dpemu.dataset_utils.
load_coco_val_2017
(n=5000, is_shuffled=False)[source]¶ Fetches the COCO dataset and returns its desired subset.
- Parameters
n (int, optional) – The size of the wanted subset. Defaults to 5000.
is_shuffled (bool, optional) – If true, then the chosen subset of the data will be shuffled. Defaults to False.
- Returns
The dataset, the labels of elements, the names of categories and the name of the dataset.
- Return type
tuple
-
dpemu.dataset_utils.
load_digits_
(n_data=1797)[source]¶ Fetches the digits dataset and returns its desired subset.
- Parameters
n_data (int, optional) – The size of the wanted subset. Defaults to 1797.
- Returns
The dataset, the labels of data points, the names of categories and the name of the dataset.
- Return type
tuple
-
dpemu.dataset_utils.
load_fashion
(n_data=70000)[source]¶ Fetches the fashion MNIST dataset and returns its desired subset.
- Parameters
n_data (int, optional) – The size of the wanted subset. Defaults to 70000.
- Returns
The dataset, the labels of elements, the names of categories and the name of the dataset.
- Return type
tuple
-
dpemu.dataset_utils.
load_mnist
(reshape_to_28x28=False, integer_values=False)[source]¶ Fetches the MNIST dataset and returns its desired subset.
- Parameters
reshape_to_28x28 (bool, optional) – The data is reshaped to 28x28 images if true. Defaults to False.
integer_values (bool, optional) – The data is typecast to integers if true. Defaults to False.
- Returns
Training pixel data, training labels, test pixel data, test labels.
- Return type
tuple
-
dpemu.dataset_utils.
load_mnist_unsplit
(n_data=70000)[source]¶ Fetches the MNIST dataset and returns its subset.
- Parameters
n_data (int, optional) – The size of the wanted subset. Defaults to 70000.
- Returns
The dataset, the labels of data points, the names of categories and the name of the dataset.
- Return type
tuple
-
dpemu.dataset_utils.
load_newsgroups
(subset='all', n_categories=20)[source]¶ Fetches the 20 newsgroups dataset and returns its desired subset.
- Parameters
subset (str, optional) – If “test” then a smaller dataset is used instead of the full one. Defaults to “all”.
n_categories (int, optional) – The number of categories to be included. Defaults to 20.
- Returns
The dataset, categories as integers, category names and the name of the dataset.
- Return type
tuple
dpemu.ml_utils module¶
-
dpemu.ml_utils.
load_yolov3
()[source]¶ Loads the custom weights and cfg for the YOLOv3 model.
- Returns
Paths to YOLOv3 weights and cfg file.
-
dpemu.ml_utils.
reduce_dimensions
(data, random_state, target_dim=2)[source]¶ Reduces the dimensionality of the data using UMAP for lower dimensions, PCA for higher dimensions and possibly even random projections if the number of dimension is over the limit given by the Johnson–Lindenstrauss lemma. Works for NumPy arrays.
- Parameters
data – The input data.
random_state – Random state to generate reproducible results.
target_dim – The targeted dimension.
- Returns
Lower dimension representation of the data.
-
dpemu.ml_utils.
reduce_dimensions_sparse
(data, random_state, target_dim=2)[source]¶ Reduces the dimensionality of the data using UMAP for lower dimensions and TruncatedSVD for higher dimensions. Works for SciPy sparse matrices.
- Parameters
data – The input data.
random_state – Random state to generate reproducible results.
target_dim – The targeted dimension.
- Returns
Lower dimension representation of the data.
-
dpemu.ml_utils.
run_ml_module_using_cli
(cline, show_stdout=True)[source]¶ Runs an external ML model using its CLI.
- Parameters
cline – Command line used to call the external ML model.
show_stdout – True to print the stdout of the external ML model.
- Returns
A string containing the stdout of the external ML model.
dpemu.pg_utils module¶
-
dpemu.pg_utils.
first_dimension_length
(array)[source]¶ Returns the length of the first dimension of the provided array or list.
- Parameters
array (list or numpy.ndarray) – An array.
- Returns
The length of the first dimension of the array.
- Return type
int
-
dpemu.pg_utils.
generate_random_dict_key
(dct, prefix)[source]¶ Generates a random string that is not already in the dict.
- Parameters
dct (dict) – A Python dictionary.
prefix (str) – A prefix for the random key.
- Returns
A randomly generated key.
- Return type
str
-
dpemu.pg_utils.
load_ocr_error_params
(path_to_error_params)[source]¶ Loads error parameters from a JSON-file.
- Parameters
path_to_error_params (str) – A string containing the relative or absolute path to the file.
- Returns
A Python dictionary.
- Return type
dict
-
dpemu.pg_utils.
normalize_ocr_error_params
(params)[source]¶ Normalises numerical weights associated with a character’s OCR-error likelihoods.
For every character found in the dict, the value associated with it is a list containing numerical weights. These weights are normalised so that they sum to 1, and can thus be used as probabilities. Every probability is then attached to the event of a character changing to another character specified in the .json file which was loaded using the load_ocr_error_params function.
- Parameters
params (dict) – A dict containing character-list pairs.
- Returns
A dict containing normalised probabilities for every character.
- Return type
dict
-
dpemu.pg_utils.
normalize_weights
(weights)[source]¶ Normalises a list of numerical values (weights) into probabilities.
Every weight in the list is assigned a probability proportional to its value divided by the sum of all values.
- Parameters
weights (list) – A list of numerical values
- Returns
A list containing values which sum to 1.
- Return type
list
-
dpemu.pg_utils.
to_time_series_x_y
(data, x_length)[source]¶ Convert time series data to pairs of x, y where x is a vector of x_length consecutive observations and y is the observation immediately following x.
- Parameters
data ([type]) – The data used.
x_length (int) – Length of the x vector.
- Returns
The x, y pair.
dpemu.plotting_utils module¶
-
dpemu.plotting_utils.
get_lims
(data)[source]¶ Returns the limits of the plot.
- Parameters
data (list) – A list of 2-dimensional data points.
- Returns
minimum x, maximum x, minimum y, maximum y.
- Return type
float, float, float, float
-
dpemu.plotting_utils.
print_results_by_model
(df, dropped_columns=[])[source]¶ Prints the dataframe row by row excluding the unwanted columns.
- Parameters
df (pandas.DataFrame) – The dataframe returned by the runner.
dropped_columns (list, optional) – List of the column names we do not want to be printed. Defaults to [].
-
dpemu.plotting_utils.
visualize_best_model_params
(df, model_name, model_params, score_names, is_higher_score_better, err_param_name, title, x_log=False, y_log=False, max_n_cols=2)[source]¶ Plots the best model parameters for distinct error values.
- Parameters
df (pandas.DataFrame) – The dataframe returned by the runner.
model_name (str) – The name of the model for which we want to plot the best parameters.
model_params (list) – A list of strings which are the names of the params of the model we want to plot.
score_names (list) – A list of strings which are the names of the scores for which we want to create a plot.
is_higher_score_better (list) – A list of booleans for each score type: True means that a higher score is better and False means a lower score is better.
err_param_name (str) – The error whose distinct values are going to be used on the x-axis.
title (str) – The title of the plot.
x_log (bool, optional) – A bool telling whether a logarithmic scale should be used on x-axis or not. Defaults to False.
y_log (bool, optional) – A bool telling whether a logarithmic scale should be used on y-axis or not. Defaults to False.
max_n_cols –
-
dpemu.plotting_utils.
visualize_classes
(df, label_names, err_param_name, reduced_data_column, labels_column, cmap, title, max_n_cols=4)[source]¶ This function visualizes the classes as 2-dimensional plots for different error parameter values.
- Parameters
df (pandas.DataFrame) – The dataframe returned by the runner.
label_names (list) – A list containing the names of the labels.
err_param_name (str) – The name of the error parameter whose different values are used for plots.
reduced_data_column (str) – The name of the column that contains the reduced data.
labels_column (str) – The name of the column that contains the labels for each element.
cmap (str) – The name of the color map used for coloring the plot.
title (str) – The title of the plot.
max_n_cols –
-
dpemu.plotting_utils.
visualize_confusion_matrices
(df, label_names, score_name, is_higher_score_better, err_param_name, labels_col, predictions_col, on_click=None)[source]¶ Generates confusion matrices for each error parameter combination and model.
- Parameters
df (pandas.DataFrame) – The dataframe returned by the runner.
label_names (list) – A list containing the names of the labels.
score_name (str) – The name of the score type used for filtering the best results.
is_higher_score_better (bool) – If true, then a higher value of score is better and vice versa.
err_param_name (str) – The name of the error parameter whose different values the matrices use.
labels_col (str) – The name of the column containing the real labels.
predictions_col (str) – The name of the column containing the predicted labels.
on_click (function, optional) – If this parameter is passed to the function, then the interactive mode will be set on and clicking an element causes the event listener to call this function. The function should take three parameters: an element, a real label and a predicted label. Defaults to None.
-
dpemu.plotting_utils.
visualize_confusion_matrix
(df_, cm, row, label_names, title, labels_column, predicted_labels_column, on_click=None)[source]¶ Creates a confusion matrix which can be made interactive if wanted.
- Parameters
df_ (DataFrame) – The original dataframe returned by the runner.
cm (list) – An integer matrix describing the number of elements in each category of the confusion matrix.
row (int) – The row of the dataframe used for this matrix.
label_names (list) – A list of strings containing the names of the labels.
title (str) – The title of the confusion matrix visualization.
labels_column (str) – The name of the column containing the real labels.
predicted_labels_column (str) – The name of the column containing the predicted labels.
on_click (function, optional) – If this parameter is passed to the function, then the interactive mode. will be set on and clicking an element causes the event listener to call this function. The function should take three parameters: an element, a real label and a predicted label. Defaults to None.
-
dpemu.plotting_utils.
visualize_error_generator
(root_node, view=True)[source]¶ Generates a directed graph describing the error generation tree and filters.
root_node.generate_error() needs to be called before calling this function, because otherwise Filters may have incorrect or missing parameter values in the graph.
- Parameters
root_node (Node) – The root node of the error generation tree.
view (bool, optional) – If view is True then the error generation tree graph is displayed to user in addition to saving it to a file. If False then it’s only saved to file in DOT graph description language. Defaults to True.
- Returns
File path to the saved DOT graph description file.
- Return type
str
-
dpemu.plotting_utils.
visualize_interactive_plot
(df, err_param_name, data, scatter_cmap, reduced_data_column, on_click)[source]¶ Creates an interactive plot for each different value of the given error type.
The data points in the plots can be clicked to activate a given function.
- Parameters
df (pandas.DataFrame) – The dataframe returned by the runner.
err_param_name (str) – The name of error parameter based on which the data is grouped by.
data (obj) – The original data that was given to the runner module.
scatter_cmap (str) – The color map for the scatter plot
reduced_data_column (str) – The name of the column containing the reduced data
on_click (function) – A function used for interactive plotting. When a data point is clicked, the function is given the original and modified elements as its parameters.
-
dpemu.plotting_utils.
visualize_scores
(df, score_names, is_higher_score_better, err_param_name, title, x_log=False, y_log=False, max_n_cols=2)[source]¶ Plots the wanted scores for all distinct models that were used.
- Parameters
df (pandas.DataFrame) – The dataframe returned by the runner.
score_names (list) – A list of strings which are the names of the scores for which we want to create a plot.
is_higher_score_better (list) – A list of booleans for each score type: True means that a higher score is better and False means a lower score is better.
err_param_name (str) – The error whose distinct values are going to be used on the x-axis.
title (str) – The title of the plot.
x_log (bool, optional) – A bool telling whether a logarithmic scale should be used on x-axis or not. Defaults to False.
y_log (bool, optional) – A bool telling whether a logarithmic scale should be used on y-axis or not. Defaults to False.
max_n_cols –
dpemu.radius_generators module¶
-
class
dpemu.radius_generators.
GaussianRadiusGenerator
(mean, std)[source]¶ Bases:
dpemu.radius_generators.RadiusGenerator
GaussianRadiusGenerator generates radii from a normal distribution with given parameters.
-
__init__
(mean, std)[source]¶ - Parameters
mean (float) – The mean of the normal distribution.
std (float) – The standard deviation of the normal distribution.
-
generate
(random_state)[source]¶ Generates a single integer to be used as a radius in some of the filters.
- Parameters
random_state (mtrand.RandomState) – A random state object to be used in all things related to randomness to ensure the repeatability.
- Returns
An integer describing the generated radius.
- Return type
int
-
-
class
dpemu.radius_generators.
ProbabilityArrayRadiusGenerator
(probability_array)[source]¶ Bases:
dpemu.radius_generators.RadiusGenerator
ProbabilityArrayRadiusGenerator generates radii based on the probabilities in the array given as a parameter.
-
__init__
(probability_array)[source]¶ - Parameters
probability_array (list) – A list where the value of an element describes the probability of using its index as a radius.
-
generate
(random_state)[source]¶ Generates a single integer to be used as a radius in some of the filters.
- Parameters
random_state (mtrand.RandomState) – A random state object to be used in all things related to randomness to ensure the repeatability.
- Returns
An integer describing the generated radius.
- Return type
int
-
-
class
dpemu.radius_generators.
RadiusGenerator
[source]¶ Bases:
abc.ABC
Radius generators are used by some filters for generating radii for their effects.
-
abstract
generate
(random_state)[source]¶ Generates a single integer to be used as a radius in some of the filters.
- Parameters
random_state (mtrand.RandomState) – A random state object to be used in all things related to randomness to ensure the repeatability.
- Returns
An integer describing the generated radius.
- Return type
int
-
abstract
dpemu.runner module¶
-
dpemu.runner.
add_more_stuff_to_results
(result, err_params, model_name, i_data, time_pre, time_err, use_i_mode)[source]¶ Adds stuff like error parameters, model parameters and run times to a result dict.
- Parameters
result – A result dict.
err_params – Error parameters.
model_name – Name of the model.
i_data – The interactive data.
time_pre – Time used in the preprocessing phase.
time_err – Time used in the error generation phase.
use_i_mode – True if interactive mode is used.
-
dpemu.runner.
errorify_data
(train_data, test_data, err_root_node, err_params)[source]¶ Applies the error to the data using the error source defined.
- Parameters
train_data – The train data.
test_data – The test data.
err_root_node – Error root node.
err_params – Error parameters.
- Returns
Erroneous data and time used in error generation.
-
dpemu.runner.
get_df_columns_base
(err_params_list, model_params_dict_list)[source]¶ Generates the base for a list of Dataframe column names.
- Parameters
err_params_list – List of all error parameter combinations.
model_params_dict_list – List of dicts where each dict includes the class of the model and a list of different hyperparameter combinations.
- Returns
Base list for Dataframe column names.
-
dpemu.runner.
get_model_name
(model, use_clean_train_data, same_model_counter)[source]¶ Returns the name of the model class. If the name ends with the word Model, it’s replaced with an empty string. If clean train data is used, word Clean is added to the name. A number is added to the end to separate multiple same models.
- Parameters
model – The ML model class used.
use_clean_train_data – True if clean train data is used.
same_model_counter – Counter used to separate same models.
- Returns
The model name.
-
dpemu.runner.
get_result_with_model_params
(model, model_params, train_data, test_data, result_base)[source]¶ Gets the results from a model using specified model parameters.
- Parameters
model – The ML model class used.
model_params – The model parameters used.
train_data – The train data.
test_data – The test data.
result_base – Base results from the preprocessor.
- Returns
The results in a dict.
-
dpemu.runner.
get_results_from_model
(model, model_params_list, train_data, test_data, result_base)[source]¶ Gets all results from a model using different hyperparameter combinations.
- Parameters
model – The ML model class used.
model_params_list – A list of different hyperparameter combinations for this model.
train_data – The train data.
test_data – The test data.
result_base – Base results from the preprocessor.
- Returns
A list of result dicts from the model.
-
dpemu.runner.
get_total_results_from_workers
(pool_inputs, n_err_params, n_processes)[source]¶ Gathers the results from different workers to a list.
- Parameters
pool_inputs – List of inputs for different workers.
n_err_params – Number off error parameter combinations.
n_processes – Max number of active subprocesses.
- Returns
List of all result dicts from different workers.
-
dpemu.runner.
order_df_columns
(df, err_params_list, model_params_dict_list)[source]¶ Defines the final order for Dataframe column names.
- Parameters
df – A Dataframe containing the results.
err_params_list – List of all error parameter combinations.
model_params_dict_list – List of dicts where each dict includes the class of the model and a list of different hyperparameter combinations.
- Returns
The reindexed Dataframe.
-
dpemu.runner.
pickle_data
(train_data, test_data)[source]¶ Saves the data to disk to be read by the workers.
- Parameters
train_data – The train data.
test_data – The test data.
- Returns
Paths to train and test data.
-
dpemu.runner.
preproc_data
(train_data, err_train_data, err_test_data, preproc, preproc_params)[source]¶ Preprocesses clean train data, errorified train data and errorified test data using the given preprocessor and parameters.
- Parameters
train_data – The train data.
err_train_data – Errorified train data.
err_test_data – Errorified test data.
preproc – The preprocessor class.
preproc_params – The preprocessor parameters.
- Returns
Preprocessed clean train data, preprocessed errorified test data using clean train data, result dict base when using clean train data, preprocessed errorified train data, preprocessed errorified test data when using errorified train data, result dict base when using errorified traindata and time used in preprocessing.
-
dpemu.runner.
run
(train_data, test_data, preproc, preproc_params, err_root_node, err_params_list, model_params_dict_list, n_processes=None, use_interactive_mode=False)[source]¶ The runner system is called with the run function. It creates a Pandas Dataframe from all of the results it gets from different workers.
- Parameters
train_data – The train data.
test_data – The test data.
preproc – The preprocessor class.
preproc_params – The preprocessor parameters.
err_root_node – Error root node.
err_params_list – List of all error parameter combinations.
model_params_dict_list – List of dicts where each dict includes the class of the model and a list of different hyperparameter combinations.
n_processes – Max number of active subprocesses.
use_interactive_mode – True if interactive mode is used. The resulting Dataframe contains the errorified data.
- Returns
A Dataframe containing the results.
-
dpemu.runner.
unpickle_data
(path_to_train_data, path_to_test_data)[source]¶ Loads the data to memory in a subprocess.
- Parameters
path_to_train_data – Path to the train data.
path_to_test_data – Path to the test data.
- Returns
The train and test data.
-
dpemu.runner.
worker
(inputs)[source]¶ One of the workers in the multiprocessing pool. A subprocess is created for every error parameter combination. In every worker, data is first errorified, preprocessed and then run through the models.
- Parameters
inputs – Tuple containing the worker inputs.
- Returns
List of all result dicts from different models.
dpemu.utils module¶
-
dpemu.utils.
filter_optimized_results
(df, err_param_name, score_name, is_higher_score_better)[source]¶ Removes suboptimal rows from the dataframe, returning only the best ones.
- Parameters
df (pandas.DataFrame) – A dataframe containing all the data returned by the runner.
err_param_name (str) – The name of the error parameter by which the data will be grouped.
score_name (str) – The name of the score type we want to optimize.
is_higher_score_better (bool) – If true, then only the highest results are returned. Otherwise the lowest results are returned.
- Returns
A dataframe containing the optimized results.
- Return type
pandas.DataFrame
-
dpemu.utils.
generate_unique_path
(folder_name, extension, prefix=None)[source]¶ Generates a unique path name with desired folder name, extension and prefix.
- Parameters
folder_name (str) – The name of the folder which the file path should contain.
extension (str) – The file extension of the file.
prefix (str, optional) – The optional prefix of the filename. Defaults to None.
- Returns
The path generated by the function.
- Return type
str
-
dpemu.utils.
get_data_dir
()[source]¶ Returns a path to the directory where datasets are saved.
- Returns
The path to the directory where datasets are saved.
- Return type
pathlib.PosixPath