Classes and Functions#

This documentation provides an in-depth overview of the Hygia library classes and functions, aimed at helping users fully understand and utilize the library’s capabilities. With comprehensive explanations and examples, users will be able to easily navigate through the different classes and functions to carry out their data analysis and modeling tasks efficiently. Whether you are a beginner or an experienced user, this documentation will serve as a valuable resource to enhance your understanding of the Hygia library and take your data processing to the next level.

class AnnotateData#

A class to incorporate the data annotation phase, starting from the thresholds (e.g., count sequence squared vowels, count sequence squared consonants) can tell if it’s a ksmash.

Examples

Use this class like this:

annotate_data = hg.AnnotateData()
key_smash_thresholds = {
'count_sequence_squared_vowels': 1.00,
'count_sequence_squared_consonants': 1.999,
'count_sequence_squared_special_characters': 2.2499,
'ratio_of_numeric_digits_squared': 2.9,
'average_of_char_count_squared': 2.78,
}

df = annotate_data.annotate_data(df, concatened_column_name, key_smash_thresholds)
print(df)

Public Functions

annotate_data(self, df, concatened_column_name, ks_thresholds)#

Annotate data function.

Parameters

df – (Type: DataFrame) Dataframe to extract features from.
concatened_column_name – (Type: List) List of columns to be used
ks_thresholds – (Type: List) List of thresholds

Returns

(Type: DataFrame) The input dataframe with additional columns for key smashing and word embedding features.

class AugmentData#

This class present a validations based on zipocde data from this website: Listen Data.

We obtained data from several continents and filtered it using the ‘country’ code. To avoid overwhelm the Git history, we saved the data in pickle files.

Public Functions

__init__(self, str country)#

Initialize the AugmentData class.

Parameters: country – (Type: str) Zipcode list of the region or country used.

validate_zipcode(self, str text)#

Check if a zipcode is valid.

Parameters: text – (Type: str) Zipcode list of the region or country used.
Returns: (Type: bool) Return if the zipcode is valid :rtype: bool

validate_zipcodes(self, pd.DataFrame df, str zipcode_column_name)#

Check if all zipcode in a data is valid.

Parameters

df – (Type: DataFrame) Dataframe to extract features.
zipcode_column_name – (Type: str) Zipcode column name

Returns

Return (Type: DataFrame) a dataframe with a new column.

augment_data(self, pd.DataFrame df, str zipcode_column_name)#

Function that uses the validate_zipcodes function and concatenates the result to the database.

Parameters

df – (Type: DataFrame) Dataframe to extract features from.
zipcode_column_name – (Type: str) Zipcode column name

Returns

(Type: DataFrame) Return a dataframe with a new column.

Public Members

country_zipcode_df#

class FeatureEngineering#

A class for extracting key smashing and word embedding features from text data.

This class combines the functionality of the KeySmash and WordEmbedding classes to extract key smashing and word embedding features from a given text column in a dataframe.

Examples - Use this class like this:

feature_engineer = FeatureEngineering()
df = feature_engineer.extract_features(df, "text_column")
print(df)

Public Functions

__init__(self, str lang='es', int dimensions=25, str model='bytepair', str country=None, str context_words_file=None)#

Initialize the FeatureEngineering class.

Parameters

lang – (Type: str) The language of the text to be processed. (default ‘es’)
dimensions – (Type: int) The number of dimensions of the word embedding vectors. (default 25)
model – (Type: str) The word embedding model to be used. (default ‘bytepair’)

extract_features(self, pd.DataFrame df, str text_column)#

Extract key smashing and word embedding features from a given dataframe and column.

Examples - Use this class like this:

fe = FeatureEngineering()
df = fe.extract_features(df, "text_column")
print(df)

Parameters

df – (Type: DataFrame) Dataframe to extract features from.
text_column – (Type: str) Name of the column in the dataframe that contains the text data to extract features from.
normalize – (Type: bool, optional) Indicates whether to normalize the feature columns. Default is True.

Returns

(Type: DataFrame) The input dataframe with additional columns for key smashing and word embedding features.

Public Members

key_smash#

word_embedding#

regex#

class KeySmash#

A class for calculating metrics to indicate key smashing behavior in a text.

Key smashing is the act of typing on a keyboard in a rapid and uncontrolled manner, often resulting in a series of random characters being entered into a document or text field.

Examples - Use this class like this:

key_smash = KeySmash()
df = key_smash.extract_key_smash_features(df, "text_column")
print(df)

Public Functions

__init__(self)#: Initialize the KeySmash class.

average_of_char_count_squared(self, str text)#

The function takes a string text as input and splits it into words.

For each word, it counts the number of occurrences of each character in the word, squares those counts, and then sums them. It then divides the sum by the length of the word and appends the result to a list words_results. Finally, it returns the mean of the words_results list, if the list is not empty, otherwise it returns 0.

Examples - Use this function like this:

key_smash = KeySmash()

res = key_smash.average_of_char_count_squared("PUENTECILLA KM. 1.7")
print(res)
# Output: 1.121212121212121

res = key_smash.average_of_char_count_squared("ASDASD XXXX")
print(res)
# Output: 3.0

Parameters: text – (Type: str) The text to use for the calculation.
Returns: (Type: float) The calculated Char Frequency Metric.

count_sequence_squared(self, str text, str opt)#

This function takes a text and opt as input.

It checks a set of characters, converts text to lowercase, iterates through characters, increments counter if finds a sequence of characters in set, if not it adds square of counter to a list, resets counter to 1. After iterating it returns sum of list divided by length of text.

Examples - Use this function like this:

key_smash = KeySmash()

res = key_smash.count_sequence_squared("PUENTECILLA KM. 1.7", "vowels")
print(res)
# Output: 0.21052631578947367

res = key_smash.count_sequence_squared("ASDASD XXXX", "consonants")
print(res)
# Output: 2.1818181818181817

res = key_smash.count_sequence_squared("!@#$% ASDFGHJKL", "special_characters")
print(res)
# Output: 1.5625

Parameters

text – (Type: str) The text to use for the calculation.
opt – (Type: str) The type of characters to consider for the calculation, can be one of ‘vowels’, ‘consonants’, or ‘special_characters’.

Returns

(Type:float) The calculated Irregular Sequence Metric.

ratio_of_numeric_digits_squared(self, str text)#

This function takes text as input, splits it into a list of words, initializes a variable to 0.

It iterates through list of words, checking if each word contains both numeric digits and non-numeric characters. If yes, it counts number of numeric digits, squares it and adds to variable. It returns the value of that variable divided by length of original text, if the list is empty it returns 0.

Examples - Use this function like this:

key_smash = KeySmash()

res = key_smash.ratio_of_numeric_digits_squared("ABC 123 !@#")
print(res)
# Output: 0.0

res = key_smash.ratio_of_numeric_digits_squared("ABC123 !@#")
print(res)
# Output: 0.9

Parameters: text – (Type: str) The text to extract the metric from.
Returns: (Type: float) The calculated Number Count Metric.

extract_key_smash_features(self, pd.DataFrame df, str column_name)#

Extract key smash features from a given dataframe and column.

Examples Use this function like this:

import pandas as pd
key_smash = KeySmash()
df = pd.DataFrame({"text_column": ["abcdefgh", "ijklmnop", "qrstuvwxyz"]})
df = key_smash.extract_key_smash_features(df, "text_column", normalize=False)
print(df.head())

Parameters

df – (Type: DataFrame) Dataframe to extract key smash features from.
column_name – (Type: str) Name of the column in the dataframe that contains the text data to extract features from.
normalize – (bool, optional) Indicates whether to normalize the key smash feature columns. Default is True.

Returns

(Type: DataFrame) The input dataframe with additional columns for key smash features: ‘irregular_sequence_vowels’, ‘irregular_sequence_consonants’, ‘irregular_sequence_special_characters’, ‘number_count_metric’, ‘char_frequency_metric’

Public Members

char_sets#

Protected Functions

_normalize_column(self, pd.DataFrame df, str column)#

Normalize a given column in a dataframe.

Parameters

df – (Type: DataFrame) Dataframe to normalize the column in.
column – (Type: str) Name of the column to be normalized.

Returns

(Type: DataFrame) The input dataframe with the normalized column.

class PreProcessData#

This class presents a series of functions that help in data pre-processing.

As concatenate columns, replace abbreviation, and etc.

Examples - Use this class like this:

pre_process_data = hg.PreProcessData()
df = pre_process_data.pre_process_data(df, ['COLUMN_1', 'COLUMN_2'], concatened_column_name)
print(df)

Public Functions

__init__(self, str country=None, str abbreviations_file=None)#

Initialize the PreProcessData class.

Parameters: country – (Type: str) Zipcode list of the region or country used.

concatenate_columns(self, df, columns, concatenated_column_name)#

Function that concatenates two columns and saves in a new one, whose name is informed by the user.

Parameters

df – (Type: DataFrame) Dataframe.
columns – (Type: List) List of columns
concatenated_column_name – (Type: str) Name of the new column

Returns

Return the columns concatenated

handle_nulls(self, df, column_name)#

Handle null values.

Parameters

df – (Type: Dataframe) Dataframe
column_name – (Type: str) Column name to check

handle_extra_spaces(self, df, str column_name)#

handle_abreviations(self, df, column_name)#

Handles abbreviations in the dataframe.

Parameters

df – (Type: DataFrame) Dataframe
column_name – (Type: str) Column name to check

pre_process_data(self, df, columns_to_concat=None, column_name=None)#

Function that gathers all implemented preprocessing (column concatenation, handle with nulls and abbreviations)

Parameters

df – (Type: DataFrame) Dataframe
columns_to_concat – (Type: List) List of columns
column_name – (Type: str) Column name to check

Returns

(Type: DataFrame) The input dataframe with additional columns

Public Members

abbreviations_dict#

Protected Functions

_replace_abbreviation(self, str text)#

Function that identifies abbreviations and according to the dictionary changes the names.

Parameters: text – (Type: str) Text to be analyzed

class RandomForestModel#

This class presents the model Random Forest, allowing train and predict the model.

Examples - Use this class like this:

new_rf_model = hg.RandomForestModel()
clf, scores = new_rf_model.train_and_get_scores(df, concatened_column_name, all_features_columns)
scores

Public Functions

__init__(self, model_file=None, normalization_absolutes_file=None, n_estimators=100, max_depth=None, random_state=0, normalize=True)#

Initialize the RandomForestModel class.

Parameters: model_file – (Type: path) Path to the model file

train_and_get_scores(self, df, concatened_column_name, all_features_columns, test_size=0.3)#

Train and get scores for the model execution.

Parameters

df – (DataFrame) Dataframe with the data.
concatened_column_name – (Type: str) Column name
all_features_columns – (Type: List) List of all features column nales

predict(self, X, concatened_column_name)#

export_model(self, str export_path, str normalization_absolutes_file_path)#

Public Members

normalize#

model#

normalization_absolutes#

pre_trained#

n_estimators#

max_depth#

random_state#

Protected Functions

_get_absolute_maximums(self, df, features_columns_to_normalize, concatened_column_name)#

_normalization(self, df, features_columns_to_normalize, concatened_column_name)#

class Regex#

It provides a set of functions that help you verify the content of a text field, such as checking if the field is empty, if it has only one word, if it contains a specific character or pattern, and more.

Public Functions

__init__(self, str country=None, str context_words_file=None)#

contains_context_invalid_words(self, str text)#

contains_exactly_the_word_dell(self, str text)#

Check if it contains the word DELL.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

contains_exactly_the_word_test(self, str text)#

Check if it contains the word test.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

only_numbers(self, str text)#

Check if it contains only numbers.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

only_special_characters(self, str text)#

Check if it contains only special characters.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

contains_email(self, str text)#

Check if it contains email.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

contains_url(self, str text)#

Check if it contains url.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

contains_date(self, str text)#

Check if it contains date.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

contains_exactly_invalid_words(self, str text)#

Check if it contains invalid words.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

is_substring_of_column_name(self, str text, str column_name)#

Check if is a substring of column name.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

only_one_char(self, str text)#

Check if it contains only one char.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

only_one_word(self, str text)#

Check if it contains only one word.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

only_white_spaces(self, str text)#

Check if it contains only white spaces.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

empty(self, str text)#

Check if is empty.

Parameters: text – (Type: str) Text to be verified
Returns: (Type: bool) true of false

extract_regex_features(self, pd.DataFrame df, str column_name)#

Function to extract all regex features.

Parameters

df – (Type: DataFrame) Dataframe with the data.
column_name – (Type: str) Column name

Returns

(Type: bool) true of false

Public Members

context_invalid_words#

class WordEmbedding#

A class for generating word embeddings from text data.

Word embeddings are numerical representations of text data that capture the context and meaning of words within a sentence or document.

Examples - Use this class like this:

word_embedding = WordEmbedding()
df = word_embedding.extract_word_embedding_features(df, "text_column")
print(df)

Public Functions

__init__(self, str lang='es', int dimensions=25, str model='bytepair')#

Initialize the WordEmbedding class.

Parameters

lang – (Type: str) The language of the text to be processed. (default ‘es’)
dimensions – (Type: int) The number of dimensions of the word embedding vectors. (default 25)
model – (Type: str) The word embedding model to be used. (default ‘bytepair’)

get_embedding(self, str text)#

Get the word embedding vector for a given text.

Examples - Use this function like this:

word_embedding = WordEmbedding()
embedding = word_embedding.get_embedding("This is a sample text.")
print(embedding)
# Output: [0.1, 0.2, ..., 0.3] (a list of float values representing the word embedding vector)

embedding = word_embedding.get_embedding("Another sample text.")
print(embedding)
# Output: [0.5, 0.6, ..., 0.7] (a list of float values representing the word embedding vector)

Parameters: text – (Type: str) The text to be processed.
Returns: (type: array) A word embedding vector for the given text.

extract_word_embedding_features(self, pd.DataFrame df, str column_name, bool normalize=False)#

Extract word embedding features from a given dataframe and column.

Examples - Use this class like this:

word_embedding = WordEmbedding()
df = pd.DataFrame({"text_column": ["abcdefgh", "ijklmnop", "qrstuvwxyz"]})
df = word_embedding.extract_features(df, "text_column", normalize=False)
print(df.head())

Parameters

df – (Type: DataFrame) Dataframe to extract word embedding features from.
column_name – (Type: str) Name of the column in the dataframe that contains the text data to extract features from.
normalize – (Type: bool, optional) Indicates whether to normalize the word embedding feature columns. Default is True.

Returns

(Type: DataFrame) The input dataframe with additional columns for word embedding features.

Public Members

lang#

dimensions#

model#

word_embedding_model#

Protected Functions

_load_model(self)#

Load the word embedding model.

Returns: (Type: Any) The loaded word embedding model.

_pre_embedding(self, str text)#