Classes and Functions#
This documentation provides an in-depth overview of the Hygia library classes and functions, aimed at helping users fully understand and utilize the library’s capabilities. With comprehensive explanations and examples, users will be able to easily navigate through the different classes and functions to carry out their data analysis and modeling tasks efficiently. Whether you are a beginner or an experienced user, this documentation will serve as a valuable resource to enhance your understanding of the Hygia library and take your data processing to the next level.
- class AnnotateData#
A class to incorporate the data annotation phase, starting from the thresholds (e.g., count sequence squared vowels, count sequence squared consonants) can tell if it’s a ksmash.
Examples
Use this class like this:
annotate_data = hg.AnnotateData() key_smash_thresholds = { 'count_sequence_squared_vowels': 1.00, 'count_sequence_squared_consonants': 1.999, 'count_sequence_squared_special_characters': 2.2499, 'ratio_of_numeric_digits_squared': 2.9, 'average_of_char_count_squared': 2.78, } df = annotate_data.annotate_data(df, concatened_column_name, key_smash_thresholds) print(df)
Public Functions
- annotate_data(self, df, concatened_column_name, ks_thresholds)#
Annotate data function.
- Parameters
df – (Type: DataFrame) Dataframe to extract features from.
concatened_column_name – (Type: List) List of columns to be used
ks_thresholds – (Type: List) List of thresholds
- Returns
(Type: DataFrame) The input dataframe with additional columns for key smashing and word embedding features.
- class AugmentData#
This class present a validations based on zipocde data from this website: Listen Data.
We obtained data from several continents and filtered it using the ‘country’ code. To avoid overwhelm the Git history, we saved the data in pickle files.
Public Functions
- __init__(self, str country)#
Initialize the AugmentData class.
- Parameters
country – (Type: str) Zipcode list of the region or country used.
- validate_zipcode(self, str text)#
Check if a zipcode is valid.
- Parameters
text – (Type: str) Zipcode list of the region or country used.
- Returns
(Type: bool) Return if the zipcode is valid :rtype: bool
- validate_zipcodes(self, pd.DataFrame df, str zipcode_column_name)#
Check if all zipcode in a data is valid.
- Parameters
df – (Type: DataFrame) Dataframe to extract features.
zipcode_column_name – (Type: str) Zipcode column name
- Returns
Return (Type: DataFrame) a dataframe with a new column.
- augment_data(self, pd.DataFrame df, str zipcode_column_name)#
Function that uses the validate_zipcodes function and concatenates the result to the database.
- Parameters
df – (Type: DataFrame) Dataframe to extract features from.
zipcode_column_name – (Type: str) Zipcode column name
- Returns
(Type: DataFrame) Return a dataframe with a new column.
Public Members
- country_zipcode_df#
- class FeatureEngineering#
A class for extracting key smashing and word embedding features from text data.
This class combines the functionality of the KeySmash and WordEmbedding classes to extract key smashing and word embedding features from a given text column in a dataframe.
Examples - Use this class like this:
feature_engineer = FeatureEngineering() df = feature_engineer.extract_features(df, "text_column") print(df)
Public Functions
- __init__(self, str lang='es', int dimensions=25, str model='bytepair', str country=None, str context_words_file=None)#
Initialize the FeatureEngineering class.
- Parameters
lang – (Type: str) The language of the text to be processed. (default ‘es’)
dimensions – (Type: int) The number of dimensions of the word embedding vectors. (default 25)
model – (Type: str) The word embedding model to be used. (default ‘bytepair’)
- extract_features(self, pd.DataFrame df, str text_column)#
Extract key smashing and word embedding features from a given dataframe and column.
Examples - Use this class like this:
fe = FeatureEngineering() df = fe.extract_features(df, "text_column") print(df)
- Parameters
df – (Type: DataFrame) Dataframe to extract features from.
text_column – (Type: str) Name of the column in the dataframe that contains the text data to extract features from.
normalize – (Type: bool, optional) Indicates whether to normalize the feature columns. Default is True.
- Returns
(Type: DataFrame) The input dataframe with additional columns for key smashing and word embedding features.
- class KeySmash#
A class for calculating metrics to indicate key smashing behavior in a text.
Key smashing is the act of typing on a keyboard in a rapid and uncontrolled manner, often resulting in a series of random characters being entered into a document or text field.
Examples - Use this class like this:
key_smash = KeySmash() df = key_smash.extract_key_smash_features(df, "text_column") print(df)
Public Functions
- __init__(self)#
Initialize the KeySmash class.
- average_of_char_count_squared(self, str text)#
The function takes a string text as input and splits it into words.
For each word, it counts the number of occurrences of each character in the word, squares those counts, and then sums them. It then divides the sum by the length of the word and appends the result to a list words_results. Finally, it returns the mean of the words_results list, if the list is not empty, otherwise it returns 0.
Examples - Use this function like this:
key_smash = KeySmash() res = key_smash.average_of_char_count_squared("PUENTECILLA KM. 1.7") print(res) # Output: 1.121212121212121 res = key_smash.average_of_char_count_squared("ASDASD XXXX") print(res) # Output: 3.0
- Parameters
text – (Type: str) The text to use for the calculation.
- Returns
(Type: float) The calculated Char Frequency Metric.
- count_sequence_squared(self, str text, str opt)#
This function takes a text and opt as input.
It checks a set of characters, converts text to lowercase, iterates through characters, increments counter if finds a sequence of characters in set, if not it adds square of counter to a list, resets counter to 1. After iterating it returns sum of list divided by length of text.
Examples - Use this function like this:
key_smash = KeySmash() res = key_smash.count_sequence_squared("PUENTECILLA KM. 1.7", "vowels") print(res) # Output: 0.21052631578947367 res = key_smash.count_sequence_squared("ASDASD XXXX", "consonants") print(res) # Output: 2.1818181818181817 res = key_smash.count_sequence_squared("!@#$% ASDFGHJKL", "special_characters") print(res) # Output: 1.5625
- Parameters
text – (Type: str) The text to use for the calculation.
opt – (Type: str) The type of characters to consider for the calculation, can be one of ‘vowels’, ‘consonants’, or ‘special_characters’.
- Returns
(Type:float) The calculated Irregular Sequence Metric.
- ratio_of_numeric_digits_squared(self, str text)#
This function takes text as input, splits it into a list of words, initializes a variable to 0.
It iterates through list of words, checking if each word contains both numeric digits and non-numeric characters. If yes, it counts number of numeric digits, squares it and adds to variable. It returns the value of that variable divided by length of original text, if the list is empty it returns 0.
Examples - Use this function like this:
key_smash = KeySmash() res = key_smash.ratio_of_numeric_digits_squared("ABC 123 !@#") print(res) # Output: 0.0 res = key_smash.ratio_of_numeric_digits_squared("ABC123 !@#") print(res) # Output: 0.9
- Parameters
text – (Type: str) The text to extract the metric from.
- Returns
(Type: float) The calculated Number Count Metric.
- extract_key_smash_features(self, pd.DataFrame df, str column_name)#
Extract key smash features from a given dataframe and column.
Examples Use this function like this:
import pandas as pd key_smash = KeySmash() df = pd.DataFrame({"text_column": ["abcdefgh", "ijklmnop", "qrstuvwxyz"]}) df = key_smash.extract_key_smash_features(df, "text_column", normalize=False) print(df.head())
- Parameters
df – (Type: DataFrame) Dataframe to extract key smash features from.
column_name – (Type: str) Name of the column in the dataframe that contains the text data to extract features from.
normalize – (bool, optional) Indicates whether to normalize the key smash feature columns. Default is True.
- Returns
(Type: DataFrame) The input dataframe with additional columns for key smash features: ‘irregular_sequence_vowels’, ‘irregular_sequence_consonants’, ‘irregular_sequence_special_characters’, ‘number_count_metric’, ‘char_frequency_metric’
Public Members
- char_sets#
Protected Functions
- _normalize_column(self, pd.DataFrame df, str column)#
Normalize a given column in a dataframe.
- Parameters
df – (Type: DataFrame) Dataframe to normalize the column in.
column – (Type: str) Name of the column to be normalized.
- Returns
(Type: DataFrame) The input dataframe with the normalized column.
- class PreProcessData#
This class presents a series of functions that help in data pre-processing.
As concatenate columns, replace abbreviation, and etc.
Examples - Use this class like this:
pre_process_data = hg.PreProcessData() df = pre_process_data.pre_process_data(df, ['COLUMN_1', 'COLUMN_2'], concatened_column_name) print(df)
Public Functions
- __init__(self, str country=None, str abbreviations_file=None)#
Initialize the PreProcessData class.
- Parameters
country – (Type: str) Zipcode list of the region or country used.
- concatenate_columns(self, df, columns, concatenated_column_name)#
Function that concatenates two columns and saves in a new one, whose name is informed by the user.
- Parameters
df – (Type: DataFrame) Dataframe.
columns – (Type: List) List of columns
concatenated_column_name – (Type: str) Name of the new column
- Returns
Return the columns concatenated
- handle_nulls(self, df, column_name)#
Handle null values.
- Parameters
df – (Type: Dataframe) Dataframe
column_name – (Type: str) Column name to check
- handle_extra_spaces(self, df, str column_name)#
- handle_abreviations(self, df, column_name)#
Handles abbreviations in the dataframe.
- Parameters
df – (Type: DataFrame) Dataframe
column_name – (Type: str) Column name to check
- pre_process_data(self, df, columns_to_concat=None, column_name=None)#
Function that gathers all implemented preprocessing (column concatenation, handle with nulls and abbreviations)
- Parameters
df – (Type: DataFrame) Dataframe
columns_to_concat – (Type: List) List of columns
column_name – (Type: str) Column name to check
- Returns
(Type: DataFrame) The input dataframe with additional columns
Public Members
- abbreviations_dict#
Protected Functions
- _replace_abbreviation(self, str text)#
Function that identifies abbreviations and according to the dictionary changes the names.
- Parameters
text – (Type: str) Text to be analyzed
- class RandomForestModel#
This class presents the model Random Forest, allowing train and predict the model.
Examples - Use this class like this:
new_rf_model = hg.RandomForestModel() clf, scores = new_rf_model.train_and_get_scores(df, concatened_column_name, all_features_columns) scores
Public Functions
- __init__(self, model_file=None, normalization_absolutes_file=None, n_estimators=100, max_depth=None, random_state=0, normalize=True)#
Initialize the RandomForestModel class.
- Parameters
model_file – (Type: path) Path to the model file
- train_and_get_scores(self, df, concatened_column_name, all_features_columns, test_size=0.3)#
Train and get scores for the model execution.
- Parameters
df – (DataFrame) Dataframe with the data.
concatened_column_name – (Type: str) Column name
all_features_columns – (Type: List) List of all features column nales
- predict(self, X, concatened_column_name)#
- export_model(self, str export_path, str normalization_absolutes_file_path)#
- class Regex#
It provides a set of functions that help you verify the content of a text field, such as checking if the field is empty, if it has only one word, if it contains a specific character or pattern, and more.
Public Functions
- __init__(self, str country=None, str context_words_file=None)#
- contains_context_invalid_words(self, str text)#
- contains_exactly_the_word_dell(self, str text)#
Check if it contains the word DELL.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- contains_exactly_the_word_test(self, str text)#
Check if it contains the word test.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- only_numbers(self, str text)#
Check if it contains only numbers.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- only_special_characters(self, str text)#
Check if it contains only special characters.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- contains_email(self, str text)#
Check if it contains email.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- contains_url(self, str text)#
Check if it contains url.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- contains_date(self, str text)#
Check if it contains date.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- contains_exactly_invalid_words(self, str text)#
Check if it contains invalid words.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- is_substring_of_column_name(self, str text, str column_name)#
Check if is a substring of column name.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- only_one_char(self, str text)#
Check if it contains only one char.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- only_one_word(self, str text)#
Check if it contains only one word.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- only_white_spaces(self, str text)#
Check if it contains only white spaces.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- empty(self, str text)#
Check if is empty.
- Parameters
text – (Type: str) Text to be verified
- Returns
(Type: bool) true of false
- extract_regex_features(self, pd.DataFrame df, str column_name)#
Function to extract all regex features.
- Parameters
df – (Type: DataFrame) Dataframe with the data.
column_name – (Type: str) Column name
- Returns
(Type: bool) true of false
Public Members
- context_invalid_words#
- class WordEmbedding#
A class for generating word embeddings from text data.
Word embeddings are numerical representations of text data that capture the context and meaning of words within a sentence or document.
Examples - Use this class like this:
word_embedding = WordEmbedding() df = word_embedding.extract_word_embedding_features(df, "text_column") print(df)
Public Functions
- __init__(self, str lang='es', int dimensions=25, str model='bytepair')#
Initialize the WordEmbedding class.
- Parameters
lang – (Type: str) The language of the text to be processed. (default ‘es’)
dimensions – (Type: int) The number of dimensions of the word embedding vectors. (default 25)
model – (Type: str) The word embedding model to be used. (default ‘bytepair’)
- get_embedding(self, str text)#
Get the word embedding vector for a given text.
Examples - Use this function like this:
word_embedding = WordEmbedding() embedding = word_embedding.get_embedding("This is a sample text.") print(embedding) # Output: [0.1, 0.2, ..., 0.3] (a list of float values representing the word embedding vector) embedding = word_embedding.get_embedding("Another sample text.") print(embedding) # Output: [0.5, 0.6, ..., 0.7] (a list of float values representing the word embedding vector)
- Parameters
text – (Type: str) The text to be processed.
- Returns
(type: array) A word embedding vector for the given text.
- extract_word_embedding_features(self, pd.DataFrame df, str column_name, bool normalize=False)#
Extract word embedding features from a given dataframe and column.
Examples - Use this class like this:
word_embedding = WordEmbedding() df = pd.DataFrame({"text_column": ["abcdefgh", "ijklmnop", "qrstuvwxyz"]}) df = word_embedding.extract_features(df, "text_column", normalize=False) print(df.head())
- Parameters
df – (Type: DataFrame) Dataframe to extract word embedding features from.
column_name – (Type: str) Name of the column in the dataframe that contains the text data to extract features from.
normalize – (Type: bool, optional) Indicates whether to normalize the word embedding feature columns. Default is True.
- Returns
(Type: DataFrame) The input dataframe with additional columns for word embedding features.