User Guide#
A user guide is a page that provides instructions and information on how to use Hylia Library. It may include information on features, settings, troubleshooting, and step-by-step procedures for performing specific tasks. The goal is to help users understand and effectively use the software.
Note
In the Hygia repository there are some boilerplates to guide learning and understanding the use Hygia features.
The Hygia library offers two options for usage: (1) Utilizing the available functions directly in your development environment, such as a Jupyter Notebook. (2) Automating processes for different databases through a customizable .yaml file. This file allows you to define your pipeline and at the end, it provides a visualization of the processed data.
Using YAML file#
Update the yaml file with your needed configs and run the notebook#
import hygia as hg
config_file = '../config/default_config.yaml'
result = hg.run_with_config(config_file)
result
Predict Example#
Imports and classes instanciations#
import pandas as pd
import hygia as hg
pre_process_data = hg.PreProcessData()
feature_engineering = hg.FeatureEngineering()
rf_model = hg.RandomForestModel('../data/models/RandomForest_Ksmash_WordEmbedding_Regex.pkl')
Load Data#
file_path = '../data/tmp/AI_LATA_ADDRESS_MEX_modificado.csv'
df = pd.read_csv(file_path, sep='ยจ', nrows=500_000, engine='python')
Add new columns#
Concatenate address
All features columns
Key Smash
Regex
Word Embedding
concatened_column_name = 'concat_STREET_ADDRESS_1_STREET_ADDRESS_2'
df = pre_process_data.pre_process_data(df, ['STREET_ADDRESS_1', 'STREET_ADDRESS_2'], concatened_column_name)
df = feature_engineering.extract_features(df, concatened_column_name)
Check new columns names#
ks_we_and_re_colummns = [col for col in df if col.startswith('feature_ks') or col.startswith('feature_we') or col.startswith('feature_re')]
ks_we_and_re_colummns
Predict using pre-trained model#
df['prediction'] = rf_model.predict(df[ks_we_and_re_colummns].values)
df['prediction'].value_counts()
Save predicted data#
df[['concat_STREET_ADDRESS_1_STREET_ADDRESS_2', 'prediction']].to_csv('data/tmp/prediction.csv')