autoPP Module

Description :
  • This module is used for data preprocessing operation:
    • Impute with missing value
    • Winsorize with outlier
    • Scaling using popular scaler approaches
    • Encoding category features using popular encoder approaches
    • Generated all combination datasets for further modeling and evaluation
    • Sparsity calculation as the critera for output datasets filtering
    • Custom parameters initial settings, add/remove winsorization, scaling, or encoding strategies.
  • Class:
    • dynaPreprocessing : Focus on classification/regression prprocessing problems
      • fit() - fit & transform method for preprocessing
  • Current available strategies:
    • Scaling : Numeric features scaling, default settings
    (NOTE: When you select ‘None’, might cause overfitting with too high R-Squared Score in Regression Problem)
    • “None” : None approach involve in scaling step
    • “standard” : StandardScaler() approach
    • “minmax” - MinMaxScaler() approach
    • “maxabs” - MaxAbsScaler() approach
    • “robust” - RobustScaler() approach
    • Encoding : Category features encoding, default settings
      • “onehot” : OnehotEncoder() approach, with dummy trap consideration in regression problem
      • “label” : LabelEncoder() approach
      • “frequency” : Frequency calculation approach
      • “mean” : Mean calculation approach
    • winsorization : Default limits settings
      • (0.01,0.01) : Top 1% and bottom 1% will be excluded
      • (0.05,0.05) : Top 5% and bottom 5% will be excluded

dynapipePreprocessing

class optimalflow.autoPP.dynaPreprocessing(custom_parameters=None, label_col=None, model_type='reg', export_output_files=False)[source]

Automated feature preprocessing including imputation, winsorization, encoding, and scaling in ensemble algorithms, to generate permutation input datasets for further pipeline components.

Parameters:
  • custom_parameters (dictionary, default = None) –

    Custom parameters settings input.

    NOTE: default_parameters
    = { “scaler” : [“standard”, “minmax”, “maxabs”, “robust”], “encode_band” : [10], “low_encode” : [“onehot”,”label”], “high_encode” : [“frequency”, “mean”], “winsorizer” : [(0.01,0.01),(0.05,0.05)], “sparsity” : [0.50], “cols” : [100] }
  • label_col (str, default = None) – Name of label column.
  • model_type (str, default = "reg") – “reg” for regression problem or “cls” for classification problem - Default: “reg”.
  • export_output_files (bool, default = False) – Export qualified permutated datasets to ./df_folder.

Example

[Example]https://Optimal-Flow.readthedocs.io/en/latest/demos.html#feature-preprocessing-for-a-regression-problem-using-autopp

References

None

fit(input_data=None)[source]

Fits and transforms a pandas dataframe to non-missing values, outlier excluded, categories encoded and scaled datasets by all algorithms permutation.

Parameters:input_data (pandas dataframe, shape = [n_samples, n_features]) – NOTE: The input_data should be the datasets after basic data cleaning & well feature deduction, the more features involve will result in more columns permutation outputs.
Returns:
  • DICT_PREP_DF (dictionary) – Each key is the # of output preprocessed dataset, each value stores the dataset
  • DICT_PREP_INFO (dictionary) – Dictionary for reference. Each key is the # of the output preprocessed dataset, each value stores the column names of the dataset
  • NOTE - Log records will generate and save to ./logs folder automatedly.

PPtools

class optimalflow.funcPP.PPtools(data=None, label_col=None, model_type='reg')[source]

This class stores feature preprocessing transform tools.

Parameters:
  • data (df, default = None) – Pre-cleaned dataset for feature preprocessing.
  • label_col (str, default = None) – Name of label column.
  • model_type (str, default = "reg") – Value in [“reg”,”cls”]. The “reg” for regression problem, and “cls” for classification problem.

Example

[Example]https://optimal-flow.readthedocs.io/en/latest/demos.html#build-pipeline-cluster-traveral-experiments-using-autopipe

References

None

encode_tool(en_type=None, category_col=None)[source]
Category features encoding, included:
“onehot” - OneHot algorithm; “label” - LabelEncoder algorithm; “frequency” - Frequency Encoding algorithm; “mean” - Mean Encoding algorithm.
Parameters:en_type (str, default = None) – Value in [“reg”,”cls”]. Will drop first encoded column to cope with dummy trap issue, when value is “reg”.
Returns:
Return type:Encoded column/dataset for each category feature
impute_tool()[source]

Imputation with the missing values.

Parameters:None
Returns:
Return type:None
remove_feature(feature_name)[source]

Remove feature.

Parameters:feature_name (str/list, default = None) – column name, or list of column names wants to extract.
Returns:
Return type:None
remove_zero_col_tool(data=None)[source]

Remove the columns with all value zero.

Parameters:data (pandas dataset, default = None) – dataset needs to remove all zero columns
Returns:
Return type:All-zero-columns dataset
scale_tool(df=None, sc_type=None)[source]

Feature scaling.

Parameters:
  • df (df, default = None) – Dataset wants to be scaled
  • sc_type (str, default = None) – Value in [“None”,”standard”,”minmax”,”maxabs”,”robust”]. Select which scaling algorithm: “None” - No scale algorithm apply; “standard” - StandardScaler algorithm; “minmax” - MinMaxScaler algorithm; “maxabs” - MaxAbsScaler algorithm; “RobustScaler” - RobustScaler algorithm
Returns:

Return type:

Scaled dataset

sparsity_tool(data=None)[source]

Calculate the sparsity of the datset.

Parameters:data (df, default = None) –
Returns:
Return type:Value of sparsity
split_category_cols()[source]

Split input datasets to numeric dataset and category dataset.

Parameters:None
Returns:
Return type:None
winsorize_tool(lower_ban=None, upper_ban=None)[source]

Feature outliers excluding with winsorization.

Parameters:
  • lower_ban (float, default = None) – Bottom percent of excluding data needs to set here.
  • upper_ban – Top percent of excluding data needs to set here.
Returns:

Return type:

None