autoPP Module¶

Description :

This module is used for data preprocessing operation:
- Impute with missing value
- Winsorize with outlier
- Scaling using popular scaler approaches
- Encoding category features using popular encoder approaches
- Generated all combination datasets for further modeling and evaluation
- Sparsity calculation as the critera for output datasets filtering
- Custom parameters initial settings, add/remove winsorization, scaling, or encoding strategies.
Class:
- dynaPreprocessing : Focus on classification/regression prprocessing problems
  
  fit() - fit & transform method for preprocessing
Current available strategies:
- Scaling : Numeric features scaling, default settings
(NOTE: When you select ‘None’, might cause overfitting with too high R-Squared Score in Regression Problem)
“None” : None approach involve in scaling step

“standard” : StandardScaler() approach

“minmax” - MinMaxScaler() approach

“maxabs” - MaxAbsScaler() approach

“robust” - RobustScaler() approach
- Encoding : Category features encoding, default settings
  
  “onehot” : OnehotEncoder() approach, with dummy trap consideration in regression problem
  
  “label” : LabelEncoder() approach
  
  “frequency” : Frequency calculation approach
  
  “mean” : Mean calculation approach
- winsorization : Default limits settings
  
  (0.01,0.01) : Top 1% and bottom 1% will be excluded
  
  (0.05,0.05) : Top 5% and bottom 5% will be excluded

dynapipePreprocessing¶

class optimalflow.autoPP.dynaPreprocessing(custom_parameters=None, label_col=None, model_type='reg', export_output_files=False)[source]¶

Automated feature preprocessing including imputation, winsorization, encoding, and scaling in ensemble algorithms, to generate permutation input datasets for further pipeline components.

Parameters:

custom_parameters (dictionary, default = None) –
Custom parameters settings input.

NOTE: default_parameters

= { “scaler” : [“standard”, “minmax”, “maxabs”, “robust”], “encode_band” : [10], “low_encode” : [“onehot”,”label”], “high_encode” : [“frequency”, “mean”], “winsorizer” : [(0.01,0.01),(0.05,0.05)], “sparsity” : [0.50], “cols” : [100] }
label_col (str, default = None) – Name of label column.
model_type (str, default = "reg") – “reg” for regression problem or “cls” for classification problem - Default: “reg”.
export_output_files (bool, default = False) – Export qualified permutated datasets to ./df_folder.

Example

[Example]

https://Optimal-Flow.readthedocs.io/en/latest/demos.html#feature-preprocessing-for-a-regression-problem-using-autopp

References

None

fit(input_data=None)[source]¶

Fits and transforms a pandas dataframe to non-missing values, outlier excluded, categories encoded and scaled datasets by all algorithms permutation.

Parameters:	input_data (pandas dataframe, shape = [n_samples, n_features]) – NOTE: The input_data should be the datasets after basic data cleaning & well feature deduction, the more features involve will result in more columns permutation outputs.
Returns:	DICT_PREP_DF (dictionary) – Each key is the # of output preprocessed dataset, each value stores the dataset DICT_PREP_INFO (dictionary) – Dictionary for reference. Each key is the # of the output preprocessed dataset, each value stores the column names of the dataset NOTE - Log records will generate and save to ./logs folder automatedly.

PPtools¶

class optimalflow.funcPP.PPtools(data=None, label_col=None, model_type='reg')[source]¶

This class stores feature preprocessing transform tools.

Parameters:	data (df, default = None) – Pre-cleaned dataset for feature preprocessing. label_col (str, default = None) – Name of label column. model_type (str, default = "reg") – Value in [“reg”,”cls”]. The “reg” for regression problem, and “cls” for classification problem.

Example

[Example]

https://optimal-flow.readthedocs.io/en/latest/demos.html#build-pipeline-cluster-traveral-experiments-using-autopipe

References

None

encode_tool(en_type=None, category_col=None)[source]¶

Category features encoding, included:: “onehot” - OneHot algorithm; “label” - LabelEncoder algorithm; “frequency” - Frequency Encoding algorithm; “mean” - Mean Encoding algorithm.

Parameters:	en_type (str, default = None) – Value in [“reg”,”cls”]. Will drop first encoded column to cope with dummy trap issue, when value is “reg”.
Returns:
Return type:	Encoded column/dataset for each category feature

impute_tool()[source]¶

Imputation with the missing values.

Parameters:	None –
Returns:
Return type:	None

remove_feature(feature_name)[source]¶

Remove feature.

Parameters:	feature_name (str/list, default = None) – column name, or list of column names wants to extract.
Returns:
Return type:	None

remove_zero_col_tool(data=None)[source]¶

Remove the columns with all value zero.

Parameters:	data (pandas dataset, default = None) – dataset needs to remove all zero columns
Returns:
Return type:	All-zero-columns dataset

scale_tool(df=None, sc_type=None)[source]¶

Feature scaling.

Parameters:	df (df, default = None) – Dataset wants to be scaled sc_type (str, default = None) – Value in [“None”,”standard”,”minmax”,”maxabs”,”robust”]. Select which scaling algorithm: “None” - No scale algorithm apply; “standard” - StandardScaler algorithm; “minmax” - MinMaxScaler algorithm; “maxabs” - MaxAbsScaler algorithm; “RobustScaler” - RobustScaler algorithm
Returns:
Return type:	Scaled dataset

sparsity_tool(data=None)[source]¶

Calculate the sparsity of the datset.

Parameters:	data (df, default = None) –
Returns:
Return type:	Value of sparsity

split_category_cols()[source]¶

Split input datasets to numeric dataset and category dataset.

Parameters:	None –
Returns:
Return type:	None

winsorize_tool(lower_ban=None, upper_ban=None)[source]¶

Feature outliers excluding with winsorization.

Parameters:	lower_ban (float, default = None) – Bottom percent of excluding data needs to set here. upper_ban – Top percent of excluding data needs to set here.
Returns:
Return type:	None

autoPP Module¶

dynapipePreprocessing¶

PPtools¶

Table of Contents

Previous topic

Next topic

This Page