autoPP Module¶
- Description :
- This module is used for data preprocessing operation:
- Impute with missing value
- Winsorize with outlier
- Scaling using popular scaler approaches
- Encoding category features using popular encoder approaches
- Generated all combination datasets for further modeling and evaluation
- Sparsity calculation as the critera for output datasets filtering
- Custom parameters initial settings, add/remove winsorization, scaling, or encoding strategies.
- Class:
- dynaPreprocessing : Focus on classification/regression prprocessing problems
- fit() - fit & transform method for preprocessing
- Current available strategies:
- Scaling : Numeric features scaling, default settings
- (NOTE: When you select ‘None’, might cause overfitting with too high R-Squared Score in Regression Problem)
- “None” : None approach involve in scaling step
- “standard” : StandardScaler() approach
- “minmax” - MinMaxScaler() approach
- “maxabs” - MaxAbsScaler() approach
- “robust” - RobustScaler() approach
- Encoding : Category features encoding, default settings
- “onehot” : OnehotEncoder() approach, with dummy trap consideration in regression problem
- “label” : LabelEncoder() approach
- “frequency” : Frequency calculation approach
- “mean” : Mean calculation approach
- winsorization : Default limits settings
- (0.01,0.01) : Top 1% and bottom 1% will be excluded
- (0.05,0.05) : Top 5% and bottom 5% will be excluded
dynapipePreprocessing¶
-
class
optimalflow.autoPP.
dynaPreprocessing
(custom_parameters=None, label_col=None, model_type='reg', export_output_files=False)[source]¶ Automated feature preprocessing including imputation, winsorization, encoding, and scaling in ensemble algorithms, to generate permutation input datasets for further pipeline components.
Parameters: - custom_parameters (dictionary, default = None) –
Custom parameters settings input.
- NOTE: default_parameters
- = { “scaler” : [“standard”, “minmax”, “maxabs”, “robust”], “encode_band” : [10], “low_encode” : [“onehot”,”label”], “high_encode” : [“frequency”, “mean”], “winsorizer” : [(0.01,0.01),(0.05,0.05)], “sparsity” : [0.50], “cols” : [100] }
- label_col (str, default = None) – Name of label column.
- model_type (str, default = "reg") – “reg” for regression problem or “cls” for classification problem - Default: “reg”.
- export_output_files (bool, default = False) – Export qualified permutated datasets to ./df_folder.
Example
[Example] https://Optimal-Flow.readthedocs.io/en/latest/demos.html#feature-preprocessing-for-a-regression-problem-using-autopp References
None
-
fit
(input_data=None)[source]¶ Fits and transforms a pandas dataframe to non-missing values, outlier excluded, categories encoded and scaled datasets by all algorithms permutation.
Parameters: input_data (pandas dataframe, shape = [n_samples, n_features]) – NOTE: The input_data should be the datasets after basic data cleaning & well feature deduction, the more features involve will result in more columns permutation outputs. Returns: - DICT_PREP_DF (dictionary) – Each key is the # of output preprocessed dataset, each value stores the dataset
- DICT_PREP_INFO (dictionary) – Dictionary for reference. Each key is the # of the output preprocessed dataset, each value stores the column names of the dataset
- NOTE - Log records will generate and save to ./logs folder automatedly.
- custom_parameters (dictionary, default = None) –
PPtools¶
-
class
optimalflow.funcPP.
PPtools
(data=None, label_col=None, model_type='reg')[source]¶ This class stores feature preprocessing transform tools.
Parameters: - data (df, default = None) – Pre-cleaned dataset for feature preprocessing.
- label_col (str, default = None) – Name of label column.
- model_type (str, default = "reg") – Value in [“reg”,”cls”]. The “reg” for regression problem, and “cls” for classification problem.
Example
[Example] https://optimal-flow.readthedocs.io/en/latest/demos.html#build-pipeline-cluster-traveral-experiments-using-autopipe References
None
-
encode_tool
(en_type=None, category_col=None)[source]¶ - Category features encoding, included:
- “onehot” - OneHot algorithm; “label” - LabelEncoder algorithm; “frequency” - Frequency Encoding algorithm; “mean” - Mean Encoding algorithm.
Parameters: en_type (str, default = None) – Value in [“reg”,”cls”]. Will drop first encoded column to cope with dummy trap issue, when value is “reg”. Returns: Return type: Encoded column/dataset for each category feature
-
impute_tool
()[source]¶ Imputation with the missing values.
Parameters: None – Returns: Return type: None
-
remove_feature
(feature_name)[source]¶ Remove feature.
Parameters: feature_name (str/list, default = None) – column name, or list of column names wants to extract. Returns: Return type: None
-
remove_zero_col_tool
(data=None)[source]¶ Remove the columns with all value zero.
Parameters: data (pandas dataset, default = None) – dataset needs to remove all zero columns Returns: Return type: All-zero-columns dataset
-
scale_tool
(df=None, sc_type=None)[source]¶ Feature scaling.
Parameters: - df (df, default = None) – Dataset wants to be scaled
- sc_type (str, default = None) – Value in [“None”,”standard”,”minmax”,”maxabs”,”robust”]. Select which scaling algorithm: “None” - No scale algorithm apply; “standard” - StandardScaler algorithm; “minmax” - MinMaxScaler algorithm; “maxabs” - MaxAbsScaler algorithm; “RobustScaler” - RobustScaler algorithm
Returns: Return type: Scaled dataset
-
sparsity_tool
(data=None)[source]¶ Calculate the sparsity of the datset.
Parameters: data (df, default = None) – Returns: Return type: Value of sparsity