Table of Contents
Preprocessing Functions Reference
Functions for preparing and transforming data before training machine learning models.
ML.PREPROCESSING Namespace
ML.PREPROCESSING.TRAIN_TEST_SPLIT()
Splits dataset into training and testing sets for model validation.
Syntax:
=ML.PREPROCESSING.TRAIN_TEST_SPLIT(data, test_size, random_state, dataset_type)
Parameters:
data
(Object, Required): DataFrame to splittest_size
(Number, Required): Fraction for test set (e.g., 0.2 = 20%)random_state
(Integer, Required): Seed for reproducibilitydataset_type
(Integer, Required): Which split to return- 0 = Training set
- 1 = Test set
Returns: DataFrame subset (train or test)
Use Case: Create separate train/test sets for model evaluation
Example:
# Split features
Cell A1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X_data, 0.2, 42, 0) # Train (80%)
Cell A2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X_data, 0.2, 42, 1) # Test (20%)
# Split target
Cell B1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(y_data, 0.2, 42, 0) # Train
Cell B2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(y_data, 0.2, 42, 1) # Test
ML.PREPROCESSING.STANDARD_SCALER()
Standardizes features by removing mean and scaling to unit variance.
Syntax:
=ML.PREPROCESSING.STANDARD_SCALER()
Parameters: None
Returns: StandardScaler transformer object
Use Case: Scale features to have mean=0 and std=1, required for many ML algorithms
Example:
# Create scaler
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Result: <StandardScaler>
# Fit and transform training data
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
# Transform test data (using training statistics)
Cell C1: =ML.TRANSFORM(A1, X_test)
ML.PREPROCESSING.MIN_MAX_SCALER()
Scales features to a given range, typically [0, 1].
Syntax:
=ML.PREPROCESSING.MIN_MAX_SCALER()
Parameters: None
Returns: MinMaxScaler transformer object
Use Case: Scale features to fixed range, preserving shape of distribution
Example:
# Create MinMax scaler
Cell A1: =ML.PREPROCESSING.MIN_MAX_SCALER()
Result: <MinMaxScaler>
# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell C1: =ML.TRANSFORM(A1, X_test)
ML.PREPROCESSING.ROBUST_SCALER()
Scales features using statistics robust to outliers.
Syntax:
=ML.PREPROCESSING.ROBUST_SCALER()
Parameters: None
Returns: RobustScaler transformer object
Use Case: Scale features when data contains outliers
Example:
# Create Robust scaler
Cell A1: =ML.PREPROCESSING.ROBUST_SCALER()
Result: <RobustScaler>
# Use with outlier-prone data
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell C1: =ML.TRANSFORM(A1, X_test)
ML.PREPROCESSING.ONE_HOT_ENCODER()
Converts categorical variables into binary columns (one-hot encoding).
Syntax:
=ML.PREPROCESSING.ONE_HOT_ENCODER(handle_unknown)
Parameters:
handle_unknown
(String, Optional): How to handle unknown categories (default: “error”)- “error”: Raise error for unknown categories
- “ignore”: Ignore unknown categories during transform
Returns: OneHotEncoder transformer object
Use Case: Convert categorical features to numerical format for ML algorithms
Example:
# Create encoder
Cell A1: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Result: <OneHotEncoder>
# Fit on categories
Cell B1: =ML.FIT(A1, categories_train)
# Transform to one-hot
Cell C1: =ML.TRANSFORM(B1, categories_train)
Cell D1: =ML.TRANSFORM(B1, categories_test)
ML.PREPROCESSING.ORDINAL_ENCODER()
Converts categorical variables to integer codes.
Syntax:
=ML.PREPROCESSING.ORDINAL_ENCODER(handle_unknown)
Parameters:
handle_unknown
(String, Optional): How to handle unknown categories (default: “error”)- “error”: Raise error for unknown categories
- “ignore”: Ignore unknown categories
Returns: OrdinalEncoder transformer object
Use Case: Convert categorical features to integers, preserving ordinality
Example:
# Create ordinal encoder
Cell A1: =ML.PREPROCESSING.ORDINAL_ENCODER("error")
Result: <OrdinalEncoder>
# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, categories_train)
Cell C1: =ML.TRANSFORM(A1, categories_test)
Common Patterns
Complete Train/Test Split Workflow
# Load data
Cell A1: =ML.DATASETS.DIABETES()
# Separate features and target
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9}) # X
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 10) # y
# Split into train/test (80/20)
Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 0) # X_train
Cell D2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 1) # X_test
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 0) # y_train
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 1) # y_test
Feature Scaling Pipeline
# Create scaler and model
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.REGRESSION.LINEAR()
# Create pipeline
Cell B1: =ML.PIPELINE(A1, A2)
# Fit pipeline (automatically scales then trains)
Cell C1: =ML.FIT(B1, X_train, y_train)
# Predict (automatically scales test data)
Cell D1: =ML.PREDICT(C1, X_test)
Comparing Scalers
# Create different scalers
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.PREPROCESSING.MIN_MAX_SCALER()
Cell A3: =ML.PREPROCESSING.ROBUST_SCALER()
# Transform same data
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell B2: =ML.FIT_TRANSFORM(A2, X_train)
Cell B3: =ML.FIT_TRANSFORM(A3, X_train)
# Compare statistics
Cell C1: =ML.DATA.DESCRIBE(B1) # Standard scaled
Cell C2: =ML.DATA.DESCRIBE(B2) # MinMax scaled
Cell C3: =ML.DATA.DESCRIBE(B3) # Robust scaled
Encoding Categorical Features
# Assume categorical data in column
Cell A1: =ML.DATA.SELECT_COLUMNS(data, {"category"})
# One-hot encoding (creates multiple binary columns)
Cell B1: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Cell C1: =ML.FIT_TRANSFORM(B1, A1)
# Ordinal encoding (creates single integer column)
Cell B2: =ML.PREPROCESSING.ORDINAL_ENCODER("error")
Cell C2: =ML.FIT_TRANSFORM(B2, A1)
# Compare results
Cell D1: =ML.DATA.SAMPLE(C1, 10) # One-hot result
Cell D2: =ML.DATA.SAMPLE(C2, 10) # Ordinal result
Mixed Feature Types Pipeline
# Assume data with numeric and categorical columns
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:F1000, TRUE)
# Split features by type
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2}) # Numeric features
Cell B2: =ML.DATA.SELECT_COLUMNS(A1, {3,4,5}) # Categorical features
# Scale numeric features
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)
# Encode categorical features
Cell C2: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Cell D2: =ML.FIT_TRANSFORM(C2, B2)
# Note: In practice, use ML.COMPOSE.DATA_TRANSFORMER for this
Handling Outliers with Robust Scaler
# Data with outliers
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E1000, TRUE)
# Robust scaler (uses median and IQR)
Cell B1: =ML.PREPROCESSING.ROBUST_SCALER()
Cell C1: =ML.FIT_TRANSFORM(B1, A1)
# Compare with standard scaler
Cell B2: =ML.PREPROCESSING.STANDARD_SCALER()
Cell C2: =ML.FIT_TRANSFORM(B2, A1)
# Robust scaler less affected by outliers
Cell D1: =ML.DATA.DESCRIBE(C1) # Robust
Cell D2: =ML.DATA.DESCRIBE(C2) # Standard
Complete Preprocessing Workflow
# Load and explore data
Cell A1: =ML.DATASETS.IRIS()
Cell A2: =ML.DATA.INFO(A1)
Cell A3: =ML.DATA.DESCRIBE(A1)
# Handle missing values
Cell B1: =ML.DATA.DROP_MISSING_ROWS(A1)
# Separate features and target
Cell C1: =ML.DATA.SELECT_COLUMNS(B1, {0,1,2,3})
Cell C2: =ML.DATA.SELECT_COLUMNS(B1, 4)
# Split train/test
Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.3, 42, 0) # X_train
Cell D2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.3, 42, 1) # X_test
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C2, 0.3, 42, 0) # y_train
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C2, 0.3, 42, 1) # y_test
# Scale features
Cell F1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell G1: =ML.FIT_TRANSFORM(F1, D1) # Fit on train, transform train
Cell G2: =ML.TRANSFORM(F1, D2) # Transform test with train stats
# Now ready for modeling
Tips and Best Practices
-
Train/Test Splitting
- Always split before any preprocessing
- Use same random_state for features and target
- Common splits: 80/20, 70/30, 90/10
- Larger datasets can use smaller test sets
-
Scaling Strategy
- StandardScaler: Most common, assumes normal distribution
- MinMaxScaler: When you need specific range [0,1]
- RobustScaler: When data has outliers
- Scale before: SVM, Neural Networks, K-Means, PCA
- Don’t scale: Tree-based models (Random Forest, Decision Trees)
-
Fit vs Transform
- Fit on training data only
- Transform both train and test using train statistics
- Never fit on test data (causes data leakage)
- Use FIT_TRANSFORM on train, TRANSFORM on test
-
Encoding Categorical Variables
- OneHotEncoder: For nominal categories (no order)
- OrdinalEncoder: For ordinal categories (has order)
- Handle unknown categories appropriately
- Consider category cardinality
-
Pipeline Integration
- Use ML.PIPELINE to combine preprocessing and models
- Ensures correct order of operations
- Prevents data leakage
- Easier to deploy and reproduce
-
Common Mistakes to Avoid
- ❌ Scaling before splitting data
- ❌ Fitting scaler on test data
- ❌ Using different random_state for X and y splits
- ❌ Not handling missing values before scaling
- ❌ Scaling tree-based models (unnecessary)
Related Functions
- ML.FIT() - Fit transformer on data
- ML.TRANSFORM() - Transform data
- ML.FIT_TRANSFORM() - Fit and transform
- ML.PIPELINE() - Combine transformers and models
- ML.COMPOSE Functions - Advanced transformation pipelines
- ML.IMPUTE Functions - Handle missing values