Preprocessing Functions Reference

Complete reference for FormulaML preprocessing functions including scaling, encoding, and data splitting.

Preprocessing Functions Reference

Functions for preparing and transforming data before training machine learning models.

ML.PREPROCESSING Namespace

ML.PREPROCESSING.TRAIN_TEST_SPLIT()

Splits dataset into training and testing sets for model validation.

Syntax:

=ML.PREPROCESSING.TRAIN_TEST_SPLIT(data, test_size, random_state, dataset_type)

Parameters:

  • data (Object, Required): DataFrame to split
  • test_size (Number, Required): Fraction for test set (e.g., 0.2 = 20%)
  • random_state (Integer, Required): Seed for reproducibility
  • dataset_type (Integer, Required): Which split to return
    • 0 = Training set
    • 1 = Test set

Returns: DataFrame subset (train or test)

Use Case: Create separate train/test sets for model evaluation

Example:

# Split features
Cell A1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X_data, 0.2, 42, 0)  # Train (80%)
Cell A2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X_data, 0.2, 42, 1)  # Test (20%)

# Split target
Cell B1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(y_data, 0.2, 42, 0)  # Train
Cell B2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(y_data, 0.2, 42, 1)  # Test

ML.PREPROCESSING.STANDARD_SCALER()

Standardizes features by removing mean and scaling to unit variance.

Syntax:

=ML.PREPROCESSING.STANDARD_SCALER()

Parameters: None

Returns: StandardScaler transformer object

Use Case: Scale features to have mean=0 and std=1, required for many ML algorithms

Example:

# Create scaler
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Result: <StandardScaler>

# Fit and transform training data
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)

# Transform test data (using training statistics)
Cell C1: =ML.TRANSFORM(A1, X_test)

ML.PREPROCESSING.MIN_MAX_SCALER()

Scales features to a given range, typically [0, 1].

Syntax:

=ML.PREPROCESSING.MIN_MAX_SCALER()

Parameters: None

Returns: MinMaxScaler transformer object

Use Case: Scale features to fixed range, preserving shape of distribution

Example:

# Create MinMax scaler
Cell A1: =ML.PREPROCESSING.MIN_MAX_SCALER()
Result: <MinMaxScaler>

# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell C1: =ML.TRANSFORM(A1, X_test)

ML.PREPROCESSING.ROBUST_SCALER()

Scales features using statistics robust to outliers.

Syntax:

=ML.PREPROCESSING.ROBUST_SCALER()

Parameters: None

Returns: RobustScaler transformer object

Use Case: Scale features when data contains outliers

Example:

# Create Robust scaler
Cell A1: =ML.PREPROCESSING.ROBUST_SCALER()
Result: <RobustScaler>

# Use with outlier-prone data
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell C1: =ML.TRANSFORM(A1, X_test)

ML.PREPROCESSING.ONE_HOT_ENCODER()

Converts categorical variables into binary columns (one-hot encoding).

Syntax:

=ML.PREPROCESSING.ONE_HOT_ENCODER(handle_unknown)

Parameters:

  • handle_unknown (String, Optional): How to handle unknown categories (default: “error”)
    • “error”: Raise error for unknown categories
    • “ignore”: Ignore unknown categories during transform

Returns: OneHotEncoder transformer object

Use Case: Convert categorical features to numerical format for ML algorithms

Example:

# Create encoder
Cell A1: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Result: <OneHotEncoder>

# Fit on categories
Cell B1: =ML.FIT(A1, categories_train)

# Transform to one-hot
Cell C1: =ML.TRANSFORM(B1, categories_train)
Cell D1: =ML.TRANSFORM(B1, categories_test)

ML.PREPROCESSING.ORDINAL_ENCODER()

Converts categorical variables to integer codes.

Syntax:

=ML.PREPROCESSING.ORDINAL_ENCODER(handle_unknown)

Parameters:

  • handle_unknown (String, Optional): How to handle unknown categories (default: “error”)
    • “error”: Raise error for unknown categories
    • “ignore”: Ignore unknown categories

Returns: OrdinalEncoder transformer object

Use Case: Convert categorical features to integers, preserving ordinality

Example:

# Create ordinal encoder
Cell A1: =ML.PREPROCESSING.ORDINAL_ENCODER("error")
Result: <OrdinalEncoder>

# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, categories_train)
Cell C1: =ML.TRANSFORM(A1, categories_test)

Common Patterns

Complete Train/Test Split Workflow

# Load data
Cell A1: =ML.DATASETS.DIABETES()

# Separate features and target
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9})  # X
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 10)                      # y

# Split into train/test (80/20)
Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 0)  # X_train
Cell D2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 1)  # X_test
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 0)  # y_train
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 1)  # y_test

Feature Scaling Pipeline

# Create scaler and model
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.REGRESSION.LINEAR()

# Create pipeline
Cell B1: =ML.PIPELINE(A1, A2)

# Fit pipeline (automatically scales then trains)
Cell C1: =ML.FIT(B1, X_train, y_train)

# Predict (automatically scales test data)
Cell D1: =ML.PREDICT(C1, X_test)

Comparing Scalers

# Create different scalers
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.PREPROCESSING.MIN_MAX_SCALER()
Cell A3: =ML.PREPROCESSING.ROBUST_SCALER()

# Transform same data
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell B2: =ML.FIT_TRANSFORM(A2, X_train)
Cell B3: =ML.FIT_TRANSFORM(A3, X_train)

# Compare statistics
Cell C1: =ML.DATA.DESCRIBE(B1)  # Standard scaled
Cell C2: =ML.DATA.DESCRIBE(B2)  # MinMax scaled
Cell C3: =ML.DATA.DESCRIBE(B3)  # Robust scaled

Encoding Categorical Features

# Assume categorical data in column
Cell A1: =ML.DATA.SELECT_COLUMNS(data, {"category"})

# One-hot encoding (creates multiple binary columns)
Cell B1: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Cell C1: =ML.FIT_TRANSFORM(B1, A1)

# Ordinal encoding (creates single integer column)
Cell B2: =ML.PREPROCESSING.ORDINAL_ENCODER("error")
Cell C2: =ML.FIT_TRANSFORM(B2, A1)

# Compare results
Cell D1: =ML.DATA.SAMPLE(C1, 10)  # One-hot result
Cell D2: =ML.DATA.SAMPLE(C2, 10)  # Ordinal result

Mixed Feature Types Pipeline

# Assume data with numeric and categorical columns
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:F1000, TRUE)

# Split features by type
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2})      # Numeric features
Cell B2: =ML.DATA.SELECT_COLUMNS(A1, {3,4,5})      # Categorical features

# Scale numeric features
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Encode categorical features
Cell C2: =ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")
Cell D2: =ML.FIT_TRANSFORM(C2, B2)

# Note: In practice, use ML.COMPOSE.DATA_TRANSFORMER for this

Handling Outliers with Robust Scaler

# Data with outliers
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E1000, TRUE)

# Robust scaler (uses median and IQR)
Cell B1: =ML.PREPROCESSING.ROBUST_SCALER()
Cell C1: =ML.FIT_TRANSFORM(B1, A1)

# Compare with standard scaler
Cell B2: =ML.PREPROCESSING.STANDARD_SCALER()
Cell C2: =ML.FIT_TRANSFORM(B2, A1)

# Robust scaler less affected by outliers
Cell D1: =ML.DATA.DESCRIBE(C1)  # Robust
Cell D2: =ML.DATA.DESCRIBE(C2)  # Standard

Complete Preprocessing Workflow

# Load and explore data
Cell A1: =ML.DATASETS.IRIS()
Cell A2: =ML.DATA.INFO(A1)
Cell A3: =ML.DATA.DESCRIBE(A1)

# Handle missing values
Cell B1: =ML.DATA.DROP_MISSING_ROWS(A1)

# Separate features and target
Cell C1: =ML.DATA.SELECT_COLUMNS(B1, {0,1,2,3})
Cell C2: =ML.DATA.SELECT_COLUMNS(B1, 4)

# Split train/test
Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.3, 42, 0)  # X_train
Cell D2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.3, 42, 1)  # X_test
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C2, 0.3, 42, 0)  # y_train
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C2, 0.3, 42, 1)  # y_test

# Scale features
Cell F1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell G1: =ML.FIT_TRANSFORM(F1, D1)  # Fit on train, transform train
Cell G2: =ML.TRANSFORM(F1, D2)       # Transform test with train stats

# Now ready for modeling

Tips and Best Practices

  1. Train/Test Splitting

    • Always split before any preprocessing
    • Use same random_state for features and target
    • Common splits: 80/20, 70/30, 90/10
    • Larger datasets can use smaller test sets
  2. Scaling Strategy

    • StandardScaler: Most common, assumes normal distribution
    • MinMaxScaler: When you need specific range [0,1]
    • RobustScaler: When data has outliers
    • Scale before: SVM, Neural Networks, K-Means, PCA
    • Don’t scale: Tree-based models (Random Forest, Decision Trees)
  3. Fit vs Transform

    • Fit on training data only
    • Transform both train and test using train statistics
    • Never fit on test data (causes data leakage)
    • Use FIT_TRANSFORM on train, TRANSFORM on test
  4. Encoding Categorical Variables

    • OneHotEncoder: For nominal categories (no order)
    • OrdinalEncoder: For ordinal categories (has order)
    • Handle unknown categories appropriately
    • Consider category cardinality
  5. Pipeline Integration

    • Use ML.PIPELINE to combine preprocessing and models
    • Ensures correct order of operations
    • Prevents data leakage
    • Easier to deploy and reproduce
  6. Common Mistakes to Avoid

    • ❌ Scaling before splitting data
    • ❌ Fitting scaler on test data
    • ❌ Using different random_state for X and y splits
    • ❌ Not handling missing values before scaling
    • ❌ Scaling tree-based models (unnecessary)