Table of Contents
Impute & Feature Selection Reference
Functions for handling missing values and selecting important features.
ML.IMPUTE Namespace
ML.IMPUTE.SIMPLE_IMPUTER()
Imputes missing values using basic strategies.
Syntax:
=ML.IMPUTE.SIMPLE_IMPUTER(strategy, fill_value)
Parameters:
strategy
(String, Required): Imputation strategy- “mean”: Replace with column mean (numeric only)
- “median”: Replace with column median (numeric only)
- “most_frequent”: Replace with mode (works for all types)
- “constant”: Replace with specific value (use fill_value)
fill_value
(Any, Optional): Value for “constant” strategy
Returns: SimpleImputer transformer object
Use Case: Handle missing data before training models
Example:
# Mean imputation for numeric columns
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Result: <SimpleImputer>
# Constant imputation
Cell A2: =ML.IMPUTE.SIMPLE_IMPUTER("constant", 0)
# Most frequent for categorical
Cell A3: =ML.IMPUTE.SIMPLE_IMPUTER("most_frequent")
# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell C1: =ML.TRANSFORM(A1, X_test)
ML.FEATURE_SELECTION Namespace
ML.FEATURE_SELECTION.SELECT_PERCENTILE()
Selects features based on percentile of highest scores.
Syntax:
=ML.FEATURE_SELECTION.SELECT_PERCENTILE(score_func, percentile)
Parameters:
score_func
(String, Required): Scoring function to use- “chi2”: Chi-squared test (for non-negative features)
- More functions may be available
percentile
(Integer, Required): Percentile of features to keep- Value between 1-100
- 10 = keep top 10% of features
Returns: SelectPercentile transformer object
Use Case: Reduce features by statistical significance
Example:
# Keep top 25% of features
Cell A1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 25)
Result: <SelectPercentile>
# Fit on training data
Cell B1: =ML.FIT(A1, X_train, y_train)
# Transform both train and test
Cell C1: =ML.TRANSFORM(B1, X_train)
Cell D1: =ML.TRANSFORM(B1, X_test)
Common Patterns
Basic Missing Value Imputation
# Load data with missing values
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:F1000, TRUE)
# Check for missing values
Cell B1: =ML.DATA.INFO(A1) # Shows null counts
# Impute with mean
Cell C1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell D1: =ML.FIT_TRANSFORM(C1, A1)
# Verify no missing values
Cell E1: =ML.DATA.INFO(D1)
Different Strategies for Different Columns
# Load data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:G1000, TRUE)
# Mean for numeric columns
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell C1: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, {"age", "income", "score"})
# Most frequent for categorical
Cell B2: =ML.IMPUTE.SIMPLE_IMPUTER("most_frequent")
Cell C2: =ML.COMPOSE.COLUMN_TRANSFORMER(B2, {"category", "region"})
# Combine imputers
Cell D1: =ML.COMPOSE.DATA_TRANSFORMER(C1, C2)
# Use in pipeline
Cell E1: =ML.CLASSIFICATION.LOGISTIC()
Cell F1: =ML.PIPELINE(D1, E1)
Cell G1: =ML.FIT(F1, X_train, y_train)
Imputation in Complete Pipeline
# Load data
Cell A1: =ML.DATASETS.DIABETES()
# Add some missing values for demonstration
# (In practice, data already has missing values)
# Step 1: Impute missing values
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("median")
# Step 2: Scale features
Cell B2: =ML.PREPROCESSING.STANDARD_SCALER()
# Step 3: Select features
Cell B3: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50)
# Step 4: Train model
Cell B4: =ML.REGRESSION.RIDGE(1.0)
# Create pipeline
Cell C1: =ML.PIPELINE(B1, B2, B3, B4)
# Fit and predict
Cell D1: =ML.FIT(C1, X_train, y_train)
Cell E1: =ML.PREDICT(D1, X_test)
Constant Value Imputation
# Load data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E1000, TRUE)
# Impute age with median age (40)
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("constant", 40)
Cell C1: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, {"age"})
# Impute categorical with "Unknown"
Cell B2: =ML.IMPUTE.SIMPLE_IMPUTER("constant", "Unknown")
Cell C2: =ML.COMPOSE.COLUMN_TRANSFORMER(B2, {"category"})
# Combine
Cell D1: =ML.COMPOSE.DATA_TRANSFORMER(C1, C2)
# Process data
Cell E1: =ML.FIT_TRANSFORM(D1, A1)
Feature Selection Workflow
# Load high-dimensional data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63") # 64 features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64) # Target
# Select top 50% most important features
Cell D1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50)
# Fit selector on training data
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 0) # X_train
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 1) # X_test
Cell F1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 0) # y_train
Cell F2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 1) # y_test
# Fit feature selector
Cell G1: =ML.FIT(D1, E1, F1)
# Transform to selected features
Cell H1: =ML.TRANSFORM(G1, E1) # Train with top features
Cell H2: =ML.TRANSFORM(G1, E2) # Test with top features
# Train model on selected features
Cell I1: =ML.CLASSIFICATION.SVM()
Cell J1: =ML.FIT(I1, H1, F1)
# Evaluate
Cell K1: =ML.EVAL.SCORE(J1, H2, F2)
Comparing Feature Selection Percentiles
# Load data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64)
# Try different percentiles
Cell D1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 10) # Top 10%
Cell D2: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 25) # Top 25%
Cell D3: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50) # Top 50%
# Create pipelines with each
Cell E1: =ML.CLASSIFICATION.SVM()
Cell F1: =ML.PIPELINE(D1, E1)
Cell F2: =ML.PIPELINE(D2, E1)
Cell F3: =ML.PIPELINE(D3, E1)
# Cross-validate each
Cell G1: =ML.EVAL.CV_SCORE(F1, B1, C1, 5, "accuracy")
Cell G2: =ML.EVAL.CV_SCORE(F2, B1, C1, 5, "accuracy")
Cell G3: =ML.EVAL.CV_SCORE(F3, B1, C1, 5, "accuracy")
# Compare average scores
Cell H1: =AVERAGE(G1#) # 10% features
Cell H2: =AVERAGE(G2#) # 25% features
Cell H3: =AVERAGE(G3#) # 50% features
Complete Preprocessing Pipeline
# Load messy data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:J1000, TRUE)
# 1. Impute missing values
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("median")
# 2. Scale features
Cell B2: =ML.PREPROCESSING.STANDARD_SCALER()
# 3. Select best features
Cell B3: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 70)
# 4. Train model
Cell B4: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(100)
# Build full pipeline
Cell C1: =ML.PIPELINE(B1, B2, B3, B4)
# Fit and evaluate
Cell D1: =ML.FIT(C1, X_train, y_train)
Cell E1: =ML.EVAL.SCORE(D1, X_test, y_test)
# Make predictions
Cell F1: =ML.PREDICT(D1, X_test)
Tips and Best Practices
Imputation Strategies
-
Choosing Imputation Strategy
- Mean: Symmetric distributions, no outliers
- Median: Skewed distributions, has outliers
- Most Frequent: Categorical data, discrete values
- Constant: Domain knowledge suggests specific value
-
When to Impute
- Before any other preprocessing
- Before splitting train/test (if using same strategy)
- Or fit imputer on train, transform both
-
Imputation Best Practices
- ✅ Understand why data is missing
- ✅ Use median for outlier-prone data
- ✅ Document imputation choices
- ✅ Check imputation impact on results
- ❌ Don’t ignore missing data patterns
- ❌ Don’t use mean with outliers
- ❌ Don’t impute without checking distribution
-
Imputation Order
Correct: Load → Impute → Split → Scale → Model Also OK: Load → Split → Impute(fit on train) → Scale → Model Wrong: Load → Split → Scale → Impute
Feature Selection
-
Choosing Percentile
- Start with 50% (half features)
- Use cross-validation to optimize
- Balance: fewer features vs accuracy
- Consider computational cost
-
Feature Selection Strategies
- 10-25%: Very high dimensionality
- 25-50%: Medium dimensionality
- 50-75%: Mild reduction
- 75-90%: Conservative approach
-
When to Use Feature Selection
- High-dimensional data (many features)
- Reduce overfitting
- Speed up training
- Improve interpretability
- Remove redundant features
-
Feature Selection Best Practices
- ✅ Always fit on training data only
- ✅ Use same selector for train and test
- ✅ Try different percentiles
- ✅ Validate with cross-validation
- ❌ Don’t select features on full dataset
- ❌ Don’t use different selections for train/test
Combined Strategies
-
Preprocessing Order
1. Handle missing values (impute) 2. Encode categorical features 3. Scale numerical features 4. Select features (if needed) 5. Train model
-
Pipeline Integration
- Put imputation first in pipeline
- Feature selection before model
- Use compose for column-specific imputation
- Validate entire pipeline with CV
-
Performance Tips
- Impute early (reduces missing value errors)
- Select features to speed up training
- Cache imputer/selector for production
- Monitor feature importance after selection
-
Common Patterns
Quick Fix: IMPUTE(mean) → MODEL Better: IMPUTE → SCALE → MODEL Best: IMPUTE → SCALE → SELECT → MODEL Column-wise: COMPOSE(imputers) → COMPOSE(scalers) → MODEL
Imputation Strategy Guide
Data Type | Distribution | Has Outliers | Strategy |
---|---|---|---|
Numeric | Normal | No | Mean |
Numeric | Normal | Yes | Median |
Numeric | Skewed | Any | Median |
Categorical | Any | N/A | Most Frequent |
Any | Any | N/A | Constant (with domain value) |
Feature Selection Scoring Functions
Function | Use Case | Data Requirements |
---|---|---|
chi2 | Classification | Non-negative features |
f_classif | Classification | Any features |
f_regression | Regression | Any features |
mutual_info | Any | Any features |
Related Functions
- ML.DATA.DROP_MISSING_ROWS() - Alternative to imputation
- ML.COMPOSE Functions - Column-specific imputation
- ML.PREPROCESSING Functions - Feature scaling
- ML.PIPELINE() - Combine preprocessing steps