Classification Models Reference

Complete reference for FormulaML classification models including Logistic Regression, SVM, and Random Forest classification.

Classification Models Reference

Functions for creating and training classification models to predict categorical outcomes.

ML.CLASSIFICATION Namespace

ML.CLASSIFICATION.LOGISTIC()

Creates a Logistic Regression classifier for binary and multi-class classification.

Syntax:

=ML.CLASSIFICATION.LOGISTIC(C, penalty, fit_intercept, max_iter, tol)

Parameters:

  • C (Number, Optional): Inverse regularization strength (default: 1.0)
    • Smaller values = stronger regularization
    • Must be positive
  • penalty (String, Optional): Regularization type (default: “l2”)
    • “l1”: Lasso regularization
    • “l2”: Ridge regularization
    • “elasticnet”: Combination of L1 and L2
    • “none”: No regularization
  • fit_intercept (Boolean, Optional): Add intercept to decision function (default: TRUE)
  • max_iter (Integer, Optional): Maximum iterations for convergence (default: 100)
  • tol (Number, Optional): Tolerance for stopping criteria (default: 0.0001)

Returns: Logistic Regression classifier object

Use Case: Binary or multi-class classification with linear decision boundaries

Example:

# Basic logistic regression
Cell A1: =ML.CLASSIFICATION.LOGISTIC()
Result: <LogisticRegression>

# With L1 regularization
Cell A2: =ML.CLASSIFICATION.LOGISTIC(0.5, "l1")

# Train model
Cell B1: =ML.FIT(A1, X_train, y_train)

# Make predictions
Cell C1: =ML.PREDICT(B1, X_test)

ML.CLASSIFICATION.SVM()

Creates a Support Vector Machine (SVM) classifier with various kernel options.

Syntax:

=ML.CLASSIFICATION.SVM(C, kernel, degree, gamma, coef0)

Parameters:

  • C (Number, Optional): Regularization parameter (default: 1.0)
    • Larger values = less regularization
    • Must be positive
  • kernel (String, Optional): Kernel type (default: “rbf”)
    • “linear”: Linear kernel (for linearly separable data)
    • “poly”: Polynomial kernel
    • “rbf”: Radial basis function (most common)
    • “sigmoid”: Sigmoid kernel
  • degree (Integer, Optional): Polynomial degree for ‘poly’ kernel (default: 3)
  • gamma (String, Optional): Kernel coefficient (default: “scale”)
    • “scale”: 1 / (n_features * X.var())
    • “auto”: 1 / n_features
  • coef0 (Number, Optional): Independent term for ‘poly’/‘sigmoid’ kernels (default: 0.0)

Returns: SVM classifier object

Use Case: Complex decision boundaries, high-dimensional data, kernel methods

Example:

# RBF kernel SVM (default)
Cell A1: =ML.CLASSIFICATION.SVM()
Result: <SVC>

# Linear SVM
Cell A2: =ML.CLASSIFICATION.SVM(1.0, "linear")

# Polynomial SVM
Cell A3: =ML.CLASSIFICATION.SVM(1.0, "poly", 3, "scale", 1.0)

# Train model
Cell B1: =ML.FIT(A1, X_train, y_train)
Cell C1: =ML.PREDICT(B1, X_test)

ML.CLASSIFICATION.RANDOM_FOREST_CLF() ⭐

Creates a Random Forest Classifier (Premium feature).

Syntax:

=ML.CLASSIFICATION.RANDOM_FOREST_CLF(n_estimators, criterion, max_depth, min_samples_split, min_samples_leaf, max_features, bootstrap, random_state)

Parameters:

  • n_estimators (Integer, Optional): Number of trees in forest (default: 100)
  • criterion (String, Optional): Split quality measure (default: “gini”)
    • “gini”: Gini impurity
    • “entropy”: Information gain
    • “log_loss”: Cross-entropy loss
  • max_depth (Integer, Optional): Maximum tree depth (default: None = unlimited)
  • min_samples_split (Integer, Optional): Min samples to split node (default: 2)
  • min_samples_leaf (Integer, Optional): Min samples at leaf (default: 1)
  • max_features (Number/String, Optional): Features per split (default: 1.0)
    • Integer: Exact number of features
    • Float: Fraction of features
    • “sqrt”: Square root of total features
    • “log2”: Log base 2 of total features
  • bootstrap (Boolean, Optional): Use bootstrap samples (default: TRUE)
  • random_state (Integer, Optional): Random seed for reproducibility

Returns: Random Forest Classifier object

Use Case: Complex patterns, feature importance, robust multi-class classification

Example:

# Basic Random Forest
Cell A1: =ML.CLASSIFICATION.RANDOM_FOREST_CLF()
Result: <RandomForestClassifier>

# Optimized forest
Cell A2: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(200, "entropy", 15, 5, 2, "sqrt", TRUE, 42)

# Train and predict
Cell B1: =ML.FIT(A1, X_train, y_train)
Cell C1: =ML.PREDICT(B1, X_test)

Common Patterns

Binary Classification

# Load Iris dataset and select binary classes
Cell A1: =ML.DATASETS.IRIS()

# Separate features and target
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 4)

# Split train/test
Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.3, 42, 0)  # Train X
Cell D2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.3, 42, 1)  # Test X
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.3, 42, 0)  # Train y
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.3, 42, 1)  # Test y

# Create and train model
Cell F1: =ML.CLASSIFICATION.LOGISTIC(1.0, "l2")
Cell G1: =ML.FIT(F1, D1, E1)

# Predict and evaluate
Cell H1: =ML.PREDICT(G1, D2)
Cell I1: =ML.EVAL.SCORE(G1, D2, E2)

SVM with Preprocessing Pipeline

# Create preprocessing and model
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.CLASSIFICATION.SVM(1.0, "rbf")

# Create pipeline
Cell B1: =ML.PIPELINE(A1, A2)

# Train pipeline
Cell C1: =ML.FIT(B1, X_train, y_train)

# Predict
Cell D1: =ML.PREDICT(C1, X_test)

# Get accuracy
Cell E1: =ML.EVAL.SCORE(C1, X_test, y_test)

Multi-Class Classification

# Load digits dataset (10 classes)
Cell A1: =ML.DATASETS.DIGITS()

# Prepare data
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")  # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64)      # Target

# Split data
Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 0)
Cell D2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 1)
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 0)
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 1)

# Create Random Forest for multi-class
Cell F1: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(100, "entropy", , , , "sqrt", TRUE, 42)
Cell G1: =ML.FIT(F1, D1, E1)

# Predict and evaluate
Cell H1: =ML.PREDICT(G1, D2)
Cell I1: =ML.EVAL.SCORE(G1, D2, E2)

Comparing Classification Models

# Create multiple classifiers
Cell A1: =ML.CLASSIFICATION.LOGISTIC()
Cell A2: =ML.CLASSIFICATION.SVM(1.0, "linear")
Cell A3: =ML.CLASSIFICATION.SVM(1.0, "rbf")
Cell A4: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(100)

# Train all models
Cell B1: =ML.FIT(A1, X_train, y_train)
Cell B2: =ML.FIT(A2, X_train, y_train)
Cell B3: =ML.FIT(A3, X_train, y_train)
Cell B4: =ML.FIT(A4, X_train, y_train)

# Compare accuracy scores
Cell C1: =ML.EVAL.SCORE(B1, X_test, y_test)  # Logistic
Cell C2: =ML.EVAL.SCORE(B2, X_test, y_test)  # Linear SVM
Cell C3: =ML.EVAL.SCORE(B3, X_test, y_test)  # RBF SVM
Cell C4: =ML.EVAL.SCORE(B4, X_test, y_test)  # Random Forest

Decision Boundary Visualization

# Train a classifier
Cell A1: =ML.CLASSIFICATION.SVM(1.0, "rbf")
Cell B1: =ML.FIT(A1, X_train, y_train)

# Extract decision boundary for first two features
Cell C1: =ML.INSPECT.DECISION_BOUNDARY(B1, X_train, "predict", 0.05, {0,1}, {0,1})

# Result is DataFrame with boundary coordinates
# Can be plotted in Excel scatter chart

Grid Search for Best Classifier

# Create SVM model
Cell A1: =ML.CLASSIFICATION.SVM()

# Parameter grid
# Model | Parameter | Value1 | Value2 | Value3
Cell B1: "model" | "C" | 0.1 | 1 | 10
Cell B2: "model" | "kernel" | "linear" | "rbf" | "poly"
Cell B3: "model" | "gamma" | "scale" | "auto" |

# Grid search with accuracy scoring
Cell C1: =ML.EVAL.GRID_SEARCH(A1, B1:E3, "accuracy", 5, TRUE)
Cell D1: =ML.FIT(C1, X_train, y_train)

# Get best parameters and score
Cell E1: =ML.EVAL.BEST_PARAMS(D1)
Cell F1: =ML.EVAL.BEST_SCORE(D1)

# Get detailed results
Cell G1: =ML.EVAL.SEARCH_RESULTS(D1)

Tips and Best Practices

  1. Model Selection

    • Logistic Regression: Linear boundaries, interpretable
    • Linear SVM: Similar to logistic but different optimization
    • RBF SVM: Complex non-linear boundaries
    • Random Forest: Feature importance, robust to outliers
  2. Feature Scaling

    • Always scale for Logistic Regression and SVM
    • Not required for Random Forest
    • Use StandardScaler or MinMaxScaler
  3. SVM Kernel Selection

    • Start with RBF kernel (most versatile)
    • Use linear kernel for high-dimensional data
    • Polynomial for specific polynomial relationships
    • Tune C and gamma for RBF kernel
  4. Random Forest Optimization

    • More trees = better performance but slower
    • Limit max_depth to prevent overfitting
    • Use bootstrap=TRUE for better generalization
    • Set random_state for reproducibility
  5. Regularization

    • Higher C (Logistic/SVM) = less regularization
    • Lower C = more regularization, simpler model
    • Use cross-validation to find optimal C
  6. Evaluation Metrics

    • Use accuracy for balanced datasets
    • Consider precision/recall for imbalanced data
    • Compare multiple models on same test set