# FormulaML Custom Functions Reference for LLM Context

## Overview

FormulaML is an Excel add-in that provides machine learning capabilities directly in Excel through custom functions. It uses scikit-learn under the hood, so all ML concepts, algorithms, and best practices from scikit-learn apply here.

---

## CRITICAL RULES - READ FIRST

### 1. NO NESTED FUNCTIONS WITH OBJECT HANDLES

**This is the most important rule in FormulaML.** You CANNOT nest custom functions that accept or return object handles. Excel needs to return and cache the object first before it can be used by another function.

**WRONG - This will FAIL:**
```excel
=ML.FIT(ML.CLASSIFICATION.LOGISTIC(), A1:C100, D1:D100)
=ML.PREDICT(ML.FIT(ML.CLASSIFICATION.LOGISTIC(), X, y), X_test)
=ML.PIPELINE(ML.PREPROCESSING.STANDARD_SCALER(), ML.CLASSIFICATION.LOGISTIC())
```

**CORRECT - Use separate cells and references:**
```
Cell A1: =ML.CLASSIFICATION.LOGISTIC()           <- Creates model object
Cell A2: =ML.FIT(A1, B1:D100, E1:E100)           <- References A1 to fit
Cell A3: =ML.PREDICT(A2, F1:H20)                 <- References fitted model in A2
```

**CORRECT Pipeline Example:**
```
Cell B35: =ML.PREPROCESSING.STANDARD_SCALER()    <- Create scaler object
Cell B39: =ML.DIM_REDUCTION.PCA(15)              <- Create PCA object
Cell B43: =ML.CLASSIFICATION.LOGISTIC()          <- Create classifier object
Cell B48: =ML.PIPELINE(B35, B39, B43)            <- Reference all three cells
Cell B50: =ML.FIT(B48, B27, B29)                 <- Fit the pipeline
```

### 2. DATAFRAME OUTPUT SPILLS INTO EXCEL CELLS

Functions that return DataFrames (decorated with `@ret(pd.DataFrame)`) **SPILL** their values directly into Excel cells. They do NOT return object handles. The values occupy a range of cells starting from the formula cell.

**Functions that SPILL data (output values to cells):**
- `ML.DATA.SAMPLE()` - Spills DataFrame rows
- `ML.DATA.INFO()` - Spills column information
- `ML.DATA.DESCRIBE()` - Spills descriptive statistics
- `ML.INSPECT.GET_PARAMS()` - Spills parameter table
- `ML.INSPECT.DECISION_BOUNDARY()` - Spills boundary points
- `ML.EVAL.BEST_PARAMS()` - Spills best parameters
- `ML.EVAL.SEARCH_RESULTS()` - Spills all grid search results
- `ML.DIM_REDUCTION.PCA.RESULTS()` - Spills PCA variance data

**Functions that return OBJECT HANDLES (single cell with icon):**
- `ML.DATASETS.*` - Returns Dataset object handle
- `ML.DATA.CONVERT_TO_DF()` - Returns DataFrame object handle
- `ML.DATA.SELECT_COLUMNS()` - Returns DataFrame object handle
- `ML.PREPROCESSING.TRAIN_TEST_SPLIT()` - Returns DataFrame object handle
- `ML.PREPROCESSING.*_SCALER()` - Returns transformer object handle
- `ML.CLASSIFICATION.*`, `ML.REGRESSION.*`, `ML.CLUSTERING.*` - Returns model object handles
- `ML.FIT()` - Returns fitted object handle
- `ML.TRANSFORM()` - Returns transformed data object handle
- `ML.PIPELINE()` - Returns Pipeline object handle
- `ML.EVAL.GRID_SEARCH()` - Returns GridSearchCV object handle

### 3. NEVER RE-REFERENCE OBJECT HANDLES

**Object handles are bound to their original cell.** You CANNOT copy or re-reference an object handle to another cell and use that reference as input to functions that accept object handles.

This is because object handles are stored in Redis with a unique key tied to the original cell address. When you reference another cell (e.g., `=A30`), Excel copies the display value, NOT the underlying object reference.

**WRONG - This will FAIL:**
```
Cell A30: =ML.CLASSIFICATION.LOGISTIC()        <- Creates model object handle
Cell A36: =A30                                  <- WRONG! This just copies display text, not the object
Cell A40: =ML.FIT(A36, X, y)                   <- FAILS! A36 is not a valid object handle
```

**CORRECT - Always reference the original cell:**
```
Cell A30: =ML.CLASSIFICATION.LOGISTIC()        <- Creates model object handle
Cell A40: =ML.FIT(A30, X, y)                   <- CORRECT! References original cell A30 directly
```

**Key principle:** Always trace back to the cell where the object was originally created. Never use intermediate cell references for object handles.

---

### 4. USE ML.DATA.SHAPE() TO PLAN SPACE FOR SPILLING

Before using a function that spills data, use `ML.DATA.SHAPE()` to determine how much space the output will need:

```
Cell A1: =ML.DATASETS.IRIS()                     <- Dataset object handle
Cell B1: =ML.DATA.SHAPE(A1)                      <- Returns [150, 5] meaning 150 rows, 5 columns
Cell A3: =ML.DATA.SAMPLE(A1, -1)                 <- Now you know this will spill 150 rows x 5 columns
                                                    Leave rows 3-152, columns A-E clear!
```

**Always leave adequate space below and to the right of spilling functions to avoid #SPILL! errors.**

---

## Key Concept: Object Handles

FormulaML uses **object handles** to manage complex Python objects (DataFrames, ML models, transformers, etc.) within Excel. When a function returns an `object` type (not decorated to return DataFrame values), Excel displays a small identifier (e.g., a cell showing an icon with text like "Pipeline" or "Dataset") that represents the actual Python object stored on the server.

### How Object Handles Work:
1. **Creation**: Functions that return `object` create a handle stored in the object cache (Redis)
2. **Usage**: Pass the cell containing the handle to other functions that accept `object` parameters
3. **Inspection**: Use `ML.INSPECT.GET_PARAMS()` to see the parameters of any model/transformer object
4. **Viewing Data**: Use `ML.DATA.SAMPLE()` to view the contents of DataFrame objects

### Supported Object Types (what can be stored as handles):
- **DataFrames**: pandas DataFrame objects (from datasets, transformations, etc.)
- **Estimators** (Models): LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression, SVC, RandomForestClassifier, RandomForestRegressor, KMeans, DecisionTreeClassifier, DecisionTreeRegressor
- **Transformers**: StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder, OrdinalEncoder, SimpleImputer, SelectPercentile
- **Dimensionality Reducers**: PCA, KernelPCA
- **Pipelines**: sklearn Pipeline objects (chains of transformers and estimators)
- **GridSearchCV**: Hyperparameter search results
- **Column Transformers**: ColumnTransformer objects for applying different transformations to different columns
- **NumPy arrays**: np.ndarray objects

---

## Function Naming Convention

All functions follow this pattern:
```
ML.[SUB_NAMESPACE].[FUNCTION_NAME](arguments)
```

Examples:
- `ML.FIT(model, X, y)` - No sub-namespace
- `ML.CLASSIFICATION.LOGISTIC()` - Sub-namespace is CLASSIFICATION
- `ML.DIM_REDUCTION.PCA.RESULTS(pca_obj)` - Nested sub-namespace

---

## Sequential Cell Reference Pattern (CRITICAL)

Because objects must be cached first, you MUST use sequential cell references. Here's the correct workflow pattern:

### Dataset Loading Flow:
```
Step 1: Cell B4  = =ML.DATASETS.IRIS()                              <- Returns Dataset handle
Step 2: Cell A7  = =ML.DATA.SAMPLE(B4, 5)                           <- SPILLS 5 rows of data
Step 3: Cell B19 = =ML.DATA.SELECT_COLUMNS(B4, B16:B17)             <- Returns features DataFrame handle
Step 4: Cell B20 = =ML.DATA.SELECT_COLUMNS(B4, E7)                  <- Returns target DataFrame handle
```

### Model Training Flow:
```
Step 1: Cell B20 = =ML.CLASSIFICATION.LOGISTIC()                    <- Returns unfitted model handle
Step 2: Cell B21 = =ML.FIT(B20, B16, B17)                          <- Returns fitted model handle
Step 3: Cell B22 = =ML.PREDICT(B21, B18)                           <- Returns predictions handle
Step 4: Cell B23 = =ML.EVAL.SCORE(B21, B18, B19)                   <- Returns float score
```

### Preprocessing Flow:
```
Step 1: Cell B35 = =ML.PREPROCESSING.STANDARD_SCALER()              <- Returns unfitted scaler handle
Step 2: Cell B36 = =ML.FIT(B35, B27)                               <- Returns fitted scaler handle
Step 3: Cell B37 = =ML.TRANSFORM(B36, B27)                         <- Returns scaled data handle
```

### Pipeline Flow:
```
Step 1: Cell B35 = =ML.PREPROCESSING.STANDARD_SCALER()              <- Create scaler
Step 2: Cell B39 = =ML.DIM_REDUCTION.PCA(15)                        <- Create PCA (15 components)
Step 3: Cell B43 = =ML.CLASSIFICATION.LOGISTIC()                    <- Create classifier
Step 4: Cell B48 = =ML.PIPELINE(B35, B39, B43)                      <- Combine into Pipeline
Step 5: Cell B50 = =ML.FIT(B48, B27, B29)                          <- Fit entire pipeline
Step 6: Cell B52 = =ML.EVAL.SCORE(B50, B28, B30)                   <- Evaluate on test set
```

---

## Complete Function Reference

### Core ML Methods (ML.*)

These functions work with any scikit-learn compatible estimator or transformer.

#### ML.FIT(model, X, [y])
Fits an estimator or transformer to the data.

| Parameter | Type | Description |
|-----------|------|-------------|
| model | object | **Required.** An unfitted model/transformer object (e.g., from ML.CLASSIFICATION.LOGISTIC, ML.PREPROCESSING.STANDARD_SCALER, etc.) |
| X | object | **Required.** Feature data - typically a DataFrame or array from ML.DATA.SELECT_COLUMNS or ML.DATA.CONVERT_TO_DF |
| y | object | **Optional.** Target variable for supervised learning models. Not needed for transformers like scalers. |

**Returns:** object (fitted model/transformer)

**Accepts as `model`:**
- Classification models: LogisticRegression, SVC, RandomForestClassifier
- Regression models: LinearRegression, Ridge, Lasso, ElasticNet, RandomForestRegressor
- Clustering models: KMeans
- Transformers: StandardScaler, MinMaxScaler, RobustScaler, PCA, KernelPCA, OneHotEncoder, OrdinalEncoder, SimpleImputer
- Pipelines: Created via ML.PIPELINE
- GridSearchCV: Created via ML.EVAL.GRID_SEARCH

**REMEMBER:** Cannot nest! Create the model in a separate cell first.

---

#### ML.TRANSFORM(model, X, [y])
Transforms data using a fitted transformer.

| Parameter | Type | Description |
|-----------|------|-------------|
| model | object | **Required.** A fitted transformer (e.g., fitted StandardScaler, PCA, OneHotEncoder) |
| X | object | **Required.** Data to transform |
| y | object | **Optional.** Target variable (rarely used, only for certain transformers) |

**Returns:** object (transformed data, typically a DataFrame or array)

**Accepts as `model`:**
- Fitted transformers: StandardScaler, MinMaxScaler, RobustScaler, PCA, KernelPCA, OneHotEncoder, OrdinalEncoder, SimpleImputer, SelectPercentile
- Fitted pipelines containing transformers
- Fitted ColumnTransformer

---

#### ML.FIT_TRANSFORM(model, X, [y])
Fits and transforms data in one step. Equivalent to calling FIT then TRANSFORM.

| Parameter | Type | Description |
|-----------|------|-------------|
| model | object | **Required.** An unfitted transformer |
| X | object | **Required.** Data to fit and transform |
| y | object | **Optional.** Target variable |

**Returns:** object (transformed data)

---

#### ML.PREDICT(model, X)
Generates predictions using a fitted model.

| Parameter | Type | Description |
|-----------|------|-------------|
| model | object | **Required.** A fitted estimator (classifier, regressor, or clustering model) |
| X | object | **Required.** Feature data for prediction |

**Returns:** object (predictions - array or Series)

**Accepts as `model`:**
- Fitted classifiers: LogisticRegression, SVC, RandomForestClassifier
- Fitted regressors: LinearRegression, Ridge, Lasso, ElasticNet, RandomForestRegressor
- Fitted clustering: KMeans
- Fitted pipelines ending with an estimator
- Fitted GridSearchCV (uses best estimator)

---

#### ML.PIPELINE(step1, step2, ..., stepN)
Creates a scikit-learn Pipeline from multiple transformers and/or a final estimator.

| Parameter | Type | Description |
|-----------|------|-------------|
| *args | object | One or more transformer/estimator objects. The last step can be an estimator. |

**Returns:** object (Pipeline)

**CRITICAL:** Each step must be a cell reference to an already-created object. Cannot nest!

**Typical Usage Pattern:**
```
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.CLASSIFICATION.LOGISTIC()
Cell A3: =ML.PIPELINE(A1, A2)                    <- References A1 and A2
```

**Important:** Pipeline steps are executed in order. Typically:
1. First steps: Transformers (scalers, encoders, imputers, PCA)
2. Last step: Estimator (classifier, regressor, clustering model)

---

#### ML.OBJECT_INFO(obj)
Returns string representation of an object's internal state.

| Parameter | Type | Description |
|-----------|------|-------------|
| obj | object | Any object handle |

**Returns:** string

---

### Classification Models (ML.CLASSIFICATION.*)

#### ML.CLASSIFICATION.LOGISTIC([C], [penalty], [fit_intercept], [max_iter], [tol])
Creates a Logistic Regression classifier. Uses scikit-learn's LogisticRegression.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| C | float | 1.0 | Inverse of regularization strength. Smaller values = stronger regularization. |
| penalty | string | "l2" | Norm for penalization: 'l1', 'l2', 'elasticnet', or 'none' |
| fit_intercept | boolean | TRUE | Whether to calculate the intercept |
| max_iter | integer | 100 | Maximum iterations for solver convergence |
| tol | float | 0.0001 | Tolerance for stopping criteria |

**Returns:** object (unfitted LogisticRegression)

---

#### ML.CLASSIFICATION.SVM([C], [kernel], [degree], [gamma], [coef0])
Creates a Support Vector Machine classifier. Uses scikit-learn's SVC.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| C | float | 1.0 | Regularization parameter. Larger values = less regularization. |
| kernel | string | "rbf" | Kernel type: 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' |
| degree | integer | 3 | Degree for 'poly' kernel (ignored by others) |
| gamma | string | "scale" | Kernel coefficient: 'scale' or 'auto' |
| coef0 | float | 0.0 | Independent term (significant in 'poly' and 'sigmoid') |

**Returns:** object (unfitted SVC)

---

#### ML.CLASSIFICATION.RANDOM_FOREST_CLF([n_estimators], [criterion], [max_depth], [min_samples_split], [min_samples_leaf], [max_features], [bootstrap], [random_state]) - Premium

Creates a Random Forest classifier. Uses scikit-learn's RandomForestClassifier.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| n_estimators | integer | 100 | Number of trees in the forest |
| criterion | string | "gini" | Split quality measure: 'gini', 'entropy', 'log_loss' |
| max_depth | integer | None | Maximum tree depth. None = nodes expand until all leaves are pure. |
| min_samples_split | integer | 2 | Minimum samples required to split a node |
| min_samples_leaf | integer | 1 | Minimum samples required at a leaf node |
| max_features | float/int/string | 1.0 | Features to consider for best split |
| bootstrap | boolean | TRUE | Whether to use bootstrap samples |
| random_state | integer | None | Random seed for reproducibility |

**Returns:** object (unfitted RandomForestClassifier)

---

### Regression Models (ML.REGRESSION.*)

#### ML.REGRESSION.LINEAR([fit_intercept])
Creates a Linear Regression model. Uses scikit-learn's LinearRegression.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| fit_intercept | boolean | TRUE | Whether to calculate the intercept |

**Returns:** object (unfitted LinearRegression)

---

#### ML.REGRESSION.RIDGE([alpha], [fit_intercept])
Creates a Ridge Regression model (L2 regularization). Uses scikit-learn's Ridge.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| alpha | float | 1.0 | Regularization strength. Larger = stronger regularization. |
| fit_intercept | boolean | TRUE | Whether to calculate the intercept |

**Returns:** object (unfitted Ridge)

---

#### ML.REGRESSION.LASSO([alpha], [fit_intercept])
Creates a Lasso Regression model (L1 regularization). Uses scikit-learn's Lasso.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| alpha | float | 1.0 | Regularization strength. Larger = stronger regularization. |
| fit_intercept | boolean | TRUE | Whether to calculate the intercept |

**Returns:** object (unfitted Lasso)

---

#### ML.REGRESSION.ELASTIC_NET([alpha], [l1_ratio], [fit_intercept])
Creates an Elastic Net model (combined L1 and L2). Uses scikit-learn's ElasticNet.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| alpha | float | 1.0 | Regularization strength |
| l1_ratio | float | 0.5 | Mix ratio: 0 = L2 only, 1 = L1 only |
| fit_intercept | boolean | TRUE | Whether to calculate the intercept |

**Returns:** object (unfitted ElasticNet)

---

#### ML.REGRESSION.RANDOM_FOREST_REG([n_estimators], [criterion], [max_depth], [min_samples_split], [min_samples_leaf], [max_features], [bootstrap], [random_state]) - Premium

Creates a Random Forest regressor. Uses scikit-learn's RandomForestRegressor.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| n_estimators | integer | 100 | Number of trees |
| criterion | string | "squared_error" | Split quality: 'squared_error', 'absolute_error', 'friedman_mse', 'poisson' |
| max_depth | integer | None | Maximum tree depth |
| min_samples_split | integer | 2 | Minimum samples to split |
| min_samples_leaf | integer | 1 | Minimum samples at leaf |
| max_features | float/int/string | 1.0 | Features for best split |
| bootstrap | boolean | TRUE | Use bootstrap samples |
| random_state | integer | None | Random seed |

**Returns:** object (unfitted RandomForestRegressor)

---

### Clustering Models (ML.CLUSTERING.*)

#### ML.CLUSTERING.KMEANS([n_clusters], [init], [n_init], [max_iter], [tol], [random_state], [algorithm])
Creates a K-Means clustering model. Uses scikit-learn's KMeans.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| n_clusters | integer | 8 | Number of clusters to create |
| init | string | "k-means++" | Initialization: 'k-means++' (recommended) or 'random' |
| n_init | integer/string | "auto" | Number of runs with different initializations |
| max_iter | integer | 300 | Maximum iterations per run |
| tol | float | 0.0001 | Convergence tolerance |
| random_state | integer | None | Random seed for reproducibility |
| algorithm | string | "lloyd" | Algorithm: 'lloyd' or 'elkan' |

**Returns:** object (unfitted KMeans)

**Usage Note:** After fitting, use ML.PREDICT to get cluster labels, or ML.TRANSFORM to get distances to cluster centers.

---

### Preprocessing (ML.PREPROCESSING.*)

#### ML.PREPROCESSING.TRAIN_TEST_SPLIT(data, test_size, random_state, dataset_type)
Splits data into training and testing sets.

| Parameter | Type | Description |
|-----------|------|-------------|
| data | object | **Required.** DataFrame to split |
| test_size | float | **Required.** Fraction for test set (e.g., 0.2 = 20% test) |
| random_state | integer | **Required.** Random seed for reproducibility |
| dataset_type | integer | **Required.** 0 = training set, 1 = test set |

**Returns:** object (DataFrame - either train or test split)

**Important:** Call this function FOUR TIMES to get all splits:
```
X_train: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, 0.2, 42, 0)
X_test:  =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, 0.2, 42, 1)
y_train: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(y, 0.2, 42, 0)
y_test:  =ML.PREPROCESSING.TRAIN_TEST_SPLIT(y, 0.2, 42, 1)
```

---

#### ML.PREPROCESSING.STANDARD_SCALER()
Creates a StandardScaler that standardizes features by removing mean and scaling to unit variance.

**Returns:** object (unfitted StandardScaler)

**When to use:** When features have different scales and you want zero mean and unit variance. Good for SVM, Logistic Regression, Neural Networks.

---

#### ML.PREPROCESSING.MIN_MAX_SCALER()
Creates a MinMaxScaler that scales features to [0, 1] range.

**Returns:** object (unfitted MinMaxScaler)

**When to use:** When you need bounded values, especially for algorithms that don't assume any distribution.

---

#### ML.PREPROCESSING.ROBUST_SCALER()
Creates a RobustScaler that uses median and IQR, robust to outliers.

**Returns:** object (unfitted RobustScaler)

**When to use:** When your data has outliers that shouldn't dominate the scaling.

---

#### ML.PREPROCESSING.ONE_HOT_ENCODER([handle_unknown])
Creates a OneHotEncoder for categorical variables.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| handle_unknown | string | "error" | How to handle unknown categories: 'error' or 'ignore' |

**Returns:** object (unfitted OneHotEncoder)

**When to use:** For nominal categorical variables (no inherent order) like colors, countries, etc.

---

#### ML.PREPROCESSING.ORDINAL_ENCODER([handle_unknown])
Creates an OrdinalEncoder for categorical variables.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| handle_unknown | string | "error" | How to handle unknown categories: 'error' or 'ignore' |

**Returns:** object (unfitted OrdinalEncoder)

**When to use:** For ordinal categorical variables (with inherent order) like education level, size (S/M/L), etc.

---

### Dimensionality Reduction (ML.DIM_REDUCTION.*)

#### ML.DIM_REDUCTION.PCA([n_components], [whiten], [svd_solver], [tol], [iterated_power], [n_oversamples], [power_iteration_normalizer], [random_state])
Creates a Principal Component Analysis transformer.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| n_components | integer | None | Components to keep. None = all. Use 'mle' for automatic. |
| whiten | boolean | FALSE | Scale to uncorrelated outputs with unit variance |
| svd_solver | string | "auto" | Solver: 'auto', 'full', 'arpack', 'randomized' |
| tol | float | 0.0 | Tolerance for 'arpack' solver |
| iterated_power | string | "auto" | Power iterations for 'randomized' |
| n_oversamples | integer | 10 | Extra random vectors for 'randomized' |
| power_iteration_normalizer | string | "auto" | Normalizer: 'auto', 'QR', 'LU', 'none' |
| random_state | integer | None | Random seed |

**Returns:** object (unfitted PCA)

---

#### ML.DIM_REDUCTION.PCA.RESULTS(pca_obj)
Returns PCA results after fitting. **This function SPILLS data.**

| Parameter | Type | Description |
|-----------|------|-------------|
| pca_obj | object | **Required.** A fitted PCA object |

**Returns:** DataFrame (SPILLS) with columns: Component, Explained Variance, Explained Variance Ratio, Singular Values

**Space needed:** Rows = number of components + 1 header, Columns = 4

---

#### ML.DIM_REDUCTION.KERNEL_PCA([n_components], [kernel], [degree], [gamma], [coef0]) - Premium
Creates a Kernel PCA transformer for nonlinear dimensionality reduction.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| n_components | integer | None | Components to keep |
| kernel | string | "linear" | Kernel: 'linear', 'poly', 'rbf', 'sigmoid', 'cosine' |
| degree | integer | 3 | Degree for 'poly' kernel |
| gamma | float | 1.0 | Kernel coefficient for 'rbf', 'poly', 'sigmoid' |
| coef0 | float | 0.0 | Independent term for 'poly' and 'sigmoid' |

**Returns:** object (unfitted KernelPCA)

---

### Data Handling (ML.DATA.*)

#### ML.DATA.CONVERT_TO_DF(table)
Converts an Excel table or range to a DataFrame object handle.

| Parameter | Type | Description |
|-----------|------|-------------|
| table | range | **Required.** Excel range with headers in the first row |

**Returns:** object (DataFrame handle)

---

#### ML.DATA.INFO(data)
Returns information about a DataFrame (column names, non-null counts, data types). **This function SPILLS data.**

| Parameter | Type | Description |
|-----------|------|-------------|
| data | object | **Required.** A DataFrame object |

**Returns:** DataFrame (SPILLS to cells showing column info)

**Space needed:** Rows = number of columns + 1 header, Columns = 3

---

#### ML.DATA.SAMPLE(data, [n], [head], [tail], [seed], [missing_values])
Returns a sample of a DataFrame or views it completely. **This function SPILLS data.**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| data | object | **Required.** DataFrame or array object |
| n | integer | 10 | Number of rows. Use -1 for complete data. |
| head | boolean | TRUE | Return first n rows |
| tail | boolean | FALSE | Return last n rows |
| seed | integer | None | Random seed for random sampling |
| missing_values | string | "#N/A" | Excel error value for missing data |

**Returns:** DataFrame (SPILLS to cells)

**Space needed:** Use ML.DATA.SHAPE first to determine rows and columns!

**Usage:** This is how you "view" the contents of a DataFrame object handle. Use n=-1 to get ALL data for charting.

---

#### ML.DATA.SELECT_COLUMNS(data, columns)
Selects specific columns from a DataFrame.

| Parameter | Type | Description |
|-----------|------|-------------|
| data | object | **Required.** DataFrame object |
| columns | range | **Required.** Column names to select (can be a range with column headers) |

**Returns:** object (DataFrame with selected columns - object handle, not spilled)

**Usage:** Essential for separating features (X) from target (y). Reference the header cells from your spilled data preview.

**Example:**
```
Cell A7:  =ML.DATA.SAMPLE(B4, 5)                  <- Spills data with headers in A7:E7
Cell B19: =ML.DATA.SELECT_COLUMNS(B4, A7:D7)     <- Selects columns A-D as features
Cell B20: =ML.DATA.SELECT_COLUMNS(B4, E7)        <- Selects column E as target
```

---

#### ML.DATA.DESCRIBE(data)
Returns descriptive statistics of a DataFrame. **This function SPILLS data.**

| Parameter | Type | Description |
|-----------|------|-------------|
| data | object | **Required.** DataFrame object |

**Returns:** DataFrame (SPILLS: count, mean, std, min, quartiles, max)

**Space needed:** 9 rows x (number of numeric columns + 1)

---

#### ML.DATA.SHAPE(data)
Returns the shape of a DataFrame as [rows, columns]. **Use this to plan space for spilling functions.**

| Parameter | Type | Description |
|-----------|------|-------------|
| data | object | **Required.** DataFrame object |

**Returns:** tuple [n_rows, n_columns] (spills to 2 cells)

**Critical Usage:** Call this BEFORE using ML.DATA.SAMPLE with n=-1 to know how much space to leave!

---

#### ML.DATA.DROP_MISSING_ROWS(data, [how], [thresh], [subset])
Removes rows with missing values.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| data | object | **Required.** DataFrame object |
| how | string | "any" | 'any' = drop if any NA, 'all' = drop only if all NA |
| thresh | integer | None | Keep rows with at least this many non-NA values |
| subset | string | None | Column to check for missing values |

**Returns:** object (DataFrame without missing rows)

---

### Datasets (ML.DATASETS.*)

Built-in datasets for learning and testing. All return object handles (not spilled data).

#### ML.DATASETS.IRIS()
Loads the Iris dataset (150 samples, 4 features + 1 target, 3 classes).

**Returns:** object (Dataset/DataFrame handle)

**Features:** sepal length, sepal width, petal length, petal width
**Target:** target (0=setosa, 1=versicolor, 2=virginica)
**Use case:** Classification

---

#### ML.DATASETS.DIGITS()
Loads the Digits dataset (1797 samples, 64 features + 1 target, 10 classes).

**Returns:** object (Dataset/DataFrame handle)

**Features:** 8x8 pixel values (64 features named pixel_0_0 through pixel_7_7)
**Target:** target (0-9)
**Use case:** Classification, dimensionality reduction

---

#### ML.DATASETS.DIABETES()
Loads the Diabetes dataset (442 samples, 10 features + 1 target).

**Returns:** object (Dataset/DataFrame handle)

**Features:** age, sex, bmi, bp, s1, s2, s3, s4, s5, s6
**Target:** target (disease progression measure)
**Use case:** Regression

---

#### ML.DATASETS.OPENML(name, [version]) - Premium
Loads any dataset from OpenML repository.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| name | string | **Required.** | Dataset name on OpenML |
| version | string | "active" | Dataset version |

**Returns:** object (Dataset/DataFrame handle)

---

### Evaluation (ML.EVAL.*)

#### ML.EVAL.SCORE(model, X, y)
Returns the default score for a fitted model (R² for regression, accuracy for classification).

| Parameter | Type | Description |
|-----------|------|-------------|
| model | object | **Required.** Fitted model |
| X | object | **Required.** Test features |
| y | object | **Required.** True target values |

**Returns:** float (single cell value)

---

#### ML.EVAL.CV_SCORE(model, X, y, cv, scoring) - Premium
Performs cross-validation and returns scores.

| Parameter | Type | Description |
|-----------|------|-------------|
| model | object | **Required.** Unfitted model |
| X | object | **Required.** Features |
| y | object | **Required.** Target |
| cv | integer | **Required.** Number of folds |
| scoring | string | **Required.** Scoring metric (e.g., 'accuracy', 'r2', 'neg_mean_squared_error') |

**Returns:** array of scores (one per fold)

---

#### ML.EVAL.GRID_SEARCH(model, param_grid, [scoring], [cv], [refit]) - Premium
Creates a GridSearchCV for hyperparameter tuning.

| Parameter | Type | Description |
|-----------|------|-------------|
| model | object | **Required.** Unfitted model or pipeline |
| param_grid | range | **Required.** Parameter grid (see format below) |
| scoring | string/list | None | Scoring metric(s) |
| cv | integer | None | Cross-validation folds |
| refit | boolean | TRUE | Refit best model on full data |

**Returns:** object (unfitted GridSearchCV - must be fitted with ML.FIT)

**param_grid format:**
- For regular models: columns are [Model, Parameter, Value1, Value2, ...]
- For pipelines: columns are [StepName, Parameter, Value1, Value2, ...]

**Example param_grid for pipeline:**
```
| Object             | Parameter    | Value 1 | Value 2 | Value 3 |
|--------------------|--------------|---------|---------|---------|
| pca                | n_components | 5       | 15      | 30      |
| logisticregression | C            | 0.01    | 1.0     | 100.0   |
```

---

#### ML.EVAL.BEST_PARAMS(model) - Premium
Returns the best parameters from a fitted GridSearchCV. **This function SPILLS data.**

| Parameter | Type | Description |
|-----------|------|-------------|
| model | object | **Required.** Fitted GridSearchCV object |

**Returns:** DataFrame (SPILLS) with columns [Model, Parameter, Value]

---

#### ML.EVAL.BEST_SCORE(model) - Premium
Returns the best score from a fitted GridSearchCV.

| Parameter | Type | Description |
|-----------|------|-------------|
| model | object | **Required.** Fitted GridSearchCV object |

**Returns:** float (single cell value)

---

#### ML.EVAL.SEARCH_RESULTS(grid_search) - Premium
Returns complete cross-validation results from GridSearchCV. **This function SPILLS data.**

| Parameter | Type | Description |
|-----------|------|-------------|
| grid_search | object | **Required.** Fitted GridSearchCV object |

**Returns:** DataFrame (SPILLS) with all parameter combinations and scores

**Space needed:** Rows = (all parameter combinations) + 1 header, many columns

---

### Inspection (ML.INSPECT.*)

#### ML.INSPECT.GET_PARAMS(obj, [transpose])
Gets parameters of any scikit-learn compatible object. **This function SPILLS data.**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| obj | object | **Required.** Any model, transformer, or pipeline |
| transpose | boolean | FALSE | Return wide format (parameters as columns) |

**Returns:** DataFrame (SPILLS) with parameters

**For pipelines:** Returns [Model, Parameter, Value] columns showing each step's parameters.
**For single objects:** Returns [Parameter, Value] columns.

**This is the key function for inspecting what's inside an object handle.**

---

#### ML.INSPECT.DECISION_BOUNDARY(model, X, [response_method], [mesh_step], [class_pair], [feature_indices], [other_features_values], [margin])
Extracts decision boundary contour points for visualization. **This function SPILLS data.**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| model | object | **Required.** Fitted classifier |
| X | object | **Required.** Feature data (used for axis ranges) |
| response_method | string | "predict" | 'predict' or 'decision_function' |
| mesh_step | float | 0.05 | Grid resolution |
| class_pair | list | [0, 1] | Class pair for multiclass decision_function |
| feature_indices | tuple | (0, 1) | Which two features to use |
| other_features_values | dict | None | Values for other features (uses mean if None) |
| margin | float | 0.1 | Margin around data range |

**Returns:** DataFrame (SPILLS) with boundary point coordinates

---

### Imputation (ML.IMPUTE.*)

#### ML.IMPUTE.SIMPLE_IMPUTER(strategy, [fill_value])
Creates a SimpleImputer for handling missing values.

| Parameter | Type | Description |
|-----------|------|-------------|
| strategy | string | **Required.** Strategy: 'mean', 'median', 'most_frequent', 'constant' |
| fill_value | any | Value to use when strategy='constant' |

**Returns:** object (unfitted SimpleImputer)

---

### Feature Selection (ML.FEATURE_SELECTION.*)

#### ML.FEATURE_SELECTION.SELECT_PERCENTILE(score_func, percentile)
Creates a SelectPercentile transformer.

| Parameter | Type | Description |
|-----------|------|-------------|
| score_func | string | **Required.** Scoring function: 'chi2' |
| percentile | integer | **Required.** Percentile of features to keep (0-100) |

**Returns:** object (unfitted SelectPercentile)

---

### Compose (ML.COMPOSE.*)

For applying different transformations to different columns.

#### ML.COMPOSE.COLUMN_TRANSFORMER(transformer, cols)
Pairs a transformer with specific columns.

| Parameter | Type | Description |
|-----------|------|-------------|
| transformer | object | **Required.** A transformer object or ML.COMPOSE.TRANSFORMERS.DROP/PASSTHROUGH |
| cols | range/string | **Required.** Column name(s) to apply transformer to |

**Returns:** object (ColumnTransformer specification)

---

#### ML.COMPOSE.DATA_TRANSFORMER(col_transformer1, col_transformer2, ...)
Combines multiple column transformers into one.

| Parameter | Type | Description |
|-----------|------|-------------|
| *args | object | One or more ML.COMPOSE.COLUMN_TRANSFORMER results |

**Returns:** object (ColumnTransformer)

---

#### ML.COMPOSE.COLUMN_SELECTOR(pattern, dtypes)
Selects columns by pattern or data type.

| Parameter | Type | Description |
|-----------|------|-------------|
| pattern | string | Regex pattern for column names |
| dtypes | list | Data types to match |

**Returns:** object (column selector)

---

#### ML.COMPOSE.TRANSFORMERS.DROP()
Returns a "drop" transformer that removes columns.

**Returns:** object (drop specification)

---

#### ML.COMPOSE.TRANSFORMERS.PASSTHROUGH()
Returns a "passthrough" transformer that keeps columns unchanged.

**Returns:** object (passthrough specification)

---

## Common Workflows with Cell References

### Basic Classification Workflow

```
Cell B4:  =ML.DATASETS.IRIS()                                    <- Load dataset (object handle)
Cell A7:  =ML.DATA.SAMPLE(B4, 5)                                 <- Preview data (SPILLS 6 rows x 5 cols)
Cell B14: =ML.DATA.SHAPE(B4)                                     <- Check size: [150, 5]
Cell B16: =A7                                                    <- Reference feature column header
Cell B17: =D7                                                    <- Reference another feature column header
Cell B19: =ML.DATA.SELECT_COLUMNS(B4, B16:B17)                   <- Select features (object handle)
Cell B20: =ML.DATA.SELECT_COLUMNS(B4, E7)                        <- Select target (object handle)
Cell A30: =ML.CLASSIFICATION.LOGISTIC()                          <- Create model (object handle)
Cell A31: =ML.FIT(A30, B19, B20)                                 <- Fit model (object handle)
Cell A32: =ML.EVAL.SCORE(A31, B19, B20)                          <- Evaluate (float value)
```

### Pipeline with Train/Test Split

```
Cell B4:  =ML.DATASETS.DIGITS()                                  <- Load dataset
Cell A7:  =ML.DATA.SAMPLE(B4, 5)                                 <- Preview (SPILLS)
Cell B21: =ML.DATA.SELECT_COLUMNS(B4, A7:BL7)                    <- Select 64 features
Cell B22: =ML.DATA.SELECT_COLUMNS(B4, BM7)                       <- Select target column
Cell B24: 0.2                                                    <- Test size parameter
Cell B25: 42                                                     <- Random seed parameter
Cell B27: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B21, B24, B25, 0)   <- X_train
Cell B28: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B21, B24, B25, 1)   <- X_test
Cell B29: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B22, B24, B25, 0)   <- y_train
Cell B30: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B22, B24, B25, 1)   <- y_test
Cell B35: =ML.PREPROCESSING.STANDARD_SCALER()                    <- Create scaler
Cell B39: =ML.DIM_REDUCTION.PCA(15)                              <- Create PCA (15 components)
Cell B43: =ML.CLASSIFICATION.LOGISTIC()                          <- Create classifier
Cell B48: =ML.PIPELINE(B35, B39, B43)                            <- Create pipeline (references!)
Cell B50: =ML.FIT(B48, B27, B29)                                 <- Fit pipeline
Cell B52: =ML.EVAL.SCORE(B50, B28, B30)                          <- Test score
```

### Grid Search Hyperparameter Optimization

```
Cell B100: =ML.PREPROCESSING.STANDARD_SCALER()                   <- Scaler for grid search
Cell B101: =ML.DIM_REDUCTION.PCA()                               <- PCA (no fixed components)
Cell B102: =ML.CLASSIFICATION.LOGISTIC()                         <- Classifier (no fixed C)
Cell B104: =ML.PIPELINE(B100, B101, B102)                        <- Pipeline for grid search

Parameter grid in A97:G98:
| pca               | n_components | 5    | 15   | 30   | 45   | 60   |
| logisticregression| C            | 0.01 | 0.1  | 1.0  | 10.0 | 100.0|

Cell B106: =ML.EVAL.GRID_SEARCH(B104, A97:G98, "accuracy", 5)    <- Create grid search
Cell B108: =ML.FIT(B106, B27, B29)                               <- Fit grid search
Cell A113: =ML.EVAL.SEARCH_RESULTS(B108)                         <- View all results (SPILLS many rows!)
Cell F141: =ML.EVAL.BEST_PARAMS(B108)                            <- Best params (SPILLS 3-4 rows)
Cell G146: =ML.EVAL.BEST_SCORE(B108)                             <- Best CV score (float)
Cell G148: =ML.EVAL.SCORE(B108, B28, B30)                        <- Test score (float)
```

### Visualizing Data for Charts

```
Cell A36: Header row for data section
Cell A37: =ML.DATA.SAMPLE(B19, -1)                               <- ALL X data (SPILLS ~150 rows!)
Cell C37: =ML.DATA.SAMPLE(B20, -1)                               <- ALL y data (SPILLS ~150 rows!)

Now you can create charts using ranges A37:B187 for X and C37:C187 for y
```

---

### Testing Predictions with Custom Input Values

When users want to predict outcomes for specific input values (e.g., "What would be the predicted price for a house with 3 bedrooms and 2000 sqft?"), you need to create a properly formatted input DataFrame.

**CRITICAL:** The custom input data MUST have:
1. **Same number of columns** as the original X data used for training
2. **Same column names/headers** as the original X data (in the same order)
3. **Converted to a DataFrame object handle** using `ML.DATA.CONVERT_TO_DF()`

**Step-by-step workflow:**

```
Assume the model was trained on X with columns: [bedrooms, sqft, bathrooms, age]

Step 1: Create custom input table in Excel with headers matching original X:
        Cell F1: bedrooms    Cell G1: sqft    Cell H1: bathrooms    Cell I1: age
        Cell F2: 3           Cell G2: 2000    Cell H2: 2            Cell I2: 10
        Cell F3: 4           Cell G3: 2500    Cell H3: 3            Cell I3: 5

Step 2: Convert Excel range to DataFrame object handle:
        Cell K1: =ML.DATA.CONVERT_TO_DF(F1:I3)    <- Includes headers! Returns object handle

Step 3: Use the converted DataFrame with the fitted model:
        Cell K2: =ML.PREDICT(fitted_model_cell, K1)    <- Returns predictions object handle

Step 4: View the predictions:
        Cell K3: =ML.DATA.SAMPLE(K2, -1)               <- SPILLS prediction values
```

**Complete Example with Iris Classification:**

```
Assume model was trained on Iris features: [sepal length (cm), sepal width (cm), petal length (cm), petal width (cm)]

Step 1: Original training workflow (already done):
        Cell B4:  =ML.DATASETS.IRIS()
        Cell B19: =ML.DATA.SELECT_COLUMNS(B4, A7:D7)              <- X features
        Cell B20: =ML.DATA.SELECT_COLUMNS(B4, E7)                 <- y target
        Cell A30: =ML.CLASSIFICATION.LOGISTIC()
        Cell A31: =ML.FIT(A30, B19, B20)                          <- Fitted model

Step 2: Create custom test data in Excel (must match X structure):
        Cell M1: sepal length (cm)    <- Header must match exactly!
        Cell N1: sepal width (cm)
        Cell O1: petal length (cm)
        Cell P1: petal width (cm)
        Cell M2: 5.1                  <- Test sample 1
        Cell N2: 3.5
        Cell O2: 1.4
        Cell P2: 0.2
        Cell M3: 6.7                  <- Test sample 2
        Cell N3: 3.0
        Cell O3: 5.2
        Cell P3: 2.3

Step 3: Convert and predict:
        Cell R1: =ML.DATA.CONVERT_TO_DF(M1:P3)                    <- Convert to DataFrame handle
        Cell R2: =ML.PREDICT(A31, R1)                             <- Get predictions handle
        Cell R3: =ML.DATA.SAMPLE(R2, -1)                          <- View predictions (SPILLS)
```

**Important Notes:**
- The column headers in your custom input MUST match the original training data exactly (case-sensitive)
- The number of columns MUST be identical to the original X
- The number of rows can be 1 or many, depending on how many predictions you need
- If the model was trained through a pipeline with preprocessing (e.g., StandardScaler), the pipeline will automatically apply the same transformations to your custom input
- Always use `ML.DATA.CONVERT_TO_DF()` to convert the Excel range - do NOT pass raw Excel ranges directly to `ML.PREDICT()`

---

## Object Type Reference

### What Each Function Returns:

| Function | Returns | Spills? | scikit-learn Type |
|----------|---------|---------|-------------------|
| ML.CLASSIFICATION.* | object handle | No | Unfitted classifier |
| ML.REGRESSION.* | object handle | No | Unfitted regressor |
| ML.CLUSTERING.* | object handle | No | Unfitted clusterer |
| ML.PREPROCESSING.*_SCALER() | object handle | No | Unfitted transformer |
| ML.PREPROCESSING.*_ENCODER() | object handle | No | Unfitted transformer |
| ML.PREPROCESSING.TRAIN_TEST_SPLIT | object handle | No | DataFrame subset |
| ML.DIM_REDUCTION.PCA | object handle | No | Unfitted PCA |
| ML.DIM_REDUCTION.PCA.RESULTS | DataFrame | **Yes** | - |
| ML.IMPUTE.SIMPLE_IMPUTER | object handle | No | Unfitted imputer |
| ML.FEATURE_SELECTION.SELECT_PERCENTILE | object handle | No | Unfitted selector |
| ML.PIPELINE | object handle | No | Pipeline |
| ML.EVAL.GRID_SEARCH | object handle | No | GridSearchCV |
| ML.FIT | object handle | No | Fitted version of input |
| ML.TRANSFORM | object handle | No | Transformed data |
| ML.FIT_TRANSFORM | object handle | No | Transformed data |
| ML.PREDICT | object handle | No | Predictions |
| ML.DATA.CONVERT_TO_DF | object handle | No | DataFrame |
| ML.DATA.SELECT_COLUMNS | object handle | No | DataFrame subset |
| ML.DATA.SAMPLE | DataFrame | **Yes** | - |
| ML.DATA.INFO | DataFrame | **Yes** | - |
| ML.DATA.DESCRIBE | DataFrame | **Yes** | - |
| ML.DATA.SHAPE | tuple | **Yes** (2 cells) | - |
| ML.DATA.DROP_MISSING_ROWS | object handle | No | DataFrame |
| ML.DATASETS.* | object handle | No | DataFrame |
| ML.EVAL.SCORE | float | No (1 cell) | - |
| ML.EVAL.CV_SCORE | array | **Yes** | - |
| ML.EVAL.BEST_PARAMS | DataFrame | **Yes** | - |
| ML.EVAL.BEST_SCORE | float | No (1 cell) | - |
| ML.EVAL.SEARCH_RESULTS | DataFrame | **Yes** | - |
| ML.INSPECT.GET_PARAMS | DataFrame | **Yes** | - |
| ML.INSPECT.DECISION_BOUNDARY | DataFrame | **Yes** | - |

### What Each Function Accepts as `model` Parameter:

| Function | Accepts |
|----------|---------|
| ML.FIT | Any unfitted estimator, transformer, pipeline, or GridSearchCV |
| ML.TRANSFORM | Any fitted transformer, pipeline with transformers, or ColumnTransformer |
| ML.FIT_TRANSFORM | Any unfitted transformer |
| ML.PREDICT | Any fitted estimator (classifier, regressor, clusterer) or pipeline ending in estimator |
| ML.EVAL.SCORE | Any fitted estimator |
| ML.EVAL.CV_SCORE | Any unfitted estimator |
| ML.EVAL.GRID_SEARCH | Any unfitted estimator or pipeline |
| ML.EVAL.BEST_PARAMS | Fitted GridSearchCV only |
| ML.EVAL.BEST_SCORE | Fitted GridSearchCV only |
| ML.EVAL.SEARCH_RESULTS | Fitted GridSearchCV only |
| ML.INSPECT.GET_PARAMS | Any scikit-learn object (fitted or unfitted) |
| ML.DIM_REDUCTION.PCA.RESULTS | Fitted PCA only |

---

## Best Practices

### 1. Always Use Separate Cells for Object Creation
```
WRONG:  =ML.FIT(ML.CLASSIFICATION.LOGISTIC(), X, y)
RIGHT:  Cell A1: =ML.CLASSIFICATION.LOGISTIC()
        Cell A2: =ML.FIT(A1, X, y)
```

### 2. Never Re-Reference Object Handles
```
WRONG:  Cell A1: =ML.CLASSIFICATION.LOGISTIC()
        Cell A5: =A1                              <- Just copies text, NOT the object!
        Cell A6: =ML.FIT(A5, X, y)               <- FAILS!

RIGHT:  Cell A1: =ML.CLASSIFICATION.LOGISTIC()
        Cell A6: =ML.FIT(A1, X, y)               <- Reference original cell directly
```

### 3. Check Shape Before Spilling Large Data
```
Cell A1: =ML.DATA.SHAPE(dataset)                   <- Returns [150, 5]
Cell A3: =ML.DATA.SAMPLE(dataset, -1)              <- Now you know to leave 150 rows clear!
```

### 4. Use Cell References for Column Selection
```
Cell A7: =ML.DATA.SAMPLE(B4, 5)                    <- Headers spill to A7:E7
Cell B16: =A7                                      <- Reference first column header
Cell B17: =D7                                      <- Reference fourth column header
Cell B19: =ML.DATA.SELECT_COLUMNS(B4, B16:B17)     <- Use the references
```

### 5. Keep Parameters in Dedicated Cells
```
Cell B24: 0.2                                      <- Test size
Cell B25: 42                                       <- Random seed
Cell B27: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, B24, B25, 0)
```
This makes it easy to experiment with different values!

### 6. Leave Buffer Space Between Sections
When functions spill data, leave extra rows before the next section to avoid #SPILL! errors.

### 7. Use GET_PARAMS to Understand Objects
```
Cell A65: =ML.INSPECT.GET_PARAMS(B50)              <- See all pipeline parameters
```
This shows exactly what parameters are available for grid search.

### 8. Premium Functions (marked with "Premium")
These require a premium subscription:
- ML.CLASSIFICATION.RANDOM_FOREST_CLF
- ML.REGRESSION.RANDOM_FOREST_REG
- ML.DIM_REDUCTION.KERNEL_PCA
- ML.EVAL.CV_SCORE
- ML.EVAL.GRID_SEARCH
- ML.EVAL.BEST_PARAMS
- ML.EVAL.BEST_SCORE
- ML.EVAL.SEARCH_RESULTS
- ML.DATASETS.OPENML

### 9. Testing Predictions with Custom Values
When predicting for specific input values:
```
Step 1: Create Excel table with SAME columns as original X (headers must match exactly)
Step 2: =ML.DATA.CONVERT_TO_DF(your_table_range)    <- Convert to DataFrame handle
Step 3: =ML.PREDICT(fitted_model, converted_df)     <- Use with fitted model
Step 4: =ML.DATA.SAMPLE(predictions, -1)            <- View results
```

### 10. Scikit-learn Documentation Applies
Since FormulaML uses scikit-learn under the hood, all scikit-learn concepts apply:
- Fit/transform pattern for transformers
- Fit/predict pattern for estimators
- Cross-validation for robust evaluation
- Pipelines for reproducible workflows
- GridSearchCV for hyperparameter tuning