# FormulaML Custom Functions Reference for LLM Context ## Overview FormulaML is an Excel add-in that provides machine learning capabilities directly in Excel through custom functions. It uses scikit-learn under the hood, so all ML concepts, algorithms, and best practices from scikit-learn apply here. --- ## CRITICAL RULES - READ FIRST ### 1. NO NESTED FUNCTIONS WITH OBJECT HANDLES **This is the most important rule in FormulaML.** You CANNOT nest custom functions that accept or return object handles. Excel needs to return and cache the object first before it can be used by another function. **WRONG - This will FAIL:** ```excel =ML.FIT(ML.CLASSIFICATION.LOGISTIC(), A1:C100, D1:D100) =ML.PREDICT(ML.FIT(ML.CLASSIFICATION.LOGISTIC(), X, y), X_test) =ML.PIPELINE(ML.PREPROCESSING.STANDARD_SCALER(), ML.CLASSIFICATION.LOGISTIC()) ``` **CORRECT - Use separate cells and references:** ``` Cell A1: =ML.CLASSIFICATION.LOGISTIC() <- Creates model object Cell A2: =ML.FIT(A1, B1:D100, E1:E100) <- References A1 to fit Cell A3: =ML.PREDICT(A2, F1:H20) <- References fitted model in A2 ``` **CORRECT Pipeline Example:** ``` Cell B35: =ML.PREPROCESSING.STANDARD_SCALER() <- Create scaler object Cell B39: =ML.DIM_REDUCTION.PCA(15) <- Create PCA object Cell B43: =ML.CLASSIFICATION.LOGISTIC() <- Create classifier object Cell B48: =ML.PIPELINE(B35, B39, B43) <- Reference all three cells Cell B50: =ML.FIT(B48, B27, B29) <- Fit the pipeline ``` ### 2. DATAFRAME OUTPUT SPILLS INTO EXCEL CELLS Functions that return DataFrames (decorated with `@ret(pd.DataFrame)`) **SPILL** their values directly into Excel cells. They do NOT return object handles. The values occupy a range of cells starting from the formula cell. **Functions that SPILL data (output values to cells):** - `ML.DATA.SAMPLE()` - Spills DataFrame rows - `ML.DATA.INFO()` - Spills column information - `ML.DATA.DESCRIBE()` - Spills descriptive statistics - `ML.INSPECT.GET_PARAMS()` - Spills parameter table - `ML.INSPECT.DECISION_BOUNDARY()` - Spills boundary points - `ML.EVAL.BEST_PARAMS()` - Spills best parameters - `ML.EVAL.SEARCH_RESULTS()` - Spills all grid search results - `ML.DIM_REDUCTION.PCA.RESULTS()` - Spills PCA variance data **Functions that return OBJECT HANDLES (single cell with icon):** - `ML.DATASETS.*` - Returns Dataset object handle - `ML.DATA.CONVERT_TO_DF()` - Returns DataFrame object handle - `ML.DATA.SELECT_COLUMNS()` - Returns DataFrame object handle - `ML.PREPROCESSING.TRAIN_TEST_SPLIT()` - Returns DataFrame object handle - `ML.PREPROCESSING.*_SCALER()` - Returns transformer object handle - `ML.CLASSIFICATION.*`, `ML.REGRESSION.*`, `ML.CLUSTERING.*` - Returns model object handles - `ML.FIT()` - Returns fitted object handle - `ML.TRANSFORM()` - Returns transformed data object handle - `ML.PIPELINE()` - Returns Pipeline object handle - `ML.EVAL.GRID_SEARCH()` - Returns GridSearchCV object handle ### 3. NEVER RE-REFERENCE OBJECT HANDLES **Object handles are bound to their original cell.** You CANNOT copy or re-reference an object handle to another cell and use that reference as input to functions that accept object handles. This is because object handles are stored in Redis with a unique key tied to the original cell address. When you reference another cell (e.g., `=A30`), Excel copies the display value, NOT the underlying object reference. **WRONG - This will FAIL:** ``` Cell A30: =ML.CLASSIFICATION.LOGISTIC() <- Creates model object handle Cell A36: =A30 <- WRONG! This just copies display text, not the object Cell A40: =ML.FIT(A36, X, y) <- FAILS! A36 is not a valid object handle ``` **CORRECT - Always reference the original cell:** ``` Cell A30: =ML.CLASSIFICATION.LOGISTIC() <- Creates model object handle Cell A40: =ML.FIT(A30, X, y) <- CORRECT! References original cell A30 directly ``` **Key principle:** Always trace back to the cell where the object was originally created. Never use intermediate cell references for object handles. --- ### 4. USE ML.DATA.SHAPE() TO PLAN SPACE FOR SPILLING Before using a function that spills data, use `ML.DATA.SHAPE()` to determine how much space the output will need: ``` Cell A1: =ML.DATASETS.IRIS() <- Dataset object handle Cell B1: =ML.DATA.SHAPE(A1) <- Returns [150, 5] meaning 150 rows, 5 columns Cell A3: =ML.DATA.SAMPLE(A1, -1) <- Now you know this will spill 150 rows x 5 columns Leave rows 3-152, columns A-E clear! ``` **Always leave adequate space below and to the right of spilling functions to avoid #SPILL! errors.** --- ## Key Concept: Object Handles FormulaML uses **object handles** to manage complex Python objects (DataFrames, ML models, transformers, etc.) within Excel. When a function returns an `object` type (not decorated to return DataFrame values), Excel displays a small identifier (e.g., a cell showing an icon with text like "Pipeline" or "Dataset") that represents the actual Python object stored on the server. ### How Object Handles Work: 1. **Creation**: Functions that return `object` create a handle stored in the object cache (Redis) 2. **Usage**: Pass the cell containing the handle to other functions that accept `object` parameters 3. **Inspection**: Use `ML.INSPECT.GET_PARAMS()` to see the parameters of any model/transformer object 4. **Viewing Data**: Use `ML.DATA.SAMPLE()` to view the contents of DataFrame objects ### Supported Object Types (what can be stored as handles): - **DataFrames**: pandas DataFrame objects (from datasets, transformations, etc.) - **Estimators** (Models): LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression, SVC, RandomForestClassifier, RandomForestRegressor, KMeans, DecisionTreeClassifier, DecisionTreeRegressor - **Transformers**: StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder, OrdinalEncoder, SimpleImputer, SelectPercentile - **Dimensionality Reducers**: PCA, KernelPCA - **Pipelines**: sklearn Pipeline objects (chains of transformers and estimators) - **GridSearchCV**: Hyperparameter search results - **Column Transformers**: ColumnTransformer objects for applying different transformations to different columns - **NumPy arrays**: np.ndarray objects --- ## Function Naming Convention All functions follow this pattern: ``` ML.[SUB_NAMESPACE].[FUNCTION_NAME](arguments) ``` Examples: - `ML.FIT(model, X, y)` - No sub-namespace - `ML.CLASSIFICATION.LOGISTIC()` - Sub-namespace is CLASSIFICATION - `ML.DIM_REDUCTION.PCA.RESULTS(pca_obj)` - Nested sub-namespace --- ## Sequential Cell Reference Pattern (CRITICAL) Because objects must be cached first, you MUST use sequential cell references. Here's the correct workflow pattern: ### Dataset Loading Flow: ``` Step 1: Cell B4 = =ML.DATASETS.IRIS() <- Returns Dataset handle Step 2: Cell A7 = =ML.DATA.SAMPLE(B4, 5) <- SPILLS 5 rows of data Step 3: Cell B19 = =ML.DATA.SELECT_COLUMNS(B4, B16:B17) <- Returns features DataFrame handle Step 4: Cell B20 = =ML.DATA.SELECT_COLUMNS(B4, E7) <- Returns target DataFrame handle ``` ### Model Training Flow: ``` Step 1: Cell B20 = =ML.CLASSIFICATION.LOGISTIC() <- Returns unfitted model handle Step 2: Cell B21 = =ML.FIT(B20, B16, B17) <- Returns fitted model handle Step 3: Cell B22 = =ML.PREDICT(B21, B18) <- Returns predictions handle Step 4: Cell B23 = =ML.EVAL.SCORE(B21, B18, B19) <- Returns float score ``` ### Preprocessing Flow: ``` Step 1: Cell B35 = =ML.PREPROCESSING.STANDARD_SCALER() <- Returns unfitted scaler handle Step 2: Cell B36 = =ML.FIT(B35, B27) <- Returns fitted scaler handle Step 3: Cell B37 = =ML.TRANSFORM(B36, B27) <- Returns scaled data handle ``` ### Pipeline Flow: ``` Step 1: Cell B35 = =ML.PREPROCESSING.STANDARD_SCALER() <- Create scaler Step 2: Cell B39 = =ML.DIM_REDUCTION.PCA(15) <- Create PCA (15 components) Step 3: Cell B43 = =ML.CLASSIFICATION.LOGISTIC() <- Create classifier Step 4: Cell B48 = =ML.PIPELINE(B35, B39, B43) <- Combine into Pipeline Step 5: Cell B50 = =ML.FIT(B48, B27, B29) <- Fit entire pipeline Step 6: Cell B52 = =ML.EVAL.SCORE(B50, B28, B30) <- Evaluate on test set ``` --- ## Complete Function Reference ### Core ML Methods (ML.*) These functions work with any scikit-learn compatible estimator or transformer. #### ML.FIT(model, X, [y]) Fits an estimator or transformer to the data. | Parameter | Type | Description | |-----------|------|-------------| | model | object | **Required.** An unfitted model/transformer object (e.g., from ML.CLASSIFICATION.LOGISTIC, ML.PREPROCESSING.STANDARD_SCALER, etc.) | | X | object | **Required.** Feature data - typically a DataFrame or array from ML.DATA.SELECT_COLUMNS or ML.DATA.CONVERT_TO_DF | | y | object | **Optional.** Target variable for supervised learning models. Not needed for transformers like scalers. | **Returns:** object (fitted model/transformer) **Accepts as `model`:** - Classification models: LogisticRegression, SVC, RandomForestClassifier - Regression models: LinearRegression, Ridge, Lasso, ElasticNet, RandomForestRegressor - Clustering models: KMeans - Transformers: StandardScaler, MinMaxScaler, RobustScaler, PCA, KernelPCA, OneHotEncoder, OrdinalEncoder, SimpleImputer - Pipelines: Created via ML.PIPELINE - GridSearchCV: Created via ML.EVAL.GRID_SEARCH **REMEMBER:** Cannot nest! Create the model in a separate cell first. --- #### ML.TRANSFORM(model, X, [y]) Transforms data using a fitted transformer. | Parameter | Type | Description | |-----------|------|-------------| | model | object | **Required.** A fitted transformer (e.g., fitted StandardScaler, PCA, OneHotEncoder) | | X | object | **Required.** Data to transform | | y | object | **Optional.** Target variable (rarely used, only for certain transformers) | **Returns:** object (transformed data, typically a DataFrame or array) **Accepts as `model`:** - Fitted transformers: StandardScaler, MinMaxScaler, RobustScaler, PCA, KernelPCA, OneHotEncoder, OrdinalEncoder, SimpleImputer, SelectPercentile - Fitted pipelines containing transformers - Fitted ColumnTransformer --- #### ML.FIT_TRANSFORM(model, X, [y]) Fits and transforms data in one step. Equivalent to calling FIT then TRANSFORM. | Parameter | Type | Description | |-----------|------|-------------| | model | object | **Required.** An unfitted transformer | | X | object | **Required.** Data to fit and transform | | y | object | **Optional.** Target variable | **Returns:** object (transformed data) --- #### ML.PREDICT(model, X) Generates predictions using a fitted model. | Parameter | Type | Description | |-----------|------|-------------| | model | object | **Required.** A fitted estimator (classifier, regressor, or clustering model) | | X | object | **Required.** Feature data for prediction | **Returns:** object (predictions - array or Series) **Accepts as `model`:** - Fitted classifiers: LogisticRegression, SVC, RandomForestClassifier - Fitted regressors: LinearRegression, Ridge, Lasso, ElasticNet, RandomForestRegressor - Fitted clustering: KMeans - Fitted pipelines ending with an estimator - Fitted GridSearchCV (uses best estimator) --- #### ML.PIPELINE(step1, step2, ..., stepN) Creates a scikit-learn Pipeline from multiple transformers and/or a final estimator. | Parameter | Type | Description | |-----------|------|-------------| | *args | object | One or more transformer/estimator objects. The last step can be an estimator. | **Returns:** object (Pipeline) **CRITICAL:** Each step must be a cell reference to an already-created object. Cannot nest! **Typical Usage Pattern:** ``` Cell A1: =ML.PREPROCESSING.STANDARD_SCALER() Cell A2: =ML.CLASSIFICATION.LOGISTIC() Cell A3: =ML.PIPELINE(A1, A2) <- References A1 and A2 ``` **Important:** Pipeline steps are executed in order. Typically: 1. First steps: Transformers (scalers, encoders, imputers, PCA) 2. Last step: Estimator (classifier, regressor, clustering model) --- #### ML.OBJECT_INFO(obj) Returns string representation of an object's internal state. | Parameter | Type | Description | |-----------|------|-------------| | obj | object | Any object handle | **Returns:** string --- ### Classification Models (ML.CLASSIFICATION.*) #### ML.CLASSIFICATION.LOGISTIC([C], [penalty], [fit_intercept], [max_iter], [tol]) Creates a Logistic Regression classifier. Uses scikit-learn's LogisticRegression. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | C | float | 1.0 | Inverse of regularization strength. Smaller values = stronger regularization. | | penalty | string | "l2" | Norm for penalization: 'l1', 'l2', 'elasticnet', or 'none' | | fit_intercept | boolean | TRUE | Whether to calculate the intercept | | max_iter | integer | 100 | Maximum iterations for solver convergence | | tol | float | 0.0001 | Tolerance for stopping criteria | **Returns:** object (unfitted LogisticRegression) --- #### ML.CLASSIFICATION.SVM([C], [kernel], [degree], [gamma], [coef0]) Creates a Support Vector Machine classifier. Uses scikit-learn's SVC. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | C | float | 1.0 | Regularization parameter. Larger values = less regularization. | | kernel | string | "rbf" | Kernel type: 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' | | degree | integer | 3 | Degree for 'poly' kernel (ignored by others) | | gamma | string | "scale" | Kernel coefficient: 'scale' or 'auto' | | coef0 | float | 0.0 | Independent term (significant in 'poly' and 'sigmoid') | **Returns:** object (unfitted SVC) --- #### ML.CLASSIFICATION.RANDOM_FOREST_CLF([n_estimators], [criterion], [max_depth], [min_samples_split], [min_samples_leaf], [max_features], [bootstrap], [random_state]) - Premium Creates a Random Forest classifier. Uses scikit-learn's RandomForestClassifier. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | n_estimators | integer | 100 | Number of trees in the forest | | criterion | string | "gini" | Split quality measure: 'gini', 'entropy', 'log_loss' | | max_depth | integer | None | Maximum tree depth. None = nodes expand until all leaves are pure. | | min_samples_split | integer | 2 | Minimum samples required to split a node | | min_samples_leaf | integer | 1 | Minimum samples required at a leaf node | | max_features | float/int/string | 1.0 | Features to consider for best split | | bootstrap | boolean | TRUE | Whether to use bootstrap samples | | random_state | integer | None | Random seed for reproducibility | **Returns:** object (unfitted RandomForestClassifier) --- ### Regression Models (ML.REGRESSION.*) #### ML.REGRESSION.LINEAR([fit_intercept]) Creates a Linear Regression model. Uses scikit-learn's LinearRegression. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | fit_intercept | boolean | TRUE | Whether to calculate the intercept | **Returns:** object (unfitted LinearRegression) --- #### ML.REGRESSION.RIDGE([alpha], [fit_intercept]) Creates a Ridge Regression model (L2 regularization). Uses scikit-learn's Ridge. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | alpha | float | 1.0 | Regularization strength. Larger = stronger regularization. | | fit_intercept | boolean | TRUE | Whether to calculate the intercept | **Returns:** object (unfitted Ridge) --- #### ML.REGRESSION.LASSO([alpha], [fit_intercept]) Creates a Lasso Regression model (L1 regularization). Uses scikit-learn's Lasso. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | alpha | float | 1.0 | Regularization strength. Larger = stronger regularization. | | fit_intercept | boolean | TRUE | Whether to calculate the intercept | **Returns:** object (unfitted Lasso) --- #### ML.REGRESSION.ELASTIC_NET([alpha], [l1_ratio], [fit_intercept]) Creates an Elastic Net model (combined L1 and L2). Uses scikit-learn's ElasticNet. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | alpha | float | 1.0 | Regularization strength | | l1_ratio | float | 0.5 | Mix ratio: 0 = L2 only, 1 = L1 only | | fit_intercept | boolean | TRUE | Whether to calculate the intercept | **Returns:** object (unfitted ElasticNet) --- #### ML.REGRESSION.RANDOM_FOREST_REG([n_estimators], [criterion], [max_depth], [min_samples_split], [min_samples_leaf], [max_features], [bootstrap], [random_state]) - Premium Creates a Random Forest regressor. Uses scikit-learn's RandomForestRegressor. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | n_estimators | integer | 100 | Number of trees | | criterion | string | "squared_error" | Split quality: 'squared_error', 'absolute_error', 'friedman_mse', 'poisson' | | max_depth | integer | None | Maximum tree depth | | min_samples_split | integer | 2 | Minimum samples to split | | min_samples_leaf | integer | 1 | Minimum samples at leaf | | max_features | float/int/string | 1.0 | Features for best split | | bootstrap | boolean | TRUE | Use bootstrap samples | | random_state | integer | None | Random seed | **Returns:** object (unfitted RandomForestRegressor) --- ### Clustering Models (ML.CLUSTERING.*) #### ML.CLUSTERING.KMEANS([n_clusters], [init], [n_init], [max_iter], [tol], [random_state], [algorithm]) Creates a K-Means clustering model. Uses scikit-learn's KMeans. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | n_clusters | integer | 8 | Number of clusters to create | | init | string | "k-means++" | Initialization: 'k-means++' (recommended) or 'random' | | n_init | integer/string | "auto" | Number of runs with different initializations | | max_iter | integer | 300 | Maximum iterations per run | | tol | float | 0.0001 | Convergence tolerance | | random_state | integer | None | Random seed for reproducibility | | algorithm | string | "lloyd" | Algorithm: 'lloyd' or 'elkan' | **Returns:** object (unfitted KMeans) **Usage Note:** After fitting, use ML.PREDICT to get cluster labels, or ML.TRANSFORM to get distances to cluster centers. --- ### Preprocessing (ML.PREPROCESSING.*) #### ML.PREPROCESSING.TRAIN_TEST_SPLIT(data, test_size, random_state, dataset_type) Splits data into training and testing sets. | Parameter | Type | Description | |-----------|------|-------------| | data | object | **Required.** DataFrame to split | | test_size | float | **Required.** Fraction for test set (e.g., 0.2 = 20% test) | | random_state | integer | **Required.** Random seed for reproducibility | | dataset_type | integer | **Required.** 0 = training set, 1 = test set | **Returns:** object (DataFrame - either train or test split) **Important:** Call this function FOUR TIMES to get all splits: ``` X_train: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, 0.2, 42, 0) X_test: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, 0.2, 42, 1) y_train: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(y, 0.2, 42, 0) y_test: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(y, 0.2, 42, 1) ``` --- #### ML.PREPROCESSING.STANDARD_SCALER() Creates a StandardScaler that standardizes features by removing mean and scaling to unit variance. **Returns:** object (unfitted StandardScaler) **When to use:** When features have different scales and you want zero mean and unit variance. Good for SVM, Logistic Regression, Neural Networks. --- #### ML.PREPROCESSING.MIN_MAX_SCALER() Creates a MinMaxScaler that scales features to [0, 1] range. **Returns:** object (unfitted MinMaxScaler) **When to use:** When you need bounded values, especially for algorithms that don't assume any distribution. --- #### ML.PREPROCESSING.ROBUST_SCALER() Creates a RobustScaler that uses median and IQR, robust to outliers. **Returns:** object (unfitted RobustScaler) **When to use:** When your data has outliers that shouldn't dominate the scaling. --- #### ML.PREPROCESSING.ONE_HOT_ENCODER([handle_unknown]) Creates a OneHotEncoder for categorical variables. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | handle_unknown | string | "error" | How to handle unknown categories: 'error' or 'ignore' | **Returns:** object (unfitted OneHotEncoder) **When to use:** For nominal categorical variables (no inherent order) like colors, countries, etc. --- #### ML.PREPROCESSING.ORDINAL_ENCODER([handle_unknown]) Creates an OrdinalEncoder for categorical variables. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | handle_unknown | string | "error" | How to handle unknown categories: 'error' or 'ignore' | **Returns:** object (unfitted OrdinalEncoder) **When to use:** For ordinal categorical variables (with inherent order) like education level, size (S/M/L), etc. --- ### Dimensionality Reduction (ML.DIM_REDUCTION.*) #### ML.DIM_REDUCTION.PCA([n_components], [whiten], [svd_solver], [tol], [iterated_power], [n_oversamples], [power_iteration_normalizer], [random_state]) Creates a Principal Component Analysis transformer. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | n_components | integer | None | Components to keep. None = all. Use 'mle' for automatic. | | whiten | boolean | FALSE | Scale to uncorrelated outputs with unit variance | | svd_solver | string | "auto" | Solver: 'auto', 'full', 'arpack', 'randomized' | | tol | float | 0.0 | Tolerance for 'arpack' solver | | iterated_power | string | "auto" | Power iterations for 'randomized' | | n_oversamples | integer | 10 | Extra random vectors for 'randomized' | | power_iteration_normalizer | string | "auto" | Normalizer: 'auto', 'QR', 'LU', 'none' | | random_state | integer | None | Random seed | **Returns:** object (unfitted PCA) --- #### ML.DIM_REDUCTION.PCA.RESULTS(pca_obj) Returns PCA results after fitting. **This function SPILLS data.** | Parameter | Type | Description | |-----------|------|-------------| | pca_obj | object | **Required.** A fitted PCA object | **Returns:** DataFrame (SPILLS) with columns: Component, Explained Variance, Explained Variance Ratio, Singular Values **Space needed:** Rows = number of components + 1 header, Columns = 4 --- #### ML.DIM_REDUCTION.KERNEL_PCA([n_components], [kernel], [degree], [gamma], [coef0]) - Premium Creates a Kernel PCA transformer for nonlinear dimensionality reduction. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | n_components | integer | None | Components to keep | | kernel | string | "linear" | Kernel: 'linear', 'poly', 'rbf', 'sigmoid', 'cosine' | | degree | integer | 3 | Degree for 'poly' kernel | | gamma | float | 1.0 | Kernel coefficient for 'rbf', 'poly', 'sigmoid' | | coef0 | float | 0.0 | Independent term for 'poly' and 'sigmoid' | **Returns:** object (unfitted KernelPCA) --- ### Data Handling (ML.DATA.*) #### ML.DATA.CONVERT_TO_DF(table) Converts an Excel table or range to a DataFrame object handle. | Parameter | Type | Description | |-----------|------|-------------| | table | range | **Required.** Excel range with headers in the first row | **Returns:** object (DataFrame handle) --- #### ML.DATA.INFO(data) Returns information about a DataFrame (column names, non-null counts, data types). **This function SPILLS data.** | Parameter | Type | Description | |-----------|------|-------------| | data | object | **Required.** A DataFrame object | **Returns:** DataFrame (SPILLS to cells showing column info) **Space needed:** Rows = number of columns + 1 header, Columns = 3 --- #### ML.DATA.SAMPLE(data, [n], [head], [tail], [seed], [missing_values]) Returns a sample of a DataFrame or views it completely. **This function SPILLS data.** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | data | object | **Required.** DataFrame or array object | | n | integer | 10 | Number of rows. Use -1 for complete data. | | head | boolean | TRUE | Return first n rows | | tail | boolean | FALSE | Return last n rows | | seed | integer | None | Random seed for random sampling | | missing_values | string | "#N/A" | Excel error value for missing data | **Returns:** DataFrame (SPILLS to cells) **Space needed:** Use ML.DATA.SHAPE first to determine rows and columns! **Usage:** This is how you "view" the contents of a DataFrame object handle. Use n=-1 to get ALL data for charting. --- #### ML.DATA.SELECT_COLUMNS(data, columns) Selects specific columns from a DataFrame. | Parameter | Type | Description | |-----------|------|-------------| | data | object | **Required.** DataFrame object | | columns | range | **Required.** Column names to select (can be a range with column headers) | **Returns:** object (DataFrame with selected columns - object handle, not spilled) **Usage:** Essential for separating features (X) from target (y). Reference the header cells from your spilled data preview. **Example:** ``` Cell A7: =ML.DATA.SAMPLE(B4, 5) <- Spills data with headers in A7:E7 Cell B19: =ML.DATA.SELECT_COLUMNS(B4, A7:D7) <- Selects columns A-D as features Cell B20: =ML.DATA.SELECT_COLUMNS(B4, E7) <- Selects column E as target ``` --- #### ML.DATA.DESCRIBE(data) Returns descriptive statistics of a DataFrame. **This function SPILLS data.** | Parameter | Type | Description | |-----------|------|-------------| | data | object | **Required.** DataFrame object | **Returns:** DataFrame (SPILLS: count, mean, std, min, quartiles, max) **Space needed:** 9 rows x (number of numeric columns + 1) --- #### ML.DATA.SHAPE(data) Returns the shape of a DataFrame as [rows, columns]. **Use this to plan space for spilling functions.** | Parameter | Type | Description | |-----------|------|-------------| | data | object | **Required.** DataFrame object | **Returns:** tuple [n_rows, n_columns] (spills to 2 cells) **Critical Usage:** Call this BEFORE using ML.DATA.SAMPLE with n=-1 to know how much space to leave! --- #### ML.DATA.DROP_MISSING_ROWS(data, [how], [thresh], [subset]) Removes rows with missing values. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | data | object | **Required.** DataFrame object | | how | string | "any" | 'any' = drop if any NA, 'all' = drop only if all NA | | thresh | integer | None | Keep rows with at least this many non-NA values | | subset | string | None | Column to check for missing values | **Returns:** object (DataFrame without missing rows) --- ### Datasets (ML.DATASETS.*) Built-in datasets for learning and testing. All return object handles (not spilled data). #### ML.DATASETS.IRIS() Loads the Iris dataset (150 samples, 4 features + 1 target, 3 classes). **Returns:** object (Dataset/DataFrame handle) **Features:** sepal length, sepal width, petal length, petal width **Target:** target (0=setosa, 1=versicolor, 2=virginica) **Use case:** Classification --- #### ML.DATASETS.DIGITS() Loads the Digits dataset (1797 samples, 64 features + 1 target, 10 classes). **Returns:** object (Dataset/DataFrame handle) **Features:** 8x8 pixel values (64 features named pixel_0_0 through pixel_7_7) **Target:** target (0-9) **Use case:** Classification, dimensionality reduction --- #### ML.DATASETS.DIABETES() Loads the Diabetes dataset (442 samples, 10 features + 1 target). **Returns:** object (Dataset/DataFrame handle) **Features:** age, sex, bmi, bp, s1, s2, s3, s4, s5, s6 **Target:** target (disease progression measure) **Use case:** Regression --- #### ML.DATASETS.OPENML(name, [version]) - Premium Loads any dataset from OpenML repository. | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | name | string | **Required.** | Dataset name on OpenML | | version | string | "active" | Dataset version | **Returns:** object (Dataset/DataFrame handle) --- ### Evaluation (ML.EVAL.*) #### ML.EVAL.SCORE(model, X, y) Returns the default score for a fitted model (R² for regression, accuracy for classification). | Parameter | Type | Description | |-----------|------|-------------| | model | object | **Required.** Fitted model | | X | object | **Required.** Test features | | y | object | **Required.** True target values | **Returns:** float (single cell value) --- #### ML.EVAL.CV_SCORE(model, X, y, cv, scoring) - Premium Performs cross-validation and returns scores. | Parameter | Type | Description | |-----------|------|-------------| | model | object | **Required.** Unfitted model | | X | object | **Required.** Features | | y | object | **Required.** Target | | cv | integer | **Required.** Number of folds | | scoring | string | **Required.** Scoring metric (e.g., 'accuracy', 'r2', 'neg_mean_squared_error') | **Returns:** array of scores (one per fold) --- #### ML.EVAL.GRID_SEARCH(model, param_grid, [scoring], [cv], [refit]) - Premium Creates a GridSearchCV for hyperparameter tuning. | Parameter | Type | Description | |-----------|------|-------------| | model | object | **Required.** Unfitted model or pipeline | | param_grid | range | **Required.** Parameter grid (see format below) | | scoring | string/list | None | Scoring metric(s) | | cv | integer | None | Cross-validation folds | | refit | boolean | TRUE | Refit best model on full data | **Returns:** object (unfitted GridSearchCV - must be fitted with ML.FIT) **param_grid format:** - For regular models: columns are [Model, Parameter, Value1, Value2, ...] - For pipelines: columns are [StepName, Parameter, Value1, Value2, ...] **Example param_grid for pipeline:** ``` | Object | Parameter | Value 1 | Value 2 | Value 3 | |--------------------|--------------|---------|---------|---------| | pca | n_components | 5 | 15 | 30 | | logisticregression | C | 0.01 | 1.0 | 100.0 | ``` --- #### ML.EVAL.BEST_PARAMS(model) - Premium Returns the best parameters from a fitted GridSearchCV. **This function SPILLS data.** | Parameter | Type | Description | |-----------|------|-------------| | model | object | **Required.** Fitted GridSearchCV object | **Returns:** DataFrame (SPILLS) with columns [Model, Parameter, Value] --- #### ML.EVAL.BEST_SCORE(model) - Premium Returns the best score from a fitted GridSearchCV. | Parameter | Type | Description | |-----------|------|-------------| | model | object | **Required.** Fitted GridSearchCV object | **Returns:** float (single cell value) --- #### ML.EVAL.SEARCH_RESULTS(grid_search) - Premium Returns complete cross-validation results from GridSearchCV. **This function SPILLS data.** | Parameter | Type | Description | |-----------|------|-------------| | grid_search | object | **Required.** Fitted GridSearchCV object | **Returns:** DataFrame (SPILLS) with all parameter combinations and scores **Space needed:** Rows = (all parameter combinations) + 1 header, many columns --- ### Inspection (ML.INSPECT.*) #### ML.INSPECT.GET_PARAMS(obj, [transpose]) Gets parameters of any scikit-learn compatible object. **This function SPILLS data.** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | obj | object | **Required.** Any model, transformer, or pipeline | | transpose | boolean | FALSE | Return wide format (parameters as columns) | **Returns:** DataFrame (SPILLS) with parameters **For pipelines:** Returns [Model, Parameter, Value] columns showing each step's parameters. **For single objects:** Returns [Parameter, Value] columns. **This is the key function for inspecting what's inside an object handle.** --- #### ML.INSPECT.DECISION_BOUNDARY(model, X, [response_method], [mesh_step], [class_pair], [feature_indices], [other_features_values], [margin]) Extracts decision boundary contour points for visualization. **This function SPILLS data.** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | model | object | **Required.** Fitted classifier | | X | object | **Required.** Feature data (used for axis ranges) | | response_method | string | "predict" | 'predict' or 'decision_function' | | mesh_step | float | 0.05 | Grid resolution | | class_pair | list | [0, 1] | Class pair for multiclass decision_function | | feature_indices | tuple | (0, 1) | Which two features to use | | other_features_values | dict | None | Values for other features (uses mean if None) | | margin | float | 0.1 | Margin around data range | **Returns:** DataFrame (SPILLS) with boundary point coordinates --- ### Imputation (ML.IMPUTE.*) #### ML.IMPUTE.SIMPLE_IMPUTER(strategy, [fill_value]) Creates a SimpleImputer for handling missing values. | Parameter | Type | Description | |-----------|------|-------------| | strategy | string | **Required.** Strategy: 'mean', 'median', 'most_frequent', 'constant' | | fill_value | any | Value to use when strategy='constant' | **Returns:** object (unfitted SimpleImputer) --- ### Feature Selection (ML.FEATURE_SELECTION.*) #### ML.FEATURE_SELECTION.SELECT_PERCENTILE(score_func, percentile) Creates a SelectPercentile transformer. | Parameter | Type | Description | |-----------|------|-------------| | score_func | string | **Required.** Scoring function: 'chi2' | | percentile | integer | **Required.** Percentile of features to keep (0-100) | **Returns:** object (unfitted SelectPercentile) --- ### Compose (ML.COMPOSE.*) For applying different transformations to different columns. #### ML.COMPOSE.COLUMN_TRANSFORMER(transformer, cols) Pairs a transformer with specific columns. | Parameter | Type | Description | |-----------|------|-------------| | transformer | object | **Required.** A transformer object or ML.COMPOSE.TRANSFORMERS.DROP/PASSTHROUGH | | cols | range/string | **Required.** Column name(s) to apply transformer to | **Returns:** object (ColumnTransformer specification) --- #### ML.COMPOSE.DATA_TRANSFORMER(col_transformer1, col_transformer2, ...) Combines multiple column transformers into one. | Parameter | Type | Description | |-----------|------|-------------| | *args | object | One or more ML.COMPOSE.COLUMN_TRANSFORMER results | **Returns:** object (ColumnTransformer) --- #### ML.COMPOSE.COLUMN_SELECTOR(pattern, dtypes) Selects columns by pattern or data type. | Parameter | Type | Description | |-----------|------|-------------| | pattern | string | Regex pattern for column names | | dtypes | list | Data types to match | **Returns:** object (column selector) --- #### ML.COMPOSE.TRANSFORMERS.DROP() Returns a "drop" transformer that removes columns. **Returns:** object (drop specification) --- #### ML.COMPOSE.TRANSFORMERS.PASSTHROUGH() Returns a "passthrough" transformer that keeps columns unchanged. **Returns:** object (passthrough specification) --- ## Common Workflows with Cell References ### Basic Classification Workflow ``` Cell B4: =ML.DATASETS.IRIS() <- Load dataset (object handle) Cell A7: =ML.DATA.SAMPLE(B4, 5) <- Preview data (SPILLS 6 rows x 5 cols) Cell B14: =ML.DATA.SHAPE(B4) <- Check size: [150, 5] Cell B16: =A7 <- Reference feature column header Cell B17: =D7 <- Reference another feature column header Cell B19: =ML.DATA.SELECT_COLUMNS(B4, B16:B17) <- Select features (object handle) Cell B20: =ML.DATA.SELECT_COLUMNS(B4, E7) <- Select target (object handle) Cell A30: =ML.CLASSIFICATION.LOGISTIC() <- Create model (object handle) Cell A31: =ML.FIT(A30, B19, B20) <- Fit model (object handle) Cell A32: =ML.EVAL.SCORE(A31, B19, B20) <- Evaluate (float value) ``` ### Pipeline with Train/Test Split ``` Cell B4: =ML.DATASETS.DIGITS() <- Load dataset Cell A7: =ML.DATA.SAMPLE(B4, 5) <- Preview (SPILLS) Cell B21: =ML.DATA.SELECT_COLUMNS(B4, A7:BL7) <- Select 64 features Cell B22: =ML.DATA.SELECT_COLUMNS(B4, BM7) <- Select target column Cell B24: 0.2 <- Test size parameter Cell B25: 42 <- Random seed parameter Cell B27: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B21, B24, B25, 0) <- X_train Cell B28: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B21, B24, B25, 1) <- X_test Cell B29: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B22, B24, B25, 0) <- y_train Cell B30: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B22, B24, B25, 1) <- y_test Cell B35: =ML.PREPROCESSING.STANDARD_SCALER() <- Create scaler Cell B39: =ML.DIM_REDUCTION.PCA(15) <- Create PCA (15 components) Cell B43: =ML.CLASSIFICATION.LOGISTIC() <- Create classifier Cell B48: =ML.PIPELINE(B35, B39, B43) <- Create pipeline (references!) Cell B50: =ML.FIT(B48, B27, B29) <- Fit pipeline Cell B52: =ML.EVAL.SCORE(B50, B28, B30) <- Test score ``` ### Grid Search Hyperparameter Optimization ``` Cell B100: =ML.PREPROCESSING.STANDARD_SCALER() <- Scaler for grid search Cell B101: =ML.DIM_REDUCTION.PCA() <- PCA (no fixed components) Cell B102: =ML.CLASSIFICATION.LOGISTIC() <- Classifier (no fixed C) Cell B104: =ML.PIPELINE(B100, B101, B102) <- Pipeline for grid search Parameter grid in A97:G98: | pca | n_components | 5 | 15 | 30 | 45 | 60 | | logisticregression| C | 0.01 | 0.1 | 1.0 | 10.0 | 100.0| Cell B106: =ML.EVAL.GRID_SEARCH(B104, A97:G98, "accuracy", 5) <- Create grid search Cell B108: =ML.FIT(B106, B27, B29) <- Fit grid search Cell A113: =ML.EVAL.SEARCH_RESULTS(B108) <- View all results (SPILLS many rows!) Cell F141: =ML.EVAL.BEST_PARAMS(B108) <- Best params (SPILLS 3-4 rows) Cell G146: =ML.EVAL.BEST_SCORE(B108) <- Best CV score (float) Cell G148: =ML.EVAL.SCORE(B108, B28, B30) <- Test score (float) ``` ### Visualizing Data for Charts ``` Cell A36: Header row for data section Cell A37: =ML.DATA.SAMPLE(B19, -1) <- ALL X data (SPILLS ~150 rows!) Cell C37: =ML.DATA.SAMPLE(B20, -1) <- ALL y data (SPILLS ~150 rows!) Now you can create charts using ranges A37:B187 for X and C37:C187 for y ``` --- ### Testing Predictions with Custom Input Values When users want to predict outcomes for specific input values (e.g., "What would be the predicted price for a house with 3 bedrooms and 2000 sqft?"), you need to create a properly formatted input DataFrame. **CRITICAL:** The custom input data MUST have: 1. **Same number of columns** as the original X data used for training 2. **Same column names/headers** as the original X data (in the same order) 3. **Converted to a DataFrame object handle** using `ML.DATA.CONVERT_TO_DF()` **Step-by-step workflow:** ``` Assume the model was trained on X with columns: [bedrooms, sqft, bathrooms, age] Step 1: Create custom input table in Excel with headers matching original X: Cell F1: bedrooms Cell G1: sqft Cell H1: bathrooms Cell I1: age Cell F2: 3 Cell G2: 2000 Cell H2: 2 Cell I2: 10 Cell F3: 4 Cell G3: 2500 Cell H3: 3 Cell I3: 5 Step 2: Convert Excel range to DataFrame object handle: Cell K1: =ML.DATA.CONVERT_TO_DF(F1:I3) <- Includes headers! Returns object handle Step 3: Use the converted DataFrame with the fitted model: Cell K2: =ML.PREDICT(fitted_model_cell, K1) <- Returns predictions object handle Step 4: View the predictions: Cell K3: =ML.DATA.SAMPLE(K2, -1) <- SPILLS prediction values ``` **Complete Example with Iris Classification:** ``` Assume model was trained on Iris features: [sepal length (cm), sepal width (cm), petal length (cm), petal width (cm)] Step 1: Original training workflow (already done): Cell B4: =ML.DATASETS.IRIS() Cell B19: =ML.DATA.SELECT_COLUMNS(B4, A7:D7) <- X features Cell B20: =ML.DATA.SELECT_COLUMNS(B4, E7) <- y target Cell A30: =ML.CLASSIFICATION.LOGISTIC() Cell A31: =ML.FIT(A30, B19, B20) <- Fitted model Step 2: Create custom test data in Excel (must match X structure): Cell M1: sepal length (cm) <- Header must match exactly! Cell N1: sepal width (cm) Cell O1: petal length (cm) Cell P1: petal width (cm) Cell M2: 5.1 <- Test sample 1 Cell N2: 3.5 Cell O2: 1.4 Cell P2: 0.2 Cell M3: 6.7 <- Test sample 2 Cell N3: 3.0 Cell O3: 5.2 Cell P3: 2.3 Step 3: Convert and predict: Cell R1: =ML.DATA.CONVERT_TO_DF(M1:P3) <- Convert to DataFrame handle Cell R2: =ML.PREDICT(A31, R1) <- Get predictions handle Cell R3: =ML.DATA.SAMPLE(R2, -1) <- View predictions (SPILLS) ``` **Important Notes:** - The column headers in your custom input MUST match the original training data exactly (case-sensitive) - The number of columns MUST be identical to the original X - The number of rows can be 1 or many, depending on how many predictions you need - If the model was trained through a pipeline with preprocessing (e.g., StandardScaler), the pipeline will automatically apply the same transformations to your custom input - Always use `ML.DATA.CONVERT_TO_DF()` to convert the Excel range - do NOT pass raw Excel ranges directly to `ML.PREDICT()` --- ## Object Type Reference ### What Each Function Returns: | Function | Returns | Spills? | scikit-learn Type | |----------|---------|---------|-------------------| | ML.CLASSIFICATION.* | object handle | No | Unfitted classifier | | ML.REGRESSION.* | object handle | No | Unfitted regressor | | ML.CLUSTERING.* | object handle | No | Unfitted clusterer | | ML.PREPROCESSING.*_SCALER() | object handle | No | Unfitted transformer | | ML.PREPROCESSING.*_ENCODER() | object handle | No | Unfitted transformer | | ML.PREPROCESSING.TRAIN_TEST_SPLIT | object handle | No | DataFrame subset | | ML.DIM_REDUCTION.PCA | object handle | No | Unfitted PCA | | ML.DIM_REDUCTION.PCA.RESULTS | DataFrame | **Yes** | - | | ML.IMPUTE.SIMPLE_IMPUTER | object handle | No | Unfitted imputer | | ML.FEATURE_SELECTION.SELECT_PERCENTILE | object handle | No | Unfitted selector | | ML.PIPELINE | object handle | No | Pipeline | | ML.EVAL.GRID_SEARCH | object handle | No | GridSearchCV | | ML.FIT | object handle | No | Fitted version of input | | ML.TRANSFORM | object handle | No | Transformed data | | ML.FIT_TRANSFORM | object handle | No | Transformed data | | ML.PREDICT | object handle | No | Predictions | | ML.DATA.CONVERT_TO_DF | object handle | No | DataFrame | | ML.DATA.SELECT_COLUMNS | object handle | No | DataFrame subset | | ML.DATA.SAMPLE | DataFrame | **Yes** | - | | ML.DATA.INFO | DataFrame | **Yes** | - | | ML.DATA.DESCRIBE | DataFrame | **Yes** | - | | ML.DATA.SHAPE | tuple | **Yes** (2 cells) | - | | ML.DATA.DROP_MISSING_ROWS | object handle | No | DataFrame | | ML.DATASETS.* | object handle | No | DataFrame | | ML.EVAL.SCORE | float | No (1 cell) | - | | ML.EVAL.CV_SCORE | array | **Yes** | - | | ML.EVAL.BEST_PARAMS | DataFrame | **Yes** | - | | ML.EVAL.BEST_SCORE | float | No (1 cell) | - | | ML.EVAL.SEARCH_RESULTS | DataFrame | **Yes** | - | | ML.INSPECT.GET_PARAMS | DataFrame | **Yes** | - | | ML.INSPECT.DECISION_BOUNDARY | DataFrame | **Yes** | - | ### What Each Function Accepts as `model` Parameter: | Function | Accepts | |----------|---------| | ML.FIT | Any unfitted estimator, transformer, pipeline, or GridSearchCV | | ML.TRANSFORM | Any fitted transformer, pipeline with transformers, or ColumnTransformer | | ML.FIT_TRANSFORM | Any unfitted transformer | | ML.PREDICT | Any fitted estimator (classifier, regressor, clusterer) or pipeline ending in estimator | | ML.EVAL.SCORE | Any fitted estimator | | ML.EVAL.CV_SCORE | Any unfitted estimator | | ML.EVAL.GRID_SEARCH | Any unfitted estimator or pipeline | | ML.EVAL.BEST_PARAMS | Fitted GridSearchCV only | | ML.EVAL.BEST_SCORE | Fitted GridSearchCV only | | ML.EVAL.SEARCH_RESULTS | Fitted GridSearchCV only | | ML.INSPECT.GET_PARAMS | Any scikit-learn object (fitted or unfitted) | | ML.DIM_REDUCTION.PCA.RESULTS | Fitted PCA only | --- ## Best Practices ### 1. Always Use Separate Cells for Object Creation ``` WRONG: =ML.FIT(ML.CLASSIFICATION.LOGISTIC(), X, y) RIGHT: Cell A1: =ML.CLASSIFICATION.LOGISTIC() Cell A2: =ML.FIT(A1, X, y) ``` ### 2. Never Re-Reference Object Handles ``` WRONG: Cell A1: =ML.CLASSIFICATION.LOGISTIC() Cell A5: =A1 <- Just copies text, NOT the object! Cell A6: =ML.FIT(A5, X, y) <- FAILS! RIGHT: Cell A1: =ML.CLASSIFICATION.LOGISTIC() Cell A6: =ML.FIT(A1, X, y) <- Reference original cell directly ``` ### 3. Check Shape Before Spilling Large Data ``` Cell A1: =ML.DATA.SHAPE(dataset) <- Returns [150, 5] Cell A3: =ML.DATA.SAMPLE(dataset, -1) <- Now you know to leave 150 rows clear! ``` ### 4. Use Cell References for Column Selection ``` Cell A7: =ML.DATA.SAMPLE(B4, 5) <- Headers spill to A7:E7 Cell B16: =A7 <- Reference first column header Cell B17: =D7 <- Reference fourth column header Cell B19: =ML.DATA.SELECT_COLUMNS(B4, B16:B17) <- Use the references ``` ### 5. Keep Parameters in Dedicated Cells ``` Cell B24: 0.2 <- Test size Cell B25: 42 <- Random seed Cell B27: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, B24, B25, 0) ``` This makes it easy to experiment with different values! ### 6. Leave Buffer Space Between Sections When functions spill data, leave extra rows before the next section to avoid #SPILL! errors. ### 7. Use GET_PARAMS to Understand Objects ``` Cell A65: =ML.INSPECT.GET_PARAMS(B50) <- See all pipeline parameters ``` This shows exactly what parameters are available for grid search. ### 8. Premium Functions (marked with "Premium") These require a premium subscription: - ML.CLASSIFICATION.RANDOM_FOREST_CLF - ML.REGRESSION.RANDOM_FOREST_REG - ML.DIM_REDUCTION.KERNEL_PCA - ML.EVAL.CV_SCORE - ML.EVAL.GRID_SEARCH - ML.EVAL.BEST_PARAMS - ML.EVAL.BEST_SCORE - ML.EVAL.SEARCH_RESULTS - ML.DATASETS.OPENML ### 9. Testing Predictions with Custom Values When predicting for specific input values: ``` Step 1: Create Excel table with SAME columns as original X (headers must match exactly) Step 2: =ML.DATA.CONVERT_TO_DF(your_table_range) <- Convert to DataFrame handle Step 3: =ML.PREDICT(fitted_model, converted_df) <- Use with fitted model Step 4: =ML.DATA.SAMPLE(predictions, -1) <- View results ``` ### 10. Scikit-learn Documentation Applies Since FormulaML uses scikit-learn under the hood, all scikit-learn concepts apply: - Fit/transform pattern for transformers - Fit/predict pattern for estimators - Cross-validation for robust evaluation - Pipelines for reproducible workflows - GridSearchCV for hyperparameter tuning