Table of Contents
Data Functions Reference
Functions for loading, exploring, and manipulating data in FormulaML.
ML.DATASETS Namespace
Built-in datasets for learning and testing machine learning models.
ML.DATASETS.IRIS()
Loads the classic Iris flower classification dataset.
Syntax:
=ML.DATASETS.IRIS()
Parameters: None
Returns: DataFrame object handle containing 150 samples with 5 columns
Dataset Details:
- Samples: 150
- Features: 4 (sepal length, sepal width, petal length, petal width)
- Target: 1 (species: setosa, versicolor, virginica)
- Use Case: Multi-class classification
Example:
Cell A1: =ML.DATASETS.IRIS()
Result: <Dataset>
# Explore the dataset
Cell A2: =ML.DATA.INFO(A1)
Cell A3: =ML.DATA.SAMPLE(A1, 5)
ML.DATASETS.DIABETES()
Loads the diabetes regression dataset for predicting disease progression.
Syntax:
=ML.DATASETS.DIABETES()
Parameters: None
Returns: DataFrame object handle containing 442 samples with 11 columns
Dataset Details:
- Samples: 442
- Features: 10 (age, sex, BMI, blood pressure, and 6 blood serum measurements)
- Target: 1 (disease progression one year after baseline)
- Use Case: Regression analysis
Example:
Cell A1: =ML.DATASETS.DIABETES()
Result: <Dataset>
# Prepare for regression
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9}) # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 10) # Target
ML.DATASETS.DIGITS()
Loads the handwritten digits classification dataset.
Syntax:
=ML.DATASETS.DIGITS()
Parameters: None
Returns: DataFrame object handle containing 1,797 samples with 65 columns
Dataset Details:
- Samples: 1,797
- Features: 64 (8x8 pixel values)
- Target: 1 (digit 0-9)
- Use Case: Multi-class image classification
Example:
Cell A1: =ML.DATASETS.DIGITS()
Result: <Dataset>
# Complex classification task
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63") # All pixel features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64) # Digit label
ML.DATASETS.OPENML()
Loads datasets from the OpenML repository (Premium feature).
Syntax:
=ML.DATASETS.OPENML(name, version)
Parameters:
name(String, Required): Dataset name from OpenMLversion(String, Optional): Dataset version (default: “active”)
Returns: DataFrame object handle
Example:
Cell A1: =ML.DATASETS.OPENML("wine-quality-red", "active")
Result: <Dataset>
# Load specific dataset
Cell B1: =ML.DATASETS.OPENML("credit-g")
Note: Visit https://www.openml.org/ to browse available datasets.
ML.DATASETS.FORCE_2020()
Loads well log data from the Force 2020 dataset (Norwegian Continental Shelf petrophysics competition) for one or more wells. Useful for petrophysics, lithology classification, and well log regression demos.
Syntax:
=ML.DATASETS.FORCE_2020(wells, [columns])
Parameters:
wells(Range or String, Required): One or more well names to load. Pass as a list of strings (e.g., a column range of well names). UseML.DATASETS.FORCE_2020.INFO(TRUE, FALSE)to see available wells.columns(Range or String, Optional): Column names to fetch. If omitted, all available columns are returned. UseML.DATASETS.FORCE_2020.INFO(FALSE, TRUE)to see available columns.
Returns: DataFrame object handle with concatenated well log data
Available columns:
WELL, DEPTH_MD, X_LOC, Y_LOC, Z_LOC, GROUP, FORMATION, CALI, RSHA, RMED, RDEP, RHOB, GR, SGR, NPHI, PEF, DTC, SP, BS, ROP, DTS, DCAL, DRHO, MUDWEIGHT, RMIC, ROPA, RXO, FORCE_2020_LITHOFACIES_LITHOLOGY, dataset
Notes:
- Data is read from parquet files served over HTTPS via DuckDB; first call may take a few seconds.
- Multiple wells are concatenated row-wise into a single DataFrame.
- Invalid well or column names raise an XlwingsError listing the offenders.
Example:
Cell A1: "15/9-13" <- Well name 1
Cell A2: "15/9-14" <- Well name 2
Cell C1: GR
Cell C2: RHOB
Cell C3: NPHI
Cell C4: FORCE_2020_LITHOFACIES_LITHOLOGY
Cell B4: =ML.DATASETS.FORCE_2020(A1:A2, C1:C4) <- Dataset handle (selected columns)
Cell B6: =ML.DATA.SAMPLE(B4, 5) <- Preview rows (SPILLS)
ML.DATASETS.FORCE_2020.INFO()
Returns the list of available wells or columns for the Force 2020 dataset. Set exactly one argument to TRUE. This function SPILLS data.
Syntax:
=ML.DATASETS.FORCE_2020.INFO(wells, columns)
Parameters:
wells(Boolean, Default:FALSE): IfTRUE, returns the list of available well names.columns(Boolean, Default:FALSE): IfTRUE, returns the list of available column names.
Returns: List (SPILLS as a single column)
Notes:
- Exactly one of
wellsorcolumnsmust beTRUE— otherwise an XlwingsError is raised.
Examples:
Cell A1: =ML.DATASETS.FORCE_2020.INFO(TRUE, FALSE) <- SPILLS all available well names
Cell C1: =ML.DATASETS.FORCE_2020.INFO(FALSE, TRUE) <- SPILLS all available column names
ML.DATA Namespace
Functions for data manipulation and exploration.
ML.DATA.CONVERT_TO_DF()
Converts Excel data range to a FormulaML DataFrame object.
Syntax:
=ML.DATA.CONVERT_TO_DF(data_range, has_header)
Parameters:
data_range(Range, Required): Excel range containing datahas_header(Boolean, Required): TRUE if first row contains column names
Returns: DataFrame object handle
Example:
# Convert Excel data with headers
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E100, TRUE)
Result: <DataFrame>
# Convert without headers
Cell B1: =ML.DATA.CONVERT_TO_DF(Data!B2:F50, FALSE)
ML.DATA.INFO()
Returns detailed information about a DataFrame.
Syntax:
=ML.DATA.INFO(dataframe)
Parameters:
dataframe(Object, Required): DataFrame object handle
Returns: DataFrame with columns: Column, Non-Null Count, Dtype
Example:
Cell A1: =ML.DATASETS.IRIS()
Cell A2: =ML.DATA.INFO(A1)
# Returns table showing:
# Column Non-Null Count Dtype
# sepal_length 150 float64
# sepal_width 150 float64
# ...
ML.DATA.DESCRIBE()
Generates descriptive statistics for numerical columns.
Syntax:
=ML.DATA.DESCRIBE(dataframe)
Parameters:
dataframe(Object, Required): DataFrame object handle
Returns: DataFrame with statistics (count, mean, std, min, 25%, 50%, 75%, max)
Example:
Cell A1: =ML.DATASETS.DIABETES()
Cell A2: =ML.DATA.DESCRIBE(A1)
# Returns statistical summary:
# age sex bmi ...
# count 442.0 442.0 442.0
# mean 48.5 0.05 26.3
# std 13.1 0.04 4.4
# ...
ML.DATA.SAMPLE()
Returns a sample of rows from the DataFrame.
Syntax:
=ML.DATA.SAMPLE(dataframe, n, head, tail, seed, missing_values)
Parameters:
dataframe(Object, Required): DataFrame object handlen(Integer, Optional): Number of rows (default: 10, use -1 for all)head(Boolean, Optional): Return first n rows (default: TRUE)tail(Boolean, Optional): Return last n rows (default: FALSE)seed(Integer, Optional): Random seed for reproducibilitymissing_values(String, Optional): Excel error to show for missing (#N/A default)
Returns: DataFrame with sampled rows
Examples:
# First 5 rows
Cell A1: =ML.DATA.SAMPLE(dataframe, 5)
# Last 10 rows
Cell A2: =ML.DATA.SAMPLE(dataframe, 10, FALSE, TRUE)
# Random 20 rows with seed
Cell A3: =ML.DATA.SAMPLE(dataframe, 20, FALSE, FALSE, 42)
# All data
Cell A4: =ML.DATA.SAMPLE(dataframe, -1)
ML.DATA.SELECT_COLUMNS()
Selects specific columns from a DataFrame.
Syntax:
=ML.DATA.SELECT_COLUMNS(dataframe, columns)
Parameters:
dataframe(Object, Required): DataFrame object handlecolumns(Array/Integer/String, Required): Column indices or names
Returns: DataFrame object handle with selected columns
Examples:
# Select single column by index
Cell A1: =ML.DATA.SELECT_COLUMNS(dataframe, 0)
# Select multiple columns by indices
Cell A2: =ML.DATA.SELECT_COLUMNS(dataframe, {0,1,2,3})
# Select column range (as string)
Cell A3: =ML.DATA.SELECT_COLUMNS(dataframe, "0:3")
# Select by column names (if available)
Cell A4: =ML.DATA.SELECT_COLUMNS(dataframe, {"age","bmi","bp"})
ML.DATA.SHAPE()
Returns the dimensions of a DataFrame as [rows, columns]. Use this before calling ML.DATA.SAMPLE(data, -1) to know how much space the spill will need.
Syntax:
=ML.DATA.SHAPE(dataframe)
Parameters:
dataframe(Object, Required): DataFrame object handle.
Returns: Tuple [n_rows, n_columns] — spills to 2 cells
Example:
Cell A1: =ML.DATASETS.IRIS()
Cell A2: =ML.DATA.SHAPE(A1)
Result: [150, 5] <- 150 rows, 5 columns; leave 150 rows × 5 cols clear before sampling all rows
ML.DATA.DROP_MISSING_ROWS()
Removes rows with missing values from DataFrame.
Syntax:
=ML.DATA.DROP_MISSING_ROWS(dataframe, how, thresh, subset)
Parameters:
dataframe(Object, Required): DataFrame object handlehow(String, Optional): “any” or “all” (default: “any”)thresh(Integer, Optional): Minimum non-null values requiredsubset(Array, Optional): Columns to check for missing values
Returns: DataFrame object handle without missing values
Examples:
# Drop rows with any missing values
Cell A1: =ML.DATA.DROP_MISSING_ROWS(dataframe)
# Drop rows where all values are missing
Cell A2: =ML.DATA.DROP_MISSING_ROWS(dataframe, "all")
# Keep rows with at least 3 non-null values
Cell A3: =ML.DATA.DROP_MISSING_ROWS(dataframe, , 3)
# Check specific columns only
Cell A4: =ML.DATA.DROP_MISSING_ROWS(dataframe, "any", , {"age","bmi"})
ML.DATA.QUERY()
Executes a DuckDB SQL query against a DataFrame and returns the result as a new DataFrame object handle.
Syntax:
=ML.DATA.QUERY(data, query)
Parameters:
data(Object, Required): DataFrame object to query.query(String, Required): DuckDB SQL query string. Reference the input DataFrame asdfinside the query (e.g.,SELECT * FROM df WHERE col > 0).
Returns: DataFrame object handle with the query result
Notes:
- Full DuckDB SQL is supported:
SELECT,WHERE,GROUP BY, window functions, aggregations, joins, and more. - Always reference the input DataFrame as
dfinside the query string. - Column names with spaces or special characters must be double-quoted inside the query (e.g.,
"sepal length (cm)").
Examples:
Cell B4: =ML.DATASETS.IRIS()
# Filter to a single class
Cell B6: =ML.DATA.QUERY(B4, "SELECT * FROM df WHERE target = 0")
# Aggregate by target class
Cell B8: =ML.DATA.QUERY(B4, "SELECT target, AVG(""sepal length (cm)"") AS avg_sl FROM df GROUP BY target")
Common Patterns
Loading and Preparing Data
# Load dataset
Cell A1: =ML.DATASETS.IRIS()
# Explore structure
Cell A2: =ML.DATA.INFO(A1)
Cell A3: =ML.DATA.DESCRIBE(A1)
Cell A4: =ML.DATA.SAMPLE(A1, 10)
# Separate features and target
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3}) # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 4) # Target
Converting Excel Data
# Prepare Excel range
# Assume data in columns A-E, rows 1-100, with headers
# Convert to DataFrame
Cell F1: =ML.DATA.CONVERT_TO_DF(A1:E100, TRUE)
# Clean missing values
Cell F2: =ML.DATA.DROP_MISSING_ROWS(F1)
# Select relevant columns
Cell F3: =ML.DATA.SELECT_COLUMNS(F2, {1,2,3})
Data Exploration Workflow
# Load data
Cell A1: =ML.DATASETS.DIABETES()
# Get overview
Cell B1: =ML.DATA.INFO(A1) # Structure
Cell B2: =ML.DATA.DESCRIBE(A1) # Statistics
Cell B3: =ML.DATA.SAMPLE(A1, 20) # Sample rows
# Check for patterns
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2}) # First 3 features
Cell C2: =ML.DATA.DESCRIBE(C1) # Stats for subset
Tips and Best Practices
-
Always explore data first
- Use INFO to understand structure
- Use DESCRIBE for statistical overview
- Use SAMPLE to inspect actual values
-
Handle missing values early
- Check INFO output for null counts
- Use DROP_MISSING_ROWS or imputation
-
Use consistent column selection
- Numerical indices are more reliable
- Document which columns represent what
-
Set random seeds for reproducibility
- Use same seed in SAMPLE for consistent results
- Important for debugging and testing
-
Memory considerations
- Use SAMPLE with small n for large datasets
- Select only necessary columns
Related Functions
- ML.PREPROCESSING.TRAIN_TEST_SPLIT() - Split data for training (docs coming soon)
- ML.IMPUTE.SIMPLE_IMPUTER() - Handle missing values (docs coming soon)
- ML.FIT() - Train models with data (docs coming soon)
- ML.PREDICT() - Make predictions on data (docs coming soon)
Navigation
- Back to Function Reference
- Next: Regression Models (coming soon)
- Examples