Data Functions Reference

Complete reference for FormulaML data handling functions including datasets, data manipulation, and exploration.

Data Functions Reference

Functions for loading, exploring, and manipulating data in FormulaML.

ML.DATASETS Namespace

Built-in datasets for learning and testing machine learning models.

ML.DATASETS.IRIS()

Loads the classic Iris flower classification dataset.

Syntax:

=ML.DATASETS.IRIS()

Parameters: None

Returns: DataFrame object handle containing 150 samples with 5 columns

Dataset Details:

  • Samples: 150
  • Features: 4 (sepal length, sepal width, petal length, petal width)
  • Target: 1 (species: setosa, versicolor, virginica)
  • Use Case: Multi-class classification

Example:

Cell A1: =ML.DATASETS.IRIS()
Result: <Dataset>

# Explore the dataset
Cell A2: =ML.DATA.INFO(A1)
Cell A3: =ML.DATA.SAMPLE(A1, 5)

ML.DATASETS.DIABETES()

Loads the diabetes regression dataset for predicting disease progression.

Syntax:

=ML.DATASETS.DIABETES()

Parameters: None

Returns: DataFrame object handle containing 442 samples with 11 columns

Dataset Details:

  • Samples: 442
  • Features: 10 (age, sex, BMI, blood pressure, and 6 blood serum measurements)
  • Target: 1 (disease progression one year after baseline)
  • Use Case: Regression analysis

Example:

Cell A1: =ML.DATASETS.DIABETES()
Result: <Dataset>

# Prepare for regression
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3,4,5,6,7,8,9})  # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 10)                      # Target

ML.DATASETS.DIGITS()

Loads the handwritten digits classification dataset.

Syntax:

=ML.DATASETS.DIGITS()

Parameters: None

Returns: DataFrame object handle containing 1,797 samples with 65 columns

Dataset Details:

  • Samples: 1,797
  • Features: 64 (8x8 pixel values)
  • Target: 1 (digit 0-9)
  • Use Case: Multi-class image classification

Example:

Cell A1: =ML.DATASETS.DIGITS()
Result: <Dataset>

# Complex classification task
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")  # All pixel features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64)      # Digit label

ML.DATASETS.OPENML()

Loads datasets from the OpenML repository (Premium feature).

Syntax:

=ML.DATASETS.OPENML(name, version)

Parameters:

  • name (String, Required): Dataset name from OpenML
  • version (String, Optional): Dataset version (default: “active”)

Returns: DataFrame object handle

Example:

Cell A1: =ML.DATASETS.OPENML("wine-quality-red", "active")
Result: <Dataset>

# Load specific dataset
Cell B1: =ML.DATASETS.OPENML("credit-g")

Note: Visit https://www.openml.org/ to browse available datasets.


ML.DATASETS.FORCE_2020()

Loads well log data from the Force 2020 dataset (Norwegian Continental Shelf petrophysics competition) for one or more wells. Useful for petrophysics, lithology classification, and well log regression demos.

Syntax:

=ML.DATASETS.FORCE_2020(wells, [columns])

Parameters:

  • wells (Range or String, Required): One or more well names to load. Pass as a list of strings (e.g., a column range of well names). Use ML.DATASETS.FORCE_2020.INFO(TRUE, FALSE) to see available wells.
  • columns (Range or String, Optional): Column names to fetch. If omitted, all available columns are returned. Use ML.DATASETS.FORCE_2020.INFO(FALSE, TRUE) to see available columns.

Returns: DataFrame object handle with concatenated well log data

Available columns: WELL, DEPTH_MD, X_LOC, Y_LOC, Z_LOC, GROUP, FORMATION, CALI, RSHA, RMED, RDEP, RHOB, GR, SGR, NPHI, PEF, DTC, SP, BS, ROP, DTS, DCAL, DRHO, MUDWEIGHT, RMIC, ROPA, RXO, FORCE_2020_LITHOFACIES_LITHOLOGY, dataset

Notes:

  • Data is read from parquet files served over HTTPS via DuckDB; first call may take a few seconds.
  • Multiple wells are concatenated row-wise into a single DataFrame.
  • Invalid well or column names raise an XlwingsError listing the offenders.

Example:

Cell A1: "15/9-13"                                  <- Well name 1
Cell A2: "15/9-14"                                  <- Well name 2
Cell C1: GR
Cell C2: RHOB
Cell C3: NPHI
Cell C4: FORCE_2020_LITHOFACIES_LITHOLOGY
Cell B4: =ML.DATASETS.FORCE_2020(A1:A2, C1:C4)     <- Dataset handle (selected columns)
Cell B6: =ML.DATA.SAMPLE(B4, 5)                     <- Preview rows (SPILLS)

ML.DATASETS.FORCE_2020.INFO()

Returns the list of available wells or columns for the Force 2020 dataset. Set exactly one argument to TRUE. This function SPILLS data.

Syntax:

=ML.DATASETS.FORCE_2020.INFO(wells, columns)

Parameters:

  • wells (Boolean, Default: FALSE): If TRUE, returns the list of available well names.
  • columns (Boolean, Default: FALSE): If TRUE, returns the list of available column names.

Returns: List (SPILLS as a single column)

Notes:

  • Exactly one of wells or columns must be TRUE — otherwise an XlwingsError is raised.

Examples:

Cell A1: =ML.DATASETS.FORCE_2020.INFO(TRUE, FALSE)    <- SPILLS all available well names
Cell C1: =ML.DATASETS.FORCE_2020.INFO(FALSE, TRUE)    <- SPILLS all available column names

ML.DATA Namespace

Functions for data manipulation and exploration.

ML.DATA.CONVERT_TO_DF()

Converts Excel data range to a FormulaML DataFrame object.

Syntax:

=ML.DATA.CONVERT_TO_DF(data_range, has_header)

Parameters:

  • data_range (Range, Required): Excel range containing data
  • has_header (Boolean, Required): TRUE if first row contains column names

Returns: DataFrame object handle

Example:

# Convert Excel data with headers
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E100, TRUE)
Result: <DataFrame>

# Convert without headers
Cell B1: =ML.DATA.CONVERT_TO_DF(Data!B2:F50, FALSE)

ML.DATA.INFO()

Returns detailed information about a DataFrame.

Syntax:

=ML.DATA.INFO(dataframe)

Parameters:

  • dataframe (Object, Required): DataFrame object handle

Returns: DataFrame with columns: Column, Non-Null Count, Dtype

Example:

Cell A1: =ML.DATASETS.IRIS()
Cell A2: =ML.DATA.INFO(A1)

# Returns table showing:
# Column          Non-Null Count  Dtype
# sepal_length    150            float64
# sepal_width     150            float64
# ...

ML.DATA.DESCRIBE()

Generates descriptive statistics for numerical columns.

Syntax:

=ML.DATA.DESCRIBE(dataframe)

Parameters:

  • dataframe (Object, Required): DataFrame object handle

Returns: DataFrame with statistics (count, mean, std, min, 25%, 50%, 75%, max)

Example:

Cell A1: =ML.DATASETS.DIABETES()
Cell A2: =ML.DATA.DESCRIBE(A1)

# Returns statistical summary:
#        age      sex      bmi     ...
# count  442.0    442.0    442.0
# mean   48.5     0.05     26.3
# std    13.1     0.04     4.4
# ...

ML.DATA.SAMPLE()

Returns a sample of rows from the DataFrame.

Syntax:

=ML.DATA.SAMPLE(dataframe, n, head, tail, seed, missing_values)

Parameters:

  • dataframe (Object, Required): DataFrame object handle
  • n (Integer, Optional): Number of rows (default: 10, use -1 for all)
  • head (Boolean, Optional): Return first n rows (default: TRUE)
  • tail (Boolean, Optional): Return last n rows (default: FALSE)
  • seed (Integer, Optional): Random seed for reproducibility
  • missing_values (String, Optional): Excel error to show for missing (#N/A default)

Returns: DataFrame with sampled rows

Examples:

# First 5 rows
Cell A1: =ML.DATA.SAMPLE(dataframe, 5)

# Last 10 rows
Cell A2: =ML.DATA.SAMPLE(dataframe, 10, FALSE, TRUE)

# Random 20 rows with seed
Cell A3: =ML.DATA.SAMPLE(dataframe, 20, FALSE, FALSE, 42)

# All data
Cell A4: =ML.DATA.SAMPLE(dataframe, -1)

ML.DATA.SELECT_COLUMNS()

Selects specific columns from a DataFrame.

Syntax:

=ML.DATA.SELECT_COLUMNS(dataframe, columns)

Parameters:

  • dataframe (Object, Required): DataFrame object handle
  • columns (Array/Integer/String, Required): Column indices or names

Returns: DataFrame object handle with selected columns

Examples:

# Select single column by index
Cell A1: =ML.DATA.SELECT_COLUMNS(dataframe, 0)

# Select multiple columns by indices
Cell A2: =ML.DATA.SELECT_COLUMNS(dataframe, {0,1,2,3})

# Select column range (as string)
Cell A3: =ML.DATA.SELECT_COLUMNS(dataframe, "0:3")

# Select by column names (if available)
Cell A4: =ML.DATA.SELECT_COLUMNS(dataframe, {"age","bmi","bp"})

ML.DATA.SHAPE()

Returns the dimensions of a DataFrame as [rows, columns]. Use this before calling ML.DATA.SAMPLE(data, -1) to know how much space the spill will need.

Syntax:

=ML.DATA.SHAPE(dataframe)

Parameters:

  • dataframe (Object, Required): DataFrame object handle.

Returns: Tuple [n_rows, n_columns] — spills to 2 cells

Example:

Cell A1: =ML.DATASETS.IRIS()
Cell A2: =ML.DATA.SHAPE(A1)
Result: [150, 5]   <- 150 rows, 5 columns; leave 150 rows × 5 cols clear before sampling all rows

ML.DATA.DROP_MISSING_ROWS()

Removes rows with missing values from DataFrame.

Syntax:

=ML.DATA.DROP_MISSING_ROWS(dataframe, how, thresh, subset)

Parameters:

  • dataframe (Object, Required): DataFrame object handle
  • how (String, Optional): “any” or “all” (default: “any”)
  • thresh (Integer, Optional): Minimum non-null values required
  • subset (Array, Optional): Columns to check for missing values

Returns: DataFrame object handle without missing values

Examples:

# Drop rows with any missing values
Cell A1: =ML.DATA.DROP_MISSING_ROWS(dataframe)

# Drop rows where all values are missing
Cell A2: =ML.DATA.DROP_MISSING_ROWS(dataframe, "all")

# Keep rows with at least 3 non-null values
Cell A3: =ML.DATA.DROP_MISSING_ROWS(dataframe, , 3)

# Check specific columns only
Cell A4: =ML.DATA.DROP_MISSING_ROWS(dataframe, "any", , {"age","bmi"})

ML.DATA.QUERY()

Executes a DuckDB SQL query against a DataFrame and returns the result as a new DataFrame object handle.

Syntax:

=ML.DATA.QUERY(data, query)

Parameters:

  • data (Object, Required): DataFrame object to query.
  • query (String, Required): DuckDB SQL query string. Reference the input DataFrame as df inside the query (e.g., SELECT * FROM df WHERE col > 0).

Returns: DataFrame object handle with the query result

Notes:

  • Full DuckDB SQL is supported: SELECT, WHERE, GROUP BY, window functions, aggregations, joins, and more.
  • Always reference the input DataFrame as df inside the query string.
  • Column names with spaces or special characters must be double-quoted inside the query (e.g., "sepal length (cm)").

Examples:

Cell B4: =ML.DATASETS.IRIS()

# Filter to a single class
Cell B6: =ML.DATA.QUERY(B4, "SELECT * FROM df WHERE target = 0")

# Aggregate by target class
Cell B8: =ML.DATA.QUERY(B4, "SELECT target, AVG(""sepal length (cm)"") AS avg_sl FROM df GROUP BY target")

Common Patterns

Loading and Preparing Data

# Load dataset
Cell A1: =ML.DATASETS.IRIS()

# Explore structure
Cell A2: =ML.DATA.INFO(A1)
Cell A3: =ML.DATA.DESCRIBE(A1)
Cell A4: =ML.DATA.SAMPLE(A1, 10)

# Separate features and target
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})  # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 4)          # Target

Converting Excel Data

# Prepare Excel range
# Assume data in columns A-E, rows 1-100, with headers

# Convert to DataFrame
Cell F1: =ML.DATA.CONVERT_TO_DF(A1:E100, TRUE)

# Clean missing values
Cell F2: =ML.DATA.DROP_MISSING_ROWS(F1)

# Select relevant columns
Cell F3: =ML.DATA.SELECT_COLUMNS(F2, {1,2,3})

Data Exploration Workflow

# Load data
Cell A1: =ML.DATASETS.DIABETES()

# Get overview
Cell B1: =ML.DATA.INFO(A1)        # Structure
Cell B2: =ML.DATA.DESCRIBE(A1)    # Statistics
Cell B3: =ML.DATA.SAMPLE(A1, 20)  # Sample rows

# Check for patterns
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2})  # First 3 features
Cell C2: =ML.DATA.DESCRIBE(C1)                 # Stats for subset

Tips and Best Practices

  1. Always explore data first

    • Use INFO to understand structure
    • Use DESCRIBE for statistical overview
    • Use SAMPLE to inspect actual values
  2. Handle missing values early

    • Check INFO output for null counts
    • Use DROP_MISSING_ROWS or imputation
  3. Use consistent column selection

    • Numerical indices are more reliable
    • Document which columns represent what
  4. Set random seeds for reproducibility

    • Use same seed in SAMPLE for consistent results
    • Important for debugging and testing
  5. Memory considerations

    • Use SAMPLE with small n for large datasets
    • Select only necessary columns
  • ML.PREPROCESSING.TRAIN_TEST_SPLIT() - Split data for training (docs coming soon)
  • ML.IMPUTE.SIMPLE_IMPUTER() - Handle missing values (docs coming soon)
  • ML.FIT() - Train models with data (docs coming soon)
  • ML.PREDICT() - Make predictions on data (docs coming soon)