Working with Data

Data is the foundation of any machine learning project. This guide covers everything you need to know about handling data in FormulaML, from loading built-in datasets to preparing your own Excel data for analysis.

Understanding Object Handles

Before we dive into data operations, it’s crucial to understand how FormulaML manages data in Excel.

What are Object Handles?

Object handles are FormulaML’s way of storing complex data structures in Excel cells. They appear as text like <Dataset> or <StandardScaler>.

Think of handles like this:

A handle is like a shipping label on a package
The label (handle) fits in a cell, but references the entire package (your data)
You pass handles between functions to operate on your data

Working with Handles

Cell A1: =ML.DATASETS.IRIS()          → Returns: <Dataset>
Cell A2: =ML.DATA.INFO(A1)            → Uses handle from A1
Cell A3: =ML.DATA.SAMPLE(A1, 5)       → Also uses handle from A1

Each function that uses the handle can access the full dataset without copying it to multiple cells.

Loading Data

FormulaML provides several ways to load data for analysis.

Built-in Datasets

Perfect for learning and testing, these datasets are included with FormulaML:

Iris Dataset (Classification)

=ML.DATASETS.IRIS()

150 samples of iris flowers
4 features: sepal length/width, petal length/width
3 target classes: Setosa, Versicolor, Virginica
Use case: Multi-class classification

Diabetes Dataset (Regression)

=ML.DATASETS.DIABETES()

442 patient samples
10 physiological features
Target: Disease progression after one year
Use case: Regression analysis

Digits Dataset (Classification)

=ML.DATASETS.DIGITS()

1,797 handwritten digit images
64 features (8x8 pixel values)
10 target classes (digits 0-9)
Use case: Image classification

OpenML Datasets (Premium)

=ML.DATASETS.OPENML(dataset_id)

Access thousands of datasets from OpenML repository
Example: =ML.DATASETS.OPENML(61) loads the Iris dataset from OpenML
Requires premium subscription

Force 2020 Well Log Dataset

=ML.DATASETS.FORCE_2020(wells, [columns])

Norwegian Continental Shelf petrophysics competition data
Loads one or more wells by name; multiple wells are concatenated row-wise
Available columns: GR, RHOB, NPHI, DTC, FORCE_2020_LITHOFACIES_LITHOLOGY, and more
Use ML.DATASETS.FORCE_2020.INFO to list available wells and columns

Cell A1: "15/9-13"
Cell A2: "15/9-14"
Cell B4: =ML.DATASETS.FORCE_2020(A1:A2)     <- Load two wells (all columns)
Cell B6: =ML.DATA.SAMPLE(B4, 5)             <- Preview rows

Converting Excel Data

Transform your Excel data into FormulaML DataFrames:

Basic Conversion

=ML.DATA.CONVERT_TO_DF(data_range, has_header)

Parameters:

data_range: Your Excel data range (e.g., A1:E100)
has_header: TRUE if first row contains column names, FALSE otherwise

Example:

=ML.DATA.CONVERT_TO_DF(Sheet1!A1:D50, TRUE)

Converts data from Sheet1, treating the first row as headers.

Practical Example: Sales Data

Suppose you have sales data in columns A-D:

Column A: Date
Column B: Product
Column C: Quantity
Column D: Revenue

Cell F1: =ML.DATA.CONVERT_TO_DF(A:D, TRUE)

Result: <DataFrame> containing your sales data

Exploring Your Data

Once data is loaded, explore it to understand its structure and content.

Data Information

=ML.DATA.INFO(dataframe)

Returns detailed information about your dataset:

Number of rows and columns
Column names and data types
Memory usage
Missing value counts

Statistical Summary

=ML.DATA.DESCRIBE(dataframe)

Provides statistical summary for numerical columns:

Count (non-null values)
Mean (average)
Standard deviation
Min and max values
Quartiles (25%, 50%, 75%)

Viewing Sample Data

=ML.DATA.SAMPLE(dataframe, n_rows)

Displays random rows from your dataset.

Example Workflow:

Cell A1: =ML.DATASETS.DIABETES()
Cell A3: =ML.DATA.INFO(A1)          → Shows 442 samples, 11 columns
Cell A5: =ML.DATA.DESCRIBE(A1)      → Statistical summary
Cell A7: =ML.DATA.SAMPLE(A1, 10)    → View 10 random rows

Selecting and Manipulating Data

Selecting Columns

Select Specific Columns

=ML.DATA.SELECT_COLUMNS(dataframe, columns)

Examples:

Select single column:     =ML.DATA.SELECT_COLUMNS(A1, 0)
Select multiple columns:  =ML.DATA.SELECT_COLUMNS(A1, {0,1,2})
Select column range:      =ML.DATA.SELECT_COLUMNS(A1, "0:3")

Practical Example: Feature and Target Separation

For machine learning, you typically need to separate features (X) from target (y):

Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})  → Features (X)
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 4)          → Target (y)

Data Sampling

Random Sampling

=ML.DATA.SAMPLE(dataframe, n_samples, random_state)

n_samples: Number of rows to sample
random_state: Optional seed for reproducibility

Stratified Sampling

For balanced sampling across categories (useful for imbalanced datasets):

=ML.DATA.SAMPLE(dataframe, n_samples, random_state, stratify_column)

SQL Filtering with ML.DATA.QUERY

For complex row filtering or aggregation, use ML.DATA.QUERY to run DuckDB SQL directly against any DataFrame:

Cell A1: =ML.DATASETS.IRIS()

# Filter to one class
Cell B1: =ML.DATA.QUERY(A1, "SELECT * FROM df WHERE target = 0")

# Aggregate: average sepal length by class
Cell C1: =ML.DATA.QUERY(A1, "SELECT target, AVG(""sepal length (cm)"") AS avg_sl FROM df GROUP BY target")

Inside the query string, always refer to the input DataFrame as df. Column names with spaces or special characters must be double-quoted inside the query.

Data Preparation Workflow

Here’s a complete workflow for preparing your data:

Step 1: Load Your Data

Cell A1: =ML.DATA.CONVERT_TO_DF(RawData!A:F, TRUE)

Step 2: Explore the Data

Cell A3: =ML.DATA.INFO(A1)
Cell A5: =ML.DATA.DESCRIBE(A1)
Cell A7: =ML.DATA.SAMPLE(A1, 10)

Step 3: Check for Issues

Look for:

Missing values in the INFO output
Unusual statistics in DESCRIBE (outliers)
Data types that need conversion

Step 4: Select Relevant Columns

Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {1,2,3,4})  → Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 5)          → Target

Step 5: Split for Training/Testing

Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, C1, 0.2, 42)

Testing Predictions with Custom Input Values

To predict outcomes for specific values (e.g., “what would the model predict for a house with 3 bedrooms?”), create an Excel table with the same column headers as your original training data, then convert it with ML.DATA.CONVERT_TO_DF.

The column headers in your custom input MUST match the original X columns exactly (case-sensitive), in the same order.

# 1. Create input table with headers matching training data
Cell F1: sepal length (cm)   Cell G1: sepal width (cm)   Cell H1: petal length (cm)   Cell I1: petal width (cm)
Cell F2: 5.1                 Cell G2: 3.5                Cell H2: 1.4                 Cell I2: 0.2

# 2. Convert to DataFrame handle
Cell K1: =ML.DATA.CONVERT_TO_DF(F1:I2)     <- includes header row

# 3. Predict and view results
Cell K2: =ML.PREDICT(A31, K1)              <- A31 holds the fitted model
Cell K3: =ML.DATA.SAMPLE(K2, -1)           <- SPILLS prediction values

Do NOT pass a raw Excel range directly to ML.PREDICT — always convert with ML.DATA.CONVERT_TO_DF first.

Working with Different Data Types

Numerical Data

FormulaML automatically handles numerical data. Ensure your Excel data:

Uses consistent number formats
Doesn’t mix text with numbers
Handles missing values appropriately (blank cells or specific values)

Categorical Data

For text categories (e.g., “High”, “Medium”, “Low”):

Convert to DataFrame normally
FormulaML will encode categories automatically when needed
Use appropriate algorithms that handle categorical data

Date and Time Data

Convert dates to numerical features before loading:

Excel formula to convert date to days: =A2-$A$2
Excel formula to extract month: =MONTH(A2)
Excel formula to extract day of week: =WEEKDAY(A2)

Handling Missing Data

Identifying Missing Values

The ML.DATA.INFO() function shows missing value counts for each column.

Strategies for Missing Data

Option 1: Remove Rows with Missing Values

Use Excel filtering before converting to DataFrame

Option 2: Imputation (Premium)

=ML.IMPUTE.SIMPLE_IMPUTER(strategy)

Strategies: “mean”, “median”, “most_frequent”, “constant”

Example workflow:

Cell A1: =ML.DATA.CONVERT_TO_DF(MyData!A:D, TRUE)
Cell A2: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell A3: =ML.FIT_TRANSFORM(A2, A1)

Best Practices

1. Data Quality Checks

Always explore your data before modeling:

Check for missing values
Look for outliers in statistical summaries
Verify data types are correct

2. Consistent Formatting

Use consistent number formats across columns
Avoid mixing data types in columns
Remove special characters from text data

3. Data Size Considerations

FormulaML handles datasets up to 1 million rows efficiently
For larger datasets, consider sampling
Premium features offer better performance for large data

4. Reproducibility

Use random_state parameters for consistent results:

=ML.DATA.SAMPLE(A1, 100, 42)  → Same sample every time

Common Data Scenarios

Scenario 1: Time Series Sales Forecasting

1. Load data: =ML.DATA.CONVERT_TO_DF(Sales!A:C, TRUE)
2. Create lag features manually in Excel
3. Select features: =ML.DATA.SELECT_COLUMNS(dataframe, {1,2,3})
4. Select target: =ML.DATA.SELECT_COLUMNS(dataframe, 4)

Scenario 2: Customer Segmentation

1. Load customer data: =ML.DATA.CONVERT_TO_DF(Customers!A:G, TRUE)
2. Select numerical features: =ML.DATA.SELECT_COLUMNS(dataframe, {2,3,4,5})
3. Apply clustering: =ML.CLUSTERING.KMEANS(n_clusters=3)

Scenario 3: Binary Classification

1. Load data: =ML.DATA.CONVERT_TO_DF(Data!A:F, TRUE)
2. Separate features and target
3. Split data: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, y, 0.3)
4. Train classifier: =ML.CLASSIFICATION.LOGISTIC()

Troubleshooting Data Issues

“Object handle not found”

Ensure you’re referencing the correct cell containing the handle
Handles expire when Excel closes; reload data if needed

“Invalid data format”

Check for mixed data types in columns
Remove non-numeric characters from numerical columns
Ensure consistent formatting

“Memory error with large dataset”

Sample your data: =ML.DATA.SAMPLE(dataframe, 10000)
Select only necessary columns
Consider premium version for better performance

Next Steps

Now that you understand data handling in FormulaML:

Learn about models: Explore Models and Algorithms
Try examples: Work through our tutorials
Advanced techniques: Discover preprocessing methods

Ready to build models? Continue to Models and Algorithms →

Table of Contents