Table of Contents
Working with Data
Data is the foundation of any machine learning project. This guide covers everything you need to know about handling data in FormulaML, from loading built-in datasets to preparing your own Excel data for analysis.
Understanding Object Handles
Before we dive into data operations, it’s crucial to understand how FormulaML manages data in Excel.
What are Object Handles?
Object handles are FormulaML’s way of storing complex data structures in Excel cells. They appear as text like <Dataset> or <StandardScaler>.
Think of handles like this:
- A handle is like a shipping label on a package
- The label (handle) fits in a cell, but references the entire package (your data)
- You pass handles between functions to operate on your data
Working with Handles
Cell A1: =ML.DATASETS.IRIS() → Returns: <Dataset>
Cell A2: =ML.DATA.INFO(A1) → Uses handle from A1
Cell A3: =ML.DATA.SAMPLE(A1, 5) → Also uses handle from A1
Each function that uses the handle can access the full dataset without copying it to multiple cells.
Loading Data
FormulaML provides several ways to load data for analysis.
Built-in Datasets
Perfect for learning and testing, these datasets are included with FormulaML:
Iris Dataset (Classification)
=ML.DATASETS.IRIS()
- 150 samples of iris flowers
- 4 features: sepal length/width, petal length/width
- 3 target classes: Setosa, Versicolor, Virginica
- Use case: Multi-class classification
Diabetes Dataset (Regression)
=ML.DATASETS.DIABETES()
- 442 patient samples
- 10 physiological features
- Target: Disease progression after one year
- Use case: Regression analysis
Digits Dataset (Classification)
=ML.DATASETS.DIGITS()
- 1,797 handwritten digit images
- 64 features (8x8 pixel values)
- 10 target classes (digits 0-9)
- Use case: Image classification
OpenML Datasets (Premium)
=ML.DATASETS.OPENML(dataset_id)
- Access thousands of datasets from OpenML repository
- Example:
=ML.DATASETS.OPENML(61)loads the Iris dataset from OpenML - Requires premium subscription
Force 2020 Well Log Dataset
=ML.DATASETS.FORCE_2020(wells, [columns])
- Norwegian Continental Shelf petrophysics competition data
- Loads one or more wells by name; multiple wells are concatenated row-wise
- Available columns:
GR,RHOB,NPHI,DTC,FORCE_2020_LITHOFACIES_LITHOLOGY, and more - Use
ML.DATASETS.FORCE_2020.INFOto list available wells and columns
Cell A1: "15/9-13"
Cell A2: "15/9-14"
Cell B4: =ML.DATASETS.FORCE_2020(A1:A2) <- Load two wells (all columns)
Cell B6: =ML.DATA.SAMPLE(B4, 5) <- Preview rows
Converting Excel Data
Transform your Excel data into FormulaML DataFrames:
Basic Conversion
=ML.DATA.CONVERT_TO_DF(data_range, has_header)
Parameters:
data_range: Your Excel data range (e.g., A1:E100)has_header: TRUE if first row contains column names, FALSE otherwise
Example:
=ML.DATA.CONVERT_TO_DF(Sheet1!A1:D50, TRUE)
Converts data from Sheet1, treating the first row as headers.
Practical Example: Sales Data
Suppose you have sales data in columns A-D:
- Column A: Date
- Column B: Product
- Column C: Quantity
- Column D: Revenue
Cell F1: =ML.DATA.CONVERT_TO_DF(A:D, TRUE)
Result: <DataFrame> containing your sales data
Exploring Your Data
Once data is loaded, explore it to understand its structure and content.
Data Information
=ML.DATA.INFO(dataframe)
Returns detailed information about your dataset:
- Number of rows and columns
- Column names and data types
- Memory usage
- Missing value counts
Statistical Summary
=ML.DATA.DESCRIBE(dataframe)
Provides statistical summary for numerical columns:
- Count (non-null values)
- Mean (average)
- Standard deviation
- Min and max values
- Quartiles (25%, 50%, 75%)
Viewing Sample Data
=ML.DATA.SAMPLE(dataframe, n_rows)
Displays random rows from your dataset.
Example Workflow:
Cell A1: =ML.DATASETS.DIABETES()
Cell A3: =ML.DATA.INFO(A1) → Shows 442 samples, 11 columns
Cell A5: =ML.DATA.DESCRIBE(A1) → Statistical summary
Cell A7: =ML.DATA.SAMPLE(A1, 10) → View 10 random rows
Selecting and Manipulating Data
Selecting Columns
Select Specific Columns
=ML.DATA.SELECT_COLUMNS(dataframe, columns)
Examples:
Select single column: =ML.DATA.SELECT_COLUMNS(A1, 0)
Select multiple columns: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2})
Select column range: =ML.DATA.SELECT_COLUMNS(A1, "0:3")
Practical Example: Feature and Target Separation
For machine learning, you typically need to separate features (X) from target (y):
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3}) → Features (X)
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 4) → Target (y)
Data Sampling
Random Sampling
=ML.DATA.SAMPLE(dataframe, n_samples, random_state)
n_samples: Number of rows to samplerandom_state: Optional seed for reproducibility
Stratified Sampling
For balanced sampling across categories (useful for imbalanced datasets):
=ML.DATA.SAMPLE(dataframe, n_samples, random_state, stratify_column)
SQL Filtering with ML.DATA.QUERY
For complex row filtering or aggregation, use ML.DATA.QUERY to run DuckDB SQL directly against any DataFrame:
Cell A1: =ML.DATASETS.IRIS()
# Filter to one class
Cell B1: =ML.DATA.QUERY(A1, "SELECT * FROM df WHERE target = 0")
# Aggregate: average sepal length by class
Cell C1: =ML.DATA.QUERY(A1, "SELECT target, AVG(""sepal length (cm)"") AS avg_sl FROM df GROUP BY target")
Inside the query string, always refer to the input DataFrame as df. Column names with spaces or special characters must be double-quoted inside the query.
Data Preparation Workflow
Here’s a complete workflow for preparing your data:
Step 1: Load Your Data
Cell A1: =ML.DATA.CONVERT_TO_DF(RawData!A:F, TRUE)
Step 2: Explore the Data
Cell A3: =ML.DATA.INFO(A1)
Cell A5: =ML.DATA.DESCRIBE(A1)
Cell A7: =ML.DATA.SAMPLE(A1, 10)
Step 3: Check for Issues
Look for:
- Missing values in the INFO output
- Unusual statistics in DESCRIBE (outliers)
- Data types that need conversion
Step 4: Select Relevant Columns
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {1,2,3,4}) → Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 5) → Target
Step 5: Split for Training/Testing
Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, C1, 0.2, 42)
Testing Predictions with Custom Input Values
To predict outcomes for specific values (e.g., “what would the model predict for a house with 3 bedrooms?”), create an Excel table with the same column headers as your original training data, then convert it with ML.DATA.CONVERT_TO_DF.
The column headers in your custom input MUST match the original X columns exactly (case-sensitive), in the same order.
# 1. Create input table with headers matching training data
Cell F1: sepal length (cm) Cell G1: sepal width (cm) Cell H1: petal length (cm) Cell I1: petal width (cm)
Cell F2: 5.1 Cell G2: 3.5 Cell H2: 1.4 Cell I2: 0.2
# 2. Convert to DataFrame handle
Cell K1: =ML.DATA.CONVERT_TO_DF(F1:I2) <- includes header row
# 3. Predict and view results
Cell K2: =ML.PREDICT(A31, K1) <- A31 holds the fitted model
Cell K3: =ML.DATA.SAMPLE(K2, -1) <- SPILLS prediction values
Do NOT pass a raw Excel range directly to ML.PREDICT — always convert with ML.DATA.CONVERT_TO_DF first.
Working with Different Data Types
Numerical Data
FormulaML automatically handles numerical data. Ensure your Excel data:
- Uses consistent number formats
- Doesn’t mix text with numbers
- Handles missing values appropriately (blank cells or specific values)
Categorical Data
For text categories (e.g., “High”, “Medium”, “Low”):
- Convert to DataFrame normally
- FormulaML will encode categories automatically when needed
- Use appropriate algorithms that handle categorical data
Date and Time Data
Convert dates to numerical features before loading:
Excel formula to convert date to days: =A2-$A$2
Excel formula to extract month: =MONTH(A2)
Excel formula to extract day of week: =WEEKDAY(A2)
Handling Missing Data
Identifying Missing Values
The ML.DATA.INFO() function shows missing value counts for each column.
Strategies for Missing Data
Option 1: Remove Rows with Missing Values
Use Excel filtering before converting to DataFrame
Option 2: Imputation (Premium)
=ML.IMPUTE.SIMPLE_IMPUTER(strategy)
Strategies: “mean”, “median”, “most_frequent”, “constant”
Example workflow:
Cell A1: =ML.DATA.CONVERT_TO_DF(MyData!A:D, TRUE)
Cell A2: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell A3: =ML.FIT_TRANSFORM(A2, A1)
Best Practices
1. Data Quality Checks
Always explore your data before modeling:
- Check for missing values
- Look for outliers in statistical summaries
- Verify data types are correct
2. Consistent Formatting
- Use consistent number formats across columns
- Avoid mixing data types in columns
- Remove special characters from text data
3. Data Size Considerations
- FormulaML handles datasets up to 1 million rows efficiently
- For larger datasets, consider sampling
- Premium features offer better performance for large data
4. Reproducibility
Use random_state parameters for consistent results:
=ML.DATA.SAMPLE(A1, 100, 42) → Same sample every time
Common Data Scenarios
Scenario 1: Time Series Sales Forecasting
1. Load data: =ML.DATA.CONVERT_TO_DF(Sales!A:C, TRUE)
2. Create lag features manually in Excel
3. Select features: =ML.DATA.SELECT_COLUMNS(dataframe, {1,2,3})
4. Select target: =ML.DATA.SELECT_COLUMNS(dataframe, 4)
Scenario 2: Customer Segmentation
1. Load customer data: =ML.DATA.CONVERT_TO_DF(Customers!A:G, TRUE)
2. Select numerical features: =ML.DATA.SELECT_COLUMNS(dataframe, {2,3,4,5})
3. Apply clustering: =ML.CLUSTERING.KMEANS(n_clusters=3)
Scenario 3: Binary Classification
1. Load data: =ML.DATA.CONVERT_TO_DF(Data!A:F, TRUE)
2. Separate features and target
3. Split data: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(X, y, 0.3)
4. Train classifier: =ML.CLASSIFICATION.LOGISTIC()
Troubleshooting Data Issues
“Object handle not found”
- Ensure you’re referencing the correct cell containing the handle
- Handles expire when Excel closes; reload data if needed
“Invalid data format”
- Check for mixed data types in columns
- Remove non-numeric characters from numerical columns
- Ensure consistent formatting
“Memory error with large dataset”
- Sample your data:
=ML.DATA.SAMPLE(dataframe, 10000) - Select only necessary columns
- Consider premium version for better performance
Next Steps
Now that you understand data handling in FormulaML:
- Learn about models: Explore Models and Algorithms
- Try examples: Work through our tutorials
- Advanced techniques: Discover preprocessing methods
Ready to build models? Continue to Models and Algorithms →