Clustering Models Reference

Complete reference for FormulaML clustering models including K-Means for unsupervised learning.

Clustering Models Reference

Functions for creating and training clustering models to discover patterns and group similar data points.

ML.CLUSTERING Namespace

ML.CLUSTERING.KMEANS()

Creates a K-Means clustering model to group similar data points.

Syntax:

=ML.CLUSTERING.KMEANS(n_clusters, init, n_init, max_iter, tol, random_state, algorithm)

Parameters:

  • n_clusters (Integer, Optional): Number of clusters to create (default: 8)
    • Choose based on your data and business needs
    • Use elbow method to find optimal number
  • init (String, Optional): Initialization method (default: “k-means++”)
    • “k-means++”: Smart initialization (recommended)
    • “random”: Random initialization (faster but less optimal)
  • n_init (Integer/String, Optional): Number of initialization runs (default: “auto”)
    • “auto”: Automatically determined
    • Integer: Specific number of runs (10-20 for production)
  • max_iter (Integer, Optional): Maximum iterations per run (default: 300)
    • Higher values find better clusters but take longer
  • tol (Number, Optional): Convergence tolerance (default: 0.0001)
    • Lower values = more precise but slower
    • Increase to 0.001 for faster convergence
  • random_state (Integer, Optional): Random seed for reproducibility
    • Use any integer (e.g., 42) for consistent results
  • algorithm (String, Optional): Algorithm variant (default: “lloyd”)
    • “lloyd”: Works for all cases (default)
    • “elkan”: Faster for dense data

Returns: K-Means clustering model object

Use Case: Customer segmentation, pattern discovery, data grouping

Example:

# Basic K-Means with 3 clusters
Cell A1: =ML.CLUSTERING.KMEANS(3)
Result: <KMeans>

# Optimized K-Means
Cell A2: =ML.CLUSTERING.KMEANS(5, "k-means++", 20, 500, 0.0001, 42, "lloyd")

# Fit to data
Cell B1: =ML.FIT(A1, X_data)

# Get cluster labels
Cell C1: =ML.PREDICT(B1, X_data)

Common Patterns

Basic Clustering

# Load and prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})  # Features only

# Create and fit K-Means
Cell C1: =ML.CLUSTERING.KMEANS(3, "k-means++", "auto", 300, 0.0001, 42)
Cell D1: =ML.FIT(C1, B1)

# Get cluster assignments
Cell E1: =ML.PREDICT(D1, B1)

Clustering with Scaling

# Load data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})

# Create preprocessing pipeline
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell C2: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42)

# Create and fit pipeline
Cell D1: =ML.PIPELINE(C1, C2)
Cell E1: =ML.FIT(D1, B1)

# Predict clusters
Cell F1: =ML.PREDICT(E1, B1)

Finding Optimal Number of Clusters (Elbow Method)

# Load and prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})

# Scale the data
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Try different cluster numbers
Cell E1: =ML.CLUSTERING.KMEANS(2, "k-means++", 10, 300, 0.0001, 42)
Cell E2: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42)
Cell E3: =ML.CLUSTERING.KMEANS(4, "k-means++", 10, 300, 0.0001, 42)
Cell E4: =ML.CLUSTERING.KMEANS(5, "k-means++", 10, 300, 0.0001, 42)
Cell E5: =ML.CLUSTERING.KMEANS(6, "k-means++", 10, 300, 0.0001, 42)

# Fit models
Cell F1: =ML.FIT(E1, D1)
Cell F2: =ML.FIT(E2, D1)
Cell F3: =ML.FIT(E3, D1)
Cell F4: =ML.FIT(E4, D1)
Cell F5: =ML.FIT(E5, D1)

# Compare inertia (within-cluster sum of squares)
# Look for "elbow" in the plot of k vs inertia
Cell G1: =ML.INSPECT.GET_PARAMS(F1)  # Check inertia value
Cell G2: =ML.INSPECT.GET_PARAMS(F2)
Cell G3: =ML.INSPECT.GET_PARAMS(F3)
Cell G4: =ML.INSPECT.GET_PARAMS(F4)
Cell G5: =ML.INSPECT.GET_PARAMS(F5)

Customer Segmentation Example

# Assume customer data in columns A-E
# Features: Age, Income, Spending Score, Frequency, Recency
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E1000, TRUE)

# Handle missing values
Cell B1: =ML.DATA.DROP_MISSING_ROWS(A1)

# Scale features (important for K-Means)
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Create K-Means with 4 segments
Cell E1: =ML.CLUSTERING.KMEANS(4, "k-means++", 20, 500, 0.0001, 42)
Cell F1: =ML.FIT(E1, D1)

# Assign customers to segments
Cell G1: =ML.PREDICT(F1, D1)

# Sample results to see segment distribution
Cell H1: =ML.DATA.SAMPLE(G1, 20)

Clustering with Dimensionality Reduction

# Load high-dimensional data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")  # 64 features

# Reduce dimensions with PCA
Cell C1: =ML.DIM_REDUCTION.PCA(2)  # Reduce to 2 components
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Cluster in reduced space
Cell E1: =ML.CLUSTERING.KMEANS(10, "k-means++", 15, 300, 0.0001, 42)
Cell F1: =ML.FIT(E1, D1)

# Get cluster labels
Cell G1: =ML.PREDICT(F1, D1)

# Visualize: Plot D1 (2D PCA) colored by G1 (clusters)

Clustering with Different Algorithms

# Prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Lloyd algorithm (default)
Cell E1: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42, "lloyd")
Cell F1: =ML.FIT(E1, D1)

# Elkan algorithm (faster for dense data)
Cell E2: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42, "elkan")
Cell F2: =ML.FIT(E2, D1)

# Compare results
Cell G1: =ML.PREDICT(F1, D1)
Cell G2: =ML.PREDICT(F2, D1)

Inspecting Cluster Centers

# Create and fit K-Means
Cell A1: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42)
Cell B1: =ML.FIT(A1, X_scaled)

# Get model parameters (includes cluster centers)
Cell C1: =ML.INSPECT.GET_PARAMS(B1)

# Examine cluster_centers_ and other attributes

Tips and Best Practices

  1. Choosing Number of Clusters

    • Use elbow method: plot k vs inertia
    • Use silhouette analysis
    • Consider business requirements
    • Start with domain knowledge
  2. Feature Scaling

    • Always scale features before K-Means
    • K-Means is sensitive to feature scales
    • Use StandardScaler or MinMaxScaler
    • Scale maintains relative distances
  3. Initialization

    • Use “k-means++” (default) for better results
    • Increase n_init (10-20) for stability
    • Set random_state for reproducibility
    • “random” init is faster but less reliable
  4. Convergence

    • Default max_iter=300 usually sufficient
    • Increase for complex datasets
    • Lower tol for more precision
    • Monitor convergence in production
  5. Algorithm Selection

    • “lloyd”: Safe default, works everywhere
    • “elkan”: Faster for dense, Euclidean data
    • Test both on your specific dataset
  6. Handling Issues

    • Empty clusters: Increase n_init
    • Poor convergence: Increase max_iter
    • Unstable results: Lower tol, increase n_init
    • Slow performance: Try “elkan”, reduce max_iter
  7. Validation

    • Check cluster sizes are balanced
    • Examine cluster centers
    • Visualize clusters (2D/3D plots)
    • Validate with domain expertise
  8. Preprocessing Checklist

    • Remove or impute missing values
    • Scale all features
    • Consider feature selection
    • Handle categorical variables