Clustering Models Reference

Functions for creating and training clustering models to discover patterns and group similar data points.

ML.CLUSTERING Namespace

ML.CLUSTERING.KMEANS()

Creates a K-Means clustering model to group similar data points.

Syntax:

=ML.CLUSTERING.KMEANS(n_clusters, init, n_init, max_iter, tol, random_state, algorithm)

Parameters:

n_clusters (Integer, Optional): Number of clusters to create (default: 8)
- Choose based on your data and business needs
- Use elbow method to find optimal number
init (String, Optional): Initialization method (default: “k-means++”)
- “k-means++”: Smart initialization (recommended)
- “random”: Random initialization (faster but less optimal)
n_init (Integer/String, Optional): Number of initialization runs (default: “auto”)
- “auto”: Automatically determined
- Integer: Specific number of runs (10-20 for production)
max_iter (Integer, Optional): Maximum iterations per run (default: 300)
- Higher values find better clusters but take longer
tol (Number, Optional): Convergence tolerance (default: 0.0001)
- Lower values = more precise but slower
- Increase to 0.001 for faster convergence
random_state (Integer, Optional): Random seed for reproducibility
- Use any integer (e.g., 42) for consistent results
algorithm (String, Optional): Algorithm variant (default: “lloyd”)
- “lloyd”: Works for all cases (default)
- “elkan”: Faster for dense data

Returns: K-Means clustering model object

Use Case: Customer segmentation, pattern discovery, data grouping

Example:

# Basic K-Means with 3 clusters
Cell A1: =ML.CLUSTERING.KMEANS(3)
Result: <KMeans>

# Optimized K-Means
Cell A2: =ML.CLUSTERING.KMEANS(5, "k-means++", 20, 500, 0.0001, 42, "lloyd")

# Fit to data
Cell B1: =ML.FIT(A1, X_data)

# Get cluster labels
Cell C1: =ML.PREDICT(B1, X_data)

Common Patterns

Basic Clustering

# Load and prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})  # Features only

# Create and fit K-Means
Cell C1: =ML.CLUSTERING.KMEANS(3, "k-means++", "auto", 300, 0.0001, 42)
Cell D1: =ML.FIT(C1, B1)

# Get cluster assignments
Cell E1: =ML.PREDICT(D1, B1)

Clustering with Scaling

# Load data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})

# Create preprocessing pipeline
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell C2: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42)

# Create and fit pipeline
Cell D1: =ML.PIPELINE(C1, C2)
Cell E1: =ML.FIT(D1, B1)

# Predict clusters
Cell F1: =ML.PREDICT(E1, B1)

Finding Optimal Number of Clusters (Elbow Method)

# Load and prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})

# Scale the data
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Try different cluster numbers
Cell E1: =ML.CLUSTERING.KMEANS(2, "k-means++", 10, 300, 0.0001, 42)
Cell E2: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42)
Cell E3: =ML.CLUSTERING.KMEANS(4, "k-means++", 10, 300, 0.0001, 42)
Cell E4: =ML.CLUSTERING.KMEANS(5, "k-means++", 10, 300, 0.0001, 42)
Cell E5: =ML.CLUSTERING.KMEANS(6, "k-means++", 10, 300, 0.0001, 42)

# Fit models
Cell F1: =ML.FIT(E1, D1)
Cell F2: =ML.FIT(E2, D1)
Cell F3: =ML.FIT(E3, D1)
Cell F4: =ML.FIT(E4, D1)
Cell F5: =ML.FIT(E5, D1)

# Compare inertia (within-cluster sum of squares)
# Look for "elbow" in the plot of k vs inertia
Cell G1: =ML.INSPECT.GET_PARAMS(F1)  # Check inertia value
Cell G2: =ML.INSPECT.GET_PARAMS(F2)
Cell G3: =ML.INSPECT.GET_PARAMS(F3)
Cell G4: =ML.INSPECT.GET_PARAMS(F4)
Cell G5: =ML.INSPECT.GET_PARAMS(F5)

Customer Segmentation Example

# Assume customer data in columns A-E
# Features: Age, Income, Spending Score, Frequency, Recency
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E1000, TRUE)

# Handle missing values
Cell B1: =ML.DATA.DROP_MISSING_ROWS(A1)

# Scale features (important for K-Means)
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Create K-Means with 4 segments
Cell E1: =ML.CLUSTERING.KMEANS(4, "k-means++", 20, 500, 0.0001, 42)
Cell F1: =ML.FIT(E1, D1)

# Assign customers to segments
Cell G1: =ML.PREDICT(F1, D1)

# Sample results to see segment distribution
Cell H1: =ML.DATA.SAMPLE(G1, 20)

Clustering with Dimensionality Reduction

# Load high-dimensional data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")  # 64 features

# Reduce dimensions with PCA
Cell C1: =ML.DIM_REDUCTION.PCA(2)  # Reduce to 2 components
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Cluster in reduced space
Cell E1: =ML.CLUSTERING.KMEANS(10, "k-means++", 15, 300, 0.0001, 42)
Cell F1: =ML.FIT(E1, D1)

# Get cluster labels
Cell G1: =ML.PREDICT(F1, D1)

# Visualize: Plot D1 (2D PCA) colored by G1 (clusters)

Clustering with Different Algorithms

# Prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Lloyd algorithm (default)
Cell E1: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42, "lloyd")
Cell F1: =ML.FIT(E1, D1)

# Elkan algorithm (faster for dense data)
Cell E2: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42, "elkan")
Cell F2: =ML.FIT(E2, D1)

# Compare results
Cell G1: =ML.PREDICT(F1, D1)
Cell G2: =ML.PREDICT(F2, D1)

Inspecting Cluster Centers

# Create and fit K-Means
Cell A1: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42)
Cell B1: =ML.FIT(A1, X_scaled)

# Get model parameters (includes cluster centers)
Cell C1: =ML.INSPECT.GET_PARAMS(B1)

# Examine cluster_centers_ and other attributes

Tips and Best Practices

Choosing Number of Clusters
- Use elbow method: plot k vs inertia
- Use silhouette analysis
- Consider business requirements
- Start with domain knowledge
Feature Scaling
- Always scale features before K-Means
- K-Means is sensitive to feature scales
- Use StandardScaler or MinMaxScaler
- Scale maintains relative distances
Initialization
- Use “k-means++” (default) for better results
- Increase n_init (10-20) for stability
- Set random_state for reproducibility
- “random” init is faster but less reliable
Convergence
- Default max_iter=300 usually sufficient
- Increase for complex datasets
- Lower tol for more precision
- Monitor convergence in production
Algorithm Selection
- “lloyd”: Safe default, works everywhere
- “elkan”: Faster for dense, Euclidean data
- Test both on your specific dataset
Handling Issues
- Empty clusters: Increase n_init
- Poor convergence: Increase max_iter
- Unstable results: Lower tol, increase n_init
- Slow performance: Try “elkan”, reduce max_iter
Validation
- Check cluster sizes are balanced
- Examine cluster centers
- Visualize clusters (2D/3D plots)
- Validate with domain expertise
Preprocessing Checklist
- Remove or impute missing values
- Scale all features
- Consider feature selection
- Handle categorical variables

ML.FIT() - Train clustering model
ML.PREDICT() - Assign cluster labels
ML.FIT_TRANSFORM() - Fit and get labels
ML.PREPROCESSING.STANDARD_SCALER() - Scale features
ML.DIM_REDUCTION.PCA() - Reduce dimensions
ML.INSPECT.GET_PARAMS() - Examine cluster properties

Clustering Models Reference

Table of Contents

Clustering Models Reference

ML.CLUSTERING Namespace

ML.CLUSTERING.KMEANS()

Common Patterns

Basic Clustering

Clustering with Scaling

Finding Optimal Number of Clusters (Elbow Method)

Customer Segmentation Example

Clustering with Dimensionality Reduction

Clustering with Different Algorithms

Inspecting Cluster Centers

Tips and Best Practices

Navigation

Table of Contents

Clustering Models Reference

ML.CLUSTERING Namespace

ML.CLUSTERING.KMEANS()

Common Patterns

Basic Clustering

Clustering with Scaling

Finding Optimal Number of Clusters (Elbow Method)

Customer Segmentation Example

Clustering with Dimensionality Reduction

Clustering with Different Algorithms

Inspecting Cluster Centers

Tips and Best Practices

Related Functions

Navigation