Table of Contents
Clustering Models Reference
Functions for creating and training clustering models to discover patterns and group similar data points.
ML.CLUSTERING Namespace
ML.CLUSTERING.KMEANS()
Creates a K-Means clustering model to group similar data points.
Syntax:
=ML.CLUSTERING.KMEANS(n_clusters, init, n_init, max_iter, tol, random_state, algorithm)
Parameters:
n_clusters
(Integer, Optional): Number of clusters to create (default: 8)- Choose based on your data and business needs
- Use elbow method to find optimal number
init
(String, Optional): Initialization method (default: “k-means++”)- “k-means++”: Smart initialization (recommended)
- “random”: Random initialization (faster but less optimal)
n_init
(Integer/String, Optional): Number of initialization runs (default: “auto”)- “auto”: Automatically determined
- Integer: Specific number of runs (10-20 for production)
max_iter
(Integer, Optional): Maximum iterations per run (default: 300)- Higher values find better clusters but take longer
tol
(Number, Optional): Convergence tolerance (default: 0.0001)- Lower values = more precise but slower
- Increase to 0.001 for faster convergence
random_state
(Integer, Optional): Random seed for reproducibility- Use any integer (e.g., 42) for consistent results
algorithm
(String, Optional): Algorithm variant (default: “lloyd”)- “lloyd”: Works for all cases (default)
- “elkan”: Faster for dense data
Returns: K-Means clustering model object
Use Case: Customer segmentation, pattern discovery, data grouping
Example:
# Basic K-Means with 3 clusters
Cell A1: =ML.CLUSTERING.KMEANS(3)
Result: <KMeans>
# Optimized K-Means
Cell A2: =ML.CLUSTERING.KMEANS(5, "k-means++", 20, 500, 0.0001, 42, "lloyd")
# Fit to data
Cell B1: =ML.FIT(A1, X_data)
# Get cluster labels
Cell C1: =ML.PREDICT(B1, X_data)
Common Patterns
Basic Clustering
# Load and prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3}) # Features only
# Create and fit K-Means
Cell C1: =ML.CLUSTERING.KMEANS(3, "k-means++", "auto", 300, 0.0001, 42)
Cell D1: =ML.FIT(C1, B1)
# Get cluster assignments
Cell E1: =ML.PREDICT(D1, B1)
Clustering with Scaling
# Load data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
# Create preprocessing pipeline
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell C2: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42)
# Create and fit pipeline
Cell D1: =ML.PIPELINE(C1, C2)
Cell E1: =ML.FIT(D1, B1)
# Predict clusters
Cell F1: =ML.PREDICT(E1, B1)
Finding Optimal Number of Clusters (Elbow Method)
# Load and prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
# Scale the data
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)
# Try different cluster numbers
Cell E1: =ML.CLUSTERING.KMEANS(2, "k-means++", 10, 300, 0.0001, 42)
Cell E2: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42)
Cell E3: =ML.CLUSTERING.KMEANS(4, "k-means++", 10, 300, 0.0001, 42)
Cell E4: =ML.CLUSTERING.KMEANS(5, "k-means++", 10, 300, 0.0001, 42)
Cell E5: =ML.CLUSTERING.KMEANS(6, "k-means++", 10, 300, 0.0001, 42)
# Fit models
Cell F1: =ML.FIT(E1, D1)
Cell F2: =ML.FIT(E2, D1)
Cell F3: =ML.FIT(E3, D1)
Cell F4: =ML.FIT(E4, D1)
Cell F5: =ML.FIT(E5, D1)
# Compare inertia (within-cluster sum of squares)
# Look for "elbow" in the plot of k vs inertia
Cell G1: =ML.INSPECT.GET_PARAMS(F1) # Check inertia value
Cell G2: =ML.INSPECT.GET_PARAMS(F2)
Cell G3: =ML.INSPECT.GET_PARAMS(F3)
Cell G4: =ML.INSPECT.GET_PARAMS(F4)
Cell G5: =ML.INSPECT.GET_PARAMS(F5)
Customer Segmentation Example
# Assume customer data in columns A-E
# Features: Age, Income, Spending Score, Frequency, Recency
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E1000, TRUE)
# Handle missing values
Cell B1: =ML.DATA.DROP_MISSING_ROWS(A1)
# Scale features (important for K-Means)
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)
# Create K-Means with 4 segments
Cell E1: =ML.CLUSTERING.KMEANS(4, "k-means++", 20, 500, 0.0001, 42)
Cell F1: =ML.FIT(E1, D1)
# Assign customers to segments
Cell G1: =ML.PREDICT(F1, D1)
# Sample results to see segment distribution
Cell H1: =ML.DATA.SAMPLE(G1, 20)
Clustering with Dimensionality Reduction
# Load high-dimensional data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63") # 64 features
# Reduce dimensions with PCA
Cell C1: =ML.DIM_REDUCTION.PCA(2) # Reduce to 2 components
Cell D1: =ML.FIT_TRANSFORM(C1, B1)
# Cluster in reduced space
Cell E1: =ML.CLUSTERING.KMEANS(10, "k-means++", 15, 300, 0.0001, 42)
Cell F1: =ML.FIT(E1, D1)
# Get cluster labels
Cell G1: =ML.PREDICT(F1, D1)
# Visualize: Plot D1 (2D PCA) colored by G1 (clusters)
Clustering with Different Algorithms
# Prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)
# Lloyd algorithm (default)
Cell E1: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42, "lloyd")
Cell F1: =ML.FIT(E1, D1)
# Elkan algorithm (faster for dense data)
Cell E2: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42, "elkan")
Cell F2: =ML.FIT(E2, D1)
# Compare results
Cell G1: =ML.PREDICT(F1, D1)
Cell G2: =ML.PREDICT(F2, D1)
Inspecting Cluster Centers
# Create and fit K-Means
Cell A1: =ML.CLUSTERING.KMEANS(3, "k-means++", 10, 300, 0.0001, 42)
Cell B1: =ML.FIT(A1, X_scaled)
# Get model parameters (includes cluster centers)
Cell C1: =ML.INSPECT.GET_PARAMS(B1)
# Examine cluster_centers_ and other attributes
Tips and Best Practices
-
Choosing Number of Clusters
- Use elbow method: plot k vs inertia
- Use silhouette analysis
- Consider business requirements
- Start with domain knowledge
-
Feature Scaling
- Always scale features before K-Means
- K-Means is sensitive to feature scales
- Use StandardScaler or MinMaxScaler
- Scale maintains relative distances
-
Initialization
- Use “k-means++” (default) for better results
- Increase n_init (10-20) for stability
- Set random_state for reproducibility
- “random” init is faster but less reliable
-
Convergence
- Default max_iter=300 usually sufficient
- Increase for complex datasets
- Lower tol for more precision
- Monitor convergence in production
-
Algorithm Selection
- “lloyd”: Safe default, works everywhere
- “elkan”: Faster for dense, Euclidean data
- Test both on your specific dataset
-
Handling Issues
- Empty clusters: Increase n_init
- Poor convergence: Increase max_iter
- Unstable results: Lower tol, increase n_init
- Slow performance: Try “elkan”, reduce max_iter
-
Validation
- Check cluster sizes are balanced
- Examine cluster centers
- Visualize clusters (2D/3D plots)
- Validate with domain expertise
-
Preprocessing Checklist
- Remove or impute missing values
- Scale all features
- Consider feature selection
- Handle categorical variables
Related Functions
- ML.FIT() - Train clustering model
- ML.PREDICT() - Assign cluster labels
- ML.FIT_TRANSFORM() - Fit and get labels
- ML.PREPROCESSING.STANDARD_SCALER() - Scale features
- ML.DIM_REDUCTION.PCA() - Reduce dimensions
- ML.INSPECT.GET_PARAMS() - Examine cluster properties