AI-Powered Audience Segmentation: K-Means Clustering with GA4 and BigQuery
Manual audience segments are arbitrary. "High-value customers" based on revenue thresholds miss behavioral nuance. K-means clustering discovers natural user groups from your data. This guide shows how to build behavioral clusters in BigQuery ML and export them to Google Ads for targeted campaigns.
Why Manual Segments Fail
Marketing teams create audience segments like:
- “High-value customers” = revenue > $500
- “Engaged users” = 5+ sessions
- “At-risk” = no visit in 30 days
These thresholds are arbitrary. Why $500 and not $400? Why 5 sessions and not 7?
Worse, they miss behavioral patterns. Two users with $600 lifetime value might behave completely differently: one buys frequently in small amounts, another made one large purchase and never returned. They need different marketing strategies, but your segments treat them identically.
Clustering solves this. Instead of defining segments, you let the algorithm discover natural groupings in your data. Users cluster together based on similar behavior patterns, not arbitrary thresholds.
How K-Means Clustering Works
K-means is the simplest clustering algorithm:
- Choose K (number of clusters you want)
- Place K random centroids in your data space
- Assign each user to the nearest centroid
- Move centroids to the center of their assigned users
- Repeat steps 3-4 until centroids stop moving
The result: K groups of users where members within each group are similar to each other and different from other groups.
You don’t tell the algorithm what “high-value” or “engaged” means. It discovers structure in the data and groups users accordingly.
Prerequisites
- GA4 BigQuery Export enabled
- At least 30 days of data (60+ recommended)
- Enough users for meaningful clusters (1,000+ active users)
Step 1: Build Behavioral Features
Clustering needs numeric features that describe user behavior. Transform GA4 events into user-level metrics:
CREATE OR REPLACE TABLE `project.dataset.user_features` AS
WITH user_events AS (
SELECT
user_pseudo_id,
event_name,
event_date,
device.category as device,
traffic_source.medium as medium,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') as session_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec') as engagement_time,
(SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'value') as event_value
FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 60 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
)
SELECT
user_pseudo_id,
-- Engagement features
COUNT(DISTINCT session_id) as total_sessions,
COUNT(DISTINCT event_date) as active_days,
SUM(IFNULL(engagement_time, 0)) / 1000 / 60 as total_engagement_minutes,
COUNT(*) as total_events,
-- Content consumption
COUNTIF(event_name = 'page_view') as page_views,
COUNTIF(event_name = 'scroll') as scroll_events,
COUNTIF(event_name = 'click') as clicks,
-- Purchase behavior
COUNTIF(event_name = 'view_item') as items_viewed,
COUNTIF(event_name = 'add_to_cart') as cart_additions,
COUNTIF(event_name = 'purchase') as purchases,
SUM(IF(event_name = 'purchase', IFNULL(event_value, 0), 0)) as total_revenue,
-- Derived metrics
SAFE_DIVIDE(COUNTIF(event_name = 'page_view'), COUNT(DISTINCT session_id)) as pages_per_session,
SAFE_DIVIDE(COUNTIF(event_name = 'scroll'), COUNTIF(event_name = 'page_view')) as scroll_rate,
SAFE_DIVIDE(COUNTIF(event_name = 'add_to_cart'), COUNTIF(event_name = 'view_item')) as cart_rate,
SAFE_DIVIDE(COUNTIF(event_name = 'purchase'), COUNTIF(event_name = 'add_to_cart')) as purchase_rate,
-- Recency (days since last visit)
DATE_DIFF(CURRENT_DATE(), MAX(PARSE_DATE('%Y%m%d', event_date)), DAY) as days_since_last_visit,
-- Device preference
COUNTIF(device = 'mobile') / COUNT(*) as mobile_ratio,
-- Traffic source
MAX(CASE WHEN medium = 'organic' THEN 1 ELSE 0 END) as is_organic,
MAX(CASE WHEN medium = 'cpc' THEN 1 ELSE 0 END) as is_paid
FROM user_events
GROUP BY user_pseudo_id
HAVING total_sessions >= 2; -- Exclude single-session usersThis creates one row per user with ~20 behavioral features.
Step 2: Normalize Features
K-means is sensitive to feature scales. A user with 10,000 page views and $50 revenue would cluster based mostly on page views because it’s a larger number.
Standardization fixes this:
CREATE OR REPLACE TABLE `project.dataset.user_features_normalized` AS
WITH stats AS (
SELECT
AVG(total_sessions) as avg_sessions, STDDEV(total_sessions) as std_sessions,
AVG(active_days) as avg_days, STDDEV(active_days) as std_days,
AVG(total_engagement_minutes) as avg_engagement, STDDEV(total_engagement_minutes) as std_engagement,
AVG(page_views) as avg_pv, STDDEV(page_views) as std_pv,
AVG(items_viewed) as avg_items, STDDEV(items_viewed) as std_items,
AVG(cart_additions) as avg_cart, STDDEV(cart_additions) as std_cart,
AVG(purchases) as avg_purchases, STDDEV(purchases) as std_purchases,
AVG(total_revenue) as avg_revenue, STDDEV(total_revenue) as std_revenue,
AVG(pages_per_session) as avg_pps, STDDEV(pages_per_session) as std_pps,
AVG(scroll_rate) as avg_scroll, STDDEV(scroll_rate) as std_scroll,
AVG(days_since_last_visit) as avg_recency, STDDEV(days_since_last_visit) as std_recency
FROM `project.dataset.user_features`
)
SELECT
f.user_pseudo_id,
-- Standardized features (z-scores)
(f.total_sessions - s.avg_sessions) / NULLIF(s.std_sessions, 0) as sessions_z,
(f.active_days - s.avg_days) / NULLIF(s.std_days, 0) as days_z,
(f.total_engagement_minutes - s.avg_engagement) / NULLIF(s.std_engagement, 0) as engagement_z,
(f.page_views - s.avg_pv) / NULLIF(s.std_pv, 0) as pageviews_z,
(f.items_viewed - s.avg_items) / NULLIF(s.std_items, 0) as items_z,
(f.cart_additions - s.avg_cart) / NULLIF(s.std_cart, 0) as cart_z,
(f.purchases - s.avg_purchases) / NULLIF(s.std_purchases, 0) as purchases_z,
(f.total_revenue - s.avg_revenue) / NULLIF(s.std_revenue, 0) as revenue_z,
(f.pages_per_session - s.avg_pps) / NULLIF(s.std_pps, 0) as pps_z,
(IFNULL(f.scroll_rate, 0) - s.avg_scroll) / NULLIF(s.std_scroll, 0) as scroll_z,
(f.days_since_last_visit - s.avg_recency) / NULLIF(s.std_recency, 0) as recency_z
FROM `project.dataset.user_features` f
CROSS JOIN stats s;Now all features have mean ~0 and standard deviation ~1.
Step 3: Choose the Number of Clusters
How many clusters should you create? Too few misses nuance; too many creates noise.
The Elbow Method: Train models with different K values and look at the error:
-- Try K = 3
CREATE OR REPLACE MODEL `project.dataset.kmeans_k3`
OPTIONS(model_type='KMEANS', num_clusters=3) AS
SELECT * EXCEPT(user_pseudo_id) FROM `project.dataset.user_features_normalized`;
-- Try K = 4
CREATE OR REPLACE MODEL `project.dataset.kmeans_k4`
OPTIONS(model_type='KMEANS', num_clusters=4) AS
SELECT * EXCEPT(user_pseudo_id) FROM `project.dataset.user_features_normalized`;
-- Try K = 5
CREATE OR REPLACE MODEL `project.dataset.kmeans_k5`
OPTIONS(model_type='KMEANS', num_clusters=5) AS
SELECT * EXCEPT(user_pseudo_id) FROM `project.dataset.user_features_normalized`;
-- Try K = 6
CREATE OR REPLACE MODEL `project.dataset.kmeans_k6`
OPTIONS(model_type='KMEANS', num_clusters=6) AS
SELECT * EXCEPT(user_pseudo_id) FROM `project.dataset.user_features_normalized`;Compare the Davies-Bouldin index (lower is better):
SELECT 3 as k, * FROM ML.EVALUATE(MODEL `project.dataset.kmeans_k3`)
UNION ALL
SELECT 4 as k, * FROM ML.EVALUATE(MODEL `project.dataset.kmeans_k4`)
UNION ALL
SELECT 5 as k, * FROM ML.EVALUATE(MODEL `project.dataset.kmeans_k5`)
UNION ALL
SELECT 6 as k, * FROM ML.EVALUATE(MODEL `project.dataset.kmeans_k6`)
ORDER BY davies_bouldin_index;Pick the K where adding more clusters stops significantly improving the score. Often K=4-6 works well for marketing audiences.
Step 4: Train the Final Model
Once you’ve chosen K (let’s say 5):
CREATE OR REPLACE MODEL `project.dataset.user_segments_model`
OPTIONS(
model_type='KMEANS',
num_clusters=5,
kmeans_init_method='KMEANS++', -- Better initialization
max_iterations=50
) AS
SELECT * EXCEPT(user_pseudo_id)
FROM `project.dataset.user_features_normalized`;Training takes 2-5 minutes.
Step 5: Assign Users to Clusters
Apply the model to get cluster assignments:
CREATE OR REPLACE TABLE `project.dataset.user_clusters` AS
SELECT
n.user_pseudo_id,
p.CENTROID_ID as cluster_id,
f.* -- Include original (non-normalized) features for analysis
FROM `project.dataset.user_features_normalized` n
JOIN `project.dataset.user_features` f USING (user_pseudo_id)
JOIN ML.PREDICT(MODEL `project.dataset.user_segments_model`,
(SELECT * FROM `project.dataset.user_features_normalized`)
) p ON n.user_pseudo_id = p.user_pseudo_id;Each user now has a cluster_id from 1 to 5.
Step 6: Interpret Cluster Characteristics
The key step: understand what each cluster represents.
SELECT
cluster_id,
COUNT(*) as users,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) as percent_of_total,
-- Engagement profile
ROUND(AVG(total_sessions), 1) as avg_sessions,
ROUND(AVG(active_days), 1) as avg_active_days,
ROUND(AVG(total_engagement_minutes), 1) as avg_engagement_min,
ROUND(AVG(pages_per_session), 1) as avg_pages_per_session,
-- Purchase profile
ROUND(AVG(items_viewed), 1) as avg_items_viewed,
ROUND(AVG(cart_additions), 1) as avg_cart_adds,
ROUND(AVG(purchases), 2) as avg_purchases,
ROUND(AVG(total_revenue), 2) as avg_revenue,
-- Recency
ROUND(AVG(days_since_last_visit), 0) as avg_days_since_visit,
-- Device
ROUND(AVG(mobile_ratio) * 100, 0) as mobile_percent
FROM `project.dataset.user_clusters`
GROUP BY cluster_id
ORDER BY avg_revenue DESC;Example output:
| cluster | users | % | sessions | days | revenue | recency | mobile |
|---|---|---|---|---|---|---|---|
| 3 | 2,400 | 8% | 12.3 | 18 | $890 | 5 | 25% |
| 1 | 8,200 | 27% | 6.1 | 9 | $125 | 12 | 45% |
| 5 | 5,100 | 17% | 4.2 | 6 | $45 | 18 | 60% |
| 2 | 9,800 | 32% | 2.1 | 3 | $0 | 25 | 55% |
| 4 | 4,500 | 15% | 1.5 | 2 | $0 | 45 | 70% |
Now name them based on behavior:
- Cluster 3: “Champions” - High engagement, high revenue, recent activity, prefer desktop
- Cluster 1: “Loyal Customers” - Regular visitors, moderate spend, engaged
- Cluster 5: “Browsers” - Some engagement, low conversion, mobile-heavy
- Cluster 2: “Window Shoppers” - Low engagement, no purchases, starting to lapse
- Cluster 4: “Churned” - Inactive, no purchases, mostly mobile
Step 7: Visualize Clusters
Create a summary for stakeholders:
-- Radar chart data: normalized averages per cluster
WITH cluster_avgs AS (
SELECT
cluster_id,
AVG(total_sessions) as sessions,
AVG(total_engagement_minutes) as engagement,
AVG(items_viewed) as browsing,
AVG(purchases) as purchases,
AVG(total_revenue) as revenue
FROM `project.dataset.user_clusters`
GROUP BY cluster_id
),
max_vals AS (
SELECT
MAX(sessions) as max_sessions,
MAX(engagement) as max_engagement,
MAX(browsing) as max_browsing,
MAX(purchases) as max_purchases,
MAX(revenue) as max_revenue
FROM cluster_avgs
)
SELECT
cluster_id,
ROUND(sessions / max_sessions, 2) as sessions_norm,
ROUND(engagement / max_engagement, 2) as engagement_norm,
ROUND(browsing / max_browsing, 2) as browsing_norm,
ROUND(purchases / max_purchases, 2) as purchases_norm,
ROUND(revenue / max_revenue, 2) as revenue_norm
FROM cluster_avgs, max_vals
ORDER BY cluster_id;Use this in Looker Studio to create radar/spider charts showing each cluster’s profile.
Step 8: Export to Google Ads
Create Audience Tables
-- Champions: High-value retention campaigns
CREATE OR REPLACE TABLE `project.dataset.audience_champions` AS
SELECT user_pseudo_id
FROM `project.dataset.user_clusters`
WHERE cluster_id = 3;
-- Loyal Customers: Upsell campaigns
CREATE OR REPLACE TABLE `project.dataset.audience_loyal` AS
SELECT user_pseudo_id
FROM `project.dataset.user_clusters`
WHERE cluster_id = 1;
-- Browsers: Conversion campaigns
CREATE OR REPLACE TABLE `project.dataset.audience_browsers` AS
SELECT user_pseudo_id
FROM `project.dataset.user_clusters`
WHERE cluster_id = 5;
-- Window Shoppers: Re-engagement campaigns
CREATE OR REPLACE TABLE `project.dataset.audience_window_shoppers` AS
SELECT user_pseudo_id
FROM `project.dataset.user_clusters`
WHERE cluster_id = 2;
-- Churned: Win-back campaigns
CREATE OR REPLACE TABLE `project.dataset.audience_churned` AS
SELECT user_pseudo_id
FROM `project.dataset.user_clusters`
WHERE cluster_id = 4;Connect to Google Ads
Option 1: Google Ads Data Manager
- In Google Ads, go to Tools > Data Manager
- Link your BigQuery project
- Create audiences from each table
Option 2: GA4 Audiences If you can match user_pseudo_id to GA4 sessions, create audiences in GA4 and sync to Google Ads automatically.
Option 3: Customer Match If you have email/phone linked to users:
SELECT
c.user_pseudo_id,
u.email, -- From your CRM/user table
c.cluster_id
FROM `project.dataset.user_clusters` c
JOIN `project.crm.users` u ON c.user_pseudo_id = u.ga_client_id;Upload to Google Ads Customer Match for each segment.
Campaign Strategies by Cluster
| Cluster | Strategy | Message | Bid Adjustment |
|---|---|---|---|
| Champions | Retention, loyalty | Exclusive offers, early access | +50% |
| Loyal | Upsell, cross-sell | Related products, bundles | +30% |
| Browsers | Conversion | Limited time offers, social proof | +20% |
| Window Shoppers | Re-engagement | ”We miss you”, discounts | Base bid |
| Churned | Win-back | Heavy discounts, new products | -20% |
Different clusters need different creatives, landing pages, and bid strategies.
Maintaining Clusters
Refresh Weekly
User behavior changes. Re-cluster weekly to keep segments current:
-- Schedule this query weekly
-- 1. Rebuild features
-- 2. Re-predict cluster assignments
-- 3. Update audience tablesMonitor Cluster Stability
Track cluster sizes over time:
SELECT
DATE_TRUNC(CURRENT_DATE(), WEEK) as week,
cluster_id,
COUNT(*) as users
FROM `project.dataset.user_clusters`
GROUP BY 1, 2
ORDER BY week, cluster_id;If cluster sizes swing wildly week-to-week, your features might be too sensitive to short-term behavior. Consider using longer lookback windows (90 days instead of 60).
Re-evaluate K Periodically
As your business grows, you might need more segments. Re-run the elbow method quarterly to check if more clusters would be beneficial.
Advanced: RFM Clustering
A classic approach specifically for purchase behavior:
- Recency: Days since last purchase
- Frequency: Number of purchases
- Monetary: Total revenue
CREATE OR REPLACE TABLE `project.dataset.rfm_features` AS
SELECT
user_pseudo_id,
-- Recency (lower is better)
DATE_DIFF(CURRENT_DATE(),
MAX(IF(event_name = 'purchase', PARSE_DATE('%Y%m%d', event_date), NULL)),
DAY) as recency,
-- Frequency
COUNTIF(event_name = 'purchase') as frequency,
-- Monetary
SUM(IF(event_name = 'purchase',
(SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'value'),
0)) as monetary
FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY))
GROUP BY user_pseudo_id
HAVING frequency > 0; -- Only purchasersThen cluster on these three features for a purchase-focused segmentation.
Cost Estimate
| Component | Cost |
|---|---|
| K-means training (5 models for elbow) | ~$2-5 |
| Weekly re-clustering | ~$1/week |
| Feature table storage | ~$1/month |
| Monthly total | ~$10-15 |
Summary
K-means clustering discovers natural audience segments that manual rules miss:
- Build behavioral features from GA4 events
- Normalize features so all metrics contribute equally
- Choose K using the elbow method
- Train the model with one SQL statement
- Interpret clusters by analyzing average behaviors
- Export to Google Ads for targeted campaigns
- Refresh weekly to keep segments current
The clusters you discover often surprise you. Users group in ways that don’t match your assumptions, revealing opportunities for more nuanced marketing strategies.
Start with 4-5 clusters. Once you understand those, experiment with more granular segmentation.
Related Resources
- Predicting Customer Value with BigQuery ML - Combine clusters with propensity models for precision targeting
- Automated Anomaly Detection for GA4 - Detect data quality issues before they corrupt your clusters
- Tracking Strategy: Build a Measurement System That Works