AI-Powered Audience Segmentation: K-Means Clustering with GA4 and BigQuery

Manual audience segments are arbitrary. "High-value customers" based on revenue thresholds miss behavioral nuance. K-means clustering discovers natural user groups from your data. This guide shows how to build behavioral clusters in BigQuery ML and export them to Google Ads for targeted campaigns.

Why Manual Segments Fail

Marketing teams create audience segments like:

  • “High-value customers” = revenue > $500
  • “Engaged users” = 5+ sessions
  • “At-risk” = no visit in 30 days

These thresholds are arbitrary. Why $500 and not $400? Why 5 sessions and not 7?

Worse, they miss behavioral patterns. Two users with $600 lifetime value might behave completely differently: one buys frequently in small amounts, another made one large purchase and never returned. They need different marketing strategies, but your segments treat them identically.

Clustering solves this. Instead of defining segments, you let the algorithm discover natural groupings in your data. Users cluster together based on similar behavior patterns, not arbitrary thresholds.


How K-Means Clustering Works

K-means is the simplest clustering algorithm:

  1. Choose K (number of clusters you want)
  2. Place K random centroids in your data space
  3. Assign each user to the nearest centroid
  4. Move centroids to the center of their assigned users
  5. Repeat steps 3-4 until centroids stop moving

The result: K groups of users where members within each group are similar to each other and different from other groups.

You don’t tell the algorithm what “high-value” or “engaged” means. It discovers structure in the data and groups users accordingly.


Prerequisites

  • GA4 BigQuery Export enabled
  • At least 30 days of data (60+ recommended)
  • Enough users for meaningful clusters (1,000+ active users)

Step 1: Build Behavioral Features

Clustering needs numeric features that describe user behavior. Transform GA4 events into user-level metrics:

CREATE OR REPLACE TABLE `project.dataset.user_features` AS

WITH user_events AS (
  SELECT
    user_pseudo_id,
    event_name,
    event_date,
    device.category as device,
    traffic_source.medium as medium,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') as session_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec') as engagement_time,
    (SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'value') as event_value
  FROM `project.analytics_PROPERTY_ID.events_*`
  WHERE _TABLE_SUFFIX BETWEEN
    FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 60 DAY))
    AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
)

SELECT
  user_pseudo_id,

  -- Engagement features
  COUNT(DISTINCT session_id) as total_sessions,
  COUNT(DISTINCT event_date) as active_days,
  SUM(IFNULL(engagement_time, 0)) / 1000 / 60 as total_engagement_minutes,
  COUNT(*) as total_events,

  -- Content consumption
  COUNTIF(event_name = 'page_view') as page_views,
  COUNTIF(event_name = 'scroll') as scroll_events,
  COUNTIF(event_name = 'click') as clicks,

  -- Purchase behavior
  COUNTIF(event_name = 'view_item') as items_viewed,
  COUNTIF(event_name = 'add_to_cart') as cart_additions,
  COUNTIF(event_name = 'purchase') as purchases,
  SUM(IF(event_name = 'purchase', IFNULL(event_value, 0), 0)) as total_revenue,

  -- Derived metrics
  SAFE_DIVIDE(COUNTIF(event_name = 'page_view'), COUNT(DISTINCT session_id)) as pages_per_session,
  SAFE_DIVIDE(COUNTIF(event_name = 'scroll'), COUNTIF(event_name = 'page_view')) as scroll_rate,
  SAFE_DIVIDE(COUNTIF(event_name = 'add_to_cart'), COUNTIF(event_name = 'view_item')) as cart_rate,
  SAFE_DIVIDE(COUNTIF(event_name = 'purchase'), COUNTIF(event_name = 'add_to_cart')) as purchase_rate,

  -- Recency (days since last visit)
  DATE_DIFF(CURRENT_DATE(), MAX(PARSE_DATE('%Y%m%d', event_date)), DAY) as days_since_last_visit,

  -- Device preference
  COUNTIF(device = 'mobile') / COUNT(*) as mobile_ratio,

  -- Traffic source
  MAX(CASE WHEN medium = 'organic' THEN 1 ELSE 0 END) as is_organic,
  MAX(CASE WHEN medium = 'cpc' THEN 1 ELSE 0 END) as is_paid

FROM user_events
GROUP BY user_pseudo_id
HAVING total_sessions >= 2;  -- Exclude single-session users

This creates one row per user with ~20 behavioral features.


Step 2: Normalize Features

K-means is sensitive to feature scales. A user with 10,000 page views and $50 revenue would cluster based mostly on page views because it’s a larger number.

Standardization fixes this:

CREATE OR REPLACE TABLE `project.dataset.user_features_normalized` AS

WITH stats AS (
  SELECT
    AVG(total_sessions) as avg_sessions, STDDEV(total_sessions) as std_sessions,
    AVG(active_days) as avg_days, STDDEV(active_days) as std_days,
    AVG(total_engagement_minutes) as avg_engagement, STDDEV(total_engagement_minutes) as std_engagement,
    AVG(page_views) as avg_pv, STDDEV(page_views) as std_pv,
    AVG(items_viewed) as avg_items, STDDEV(items_viewed) as std_items,
    AVG(cart_additions) as avg_cart, STDDEV(cart_additions) as std_cart,
    AVG(purchases) as avg_purchases, STDDEV(purchases) as std_purchases,
    AVG(total_revenue) as avg_revenue, STDDEV(total_revenue) as std_revenue,
    AVG(pages_per_session) as avg_pps, STDDEV(pages_per_session) as std_pps,
    AVG(scroll_rate) as avg_scroll, STDDEV(scroll_rate) as std_scroll,
    AVG(days_since_last_visit) as avg_recency, STDDEV(days_since_last_visit) as std_recency
  FROM `project.dataset.user_features`
)

SELECT
  f.user_pseudo_id,

  -- Standardized features (z-scores)
  (f.total_sessions - s.avg_sessions) / NULLIF(s.std_sessions, 0) as sessions_z,
  (f.active_days - s.avg_days) / NULLIF(s.std_days, 0) as days_z,
  (f.total_engagement_minutes - s.avg_engagement) / NULLIF(s.std_engagement, 0) as engagement_z,
  (f.page_views - s.avg_pv) / NULLIF(s.std_pv, 0) as pageviews_z,
  (f.items_viewed - s.avg_items) / NULLIF(s.std_items, 0) as items_z,
  (f.cart_additions - s.avg_cart) / NULLIF(s.std_cart, 0) as cart_z,
  (f.purchases - s.avg_purchases) / NULLIF(s.std_purchases, 0) as purchases_z,
  (f.total_revenue - s.avg_revenue) / NULLIF(s.std_revenue, 0) as revenue_z,
  (f.pages_per_session - s.avg_pps) / NULLIF(s.std_pps, 0) as pps_z,
  (IFNULL(f.scroll_rate, 0) - s.avg_scroll) / NULLIF(s.std_scroll, 0) as scroll_z,
  (f.days_since_last_visit - s.avg_recency) / NULLIF(s.std_recency, 0) as recency_z

FROM `project.dataset.user_features` f
CROSS JOIN stats s;

Now all features have mean ~0 and standard deviation ~1.


Step 3: Choose the Number of Clusters

How many clusters should you create? Too few misses nuance; too many creates noise.

The Elbow Method: Train models with different K values and look at the error:

-- Try K = 3
CREATE OR REPLACE MODEL `project.dataset.kmeans_k3`
OPTIONS(model_type='KMEANS', num_clusters=3) AS
SELECT * EXCEPT(user_pseudo_id) FROM `project.dataset.user_features_normalized`;

-- Try K = 4
CREATE OR REPLACE MODEL `project.dataset.kmeans_k4`
OPTIONS(model_type='KMEANS', num_clusters=4) AS
SELECT * EXCEPT(user_pseudo_id) FROM `project.dataset.user_features_normalized`;

-- Try K = 5
CREATE OR REPLACE MODEL `project.dataset.kmeans_k5`
OPTIONS(model_type='KMEANS', num_clusters=5) AS
SELECT * EXCEPT(user_pseudo_id) FROM `project.dataset.user_features_normalized`;

-- Try K = 6
CREATE OR REPLACE MODEL `project.dataset.kmeans_k6`
OPTIONS(model_type='KMEANS', num_clusters=6) AS
SELECT * EXCEPT(user_pseudo_id) FROM `project.dataset.user_features_normalized`;

Compare the Davies-Bouldin index (lower is better):

SELECT 3 as k, * FROM ML.EVALUATE(MODEL `project.dataset.kmeans_k3`)
UNION ALL
SELECT 4 as k, * FROM ML.EVALUATE(MODEL `project.dataset.kmeans_k4`)
UNION ALL
SELECT 5 as k, * FROM ML.EVALUATE(MODEL `project.dataset.kmeans_k5`)
UNION ALL
SELECT 6 as k, * FROM ML.EVALUATE(MODEL `project.dataset.kmeans_k6`)
ORDER BY davies_bouldin_index;

Pick the K where adding more clusters stops significantly improving the score. Often K=4-6 works well for marketing audiences.


Step 4: Train the Final Model

Once you’ve chosen K (let’s say 5):

CREATE OR REPLACE MODEL `project.dataset.user_segments_model`
OPTIONS(
  model_type='KMEANS',
  num_clusters=5,
  kmeans_init_method='KMEANS++',  -- Better initialization
  max_iterations=50
) AS
SELECT * EXCEPT(user_pseudo_id)
FROM `project.dataset.user_features_normalized`;

Training takes 2-5 minutes.


Step 5: Assign Users to Clusters

Apply the model to get cluster assignments:

CREATE OR REPLACE TABLE `project.dataset.user_clusters` AS

SELECT
  n.user_pseudo_id,
  p.CENTROID_ID as cluster_id,
  f.*  -- Include original (non-normalized) features for analysis
FROM `project.dataset.user_features_normalized` n
JOIN `project.dataset.user_features` f USING (user_pseudo_id)
JOIN ML.PREDICT(MODEL `project.dataset.user_segments_model`,
  (SELECT * FROM `project.dataset.user_features_normalized`)
) p ON n.user_pseudo_id = p.user_pseudo_id;

Each user now has a cluster_id from 1 to 5.


Step 6: Interpret Cluster Characteristics

The key step: understand what each cluster represents.

SELECT
  cluster_id,
  COUNT(*) as users,
  ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) as percent_of_total,

  -- Engagement profile
  ROUND(AVG(total_sessions), 1) as avg_sessions,
  ROUND(AVG(active_days), 1) as avg_active_days,
  ROUND(AVG(total_engagement_minutes), 1) as avg_engagement_min,
  ROUND(AVG(pages_per_session), 1) as avg_pages_per_session,

  -- Purchase profile
  ROUND(AVG(items_viewed), 1) as avg_items_viewed,
  ROUND(AVG(cart_additions), 1) as avg_cart_adds,
  ROUND(AVG(purchases), 2) as avg_purchases,
  ROUND(AVG(total_revenue), 2) as avg_revenue,

  -- Recency
  ROUND(AVG(days_since_last_visit), 0) as avg_days_since_visit,

  -- Device
  ROUND(AVG(mobile_ratio) * 100, 0) as mobile_percent

FROM `project.dataset.user_clusters`
GROUP BY cluster_id
ORDER BY avg_revenue DESC;

Example output:

clusterusers%sessionsdaysrevenuerecencymobile
32,4008%12.318$890525%
18,20027%6.19$1251245%
55,10017%4.26$451860%
29,80032%2.13$02555%
44,50015%1.52$04570%

Now name them based on behavior:

  • Cluster 3: “Champions” - High engagement, high revenue, recent activity, prefer desktop
  • Cluster 1: “Loyal Customers” - Regular visitors, moderate spend, engaged
  • Cluster 5: “Browsers” - Some engagement, low conversion, mobile-heavy
  • Cluster 2: “Window Shoppers” - Low engagement, no purchases, starting to lapse
  • Cluster 4: “Churned” - Inactive, no purchases, mostly mobile

Step 7: Visualize Clusters

Create a summary for stakeholders:

-- Radar chart data: normalized averages per cluster
WITH cluster_avgs AS (
  SELECT
    cluster_id,
    AVG(total_sessions) as sessions,
    AVG(total_engagement_minutes) as engagement,
    AVG(items_viewed) as browsing,
    AVG(purchases) as purchases,
    AVG(total_revenue) as revenue
  FROM `project.dataset.user_clusters`
  GROUP BY cluster_id
),
max_vals AS (
  SELECT
    MAX(sessions) as max_sessions,
    MAX(engagement) as max_engagement,
    MAX(browsing) as max_browsing,
    MAX(purchases) as max_purchases,
    MAX(revenue) as max_revenue
  FROM cluster_avgs
)

SELECT
  cluster_id,
  ROUND(sessions / max_sessions, 2) as sessions_norm,
  ROUND(engagement / max_engagement, 2) as engagement_norm,
  ROUND(browsing / max_browsing, 2) as browsing_norm,
  ROUND(purchases / max_purchases, 2) as purchases_norm,
  ROUND(revenue / max_revenue, 2) as revenue_norm
FROM cluster_avgs, max_vals
ORDER BY cluster_id;

Use this in Looker Studio to create radar/spider charts showing each cluster’s profile.


Step 8: Export to Google Ads

Create Audience Tables

-- Champions: High-value retention campaigns
CREATE OR REPLACE TABLE `project.dataset.audience_champions` AS
SELECT user_pseudo_id
FROM `project.dataset.user_clusters`
WHERE cluster_id = 3;

-- Loyal Customers: Upsell campaigns
CREATE OR REPLACE TABLE `project.dataset.audience_loyal` AS
SELECT user_pseudo_id
FROM `project.dataset.user_clusters`
WHERE cluster_id = 1;

-- Browsers: Conversion campaigns
CREATE OR REPLACE TABLE `project.dataset.audience_browsers` AS
SELECT user_pseudo_id
FROM `project.dataset.user_clusters`
WHERE cluster_id = 5;

-- Window Shoppers: Re-engagement campaigns
CREATE OR REPLACE TABLE `project.dataset.audience_window_shoppers` AS
SELECT user_pseudo_id
FROM `project.dataset.user_clusters`
WHERE cluster_id = 2;

-- Churned: Win-back campaigns
CREATE OR REPLACE TABLE `project.dataset.audience_churned` AS
SELECT user_pseudo_id
FROM `project.dataset.user_clusters`
WHERE cluster_id = 4;

Connect to Google Ads

Option 1: Google Ads Data Manager

  1. In Google Ads, go to Tools > Data Manager
  2. Link your BigQuery project
  3. Create audiences from each table

Option 2: GA4 Audiences If you can match user_pseudo_id to GA4 sessions, create audiences in GA4 and sync to Google Ads automatically.

Option 3: Customer Match If you have email/phone linked to users:

SELECT
  c.user_pseudo_id,
  u.email,  -- From your CRM/user table
  c.cluster_id
FROM `project.dataset.user_clusters` c
JOIN `project.crm.users` u ON c.user_pseudo_id = u.ga_client_id;

Upload to Google Ads Customer Match for each segment.


Campaign Strategies by Cluster

ClusterStrategyMessageBid Adjustment
ChampionsRetention, loyaltyExclusive offers, early access+50%
LoyalUpsell, cross-sellRelated products, bundles+30%
BrowsersConversionLimited time offers, social proof+20%
Window ShoppersRe-engagement”We miss you”, discountsBase bid
ChurnedWin-backHeavy discounts, new products-20%

Different clusters need different creatives, landing pages, and bid strategies.


Maintaining Clusters

Refresh Weekly

User behavior changes. Re-cluster weekly to keep segments current:

-- Schedule this query weekly
-- 1. Rebuild features
-- 2. Re-predict cluster assignments
-- 3. Update audience tables

Monitor Cluster Stability

Track cluster sizes over time:

SELECT
  DATE_TRUNC(CURRENT_DATE(), WEEK) as week,
  cluster_id,
  COUNT(*) as users
FROM `project.dataset.user_clusters`
GROUP BY 1, 2
ORDER BY week, cluster_id;

If cluster sizes swing wildly week-to-week, your features might be too sensitive to short-term behavior. Consider using longer lookback windows (90 days instead of 60).

Re-evaluate K Periodically

As your business grows, you might need more segments. Re-run the elbow method quarterly to check if more clusters would be beneficial.


Advanced: RFM Clustering

A classic approach specifically for purchase behavior:

  • Recency: Days since last purchase
  • Frequency: Number of purchases
  • Monetary: Total revenue
CREATE OR REPLACE TABLE `project.dataset.rfm_features` AS

SELECT
  user_pseudo_id,

  -- Recency (lower is better)
  DATE_DIFF(CURRENT_DATE(),
    MAX(IF(event_name = 'purchase', PARSE_DATE('%Y%m%d', event_date), NULL)),
    DAY) as recency,

  -- Frequency
  COUNTIF(event_name = 'purchase') as frequency,

  -- Monetary
  SUM(IF(event_name = 'purchase',
    (SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'value'),
    0)) as monetary

FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY))
GROUP BY user_pseudo_id
HAVING frequency > 0;  -- Only purchasers

Then cluster on these three features for a purchase-focused segmentation.


Cost Estimate

ComponentCost
K-means training (5 models for elbow)~$2-5
Weekly re-clustering~$1/week
Feature table storage~$1/month
Monthly total~$10-15

Summary

K-means clustering discovers natural audience segments that manual rules miss:

  1. Build behavioral features from GA4 events
  2. Normalize features so all metrics contribute equally
  3. Choose K using the elbow method
  4. Train the model with one SQL statement
  5. Interpret clusters by analyzing average behaviors
  6. Export to Google Ads for targeted campaigns
  7. Refresh weekly to keep segments current

The clusters you discover often surprise you. Users group in ways that don’t match your assumptions, revealing opportunities for more nuanced marketing strategies.

Start with 4-5 clusters. Once you understand those, experiment with more granular segmentation.



Sources