Predicting Customer Value with BigQuery ML and GA4 Data

Your GA4 data contains patterns that predict which users will convert, churn, or become high-value customers. BigQuery ML lets you build machine learning models using SQL, no Python required. This guide walks through building a purchase propensity model and using predictions for Google Ads audiences.

Why Predict Customer Behavior?

Every user on your site leaves behavioral signals. Page views, scroll depth, time on site, products viewed, cart additions. These signals correlate with outcomes: purchases, sign-ups, churn.

Traditional analytics tells you what happened. Predictive analytics tells you what will happen.

With predictions, you can:

  • Target high-propensity users with Google Ads before they convert elsewhere
  • Identify churn risk and trigger retention campaigns
  • Allocate budget toward users likely to have high lifetime value
  • Personalize experiences based on predicted interests

The challenge has always been building these models. Until BigQuery ML.


What is BigQuery ML?

BigQuery ML lets you create and execute machine learning models using standard SQL. No Python, no TensorFlow, no ML infrastructure to manage.

CREATE MODEL `project.dataset.my_model`
OPTIONS(model_type='LOGISTIC_REG') AS
SELECT * FROM training_data;

That’s it. BigQuery handles feature engineering, training, hyperparameter tuning, and model deployment.

Supported model types include:

  • Linear regression - Predict continuous values (revenue, LTV)
  • Logistic regression - Predict binary outcomes (will purchase / won’t purchase)
  • K-means clustering - Segment users into groups
  • XGBoost - High-accuracy predictions with gradient boosting
  • Deep Neural Networks - Complex pattern recognition
  • ARIMA - Time series forecasting

For most marketing use cases, logistic regression and XGBoost cover 90% of needs.


Prerequisites

1. GA4 BigQuery Export

Your GA4 property must export data to BigQuery. This is free and takes 5 minutes to set up:

  1. In GA4, go to Admin > BigQuery Links
  2. Click Link and select your Google Cloud project
  3. Choose Daily export (streaming adds cost)
  4. Enable Include advertising identifiers if you want to build Google Ads audiences

Data starts flowing within 24 hours. You’ll get tables like:

analytics_PROPERTY_ID.events_YYYYMMDD
analytics_PROPERTY_ID.events_intraday_YYYYMMDD

2. BigQuery Access

You need a Google Cloud project with BigQuery enabled. If you linked GA4, you already have this.

Billing must be enabled, but model training on GA4 data typically costs $1-10 depending on data volume.


Building a Purchase Propensity Model

Let’s build a model that predicts which users will purchase in the next 7 days.

Step 1: Understand Your Data

First, explore what’s available:

-- See all event types in your data
SELECT
  event_name,
  COUNT(*) as event_count
FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260130'
GROUP BY event_name
ORDER BY event_count DESC
LIMIT 20;

Common events useful for prediction:

  • page_view - Engagement level
  • scroll - Content consumption
  • view_item - Product interest
  • add_to_cart - Purchase intent
  • begin_checkout - Strong purchase intent
  • purchase - Our target variable

Step 2: Create Training Features

We need to transform raw events into features (input variables) that predict purchase behavior.

CREATE OR REPLACE TABLE `project.dataset.training_data` AS

WITH user_features AS (
  SELECT
    user_pseudo_id,

    -- Engagement metrics
    COUNT(*) as total_events,
    COUNT(DISTINCT event_date) as active_days,
    COUNTIF(event_name = 'page_view') as page_views,
    COUNTIF(event_name = 'scroll') as scroll_events,

    -- Purchase intent signals
    COUNTIF(event_name = 'view_item') as items_viewed,
    COUNTIF(event_name = 'add_to_cart') as cart_additions,
    COUNTIF(event_name = 'begin_checkout') as checkout_starts,

    -- Session metrics
    COUNT(DISTINCT (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')) as sessions,

    -- Device and traffic
    MAX(device.category) as device_category,
    MAX(traffic_source.medium) as traffic_medium,

    -- Time-based
    MIN(PARSE_DATE('%Y%m%d', event_date)) as first_seen,
    MAX(PARSE_DATE('%Y%m%d', event_date)) as last_seen

  FROM `project.analytics_PROPERTY_ID.events_*`
  WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260123'  -- Training period
  GROUP BY user_pseudo_id
),

-- Label: Did user purchase in the following 7 days?
purchase_labels AS (
  SELECT DISTINCT
    user_pseudo_id,
    1 as purchased
  FROM `project.analytics_PROPERTY_ID.events_*`
  WHERE _TABLE_SUFFIX BETWEEN '20260124' AND '20260130'  -- Prediction window
    AND event_name = 'purchase'
)

SELECT
  f.*,
  DATE_DIFF(f.last_seen, f.first_seen, DAY) as days_active,
  COALESCE(l.purchased, 0) as will_purchase  -- Target variable
FROM user_features f
LEFT JOIN purchase_labels l USING (user_pseudo_id);

This creates a table where each row is a user with:

  • Features: Their behavior during the training period
  • Label: Whether they purchased in the following 7 days (1 or 0)

Step 3: Train the Model

Now create the model:

CREATE OR REPLACE MODEL `project.dataset.purchase_propensity_model`
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['will_purchase'],
  auto_class_weights=TRUE,  -- Handle imbalanced data (few purchasers vs many non-purchasers)
  enable_global_explain=TRUE  -- See feature importance
) AS
SELECT
  -- Exclude user_pseudo_id and date fields from features
  total_events,
  active_days,
  page_views,
  scroll_events,
  items_viewed,
  cart_additions,
  checkout_starts,
  sessions,
  device_category,
  traffic_medium,
  days_active,
  will_purchase
FROM `project.dataset.training_data`;

Training takes 2-10 minutes depending on data size. BigQuery handles everything: feature scaling, regularization, cross-validation.

Step 4: Evaluate the Model

Check how well the model performs:

SELECT *
FROM ML.EVALUATE(MODEL `project.dataset.purchase_propensity_model`);

Key metrics to look at:

MetricWhat It MeansGood Value
precisionOf users predicted to purchase, how many actually did?> 0.3
recallOf users who purchased, how many did we predict?> 0.5
accuracyOverall correct predictions> 0.7
f1_scoreBalance of precision and recall> 0.4
roc_aucModel’s ability to distinguish purchasers> 0.7

For marketing use cases, roc_auc > 0.7 indicates a useful model.

Step 5: Understand Feature Importance

See which behaviors most predict purchases:

SELECT *
FROM ML.GLOBAL_EXPLAIN(MODEL `project.dataset.purchase_propensity_model`)
ORDER BY attribution DESC;

Typical findings:

  • cart_additions and checkout_starts are strongest predictors (obviously)
  • items_viewed matters more than page_views
  • active_days shows engaged users convert more
  • traffic_medium reveals which channels bring high-intent users

Making Predictions

Score Current Users

Apply the model to users from the last 7 days:

CREATE OR REPLACE TABLE `project.dataset.user_propensity_scores` AS

WITH recent_users AS (
  SELECT
    user_pseudo_id,
    COUNT(*) as total_events,
    COUNT(DISTINCT event_date) as active_days,
    COUNTIF(event_name = 'page_view') as page_views,
    COUNTIF(event_name = 'scroll') as scroll_events,
    COUNTIF(event_name = 'view_item') as items_viewed,
    COUNTIF(event_name = 'add_to_cart') as cart_additions,
    COUNTIF(event_name = 'begin_checkout') as checkout_starts,
    COUNT(DISTINCT (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')) as sessions,
    MAX(device.category) as device_category,
    MAX(traffic_source.medium) as traffic_medium,
    DATE_DIFF(MAX(PARSE_DATE('%Y%m%d', event_date)), MIN(PARSE_DATE('%Y%m%d', event_date)), DAY) as days_active
  FROM `project.analytics_PROPERTY_ID.events_*`
  WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
  GROUP BY user_pseudo_id
)

SELECT
  user_pseudo_id,
  predicted_will_purchase_probs[OFFSET(1)].prob as purchase_probability
FROM ML.PREDICT(
  MODEL `project.dataset.purchase_propensity_model`,
  (SELECT * FROM recent_users)
);

Each user now has a purchase_probability score from 0 to 1.

Create Audience Segments

Segment users by propensity:

SELECT
  CASE
    WHEN purchase_probability >= 0.7 THEN 'High Intent'
    WHEN purchase_probability >= 0.4 THEN 'Medium Intent'
    WHEN purchase_probability >= 0.1 THEN 'Low Intent'
    ELSE 'Very Low Intent'
  END as segment,
  COUNT(*) as users,
  AVG(purchase_probability) as avg_probability
FROM `project.dataset.user_propensity_scores`
GROUP BY 1
ORDER BY avg_probability DESC;

Exporting to Google Ads

The real power comes from acting on predictions. Export high-propensity users to Google Ads for targeted campaigns.

Option 1: GA4 Audiences via BigQuery

If you have GA4 360 or use the Measurement Protocol, you can send propensity scores back to GA4 as user properties, then build audiences in GA4’s UI.

Option 2: Google Ads Data Manager

  1. In Google Ads, go to Tools > Data Manager
  2. Connect your BigQuery project
  3. Create an audience from your propensity table
-- Create a table formatted for Google Ads
SELECT
  user_pseudo_id as user_id,
  'High Intent Purchasers' as audience_name
FROM `project.dataset.user_propensity_scores`
WHERE purchase_probability >= 0.6;

Option 3: Customer Match

Export emails/phone numbers (if you have them linked to user_pseudo_id) for Customer Match upload.


Going Further: LTV Prediction

Purchase propensity predicts if someone will buy. Lifetime Value (LTV) predicts how much they’ll spend over time.

LTV Model

CREATE OR REPLACE MODEL `project.dataset.ltv_model`
OPTIONS(
  model_type='LINEAR_REG',  -- Regression for continuous values
  input_label_cols=['future_revenue']
) AS

WITH user_history AS (
  -- Features from first 30 days
  SELECT
    user_pseudo_id,
    COUNTIF(event_name = 'purchase') as purchases_30d,
    SUM(IFNULL((SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'value'), 0)) as revenue_30d,
    COUNT(DISTINCT event_date) as active_days_30d
  FROM `project.analytics_PROPERTY_ID.events_*`
  WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260130'
  GROUP BY user_pseudo_id
),

future_value AS (
  -- Revenue in next 90 days
  SELECT
    user_pseudo_id,
    SUM(IFNULL((SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'value'), 0)) as future_revenue
  FROM `project.analytics_PROPERTY_ID.events_*`
  WHERE _TABLE_SUFFIX BETWEEN '20260131' AND '20260430'
    AND event_name = 'purchase'
  GROUP BY user_pseudo_id
)

SELECT
  h.*,
  COALESCE(f.future_revenue, 0) as future_revenue
FROM user_history h
LEFT JOIN future_value f USING (user_pseudo_id);

Now you can identify not just who will buy, but who will become your most valuable customers.


Best Practices

Data Quality

  • Minimum 30 days of training data
  • At least 1000 conversion events for reliable models
  • Consistent tracking - model breaks if events change

Feature Engineering

  • Ratios often outperform counts: cart_additions / items_viewed (cart rate) can be stronger than raw counts
  • Recency matters: Weight recent behavior higher than old behavior
  • Exclude obvious signals: Don’t include begin_checkout if you want to catch users earlier in the funnel

Model Maintenance

  • Retrain monthly - user behavior shifts over time
  • Monitor performance - track precision/recall on a holdout set
  • A/B test campaigns - verify that targeting high-propensity users actually improves ROI

Cost Considerations

BigQuery ML pricing:

OperationCost
Model training$250 per TB processed
Predictions$5 per TB processed
Storage$0.02 per GB/month

For a typical GA4 export (1M users, 30 days):

  • Training: ~$2-5
  • Daily predictions: ~$0.50
  • Storage: ~$1/month

Far cheaper than external ML platforms, and no data leaves Google’s infrastructure.


Summary

BigQuery ML transforms GA4 data from historical reporting into predictive intelligence:

  1. Export GA4 to BigQuery (free, 5 minutes)
  2. Create training features from user behavior
  3. Train a model with one SQL statement
  4. Score users to get propensity/LTV predictions
  5. Export to Google Ads for targeted campaigns

The models aren’t perfect, but they don’t need to be. A model that identifies high-intent users with 70% accuracy still dramatically improves campaign ROI compared to broad targeting.

Start with purchase propensity. Once that’s working, expand to churn prediction, LTV modeling, and product recommendations.



Sources