Predicting Customer Value with BigQuery ML and GA4 Data
Your GA4 data contains patterns that predict which users will convert, churn, or become high-value customers. BigQuery ML lets you build machine learning models using SQL, no Python required. This guide walks through building a purchase propensity model and using predictions for Google Ads audiences.
Why Predict Customer Behavior?
Every user on your site leaves behavioral signals. Page views, scroll depth, time on site, products viewed, cart additions. These signals correlate with outcomes: purchases, sign-ups, churn.
Traditional analytics tells you what happened. Predictive analytics tells you what will happen.
With predictions, you can:
- Target high-propensity users with Google Ads before they convert elsewhere
- Identify churn risk and trigger retention campaigns
- Allocate budget toward users likely to have high lifetime value
- Personalize experiences based on predicted interests
The challenge has always been building these models. Until BigQuery ML.
What is BigQuery ML?
BigQuery ML lets you create and execute machine learning models using standard SQL. No Python, no TensorFlow, no ML infrastructure to manage.
CREATE MODEL `project.dataset.my_model`
OPTIONS(model_type='LOGISTIC_REG') AS
SELECT * FROM training_data;That’s it. BigQuery handles feature engineering, training, hyperparameter tuning, and model deployment.
Supported model types include:
- Linear regression - Predict continuous values (revenue, LTV)
- Logistic regression - Predict binary outcomes (will purchase / won’t purchase)
- K-means clustering - Segment users into groups
- XGBoost - High-accuracy predictions with gradient boosting
- Deep Neural Networks - Complex pattern recognition
- ARIMA - Time series forecasting
For most marketing use cases, logistic regression and XGBoost cover 90% of needs.
Prerequisites
1. GA4 BigQuery Export
Your GA4 property must export data to BigQuery. This is free and takes 5 minutes to set up:
- In GA4, go to Admin > BigQuery Links
- Click Link and select your Google Cloud project
- Choose Daily export (streaming adds cost)
- Enable Include advertising identifiers if you want to build Google Ads audiences
Data starts flowing within 24 hours. You’ll get tables like:
analytics_PROPERTY_ID.events_YYYYMMDD
analytics_PROPERTY_ID.events_intraday_YYYYMMDD2. BigQuery Access
You need a Google Cloud project with BigQuery enabled. If you linked GA4, you already have this.
Billing must be enabled, but model training on GA4 data typically costs $1-10 depending on data volume.
Building a Purchase Propensity Model
Let’s build a model that predicts which users will purchase in the next 7 days.
Step 1: Understand Your Data
First, explore what’s available:
-- See all event types in your data
SELECT
event_name,
COUNT(*) as event_count
FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260130'
GROUP BY event_name
ORDER BY event_count DESC
LIMIT 20;Common events useful for prediction:
page_view- Engagement levelscroll- Content consumptionview_item- Product interestadd_to_cart- Purchase intentbegin_checkout- Strong purchase intentpurchase- Our target variable
Step 2: Create Training Features
We need to transform raw events into features (input variables) that predict purchase behavior.
CREATE OR REPLACE TABLE `project.dataset.training_data` AS
WITH user_features AS (
SELECT
user_pseudo_id,
-- Engagement metrics
COUNT(*) as total_events,
COUNT(DISTINCT event_date) as active_days,
COUNTIF(event_name = 'page_view') as page_views,
COUNTIF(event_name = 'scroll') as scroll_events,
-- Purchase intent signals
COUNTIF(event_name = 'view_item') as items_viewed,
COUNTIF(event_name = 'add_to_cart') as cart_additions,
COUNTIF(event_name = 'begin_checkout') as checkout_starts,
-- Session metrics
COUNT(DISTINCT (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')) as sessions,
-- Device and traffic
MAX(device.category) as device_category,
MAX(traffic_source.medium) as traffic_medium,
-- Time-based
MIN(PARSE_DATE('%Y%m%d', event_date)) as first_seen,
MAX(PARSE_DATE('%Y%m%d', event_date)) as last_seen
FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260123' -- Training period
GROUP BY user_pseudo_id
),
-- Label: Did user purchase in the following 7 days?
purchase_labels AS (
SELECT DISTINCT
user_pseudo_id,
1 as purchased
FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260124' AND '20260130' -- Prediction window
AND event_name = 'purchase'
)
SELECT
f.*,
DATE_DIFF(f.last_seen, f.first_seen, DAY) as days_active,
COALESCE(l.purchased, 0) as will_purchase -- Target variable
FROM user_features f
LEFT JOIN purchase_labels l USING (user_pseudo_id);This creates a table where each row is a user with:
- Features: Their behavior during the training period
- Label: Whether they purchased in the following 7 days (1 or 0)
Step 3: Train the Model
Now create the model:
CREATE OR REPLACE MODEL `project.dataset.purchase_propensity_model`
OPTIONS(
model_type='LOGISTIC_REG',
input_label_cols=['will_purchase'],
auto_class_weights=TRUE, -- Handle imbalanced data (few purchasers vs many non-purchasers)
enable_global_explain=TRUE -- See feature importance
) AS
SELECT
-- Exclude user_pseudo_id and date fields from features
total_events,
active_days,
page_views,
scroll_events,
items_viewed,
cart_additions,
checkout_starts,
sessions,
device_category,
traffic_medium,
days_active,
will_purchase
FROM `project.dataset.training_data`;Training takes 2-10 minutes depending on data size. BigQuery handles everything: feature scaling, regularization, cross-validation.
Step 4: Evaluate the Model
Check how well the model performs:
SELECT *
FROM ML.EVALUATE(MODEL `project.dataset.purchase_propensity_model`);Key metrics to look at:
| Metric | What It Means | Good Value |
|---|---|---|
| precision | Of users predicted to purchase, how many actually did? | > 0.3 |
| recall | Of users who purchased, how many did we predict? | > 0.5 |
| accuracy | Overall correct predictions | > 0.7 |
| f1_score | Balance of precision and recall | > 0.4 |
| roc_auc | Model’s ability to distinguish purchasers | > 0.7 |
For marketing use cases, roc_auc > 0.7 indicates a useful model.
Step 5: Understand Feature Importance
See which behaviors most predict purchases:
SELECT *
FROM ML.GLOBAL_EXPLAIN(MODEL `project.dataset.purchase_propensity_model`)
ORDER BY attribution DESC;Typical findings:
cart_additionsandcheckout_startsare strongest predictors (obviously)items_viewedmatters more thanpage_viewsactive_daysshows engaged users convert moretraffic_mediumreveals which channels bring high-intent users
Making Predictions
Score Current Users
Apply the model to users from the last 7 days:
CREATE OR REPLACE TABLE `project.dataset.user_propensity_scores` AS
WITH recent_users AS (
SELECT
user_pseudo_id,
COUNT(*) as total_events,
COUNT(DISTINCT event_date) as active_days,
COUNTIF(event_name = 'page_view') as page_views,
COUNTIF(event_name = 'scroll') as scroll_events,
COUNTIF(event_name = 'view_item') as items_viewed,
COUNTIF(event_name = 'add_to_cart') as cart_additions,
COUNTIF(event_name = 'begin_checkout') as checkout_starts,
COUNT(DISTINCT (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')) as sessions,
MAX(device.category) as device_category,
MAX(traffic_source.medium) as traffic_medium,
DATE_DIFF(MAX(PARSE_DATE('%Y%m%d', event_date)), MIN(PARSE_DATE('%Y%m%d', event_date)), DAY) as days_active
FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY))
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
GROUP BY user_pseudo_id
)
SELECT
user_pseudo_id,
predicted_will_purchase_probs[OFFSET(1)].prob as purchase_probability
FROM ML.PREDICT(
MODEL `project.dataset.purchase_propensity_model`,
(SELECT * FROM recent_users)
);Each user now has a purchase_probability score from 0 to 1.
Create Audience Segments
Segment users by propensity:
SELECT
CASE
WHEN purchase_probability >= 0.7 THEN 'High Intent'
WHEN purchase_probability >= 0.4 THEN 'Medium Intent'
WHEN purchase_probability >= 0.1 THEN 'Low Intent'
ELSE 'Very Low Intent'
END as segment,
COUNT(*) as users,
AVG(purchase_probability) as avg_probability
FROM `project.dataset.user_propensity_scores`
GROUP BY 1
ORDER BY avg_probability DESC;Exporting to Google Ads
The real power comes from acting on predictions. Export high-propensity users to Google Ads for targeted campaigns.
Option 1: GA4 Audiences via BigQuery
If you have GA4 360 or use the Measurement Protocol, you can send propensity scores back to GA4 as user properties, then build audiences in GA4’s UI.
Option 2: Google Ads Data Manager
- In Google Ads, go to Tools > Data Manager
- Connect your BigQuery project
- Create an audience from your propensity table
-- Create a table formatted for Google Ads
SELECT
user_pseudo_id as user_id,
'High Intent Purchasers' as audience_name
FROM `project.dataset.user_propensity_scores`
WHERE purchase_probability >= 0.6;Option 3: Customer Match
Export emails/phone numbers (if you have them linked to user_pseudo_id) for Customer Match upload.
Going Further: LTV Prediction
Purchase propensity predicts if someone will buy. Lifetime Value (LTV) predicts how much they’ll spend over time.
LTV Model
CREATE OR REPLACE MODEL `project.dataset.ltv_model`
OPTIONS(
model_type='LINEAR_REG', -- Regression for continuous values
input_label_cols=['future_revenue']
) AS
WITH user_history AS (
-- Features from first 30 days
SELECT
user_pseudo_id,
COUNTIF(event_name = 'purchase') as purchases_30d,
SUM(IFNULL((SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'value'), 0)) as revenue_30d,
COUNT(DISTINCT event_date) as active_days_30d
FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260130'
GROUP BY user_pseudo_id
),
future_value AS (
-- Revenue in next 90 days
SELECT
user_pseudo_id,
SUM(IFNULL((SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'value'), 0)) as future_revenue
FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260131' AND '20260430'
AND event_name = 'purchase'
GROUP BY user_pseudo_id
)
SELECT
h.*,
COALESCE(f.future_revenue, 0) as future_revenue
FROM user_history h
LEFT JOIN future_value f USING (user_pseudo_id);Now you can identify not just who will buy, but who will become your most valuable customers.
Best Practices
Data Quality
- Minimum 30 days of training data
- At least 1000 conversion events for reliable models
- Consistent tracking - model breaks if events change
Feature Engineering
- Ratios often outperform counts:
cart_additions / items_viewed(cart rate) can be stronger than raw counts - Recency matters: Weight recent behavior higher than old behavior
- Exclude obvious signals: Don’t include
begin_checkoutif you want to catch users earlier in the funnel
Model Maintenance
- Retrain monthly - user behavior shifts over time
- Monitor performance - track precision/recall on a holdout set
- A/B test campaigns - verify that targeting high-propensity users actually improves ROI
Cost Considerations
BigQuery ML pricing:
| Operation | Cost |
|---|---|
| Model training | $250 per TB processed |
| Predictions | $5 per TB processed |
| Storage | $0.02 per GB/month |
For a typical GA4 export (1M users, 30 days):
- Training: ~$2-5
- Daily predictions: ~$0.50
- Storage: ~$1/month
Far cheaper than external ML platforms, and no data leaves Google’s infrastructure.
Summary
BigQuery ML transforms GA4 data from historical reporting into predictive intelligence:
- Export GA4 to BigQuery (free, 5 minutes)
- Create training features from user behavior
- Train a model with one SQL statement
- Score users to get propensity/LTV predictions
- Export to Google Ads for targeted campaigns
The models aren’t perfect, but they don’t need to be. A model that identifies high-intent users with 70% accuracy still dramatically improves campaign ROI compared to broad targeting.
Start with purchase propensity. Once that’s working, expand to churn prediction, LTV modeling, and product recommendations.
Related Resources
- Automated Anomaly Detection for GA4 with BigQuery - Monitor your models and data quality with statistical anomaly detection
- AI-Powered Audience Segmentation: K-Means Clustering - Discover natural user groups to complement propensity scoring
- Tracking Strategy: How to Build a Measurement System That Works
- GA4 Custom Dimensions & Metrics
- Google Cloud BigQuery ML Documentation