dbt for GA4 BigQuery: Transform Raw Exports into Clean Tables
If your data stack spans multiple warehouses or your team already knows dbt, it is the right tool for transforming GA4's messy BigQuery export. This guide covers the Velir/dbt-ga4 package, building custom models from scratch, incremental loading, testing, and how dbt compares to Dataform for this specific use case.
Why dbt for GA4 BigQuery Data
I recently wrote a complete guide to building GA4 transformation pipelines with Dataform. Dataform is the right choice if you are all-in on BigQuery and want zero additional tooling. But if any of the following apply to you, dbt is the better fit:
- You already use dbt for other data sources and want GA4 models in the same project.
- Your data stack spans multiple warehouses — Snowflake, Redshift, or Databricks alongside BigQuery.
- You want the richest ecosystem of community packages, testing frameworks, and documentation tooling.
- Your team knows Jinja and prefers it over JavaScript templating.
The problem dbt solves is the same one Dataform solves: GA4’s BigQuery export uses a deeply nested schema with RECORD and REPEATED fields. Every query requires UNNEST subqueries, dashboards scan full tables, and analysts waste hours on boilerplate SQL. I covered the cost implications in GA4 BigQuery Export Cost Optimization — the short version is $25–60/month in wasted query costs for a medium-traffic site.
dbt transforms the raw export into clean, flat, partitioned tables. Query once, use everywhere.
dbt vs. Dataform: A Quick Comparison
If you have not read the Dataform article, here is the summary:
| Factor | dbt | Dataform |
|---|---|---|
| Cost | dbt Cloud ~$100/user/month; dbt Core is free | Free (BigQuery compute only) |
| Warehouse support | BigQuery, Snowflake, Redshift, Databricks, and more | BigQuery only |
| Templating | Jinja | JavaScript |
| Community | Massive (400+ packages on dbt Hub) | Growing (smaller ecosystem) |
| GA4 package | Velir/dbt-ga4 (386 stars) | ga4_dataform (152 stars) |
| Scheduling | dbt Cloud, Airflow, or any orchestrator | Built-in via Cloud Workflows |
| Testing | Mature framework (generic + singular tests, packages) | Assertions (simpler, built-in) |
The bottom line: both tools produce equivalent output. The choice is about your team’s existing stack and preferences, not capability. If you picked dbt, this article is for you.
Option 1: The Velir/dbt-ga4 Package
The fastest way to get started is the Velir/dbt-ga4 package — the most popular community package for GA4 BigQuery transformations with 386 stars and 164 forks.
What It Provides
Staging models (the heavy lifting):
stg_ga4__events— Cleaned event data with session and event keysstg_ga4__event_*— Individual models per event type (page_view, purchase, etc.) with flattened parametersstg_ga4__event_items— Ecommerce item-level datastg_ga4__user_properties— Most recent user property values per userstg_ga4__sessions_traffic_sources— First-click attribution per sessionstg_ga4__sessions_traffic_sources_last_non_direct_daily— Last non-direct attribution with configurable lookback
Marts models (analysis-ready):
dim_ga4__sessions/dim_ga4__sessions_daily— Session dimensions (geography, device, acquisition)fct_ga4__pages— Page-level metrics by date and locationfct_ga4__sessions/fct_ga4__sessions_daily— Session facts with conversion countsdim_ga4__client_keys— Device dimension with first/last page viewed
Installation
Add to your packages.yml:
packages:
- package: Velir/ga4
version: ['>=6.2.0', '<6.3.0']Then install:
dbt depsConfiguration
Add the required variables to your dbt_project.yml:
vars:
ga4:
source_project: 'your-gcp-project-id'
property_ids: [123456789]
start_date: '20240101'
static_incremental_days: 3Key optional variables:
vars:
ga4:
# Extract UTM parameters from URLs into columns
query_parameter_extraction:
- 'utm_source'
- 'utm_medium'
- 'utm_campaign'
# Remove tracking parameters from page_location
query_parameter_exclusions:
- 'gclid'
- 'fbclid'
- 'mc_cid'
# Define which events count as conversions
conversion_events:
- 'purchase'
- 'generate_lead'
- 'sign_up'
# Attribution lookback window (default: 30 days)
session_attribution_lookback_window_days: 30
# Custom parameters to extract on all events
default_custom_parameters:
- name: 'content_group'
value_type: 'string_value'
- name: 'logged_in'
value_type: 'string_value'First Run
# Load source category mappings (required before first run)
dbt seed
# Run all models
dbt run
# Run tests
dbt testThe first run does a full historical load. Subsequent runs process incrementally based on static_incremental_days.
Multi-Property Support
If you have multiple GA4 properties, the package can consolidate them:
vars:
ga4:
combined_dataset: 'ga4_combined'
property_ids: [111111111, 222222222]The package clones date shards from each property into the combined dataset before running transformations. This is a significant advantage over manual approaches where you would need to UNION tables across datasets.
Option 2: Building Custom Models
The Velir package is comprehensive, but sometimes you need full control. Here is how to build the core transformations from scratch — the same architecture covered in the Dataform article, translated to dbt conventions.
Project Structure
├── models/
│ ├── staging/
│ │ ├── _staging__sources.yml
│ │ ├── _staging__models.yml
│ │ └── stg_ga4__events.sql
│ ├── intermediate/
│ │ ├── _intermediate__models.yml
│ │ ├── int_ga4__sessions.sql
│ │ └── int_ga4__user_mapping.sql
│ └── marts/
│ ├── _marts__models.yml
│ ├── fct_ga4__sessions_daily.sql
│ └── fct_ga4__pageviews_daily.sql
├── macros/
│ └── unnest_param.sql
├── tests/
│ └── assert_event_date_range.sql
├── dbt_project.yml
└── packages.ymlSource Declaration
Create models/staging/_staging__sources.yml:
version: 2
sources:
- name: ga4
database: "{{ var('source_project') }}"
schema: "analytics_{{ var('property_id') }}"
tables:
- name: events
identifier: 'events_*'
description: 'Raw GA4 BigQuery export (date-sharded)'Macro: Unnest Parameter
Create macros/unnest_param.sql — this is the dbt equivalent of the JavaScript helper from the Dataform article:
{% macro unnest_param(key, value_type='string_value') %}
(SELECT value.{{ value_type }} FROM UNNEST(event_params) WHERE key = '{{ key }}')
{% endmacro %}
{% macro unnest_user_property(key, value_type='string_value') %}
(SELECT value.{{ value_type }} FROM UNNEST(user_properties) WHERE key = '{{ key }}')
{% endmacro %}Usage in models:
SELECT
{{ unnest_param('page_location') }} AS page_location,
{{ unnest_param('ga_session_id', 'int_value') }} AS ga_session_id,
{{ unnest_user_property('user_tier') }} AS user_tier,Jinja macros serve the same purpose as Dataform’s JavaScript helpers — abstracting the repetitive UNNEST boilerplate into reusable functions.
Staging: Flattened Events
Create models/staging/stg_ga4__events.sql:
{{
config(
materialized='incremental',
partition_by={
"field": "event_date",
"data_type": "date",
"granularity": "day"
},
cluster_by=["event_name"],
incremental_strategy='merge',
unique_key=['event_date', 'event_timestamp', 'user_pseudo_id', 'event_name', 'row_num']
)
}}
WITH source AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY event_date, event_timestamp, user_pseudo_id, event_name
) AS row_num
FROM {{ source('ga4', 'events') }}
WHERE _TABLE_SUFFIX NOT LIKE '%intraday%'
{% if is_incremental() %}
AND PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) >= DATE_SUB(CURRENT_DATE(), INTERVAL {{ var('static_incremental_days', 3) }} DAY)
{% endif %}
)
SELECT
PARSE_DATE('%Y%m%d', event_date) AS event_date,
TIMESTAMP_MICROS(event_timestamp) AS event_timestamp,
event_name,
user_pseudo_id,
user_id,
-- Extracted event parameters
{{ unnest_param('page_location') }} AS page_location,
{{ unnest_param('page_title') }} AS page_title,
{{ unnest_param('page_referrer') }} AS page_referrer,
{{ unnest_param('source') }} AS source,
{{ unnest_param('medium') }} AS medium,
{{ unnest_param('campaign') }} AS campaign,
{{ unnest_param('ga_session_id', 'int_value') }} AS ga_session_id,
{{ unnest_param('ga_session_number', 'int_value') }} AS ga_session_number,
{{ unnest_param('engagement_time_msec', 'int_value') }} AS engagement_time_msec,
{{ unnest_param('session_engaged') }} AS session_engaged,
-- Device and geo
device.category AS device_category,
device.operating_system AS device_os,
device.web_info.browser AS browser,
geo.country AS country,
geo.city AS city,
-- Traffic source
traffic_source.source AS traffic_source,
traffic_source.medium AS traffic_medium,
traffic_source.name AS traffic_campaign,
-- Ecommerce
ecommerce.purchase_revenue AS purchase_revenue,
ecommerce.transaction_id AS transaction_id,
row_num
FROM sourceThe key difference from Dataform: dbt uses {% if is_incremental() %} instead of ${when(incremental(), ...)}, and {{ source('ga4', 'events') }} instead of ${ref("ga4_raw_events")}. The SQL logic is identical.
Intermediate: Sessions
Create models/intermediate/int_ga4__sessions.sql:
{{
config(
materialized='incremental',
partition_by={
"field": "session_date",
"data_type": "date",
"granularity": "day"
},
cluster_by=["source", "medium", "device_category"],
incremental_strategy='merge',
unique_key=['ga_session_key', 'session_date']
)
}}
WITH events AS (
SELECT *
FROM {{ ref('stg_ga4__events') }}
{% if is_incremental() %}
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL {{ var('static_incremental_days', 3) }} DAY)
{% endif %}
),
session_source AS (
SELECT
CONCAT(user_pseudo_id, '.', CAST(ga_session_id AS STRING)) AS ga_session_key,
event_date,
COALESCE(source, traffic_source) AS source,
COALESCE(medium, traffic_medium) AS medium,
COALESCE(campaign, traffic_campaign) AS campaign
FROM events
WHERE event_name = 'session_start'
QUALIFY ROW_NUMBER() OVER (
PARTITION BY user_pseudo_id, ga_session_id
ORDER BY event_timestamp
) = 1
)
SELECT
CONCAT(e.user_pseudo_id, '.', CAST(e.ga_session_id AS STRING)) AS ga_session_key,
e.user_pseudo_id,
e.user_id,
e.ga_session_id,
e.ga_session_number,
MIN(e.event_date) AS session_date,
MIN(e.event_timestamp) AS session_start,
MAX(e.event_timestamp) AS session_end,
TIMESTAMP_DIFF(MAX(e.event_timestamp), MIN(e.event_timestamp), SECOND) AS session_duration_seconds,
ss.source,
ss.medium,
ss.campaign,
COUNT(*) AS event_count,
COUNTIF(e.event_name = 'page_view') AS pageviews,
COUNTIF(e.event_name = 'purchase') AS purchases,
SUM(COALESCE(e.engagement_time_msec, 0)) AS total_engagement_time_msec,
MAX(CASE WHEN e.session_engaged = '1' THEN TRUE ELSE FALSE END) AS is_engaged,
ARRAY_AGG(
IF(e.event_name = 'page_view', e.page_location, NULL) IGNORE NULLS
ORDER BY e.event_timestamp LIMIT 1
)[SAFE_OFFSET(0)] AS landing_page,
ARRAY_AGG(e.device_category IGNORE NULLS ORDER BY e.event_timestamp LIMIT 1)[SAFE_OFFSET(0)] AS device_category,
ARRAY_AGG(e.country IGNORE NULLS ORDER BY e.event_timestamp LIMIT 1)[SAFE_OFFSET(0)] AS country,
SUM(COALESCE(e.purchase_revenue, 0)) AS session_revenue
FROM events e
LEFT JOIN session_source ss
ON CONCAT(e.user_pseudo_id, '.', CAST(e.ga_session_id AS STRING)) = ss.ga_session_key
WHERE e.ga_session_id IS NOT NULL
GROUP BY 1, 2, 3, 4, 5, ss.source, ss.medium, ss.campaignIf you compared this to the Dataform sessions model, the SQL is nearly identical. The differences are syntax-level: {{ config() }} vs Dataform’s config {}, {{ ref() }} vs ${ref()}, and {% if is_incremental() %} vs ${when(incremental(), ...)}.
Testing and Data Quality
This is where dbt genuinely shines over Dataform. The testing framework is more mature, with more test types and better integration with CI/CD.
Schema Tests
Add to models/staging/_staging__models.yml:
version: 2
models:
- name: stg_ga4__events
description: 'Flattened GA4 events with extracted parameters'
columns:
- name: user_pseudo_id
description: 'Cookie-based user identifier'
tests:
- not_null
- name: event_date
description: 'Event date'
tests:
- not_null
- name: event_name
description: 'GA4 event name'
tests:
- not_null
- accepted_values:
values:
[
'page_view',
'session_start',
'first_visit',
'scroll',
'click',
'view_item',
'add_to_cart',
'purchase',
'begin_checkout',
'user_engagement',
]
config:
severity: warn
- name: int_ga4__sessions
description: 'Session-level aggregation with traffic source attribution'
columns:
- name: ga_session_key
description: 'Unique session identifier (user_pseudo_id + ga_session_id)'
tests:
- not_null
- name: session_date
tests:
- not_null
tests:
- unique:
column_name: "ga_session_key || '-' || CAST(session_date AS STRING)"Custom Singular Tests
Create tests/assert_event_date_range.sql:
-- Events with dates in the future or more than 400 days ago indicate export issues
SELECT *
FROM {{ ref('stg_ga4__events') }}
WHERE event_date > CURRENT_DATE()
OR event_date < DATE_SUB(CURRENT_DATE(), INTERVAL 400 DAY)Create tests/assert_session_count_bounds.sql:
-- Alert if yesterday's sessions dropped by more than 80% vs 30-day average
WITH daily_counts AS (
SELECT
session_date,
COUNT(*) AS session_count
FROM {{ ref('int_ga4__sessions') }}
WHERE session_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 31 DAY)
GROUP BY 1
),
averages AS (
SELECT
AVG(CASE WHEN session_date < DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) THEN session_count END) AS avg_30d,
MAX(CASE WHEN session_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) THEN session_count END) AS yesterday
FROM daily_counts
)
SELECT *
FROM averages
WHERE yesterday < avg_30d * 0.2dbt Testing Packages
For additional test coverage, add dbt-utils and dbt-expectations:
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: ['>=1.0.0', '<2.0.0']
- package: calogica/dbt_expectations
version: ['>=0.10.0', '<0.11.0']This unlocks tests like expect_column_values_to_be_between, expect_table_row_count_to_be_between, and recency (ensure data is not stale). Dataform’s assertion system is simpler and built-in, but dbt’s testing ecosystem is significantly deeper.
Running Tests in CI
# Run all tests
dbt test
# Run only tests for staging models
dbt test --select staging
# Run tests with severity filtering
dbt test --warn-errorIntegrate this into your CI/CD pipeline (GitHub Actions, GitLab CI, etc.) to catch data quality issues before they reach production.
Incremental Loading Strategy
dbt’s incremental strategy for GA4 follows the same logic as Dataform’s — process only recent data, use MERGE to avoid duplicates.
The static_incremental_days Pattern
The Velir package uses static_incremental_days (default: 3) as the lookback window. On each run:
- Only
_TABLE_SUFFIXvalues within the last N days are scanned - A MERGE statement inserts new rows and updates existing ones
- Late-arriving events from GA4’s export process are captured
{% if is_incremental() %}
AND PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) >= DATE_SUB(
CURRENT_DATE(),
INTERVAL {{ var('static_incremental_days', 3) }} DAY
)
{% endif %}Full Refresh vs. Incremental
# First run or schema change: full refresh
dbt run --full-refresh
# Daily runs: incremental (default)
dbt runA full refresh on a site with a year of GA4 data might scan 50–100 GB. A 3-day incremental run scans 1–3 GB. At $6.25/TB on-demand, that is the difference between $0.63 and $0.02 per run.
Disabling Expensive Models
For high-traffic sites, the Velir package lets you disable unpartitioned models that would otherwise scan the entire history:
# dbt_project.yml
models:
ga4:
staging:
stg_ga4__sessions_traffic_sources:
+enabled: false # Use the _daily variant instead
marts:
fct_ga4__sessions:
+enabled: false # Use fct_ga4__sessions_daily insteadThis is a practical cost optimization that the package handles well.
Scheduling and Orchestration
Unlike Dataform, dbt does not have built-in scheduling. You need an external orchestrator.
Option 1: dbt Cloud
The simplest option if you are paying for dbt Cloud. Create a job with a daily schedule:
- Schedule:
0 9 * * *(9 AM daily, after GA4 export lands) - Commands:
dbt seed && dbt run && dbt test - Environment: Production
dbt Cloud also provides run history, alerting, and the IDE for development.
Option 2: GitHub Actions (Free)
For dbt Core users, a GitHub Actions workflow is the most common approach:
# .github/workflows/dbt-daily.yml
name: dbt daily run
on:
schedule:
- cron: '0 9 * * *'
workflow_dispatch:
jobs:
dbt-run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dbt
run: pip install dbt-bigquery
- name: Run dbt
env:
DBT_PROFILES_DIR: .
GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}
run: |
dbt deps
dbt seed
dbt run
dbt testStore your GCP service account key as a GitHub secret. The workflow runs daily, installs dbt, and executes the pipeline.
Option 3: Apache Airflow / Cloud Composer
For teams with existing Airflow infrastructure, use Astronomer Cosmos — the standard way to run dbt projects as Airflow DAGs. Cosmos turns each dbt model into an individual Airflow task, giving you full visibility and retry control. This is overkill for a standalone GA4 pipeline but makes sense when dbt is part of a larger orchestration graph.
dbt Documentation and Lineage
One of dbt’s standout features is automatic documentation generation:
dbt docs generate
dbt docs serveThis produces a browsable website with:
- Model descriptions from your YAML files
- Column-level documentation and tests
- Lineage graph showing how models depend on each other (source → staging → intermediate → marts)
- SQL compiled output for debugging
For GA4 pipelines, the lineage graph is particularly useful — it visualizes how raw events flow through flattening, sessionization, and into the marts layer. Share it with stakeholders who want to understand where their dashboard numbers come from.
Common Issues and Troubleshooting
”Not found: Dataset” on First Run
Cause: dbt does not auto-create BigQuery datasets.
Fix: Create the target datasets manually in BigQuery before running, or add on-run-start hooks to your dbt_project.yml:
on-run-start:
- 'CREATE SCHEMA IF NOT EXISTS {{ target.schema }}'Slow Initial Full Refresh
Cause: Scanning the entire GA4 export history.
Fix: Set start_date in your variables to limit the initial load:
vars:
ga4:
start_date: '20250101' # Only load from Jan 2025You can always backfill earlier data later with a targeted --full-refresh.
_TABLE_SUFFIX Filter Not Working
Cause: Using event_date column instead of _TABLE_SUFFIX in the WHERE clause. BigQuery cannot apply partition pruning on a derived filter.
Fix: Always filter on _TABLE_SUFFIX for date-sharded tables:
-- Good: partition pruning works
WHERE _TABLE_SUFFIX BETWEEN '20260201' AND '20260228'
-- Bad: scans ALL tables, then filters
WHERE PARSE_DATE('%Y%m%d', event_date) >= '2026-02-01'Tests Failing with “accepted_values”
Cause: GA4 sends custom event names that are not in your allowed list.
Fix: Set severity to warn instead of error for event name validation, or expand the accepted values list to include your custom events.
When to Use the Package vs. Custom Models
Use Velir/dbt-ga4 if:
- You want a working pipeline in under an hour
- You need multi-property support
- You want maintained, community-tested models
- You do not need heavy customization of the transformation logic
Build custom models if:
- You need full control over the schema and column names
- Your business logic (attribution, sessionization) differs from the package defaults
- You want to match an existing data model or naming convention
- You are learning dbt and want to understand every transformation
A common middle ground: start with the Velir package, then gradually replace individual models with custom ones as your needs diverge. The package models are well-documented, so reading their source code is an excellent way to learn GA4 transformation patterns.
Related Resources
- Dataform for GA4: Build Your First BigQuery Transformation Pipeline — The Dataform equivalent of this guide, covering the same transformations with Dataform’s syntax
- GA4 BigQuery Export Cost Optimization — Understand where BigQuery costs come from and how transformation tables reduce them
- Automated Anomaly Detection for GA4 with BigQuery — Build monitoring on top of your transformed session tables
- Predicting Customer Value with BigQuery ML and GA4 Data — ML models that benefit from clean, flat input tables
Sources
- Velir/dbt-ga4 — GitHub
- modataconsulting/dbt_ga4_project — GitHub
- Google Analytics, BigQuery & dbt: A dbt example project — stacktonic.com
- Let’s build together: towards a community-driven dbt package for GA4 — ga4bigquery.com
- How to standardize GA4 data with dbt — Medium
- dbt vs. Dataform: Which Should You Choose in 2026? — The Data Letter
- Dataform vs dbt for BigQuery Transformation Workflows — Valiotti Analytics