dbt for GA4 BigQuery: Transform Raw Exports into Clean Tables

If your data stack spans multiple warehouses or your team already knows dbt, it is the right tool for transforming GA4's messy BigQuery export. This guide covers the Velir/dbt-ga4 package, building custom models from scratch, incremental loading, testing, and how dbt compares to Dataform for this specific use case.

Why dbt for GA4 BigQuery Data

I recently wrote a complete guide to building GA4 transformation pipelines with Dataform. Dataform is the right choice if you are all-in on BigQuery and want zero additional tooling. But if any of the following apply to you, dbt is the better fit:

  • You already use dbt for other data sources and want GA4 models in the same project.
  • Your data stack spans multiple warehouses — Snowflake, Redshift, or Databricks alongside BigQuery.
  • You want the richest ecosystem of community packages, testing frameworks, and documentation tooling.
  • Your team knows Jinja and prefers it over JavaScript templating.

The problem dbt solves is the same one Dataform solves: GA4’s BigQuery export uses a deeply nested schema with RECORD and REPEATED fields. Every query requires UNNEST subqueries, dashboards scan full tables, and analysts waste hours on boilerplate SQL. I covered the cost implications in GA4 BigQuery Export Cost Optimization — the short version is $25–60/month in wasted query costs for a medium-traffic site.

dbt transforms the raw export into clean, flat, partitioned tables. Query once, use everywhere.


dbt vs. Dataform: A Quick Comparison

If you have not read the Dataform article, here is the summary:

FactordbtDataform
Costdbt Cloud ~$100/user/month; dbt Core is freeFree (BigQuery compute only)
Warehouse supportBigQuery, Snowflake, Redshift, Databricks, and moreBigQuery only
TemplatingJinjaJavaScript
CommunityMassive (400+ packages on dbt Hub)Growing (smaller ecosystem)
GA4 packageVelir/dbt-ga4 (386 stars)ga4_dataform (152 stars)
Schedulingdbt Cloud, Airflow, or any orchestratorBuilt-in via Cloud Workflows
TestingMature framework (generic + singular tests, packages)Assertions (simpler, built-in)

The bottom line: both tools produce equivalent output. The choice is about your team’s existing stack and preferences, not capability. If you picked dbt, this article is for you.


Option 1: The Velir/dbt-ga4 Package

The fastest way to get started is the Velir/dbt-ga4 package — the most popular community package for GA4 BigQuery transformations with 386 stars and 164 forks.

What It Provides

Staging models (the heavy lifting):

  • stg_ga4__events — Cleaned event data with session and event keys
  • stg_ga4__event_* — Individual models per event type (page_view, purchase, etc.) with flattened parameters
  • stg_ga4__event_items — Ecommerce item-level data
  • stg_ga4__user_properties — Most recent user property values per user
  • stg_ga4__sessions_traffic_sources — First-click attribution per session
  • stg_ga4__sessions_traffic_sources_last_non_direct_daily — Last non-direct attribution with configurable lookback

Marts models (analysis-ready):

  • dim_ga4__sessions / dim_ga4__sessions_daily — Session dimensions (geography, device, acquisition)
  • fct_ga4__pages — Page-level metrics by date and location
  • fct_ga4__sessions / fct_ga4__sessions_daily — Session facts with conversion counts
  • dim_ga4__client_keys — Device dimension with first/last page viewed

Installation

Add to your packages.yml:

packages:
  - package: Velir/ga4
    version: ['>=6.2.0', '<6.3.0']

Then install:

dbt deps

Configuration

Add the required variables to your dbt_project.yml:

vars:
  ga4:
    source_project: 'your-gcp-project-id'
    property_ids: [123456789]
    start_date: '20240101'
    static_incremental_days: 3

Key optional variables:

vars:
  ga4:
    # Extract UTM parameters from URLs into columns
    query_parameter_extraction:
      - 'utm_source'
      - 'utm_medium'
      - 'utm_campaign'

    # Remove tracking parameters from page_location
    query_parameter_exclusions:
      - 'gclid'
      - 'fbclid'
      - 'mc_cid'

    # Define which events count as conversions
    conversion_events:
      - 'purchase'
      - 'generate_lead'
      - 'sign_up'

    # Attribution lookback window (default: 30 days)
    session_attribution_lookback_window_days: 30

    # Custom parameters to extract on all events
    default_custom_parameters:
      - name: 'content_group'
        value_type: 'string_value'
      - name: 'logged_in'
        value_type: 'string_value'

First Run

# Load source category mappings (required before first run)
dbt seed

# Run all models
dbt run

# Run tests
dbt test

The first run does a full historical load. Subsequent runs process incrementally based on static_incremental_days.

Multi-Property Support

If you have multiple GA4 properties, the package can consolidate them:

vars:
  ga4:
    combined_dataset: 'ga4_combined'
    property_ids: [111111111, 222222222]

The package clones date shards from each property into the combined dataset before running transformations. This is a significant advantage over manual approaches where you would need to UNION tables across datasets.


Option 2: Building Custom Models

The Velir package is comprehensive, but sometimes you need full control. Here is how to build the core transformations from scratch — the same architecture covered in the Dataform article, translated to dbt conventions.

Project Structure

├── models/
│   ├── staging/
│   │   ├── _staging__sources.yml
│   │   ├── _staging__models.yml
│   │   └── stg_ga4__events.sql
│   ├── intermediate/
│   │   ├── _intermediate__models.yml
│   │   ├── int_ga4__sessions.sql
│   │   └── int_ga4__user_mapping.sql
│   └── marts/
│       ├── _marts__models.yml
│       ├── fct_ga4__sessions_daily.sql
│       └── fct_ga4__pageviews_daily.sql
├── macros/
│   └── unnest_param.sql
├── tests/
│   └── assert_event_date_range.sql
├── dbt_project.yml
└── packages.yml

Source Declaration

Create models/staging/_staging__sources.yml:

version: 2

sources:
  - name: ga4
    database: "{{ var('source_project') }}"
    schema: "analytics_{{ var('property_id') }}"
    tables:
      - name: events
        identifier: 'events_*'
        description: 'Raw GA4 BigQuery export (date-sharded)'

Macro: Unnest Parameter

Create macros/unnest_param.sql — this is the dbt equivalent of the JavaScript helper from the Dataform article:

{% macro unnest_param(key, value_type='string_value') %}
  (SELECT value.{{ value_type }} FROM UNNEST(event_params) WHERE key = '{{ key }}')
{% endmacro %}

{% macro unnest_user_property(key, value_type='string_value') %}
  (SELECT value.{{ value_type }} FROM UNNEST(user_properties) WHERE key = '{{ key }}')
{% endmacro %}

Usage in models:

SELECT
  {{ unnest_param('page_location') }} AS page_location,
  {{ unnest_param('ga_session_id', 'int_value') }} AS ga_session_id,
  {{ unnest_user_property('user_tier') }} AS user_tier,

Jinja macros serve the same purpose as Dataform’s JavaScript helpers — abstracting the repetitive UNNEST boilerplate into reusable functions.

Staging: Flattened Events

Create models/staging/stg_ga4__events.sql:

{{
  config(
    materialized='incremental',
    partition_by={
      "field": "event_date",
      "data_type": "date",
      "granularity": "day"
    },
    cluster_by=["event_name"],
    incremental_strategy='merge',
    unique_key=['event_date', 'event_timestamp', 'user_pseudo_id', 'event_name', 'row_num']
  )
}}

WITH source AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY event_date, event_timestamp, user_pseudo_id, event_name
    ) AS row_num
  FROM {{ source('ga4', 'events') }}
  WHERE _TABLE_SUFFIX NOT LIKE '%intraday%'
  {% if is_incremental() %}
    AND PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) >= DATE_SUB(CURRENT_DATE(), INTERVAL {{ var('static_incremental_days', 3) }} DAY)
  {% endif %}
)

SELECT
  PARSE_DATE('%Y%m%d', event_date) AS event_date,
  TIMESTAMP_MICROS(event_timestamp) AS event_timestamp,
  event_name,
  user_pseudo_id,
  user_id,

  -- Extracted event parameters
  {{ unnest_param('page_location') }} AS page_location,
  {{ unnest_param('page_title') }} AS page_title,
  {{ unnest_param('page_referrer') }} AS page_referrer,
  {{ unnest_param('source') }} AS source,
  {{ unnest_param('medium') }} AS medium,
  {{ unnest_param('campaign') }} AS campaign,
  {{ unnest_param('ga_session_id', 'int_value') }} AS ga_session_id,
  {{ unnest_param('ga_session_number', 'int_value') }} AS ga_session_number,
  {{ unnest_param('engagement_time_msec', 'int_value') }} AS engagement_time_msec,
  {{ unnest_param('session_engaged') }} AS session_engaged,

  -- Device and geo
  device.category AS device_category,
  device.operating_system AS device_os,
  device.web_info.browser AS browser,
  geo.country AS country,
  geo.city AS city,

  -- Traffic source
  traffic_source.source AS traffic_source,
  traffic_source.medium AS traffic_medium,
  traffic_source.name AS traffic_campaign,

  -- Ecommerce
  ecommerce.purchase_revenue AS purchase_revenue,
  ecommerce.transaction_id AS transaction_id,

  row_num

FROM source

The key difference from Dataform: dbt uses {% if is_incremental() %} instead of ${when(incremental(), ...)}, and {{ source('ga4', 'events') }} instead of ${ref("ga4_raw_events")}. The SQL logic is identical.

Intermediate: Sessions

Create models/intermediate/int_ga4__sessions.sql:

{{
  config(
    materialized='incremental',
    partition_by={
      "field": "session_date",
      "data_type": "date",
      "granularity": "day"
    },
    cluster_by=["source", "medium", "device_category"],
    incremental_strategy='merge',
    unique_key=['ga_session_key', 'session_date']
  )
}}

WITH events AS (
  SELECT *
  FROM {{ ref('stg_ga4__events') }}
  {% if is_incremental() %}
    WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL {{ var('static_incremental_days', 3) }} DAY)
  {% endif %}
),

session_source AS (
  SELECT
    CONCAT(user_pseudo_id, '.', CAST(ga_session_id AS STRING)) AS ga_session_key,
    event_date,
    COALESCE(source, traffic_source) AS source,
    COALESCE(medium, traffic_medium) AS medium,
    COALESCE(campaign, traffic_campaign) AS campaign
  FROM events
  WHERE event_name = 'session_start'
  QUALIFY ROW_NUMBER() OVER (
    PARTITION BY user_pseudo_id, ga_session_id
    ORDER BY event_timestamp
  ) = 1
)

SELECT
  CONCAT(e.user_pseudo_id, '.', CAST(e.ga_session_id AS STRING)) AS ga_session_key,
  e.user_pseudo_id,
  e.user_id,
  e.ga_session_id,
  e.ga_session_number,
  MIN(e.event_date) AS session_date,
  MIN(e.event_timestamp) AS session_start,
  MAX(e.event_timestamp) AS session_end,
  TIMESTAMP_DIFF(MAX(e.event_timestamp), MIN(e.event_timestamp), SECOND) AS session_duration_seconds,

  ss.source,
  ss.medium,
  ss.campaign,

  COUNT(*) AS event_count,
  COUNTIF(e.event_name = 'page_view') AS pageviews,
  COUNTIF(e.event_name = 'purchase') AS purchases,
  SUM(COALESCE(e.engagement_time_msec, 0)) AS total_engagement_time_msec,
  MAX(CASE WHEN e.session_engaged = '1' THEN TRUE ELSE FALSE END) AS is_engaged,

  ARRAY_AGG(
    IF(e.event_name = 'page_view', e.page_location, NULL) IGNORE NULLS
    ORDER BY e.event_timestamp LIMIT 1
  )[SAFE_OFFSET(0)] AS landing_page,

  ARRAY_AGG(e.device_category IGNORE NULLS ORDER BY e.event_timestamp LIMIT 1)[SAFE_OFFSET(0)] AS device_category,
  ARRAY_AGG(e.country IGNORE NULLS ORDER BY e.event_timestamp LIMIT 1)[SAFE_OFFSET(0)] AS country,

  SUM(COALESCE(e.purchase_revenue, 0)) AS session_revenue

FROM events e
LEFT JOIN session_source ss
  ON CONCAT(e.user_pseudo_id, '.', CAST(e.ga_session_id AS STRING)) = ss.ga_session_key
WHERE e.ga_session_id IS NOT NULL
GROUP BY 1, 2, 3, 4, 5, ss.source, ss.medium, ss.campaign

If you compared this to the Dataform sessions model, the SQL is nearly identical. The differences are syntax-level: {{ config() }} vs Dataform’s config {}, {{ ref() }} vs ${ref()}, and {% if is_incremental() %} vs ${when(incremental(), ...)}.


Testing and Data Quality

This is where dbt genuinely shines over Dataform. The testing framework is more mature, with more test types and better integration with CI/CD.

Schema Tests

Add to models/staging/_staging__models.yml:

version: 2

models:
  - name: stg_ga4__events
    description: 'Flattened GA4 events with extracted parameters'
    columns:
      - name: user_pseudo_id
        description: 'Cookie-based user identifier'
        tests:
          - not_null
      - name: event_date
        description: 'Event date'
        tests:
          - not_null
      - name: event_name
        description: 'GA4 event name'
        tests:
          - not_null
          - accepted_values:
              values:
                [
                  'page_view',
                  'session_start',
                  'first_visit',
                  'scroll',
                  'click',
                  'view_item',
                  'add_to_cart',
                  'purchase',
                  'begin_checkout',
                  'user_engagement',
                ]
              config:
                severity: warn

  - name: int_ga4__sessions
    description: 'Session-level aggregation with traffic source attribution'
    columns:
      - name: ga_session_key
        description: 'Unique session identifier (user_pseudo_id + ga_session_id)'
        tests:
          - not_null
      - name: session_date
        tests:
          - not_null
    tests:
      - unique:
          column_name: "ga_session_key || '-' || CAST(session_date AS STRING)"

Custom Singular Tests

Create tests/assert_event_date_range.sql:

-- Events with dates in the future or more than 400 days ago indicate export issues
SELECT *
FROM {{ ref('stg_ga4__events') }}
WHERE event_date > CURRENT_DATE()
  OR event_date < DATE_SUB(CURRENT_DATE(), INTERVAL 400 DAY)

Create tests/assert_session_count_bounds.sql:

-- Alert if yesterday's sessions dropped by more than 80% vs 30-day average
WITH daily_counts AS (
  SELECT
    session_date,
    COUNT(*) AS session_count
  FROM {{ ref('int_ga4__sessions') }}
  WHERE session_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 31 DAY)
  GROUP BY 1
),
averages AS (
  SELECT
    AVG(CASE WHEN session_date < DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) THEN session_count END) AS avg_30d,
    MAX(CASE WHEN session_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) THEN session_count END) AS yesterday
  FROM daily_counts
)
SELECT *
FROM averages
WHERE yesterday < avg_30d * 0.2

dbt Testing Packages

For additional test coverage, add dbt-utils and dbt-expectations:

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: ['>=1.0.0', '<2.0.0']
  - package: calogica/dbt_expectations
    version: ['>=0.10.0', '<0.11.0']

This unlocks tests like expect_column_values_to_be_between, expect_table_row_count_to_be_between, and recency (ensure data is not stale). Dataform’s assertion system is simpler and built-in, but dbt’s testing ecosystem is significantly deeper.

Running Tests in CI

# Run all tests
dbt test

# Run only tests for staging models
dbt test --select staging

# Run tests with severity filtering
dbt test --warn-error

Integrate this into your CI/CD pipeline (GitHub Actions, GitLab CI, etc.) to catch data quality issues before they reach production.


Incremental Loading Strategy

dbt’s incremental strategy for GA4 follows the same logic as Dataform’s — process only recent data, use MERGE to avoid duplicates.

The static_incremental_days Pattern

The Velir package uses static_incremental_days (default: 3) as the lookback window. On each run:

  1. Only _TABLE_SUFFIX values within the last N days are scanned
  2. A MERGE statement inserts new rows and updates existing ones
  3. Late-arriving events from GA4’s export process are captured
{% if is_incremental() %}
  AND PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) >= DATE_SUB(
    CURRENT_DATE(),
    INTERVAL {{ var('static_incremental_days', 3) }} DAY
  )
{% endif %}

Full Refresh vs. Incremental

# First run or schema change: full refresh
dbt run --full-refresh

# Daily runs: incremental (default)
dbt run

A full refresh on a site with a year of GA4 data might scan 50–100 GB. A 3-day incremental run scans 1–3 GB. At $6.25/TB on-demand, that is the difference between $0.63 and $0.02 per run.

Disabling Expensive Models

For high-traffic sites, the Velir package lets you disable unpartitioned models that would otherwise scan the entire history:

# dbt_project.yml
models:
  ga4:
    staging:
      stg_ga4__sessions_traffic_sources:
        +enabled: false # Use the _daily variant instead
    marts:
      fct_ga4__sessions:
        +enabled: false # Use fct_ga4__sessions_daily instead

This is a practical cost optimization that the package handles well.


Scheduling and Orchestration

Unlike Dataform, dbt does not have built-in scheduling. You need an external orchestrator.

Option 1: dbt Cloud

The simplest option if you are paying for dbt Cloud. Create a job with a daily schedule:

  • Schedule: 0 9 * * * (9 AM daily, after GA4 export lands)
  • Commands: dbt seed && dbt run && dbt test
  • Environment: Production

dbt Cloud also provides run history, alerting, and the IDE for development.

Option 2: GitHub Actions (Free)

For dbt Core users, a GitHub Actions workflow is the most common approach:

# .github/workflows/dbt-daily.yml
name: dbt daily run
on:
  schedule:
    - cron: '0 9 * * *'
  workflow_dispatch:

jobs:
  dbt-run:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dbt
        run: pip install dbt-bigquery

      - name: Run dbt
        env:
          DBT_PROFILES_DIR: .
          GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}
        run: |
          dbt deps
          dbt seed
          dbt run
          dbt test

Store your GCP service account key as a GitHub secret. The workflow runs daily, installs dbt, and executes the pipeline.

Option 3: Apache Airflow / Cloud Composer

For teams with existing Airflow infrastructure, use Astronomer Cosmos — the standard way to run dbt projects as Airflow DAGs. Cosmos turns each dbt model into an individual Airflow task, giving you full visibility and retry control. This is overkill for a standalone GA4 pipeline but makes sense when dbt is part of a larger orchestration graph.


dbt Documentation and Lineage

One of dbt’s standout features is automatic documentation generation:

dbt docs generate
dbt docs serve

This produces a browsable website with:

  • Model descriptions from your YAML files
  • Column-level documentation and tests
  • Lineage graph showing how models depend on each other (source → staging → intermediate → marts)
  • SQL compiled output for debugging

For GA4 pipelines, the lineage graph is particularly useful — it visualizes how raw events flow through flattening, sessionization, and into the marts layer. Share it with stakeholders who want to understand where their dashboard numbers come from.


Common Issues and Troubleshooting

”Not found: Dataset” on First Run

Cause: dbt does not auto-create BigQuery datasets.

Fix: Create the target datasets manually in BigQuery before running, or add on-run-start hooks to your dbt_project.yml:

on-run-start:
  - 'CREATE SCHEMA IF NOT EXISTS {{ target.schema }}'

Slow Initial Full Refresh

Cause: Scanning the entire GA4 export history.

Fix: Set start_date in your variables to limit the initial load:

vars:
  ga4:
    start_date: '20250101' # Only load from Jan 2025

You can always backfill earlier data later with a targeted --full-refresh.

_TABLE_SUFFIX Filter Not Working

Cause: Using event_date column instead of _TABLE_SUFFIX in the WHERE clause. BigQuery cannot apply partition pruning on a derived filter.

Fix: Always filter on _TABLE_SUFFIX for date-sharded tables:

-- Good: partition pruning works
WHERE _TABLE_SUFFIX BETWEEN '20260201' AND '20260228'

-- Bad: scans ALL tables, then filters
WHERE PARSE_DATE('%Y%m%d', event_date) >= '2026-02-01'

Tests Failing with “accepted_values”

Cause: GA4 sends custom event names that are not in your allowed list.

Fix: Set severity to warn instead of error for event name validation, or expand the accepted values list to include your custom events.


When to Use the Package vs. Custom Models

Use Velir/dbt-ga4 if:

  • You want a working pipeline in under an hour
  • You need multi-property support
  • You want maintained, community-tested models
  • You do not need heavy customization of the transformation logic

Build custom models if:

  • You need full control over the schema and column names
  • Your business logic (attribution, sessionization) differs from the package defaults
  • You want to match an existing data model or naming convention
  • You are learning dbt and want to understand every transformation

A common middle ground: start with the Velir package, then gradually replace individual models with custom ones as your needs diverge. The package models are well-documented, so reading their source code is an excellent way to learn GA4 transformation patterns.



Sources