Dataform for GA4: Build Your First BigQuery Transformation Pipeline

GA4's BigQuery export gives you raw, deeply nested event data. Every query requires UNNEST subqueries, every dashboard scans the full table, and every analyst writes the same boilerplate SQL. Dataform fixes this: transform once into clean, flat, partitioned tables, then query cheaply downstream. This guide walks through the complete pipeline, from flattening events to sessionization, incremental loading, and data quality testing.

The GA4 BigQuery Export Problem

You enabled the GA4 BigQuery export. Now every analyst on your team needs to write queries like this to get something as basic as pageviews by page:

SELECT
  PARSE_DATE('%Y%m%d', event_date) AS date,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location,
  COUNT(*) AS pageviews,
  COUNT(DISTINCT user_pseudo_id) AS users
FROM `project.analytics_123456789.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260201' AND '20260228'
  AND event_name = 'page_view'
GROUP BY 1, 2
ORDER BY pageviews DESC

That nested (SELECT ... FROM UNNEST(event_params) WHERE key = ...) pattern appears in every single query. Want page_title? Another UNNEST. Want session_id? Another UNNEST. Want traffic source from session_start events? UNNEST with a JOIN. The schema uses RECORD and REPEATED fields for event_params, user_properties, and items — arrays of key-value pairs instead of flat columns.

This is not just a readability problem. It is a cost problem. Every SELECT * scans every nested column. Analysts connecting Looker Studio directly to the raw export tables generate full-table scans on every dashboard refresh. I covered the cost implications in detail in GA4 BigQuery Export Cost Optimization — the short version is that a medium-traffic site can burn through $25–60/month from unoptimized dashboards alone.

The solution is a transformation pipeline: process the raw export once into clean, flat tables, then point all downstream queries at those tables. This is what Dataform does.


What Is Dataform and Why Use It for GA4

Dataform is Google Cloud’s native SQL transformation framework. Google acquired the company in 2020 and integrated it directly into the BigQuery console. You define SQL workflows as .sqlx files with dependency management, incremental materializations, and built-in data quality testing.

For GA4 specifically, Dataform is a natural fit:

  • Source and destination are both BigQuery. No cross-warehouse complexity, no separate infrastructure to manage.
  • Zero licensing cost. You pay only for BigQuery compute. No per-seat fees.
  • Native BigQuery UI integration. The DAG visualization, table descriptions, and workflow execution all live in the BigQuery console.
  • JavaScript templating. Dynamic SQL generation using JavaScript rather than Jinja — useful for generating repetitive UNNEST patterns programmatically.
  • Built-in scheduling. Managed through Cloud Workflows and Cloud Scheduler. No Airflow, no cron jobs.

The workflow: you write .sqlx model files that transform raw GA4 data into progressively cleaner tables. Dataform compiles them into a DAG (Directed Acyclic Graph), resolves dependencies, and executes them in the right order. Version control everything in Git.


Dataform vs. dbt: Which Tool Should You Choose?

If you have been researching data transformation tools, you have encountered dbt. The comparison is worth addressing upfront because it affects your architecture.

FactorDataformdbt
CostFree (BigQuery compute only)dbt Cloud ~$100/user/month; dbt Core is free but self-managed
Warehouse supportBigQuery onlyBigQuery, Snowflake, Redshift, Databricks, and more
TemplatingJavaScriptJinja
IntegrationNative in BigQuery consoleSeparate UI (dbt Cloud) or CLI
CommunityGrowing (smaller ecosystem)Massive (5,000+ packages on dbt Hub)
GA4 packagesgoogle-marketing-solutions/ga4_dataform (150 stars)Velir/dbt-ga4 (386 stars)
SchedulingBuilt-in via Cloud WorkflowsRequires dbt Cloud or external orchestration
AI featuresGemini 2 integration (code generation in editor)dbt Copilot (dbt Cloud only)

Choose Dataform if you are fully committed to BigQuery, want zero additional tooling cost, and prefer native Google Cloud integration. This covers most analytics teams working exclusively with GA4 data.

Choose dbt if you need multi-warehouse support, already have a dbt investment, or want the richest ecosystem of community packages and testing frameworks.

The real competition is “no transformation pipeline.” Both tools are excellent. The worst choice is running ad-hoc UNNEST queries directly from dashboards for the next two years.

This article focuses on Dataform. If you are going with dbt, the Velir/dbt-ga4 package provides equivalent transformations (flattened events, sessions, user dimensions) using dbt conventions.


Pipeline Architecture: The Four-Layer Pattern

Before touching any code, understand the architecture. A well-designed GA4 pipeline follows four layers:

1. Raw Layer — The untouched GA4 BigQuery export (analytics_PROPERTY_ID.events_*). Never modify this data.

2. Staging Layer — Flattened events with extracted parameters. Each event_params key becomes its own column. This is the single most impactful transformation.

3. Intermediate Layer — Business logic applied: sessionization (grouping events into sessions), user stitching (cross-device identity resolution), and traffic source attribution.

4. Marts Layer — Aggregated tables ready for dashboards and downstream tools. Sessions summary, daily pageviews, conversion funnels, user cohorts.

Each layer is a set of Dataform .sqlx files with explicit ref() dependencies forming a clean DAG. Downstream models wait for upstream models to complete.

The result: analysts query the marts layer without ever writing UNNEST again. Dashboards scan megabytes instead of gigabytes. Query costs drop by 90%+.


Setting Up Your Dataform Workspace

Prerequisites

Creating the Repository

  1. Open the BigQuery console and navigate to Dataform in the left sidebar.
  2. Click Create repository. Link it to a GitHub or GitLab repo, or use the built-in Google Source Repository.
  3. Set the region to match your BigQuery dataset (e.g., europe-west1 for EU data).
  4. Create a development workspace — this is where you write and test models before deploying.

Project Structure

A standard Dataform project follows this layout:

├── definitions/
│   ├── staging/          # Flattened events, extracted parameters
│   ├── intermediate/     # Sessions, user stitching, attribution
│   └── marts/            # Business-ready aggregations
├── includes/
│   ├── constants.js      # Project config (dataset, property ID)
│   └── helpers.js        # Reusable SQL generation functions
└── workflow_settings.yaml

Configuration

The workflow_settings.yaml file defines your project defaults:

defaultProject: your-gcp-project-id
defaultLocation: europe-west1
defaultDataset: ga4_transformed
defaultAssertionDataset: ga4_assertions

Create an includes/constants.js file for GA4-specific configuration:

const GA4_DATASET = 'analytics_123456789';
const GA4_PROJECT = 'your-gcp-project-id';
const OUTPUT_DATASET = 'ga4_transformed';
// How many days to look back for incremental runs
const INCREMENTAL_LOOKBACK_DAYS = 3;

module.exports = {
  GA4_DATASET,
  GA4_PROJECT,
  OUTPUT_DATASET,
  INCREMENTAL_LOOKBACK_DAYS,
};

The Starter Project Shortcut

Google’s Marketing Solutions team maintains ga4_dataform, an open-source Dataform project with pre-built models for common GA4 transformations. It produces sessions, events, and user transaction tables out of the box. Fork it, update the configuration, run a “Full Refresh”, and you have a working pipeline in under an hour.

Another option is the dataform-ga4-sessions package by Artem Korneev, which can be installed as a Dataform dependency and provides session and event models with sensible defaults.

Both are good starting points. This article walks through building the models yourself so you understand what they do.


Core Transformation 1: Flattening GA4 Event Parameters

This is the single most impactful transformation. You convert the nested event_params array into flat columns that BigQuery can query efficiently.

The Pattern

Instead of this in every downstream query:

(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location')

You create a staging model that extracts each parameter once:

Create definitions/staging/stg_events.sqlx:

config {
  type: "incremental",
  schema: "ga4_transformed",
  description: "Flattened GA4 events with extracted parameters",
  bigquery: {
    partitionBy: "event_date",
    clusterBy: ["event_name"]
  },
  uniqueKey: ["event_date", "event_timestamp", "user_pseudo_id", "event_name", "row_num"]
}

WITH source AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY event_date, event_timestamp, user_pseudo_id, event_name
    ) AS row_num
  FROM ${ref("ga4_raw_events")}
  ${when(incremental(),
    `WHERE PARSE_DATE('%Y%m%d', event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL ${constants.INCREMENTAL_LOOKBACK_DAYS} DAY)`
  )}
)

SELECT
  PARSE_DATE('%Y%m%d', event_date) AS event_date,
  TIMESTAMP_MICROS(event_timestamp) AS event_timestamp,
  event_name,
  user_pseudo_id,
  user_id,

  -- Extracted event parameters
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_title') AS page_title,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_referrer') AS page_referrer,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'source') AS source,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'medium') AS medium,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'campaign') AS campaign,
  (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id,
  (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_number') AS ga_session_number,
  (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec') AS engagement_time_msec,
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'session_engaged') AS session_engaged,

  -- Device and geo
  device.category AS device_category,
  device.operating_system AS device_os,
  device.web_info.browser AS browser,
  geo.country AS country,
  geo.city AS city,

  -- Traffic source (user-level, first-touch — does NOT change per session)
  traffic_source.source AS first_user_source,
  traffic_source.medium AS first_user_medium,
  traffic_source.name AS first_user_campaign,

  -- Session-level traffic source (last non-direct click attribution)
  session_traffic_source_last_click.manual_campaign.source AS session_source,
  session_traffic_source_last_click.manual_campaign.medium AS session_medium,
  session_traffic_source_last_click.manual_campaign.campaign_name AS session_campaign,

  -- Ecommerce (if applicable)
  ecommerce.purchase_revenue AS purchase_revenue,
  ecommerce.transaction_id AS transaction_id,

  row_num

FROM source

A Note on Traffic Source Fields

The GA4 BigQuery export contains multiple traffic source structs that are easy to confuse:

  • traffic_source — The user’s first-ever observed traffic source (first-touch attribution). This value never changes across sessions for the same user. Useful for acquisition analysis, but not for session-level attribution.
  • session_traffic_source_last_click — The session-level traffic source with last non-direct click attribution applied. This is equivalent to what you see in GA4’s standard reports and is what you want for most session-level analysis.
  • collected_traffic_source — The event-level raw traffic source data as collected. Useful when you need to build custom attribution models.

The staging model above extracts both traffic_source (renamed with first_user_ prefix for clarity) and session_traffic_source_last_click (as session_source/medium/campaign). The sessionization model downstream uses the session-level fields.

Making It DRY with JavaScript Helpers

That (SELECT ... FROM UNNEST(event_params) WHERE key = ...) pattern gets repetitive. Create a helper function in includes/helpers.js:

function unnestParam(key, valueType) {
  const valueField = valueType || 'string_value';
  return `(SELECT value.${valueField} FROM UNNEST(event_params) WHERE key = '${key}')`;
}

function unnestUserProperty(key, valueType) {
  const valueField = valueType || 'string_value';
  return `(SELECT value.${valueField} FROM UNNEST(user_properties) WHERE key = '${key}')`;
}

module.exports = { unnestParam, unnestUserProperty };

Then use it in your models:

SELECT
  ${helpers.unnestParam("page_location")} AS page_location,
  ${helpers.unnestParam("ga_session_id", "int_value")} AS ga_session_id,
  ${helpers.unnestUserProperty("user_tier")} AS user_tier,

This is where Dataform’s JavaScript templating shines. The helper function generates the verbose UNNEST SQL at compile time, keeping your model files clean and consistent.

The Impact

A raw GA4 table for a medium site might be 1 GB/day. The equivalent flattened staging table with only the columns you need is dramatically smaller. When Looker Studio queries the flattened table instead of the raw export, data scanned drops by 80–95%. This single transformation often justifies the entire Dataform setup.


Core Transformation 2: Sessionization

GA4’s raw export contains individual events but no pre-built session table. Sessionization means grouping events into sessions, calculating session-level metrics, and attributing traffic sources.

Create definitions/intermediate/sessions.sqlx:

config {
  type: "incremental",
  schema: "ga4_transformed",
  description: "Session-level aggregation with traffic source attribution",
  bigquery: {
    partitionBy: "session_date",
    clusterBy: ["source", "medium", "device_category"]
  },
  uniqueKey: ["ga_session_key", "session_date"]
}

WITH events AS (
  SELECT *
  FROM ${ref("stg_events")}
  ${when(incremental(),
    `WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL ${constants.INCREMENTAL_LOOKBACK_DAYS} DAY)`
  )}
),

session_source AS (
  -- Get session-level traffic source (last non-direct click attribution)
  SELECT
    CONCAT(user_pseudo_id, '.', CAST(ga_session_id AS STRING)) AS ga_session_key,
    event_date,
    COALESCE(session_source, source) AS source,
    COALESCE(session_medium, medium) AS medium,
    COALESCE(session_campaign, campaign) AS campaign
  FROM events
  WHERE event_name = 'session_start'
  QUALIFY ROW_NUMBER() OVER (
    PARTITION BY user_pseudo_id, ga_session_id
    ORDER BY event_timestamp
  ) = 1
)

SELECT
  CONCAT(e.user_pseudo_id, '.', CAST(e.ga_session_id AS STRING)) AS ga_session_key,
  e.user_pseudo_id,
  e.user_id,
  e.ga_session_id,
  e.ga_session_number,
  MIN(e.event_date) AS session_date,
  MIN(e.event_timestamp) AS session_start,
  MAX(e.event_timestamp) AS session_end,
  TIMESTAMP_DIFF(MAX(e.event_timestamp), MIN(e.event_timestamp), SECOND) AS session_duration_seconds,

  -- Traffic source from session_start event
  ss.source,
  ss.medium,
  ss.campaign,

  -- Session metrics
  COUNT(*) AS event_count,
  COUNTIF(e.event_name = 'page_view') AS pageviews,
  COUNTIF(e.event_name = 'purchase') AS purchases,
  SUM(COALESCE(e.engagement_time_msec, 0)) AS total_engagement_time_msec,
  MAX(CASE WHEN e.session_engaged = '1' THEN TRUE ELSE FALSE END) AS is_engaged,

  -- Landing page
  ARRAY_AGG(
    IF(e.event_name = 'page_view', e.page_location, NULL) IGNORE NULLS
    ORDER BY e.event_timestamp LIMIT 1
  )[SAFE_OFFSET(0)] AS landing_page,

  -- Device and geo (from first event)
  ARRAY_AGG(e.device_category IGNORE NULLS ORDER BY e.event_timestamp LIMIT 1)[SAFE_OFFSET(0)] AS device_category,
  ARRAY_AGG(e.country IGNORE NULLS ORDER BY e.event_timestamp LIMIT 1)[SAFE_OFFSET(0)] AS country,

  -- Revenue
  SUM(COALESCE(e.purchase_revenue, 0)) AS session_revenue

FROM events e
LEFT JOIN session_source ss
  ON CONCAT(e.user_pseudo_id, '.', CAST(e.ga_session_id AS STRING)) = ss.ga_session_key
WHERE e.ga_session_id IS NOT NULL
GROUP BY 1, 2, 3, 4, 5, ss.source, ss.medium, ss.campaign

Why ga_session_key Instead of ga_session_id

ga_session_id is not globally unique — it is a timestamp-based integer generated per device. Two different users can (and will) have the same ga_session_id. The unique identifier is ga_session_key: the concatenation of user_pseudo_id and ga_session_id. Always use this as your session primary key.

Default Channel Grouping

The session source model above gives you raw source and medium values. To match GA4’s default channel grouping, you need a CASE statement that maps source/medium combinations to channels. The ga4_dataform starter project includes a channel grouping function in its helpers — use it as a reference rather than writing your own from scratch.


Core Transformation 3: Cross-Device User Identity Stitching

By default, GA4 identifies users via user_pseudo_id — a cookie-based, device-specific identifier. When a user logs in and you collect a user_id, you can stitch those identities together for cross-device analysis.

Create definitions/intermediate/user_mapping.sqlx:

config {
  type: "incremental",
  schema: "ga4_transformed",
  description: "Maps user_pseudo_id to user_id for cross-device stitching",
  bigquery: {
    partitionBy: "first_seen_date"
  },
  uniqueKey: ["user_pseudo_id", "user_id"]
}

SELECT
  user_pseudo_id,
  user_id,
  MIN(event_date) AS first_seen_date,
  MAX(event_date) AS last_seen_date,
  COUNT(*) AS event_count

FROM ${ref("stg_events")}
WHERE user_id IS NOT NULL
  AND user_pseudo_id IS NOT NULL
  ${when(incremental(),
    `AND event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL ${constants.INCREMENTAL_LOOKBACK_DAYS} DAY)`
  )}
GROUP BY 1, 2

Then in your sessions or events models, join against this mapping to create a stitched user identifier:

SELECT
  e.*,
  COALESCE(e.user_id, um.user_id, e.user_pseudo_id) AS stitched_user_id
FROM stg_events e
LEFT JOIN user_mapping um
  ON e.user_pseudo_id = um.user_pseudo_id
  AND um.user_id IS NOT NULL

This is conceptually similar to GA4’s “Blended” identity space, but in BigQuery you have full control over the resolution logic. The backstitching propagates user_id to all events from the same user_pseudo_id, even those that occurred before login.

Caveat: stitching retroactively requires reprocessing historical data. On your first run, do a full refresh. After that, the incremental pattern handles new mappings as they arrive.


Incremental Loading Patterns

Full refreshes become expensive fast. A site with a year of GA4 data might have 50–100 GB of raw events. Reprocessing all of it daily wastes money and compute time.

The MERGE Pattern

Dataform’s incremental materialization uses a MERGE statement under the hood: new rows are inserted, existing rows (matching the uniqueKey) are updated. This is idempotent — running the same day’s pipeline twice updates rather than duplicates rows.

The key ingredients in your .sqlx config:

config {
  type: "incremental",
  uniqueKey: ["ga_session_key", "session_date"],
  bigquery: {
    partitionBy: "session_date"
  }
}

The ${when(incremental(), ...)} block adds a WHERE clause that filters to only recent data on subsequent runs:

${when(incremental(),
  `WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)`
)}

Why a 3-Day Lookback?

GA4’s export is not instant. The daily export typically lands next morning, but late-arriving events can trickle in for 1–3 days. The intraday table (events_intraday_*) gets replaced by the final daily table when processing completes. A 3-day lookback window ensures you capture these late arrivals without reprocessing the entire history.

Cost Impact

Smart incremental strategies reduce BigQuery costs by 90%+ compared to daily full refreshes on large datasets. Instead of scanning 100 GB every day, you scan only the last 3 days — typically 1–3 GB. At $6.25/TB, that is the difference between $0.63/day and $0.02/day.


Data Quality Assertions

A pipeline that transforms incorrect data is worse than no pipeline at all — it gives you confidence in wrong numbers. Dataform assertions are SQL queries that return rows violating a condition. If any rows are returned, the assertion fails.

Built-in Assertion Types

Add assertions directly in your model’s config block:

config {
  type: "incremental",
  assertions: {
    nonNull: ["ga_session_key", "session_date", "user_pseudo_id"],
    uniqueKey: ["ga_session_key", "session_date"]
  }
}

Custom Assertions

For GA4-specific checks, create standalone assertion files. Here is definitions/assertions/assert_event_date_range.sqlx:

config {
  type: "assertion",
  description: "Catches events with dates outside the expected range"
}

SELECT *
FROM ${ref("stg_events")}
WHERE event_date > CURRENT_DATE()
  OR event_date < DATE_SUB(CURRENT_DATE(), INTERVAL 400 DAY)

GA4 Assertions Worth Adding

  • session_id never null (for events that should have sessions)
  • No duplicate events (event_timestamp + user_pseudo_id + event_name + row_num should be unique)
  • event_date within range (catches export errors or schema drift)
  • page_location not null for page_view events (catches tracking breaks)
  • Row count bounds (if yesterday’s events are 80% below the 30-day average, something broke)

The dataform-ga4-tests repository provides a pre-built set of GA4-specific assertions you can install as a starting point.

Assertions run as part of every Dataform workflow execution. A failed assertion does not block the pipeline by default, but you can configure it to do so — useful for catching data drops before they propagate into reports.


Scheduling Your Pipeline

Cloud Scheduler Configuration

GA4’s daily export typically completes by 6–8 AM in your property’s timezone. Schedule your Dataform workflow to run after that window:

  1. In the BigQuery console, go to your Dataform repository.
  2. Click Release Configurations and create a new release (e.g., production).
  3. Click Workflow Configurations and create a new schedule:
    • Release: production
    • Frequency: 0 9 * * * (9 AM daily — adjust for your timezone and export timing)
    • Compilation overrides: Set your production dataset and project

The first run should be a Full Refresh to populate all tables from scratch. Subsequent scheduled runs execute incrementally, processing only the lookback window.

Handling Failures

Dataform sends execution results to Cloud Logging. Set up a log-based alert for failed workflow runs:

  1. Create a Cloud Monitoring alert policy
  2. Filter for Dataform execution failures
  3. Route notifications to email, Slack, or PagerDuty

For more sophisticated monitoring, feed the execution metadata into your anomaly detection system to catch gradual data quality degradation, not just hard failures.


Common Issues and Troubleshooting

”Table Not Found” Errors

Cause: The raw GA4 export table name does not match your Dataform configuration.

Fix: Verify that analytics_PROPERTY_ID in your constants.js matches your GA4 property ID exactly. A common mistake is using the Measurement ID (G-XXXXXXX) instead of the numeric property ID. Check in GA4 under Admin > Property Settings for the correct number.

Incremental Runs Producing No New Data

Cause: The _TABLE_SUFFIX filter is excluding recent partitions because the GA4 export has not landed yet when the pipeline runs.

Fix: Increase INCREMENTAL_LOOKBACK_DAYS from 3 to 5 if your export consistently arrives late. Alternatively, shift your schedule later — some properties do not finish exporting until 10–11 AM. You can check export timing in BigQuery by looking at _TABLE_SUFFIX values and their creation_time in INFORMATION_SCHEMA.TABLES.

Assertion Failures After Deployment

Cause: Row count assertions fail during legitimate traffic spikes or drops (Black Friday, site downtime, seasonal patterns).

Fix: Use percentage-based thresholds rather than absolute numbers. A “80% below the 30-day average” assertion is more resilient than “fewer than 10,000 events”. For expected anomalies (planned maintenance, holiday traffic), either adjust thresholds temporarily or add date-based exclusions to your assertion logic.

MERGE Statement Timeouts on Large Tables

Cause: The uniqueKey is not selective enough, causing BigQuery to compare millions of rows during the MERGE operation.

Fix: Ensure your partitionBy column aligns with the date filter in your incremental WHERE clause. The MERGE should only touch partitions within the lookback window. If you are partitioning by event_date and filtering by event_date >= DATE_SUB(...), BigQuery prunes partitions correctly. Problems arise when the partition column and filter column do not match.


Putting It All Together

Here is the complete pipeline, from raw data to analyst-friendly tables:

GA4 Property
    ↓ (BigQuery Export)
Raw Layer: analytics_PROPERTY_ID.events_*
    ↓ (Dataform)
Staging: ga4_transformed.stg_events

Intermediate: ga4_transformed.sessions
              ga4_transformed.user_mapping

Marts: ga4_transformed.daily_sessions_summary
       ga4_transformed.daily_pageviews
       ga4_transformed.conversion_funnel

Looker Studio / Sheets / downstream tools

Getting Started Checklist

  1. Enable the Dataform API and create a repository
  2. Set up your project structure (definitions/, includes/, workflow_settings.yaml)
  3. Start with the flattened events model (stg_events) — this alone delivers most of the value
  4. Add sessionization once the staging layer is stable
  5. Implement incremental loading (3-day lookback)
  6. Add assertions for critical data quality checks
  7. Schedule the workflow to run daily after the GA4 export lands
  8. Point all dashboards at the transformed tables, never at the raw export

Start simple. A single flattened events table with incremental loading is worth more than a complex pipeline that never ships. Add sessionization, user stitching, and marts as your needs grow.

Measuring Success

After deploying your pipeline, track:

  • Query cost reduction — Compare BigQuery bills before and after. Expect 80–95% reduction for dashboard queries.
  • Analyst productivity — How many minutes does a new query take? If the answer goes from “30 minutes of UNNEST wrestling” to “5 minutes on flat tables”, the pipeline is working.
  • Data freshness — Assertions passing daily means your data is both fresh and correct.

Pipeline ROI Example

Here is a realistic before/after for a medium-traffic site (500K events/day):

Before Dataform (raw table queries):

SourceScanned/dayMonthly cost
100 analyst queries × 10 GB each1,000 GB$187
5 Looker Studio dashboards × 20 GB each100 GB$19
Total$206/month

After Dataform (transformed table queries):

SourceScanned/dayMonthly cost
Same 100 queries × 0.5 GB each50 GB$9
Same 5 dashboards × 0.1 GB each0.5 GB$0.09
Dataform pipeline execution (3-day lookback)2 GB$0.38
Total$9.47/month

Savings: ~$197/month (95% reduction). The pipeline pays for itself on day one — there is nothing to pay beyond BigQuery compute you are already spending.


Frequently Asked Questions

How much does Dataform cost for GA4 transformations?

Dataform itself is free — there is no licensing fee or per-seat charge. You pay only for BigQuery compute when the pipeline runs. For a medium-traffic site processing a 3-day incremental window daily, expect $0.01–0.10/day in pipeline execution costs. The net effect is a cost reduction because your downstream queries scan transformed tables instead of raw exports.

Should I use Dataform or dbt for GA4 BigQuery data?

Use Dataform if you are fully committed to BigQuery, want zero additional tooling cost, and prefer native Google Cloud integration. Use dbt if you need multi-warehouse support (Snowflake, Redshift, Databricks), already have an existing dbt investment, or want access to the larger community ecosystem with 5,000+ packages. Both tools produce equivalent results for GA4 transformations.

How long does it take to set up a Dataform GA4 pipeline?

Using the ga4_dataform starter project, you can have a working pipeline in under one hour. Building custom models from scratch — staging, sessionization, user stitching, assertions, and scheduling — takes 4–8 hours for a complete implementation. The staging layer alone (flattened events) takes about one hour and delivers most of the value.

What is the difference between incremental and full refresh in Dataform?

A full refresh drops and rebuilds the entire table from scratch, scanning all historical data. An incremental run processes only recent data (e.g., the last 3 days) and uses a MERGE statement to insert new rows and update existing ones. Incremental runs cost 90%+ less than full refreshes on large datasets. Use full refresh for your first run and when you change the model schema. Use incremental for daily scheduled runs.



Sources