GA4 Data Quality Monitoring with BigQuery SQL
GA4 exports to BigQuery silently, and silently it can break. This guide gives you the SQL, architecture, and mental model to catch data quality issues before they corrupt your reports.
GA4 exports to BigQuery silently. It also breaks silently. There is no error email when the daily export stalls, no warning when a consent mode change drops your user_pseudo_id population by 30%, and no dashboard that tells you purchase events started duplicating after a GTM container update. You find out weeks later when a stakeholder questions a number, and by then the data is finalized and partially unrecoverable.
This article gives you the SQL and architecture to catch these problems early. The patterns here are borrowed from the data engineering discipline — freshness checks, volume anomaly detection, duplicate flagging, validity scoring — but translated to GA4’s actual BigQuery schema: events_YYYYMMDD tables, nested event_params arrays, user_pseudo_id, and all the quirks that come with them.
Why GA4 Data Quality Breaks Silently
Before building monitoring, it helps to understand the specific failure modes of the GA4-to-BigQuery pipeline.
Modeled data never reaches BigQuery. GA4’s modeled conversions and blended metrics exist only inside the GA4 UI. The BigQuery export contains only raw, unmodeled hits. If your team is making decisions off modeled numbers, there is a structural gap between what they see and what you can verify.
The free tier 1M event cap causes silent pauses. Free GA4 properties are capped at 1 million events per day in the BigQuery export. When you hit the cap, events above it are silently dropped. There is no error. The export just stops for that day. Properties that occasionally spike — Black Friday, a viral post — are especially vulnerable.
72-hour finalization means intraday data is incomplete. The events_intraday_YYYYMMDD table is continuously updated but unstable. The events_YYYYMMDD finalized table typically appears within 24 hours of the session date, but GA4 can process late hits and update data for up to 72 hours. Querying yesterday’s finalized table today is safe. Building dashboards on same-day data is not.
Consent mode strips user_pseudo_id. When a user declines consent, GA4 with Consent Mode v2 sends cookieless pings. Depending on your implementation, these events may arrive with a null or rotating user_pseudo_id. In EU-heavy markets, this can affect 20–40% of sessions. Funnel analysis and session stitching become unreliable unless you explicitly account for this population.
Schema changes arrive without warning. GA4 custom dimensions and event parameters can change anytime a developer pushes a new GTM container or updates client-side tracking. A parameter that was reliably present in event_params for six months can vanish or change data type overnight. There is no schema migration log in BigQuery.
A 2–5% daily discrepancy between GA4 UI numbers and BigQuery is normal and expected. In consent-heavy EU deployments, that gap commonly runs 20–40%. The goal of monitoring is not to eliminate discrepancy — it is to detect when something changed abnormally, so you can investigate before the data is weeks old.
Six Dimensions of GA4 Data Quality
The data engineering community has converged on a shared vocabulary for data quality. Tools like dbt, Great Expectations, and Soda Core all use variations of the same framework. Applied to GA4, the six dimensions are:
- Freshness — Does today’s export exist? Is yesterday’s finalized table present with a reasonable row count?
- Volume — Are event counts within normal range? Has daily volume dropped or spiked unexpectedly?
- Completeness — Are critical fields populated? Are purchase events arriving with transaction_id? Are page_view events arriving with page_location?
- Uniqueness — Are transaction IDs unique? Are there duplicate hits from double-firing tags?
- Validity — Are field values within expected ranges? Are engagement_time_msec values reasonable?
- Consistency — Are cross-field relationships coherent? Do all purchase events have a corresponding session_start within the same day?
The rest of this article works through each dimension with production-ready SQL.
Freshness: Detecting Incomplete Exports
The cheapest check you can run. Query INFORMATION_SCHEMA.__TABLES__ to verify the expected daily table exists and exceeds a minimum row count.
-- Check if yesterday's finalized table exists and has sufficient rows
DECLARE min_rows INT64 DEFAULT 500000; -- adjust for your property size
SELECT
table_id,
row_count,
CASE
WHEN row_count IS NULL THEN 'MISSING'
WHEN row_count < min_rows THEN 'BELOW_THRESHOLD'
ELSE 'OK'
END AS freshness_status,
DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
FROM `your-project.analytics_XXXXXXXXX.__TABLES__`
WHERE table_id = CONCAT(
'events_',
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE('UTC'), INTERVAL 1 DAY))
);If the query returns no rows, the table does not exist. If it returns a row with row_count below your threshold, the export ran but was likely truncated.
For intraday comparison, you can compare the current intraday table against the previous finalized day to get a directional completeness signal:
-- Compare today's intraday row count to yesterday's finalized count
WITH yesterday AS (
SELECT row_count AS finalized_rows
FROM `your-project.analytics_XXXXXXXXX.__TABLES__`
WHERE table_id = CONCAT(
'events_',
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE('UTC'), INTERVAL 1 DAY))
)
),
today_intraday AS (
SELECT row_count AS intraday_rows
FROM `your-project.analytics_XXXXXXXXX.__TABLES__`
WHERE table_id = CONCAT(
'events_intraday_',
FORMAT_DATE('%Y%m%d', CURRENT_DATE('UTC'))
)
)
SELECT
y.finalized_rows,
t.intraday_rows,
SAFE_DIVIDE(t.intraday_rows, y.finalized_rows) AS intraday_ratio
FROM yesterday y
CROSS JOIN today_intraday t;For production dashboards, apply a 2-day lag. Use DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY) as your most recent reporting date. This gives finalization time to complete and avoids building reports on unstable intraday data.
Volume Anomaly Detection
Freshness tells you the data exists. Volume monitoring tells you whether the data is plausible.
The standard approach is a rolling 28-day average with standard deviation bounds. Days falling outside 2 standard deviations are flagged.
-- 28-day rolling stats per event_name
WITH daily_counts AS (
SELECT
PARSE_DATE('%Y%m%d', event_date) AS event_date,
event_name,
COUNT(*) AS event_count
FROM `your-project.analytics_XXXXXXXXX.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY 1, 2
),
rolling_stats AS (
SELECT
event_date,
event_name,
event_count,
AVG(event_count) OVER (
PARTITION BY event_name
ORDER BY event_date
ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
) AS rolling_avg,
STDDEV(event_count) OVER (
PARTITION BY event_name
ORDER BY event_date
ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
) AS rolling_stddev
FROM daily_counts
)
SELECT
event_date,
event_name,
event_count,
ROUND(rolling_avg, 0) AS rolling_avg,
ROUND(rolling_stddev, 0) AS rolling_stddev,
CASE
WHEN ABS(event_count - rolling_avg) > 2 * rolling_stddev
THEN 'ANOMALY'
ELSE 'NORMAL'
END AS volume_status
FROM rolling_stats
WHERE event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
AND event_name IN ('purchase', 'page_view', 'session_start', 'add_to_cart')
ORDER BY event_name;For day-over-day spike detection on critical events, a simpler ratio check is often more actionable:
-- Flag events with >50% day-over-day change
WITH daily AS (
SELECT
PARSE_DATE('%Y%m%d', event_date) AS event_date,
event_name,
COUNT(*) AS cnt
FROM `your-project.analytics_XXXXXXXXX.events_*`
WHERE _TABLE_SUFFIX IN (
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)),
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY))
)
GROUP BY 1, 2
)
SELECT
today.event_name,
today.cnt AS today_count,
yesterday.cnt AS yesterday_count,
SAFE_DIVIDE(today.cnt - yesterday.cnt, yesterday.cnt) AS pct_change,
CASE
WHEN ABS(SAFE_DIVIDE(today.cnt - yesterday.cnt, yesterday.cnt)) > 0.5
THEN 'SPIKE'
ELSE 'NORMAL'
END AS volume_flag
FROM daily today
JOIN daily yesterday
ON today.event_name = yesterday.event_name
AND today.event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
AND yesterday.event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY);For properties with enough history and more complex seasonality, BigQuery ML’s ARIMA_PLUS model and ML.DETECT_ANOMALIES provide time-series-aware anomaly detection without requiring you to manage window logic manually. See the BigQuery ML.DETECT_ANOMALIES documentation for implementation details. The SQL approach above is sufficient for most properties and has zero ML infrastructure overhead.
Duplicate Detection
GTM misfires, SPA navigation quirks, and server-side double-sends all produce duplicate events. For purchase events, duplicates directly corrupt revenue reporting.
-- Find duplicate transaction_ids in purchase events
WITH purchase_params AS (
SELECT
event_date,
event_timestamp,
user_pseudo_id,
(
SELECT value.string_value
FROM UNNEST(event_params)
WHERE key = 'transaction_id'
) AS transaction_id
FROM `your-project.analytics_XXXXXXXXX.events_*`
WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND event_name = 'purchase'
)
SELECT
transaction_id,
COUNT(*) AS hit_count,
COUNT(DISTINCT user_pseudo_id) AS distinct_users,
MIN(event_timestamp) AS first_seen,
MAX(event_timestamp) AS last_seen
FROM purchase_params
WHERE transaction_id IS NOT NULL
GROUP BY transaction_id
HAVING COUNT(*) > 1
ORDER BY hit_count DESC;When you need deduplicated purchase data for downstream reporting, use ROW_NUMBER():
-- Deduplicate purchase events, keep earliest hit per transaction_id
WITH purchase_params AS (
SELECT
*,
(
SELECT value.string_value
FROM UNNEST(event_params)
WHERE key = 'transaction_id'
) AS transaction_id,
ROW_NUMBER() OVER (
PARTITION BY (
SELECT value.string_value
FROM UNNEST(event_params)
WHERE key = 'transaction_id'
)
ORDER BY event_timestamp ASC
) AS rn
FROM `your-project.analytics_XXXXXXXXX.events_*`
WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND event_name = 'purchase'
)
SELECT * EXCEPT(rn)
FROM purchase_params
WHERE rn = 1
AND transaction_id IS NOT NULL;Session-level duplicates — multiple identical events firing within the same millisecond for the same user — can be detected by grouping on (user_pseudo_id, event_name, event_timestamp) and counting groups with more than one row. These often indicate a double-render in a SPA framework or a misconfigured trigger.
Bot Traffic Detection
GA4 does some bot filtering by default, but it is not exhaustive. Automated crawlers, performance testing tools, and competitor scrapers routinely appear in raw BigQuery data.
A scoring approach is more robust than any single threshold:
-- Bot scoring per session
WITH session_stats AS (
SELECT
user_pseudo_id,
(
SELECT value.int_value
FROM UNNEST(event_params)
WHERE key = 'ga_session_id'
) AS session_id,
COUNT(*) AS events_in_session,
SUM(
CASE WHEN event_name = 'user_engagement' THEN
COALESCE((
SELECT value.int_value
FROM UNNEST(event_params)
WHERE key = 'engagement_time_msec'
), 0)
ELSE 0 END
) AS total_engagement_ms,
MIN(event_timestamp) AS session_start_ts,
MAX(event_timestamp) AS session_end_ts
FROM `your-project.analytics_XXXXXXXXX.events_*`
WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY user_pseudo_id, session_id
),
user_daily_sessions AS (
SELECT
user_pseudo_id,
COUNT(DISTINCT session_id) AS sessions_today
FROM session_stats
GROUP BY user_pseudo_id
)
SELECT
s.user_pseudo_id,
s.session_id,
s.events_in_session,
s.total_engagement_ms,
u.sessions_today,
(session_end_ts - session_start_ts) / 1000000.0 AS session_duration_sec,
(
CASE WHEN s.total_engagement_ms = 0 THEN 1 ELSE 0 END
+ CASE WHEN s.events_in_session > 100
AND (session_end_ts - session_start_ts) < 60000000
THEN 1 ELSE 0 END
+ CASE WHEN u.sessions_today > 50 THEN 1 ELSE 0 END
) AS bot_score
FROM session_stats s
JOIN user_daily_sessions u USING (user_pseudo_id)
HAVING bot_score >= 2
ORDER BY bot_score DESC, s.events_in_session DESC;A bot_score of 2 or 3 is a strong signal. A score of 1 warrants investigation but is not conclusive on its own. For properties with significant bot traffic, a BigQuery ML classification model trained on labeled sessions can replace the hand-tuned thresholds and adapt over time.
Consent Mode Impact Assessment
Consent mode changes the composition of your tracked population, not just the total count. Events from non-consenting users may arrive with rotating or null user_pseudo_id values, which breaks session stitching, user-level metrics, and funnel analysis.
This query quantifies the consent-affected population by detecting sessions where user_pseudo_id changes across events that should belong to the same logical session:
-- Estimate consent-impacted sessions
WITH session_uid_counts AS (
SELECT
(
SELECT value.int_value
FROM UNNEST(event_params)
WHERE key = 'ga_session_id'
) AS session_id,
COUNT(DISTINCT user_pseudo_id) AS distinct_uids_in_session,
COUNT(*) AS event_count
FROM `your-project.analytics_XXXXXXXXX.events_*`
WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND user_pseudo_id IS NOT NULL
GROUP BY session_id
)
SELECT
COUNTIF(distinct_uids_in_session = 1) AS stable_sessions,
COUNTIF(distinct_uids_in_session > 1) AS unstable_sessions,
COUNT(*) AS total_sessions,
ROUND(
SAFE_DIVIDE(
COUNTIF(distinct_uids_in_session > 1),
COUNT(*)
) * 100,
2
) AS pct_consent_impacted
FROM session_uid_counts;To understand the bias introduced by unconsented traffic, segment key conversion metrics by consent signal availability:
-- Segment purchase conversion by user_pseudo_id stability proxy
WITH user_sessions AS (
SELECT
user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
MAX(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS converted
FROM `your-project.analytics_XXXXXXXXX.events_*`
WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY user_pseudo_id, session_id
)
SELECT
CASE
WHEN user_pseudo_id IS NULL THEN 'no_user_id'
WHEN LENGTH(user_pseudo_id) < 10 THEN 'short_uid_possible_consent_mode'
ELSE 'standard_uid'
END AS uid_type,
COUNT(*) AS sessions,
SUM(converted) AS conversions,
ROUND(SAFE_DIVIDE(SUM(converted), COUNT(*)) * 100, 3) AS conversion_rate_pct
FROM user_sessions
GROUP BY uid_type;Automated Monitoring with Scheduled Queries
Individual SQL checks are only useful if they run automatically and alert someone when they fail. The simplest viable architecture uses BigQuery Scheduled Queries and Cloud Monitoring.
Architecture:
- A scheduled query runs daily at 08:00 UTC.
- The query writes any detected failures as rows into a
data_quality_alertstable. - A Cloud Monitoring alert fires when the alerts table row count exceeds zero for the current date.
- Cloud Monitoring routes the alert to Slack, email, or PagerDuty via a notification channel.
The data_quality_alerts table schema:
CREATE TABLE IF NOT EXISTS `your-project.analytics_monitoring.data_quality_alerts` (
check_name STRING,
severity STRING, -- 'CRITICAL', 'WARNING', 'INFO'
details STRING,
event_date DATE,
detected_at TIMESTAMP
);The scheduled query inserts rows only when a check fails. See the BigQuery Scheduled Queries alerting documentation for the full Cloud Monitoring integration setup.
For more complex routing — different Slack channels by severity, PagerDuty only for CRITICAL, suppression windows during known maintenance — replace the Cloud Monitoring alert with a Cloud Function triggered via Pub/Sub. The scheduled query publishes a message on completion; the Cloud Function reads severity and routes accordingly.
Dataplex Auto Data Quality
Dataplex Universal Catalog can profile BigQuery tables automatically and run declarative data quality rules without writing SQL checks by hand. It detects null rates, value distributions, statistical outliers, and freshness at the table level.
For GA4 you can define rules like:
transaction_idmust not be null for rows whereevent_name = 'purchase'event_datemust be within the last 3 daysevent_timestampmust be greater than zero
The main advantage is a non-technical UI that analytics stakeholders without strong SQL skills can navigate. Rules are versioned and auditable.
The limitation in GA4 contexts is the nested schema. Dataplex rules operate on table columns. Extracting transaction_id from inside event_params ARRAY<STRUCT<...>> requires either a pre-flattened view or a custom SQL rule, which largely negates the no-SQL advantage. For monitoring top-level fields — event_date, event_name, user_pseudo_id null rates — Dataplex works well out of the box. For parameter-level validation, custom SQL is more practical.
Reference: Dataplex Auto Data Quality overview.
Enterprise Observability Tools vs DIY
There is no shortage of tools in this space. A three-tier framing is the most honest:
Tier 1 — DIY SQL + Cloud Monitoring (free) What this article describes. Covers freshness, volume, completeness, uniqueness, and validity with standard BigQuery SQL and scheduled queries. Requires SQL knowledge to build and maintain. No licensing cost beyond query processing. Right starting point for most teams.
Tier 2 — dbt tests / Great Expectations / Soda Core Open-source frameworks that bring test infrastructure to SQL: schema tests, custom assertions, test result storage, CI integration. dbt’s built-in tests (not_null, unique, accepted_values) cover most GA4 quality dimensions when applied to flattened models. Appropriate when the warehouse has grown beyond GA4 and quality monitoring needs to be consistent across multiple sources.
Tier 3 — Monte Carlo, Bigeye, Metaplane Commercial data observability platforms. Monte Carlo uses ML-based anomaly detection on your warehouse metadata without requiring you to write any checks. Pricing typically starts at $30–50K/year. Justified for large data teams where the cost of undetected data issues exceeds the platform cost. Not justified for a team whose primary analytical source is GA4.
The sweet spot for most analytics teams: Tier 1 for GA4 monitoring, graduating to Tier 2 as the warehouse grows and dbt models proliferate.
Starter Monitoring Kit
Here is a composite query that runs all core checks and writes failures to the alerts table in a single scheduled query execution. Run this daily at 08:00 UTC.
-- Composite GA4 data quality check — daily scheduled query
-- Writes to analytics_monitoring.data_quality_alerts only on failure
DECLARE yesterday DATE DEFAULT DATE_SUB(CURRENT_DATE('UTC'), INTERVAL 1 DAY);
DECLARE yesterday_suffix STRING DEFAULT FORMAT_DATE('%Y%m%d', yesterday);
INSERT INTO `your-project.analytics_monitoring.data_quality_alerts`
(check_name, severity, details, event_date, detected_at)
-- CHECK 1: Freshness
SELECT
'freshness_daily_table' AS check_name,
'CRITICAL' AS severity,
CONCAT('events_', yesterday_suffix, ' has ', CAST(COALESCE(row_count, 0) AS STRING), ' rows') AS details,
yesterday AS event_date,
CURRENT_TIMESTAMP() AS detected_at
FROM (
SELECT row_count
FROM `your-project.analytics_XXXXXXXXX.__TABLES__`
WHERE table_id = CONCAT('events_', yesterday_suffix)
) t
WHERE COALESCE(t.row_count, 0) < 500000
UNION ALL
-- CHECK 2: Volume anomaly on page_view
SELECT
'volume_page_view_anomaly' AS check_name,
'WARNING' AS severity,
CONCAT(
'page_view count ', CAST(today_count AS STRING),
' vs 28-day avg ', CAST(ROUND(rolling_avg, 0) AS STRING)
) AS details,
yesterday AS event_date,
CURRENT_TIMESTAMP() AS detected_at
FROM (
WITH daily AS (
SELECT
PARSE_DATE('%Y%m%d', event_date) AS dt,
COUNT(*) AS cnt
FROM `your-project.analytics_XXXXXXXXX.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(yesterday, INTERVAL 28 DAY))
AND yesterday_suffix
AND event_name = 'page_view'
GROUP BY dt
),
stats AS (
SELECT
dt,
cnt,
AVG(cnt) OVER (ORDER BY dt ROWS BETWEEN 27 PRECEDING AND CURRENT ROW) AS rolling_avg,
STDDEV(cnt) OVER (ORDER BY dt ROWS BETWEEN 27 PRECEDING AND CURRENT ROW) AS rolling_std
FROM daily
)
SELECT cnt AS today_count, rolling_avg, rolling_std
FROM stats
WHERE dt = yesterday
AND ABS(cnt - rolling_avg) > 2 * rolling_std
)
UNION ALL
-- CHECK 3: Purchase events missing transaction_id
SELECT
'completeness_purchase_transaction_id' AS check_name,
'CRITICAL' AS severity,
CONCAT(
CAST(null_count AS STRING), ' purchase events missing transaction_id of ',
CAST(total_count AS STRING), ' total'
) AS details,
yesterday AS event_date,
CURRENT_TIMESTAMP() AS detected_at
FROM (
SELECT
COUNTIF(transaction_id IS NULL) AS null_count,
COUNT(*) AS total_count
FROM (
SELECT
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'transaction_id') AS transaction_id
FROM `your-project.analytics_XXXXXXXXX.events_*`
WHERE _TABLE_SUFFIX = yesterday_suffix
AND event_name = 'purchase'
)
)
WHERE null_count > 0
UNION ALL
-- CHECK 4: Duplicate transaction_ids
SELECT
'uniqueness_duplicate_transactions' AS check_name,
'CRITICAL' AS severity,
CONCAT(CAST(dup_count AS STRING), ' duplicate transaction_ids detected') AS details,
yesterday AS event_date,
CURRENT_TIMESTAMP() AS detected_at
FROM (
SELECT COUNT(*) AS dup_count
FROM (
SELECT
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'transaction_id') AS transaction_id
FROM `your-project.analytics_XXXXXXXXX.events_*`
WHERE _TABLE_SUFFIX = yesterday_suffix
AND event_name = 'purchase'
)
WHERE transaction_id IS NOT NULL
GROUP BY transaction_id
HAVING COUNT(*) > 1
)
WHERE dup_count > 0
UNION ALL
-- CHECK 5: Bot traffic rate above 1%
SELECT
'validity_bot_traffic_rate' AS check_name,
'WARNING' AS severity,
CONCAT(CAST(ROUND(bot_pct * 100, 2) AS STRING), '% of sessions flagged with bot_score >= 2') AS details,
yesterday AS event_date,
CURRENT_TIMESTAMP() AS detected_at
FROM (
WITH session_stats AS (
SELECT
user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
COUNT(*) AS event_count,
MIN(event_timestamp) AS ts_start,
MAX(event_timestamp) AS ts_end,
SUM(CASE WHEN event_name = 'user_engagement' THEN
COALESCE((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec'), 0)
ELSE 0 END) AS total_engagement_ms
FROM `your-project.analytics_XXXXXXXXX.events_*`
WHERE _TABLE_SUFFIX = yesterday_suffix
GROUP BY user_pseudo_id, session_id
),
user_sessions AS (
SELECT user_pseudo_id, COUNT(*) AS daily_sessions
FROM session_stats
GROUP BY user_pseudo_id
),
scored AS (
SELECT
(
CASE WHEN total_engagement_ms = 0 THEN 1 ELSE 0 END
+ CASE WHEN event_count > 100 AND (ts_end - ts_start) < 60000000 THEN 1 ELSE 0 END
+ CASE WHEN u.daily_sessions > 50 THEN 1 ELSE 0 END
) AS bot_score
FROM session_stats s
JOIN user_sessions u USING (user_pseudo_id)
)
SELECT SAFE_DIVIDE(COUNTIF(bot_score >= 2), COUNT(*)) AS bot_pct
FROM scored
)
WHERE bot_pct > 0.01;Deployment checklist:
- Create the
data_quality_alertstable using the DDL above. - Create a BigQuery Scheduled Query running the composite check at 08:00 UTC daily.
- Create a Cloud Monitoring alert triggered when the insert row count is greater than zero.
- Wire the alert to your notification channel (email, Slack webhook, PagerDuty).
That is one scheduled query, one alert, one notification channel. It covers the most critical failure modes for most GA4 properties.
Data quality monitoring is not a one-time setup. Your property will grow, custom events will change, consent mode configurations will be updated, and new failure modes will emerge that none of these checks anticipated. Review the alerts table weekly, add a new check whenever you discover an issue that would have been caught earlier, and remove checks that have fired zero times in six months. A monitoring system that generates noise is worse than no monitoring at all.
Related Resources
- BigQuery Scheduled Queries for GA4 Automated Reporting — Automate the transformation layer that feeds your monitoring tables
- GA4 Data Retention Workarounds with BigQuery — Keep raw event data beyond GA4’s 14-month limit
- Dataform for GA4: Build Your First BigQuery Pipeline — Add dependency management and data quality assertions to your pipeline