GA4 Data Quality Monitoring with BigQuery SQL

GA4 exports to BigQuery silently. It also breaks silently. There is no error email when the daily export stalls, no warning when a consent mode change drops your user_pseudo_id population by 30%, and no dashboard that tells you purchase events started duplicating after a GTM container update. You find out weeks later when a stakeholder questions a number, and by then the data is finalized and partially unrecoverable.

This article gives you the SQL and architecture to catch these problems early. The patterns here are borrowed from the data engineering discipline — freshness checks, volume anomaly detection, duplicate flagging, validity scoring — but translated to GA4’s actual BigQuery schema: events_YYYYMMDD tables, nested event_params arrays, user_pseudo_id, and all the quirks that come with them.

Why GA4 Data Quality Breaks Silently

Before building monitoring, it helps to understand the specific failure modes of the GA4-to-BigQuery pipeline.

Modeled data never reaches BigQuery. GA4’s modeled conversions and blended metrics exist only inside the GA4 UI. The BigQuery export contains only raw, unmodeled hits. If your team is making decisions off modeled numbers, there is a structural gap between what they see and what you can verify.

The free tier 1M event cap causes silent pauses. Free GA4 properties are capped at 1 million events per day in the BigQuery export. When you hit the cap, events above it are silently dropped. There is no error. The export just stops for that day. Properties that occasionally spike — Black Friday, a viral post — are especially vulnerable.

72-hour finalization means intraday data is incomplete. The events_intraday_YYYYMMDD table is continuously updated but unstable. The events_YYYYMMDD finalized table typically appears within 24 hours of the session date, but GA4 can process late hits and update data for up to 72 hours. Querying yesterday’s finalized table today is safe. Building dashboards on same-day data is not.

Consent mode strips user_pseudo_id. When a user declines consent, GA4 with Consent Mode v2 sends cookieless pings. Depending on your implementation, these events may arrive with a null or rotating user_pseudo_id. In EU-heavy markets, this can affect 20–40% of sessions. Funnel analysis and session stitching become unreliable unless you explicitly account for this population.

Schema changes arrive without warning. GA4 custom dimensions and event parameters can change anytime a developer pushes a new GTM container or updates client-side tracking. A parameter that was reliably present in event_params for six months can vanish or change data type overnight. There is no schema migration log in BigQuery.

A 2–5% daily discrepancy between GA4 UI numbers and BigQuery is normal and expected. In consent-heavy EU deployments, that gap commonly runs 20–40%. The goal of monitoring is not to eliminate discrepancy — it is to detect when something changed abnormally, so you can investigate before the data is weeks old.

Six Dimensions of GA4 Data Quality

The data engineering community has converged on a shared vocabulary for data quality. Tools like dbt, Great Expectations, and Soda Core all use variations of the same framework. Applied to GA4, the six dimensions are:

Freshness — Does today’s export exist? Is yesterday’s finalized table present with a reasonable row count?
Volume — Are event counts within normal range? Has daily volume dropped or spiked unexpectedly?
Completeness — Are critical fields populated? Are purchase events arriving with transaction_id? Are page_view events arriving with page_location?
Uniqueness — Are transaction IDs unique? Are there duplicate hits from double-firing tags?
Validity — Are field values within expected ranges? Are engagement_time_msec values reasonable?
Consistency — Are cross-field relationships coherent? Do all purchase events have a corresponding session_start within the same day?

The rest of this article works through each dimension with production-ready SQL.

Freshness: Detecting Incomplete Exports

The cheapest check you can run. Query INFORMATION_SCHEMA.__TABLES__ to verify the expected daily table exists and exceeds a minimum row count.

-- Check if yesterday's finalized table exists and has sufficient rows
DECLARE min_rows INT64 DEFAULT 500000; -- adjust for your property size

SELECT
  table_id,
  row_count,
  CASE
    WHEN row_count IS NULL THEN 'MISSING'
    WHEN row_count < min_rows THEN 'BELOW_THRESHOLD'
    ELSE 'OK'
  END AS freshness_status,
  DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
FROM `your-project.analytics_XXXXXXXXX.__TABLES__`
WHERE table_id = CONCAT(
  'events_',
  FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE('UTC'), INTERVAL 1 DAY))
);

If the query returns no rows, the table does not exist. If it returns a row with row_count below your threshold, the export ran but was likely truncated.

For intraday comparison, you can compare the current intraday table against the previous finalized day to get a directional completeness signal:

-- Compare today's intraday row count to yesterday's finalized count
WITH yesterday AS (
  SELECT row_count AS finalized_rows
  FROM `your-project.analytics_XXXXXXXXX.__TABLES__`
  WHERE table_id = CONCAT(
    'events_',
    FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE('UTC'), INTERVAL 1 DAY))
  )
),
today_intraday AS (
  SELECT row_count AS intraday_rows
  FROM `your-project.analytics_XXXXXXXXX.__TABLES__`
  WHERE table_id = CONCAT(
    'events_intraday_',
    FORMAT_DATE('%Y%m%d', CURRENT_DATE('UTC'))
  )
)
SELECT
  y.finalized_rows,
  t.intraday_rows,
  SAFE_DIVIDE(t.intraday_rows, y.finalized_rows) AS intraday_ratio
FROM yesterday y
CROSS JOIN today_intraday t;

For production dashboards, apply a 2-day lag. Use DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY) as your most recent reporting date. This gives finalization time to complete and avoids building reports on unstable intraday data.

Volume Anomaly Detection

Freshness tells you the data exists. Volume monitoring tells you whether the data is plausible.

The standard approach is a rolling 28-day average with standard deviation bounds. Days falling outside 2 standard deviations are flagged.

-- 28-day rolling stats per event_name
WITH daily_counts AS (
  SELECT
    PARSE_DATE('%Y%m%d', event_date) AS event_date,
    event_name,
    COUNT(*) AS event_count
  FROM `your-project.analytics_XXXXXXXXX.events_*`
  WHERE _TABLE_SUFFIX BETWEEN
    FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
  GROUP BY 1, 2
),
rolling_stats AS (
  SELECT
    event_date,
    event_name,
    event_count,
    AVG(event_count) OVER (
      PARTITION BY event_name
      ORDER BY event_date
      ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
    ) AS rolling_avg,
    STDDEV(event_count) OVER (
      PARTITION BY event_name
      ORDER BY event_date
      ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
    ) AS rolling_stddev
  FROM daily_counts
)
SELECT
  event_date,
  event_name,
  event_count,
  ROUND(rolling_avg, 0) AS rolling_avg,
  ROUND(rolling_stddev, 0) AS rolling_stddev,
  CASE
    WHEN ABS(event_count - rolling_avg) > 2 * rolling_stddev
    THEN 'ANOMALY'
    ELSE 'NORMAL'
  END AS volume_status
FROM rolling_stats
WHERE event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  AND event_name IN ('purchase', 'page_view', 'session_start', 'add_to_cart')
ORDER BY event_name;

For day-over-day spike detection on critical events, a simpler ratio check is often more actionable:

-- Flag events with >50% day-over-day change
WITH daily AS (
  SELECT
    PARSE_DATE('%Y%m%d', event_date) AS event_date,
    event_name,
    COUNT(*) AS cnt
  FROM `your-project.analytics_XXXXXXXXX.events_*`
  WHERE _TABLE_SUFFIX IN (
    FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)),
    FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY))
  )
  GROUP BY 1, 2
)
SELECT
  today.event_name,
  today.cnt AS today_count,
  yesterday.cnt AS yesterday_count,
  SAFE_DIVIDE(today.cnt - yesterday.cnt, yesterday.cnt) AS pct_change,
  CASE
    WHEN ABS(SAFE_DIVIDE(today.cnt - yesterday.cnt, yesterday.cnt)) > 0.5
    THEN 'SPIKE'
    ELSE 'NORMAL'
  END AS volume_flag
FROM daily today
JOIN daily yesterday
  ON today.event_name = yesterday.event_name
  AND today.event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  AND yesterday.event_date = DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY);

For properties with enough history and more complex seasonality, BigQuery ML’s ARIMA_PLUS model and ML.DETECT_ANOMALIES provide time-series-aware anomaly detection without requiring you to manage window logic manually. See the BigQuery ML.DETECT_ANOMALIES documentation for implementation details. The SQL approach above is sufficient for most properties and has zero ML infrastructure overhead.

Duplicate Detection

GTM misfires, SPA navigation quirks, and server-side double-sends all produce duplicate events. For purchase events, duplicates directly corrupt revenue reporting.

-- Find duplicate transaction_ids in purchase events
WITH purchase_params AS (
  SELECT
    event_date,
    event_timestamp,
    user_pseudo_id,
    (
      SELECT value.string_value
      FROM UNNEST(event_params)
      WHERE key = 'transaction_id'
    ) AS transaction_id
  FROM `your-project.analytics_XXXXXXXXX.events_*`
  WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
    AND event_name = 'purchase'
)
SELECT
  transaction_id,
  COUNT(*) AS hit_count,
  COUNT(DISTINCT user_pseudo_id) AS distinct_users,
  MIN(event_timestamp) AS first_seen,
  MAX(event_timestamp) AS last_seen
FROM purchase_params
WHERE transaction_id IS NOT NULL
GROUP BY transaction_id
HAVING COUNT(*) > 1
ORDER BY hit_count DESC;

When you need deduplicated purchase data for downstream reporting, use ROW_NUMBER():

-- Deduplicate purchase events, keep earliest hit per transaction_id
WITH purchase_params AS (
  SELECT
    *,
    (
      SELECT value.string_value
      FROM UNNEST(event_params)
      WHERE key = 'transaction_id'
    ) AS transaction_id,
    ROW_NUMBER() OVER (
      PARTITION BY (
        SELECT value.string_value
        FROM UNNEST(event_params)
        WHERE key = 'transaction_id'
      )
      ORDER BY event_timestamp ASC
    ) AS rn
  FROM `your-project.analytics_XXXXXXXXX.events_*`
  WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
    AND event_name = 'purchase'
)
SELECT * EXCEPT(rn)
FROM purchase_params
WHERE rn = 1
  AND transaction_id IS NOT NULL;

Session-level duplicates — multiple identical events firing within the same millisecond for the same user — can be detected by grouping on (user_pseudo_id, event_name, event_timestamp) and counting groups with more than one row. These often indicate a double-render in a SPA framework or a misconfigured trigger.

Bot Traffic Detection

GA4 does some bot filtering by default, but it is not exhaustive. Automated crawlers, performance testing tools, and competitor scrapers routinely appear in raw BigQuery data.

A scoring approach is more robust than any single threshold:

-- Bot scoring per session
WITH session_stats AS (
  SELECT
    user_pseudo_id,
    (
      SELECT value.int_value
      FROM UNNEST(event_params)
      WHERE key = 'ga_session_id'
    ) AS session_id,
    COUNT(*) AS events_in_session,
    SUM(
      CASE WHEN event_name = 'user_engagement' THEN
        COALESCE((
          SELECT value.int_value
          FROM UNNEST(event_params)
          WHERE key = 'engagement_time_msec'
        ), 0)
      ELSE 0 END
    ) AS total_engagement_ms,
    MIN(event_timestamp) AS session_start_ts,
    MAX(event_timestamp) AS session_end_ts
  FROM `your-project.analytics_XXXXXXXXX.events_*`
  WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
  GROUP BY user_pseudo_id, session_id
),
user_daily_sessions AS (
  SELECT
    user_pseudo_id,
    COUNT(DISTINCT session_id) AS sessions_today
  FROM session_stats
  GROUP BY user_pseudo_id
)
SELECT
  s.user_pseudo_id,
  s.session_id,
  s.events_in_session,
  s.total_engagement_ms,
  u.sessions_today,
  (session_end_ts - session_start_ts) / 1000000.0 AS session_duration_sec,
  (
    CASE WHEN s.total_engagement_ms = 0 THEN 1 ELSE 0 END
    + CASE WHEN s.events_in_session > 100
        AND (session_end_ts - session_start_ts) < 60000000
        THEN 1 ELSE 0 END
    + CASE WHEN u.sessions_today > 50 THEN 1 ELSE 0 END
  ) AS bot_score
FROM session_stats s
JOIN user_daily_sessions u USING (user_pseudo_id)
HAVING bot_score >= 2
ORDER BY bot_score DESC, s.events_in_session DESC;

A bot_score of 2 or 3 is a strong signal. A score of 1 warrants investigation but is not conclusive on its own. For properties with significant bot traffic, a BigQuery ML classification model trained on labeled sessions can replace the hand-tuned thresholds and adapt over time.

Consent mode changes the composition of your tracked population, not just the total count. Events from non-consenting users may arrive with rotating or null user_pseudo_id values, which breaks session stitching, user-level metrics, and funnel analysis.

This query quantifies the consent-affected population by detecting sessions where user_pseudo_id changes across events that should belong to the same logical session:

-- Estimate consent-impacted sessions
WITH session_uid_counts AS (
  SELECT
    (
      SELECT value.int_value
      FROM UNNEST(event_params)
      WHERE key = 'ga_session_id'
    ) AS session_id,
    COUNT(DISTINCT user_pseudo_id) AS distinct_uids_in_session,
    COUNT(*) AS event_count
  FROM `your-project.analytics_XXXXXXXXX.events_*`
  WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
    AND user_pseudo_id IS NOT NULL
  GROUP BY session_id
)
SELECT
  COUNTIF(distinct_uids_in_session = 1) AS stable_sessions,
  COUNTIF(distinct_uids_in_session > 1) AS unstable_sessions,
  COUNT(*) AS total_sessions,
  ROUND(
    SAFE_DIVIDE(
      COUNTIF(distinct_uids_in_session > 1),
      COUNT(*)
    ) * 100,
    2
  ) AS pct_consent_impacted
FROM session_uid_counts;

To understand the bias introduced by unconsented traffic, segment key conversion metrics by consent signal availability:

-- Segment purchase conversion by user_pseudo_id stability proxy
WITH user_sessions AS (
  SELECT
    user_pseudo_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
    MAX(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS converted
  FROM `your-project.analytics_XXXXXXXXX.events_*`
  WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
  GROUP BY user_pseudo_id, session_id
)
SELECT
  CASE
    WHEN user_pseudo_id IS NULL THEN 'no_user_id'
    WHEN LENGTH(user_pseudo_id) < 10 THEN 'short_uid_possible_consent_mode'
    ELSE 'standard_uid'
  END AS uid_type,
  COUNT(*) AS sessions,
  SUM(converted) AS conversions,
  ROUND(SAFE_DIVIDE(SUM(converted), COUNT(*)) * 100, 3) AS conversion_rate_pct
FROM user_sessions
GROUP BY uid_type;

Automated Monitoring with Scheduled Queries

Individual SQL checks are only useful if they run automatically and alert someone when they fail. The simplest viable architecture uses BigQuery Scheduled Queries and Cloud Monitoring.

Architecture:

A scheduled query runs daily at 08:00 UTC.
The query writes any detected failures as rows into a data_quality_alerts table.
A Cloud Monitoring alert fires when the alerts table row count exceeds zero for the current date.
Cloud Monitoring routes the alert to Slack, email, or PagerDuty via a notification channel.

The data_quality_alerts table schema:

CREATE TABLE IF NOT EXISTS `your-project.analytics_monitoring.data_quality_alerts` (
  check_name STRING,
  severity STRING, -- 'CRITICAL', 'WARNING', 'INFO'
  details STRING,
  event_date DATE,
  detected_at TIMESTAMP
);

The scheduled query inserts rows only when a check fails. See the BigQuery Scheduled Queries alerting documentation for the full Cloud Monitoring integration setup.

For more complex routing — different Slack channels by severity, PagerDuty only for CRITICAL, suppression windows during known maintenance — replace the Cloud Monitoring alert with a Cloud Function triggered via Pub/Sub. The scheduled query publishes a message on completion; the Cloud Function reads severity and routes accordingly.

Dataplex Auto Data Quality

Dataplex Universal Catalog can profile BigQuery tables automatically and run declarative data quality rules without writing SQL checks by hand. It detects null rates, value distributions, statistical outliers, and freshness at the table level.

For GA4 you can define rules like:

transaction_id must not be null for rows where event_name = 'purchase'
event_date must be within the last 3 days
event_timestamp must be greater than zero

The main advantage is a non-technical UI that analytics stakeholders without strong SQL skills can navigate. Rules are versioned and auditable.

The limitation in GA4 contexts is the nested schema. Dataplex rules operate on table columns. Extracting transaction_id from inside event_params ARRAY<STRUCT<...>> requires either a pre-flattened view or a custom SQL rule, which largely negates the no-SQL advantage. For monitoring top-level fields — event_date, event_name, user_pseudo_id null rates — Dataplex works well out of the box. For parameter-level validation, custom SQL is more practical.

Reference: Dataplex Auto Data Quality overview.

Enterprise Observability Tools vs DIY

There is no shortage of tools in this space. A three-tier framing is the most honest:

Tier 1 — DIY SQL + Cloud Monitoring (free) What this article describes. Covers freshness, volume, completeness, uniqueness, and validity with standard BigQuery SQL and scheduled queries. Requires SQL knowledge to build and maintain. No licensing cost beyond query processing. Right starting point for most teams.

Tier 2 — dbt tests / Great Expectations / Soda Core Open-source frameworks that bring test infrastructure to SQL: schema tests, custom assertions, test result storage, CI integration. dbt’s built-in tests (not_null, unique, accepted_values) cover most GA4 quality dimensions when applied to flattened models. Appropriate when the warehouse has grown beyond GA4 and quality monitoring needs to be consistent across multiple sources.

Tier 3 — Monte Carlo, Bigeye, Metaplane Commercial data observability platforms. Monte Carlo uses ML-based anomaly detection on your warehouse metadata without requiring you to write any checks. Pricing typically starts at $30–50K/year. Justified for large data teams where the cost of undetected data issues exceeds the platform cost. Not justified for a team whose primary analytical source is GA4.

The sweet spot for most analytics teams: Tier 1 for GA4 monitoring, graduating to Tier 2 as the warehouse grows and dbt models proliferate.

Starter Monitoring Kit

Here is a composite query that runs all core checks and writes failures to the alerts table in a single scheduled query execution. Run this daily at 08:00 UTC.

-- Composite GA4 data quality check — daily scheduled query
-- Writes to analytics_monitoring.data_quality_alerts only on failure

DECLARE yesterday DATE DEFAULT DATE_SUB(CURRENT_DATE('UTC'), INTERVAL 1 DAY);
DECLARE yesterday_suffix STRING DEFAULT FORMAT_DATE('%Y%m%d', yesterday);

INSERT INTO `your-project.analytics_monitoring.data_quality_alerts`
  (check_name, severity, details, event_date, detected_at)

-- CHECK 1: Freshness
SELECT
  'freshness_daily_table' AS check_name,
  'CRITICAL' AS severity,
  CONCAT('events_', yesterday_suffix, ' has ', CAST(COALESCE(row_count, 0) AS STRING), ' rows') AS details,
  yesterday AS event_date,
  CURRENT_TIMESTAMP() AS detected_at
FROM (
  SELECT row_count
  FROM `your-project.analytics_XXXXXXXXX.__TABLES__`
  WHERE table_id = CONCAT('events_', yesterday_suffix)
) t
WHERE COALESCE(t.row_count, 0) < 500000

UNION ALL

-- CHECK 2: Volume anomaly on page_view
SELECT
  'volume_page_view_anomaly' AS check_name,
  'WARNING' AS severity,
  CONCAT(
    'page_view count ', CAST(today_count AS STRING),
    ' vs 28-day avg ', CAST(ROUND(rolling_avg, 0) AS STRING)
  ) AS details,
  yesterday AS event_date,
  CURRENT_TIMESTAMP() AS detected_at
FROM (
  WITH daily AS (
    SELECT
      PARSE_DATE('%Y%m%d', event_date) AS dt,
      COUNT(*) AS cnt
    FROM `your-project.analytics_XXXXXXXXX.events_*`
    WHERE _TABLE_SUFFIX BETWEEN
      FORMAT_DATE('%Y%m%d', DATE_SUB(yesterday, INTERVAL 28 DAY))
      AND yesterday_suffix
      AND event_name = 'page_view'
    GROUP BY dt
  ),
  stats AS (
    SELECT
      dt,
      cnt,
      AVG(cnt) OVER (ORDER BY dt ROWS BETWEEN 27 PRECEDING AND CURRENT ROW) AS rolling_avg,
      STDDEV(cnt) OVER (ORDER BY dt ROWS BETWEEN 27 PRECEDING AND CURRENT ROW) AS rolling_std
    FROM daily
  )
  SELECT cnt AS today_count, rolling_avg, rolling_std
  FROM stats
  WHERE dt = yesterday
    AND ABS(cnt - rolling_avg) > 2 * rolling_std
)

UNION ALL

-- CHECK 3: Purchase events missing transaction_id
SELECT
  'completeness_purchase_transaction_id' AS check_name,
  'CRITICAL' AS severity,
  CONCAT(
    CAST(null_count AS STRING), ' purchase events missing transaction_id of ',
    CAST(total_count AS STRING), ' total'
  ) AS details,
  yesterday AS event_date,
  CURRENT_TIMESTAMP() AS detected_at
FROM (
  SELECT
    COUNTIF(transaction_id IS NULL) AS null_count,
    COUNT(*) AS total_count
  FROM (
    SELECT
      (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'transaction_id') AS transaction_id
    FROM `your-project.analytics_XXXXXXXXX.events_*`
    WHERE _TABLE_SUFFIX = yesterday_suffix
      AND event_name = 'purchase'
  )
)
WHERE null_count > 0

UNION ALL

-- CHECK 4: Duplicate transaction_ids
SELECT
  'uniqueness_duplicate_transactions' AS check_name,
  'CRITICAL' AS severity,
  CONCAT(CAST(dup_count AS STRING), ' duplicate transaction_ids detected') AS details,
  yesterday AS event_date,
  CURRENT_TIMESTAMP() AS detected_at
FROM (
  SELECT COUNT(*) AS dup_count
  FROM (
    SELECT
      (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'transaction_id') AS transaction_id
    FROM `your-project.analytics_XXXXXXXXX.events_*`
    WHERE _TABLE_SUFFIX = yesterday_suffix
      AND event_name = 'purchase'
  )
  WHERE transaction_id IS NOT NULL
  GROUP BY transaction_id
  HAVING COUNT(*) > 1
)
WHERE dup_count > 0

UNION ALL

-- CHECK 5: Bot traffic rate above 1%
SELECT
  'validity_bot_traffic_rate' AS check_name,
  'WARNING' AS severity,
  CONCAT(CAST(ROUND(bot_pct * 100, 2) AS STRING), '% of sessions flagged with bot_score >= 2') AS details,
  yesterday AS event_date,
  CURRENT_TIMESTAMP() AS detected_at
FROM (
  WITH session_stats AS (
    SELECT
      user_pseudo_id,
      (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
      COUNT(*) AS event_count,
      MIN(event_timestamp) AS ts_start,
      MAX(event_timestamp) AS ts_end,
      SUM(CASE WHEN event_name = 'user_engagement' THEN
        COALESCE((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec'), 0)
        ELSE 0 END) AS total_engagement_ms
    FROM `your-project.analytics_XXXXXXXXX.events_*`
    WHERE _TABLE_SUFFIX = yesterday_suffix
    GROUP BY user_pseudo_id, session_id
  ),
  user_sessions AS (
    SELECT user_pseudo_id, COUNT(*) AS daily_sessions
    FROM session_stats
    GROUP BY user_pseudo_id
  ),
  scored AS (
    SELECT
      (
        CASE WHEN total_engagement_ms = 0 THEN 1 ELSE 0 END
        + CASE WHEN event_count > 100 AND (ts_end - ts_start) < 60000000 THEN 1 ELSE 0 END
        + CASE WHEN u.daily_sessions > 50 THEN 1 ELSE 0 END
      ) AS bot_score
    FROM session_stats s
    JOIN user_sessions u USING (user_pseudo_id)
  )
  SELECT SAFE_DIVIDE(COUNTIF(bot_score >= 2), COUNT(*)) AS bot_pct
  FROM scored
)
WHERE bot_pct > 0.01;

Deployment checklist:

Create the data_quality_alerts table using the DDL above.
Create a BigQuery Scheduled Query running the composite check at 08:00 UTC daily.
Create a Cloud Monitoring alert triggered when the insert row count is greater than zero.
Wire the alert to your notification channel (email, Slack webhook, PagerDuty).

That is one scheduled query, one alert, one notification channel. It covers the most critical failure modes for most GA4 properties.

Data quality monitoring is not a one-time setup. Your property will grow, custom events will change, consent mode configurations will be updated, and new failure modes will emerge that none of these checks anticipated. Review the alerts table weekly, add a new check whenever you discover an issue that would have been caught earlier, and remove checks that have fired zero times in six months. A monitoring system that generates noise is worse than no monitoring at all.

BigQuery Scheduled Queries for GA4 Automated Reporting — Automate the transformation layer that feeds your monitoring tables
GA4 Data Retention Workarounds with BigQuery — Keep raw event data beyond GA4’s 14-month limit
Dataform for GA4: Build Your First BigQuery Pipeline — Add dependency management and data quality assertions to your pipeline