How GA4 Counts Millions of Users with 12 Kilobytes: The HyperLogLog Algorithm

GA4 reports 2.4 million unique users, but how does it count them without storing 2.4 million IDs? The answer is HyperLogLog, a probabilistic algorithm that trades perfect accuracy for radical efficiency. This is the story of how a clever mathematical trick powers modern analytics.

The Problem: Counting Is Hard

You want to know how many unique users visited your website last month.

Simple, right? Just keep a list of every user ID you see, and count how many are in the list.

With 100 users, this works fine. With 1 million users, you need to store 1 million IDs. With 1 billion users across Google Analytics, you need… a lot of storage.

And it gets worse. Every time a new event arrives, you need to check if that user is already in your list. With a billion entries, that lookup is slow.

Now multiply this by every metric in GA4: users by page, users by country, users by device, users by campaign, users by hour. The combinations explode exponentially.

Storing exact counts doesn’t scale.

This is where HyperLogLog enters the picture.


The Insight: Randomness Contains Information

Here’s a thought experiment.

I flip a coin repeatedly until I get heads. Sometimes I stop after 1 flip (heads immediately). Sometimes after 3 flips (tails, tails, heads). Occasionally after 10 flips.

If I tell you my longest streak before hitting heads was 20 flips, what can you infer?

You’d guess I flipped the coin many times. Getting 20 tails in a row has probability (1/2)^20 = 0.000001. To see such a rare event, I probably did millions of flips.

The longest streak tells you something about the total count.

This is the core insight behind HyperLogLog: you can estimate how many things you’ve seen by tracking the “most unusual” thing you’ve encountered.


From Coins to Hashes

Computers don’t flip coins. They compute hash functions.

A hash function takes any input (like a user ID) and produces a seemingly random number. The same input always produces the same output, but different inputs produce wildly different outputs.

hash("user_123") → 0110100011010111001010...
hash("user_456") → 1011001110000101110011...
hash("user_789") → 0000001011101001001010...

Look at that third hash. It starts with six zeros. That’s unusual—like flipping six tails in a row.

If you hash millions of user IDs, you’ll eventually see hashes starting with many zeros. The more users you have, the more likely you are to see rare patterns.

HyperLogLog counts users by tracking the rarest hash pattern it has seen.


The Algorithm: Counting Zeros

Here’s HyperLogLog in plain English:

  1. Hash every user ID to get a binary number
  2. Count leading zeros in each hash
  3. Remember the maximum leading zeros seen
  4. Estimate cardinality from that maximum

If the longest run of leading zeros you’ve seen is k, the estimated count is roughly 2^k.

Saw a hash starting with 10 zeros? You’ve probably seen around 2^10 = 1,024 unique values.

Saw 20 leading zeros? Probably around 2^20 = 1 million unique values.

The Memory Magic

Here’s the beautiful part: you only store one number—the maximum leading zeros count.

Not a list of user IDs. Not a bloom filter. Just a single integer.

One byte can represent maximums up to 255 zeros, which could estimate 2^255 unique values—far more than atoms in the universe.

From terabytes of user IDs to a single byte.


The Problem with One Number

There’s a catch. A single maximum is noisy.

Imagine you’ve seen 1,000 users, and your maximum leading zeros is 15. That suggests 2^15 = 32,768 users. Way off.

Or you’ve seen 1 million users but got unlucky—your maximum is only 12, suggesting 4,096 users.

Single observations have high variance.


The Fix: Divide and Average

HyperLogLog’s clever trick: use many counters and combine them.

  1. Split hashes into buckets using the first few bits
  2. Track maximum leading zeros separately for each bucket
  3. Combine estimates using a harmonic mean

With 2,048 buckets (the GA4 default), you get 2,048 independent estimates. Averaging them dramatically reduces variance.

Bucket 0:  max_zeros = 8  → estimate ~256
Bucket 1:  max_zeros = 12 → estimate ~4,096
Bucket 2:  max_zeros = 10 → estimate ~1,024
...
Bucket 2047: max_zeros = 11 → estimate ~2,048

Combined estimate: ~500,000 users (±2.3% error)

Memory Calculation

Each bucket stores a small number (typically 5-6 bits for the zero count).

  • 2,048 buckets × 6 bits = 12,288 bits = 1.5 kilobytes

That’s the entire data structure. 1.5 KB to estimate billions of unique users.

GA4 uses a slight variation with about 12 KB per HyperLogLog sketch, achieving approximately 0.8% standard error.


Why Harmonic Mean?

Why not use a regular average?

Arithmetic mean is sensitive to outliers. One bucket with an unusually high maximum would skew the entire estimate upward.

The harmonic mean—computed as n / (1/x₁ + 1/x₂ + ... + 1/xₙ)—naturally dampens outliers. It’s closer to the geometric mean, which handles multiplicative relationships better.

The actual HyperLogLog formula includes correction factors discovered through mathematical analysis and empirical testing:

E = α * m² / Σ(2^(-M[j]))

Where:

  • m = number of buckets
  • M[j] = maximum zeros in bucket j
  • α = correction constant (~0.7213 for large m)

The math is elegant, but the intuition is simple: average many rough estimates to get one good estimate.


Accuracy vs. Memory Tradeoff

More buckets = more accuracy = more memory.

BucketsMemoryStandard Error
1612 bytes26%
256192 bytes6.5%
2,0481.5 KB2.3%
16,38412 KB0.81%
65,53648 KB0.41%

GA4 uses 16,384 buckets (sometimes called precision 14, since 2^14 = 16,384), giving ~0.8% standard error with 12 KB memory.

For 10 million actual users, the estimate would typically be between 9.92 million and 10.08 million.

Good enough for analytics. Incredible for the memory cost.


HyperLogLog in GA4

When you see “Users” in any GA4 report, you’re seeing a HyperLogLog estimate.

Where It’s Used

  • Total users in any report
  • New users vs returning users
  • Users by dimension (country, device, page, etc.)
  • User counts in segments
  • BigQuery export user counts

The 0.5 User Problem

Ever seen a report showing “0.5 users” or non-integer user counts?

HyperLogLog itself produces non-integer estimates — the harmonic mean formula outputs a real number, not a whole number. When GA4 aggregates sketches across dimensions or time periods, or when data sampling kicks in and results get extrapolated, fractional user counts appear in reports.

It’s not literally half a person — it’s a statistical estimate.

Merging Sketches

The killer feature of HyperLogLog: sketches can be merged.

Want to know unique users across two date ranges? Don’t reprocess all the events. Just merge the two HyperLogLog sketches.

Sketch A (January): 500,000 users
Sketch B (February): 600,000 users
Merged (Jan + Feb): 850,000 users (not 1.1M—many users visited both months)

The merged sketch correctly handles overlap. This is why GA4 can compute “users last 90 days” without scanning 90 days of raw data.

BigQuery HLL Functions

If you export GA4 data to BigQuery, you can use HyperLogLog directly:

-- Create a HyperLogLog sketch
SELECT HLL_COUNT.INIT(user_pseudo_id) as user_sketch
FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX = '20260131';

-- Merge sketches from multiple days
SELECT HLL_COUNT.MERGE(user_sketch) as merged_sketch
FROM daily_sketches;

-- Extract the count estimate
SELECT HLL_COUNT.EXTRACT(merged_sketch) as estimated_users
FROM merged_sketches;

This is how you can compute user counts across massive datasets efficiently.


When HyperLogLog Struggles

Low Cardinalities

HyperLogLog is optimized for large counts. For small numbers (under 1,000), the relative error can be significant.

GA4 and BigQuery implementations include “small range corrections” that switch to exact counting when cardinality is low.

Intersections Are Hard

HyperLogLog excels at unions (users who visited page A or page B) but struggles with intersections (users who visited page A and page B).

There’s no efficient way to compute intersections from two sketches. GA4 has to use other techniques for segmented analysis.

Exact Counts in BigQuery

If you need exact user counts (for billing, compliance, etc.), query the raw data:

SELECT COUNT(DISTINCT user_pseudo_id) as exact_users
FROM `project.analytics_PROPERTY_ID.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260131';

This scans all data and costs more, but gives precise results.


The Broader Lesson

HyperLogLog represents a fundamental computer science insight: perfect answers aren’t always necessary.

An estimate with less than 1% error is fine for analytics. Nobody makes different business decisions based on 10,000,000 vs 10,080,000 users.

This tradeoff—sacrificing precision for efficiency—appears everywhere:

  • Bloom filters: Probably in the set vs. definitely not in the set
  • Count-Min Sketch: Approximate frequency counting
  • Locality-Sensitive Hashing: Approximate nearest neighbors
  • Streaming algorithms: Process data without storing it all

These probabilistic data structures make modern analytics possible. Without them, Google couldn’t report user counts for billions of websites in real-time.


Summary

GA4 counts millions of users using a 12-kilobyte data structure because of HyperLogLog:

  1. Hash user IDs to random binary numbers
  2. Track the rarest pattern (most leading zeros) in each bucket
  3. Combine estimates using harmonic mean
  4. Merge sketches across time periods without reprocessing

The result: constant memory usage regardless of user count, with ~0.8% standard error.

Next time you see “2.4M users” in GA4, remember: Google didn’t store 2.4 million IDs. They stored a clever 12-kilobyte summary that contains the same information—approximately.

That’s the power of probabilistic algorithms.


Further Reading


Sources