'Email Already Taken': Checking Millions of Signups at Scale

When you sign up for a service, checking whether an email is already taken seems straightforward. A simple database query. But here's the thing: at scale (millions of signups, thousands of concurrent requests, bots hammering your endpoints), this becomes a real distributed systems problem. You're dealing with:

Latency: Every signup request hits the database → slower response times
Race conditions: Two users registering the same email simultaneously
Security: Attackers probing which emails are registered
Resource contention: Database under heavy load from duplicate checks

This is exactly the kind of problem Google, Meta, and other tech giants had to solve. And the answer? A combination of Bloom filters, email normalization, and smart database constraints.

Let's break it down.

The Real-World Challenge

Imagine you're running a popular service and you're experiencing 1,000 signups per minute. That's roughly 1.4 million signups per day. Without optimization, that's 1.4 million database queries just for email uniqueness checks.

Each query takes time—even if it's milliseconds. On a busy database, those milliseconds add up. Users see slower signup times. Your database CPU spikes. This is a classic scaling problem.

The solution? Use a probabilistic data structure to pre-filter obvious non-duplicates before hitting the database.

Introducing: The Bloom Filter

A Bloom filter is a probabilistic data structure that answers one question with certainty:

"Has this value definitely NOT been seen before?"

Here's the magic: it has no false negatives, only false positives.

Scenario	Result	What It Means
Email not in Bloom	100% sure it's new	Skip the DB query! ✅
Email maybe in Bloom	Could be old (false alarm)	Check the DB to be sure

Why This Works

Speed: O(1) lookup, just a few hash function calls
Memory: Incredibly small—100,000 emails ≈ 13 KB with 0.1% false positive rate
Trade-off: Occasionally wrong about old emails (false positive), but never wrong about new ones

Real example:

10,000 users in the system
Bloom says "maybe exists" for 50 emails (false positives + true positives)
Only 50 queries hit the database instead of 10,000
500x reduction in database load

How It Fits Into Your Architecture

Here's the signup flow:

The database is the source of truth. The Bloom filter is just a speed layer.

Email Normalization: The Forgotten Feature

Before checking anything, you need to normalize emails. Why? Because people are creative with email addresses.

Look at these:

john.doe@gmail.com
johndoe@gmail.com
john.d.o.e@gmail.com
john+spam@gmail.com
john+newsletter@gmail.com

Most services treat all of these as the same person. Gmail ignores dots and everything after +. So these should all resolve to johndoe@gmail.com.

This is critical. If you don't normalize, someone could sign up with multiple variations and bypass your email check, creating duplicate users with the same effective email address.

Normalization rules for popular domains:

Gmail → lowercase, remove dots, remove +tags
Microsoft → lowercase, remove dots
Others → just lowercase

Race Conditions: The Database Wins

Here's a tricky scenario. Two users hit your signup endpoint simultaneously with the same email:

This is why the UNIQUE constraint on the email column is non-negotiable. It's your final line of defense. The database enforces global uniqueness, even under concurrent writes. Your code should gracefully handle database constraint violations when the race condition occurs.

Security: Don't Leak Email Addresses

Here's an attack vector many developers miss:

Bad response: "Email already taken"
An attacker can now probe for valid emails. Check a million emails, and they know which ones are registered. This is called email enumeration.

Better response: "If this email is valid, you'll receive a verification link shortly"
Now the attacker gets the same response whether the email exists or not. The only way to confirm registration is to check their inbox.

This is what Gmail, GitHub, and other security-conscious services do.

Implementation Strategy

While the specific implementation varies by framework and language, the pattern is universal:

Normalize the email at the start of signup
Check the Bloom filter (O(1) operation)
If not found, proceed directly to database insert
If maybe found, query the database to confirm
Attempt insert with UNIQUE constraint (final guarantee)
On success, add to Bloom filter for next time
On failure, respond with privacy-first message (never reveal if email exists)

This three-layer approach—Bloom filter → database query → UNIQUE constraint—catches issues at different scales and provides defense in depth.

The code for this is straightforward in any framework. Check your framework's documentation for email normalization utilities and database constraint handling. The key is the logic flow, not the specific syntax.

Production Considerations

Cold Start Problem

When you first deploy, the Bloom filter is empty. This means all signup requests hit the database—defeating the purpose of the Bloom filter.

Solution: On app startup, preload the Bloom filter from the database or a cached state. Most frameworks provide lifecycle hooks (app startup, initialization) where you can load all existing emails into memory.

Distributed Systems

For true distributed systems (multiple servers), an in-memory Bloom filter won't work. One server won't know about emails registered on another.

Solution: Use a distributed cache like Redis to store the Bloom filter. This way, all servers reference the same filter, maintaining consistency across your infrastructure. Now all servers share the same Bloom filter.

Handling Scale Progression

As your service grows:

What Google Does (Conceptually)

Google's signup system for Gmail, YouTube, and other services likely involves:

Email normalization at the edge (geographically distributed)
Distributed Bloom filters in every region
Spanner-like database (globally consistent, multi-region)
Rate limiting (prevent abuse)
CAPTCHA (bot protection)
Privacy-first responses (no email enumeration)
Async pipelines (send verification emails in background)
Monitoring (track duplicate attempts, patterns)

The Bloom filter is just one piece, but it's a critical optimization.

Key Takeaways

Bloom filters are perfect for "maybe exists" checks — they're fast and memory-efficient
Email normalization prevents duplicate tricks — normalize before every check
The UNIQUE constraint is non-negotiable — it's your guarantee against duplicates
Database is the source of truth — Bloom filter is just a speed layer
Race conditions are real — handle constraint violations gracefully
Security matters — don't leak whether an email is registered
Cold start is a trap — preload the Bloom filter on startup

Find the complete code on GitHub