'Email Already Taken': Checking Millions of Signups at Scale
'Email Already Taken': Checking Millions of Signups at Scale
When you sign up for a service, checking whether an email is already taken seems straightforward. A simple database query. But here's the thing: at scale (millions of signups, thousands of concurrent requests, bots hammering your endpoints), this becomes a real distributed systems problem. You're dealing with:
- Latency: Every signup request hits the database → slower response times
- Race conditions: Two users registering the same email simultaneously
- Security: Attackers probing which emails are registered
- Resource contention: Database under heavy load from duplicate checks
This is exactly the kind of problem Google, Meta, and other tech giants had to solve. And the answer? A combination of Bloom filters, email normalization, and smart database constraints.
Let's break it down.
The Real-World Challenge
Imagine you're running a popular service and you're experiencing 1,000 signups per minute. That's roughly 1.4 million signups per day. Without optimization, that's 1.4 million database queries just for email uniqueness checks.
Each query takes time—even if it's milliseconds. On a busy database, those milliseconds add up. Users see slower signup times. Your database CPU spikes. This is a classic scaling problem.
The solution? Use a probabilistic data structure to pre-filter obvious non-duplicates before hitting the database.
Introducing: The Bloom Filter
A Bloom filter is a probabilistic data structure that answers one question with certainty:
"Has this value definitely NOT been seen before?"
Here's the magic: it has no false negatives, only false positives.
Why This Works
- Speed: O(1) lookup, just a few hash function calls
- Memory: Incredibly small—100,000 emails ≈ 13 KB with 0.1% false positive rate
- Trade-off: Occasionally wrong about old emails (false positive), but never wrong about new ones
Real example:
- 10,000 users in the system
- Bloom says "maybe exists" for 50 emails (false positives + true positives)
- Only 50 queries hit the database instead of 10,000
- 500x reduction in database load
How It Fits Into Your Architecture
Here's the signup flow:
The database is the source of truth. The Bloom filter is just a speed layer.
Email Normalization: The Forgotten Feature
Before checking anything, you need to normalize emails. Why? Because people are creative with email addresses.
Look at these:
john.doe@gmail.comjohndoe@gmail.comjohn.d.o.e@gmail.comjohn+spam@gmail.comjohn+newsletter@gmail.com
Most services treat all of these as the same person. Gmail ignores dots and everything after +. So these should all resolve to johndoe@gmail.com.
This is critical. If you don't normalize, someone could sign up with multiple variations and bypass your email check, creating duplicate users with the same effective email address.
Normalization rules for popular domains:
- Gmail → lowercase, remove dots, remove +tags
- Microsoft → lowercase, remove dots
- Others → just lowercase
Race Conditions: The Database Wins
Here's a tricky scenario. Two users hit your signup endpoint simultaneously with the same email:
This is why the UNIQUE constraint on the email column is non-negotiable. It's your final line of defense. The database enforces global uniqueness, even under concurrent writes. Your code should gracefully handle database constraint violations when the race condition occurs.
Security: Don't Leak Email Addresses
Here's an attack vector many developers miss:
Bad response: "Email already taken"
An attacker can now probe for valid emails. Check a million emails, and they know which ones are registered. This is called email enumeration.
Better response: "If this email is valid, you'll receive a verification link shortly"
Now the attacker gets the same response whether the email exists or not. The only way to confirm registration is to check their inbox.
This is what Gmail, GitHub, and other security-conscious services do.
Implementation Strategy
While the specific implementation varies by framework and language, the pattern is universal:
- Normalize the email at the start of signup
- Check the Bloom filter (O(1) operation)
- If not found, proceed directly to database insert
- If maybe found, query the database to confirm
- Attempt insert with UNIQUE constraint (final guarantee)
- On success, add to Bloom filter for next time
- On failure, respond with privacy-first message (never reveal if email exists)
This three-layer approach—Bloom filter → database query → UNIQUE constraint—catches issues at different scales and provides defense in depth.
The code for this is straightforward in any framework. Check your framework's documentation for email normalization utilities and database constraint handling. The key is the logic flow, not the specific syntax.
Production Considerations
Cold Start Problem
When you first deploy, the Bloom filter is empty. This means all signup requests hit the database—defeating the purpose of the Bloom filter.
Solution: On app startup, preload the Bloom filter from the database or a cached state. Most frameworks provide lifecycle hooks (app startup, initialization) where you can load all existing emails into memory.
Distributed Systems
For true distributed systems (multiple servers), an in-memory Bloom filter won't work. One server won't know about emails registered on another.
Solution: Use a distributed cache like Redis to store the Bloom filter. This way, all servers reference the same filter, maintaining consistency across your infrastructure. Now all servers share the same Bloom filter.
Handling Scale Progression
As your service grows:
What Google Does (Conceptually)
Google's signup system for Gmail, YouTube, and other services likely involves:
- Email normalization at the edge (geographically distributed)
- Distributed Bloom filters in every region
- Spanner-like database (globally consistent, multi-region)
- Rate limiting (prevent abuse)
- CAPTCHA (bot protection)
- Privacy-first responses (no email enumeration)
- Async pipelines (send verification emails in background)
- Monitoring (track duplicate attempts, patterns)
The Bloom filter is just one piece, but it's a critical optimization.
Key Takeaways
- Bloom filters are perfect for "maybe exists" checks — they're fast and memory-efficient
- Email normalization prevents duplicate tricks — normalize before every check
- The UNIQUE constraint is non-negotiable — it's your guarantee against duplicates
- Database is the source of truth — Bloom filter is just a speed layer
- Race conditions are real — handle constraint violations gracefully
- Security matters — don't leak whether an email is registered
- Cold start is a trap — preload the Bloom filter on startup