Back to all posts

If Your API Is Stateless, Why Do Users Keep Getting Logged Out?

5 min read
System DesignAuthenticationDistributed SystemsJWTDevOpsDebugging

If Your API Is Stateless, Why Do Users Keep Getting Logged Out?

The invisible session bugs hiding in plain sight


You've built a stateless API. You're proud of it. Each request carries a JWT, no server-side session storage, horizontally scalable to infinity. Textbook.

Then a user files a ticket: "I keep getting randomly logged out."

You check the logs. The token was valid. The server responded correctly. Nothing looks wrong.

Welcome to one of the most misunderstood classes of bugs in distributed systems.


First: What "Stateless" Actually Means

A stateless API means the server doesn't remember you between requests. Every request must carry all the information needed to authenticate and process it — typically in the form of a JWT (JSON Web Token) stored client-side.

The promise: any server in your fleet can handle any request. No sticky sessions. Pure horizontal scale.

The catch: stateless on the server doesn't mean stateless in the system. State still exists — it just moved somewhere else. And that "somewhere else" is full of failure modes.


The Real Culprits Behind Random Logouts

1. Token Expiry — The Obvious One (That's Still Mishandled)

JWTs have an exp claim. When it passes, the token is invalid. If your refresh logic is flawed, users get logged out.

But here's what makes it "random":

  • Clock skew between servers. Your auth server says it's 12:00:00. Your API server thinks it's 11:59:58. A token that just expired on one server is still valid on another — and vice versa. The same user, hitting different servers, gets different results.
  • Tab/background behavior. A user leaves a tab open for 45 minutes, comes back, clicks something. The refresh token silently fails (network hiccup, server restart), and instead of retrying, the client throws them to the login screen.
  • Refresh race conditions. Two requests fire simultaneously. Both detect an expired token. Both try to refresh. One succeeds, one gets a 401 on a now-invalidated refresh token, and the user is logged out.

Fix: Implement a single-flight refresh (one in-flight refresh at a time, other requests queue behind it). Add a clock skew tolerance buffer of 30–60 seconds. Log every token rejection with the delta between exp and server time.


2. Multiple Servers, Multiple Signing Keys

This one is subtle and catastrophic.

If you're running multiple instances of your auth service, each instance must use the same signing secret (for HS256) or the same private/public key pair (for RS256/ES256).

What happens in practice:

  • A deployment goes wrong. The environment variable for JWT_SECRET isn't set, so it falls back to a default, a random value, or an empty string.
  • Keys are rotated, but the old key isn't kept in a validation set during the transition period.
  • In a multi-region setup, secrets aren't synced and one region generates tokens the other can't verify.

The result: tokens issued by Server A are rejected by Server B. Users hitting the load balancer get valid responses ~50% of the time, and "random" 401s the other 50%.

Fix: Use asymmetric signing (RS256/ES256) with the public key distributed to all validators. Store secrets in a secrets manager (AWS Secrets Manager, Vault, GCP Secret Manager) — never in environment variables set per-instance. During key rotation, support a JWKS endpoint with multiple active keys so old tokens remain valid during the rollover window.


3. The Load Balancer Terminating Connections Mid-Request

Your API is stateless. Your load balancer is not.

If you're using connection-level load balancing (common with HTTP/1.1 keep-alive), a user's requests may be "sticky" to a server — until that server is restarted, scaled down, or health-checked out of rotation. When that happens, in-flight requests die, and depending on how your client handles it, the user gets logged out instead of seeing a retry.

Fix: Ensure your client retries idempotent requests on connection failure before surfacing an error. Use HTTP/2 where possible (multiplexed, connection errors are more gracefully handled). Make sure your load balancer drains connections before pulling a node.


4. Cookie Misconfiguration Across Subdomains

If you're storing tokens in cookies (rather than localStorage), the Domain, SameSite, and Secure attributes govern exactly when that cookie is sent.

Common traps:

  • Domain=api.yourapp.com — the cookie won't be sent from app.yourapp.com. Users are authenticated on one subdomain and invisible on another.
  • Missing Secure flag — the cookie is only sent over HTTP, not HTTPS, so in production it's silently dropped.
  • SameSite=Strict on a site with third-party OAuth redirects — the cookie is stripped on the redirect back, so the user completes OAuth and immediately appears logged out.

Fix: Set Domain=.yourapp.com (note the leading dot) to cover all subdomains. Always set Secure in production. Use SameSite=Lax for most cases; only use Strict if you've explicitly tested your auth flows against it.


5. Token Blacklisting Without a Shared Store

Here's an ironic one: you implement "stateless" logout by blacklisting the token in a local in-memory store on the server that received the logout request.

User logs out → Server 3 marks token as invalid. User's next request → hits Server 1 → token is still valid → user is... logged in again?

Or the inverse: you somehow mark a token globally invalid, and the user can't log back in because the invalidation logic is broken.

Fix: If you need server-side token invalidation (for logout, password changes, account suspension), you need a shared store — Redis is the standard choice. Alternatively, use short-lived tokens (5–15 minutes) paired with refresh token rotation, so "logout" is just deleting the refresh token.


6. Mobile Apps and OS-Level Token Eviction

On mobile, the OS can evict app data under memory pressure. If your token is stored in a location that's not backed by the secure keychain (iOS Keychain, Android Keystore), it can disappear.

Additionally, OS updates sometimes clear app sandboxes. Users wake up after a phone update, open your app, and they're logged out. No server involved.

Fix: Store tokens in platform-secure storage. On iOS: Keychain with kSecAttrAccessibleAfterFirstUnlock. On Android: EncryptedSharedPreferences or the Keystore. Test what happens to stored tokens after OS updates.


7. CDN and Reverse Proxy Caching 401 Responses

If a 401 response gets cached — even briefly — every user hitting that CDN node gets logged out until the cache expires.

This is rare but memorable when it happens. One bad deploy that returns a 401 for one second, cached by a CDN with a 60-second TTL, and thousands of users are logged out for a minute.

Fix: Ensure auth endpoints and any endpoint that returns WWW-Authenticate headers have Cache-Control: no-store, no-cache. Audit your CDN rules. Never cache 4xx responses without an explicit, short TTL.


The lesson isn't that stateless APIs are broken. They're not. The lesson is:

Stateless means no server holds your session. It doesn't mean the system has no state.

State lives in the token, in the cookie jar, in the signing keys, in the clock, in the CDN cache, in the OS keychain. Every one of those is a failure domain.

Build your auth infrastructure with that map in your head, and the "random" logouts start to look very, very deterministic.


If this helped you track down a bug, share it with whoever filed the ticket. They deserve to know why.