What OpenAI’s PostgreSQL Choices Reveal About Pragmatic Scaling

What OpenAI’s PostgreSQL Choices Reveal About Pragmatic Scaling
Source: Pixabay

OpenAI recently shared details on how they scale PostgreSQL to power ChatGPT for over 800 million users. When you hear numbers like that, you probably imagine a complex, distributed, sharded database architecture. You might expect them to be using something like Spanner, CockroachDB, or Cassandra.

The reality is surprisingly simple: They use a single Postgres primary with nearly 50 read replicas.

It sounds counter-intuitive. How does a single database node handle traffic for one of the most popular apps in the world? The answer provides a fascinating look at how real-world systems evolve versus how we design them on a whiteboard.

Systems are Grown, Not Built

To understand their architecture, you have to look at their history. If you were designing a system for 800 million users from scratch today, you probably wouldn't choose a single-primary architecture. You would likely pick a distributed database that handles horizontal scaling natively to avoid the single point of failure.

But OpenAI didn't start with 800 million users. They started small, used a standard Postgres setup, and then experienced unprecedented hyper-growth. Migrating a massive, active database to a new technology while trying to keep the product running is incredibly risky and expensive. Their current architecture isn't necessarily the "perfect end state." It is a successful example of extending the runway of your existing technology. They optimized what they had because they had to.
Don’t paralyze yourself trying to design the perfect "end-game" architecture on Day 1. Start with what works for today's scale, and be willing and keep enough flexibility to rewrite it when you are lucky enough to break it.
Cost is a quiet constraint throughout this journey.

For organizations with deep pockets, early rewrites or heavy over-provisioning may be acceptable. For most growing teams, extending a familiar system reduces migration risk, limits operational overhead, and buys time to invest engineering effort where it matters most. 

One Size Does Not Fit All

The team at OpenAI didn't try to force Postgres to do everything. They recognized that while Postgres is excellent for relational data, it struggles with massive write volumes due to its internal design, specifically Multi-Version Concurrency Control (MVCC). Every update in Postgres creates a "dead tuple" that must be cleaned up by the VACUUM process. At their scale, heavy writes lead to a "vacuum death spiral," saturating CPU and I/O.

To mitigate this, they identified write-heavy, shardable workloads like audit logs or temporary state and actively migrated them to CosmosDB. This was a crucial strategic decision: by offloading the heaviest writes to a different system, they kept the Postgres primary healthy for the core application logic.
A hybrid architecture is often more stable than a forcing a single system to handle workloads it was not designed for.

Workload and Consistency

This specific setup works for ChatGPT because the workload is heavily read-biased. Users send a prompt (a write), but the system performs many lookups for context, history, and settings (reads). Because the ratio heavily favors reads, they were able to scale by adding more read replicas—currently operating around 50 of them across multiple regions.

However, even replicating data has limits. The primary node has to stream the Write Ahead Log (WAL) to every single replica, which consumes significant network bandwidth and CPU. To solve this, they are working on Cascading Replication, where the primary streams data to a few "intermediate" replicas, which then relay it to the rest.
You must understand your Read/Write ratio. If you have a read-heavy application, you can scale vertically far longer than you think but only if you are willing to accept the consistency trade-offs involved.

Fundamentals You Can't Ignore

At this volume, you cannot rely on hardware alone; you have to obsess over the fundamentals. The team focuses heavily on preventing "OLTP anti-patterns." They identified that complex joins (like one joining 12 tables!) were causing outages, so they moved that logic to the application layer. They also aggressively police their Object-Relational Mapping (ORM) code, often rewriting inefficient queries into raw SQL.

They are equally strict about schema changes. They enforce a 5-second timeout on any schema migration. If a change (like adding a column) takes longer than 5 seconds, it is aborted to prevent locking the table and taking down the site.
Cloud resources can buy time, but they won’t compensate for inefficient queries or poor data access patterns.

Everything Matters at Scale

When you operate at this level, infrastructure details that usually don't matter become critical "hidden killers."

Connection Pooling: You cannot simply open a new database connection for every user request. OpenAI uses PgBouncer to pool connections. This single change dropped their average connection latency from 50ms to 5ms—a 10x improvement.

The Thundering Herd: When a cache item expires, thousands of requests might hit the database simultaneously for the same key. They implemented Cache Leasing (or locking), ensuring that only one request hits the database to refresh the cache while the others wait.

Rate Limiting: They implemented rate limits at every layer - application, proxy, and even the database query level to prevent a single abusive user or a "retry storm" from taking down the platform.
The database isn't just about storage; it is also about how clients connect, retry, and fail.

Resilience Through Isolation

When you run a single primary, you are effectively running a Single Point of Failure. Since they couldn't easily shard the database to remove this risk, they focused on isolating it. They explicitly split traffic into "High Priority" and "Low Priority" tiers and route them to separate database instances.

This ensures that a heavy, non-critical background job (like an analytics query) doesn't compete for resources with a user trying to send a message. If the low-priority tier gets overloaded, the user experience remains unaffected.
If you cannot eliminate a single point of failure, you must control its blast radius.

Technical Debt Should Be Intentional

Perhaps the most interesting insight for senior engineers is how they handle technical debt. They effectively "froze" the complexity of their monolith. They have a strict rule: No new tables are allowed in the main Postgres cluster for new features.

If a team wants to build a new feature that requires storage, they must use a separate data store (like CosmosDB). This ensures that the legacy database stops growing in complexity, preventing it from becoming an unmanageable "distributed monolith."
Sometimes the right strategy isn't a rewrite, but a containment. Stop adding new complexity to old systems and force new features to use modern standards.

Summary

OpenAI’s journey shows that large-scale systems are rarely built in one step. They evolve through careful trade-offs, operational discipline, and constraint-aware decisions.

Real engineering is not about predicting the future perfectly. It is about making careful, reversible decisions in the present.

These views are my own and reflect my interpretation of publicly shared information.
Follow me on LinkedIn for more writing on pragmatic system design, databases, and engineering at scale: Rathish Kumar B












0 thoughts:

Post a Comment