![]() |
An Aurora Deep Dive Series by Rathish Kumar B - Part 2 |
Transaction Flow in Traditional Database Systems
The following diagram illustrates this entire process, tracing the path of a single write transaction through the tightly-coupled layers of a monolithic database engine.
![]() |
Transaction Flow in Traditional Database Systems |
To see how these layers work in practice, let's trace the lifecycle of a single SQL statement using a concrete example: updating an account balance.
UPDATE accounts SET balance = balance - 5000 WHERE account_id = 101;
SQL Parsing & Planning
- The SQL processor first checks for valid syntax.
- Then, it generates an execution plan. If account_id is indexed, the optimizer chooses an index scan over a full table scan.
- This step is lightweight, but the quality of the execution plan has a huge impact on performance.
Transaction Manager
- A transaction ID is assigned.
- Locks are acquired (e.g., exclusive row lock on account #101).
- The system enforces ACID guarantees: Atomicity, Consistency, Isolation, Durability.
BEGIN
or implicit start) and ends (COMMIT
or ROLLBACK
). Locking ensures no other transaction can modify this row until the current one completes, preventing dirty reads and write conflicts. Buffer Pool / Caching
- The database works with data in fixed-size memory blocks called pages.
- When the database needs a page, it first checks the buffer pool.
- If the page is there (a cache hit), it's read instantly from fast RAM.
- If the page isn't there (a cache miss), the system must perform a much slower read from the disk to fetch it into the cache.
- The page for account #101 is loaded into the pool, and the
UPDATE
happens in-memory, marking the page as "dirty."
By manipulating pages in this high-speed staging area, the database can perform operations thousands of times faster. But how are these fast, in-memory changes made safe in case of a crash? This is where the strict Write-Ahead Logging protocol comes into play. The diagram below illustrates the Buffer Pool in action, showing how the database pages are fetched and from disk and updates and stores it back.
![]() |
Database Buffer Pool. Source CMU Database Systems |
Redo/Undo Logging
- The database follows a strict Write-Ahead Logging (WAL) protocol, meaning changes are logged before they are written to the page.
- A log record describing the change (e.g., before/after values for account #101) is created and flushed to the permanent log file on disk.
- A transaction is only considered committed when its log records are secure.
![]() |
Write Ahead Logging. Source CMU Database Systems |
Checkpointing
- Periodically, the DBMS performs a checkpoint to flush all dirty pages from the buffer pool to the main data files on disk.
- This crucial background process synchronizes the in-memory state with durable storage and bounds recovery time; after a crash, the database only needs to replay logs created since the last successful checkpoint.
- This operation creates a trade-off, as checkpoints can cause I/O spikes that slow down transactions.
Durable Storage
- The final layer is the physical disk (attached or a SAN), where the main data files and log files permanently reside.
Trade-offs and Scalability Challenges
- Buffer Pool Contention: A single shared buffer pool improves cache locality but limits concurrency. All writes contend for latches on common pages and I/O bandwidth. Scaling memory beyond one machine is hard: traditional engines cannot share RAM across servers. If the buffer pool is too large, maintenance (like scans or writes) slows down, but if too small it causes more disk I/O and thrashing.
- Log Flush Latency: Every commit requires flushing the WAL to disk. This creates a sequential I/O bottleneck. Databases mitigate this with group-commit (batching multiple transactions’ log writes), but spikes in write traffic or a slow disk can still cause queueing delays. A missing or corrupted WAL record can halt recovery entirely. In practice, each log write is an I/O op, so heavy write workloads incur high IOPS cost.
- Checkpoint/Flush Bottlenecks: When dirty pages are checkpointed, the DB must write large batches of data pages to disk. This can cause a sudden I/O spike that slows incoming transactions. To avoid overloading the disk, databases throttle writes, but that throttling in turn limits throughput. Large transactions or long-running updates can flood the buffer with dirty pages faster than they can be flushed, causing stalls. Moreover, on crash recovery a traditional DB must replay all logs since the last checkpoint, potentially taking minutes to catch up – further delaying availability.
- Single-Machine Storage Limits: Because all data and logs reside on one server’s disks or SAN, the database is constrained by that hardware’s capacity and durability. A single node can usually support only a few terabytes to, at most, a few tens of terabytes of data. Beyond that, storage partitioning or sharding is needed, which complicates the design. Also, all failure modes of that one host (disk failure, full-volume, AZ outage, etc.) risk the database.
- Slow Recovery and Failover: In a monolithic design, recovering from a crash means restarting the database process and replaying WAL records (redo/undo) to bring the buffer pool and data files into a consistent state. This can take time proportional to the transaction rate since the last checkpoint. Clients must wait (often minutes) before the database is available again. Similarly, promoting a standby replica (in a classic replica setup) can take tens of seconds or more as it now has to catch up on a full copy of the database. By contrast, Aurora’s architecture (discussed below) avoids most of this delay.
How Aurora Re-architects the Stack
To start addressing the limitations of relational databases, we reconceptualized the stack by decomposing the system into its fundamental building blocks. We recognized that the caching and logging layers were ripe for innovation. We could move these layers into a purpose-built, scale-out, self-healing, multitenant, database-optimized storage service. When we began building the distributed storage system, Amazon Aurora was born. We challenged the conventional ideas of caching and logging in a relational database, reinvented the database I/O layer, and reaped major scalability and resiliency benefits. Amazon Aurora is remarkably scalable and resilient, because it embraces the ideas of offloading redo logging, cell-based architecture, quorums, and fast database repairs. — AllThingsDistributed
This is where the magic happens. This distributed storage service receives the stream of logs and applies those changes to the data pages continuously in the background. This design makes the disruptive, I/O-heavy checkpoint process on the database node completely unnecessary, eliminating a major source of latency and contention. Aurora is a symphony of managed AWS services working together: EC2 for compute, a purpose-built log-and-storage service, DynamoDB for metadata, and S3 for backups.
The following diagram provides a high-level overview of Aurora's decoupled architecture.
![]() | |
|
SQL & Transaction Layer (Compute Nodes)
- Solves the problems of: Buffer Pool Contention and Single-Machine Storage Limits.
- The standard MySQL/PostgreSQL engine runs on stateless EC2 compute nodes.
- These nodes handle all query processing, transaction logic, and caching, but offload permanent page writes.
- Allows for adding up to 15 read replicas, each with an independent cache, to scale reads without cross-node contention.
Because the compute nodes are effectively stateless (aside from their cache), you can break the single-machine barrier. You can add numerous read replicas (up to 15) that all point to the same shared storage volume. Each replica has its own independent buffer cache and CPU, eliminating the cross-node contention and cache coherency overhead that plagues traditional clusters. When a new reader is added, its cache starts empty ("cold") and warms up as it serves queries by fetching pages from the shared storage volume. It locates these pages not by talking to the writer, but by consulting a shared metadata service that maps logical data pages to their physical location in the distributed storage layer, allowing for massive and efficient read scaling.
Redo Logging (Distributed Storage)
- Solves the problem of: Log Flush Latency and the single-point-of-failure risk of a traditional WAL.
- Introduces a fundamental architectural shift: the log is the database. The storage layer uses the log stream as the definitive source of truth.
- The compute node's only write I/O is sending log records to a distributed storage service; data pages are never written from the compute node.
- Writes are confirmed durable after a fast 4-of-6 quorum acknowledgment across multiple AZs, providing extreme fault tolerance.
Instead of writing to a single log file on a local disk, the compute node sends its redo log records over the network in parallel to a fleet of storage nodes spread across three Availability Zones. This is its only write I/O. The storage layer then uses this log stream as the source of truth to materialize data pages on demand or in the background.
The true innovation lies in its consensus protocol for durability. Each write is sent to six storage nodes, but the transaction is confirmed as committed once a quorum of any four nodes acknowledge it. This makes the commit process both extremely fast (it only waits for the fastest four responses) and incredibly fault-tolerant. In essence, Aurora transforms logging from a sequential, fragile bottleneck into a parallel, resilient, and high-throughput data stream that serves as the foundation for the entire database.
Crash Recovery (Storage Service)
- Solves the problem of: Slow Recovery and Failover.
- A direct result of the "log is the database" design is that compute node crash recovery is near-instant.
- The traditional, time-consuming WAL replay process on the database instance is completely eliminated.
- A "survivable cache," managed in a separate process, allows a restarted node to come back "warm" and immediately performant.
Aurora sidesteps this entirely. Since the distributed storage layer is the durable source of truth, a crashed or restarted compute instance simply reconnects to the already-consistent storage volume. There is no need for a WAL replay on the compute node itself. This is what enables failover times measured in seconds, not minutes.
Furthermore, the "survivable cache" is managed in a separate process from the database engine. This means that for many events, like an engine crash or a Zero-Downtime Patching, the database process can restart and find its valuable in-memory cache already warm and waiting. While a full host failure would clear the instance's RAM and thus its cache, Aurora's fundamental design ensures that recovery remains exceptionally fast regardless, as it never depends on the state of that local cache to begin with.
Checkpointing (Implicit in Storage)
- Solves the problem of: Checkpoint/Flush Bottlenecks.
- Because the log is the database, the disruptive checkpoint process is completely eliminated from the compute node.
- The storage layer continuously "materializes" new page versions from the log stream in the background.
- This granular, continuous process replaces the large, indiscriminate I/O storms of traditional checkpoints.
This process is fundamentally more efficient. A classic checkpoint is governed by the length of the entire log chain, forcing a huge, indiscriminate flush. Aurora’s continuous page materialization, however, is granular and driven by the needs of individual pages, completely eliminating I/O storms and leading to smoother, more predictable performance.
Durable Storage (Multi-AZ Shared Volume)
- Solves the problem of: Single-Machine Storage Limits and the risk of data loss from a single component or AZ failure.
- Aurora replaces local disks with a custom, log-structured, distributed storage volume that is shared by all compute nodes.
- Data is automatically replicated 6 ways across 3 Availability Zones (AZs) for extreme durability and availability.
- The volume automatically scales in 10GB segments up to 128 TB, eliminating the need for manual storage provisioning.
This architecture provides immense durability, easily tolerating the loss of an entire AZ without impacting data availability. It also offers seamless scalability. As your data grows, Aurora automatically adds new segments to the volume, scaling up to 128 TB without you having to provision storage in advance. This multi-AZ, log-structured store not only delivers high throughput and redundancy but also allows read requests to be served from any of the data copies, further distributing the load.
Backups (S3 Offload)
- Solves the problem of: Slow and performance-impacting backups.
- Continuously and asynchronously streams page snapshots and log journals to Amazon S3.
- The backup process is completely decoupled from the compute node, causing zero performance impact on the live database.
- Enables fast point-in-time restores and the ability to quickly provision database clones from S3.
Metadata (DynamoDB)
- Solves the problem of: Metadata access becoming a bottleneck and creating a single point of failure.
- Cluster metadata (like volume configuration and storage segment maps) is stored in Amazon DynamoDB.
- Using DynamoDB provides a fast, highly available, and globally accessible control plane.
- This decouples cluster state from any single database instance, ensuring all nodes have a consistent view.
All critical cluster metadata—such as the configuration of the storage volume, the map of which data lives on which storage segments, and backup pointers—resides in DynamoDB. This means metadata lookups are consistently fast and don't compete with user queries for resources. More importantly, it decouples the cluster's state from any single compute node. When you add a new instance or perform a failover, all nodes get the latest, consistent map of the cluster by querying DynamoDB, ensuring quick and reliable coordination without a single point of failure.
Failover/Discovery (Route 53 Endpoints)
- Solves the problem of: Slow Recovery and Failover by making the process transparent to applications.
- Aurora provides stable DNS names (endpoints) for the writer and reader instances, managed by Amazon Route 53.
- Applications connect to these endpoints, not to a specific database instance's IP address.
- On a failover, Aurora automatically updates the DNS record to point to the newly promoted writer, abstracting the complexity from the client.
Aurora solves this by using a service-oriented approach with Amazon Route 53. It provides a stable cluster endpoint (a DNS name) that always points to the current writer instance. If a failover occurs, Aurora automatically promotes a replica and, in coordination with Route 53, updates the cluster endpoint's DNS record to resolve to the new writer's IP address. Your application simply needs to handle the connection drop and reconnect to the exact same endpoint name. This DNS-based discovery, managed under the hood by RDS, makes failover fast and transparent, eliminating the need for complex client-side logic or manual reconfiguration.
In essence, Aurora successfully dismantles the monolithic stack, reassembling it as a symphony of cloud-native services: compute engines on EC2, a distributed log-structured storage service, durable backups on S3, a metadata control plane on DynamoDB, and intelligent routing via Route 53. It is this modular, service-oriented design that allows Aurora to break through the performance and availability ceilings that limit traditional databases.
What’s Next
How does Aurora ensure that writes are both incredibly fast and highly durable when it has to coordinate across multiple servers and Availability Zones? The answer lies in the heart of its design: the purpose-built, distributed storage engine.
In the next article, we’ll take a deep dive into this storage layer. We will explore how its quorum-based protocol achieves consensus without the high latency of traditional methods, how it handles replication and consistency across AZs, and how its unique log-structured design makes crash recovery near-instantaneous. Stay tuned as we unpack the innovative engineering that powers Aurora's performance and resilience.
References & Further Reading
Amazon Aurora Deep Dive Series: The Scaling Bottleneck - Why Traditional Databases Fail and How Aurora WinsAmazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases.
Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes.
Amazon Aurora ascendant: How we designed a cloud-native relational database
Amazon Aurora: Cluster Cache Management
CMU Database Systems: Buffer Pools
AWS re:Invent 2024 - Deep dive into Amazon Aurora and its innovations (DAT405)
(Disclaimer: The views and opinions expressed in this article are my own and do not necessarily reflect the official policy or position of any
-------------------------------
If you enjoyed this, let’s connect!
🔗Connect with me on LinkedIn and share ideas!