A service often needs to publish messages as part of a transaction that updates the database. Both updating the database and sending the message must happen within a transaction. Otherwise, a service might update the database and then crash, for example, before sending the message. If the service doesn’t perform these two operations atomically, a failure could leave the system in an inconsistent state. Let us see an example:
In this diagram, we can see the user who wants to order a pizza using a food delivery platform. This platform includes two core services: order service which stores the information about orders and delivery service which assigns delivery of orders to couriers. Upon receiving a request, the order service stores order information in the database and send an event to Kafka for further processing. There are no guarantees that both writes will happen atomically.
Suppose that Kafka is not available or a network fault occurred:
As you can see, due to the network error, the event will never reach the delivery service, making our system inconsistent and the user will never see his pizza, even if his order has been successfully saved in the database.
The problem our platform has just experienced is called dual write and it usually occurs when you change data in multiple systems, such as database, message broker, search engine, distributed cache, mail transfer agent, etc., without an additional layer that ensures data consistency across all storages.
Now we will look at this issue from the code perspective:
Our goal is to guarantee eventual consistency of both write operations, let us think about how we can achieve that. The most common solution I saw in many projects is adding @Transactional annotation and if something goes wrong the transaction will be rolled back. This approach is not going to work because:
Not all of the storage systems support transactions
ORM frameworks such as Hibernate usually postpone the actual DML queries execution putting them in the queue until a flush forces them to be executed against the database. As result, event publishing can occur before insert statement and if the query fails for some reason, maybe because some constraints get violated, our system remains in an inconsistent state since the first write has been completed.
In distributed systems when the network fault occurs we never know if the interruption happened before the target system received and processed the request, or after, hence we get stuck since we do not know whether to proceed with the transaction or abort it.
Even if you use a transaction synchronization mechanism(after commit listener) for event publishing, it will not help because the target system might not be available
Another solution is to use a distributed transaction that spans the database and the message broker. But distributed transactions aren’t a good choice for modern applications because they require locks and don’t scale well. They also need all involved systems to be up and running at the same time. Moreover, many modern brokers such as Apache Kafka don’t support distributed transactions.
Transactional outbox and Polling publisher patterns
These patterns were suggested by Chris Richardson in his iconic book: Microservices architecture patterns.
The idea behind the Transactional outbox pattern is to use a database table as a temporary message queue. As you see in this diagram, as part of a database transaction that creates order, the service inserts a corresponding event into the outbox table. Atomicity is guaranteed because this is a local ACID transaction. The polling publisher, for its part, polls the table for unpublished messages at a fixed rate. Once the event has been sent to its destination message channel, it is soft or hard deleted from the outbox table.
Now we will see how we can implement this pattern: https://github.com/giova333/transactional-messaging
Instead of directly publishing event to Kafka, here we are using an intermediate decorator which stores events in the outbox table. Using Spring scheduling API we have implemented the polling publisher which queries the database every 3 seconds for unpublished events, sends them to Kafka and updates their statuses. This solution provides at-least once delivery semantic which is completely fine since all modern message brokers use this delivery mode as default. If you are looking for exactly-once delivery you won't be able to achieve it on this planet and don't trust the people and organizations that claim the opposite. However, you can achieve exactly-once processing by making consumers idempotent.
Polling the database is a simple approach that works reasonably well at low scale. The downside is that frequently polling the database can be expensive. Also, whether you can use this approach with a NoSQL database depends on its querying capabilities.
In this article, we have explained the dual write problem and how you can tolerate it by applying transactional outbox and polling publisher patterns. Because of drawbacks and limitations, the polling publisher has, it is often better and in some cases, necessary to use the more sophisticated and performant approach of tailing the database transaction log that we are going to cover in the next article https://www.up-2date.com/post/debezium.