Write Skew Problem and Distributed Lock

Updated: May 26, 2021

In the previous article https://www.up-2date.com/post/lost-update, I showed you the most common technics how you can prevent the lost update problem, however what if I say that there is a way to avoid updates at all by redesigning our model and database table to have append-only operations rather than the final state? Let us see how we can achieve that. This article is accompanied by a working code example: https://github.com/giova333/spring-boot-distributed-lock

First of all, we need to introduce Operation which represents either deposit or withdrawal in our case:


We will now refactor the Account model:


Account no longer has a state which was updated on every withdraw/deposit resulting in the lost update. Now every time we want to update an account balance, we insert a separate immutable record in the operation table rather than updating the balance column in the account table:



Moreover, now we have a complete history of account transactions that occurred in our system, which is also very useful and important when building financial systems.

The downside of this approach is whenever we want to get an account balance we need to join and aggregate all the operations, this is exactly what we are doing in Account.getCurrentBalance.


Since we no longer store the balance I have added a separate endpoint in AccountController which allows us to see the status of accounts:


Now let us try to test our changes:


If we check the account statuses you will see that there is some inconsistency:


Although we have fixed the lost update problem, in our new design we introduced an anomaly called write skew.


Write Skew

To begin, let us try to figure out what went wrong in our system and why Bob's account has a negative balance.

As you can see in this sequence diagram, we have two concurrent transactions trying to withdraw balance from Bob's account. Our application first checks that the account balance is greater than the withdrawal amount, if yes, it assumes it’s safe to proceed. Each transaction inserts its record and as a result, Bob has a negative balance.

This anomaly is called write skew. It is neither a dirty write nor a lost update because the two transactions are dealing with two different records. It is less obvious that a conflict occurred here, but it’s a race condition: if the two transactions had run one after another, the second withdrawal would have been aborted. The anomalous behavior was only possible because the transactions ran concurrently.

To put it simply, there are three phantoms causing write skew:

  1. A SELECT query checks whether some requirement is satisfied by searching for rows that match some search condition

  2. Depending on the result of the first query, the application code decides how to continue (perhaps to go ahead with the operation, or perhaps to report an error to the user and abort).

  3. If the application decides to go ahead, it makes a write (INSERT, UPDATE, or DELETE) to the database and commits the transaction.


Preventing Write Skew

Unfortunately, the write skew problem is more difficult to deal with comparing to lost updates. The easiest way to prevent it is to use serializable isolation either globally or:


Nevertheless, serializable isolation brings a lot of overhead and is not an option for modern applications.


Alternatively, you can use pessimistic locking when retrieving account:


This approach works, but if you have multiple places in your code where operation can be inserted it is a matter of time when inconsistent will occur. To prevent this from happening we have only one place responsible for account modification: the Account model. In our design, from the perspective of Domain-Driven Design, Operation is a local entity while Account is an aggregate root responsible for account modification and protection of its boundaries. This design guarantees the strong consistency of our model.


Distributed Lock

So far we have seen serializable isolation and pessimistic locking which helped us to prevent write skew. Both approaches work well for RDBMS, however, what if we want to find a database-agnostic solution? There are several ways how distributed locks can be implemented but in our example, we will be using Redis because it's a high-performance distributed key-value store that supports a range of useful features for distributed lock implementation such as TTL and set only if it does not already exist:


In this example, the command will set the key only if it does not already exist (NX option), with an expiration of 30000 milliseconds (PX option).

To implement a distributed lock using Redis we need to add a dependency


You can run Redis in the docker container:


We will now define an abstraction for our lock:


Implementation:


Now we need to instruct AccountService to use the newly created lock:


This approach works quite well, however, it has the following problems:

  • Suppose the first client requests to get a lock, but the server response is longer than the lease time; as a result, the client uses the expired key, and at the same time, another client could get the same key, now both of them have the same key simultaneously! It violet the mutual exclusion. (To solve this problem, we can set a timeout for Redis clients, and it should be less than the lease time)

  • Garbage collector pause(Stop-the-World) can block the thread acquired the lock and it’s likely that the lease will have expired by the time the request is processed, and another thread has already taken the lock.

  • Data loss can happen during node outage in case of single node cluster or master-slave with asynchronous replication which can cause the violation of the mutual exclusion

We have seen only a few issues but there are many more. The main idea you should extract from this article is: "Never implement custom distributed lock".

There are some libraries such as Redisson that manage some of the issues we have seen above and provide robust configuration. To use it you need to add this dependency:


Add Redisson lock manager implementation:


At first glance, it looks similar to our custom implementation however this library provides a lot of settings that you can customize for your use case:


Conclusion

In this article, we have redesigned our system to have append-only operations and prevent lost updates. However, our new design introduced a different problem called write skew. We saw that there are multiple ways to deal with this anomaly, and none of them are perfect. Besides that, we created a custom distributed lock using Redis and understood that it is not so easy to implement a fault-tolerant lock, hence it is recommended to use existing libraries such as Redisson which are providing distributed lock implementations out of the box.

200 views