When talking about databases the term ACID comes up quite often. It’s a model for designing a database system that was introduced based on how data were usually manipulated in real use cases.
ACID is considered by many a fundamental concept that software engineer should be aware of and having been asked about it in interviews before, I thought reviewing it more thoroughly would be worth the effort.
The term ACID was coined by Theo Haerder and Andreas Reuter in their paper ‘Principles of Transaction-Oriented Database Recovery’. It was used to describe properties of a transaction in a database system. So firstly, a definition for transaction should be defined.
Transaction
From our common usage, a process would need to perform “A transfer 100$ to B” and the steps would look something like this:
transfer_amount = 100;
balance_of_A = get_balance_of_A();
if balance_of_A is < transfer_amount then {
report failure;
go to 'commit transaction'
}
set balance_of_A = balance_of_A - transfer_amount;
set balance_of_B = balance_of_B + transfer_amount;
In an abstract understanding, what does the operation do? In essence, it performs several operation that changes the state of the data from one to another. Additionally, this change is meaningful to the user. If A originally have 200$ and B have 300$, the user expects the end result of this running those steps is A have 100$ and B have 400$, if there was any problem in between those steps, A and B should have their original balance unchanged. There should not be a case where A is left with 100$ but B still only have 300$. In other words, all operations within this confined procedure should succeed and be reflected on the database or nothing happens. That concept is refered to as ‘transaction’ in the context of database system and it symbolizes a unit of work.
ACID
To have such behavior, the paper suggests a transaction must have four properties: atomicity, consistency, isolation, durability. Let’s go over each, one by one and try to match them with an example.
Atomicity
Atomic usually means the smallest possible unit. In our case, it means a transaction should be considered a smallest work unit. It happened or it doesn’t. There should not be any intermidate state in a transaction, at least, not visible to the outside. For example a transaction ‘saving a list of items’:
begin transaction;
create list
create item 1
create item 2
add item 1 to list
add item 2 to list
commit transaction;
From the previous money transfer example, there are many steps in a transaction, and those steps can make changes to the data. However, from the user perspective, the change is only visible after every step has finished successfully. The user should only see data reflect either a list with 2 items or no list exists, no items exist. This is the “all or nothing” nature of the transaction.
Consistency
This property assert that a transaction must only commit legal results. Meaning all constraints on the database must be satisfied before the transaction can finalize. For example a field ‘age’ must be greater or equals to zero. By the end of the transaction, if somehow the field is less than zero, than that transaction must fail and not committed. If we don’t have this property then our database will not have much uses since any changes we make might leave the data corrupted and meaningless. If the system lets user sets constraints but any changes might have chance of breaking that constraint and the user have to manually check everything themselves. Then that means the setting constraints on the database system is a useless action.
Isolation
A database system commonly evolves many processes trying to access and manipulate the data at the same time. Therefore transactions must provide certain levels of isolation, the higher the level, the more indepedent the transactions and the less intermediate states are visible between transactions. If the database system doesn’t guarantees it, then the user would have to deal with the concurrency issues by themselves leaving more room for errors. Techniques for creating isolation are refered to as ‘synchronization’. Concurrent transactions in database system will usually be scheduled, arrange in a way that tries to maximizes correctness without sacrifcing too much performance. There are different types of ‘Read phenomena’ such as:
- Dirty reads : reading uncommited values.
- Non-repeatable reads: getting different results when reading multiple times within a transaction.
- Phantom reads: a special case of non-repeatable reads where there might be changes to the many rows or number of rows and the query is trying to read a range of rows. There several isolation levels (serialzable, repeatable reads, read committed, read uncommitted) that mitigates a some of the read problems above but I won’t go into these details of all of them. In short, serializable is the highest isolation level (transactions execute as if they were performed sequentially) and read uncommitted is the lowest, allowing dirty reads.
Durability
Last but not least is the property that ensures data is safe. Committed changes must be reflected, saved, stored somehow no matter what happens to the system, especially in case of unexpected failures. Therefore, databases such as MySQL uses a commit log that persist to disk (a non-volatile storage) to ensure data durability. This is important because from the user perspective, there is no information regarding the internals states of transaction. The users has to fully trust the system when it says something is committed or not. So in case of failure, if the system doesn’t have a backup plan to restore data to a valid state, then data is corrupted leaving the user helpless when it happens.
Footnote
It’s a lot of theory and technical jargon at times, but I think the ACID model is a key concept to help understand why the database management systems we use work the way they do. Also we might need to tweak them according to our use cases and correctly doing so will require this type of knowledge.