java concurrency nonblocking compare-and-swap

Is my code thread-safe? [Java, CAS, money transfer]

I am solving classic java concurency kata: you have a bunch of accounts and you need to transfer money from one to another in multi-thread env.

So I've learned and wrote all possible solutions with locking/synchronizations, and now I am trying to find out how to solve this task in non-blocking way, by using CAS and AtomicReference.

The problem is that very rarely during testing I see that sum of balances not the same - it means that i have a drawback in my solution. Please help)

this is my code for Account:

public class AtomicRefAccount implements Account {
    private final int id;
    private final AtomicReference<BigDecimal> atomicAmount;
    private final int MAX_RETRY = 5;
    private final Lock lock = new ReentrantLock();

    public AtomicRefAccount(AtomicReference<BigDecimal> atomicAmount, int id) {
        this.id = id;
        this.atomicAmount = atomicAmount;
    }

    public static AtomicRefAccount of(int amount) {
        return new AtomicRefAccount(new AtomicReference<>(new BigDecimal(amount)), 0);
    }

    @Override
    public int getId() {
        return id;
    }

    @Override
    public BigDecimal getBalance() {
        return atomicAmount.get();
    }

    @Override
    public void withdrawAmount(BigDecimal amount) {
        for (int i = 0; i < MAX_RETRY; i++) {
            BigDecimal curVal = atomicAmount.get();
            if (curVal.compareTo(amount) == -1) {
                throw new IllegalStateException("[withdrawAmount] insufficient funds");
            } else {
                if (atomicAmount.compareAndSet(curVal, curVal.subtract(amount))) {
                    return;
                }else {
                    Thread.yield();
                }
            }
        }
        System.out.println("[withdrawAmount] retries fired");
        // trying locking now
        lock.lock();
        try {
            if (atomicAmount.get().compareTo(amount) == -1) {
                throw new IllegalStateException("[withdrawAmount] insufficient funds");
            } else {
                atomicAmount.set(atomicAmount.get().subtract(amount));
            }
        } finally {
            lock.unlock();
        }
    }

    @Override
    public void addAmount(BigDecimal amount) {
        for (int i = 0; i < MAX_RETRY; i++) {
            BigDecimal curVal = atomicAmount.get();
            if (atomicAmount.compareAndSet(curVal, curVal.add(amount))) {
                return;
            }else {
                Thread.yield();
            }
        }
        System.out.println("[addAmount] retries fired");
        // trying locking now
        lock.lock();
        try {
            atomicAmount.set(atomicAmount.get().add(amount));
        } finally {
            lock.unlock();
        }
    }
}

this is my code for Money Transfer Service:

public class DefaultMoneyTransferService implements MoneyTransferService {
    @Override
    public void transfer(Account from, Account to, BigDecimal amountToTransfer) {
        //validation start
        validation(from, to, amountToTransfer);
        //validation end

        from.withdrawAmount(amountToTransfer);
        to.addAmount(amountToTransfer);
    }

    private static void validation(Account from, Account to, BigDecimal amountToTransfer) {
        Objects.requireNonNull(from, "[from] account is null");
        Objects.requireNonNull(to, "[to] account is null");
        Objects.requireNonNull(amountToTransfer, "[amountToTransfer] account is null");
        if (from == to) {
            throw new IllegalArgumentException("[from|to] provided to same accounts");
        }
        if (amountToTransfer.compareTo(BigDecimal.ZERO) != 1) {
            throw new IllegalArgumentException("[amountToTransfer] have to be positive");
        }
        if (from.getBalance().compareTo(amountToTransfer) == -1) {
            throw new IllegalStateException("[from] have insufficient funds");
        }
    }
}

I am trying to have a bullet-prof solution in terms of concurency, but using non-blocking approach.

Solution

It's not bulletproof by a long shot, but I think we're mincing definitions here. Let's go through the major failures in this code.

Misuse of `compareTo`

You appear to labour under the assumption that the compareTo method returns -1, 0, or 1. As per the spec no, it doesn't. It returns 0, positive, or negative. You must NEVER compare the result of compareTo against anything other than 0. i.e. the only valid operations are == 0, != 0, > 0, < 0, <= 0, and >= 0, for, respectively, 'equal', 'not equal', 'a is larger than b', 'a is smaller than b', 'a is equal or smaller than b / b is NOT larger than a', 'a is equal or larger than b / b is NOT smaller than a'.

Lock fallback is broken

You fall back on using a lock. This is utterly broken - locks are a two-way street: They do not do anything unless every access to the thing you are trying to concurrency-proof locks. Your code doesn't. Hence, this scenario is trivially possible:

There's €1000,- in your account. And you managed to get your hands on 20 identical cards.
You coordinate with fellow hackers: Each walks off with a card and goes to an ATM, with the plan to simultaneously all withdraw cash to mess with your code.
Due to all this, you end up with somebody in the lock-based part of the code (because they blew through the retries), and simultaneously somebody inside the CAS-based loop. Larry is in the lock-based part, and Carl is in the CAS part.
Larry's code executes atomicAmount.get().subtract(amount), which ends up being €900 (Larry is withdrawing €100 from the machine).
Larry's thread is pre-empted for whatever reason (hey, it happens - that's the point, you cannot control when it happens. Thread.yield() makes it happen faster, but there is no Thread.unyield() - and, of course, there are multiple cores, things can run simultaneously.
Carl's thread runs both curVal.subtract(amount) as well as the curVal.compareAndSet. Hence, the CAS call succeeds; Carl's code has also subtracted €100,- from the account, and Carl's ATM spits out a crisp €100,- bill, and your account is now at €900,-.
Larry's thread continues, and just overwrites the account balance.
Larry's ATM also spits out a crisp €100,- bill.
Your account is now at €900,-.

Tada. Bank has been fleeced for €100,-: An account that had €1000,- before, has €900,- now, but we have 2 100 euro bills.

SOLUTION: You can't do it. If retry fails, you tell the ATM to tell the user that for whatever reason they can't get cash right now, sorry, and spit the card back out. There is no falling back on locks, period. If you're going to use locks, you have to always use them.

It's just not bulletproof, period.

The whole approach is fundamentally broken, but here's where the word mincery might come up.

Hackers are at it again:

Their plan is now not to steal from the bank but to get them regulatorily destroyed. Their plan is to make the bank look like they are stealing from their clients.

So, here is their plan: They will go to the bank in person and act like an official from some major political party. They gathered a few minor things that makes them look legit. Such as a fake passport. It's a shoddy fake, though.

They walk into the bank and ask to transfer money from the savings account of the party to the running account. Because this appears to be a transfer within the confines of a single entity, the bank teller is not too fussed and doesn't put any serious effort in checking that the person on the other side of the window is actually authorized to do this. After all, what possible purpose could a hacker have to transfer money from one entity's savings account to that same entity's running account, right?

Thus, the teller starts the transfer operation. The hacker is somehow aware of timings quite well and manages to pull the plug on the computer in the middle of the operation - in fact, exactly between your from.withdrawAmount(amountToTransfer) and to.addAmount(amountToTransfer) calls. The teller is none the wiser, nor is the bank. But what has happened is that €5000,- has disappeared from the political entity's savings account. That money is gone, it hasn't been added to their running account.

Eventually the political party figures this out, makes a big stink, and the bank is fined hundreds of millions.

The solution?

JOURNALING.

If you're old like me, remember this scene from spaceballs - preparing to prepare?

That. The right way to do this is as follows:

Write in a journal log 'about to start transaction 12-34-56: Transfer €5000,- from account 12345 to 54321 for [reason]', authorized by '[operator]'. Wait for the disk to report it has really really saved it.
Write in a journal log '[12-34-56]: Subtracting €5000,- from 12345's current €9500,-'. Here, and always, wait for the disk to report it has done so.
subtract €5000,- from 12345 - it is now €4500,- - this is the first time you're really doing something and not just logging.
Write in a journal log: `[12-34-56]: Subtracted €5000,- from 12345; it is now €4500,-'
You get the idea.

The advantage of this system is that on bootup, before the system is back online, it can check the last few lines of the log and 'fix' all half-completed transactions, or at least check what happened. For example, if the last line in the log is '[12-34-56]: Subtracting €5000,- from 12345's current €9500,-', then you check the balance of 12345. If it is €9500,- you know that act did not go through before the system failed. You either continue the transaction (the first log line) from there, or you know there's nothing to revert, and can continue the boot procedure.

If the account balance is €4500,-, you know that the thing the log line says is about to happen, did happen, but the log to report that it did this didn't make it before the hacker pulled the plug on the machine. You can now either reverse that (set the balance back to €9500,-), or continue the transaction from that point forward (I.e. start by writing in the log that it happened).

Once all transactions have been 'fixed' (nothing that was designed to be atomic is left in a half-baked state - i.e. all have been either completed or reverted), the system starts up as normal.

This even works when you are not in direct control of knowing if things happened. For example, an ATM does the same thing: "Preparing to spit out €100,- bill. Spitting out €100,- bill. Recording that user grabbed the bill. Recording that door is now closed.".

If the ATM crashes and the last line in the journal is 'Spitting out €100,- bill', if you have to, you can ask a human operator to check the video feed and check if the bill actually was spit out (and grabbed by the account holder), or if the machine never quite managed to get to the 'door is now open' phase of the 'spit out cash' procedure.

The problem with untestable rocket science

As you can see, this stuff is incredibly difficult. It's one of those unknown unknowns things: How do you know you've covered every avenue? You can't write tests for ways your concurrency can fail that you don't know about.

Hence, do not do any of this and leave it to the pros. The normal way to do this stuff is to get a good database such as postgresql, set it to SERIALIZABLE transaction level isolation, start a transaction, make the transfer inside that single transaction, and COMMIT.

psql actually works more or less exactly how your code with my fixes as outlined above would work: psql does not necessarily use locks (it's 'optimistic locking' - you can search wikipedia for an explanation, it's CAS based, more or less what you're doing), and journals everything it does, and uses that journal on boot to first undo any half-bakery if you trip over a power cord right in the middle of psql saving the transaction permanently, before it allows any incoming connections.

RETRY failure

To really land why you shouldn't handroll this stuff, there's a serious bug in your code that you're never going to figure out.

Ever ran into the situation that you are about to walk right into somebody, so you veer left to avoid a collision, but they also veer left? You laugh sheepishly, veer right, but.. they also veer right?

It's rare but it happens. But, in computer land, it is incredibly common - after all, computers are usually quite deterministic. To a fault, really.

Hence, it is possible, even likely, that 2+ threads that are simultaneously attempting to write an update to the same bank account will continually get in each others way. They cause each others CAS to fail and for these processes to then start over and thus mess with each other enough that none of them ever finish, and they all blow through their retry limits.

The fix for this is to roll some dice. No, really. This nifty invention is from the brain of Mr Metcalfe whose at the time quite derided ethernet solution to networking handily beat Token Ring and all the other techs that lots of money was thrown behind, because the idiotic idea of dice rolling works so well.

The trick is, if a retry is needed, don't instantly try so. Instead, ROLL SOME DICE. Wait a random amount of time and then retry. This avoids the 'both parties dodge at the same time in the same direction and thus the collision isn't actually avoided' scenario. Imagine in the '2 pedestrians are about to collide' scenario, instead of both instantly veering to avoid the collision, they both mentally pick a random number, wait that long, THEN veer? The odds are very high indeed that one veers earlier than the other, and then all is well.

This is really how computers do this: psql and such really do roll some dice, and your ethernet cable really does detect collisions and waits random amounts of times to avoid repeat collisions.

For reasons, it's a good idea to do this exponentially.

Your code needs a line at the end of the retry block that looks something like Thread.sleep((int) (Math.random() * i * 500));. Wait a random amount, and wait longer if we've been retrying for a while (i is your counter that tracks which retry we're in).

You're never going to remember to do this every time. Hence, you want a framework. Keep whacking away at that idea and soon you've rebuilt a database engine from scratch. Skip the process, and just use a database.