I'm imagining something similar is going on in your database races...?
N.
A better way of explaining what he's getting at is that you have a shop that takes orders by mail. There is a large billboard at the front with a large banner pinned to it that shows the current number of orders that have been received but not processed. Nobody is allowed to go home if it shows a number higher than zero, and if it shows zero at any time after noon, everybody is required to go home for the day.
There are three secretaries each checking in incoming orders. Whenever a secretary finishes checking in an order, while still at their desk, they take a new banner out of a box, look at the billboard, add one to the number already pinned to the billboard, and write it on their banner. They then walk up to the billboard, throw away the old banner, and pin up the new one. There are three box packers that do the same when they finish processing and shipping an order, except that they subtract one from the number currently on the billboard.
The problem is that, since everybody writes the new banner at their desk, the banner you looked at to add one to might not be the same banner that you replaced when you finally got up front. So if several secretaries are all checking in new orders at the same time, they might all see that the billboard indicates 4 current jobs, each write "5" on their banner, and then have the first one replace the old "4" banner with "5", the second one replace the first one's "5" banner with a new "5" banner, and so on. So if there were three new jobs, the job count should read 7, but will actually read 5. If the box packers complete their work on the corresponding jobs at a pace such that they don't all reach the "update the billboard" step at once, then the subtraction of the completed jobs will proceed normally, and the job count will reach zero before the actual number of pending jobs is zero, and, if it's after noon, everyone will go home with work still pending.
It could also happen the other way, to where new jobs are added normally, but several completed ones step on each other at the update billboard step. Then the shop runs out of work to do before the count reaches zero, so nobody can go home because The Rules say that the count has to be zero before anybody can leave, and everybody has to stay overnight and hope for an offsetting error the next day.
This is, of course, a completely daft way to run a shop, but the way computers work makes it easier to run into such problems than in the physical world, so programmers have to be trained to spot such problems and build safeguards against them.
And actually, with servers, the "running out of work before reaching a count of zero" scenario is often *less* troublesome than the "reaching zero before running out of work" scenario, because if the program tries to get the next work item when there is no such thing, it will, worst case, try to access memory that doesn't exist and crash, and then the operating system will restart it, and everything is fine and dandy. On the other hand, if there are jobs waiting and the program thinks there aren't, then each of those jobs will take up memory. Not alot, but if you've got 100 jobs per second coming in, and half of them don't update the count properly (and remember, the heavier your load, the more jobs will try to update the count at the same time), then even a small amount of memory (10s of kB) per job can use up 10s of GiB in five hours. And when there's not enough memory, the OS will start using the hard disk to pretend it has more memory, but hard disks are s...l...o...w... .
And as using disk to emulate RAM slows the system down, the process of reading the job count, adding one, and writing the job count back slows down, so the window for jobs to step on each other updating the count gets wider, which means a higher percentage of jobs fails to update the count correctly, more memory gets tied up, more disk is set to the task of pretending to be memory, so the system gets slower.
And *then*, 100,000 customers that have been trying to get to your website for the last 10 minutes all have their browsers tell them that the page load timed out, so they all hit refresh, and now you're getting 200 requests per second, 100 from new customers, and a hundred from all the waiting customers trying to refresh the page.
And then your tech support line gets a call spike and suddenly wait times for tech support are going up, and pretty soon grumpy customers that have been listening to hold music for an hour are chewing out underpaid agents that have had back-to-back calls with nothing but grumpy customers for an hour, and pretty soon emotionally spent agents are snapping back at customers and avoiding calls, and then a customer makes a recording of his conversation with an exceptionally grumpy agent and parts it to social media and/or the press. And then the board of directors is asking pointed questions of middle management, and middle management deflects the issue to the software engineers, and then Urwumpe says "we opened a ticket on this two years ago".
But you can see from the above that Urwumpe got one thing wrong. He was too modest:
The destructive yield of emails like that is measured not in gigatons, but in units like giga-
foe and millions of solar masses * c^2.
Somewhere in the most uninhabited reaches of the Andromeda Galaxy, underneath the outback of some uninhabited desert planet, or at any rate, far away from Wolfsburg, a German middle manager is digging with inhuman speed towards the core of the planet, crawling into the deepest, darkest hole he can until the storm passes.
---------- Post added at 08:26 ---------- Previous post was at 08:15 ----------
"Get a life you fanatics!"
SCREEEEEEEEEEECH!
SPLASH!
"I told you the sign should have said 'the bridge is out'!"