Single bit errors really sUXoRz (adventures in troubleshooting #217)

Keatah · Feb 15, 2012

A user working through the wintry evening on mathematical equations describing the path a spacecraft would take as it cruised the solar system was up-in-arms. After days of trying to determine why the simulations she was running would sometimes allow the craft to land on target and other times miss by hundreds of thousands of miles she gave up.

Dejected and demoralized she passed the code off to a band of colleagues that had little difficulty in figuring out the problem. The next day the showed her the simulation and the craft was landing on target run after run, +/- 100 meters.

And more perplexing, they wondered, why she couldn't find the cause. The error was glaringly obvious! So plain was the "mistake" that they thought it was a mis-timed April fools joke. Why in god's universe would her program be firing a mid-course correction long enough for complete propellant depletion, and at an angle such as to effect a plane-change instead of a retrograde delta-V of, say, perhaps 2m/s..? Why? Because the ram in which the simulation was running had 1 single bit that was intermittently bad. Not a whole location, but 1/8th of a location. In this simulation the bad bit was in the routine that determined whether there would be a mid-course correction and how long. The failed bit would **sometimes** make a sign error and invert the time RCS was to fire.. A few seconds or a few billion seconds.

---------------------------------

Keep in mind, unlike a simulated spacecraft living in a desktop system, a real one would have several processors running checks against each other before doing maneuvers.

That's-a right! A single bit error, one single bit, mind you, really sucks. It's like the hardware version of flipper. Flipper is a virus I wrote that just kinda hangs out and looks like part of an application but it randomly flips a bit here and there, in a random file of its choice - from time to time. But that's besides the point. Really insidious, sometimes the problem can show up, sometimes it can be hidden for months.

Anyways. I had the fun task of tending to a sick machine that seemed to be, mostly, on a mission to corrupt files, sometimes, and of a certain size (especially larger datasets) by flipping one bit in them. The failed bit in module #2 was discovered with Memtest86. Not saying it was simple. We had to cycle the machine for a few days to uncover the issue. And it was annoying because the problem would mostly show up at the end of the test.

Imagine a tic-tac-toe board. Let us equate it to a 9-bit memory chip. All X's, good. All O's, good.

Now, make it all X's except for the center square #5. Make that an O.
After a few minutes the O might fade and become an X, sometimes.
The complement is also true.

Now, make it a random distribution of X's and O's. As long as there was a mixture both X's and O's surrounding the center #5 element, the chip would be stable. Provided the temperature and local background radiation levels are right. It can get worse, the center #5 cell can intermittently couple itself to any surrounding element, randomly. It doesn't have to be all at once either.

In the big picture, this system would corrupt some larger files when doing file transfers, or doing disk operations. And since the bit seemed to be near the edge of (and between) the o/s code that controls the reading and writing and storage-in-memory (of said file), depending on what Windows was doing at the moment, a file could pass through unscathed sometimes. Since this was a Windows XP system, the problem remained relatively contained to file corruption, no other symptoms showed up, like sound or graphics issues.

If this was in a Windows 7 system, there could problems all over the place, because, win7 loads itself into memory differently every time. In an attempt to avoid some low-level hacks, win7 will sometimes load graphics stuff here, sound stuff there, and filesystem routines in yet a different spot, different each and every time you start your computer. So this failing ram bit might sometimes handle code that does graphics, or sound, or mouse, or printing operations. You'll just never ever know!

Back to XP, in contrast, XP tends to keep a memory map that's not as dynamic. On this system, code that handled disk ops and scratchpad memory (for file transfer), happened to be making use of the failed bit. So the scope of errors was limited to file corruption when copying or creating files.

On Linux, this could have been a non-issue (almost), because, assuming testing was accurate, the bad locations could have been mapped out and simply not used in the map.

This is something that a lot of failsafe & mission critical computers do, they will test their own memory and automatically map the failed address off-limits. The systems onboard a spacecraft will do this in conjunction with mission control.

SSD's, like standard hard disks, have been doing it for years too. All we have left is to do this for ram. Yehp! Your $4,000 gaming rig, can fail due to one single marginal bit. And that failure can be intermittent.

Single bit errors can..
Make you swap graphics card drivers 10 times.
Make you reload your recalcitrant applications.
Make you swap a motherboard, power supply, and make you remove all your add-in cards.
You'll be checking heatsink paste, spending hours dicking with bios options.
You will also read up on and replace bad capacitors.
We're not done yet.. You'll methodically patch and update your o/s and apps.
You'll post to message boards and try ridiculously non-applicable suggestions.
Make you check cables and clean contacts.
Make you think you've tried everything and reload the o/s from scratch.
Keep you up all night long.
Make you spend hours swapping and exchanging all sorts of hardware.
Make you call tech support and bitch how bad their service & support is, especially on that $5,000 gaming PC you ordered with dad's money.

Single-bit errors can be the most insidious forms of failure, especially when they show up in a certain temperature range only, and then become intermittent. If it's seemingly too hot or too cold, the memory works fine. But in a certain temperature range, the metal gate in a microchip can become leaky and interact with the surrounding cells and cause a bit to flip. Read the technical section of memtest to see how subtle some of these failures can be.

http://www.memtest86.com/
http://fgouget.free.fr/misc/badram.shtml
http://rick.vanrein.org/linux/badram/
http://en.wikipedia.org/wiki/Dynamic_random-access_memory
[ame="http://en.wikipedia.org/wiki/Error_detection_and_correction"]Error detection and correction - Wikipedia, the free encyclopedia[/ame]
http://it.slashdot.org/story/10/06/24/2210214/tracking-down-a-single-bit-ram-error

Artlav · Feb 15, 2012

Never forget how much of a Rube Goldberg machine a computer is.
A tiny error, software or hardware, can wreck it.

Single bit errors really sUXoRz (adventures in troubleshooting #217)

Keatah

Active member

Artlav

Aperiodic traveller

Similar threads