May 19

The Case of the Random Lockup

We had a bug that just wouldn’t go away. Sometimes it would show up quickly, sometimes it would show up after the code had been running for a few days… but it would show up. What’s the programmer to do?

The Setup

So here’s the background. We have an ARM Cortex-M3 MCU that we are using as a controller and IO processor for a device. It’s coordinating about 4 serial connections. Luckily, only two to three of those are active at a time. We had the proof of concept code running without hangup for over a week. Add a few more functions and things look OK… then we started getting lockups after the program ran for a few days. The first temptation was to ignore them because they were happening after the device was running for a few days. People wouldn’t use it that long in the real world, right? Then we had one of the lockups occur after 9 hours. That was a little harder to marginalize… so down the rabbit hole we go.

Initial Diagnosis

When I’d stop the MCU in the debugger, it was reliably stopped in the same location in memory. A stack trace was not intuitive as well, since the debugger couldn’t figure out which subroutine we were in… because we were outside programmed memory. It felt like the Star Trek: The Next Generation episode where they traveled past the edge of the universe. After some slight changes to the code to aid in debugging, it now started hanging at another address reliably. I figured there had to be a reference I had wrong in the code that was causing the jump to nowhere, but how to find it?

A week of code review didn’t turn it up. I thought it might have something to do with a stack overrun (I had this problem when using sprinf() before) so I pumped the stack up to 3x what I thought it needed. I even went to the trouble to initialize the stack so I could see what was used and while instructive, it showed that stack overflow was not the culprit.

But where does it start?

Trying to piece the stack trace together manually, it looked like the code was locking up during a delay loop. Statistically, that was to be expected since that’s where the processor spends 90%+ of it’s time. So, was it correlation or coincidence? Was the address it was locking up the first place it went off the reservation, or just where it got stuck. To help with this I made sure that whenever I programmed the MCU, it cleared all the flash memory, not just the portion I was using. That changed where the program was hanging. Now it was hanging on a jump to the end of the 32-bit address space… where I didn’t even have memory.

At this point I’m starting to rip out whole modules of code to try and narrow down what is occurring. We’re two weeks into trying to track this bug down. It’s still hanging. Ripping out sections of code hasn’t solved anything. It did lead to a clue. When I activated the SysTick interrupt I now had the processor reliably hanging less than a second after I enabled interrupts on the MCU. So, this is interrupt related. Hmmm…

On the Trail

I went and checked my vector interrupt table (VIT) and it looked good. I’m pretty sure it is the interrupts, but what is going on here. How am I getting sent to the end of the know address space when I’ve looked at the actual values being stored, which are all legit. How to further narrow things down?

I went looking for an opcode I could write across all unused memory to further figure things out. After a lot of digging I found an opcode (breakpoint, 0xbexx) that would pause things when debugging. I went into the linker script and had all extra memory filled with its opcode. Now let’s see where that gets us.

A Break in the Case

When we started up the code this time, it did indeed halt the code…. at the same address as the value of the opcode I had just overwritten everything extra with. So… we’re interrupting and jumping to the address all my unused memory is pointed at. At this point an epiphany occurred. While I hadn’t made many changes to the c code for the project, I had changed the location of things in flash memory so I could do boot-loading. I had also changed where I stored the VIT. Maybe I was onto something here? I did some digging and while relocatable, I have to tell the MCU where I relocated it to. The software is not smart enough to do that for you. Some of the built-in libraries may do that for you, but if you’re moving things in the boot loader, good luck.

The Fix is In

So, was it that simple, set a register and we’re good? To test I found said register and set it before any interrupts were enabled… and let it run… and run… and run. After about four days I declared it good… and then let it run over the weekend just to be sure.

The moral of the story is, just because you tell one piece of software (the linker in this case) what you are doing, don’t assume that the other piece (the compiler) or the hardware is smart enough to figure out what you are doing. For the gory details see the e2e.ti.com thread. I hope that this helps.

Permanent link to this article: http://blog.curioussystem.com/2012/05/the-case-of-the-random-lockup/

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>