Schrödinger bug disappearing when breakpoint is set

D

5

8

I have a strange bug in my code which disappears when I try to debug it.

In my timer interrupt (always running system ticker) I have something like this:

 if (a && lot && of && conditions)
 {
     some_global_flag = 1;                   // breakpoint 2
 }

in my main loop I have

 if (some_global_flag)
 {
     some_global_flag = 0;
     do_something_very_important();   // breakpoint 1
 }

This condition in the main loop is never called when the conditions in the timer are (I think) fulfilled. The conditions are external (portpins, ADC results, etc). First I put a breakpoint at the position 1, and it is never triggered.

To check it, I put breakpoint nr. 2 on the line some_global_flag = 1;, and in this case the code works: both breakpoints are triggered when the conditions are true.

Update 1:

To research whether some timing condition is responsible, and the if in the timer is never entered if running without debugging, I added the following in my timer:

 if (a && lot && of && conditions)
 {
     some_global_flag = 1;                   // breakpoint 2
 }


 if (some_global_flag)
 {
     #asm("NOP");    // breakpoint 3
 }

The flag is not used anywhere else in the code. It is in RAM, and the RAM is cleared to zero at the beginning.

Now, when all the breakpoints are disabled (or only breakpoint 1 in the main is enabled), the code does not work correctly, the function is not executed. However, if I enable only the breakpoint 3 on the NOP, the code works! The breakpoint is triggered, and after continuing, the function is executed. (It has visible and audible output, so it's obvious if it runs)

Update 2:

The timer interrupt was interruptible, by means of a "SEI" at its beginning. I removed that line, but the behavior is not changed in any noticeable way.

Update 3:

I'm not using any external memory. As I'm very close to the limit in the flash, I have size optimization in the compiler on maximum.

Can the compiler (CodeVision) be responsible, or did I do something very wrong?

Ditzel answered 8/2, 2012 at 7:29 Comment(7)

Is there some way you can try adding some sort of logging, rather than setting breakpoints? I'd usually hesitate to suggest that, but in this case it may be less perturbing to the system. In particular, what I'd log is the state of your conditions on each pass through the timer interrupt code. – Wilhelm 8/2, 2012 at 7:46

How is some_global_flag defined? Are you using volatile int some_global_flag? – Yager 8/2, 2012 at 7:53

Are you using external memory? Or just the internal SRAM? What chip are you using exactly? – Yager 8/2, 2012 at 7:57

@DipSwitch: I'm just using a global unsigned char some_global_flag;, and it's used only in one source file. Previously I used a bit variable, but changed it to make debugging easier. I'm very close to the limit in the flash. Using volatile did not change the behavior. – Ditzel 8/2, 2012 at 7:57

@DipSwitch: a lone ATmega168 without external memory. – Ditzel 8/2, 2012 at 7:58

Are there any other bits of code that modify the variable? Maybe something else resets it to 0 (either explicitly or possibly some buggy code that overwrites it by mistake) and the timing of things is such that in a normal run the 1 never gets noticed in the main loop before getting reset to 0, but when you stop with the debugger, it does. Try manually setting the variable to 1 and step through your main loop and see if it gets reset to 0 somewhere other than just before do_something_very_important() is called. – Geometrize 8/2, 2012 at 8:15

Since you're dealing with interrupts and thus concurrent execution, the influence of the debugger on code execution is not neglectible. This is not a solution to your problem but an explanation of why this behaviour (bug disappearing) is totally normal and not a deficiency of the used tools but a deficiency of your application (unless you know what your're doing, which I don't suppose). – Lebbie 8/2, 2012 at 11:20

D

4

It might seem strange but it finally proved to be caused by strong transients on one of the input lines (which powers the system but its ADC measurement is also used as a condition).

The system can have periodic power fails for a short time, and important temporary data is kept in part of the internal SRAM, which is not cleaned after startup and designed to retain the data (for as much as 10 minutes or more) with the use of a small capacitor while the CPU is in brown-out.

I did not post this in the question because I tested this part of the system it and worked perfectly, so I did not want to throw you off course.

What I found out at the end, is that a new feature was used in an environment which created very strong transients, and one of the conditions in my question depended on a state which depended on one of those variables in the "permanent RAM", and finally using a breakpoint saved me from the effects of that transient.

Finally the problem was solved with adjustments in timing.

Edit: what helped me find the location of the problem was that I logged the values of my most important variables in the "permanent RAM" area and could see that a few of them got corrupted.

Ditzel answered 8/2, 2012 at 17:57 Comment(5)

This is actually a general reason for this sort of problem -- debugging tends to perturb timing in some way, and if you have a situation that's sensitive to timing issues, then debugging can certainly affect it. There's often not much meaningful information in how the debugging affects it (in part because it's hard to tell how the debugging is affecting the timing), though. – Wilhelm 8/2, 2012 at 23:1

I do think it would be interesting to add a bit to this answer describing how you ended up diagnosing the problem, as well as what you found. Also, since you know this is the correct answer, you should accept it for completeness. :) – Wilhelm 8/2, 2012 at 23:2

Good that you solved it. Something good came out of it: you added volatile to your global flag (if you did not add it, please do, it really should be there). – Hamitosemitic 9/2, 2012 at 10:44

@Gauthier: no, I made a permanent place for it in a register, as a bit variable. As it's in a register, it's volatile by default. – Ditzel 9/2, 2012 at 11:17

@vsz, Brooks Moses has the general answer: Debugging affects timing. Suggest you add a remark to that effect. – Pleasant 9/2, 2012 at 11:26

M

6

Debuggers can/do change the way the processor runs and code executes so this is not surprising.

divide and conquer. Start removing things until it works. In parallel with that start with nothing add only the timer interrupt and the few lines of code in the main loop with do_something_very_important() being something simple like blinking an led or spitting something out the uart. if that doesnt work you wont get the bigger app to work. If that does work start adding init code and more conditions in your interrupt, but do not complicate the main loop any more than the few lines described. Increase the interrupt handler conditions by adding more of the code back in until it fails.

When you reach the boundary where you can add one thing and fail and remove it and not fail then do some disassembly to see if it is a compiler thing. this might warrant another SO ticket if it is not obvious, "why does my avr interrupt handler break when I add ..."

If you are able to get this down to a small number of lines of code a dozen or so main and just the few interrupt lines, post that so others can try it on their own hardware and perhaps figure it out in parallel.

Mcleroy answered 8/2, 2012 at 16:22 Comment(0)

Y

5

This is probably an typical optimizing / debugging bug. Make sure that some_global_flag is marked as volatile. This may be an int uint8 uint64 whatever you like...

volatile int some_global_flag

This way you tell the compiler not to make any assumptions on what the value of some_global_flag will be. You must do this because the compiler/optimizer can't see any call to your interrupt routine, so it assumes some_global_flag is always 0 (the initial state) and never changed.

Sorry misread the part where you already tried it...

You can try to compile the code with avr-gcc and see if you have the same behavior...

Yager answered 8/2, 2012 at 8:4 Comment(1)

I tried it with volatile, resulting in no change in behavior. – Ditzel 8/2, 2012 at 8:8

D

4

It might seem strange but it finally proved to be caused by strong transients on one of the input lines (which powers the system but its ADC measurement is also used as a condition).

The system can have periodic power fails for a short time, and important temporary data is kept in part of the internal SRAM, which is not cleaned after startup and designed to retain the data (for as much as 10 minutes or more) with the use of a small capacitor while the CPU is in brown-out.

I did not post this in the question because I tested this part of the system it and worked perfectly, so I did not want to throw you off course.

What I found out at the end, is that a new feature was used in an environment which created very strong transients, and one of the conditions in my question depended on a state which depended on one of those variables in the "permanent RAM", and finally using a breakpoint saved me from the effects of that transient.

Finally the problem was solved with adjustments in timing.

Edit: what helped me find the location of the problem was that I logged the values of my most important variables in the "permanent RAM" area and could see that a few of them got corrupted.

Ditzel answered 8/2, 2012 at 17:57 Comment(5)

This is actually a general reason for this sort of problem -- debugging tends to perturb timing in some way, and if you have a situation that's sensitive to timing issues, then debugging can certainly affect it. There's often not much meaningful information in how the debugging affects it (in part because it's hard to tell how the debugging is affecting the timing), though. – Wilhelm 8/2, 2012 at 23:1

I do think it would be interesting to add a bit to this answer describing how you ended up diagnosing the problem, as well as what you found. Also, since you know this is the correct answer, you should accept it for completeness. :) – Wilhelm 8/2, 2012 at 23:2

Good that you solved it. Something good came out of it: you added volatile to your global flag (if you did not add it, please do, it really should be there). – Hamitosemitic 9/2, 2012 at 10:44

@Gauthier: no, I made a permanent place for it in a register, as a bit variable. As it's in a register, it's volatile by default. – Ditzel 9/2, 2012 at 11:17

@vsz, Brooks Moses has the general answer: Debugging affects timing. Suggest you add a remark to that effect. – Pleasant 9/2, 2012 at 11:26

C

1

I may be wrong here but if you are using a debugger to attach to the board in question and debug the program on the hardware it was supposed to run on i think it can change the behavior of the microcontroller when it performs an attach.... Other that that and the volatile keyword suggested above i have no clues.

Chap answered 8/2, 2012 at 8:0 Comment(1)

The strange thing is, that the correct functionality is achieved when the debugger is used. – Ditzel 8/2, 2012 at 8:5

H

1

This is written assuming an ARM processor.

using a breakpoint ( RAM or ROM bkpoint ) forces processor to switch from Run mode to Debug Mode at the breakpoint ( either to halt mode or Monitor mode) and force it to run in Debug speed or to run an abort handler and hence JTAG based debugging is basically intrusive debugging.

ETM( embedded Trace Macrocell),specifically in ARM (or other types of bus instrumentation ) is designed to be non intrusive and can log the instructions and data in real time so that we can inspect what really happened.

Haupt answered 9/9, 2014 at 15:15 Comment(0)

Recommended topics

Hot tags