Different methodologies for solving bugs that only occur in production

Asked 27/7, 2009 at 16:24 Answered 27/7, 2009 at 17:26

Solved debugging websphere testing production

As one who is relatively new to the whole support and bug fixing environment and a young programmer I had never come across a bug that only occurs in the Websphere environment but not on the localhost test enviroment, until today. When I first got this bug report I was confused as to why I couldn't reproduce it on the localhost test environment. I decided to try on the Websphere test environment to see what would happen and I successfully reproduced the bug. The problem is I can't make changes and build to the Websphere test enviroment. I can only make changes to my local environment. Given this handicap what methodologies exist for resolving these kinds of bugs. Or are there even any methodologies at all? Any advice or help on how to approach issues like this?

Tacho answered 27/7, 2009 at 16:24 Comment(0)

Campaign for full access to a test environment. Being able to tweak things, redeploy and retry makes a huge difference. It's entirely reasonable to explain how not having access severely restricts your ability to do your job.
Make sure you've got sufficient logging, and make it configurable. Make sure you keep the logs for long enough to track down a problem reported by a customer even if it happened a few days ago.
When you finally diagnose a problem and why it only happens in a particular environment, document it and try to persuade your local system to behave the same way - that should make it easier to diagnose another symptom of the same problem next time.

Lorineloriner answered 27/7, 2009 at 16:29 Comment(1)

The logs are proving a huge help it let me find the pattern behind where it was breaking. I'm definitely going to keep this advice in mind from now on for all debugging. – Tacho 27/7, 2009 at 16:49

In short, the methodology is to isolate and understand the differences between environments and which one or ones might be causing the issue.

Check your local build. Make sure it the same version (code and database) as Test and Prod. If it is, what are the environment differences that could effect the issue you are seeing? (Multi-core, load balancing, operating system version, 3rd library version). Don't run locally in the debugger, make sure your running a release build (if that's what is on Test and Prod) and maybe even do a local deployment rather than building from source.
Check to see if it is particular data that might be causing the problem. If you can, pull a copy of the database back from Test onto Local and see if that enables you to repro the problem.
Check with other developers. See if they can repro. the issue in their environment. Check with your QA guys, get their thoughts on what might be causing such an issue (often times they will have seen "similar" issues and might give you a clue).

At that point, if nothing helps, I generally go into a deep state of zen to try and understand what I am missing. But, there must be a difference, you just have to find it.

Signorelli answered 27/7, 2009 at 16:37 Comment(2)

Along with the logs(I compared the local logs to the test logs), taking time to notice the differences in environment and just stopping to think about what I'm missing and discovered the error. – Tacho 27/7, 2009 at 16:51

Good for you. Thinking deeply about the problem gets a bad rap sometimes, but it can be very effective. :) – Signorelli 27/7, 2009 at 16:54

The Scientific Method always applies-- check your assumptions first. If the systems are different, the problem might reside in some sort of implicit default being different, or a different implementation of some function.

In all debugging processes, localization is the key. You gotta isolate the area of the problem first. If your OS, patches, libraries, and the main software itself are all identical, then it's probably the system settings (limits for sockets, file descriptors, etc). If you know you have enough inodes, space and memory left, then it's not a resource issue. If the computer is barely responsive to your interactive prodding, your load is too high, or you have some runaway processes. Remember what every process needs to run, and make sure they got what they need.

It can be also code just can't deal with the load of the production system. Locking mechanisms are a very popular cause of problem in production vs dev/test systems, simply because you can't generate enough of test cases that you get for free in production.

Logging can be easily overlooked, but I also like to add a lot of debug values into the code, to make debugging easier. I cannot even count how many times some environment variable, path, or broken symlink have ruined my day, just to realize that it would be a trivial fix if I looked at the values of variables while running, not just looking at the static code.

If all else fails, ltrace and strace are the best way to really look at what's going on under the hood. They're not easy to read, but once you get used to how spot and interpret return values of some syscalls, you gain a very powerful debugging tool.

Belovo answered 27/7, 2009 at 17:26 Comment(0)

Recommended topics

Hot tags