Tracing memory corruption on a production linux server

H

8

15

Guys, could you please recommend a tool for spotting a memory corruption on a production multithreaded server built with c++ and working under linux x86_64? I'm currently facing the following problem : every several hours my server crashes with a segfault and the core dump shows that error happens in malloc/calloc which is definitely a sign of memory being corrupted somewhere.

Actually I have already tried some tools without much luck. Here is my experience so far:

Valgrind is a great(I'd even say best) tool but it slows down the server too much making it unusable in production. I tried it on a stage server and it really helped me find some memory related issues but even after fixing them I still get crashes on the production server. I ran my stage server under Valgrind for several hours but still couldn't spot any serious errors.
ElectricFence is said to be a real memory hog but I couldn't even get it working properly. It segfaults almost immediately on the stage server in random weird places where Valgrind didn't show any issues at all. Maybe ElectricFence doesn't support threading well?.. I have no idea.
DUMA - same story as ElectricFence but even worse. While EF produced core dumps with readable backtraces DUMA shows me only "?????"(and yes server is built with -g flag for sure)
dmalloc - I configured the server to use it instead of standard malloc routines however it hangs after several minutes. Attaching a gdb to the process reveals it's hung somewhere in dmalloc :(

I'm gradually getting crazy and simply don't know what to do next. I have the following tools to be tried: mtrace, mpatrol but maybe someone has a better idea?

I'd greatly appreciate any help on this issue.

Update: I managed to find the source of the bug. However I found it on the stage server not production one using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...

Holdfast answered 25/7, 2009 at 19:22 Comment(6)

Did you compile libefence in or use LD_PRELOAD env variable? electricfence is thread safe supposedly if it is compiled with -DUSE_SEMAPHORE – Sportsman 25/7, 2009 at 19:54

I'm using libefense.a not .so. And I didn't compile it myself, I installed using emerge on Gentoo. Would you recommend installing it manually instead with this flag? – Holdfast 25/7, 2009 at 20:4

One thing that might help is to look +/- 200 bytes of where the seg fault said that the data is corrupted. By looking at the data you might be able to get an idea what is causing the memory corruption. – Consecration 25/7, 2009 at 20:17

Could you please elaborate a bit more on this or give a link where I can find more info on this? How can I do it with gdb? – Holdfast 25/7, 2009 at 20:28

If the exception addr has the value x, then calculate y as x-200, then execute in gdb x/400xb y (replace y with the above calculated address) – Consecration 26/7, 2009 at 7:7

Well I see a lot of hex data and have no clue how to interpret it :) Could you please tell me what exactly should I try to look for in this raw data? – Holdfast 26/7, 2009 at 19:11

H

4

Folks, I managed to find the source of the bug. However I found it on the stage server using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...

Holdfast answered 31/7, 2009 at 20:33 Comment(0)

P

7

Yes, C/C++ memory corruption problems are tough. I also used several times valgrind, sometimes it revealed the problem and sometimes not.

While examining valgrind output don't tend to ignore its result too fast. Sometimes after a considerable time spent, you'll see that valgrind gave you the clue on the first place, but you ignored it.

Another advise is to compare the code changes from previously known stable release. It's not a problem if you use some sort of source versioning system (e.g. svn). Examine all memory related functions (e.g. memcpy, memset, sprintf, new, delete/delete[]).

Petes answered 25/7, 2009 at 20:4 Comment(3)

As for examining all memory related functions - I don't use them directly anywhere, all pointers are shared_ptrs or weak_ptrs and all containers are from stl... – Holdfast 25/7, 2009 at 20:13

STL is good but even with STL you can run into memory corruption problem, for example why using invalidated iterator. See angelikalanger.com/Conferences/Slides/… – Petes 25/7, 2009 at 20:22

Yep, I know it's always possible to shoot oneself in the foot even with such high-level libraries – Holdfast 25/7, 2009 at 20:29

C

6

Compile your program with gcc 4.1 and the -fstack-protector-all switch. If the memory corruption is caused by stack smashing this should be able to detect it. You might need to play with some of the additional parameters of SSP.

Consecration answered 25/7, 2009 at 22:11 Comment(0)

H

4

Folks, I managed to find the source of the bug. However I found it on the stage server using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...

Holdfast answered 31/7, 2009 at 20:33 Comment(0)

R

3

Have you tried -fmudflap? (scroll up a few lines to see the options available).

Rosetterosewall answered 25/7, 2009 at 22:49 Comment(2)

I'm currently figthing with "error: mudflap cannot track unknown size extern ‘__prime_list’" errors :( Any idea why they can happen? I have no __prime_list symbol anywhere in the code... – Holdfast 26/7, 2009 at 7:10

It does rely on libmudflap to be installed. Maybe it's not? – Myoglobin 31/7, 2009 at 21:41

S

1

you can try IBM purify, but i am afraid that is not opensource..

Soup answered 25/7, 2009 at 19:57 Comment(2)

Well if nothing else works... But I still believe there should be an OpenSource solution to this. – Holdfast 25/7, 2009 at 20:6

Also purify slows down the application considerably and cannot be used on a production machine. – Consecration 25/7, 2009 at 20:13

T

1

The Google Perftools --- which is Open Source --- may be of help, see the heap checker documentation.

Tacit answered 25/7, 2009 at 20:15 Comment(1)

Unfortunately heap checker is pretty limited, it can detect only memory leaks and not memory overruns. It could not even detect mismatching new[]/delete :( – Holdfast 26/7, 2009 at 11:14

D

1

Try this one: http://www.hexco.de/rmdebug/ I used it extensively, its has a low impact in performance(it mostly impacts amount of ram) but the allocation algorithm is the same. Its always proven enough to find any allocation bugs. Your program will crash as soon as the bug occurs, and it will have a detailed log.

Decapolis answered 30/7, 2009 at 5:59 Comment(1)

Thanks, I'll have a look at it. I wonder if it works fine in a c++ multithreading app... – Holdfast 30/7, 2009 at 9:56

M

1

I'm not sure if it would have caught your particular bug, but the MALLOC_CHECK_ environment variable (malloc man page) turns on additional checking in the default Linux malloc implementation, and typically doesn't have a significant runtime cost.

Mor answered 2/8, 2009 at 18:27 Comment(1)

Thanks, I've tried it as well(MALLOC_CHECK_=3), however, it didn't show my any source of memory corruption since(as I wrote earlier) the memory was corrupted by datarace not by improper usage of malloc/free... – Holdfast 3/8, 2009 at 4:41

Recommended topics

Hot tags