segfault during write to the realloc'd area
Asked Answered
C

1

6

I have a very frustrating problem. My application runs on a few machines flawlessly for a month. However, there is one machine on which my application crashes nearly every day because of segfault. It always crashes at the same instruction address:

segfault at 7fec33ef36a8 ip 000000000041c16d sp 00007fec50a55c80 error 6 in myapp[400000+f8000]

This address points to memcpy call.

Below, there is an excerpt #1 from my app:

....
uint32_t size = messageSize - sizeof(uint64_t) + 1;

stack->trcData = (char*)Realloc(stack->trcData,(stack->trcSize + size + sizeof(uint32_t)));
char* buffer = stack->trcData + stack->trcSize;

uint32_t n_size = htonl(size);
memcpy(buffer,&n_size,sizeof(uint32_t)); /* ip 000000000041c16d points here*/
buffer += sizeof(uint32_t);

....
stack->trcSize += size + sizeof(uint32_t);
....

where stack is a structure:

struct Stack{
  char*     trcData;    
  uint32_t  trcSize;    
  /* ... some other elements */
};

and Realloc is a realloc wrapper:

#define Realloc(x,y)    _Realloc((x),(y),__LINE__)

void* _Realloc(void* ptr,size_t size,int line){

  void *tmp = realloc(ptr,size);
  if(tmp == NULL){
    fprintf(stderr,"R%i: Out of memory: trying to allocate: %lu.\n",line,size);
    exit(EXIT_FAILURE);
  }
  return tmp;
}

messageSize is of uint32_t type and its value is always greater than 44 bytes. The code #1 runs in a loop. stack->trcData is just a buffer which collects some data until some condition is fulfilled. stack->trcData is always initialized to NULL. The application is compiled with gcc with optimization -O3 enabled. When I run it in gdb, of course it did not crash, as I expected;)

I ran out of ideas why myapp crashes during memcpy call. Realloc returns with no error, so I guess it allocated enough space and I can write to this area. Valgrind

valgrind --leak-check=full --track-origins=yes --show-reachable=yes myapp

shows absolutely no invalid reads/writes.

Is it possible that on this particular machine the memory itself is corrupted and it causes these often crashes? Or maybe I corrupt memory somewhere else in myapp, but if this is the case, why it does not crash earlier, when the invalid write is made?

Thanks in advance for any help.

Assembly piece:

41c164: 00 
41c165: 48 01 d0                add    %rdx,%rax
41c168: 44 89 ea                mov    %r13d,%edx
41c16b: 0f ca                   bswap  %edx
41c16d: 89 10                   mov    %edx,(%rax)
41c16f: 0f b6 94 24 47 10 00    movzbl 0x1047(%rsp),%edx
41c176: 00

I'm not sure whether this information is relevant but all the machines, my application runs on successfully, have Intel processors whilst the one causing the problem has AMD.

Crabstick answered 3/9, 2013 at 12:16 Comment(25)
How/where do you set stack->trcData initially? How/where is messageSize set? Your segfault could be due to a memory management bug in your code, but you don't have enough pieces here to determine that.Adalbertoadalheid
I wouldn't rule out faulty hardware. Have your system administrators run a heavy duty memory test on the computer where your code crashes, and see if they could tell you anything interesting.Durkin
@mbratch stack->trcData is set to NULL initially. A value is assigned to messageSize and it's always checked.Crabstick
@dasblinkenlight He plans to run a memory test but not very soon.Crabstick
@DariuszSendkowski: what does the disassembly in that area look like? I don't see why a call to memcpy would crash at the call site.Hyper
Then you shouldn't plan to provide a fix "very soon" either - it's a good idea to ensure that you aren't embarking on a wild goose chase before you begin. If valgrind says you're good, the search would be very costly.Durkin
In your code #1 Realloc() is called with only two instead of three parameters. Is that the case in the original code as well?Thorny
@Ingo Leonhardt Sorry, Realloc is a macro. I've just edited my post.Crabstick
Are you sure you have a prototype of void *_Realloc() in your code? Thanks to cast you have made, the code would compile without as well. But on some 64bit architectures you would only store the last four bytes of the eight byte address in stack->trcDataThorny
@Ingo Leonhardt Yes, I have the prototype of _Realloc in the code.Crabstick
@Ernest Friedman-Hill I call htonl since stack->trcData is sent to another application over network eventually. On the other side of communication, the size is decoded by calling ntohl.Crabstick
Is it possible that at some point, messageSize has a value making stack->trcSize + size + sizeof(uint32_t) = 0 ? Making realloc returning NULL ? (By exemple with messageSize = 4 and trcSize = 0, if my calculation are correct...)Assertion
Have you tried monitoring the code on other machines to make sure the code at this location is executed OK elsewhere? Do any other applications crash on the machine where this one does? If none of the other machines running this code actually execute it, then it doesn't necessarily point to the hardware; if all the other machines do execute this same code flawlessly, then it supports the 'machine at fault' contention. If other applications are failing on the same machine for a similar reason, that supports 'machine at fault'; if no other application runs into the problem, maybe not.Mistook
@Jonathan Leffler This piece of code is one of the most frequently called pieces in the whole application. This problem occurs only on a single, particular machine.Crabstick
@Assertion No, it is not possible. messageSize is always greater than 44 bytes. Its value is always checked before Realloc call.Crabstick
What are the values of stack->trcData before and after the Realloc() call in an instance where it crashes? What is the value of the rax register when it crashes? What are all of the regions of memory mapped into your program when it crashes (cat /proc/<PID>/maps)?Mylan
It might be worth trying either an alternate malloc implementation (e.g. TC malloc) or see if your existing malloc has any diagnostics that might uncover problems: gnu.org/software/libc/manual/html_node/…Chromatology
To answer your question about an invalid write elsewhere in the program - it absolutely can lead to this, and the reason it does not crash is that it's writing to a valid location in memory, just the wrong location. Seg faults are caught by the kernel when the hardware tells the kernel the process accessed a memory location for which it's memory map does not have an entry. A memory checker could help; glibc has one built in.Erythrocyte
To enable glibc's checker, set the environment variable MALLOC_CHECK_ to 1 (errors go to stderr), 2 (error calls abort()), or 3 (error is printed to stderr and calls abort().Erythrocyte
Is stack->trcSize appropriately updated elsewhere in the code?Favrot
@Erythrocyte Enabling MALLOC_CHECK_ gave no extra information. The application crashed exactly the same as before.Crabstick
@Claudix The size is updated within the same block.Crabstick
@DariuszSendkowski - ah, that means that memory allocation did not detect the error. Perhaps it's not related to malloc and free...Erythrocyte
Have you tried an own version of memcpy, i.e., just copying byte-by-byte in a loop? It's only for discarding a possible memcpy malfunction. Even better, replace the memcpy line by this statement: *((uint32_t*)buffer) = htonl(size)Favrot
I think I know, what can cause this situation. Suppose, that at some loop step stack->trcSize + size exceeds UINT32_MAX. That means Realloc in fact shrinks stc->trcData. Next, I define buffer which now is far behind the allocated area. Hence, when I write to buffer I get segfault. What do you think?Crabstick
C
0

Here is the cause of my problem. The point is that at some loop step stack->trcSize + size exceeds UINT32_MAX. That means Realloc in fact shrinks stc->trcData. Next, I define buffer which now is far behind the allocated area. Hence, when I write to buffer I get segfault. I've checked it and it was indeed the cause.

Crabstick answered 12/9, 2013 at 11:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.