segfault during write to the realloc'd area

Asked 3/9, 2013 at 12:16 Answered 12/9, 2013 at 11:52

I have a very frustrating problem. My application runs on a few machines flawlessly for a month. However, there is one machine on which my application crashes nearly every day because of segfault. It always crashes at the same instruction address:

segfault at 7fec33ef36a8 ip 000000000041c16d sp 00007fec50a55c80 error 6 in myapp[400000+f8000]

This address points to memcpy call.

Below, there is an excerpt #1 from my app:

....
uint32_t size = messageSize - sizeof(uint64_t) + 1;

stack->trcData = (char*)Realloc(stack->trcData,(stack->trcSize + size + sizeof(uint32_t)));
char* buffer = stack->trcData + stack->trcSize;

uint32_t n_size = htonl(size);
memcpy(buffer,&n_size,sizeof(uint32_t)); /* ip 000000000041c16d points here*/
buffer += sizeof(uint32_t);

....
stack->trcSize += size + sizeof(uint32_t);
....

where stack is a structure:

struct Stack{
  char*     trcData;    
  uint32_t  trcSize;    
  /* ... some other elements */
};

and Realloc is a realloc wrapper:

#define Realloc(x,y)    _Realloc((x),(y),__LINE__)

void* _Realloc(void* ptr,size_t size,int line){

  void *tmp = realloc(ptr,size);
  if(tmp == NULL){
    fprintf(stderr,"R%i: Out of memory: trying to allocate: %lu.\n",line,size);
    exit(EXIT_FAILURE);
  }
  return tmp;
}

messageSize is of uint32_t type and its value is always greater than 44 bytes. The code #1 runs in a loop. stack->trcData is just a buffer which collects some data until some condition is fulfilled. stack->trcData is always initialized to NULL. The application is compiled with gcc with optimization -O3 enabled. When I run it in gdb, of course it did not crash, as I expected;)

I ran out of ideas why myapp crashes during memcpy call. Realloc returns with no error, so I guess it allocated enough space and I can write to this area. Valgrind

valgrind --leak-check=full --track-origins=yes --show-reachable=yes myapp

shows absolutely no invalid reads/writes.

Is it possible that on this particular machine the memory itself is corrupted and it causes these often crashes? Or maybe I corrupt memory somewhere else in myapp, but if this is the case, why it does not crash earlier, when the invalid write is made?

Thanks in advance for any help.

Assembly piece:

41c164: 00 
41c165: 48 01 d0                add    %rdx,%rax
41c168: 44 89 ea                mov    %r13d,%edx
41c16b: 0f ca                   bswap  %edx
41c16d: 89 10                   mov    %edx,(%rax)
41c16f: 0f b6 94 24 47 10 00    movzbl 0x1047(%rsp),%edx
41c176: 00

I'm not sure whether this information is relevant but all the machines, my application runs on successfully, have Intel processors whilst the one causing the problem has AMD.

Crabstick answered 3/9, 2013 at 12:16 Comment(25)

How/where do you set stack->trcData initially? How/where is messageSize set? Your segfault could be due to a memory management bug in your code, but you don't have enough pieces here to determine that. – Adalbertoadalheid 3/9, 2013 at 12:22

I wouldn't rule out faulty hardware. Have your system administrators run a heavy duty memory test on the computer where your code crashes, and see if they could tell you anything interesting. – Durkin 3/9, 2013 at 12:23

@mbratch stack->trcData is set to NULL initially. A value is assigned to messageSize and it's always checked. – Crabstick 3/9, 2013 at 12:25

@dasblinkenlight He plans to run a memory test but not very soon. – Crabstick 3/9, 2013 at 12:28

@DariuszSendkowski: what does the disassembly in that area look like? I don't see why a call to memcpy would crash at the call site. – Hyper 3/9, 2013 at 12:32

Then you shouldn't plan to provide a fix "very soon" either - it's a good idea to ensure that you aren't embarking on a wild goose chase before you begin. If valgrind says you're good, the search would be very costly. – Durkin 3/9, 2013 at 12:33

In your code #1 Realloc() is called with only two instead of three parameters. Is that the case in the original code as well? – Thorny 3/9, 2013 at 12:39

@Ingo Leonhardt Sorry, Realloc is a macro. I've just edited my post. – Crabstick 3/9, 2013 at 12:43

Are you sure you have a prototype of void *_Realloc() in your code? Thanks to cast you have made, the code would compile without as well. But on some 64bit architectures you would only store the last four bytes of the eight byte address in stack->trcData – Thorny 3/9, 2013 at 12:49

@Ingo Leonhardt Yes, I have the prototype of _Realloc in the code. – Crabstick 3/9, 2013 at 12:52

@Ernest Friedman-Hill I call htonl since stack->trcData is sent to another application over network eventually. On the other side of communication, the size is decoded by calling ntohl. – Crabstick 3/9, 2013 at 12:56

Is it possible that at some point, messageSize has a value making stack->trcSize + size + sizeof(uint32_t) = 0 ? Making realloc returning NULL ? (By exemple with messageSize = 4 and trcSize = 0, if my calculation are correct...) – Assertion 3/9, 2013 at 14:29

Have you tried monitoring the code on other machines to make sure the code at this location is executed OK elsewhere? Do any other applications crash on the machine where this one does? If none of the other machines running this code actually execute it, then it doesn't necessarily point to the hardware; if all the other machines do execute this same code flawlessly, then it supports the 'machine at fault' contention. If other applications are failing on the same machine for a similar reason, that supports 'machine at fault'; if no other application runs into the problem, maybe not. – Mistook 3/9, 2013 at 14:37

@Jonathan Leffler This piece of code is one of the most frequently called pieces in the whole application. This problem occurs only on a single, particular machine. – Crabstick 3/9, 2013 at 15:3

@Assertion No, it is not possible. messageSize is always greater than 44 bytes. Its value is always checked before Realloc call. – Crabstick 3/9, 2013 at 15:5

What are the values of stack->trcData before and after the Realloc() call in an instance where it crashes? What is the value of the rax register when it crashes? What are all of the regions of memory mapped into your program when it crashes (cat /proc/<PID>/maps)? – Mylan 3/9, 2013 at 23:1

It might be worth trying either an alternate malloc implementation (e.g. TC malloc) or see if your existing malloc has any diagnostics that might uncover problems: gnu.org/software/libc/manual/html_node/… – Chromatology 4/9, 2013 at 2:59

To answer your question about an invalid write elsewhere in the program - it absolutely can lead to this, and the reason it does not crash is that it's writing to a valid location in memory, just the wrong location. Seg faults are caught by the kernel when the hardware tells the kernel the process accessed a memory location for which it's memory map does not have an entry. A memory checker could help; glibc has one built in. – Erythrocyte 4/9, 2013 at 6:29

To enable glibc's checker, set the environment variable MALLOC_CHECK_ to 1 (errors go to stderr), 2 (error calls abort()), or 3 (error is printed to stderr and calls abort(). – Erythrocyte 4/9, 2013 at 6:32

Is stack->trcSize appropriately updated elsewhere in the code? – Favrot 8/9, 2013 at 14:52

@Erythrocyte Enabling MALLOC_CHECK_ gave no extra information. The application crashed exactly the same as before. – Crabstick 8/9, 2013 at 16:39

@Claudix The size is updated within the same block. – Crabstick 8/9, 2013 at 16:42

@DariuszSendkowski - ah, that means that memory allocation did not detect the error. Perhaps it's not related to malloc and free... – Erythrocyte 9/9, 2013 at 4:54

Have you tried an own version of memcpy, i.e., just copying byte-by-byte in a loop? It's only for discarding a possible memcpy malfunction. Even better, replace the memcpy line by this statement: *((uint32_t*)buffer) = htonl(size) – Favrot 9/9, 2013 at 7:34

I think I know, what can cause this situation. Suppose, that at some loop step stack->trcSize + size exceeds UINT32_MAX. That means Realloc in fact shrinks stc->trcData. Next, I define buffer which now is far behind the allocated area. Hence, when I write to buffer I get segfault. What do you think? – Crabstick 11/9, 2013 at 11:0

Here is the cause of my problem. The point is that at some loop step stack->trcSize + size exceeds UINT32_MAX. That means Realloc in fact shrinks stc->trcData. Next, I define buffer which now is far behind the allocated area. Hence, when I write to buffer I get segfault. I've checked it and it was indeed the cause.

Crabstick answered 12/9, 2013 at 11:52 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags