[This is lengthy and full of details. My specific questions are introduced by the boldface word Question below.]
Summary
We are running some of our test suites under valgrind and encountering an error that doesn't make much sense to me. I'm looking for advice on figuring out in more detail what might be going wrong.
- Valgrind complains of an "invalid write of size 8".
- The error is consistent from run to run but comes and goes with should-be-irrelevant code changes, different compiler/stdlib versions, etc.
- The address being written to is on the stack and so far as I can see is a perfectly reasonable address for our code to be writing to.
- Its alignment is consistent with the size of the write.
- The place where it happens is deep inside the standard library.
All of which smells rather as if the real problem is elsewhere: something is getting corrupted and leading to confusion later on. But this is the first problem valgrind reports, so if there's memory-stomping elsewhere then valgrind is failing to catch it. I suspect that either I am missing something obvious, or there is a subtle problem that those with more valgrind expertise than I have may be able to point me at.
Some details
Here are some details and some specific questions.
This is on a Linux box running Ubuntu 14.04 on x64 hardware.
Here is valgrind's complaint in one fairly typical instance:
==14259== Invalid write of size 8
==14259== at 0x662BBC9: __printf_fp (printf_fp.c:663)
==14259== by 0x6629792: vfprintf (vfprintf.c:1660)
==14259== by 0x664D578: vsnprintf (vsnprintf.c:119)
==14259== by 0x52DCE0F: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19)
==14259== by 0x52E3263: std::ostreambuf_iterator<char, std::char_traits<char> > std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double>(std::ostreambuf_iterator<char, std::char_traits<char> >, std::ios_base&, char, char, double) const (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19)
==14259== by 0x52E354F: std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::do_put(std::ostreambuf_iterator<char, std::char_traits<char> >, std::ios_base&, char, double) const (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19)
==14259== by 0x52EEAF4: std::ostream& std::ostream::_M_insert<double>(double) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19)
==14259== by 0x694725: CRVinfo::appendValue(std::string const&, double) (CRVinfo.cpp:197)
==14259== by 0x6902DB: CRVdouble::info(CRVinfo&) const (CRVdouble.cpp:103)
==14259== by 0x6913B4: CRVcollection::info(CRVinfo&) const (CRVcollection.cpp:60)
==14259== by 0x6913B4: CRVcollection::info(CRVinfo&) const (CRVcollection.cpp:60)
==14259== by 0x68F87F: CRVvalue::generate() (CRVvalue.cpp:71)
==14259== Address 0xffeffde68 is on thread 1's stack
==14259== in frame #0, created by __printf_fp (printf_fp.c:161)
The things beginning with "CRV" are ours; the things above them are in libstdc++ and glibc. Ubuntu 14.04 uses version 2.19 of glibc -- except that in fact it seems to be using eglibc 2.19 rather than plain glibc 2.19; you can find the relevant version of printf_fp.c here.
Running valgrind with --vgdb
and asking gdb for disassembly claims (consistently with the source code linked above) that the instruction we're actually about to execute when valgrind stops us is callq __mpn_lshift
.
The topmost stack frame involving "our" code looks like this:
void CRVinfo::appendValue(const std::string &name, double value){
addIndent();
addElementBegin(name);
std::ostringstream oss;
oss << value;
m_valueTree.append(oss.str());
addElementEnd(name);
}
and it's inside oss << value;
that the trouble occurs. m_valueTree
is a std::string
; you can guess what sort of thing addIndent
and addElementBegin
do; the latter uses a stringstream to do it, the former doesn't. (Probably-irrelevant note: you might think this looks inefficient and you'd be right, but this is not at all performance-critical code.)
So, anyway, we're getting an invalid write of size 8 at address 0xffeffde68, on a callq
instruction. You'd expect callq
to write to memory pointed to by rsp
, and so it does (I have verified that at this point rsp
equals 0xffeffde68) ... but valgrind objects to this, and it's not clear to me why.
(One obvious guess might be that we're overflowing our stack. But (1) I'd have thought that would happen at a rounder-looking address, and (2) I have attempted to increase the stack size and it hasn't made these valgrind complaints go away, and (3) I would expect a segfault on overflowing the stack and that isn't happening, and (4) we haven't used much stack at this point anyway; at the earliest point I've been able to probe, rsp
is 0xfff000598 so we've used less than 10k of stack at the point of failure.)
Question: Should it be apparent to me what valgrind objects to about this write? If not, is there any way to make valgrind tell me more about why it doesn't like it?
Question: Is it plausible that the immediate problem here is an error in valgrind (albeit perhaps provoked by some earlier misbehaviour in our code)? If so, is there any good way to track such things down or rule them out?
Question: Does this look like any known issue with glibc or libstdc++? (Such web-searching as I've done so far hasn't turned up any such known issue.)
More information in case it's useful
If I allow execution to continue after this invalid write, valgrind then complains -- inside the __mpn_lshift
function being called here -- of an invalid read of size 8. It's reading from the same address and disassembling in gdb indicates unsurprisingly that it's the retq
instruction at the end of __mpn_lshift
that is to blame.
None of my stack frames appears to be terribly large. Valgrind doesn't complain about large stack frames, inquire whether the stack has moved, suggest increasing --max-stacksize
, or anything of the kind.
On another machine with a slightly different version of gcc and perhaps different versions of the standard libraries, valgrind again reports an invalid write of size 8 in __printf_fp
but in a different part of it and this time not on a call instruction. (Unfortunately, this was on a colleague's computer, and since we observed this there some changes have been made that make his version show the same failure as mine, so I am unable to give more details with any confidence. But I'm 95% sure the failure occurred on a mov
instruction, and was writing strictly inside the current stack frame.)