This post relates a technical problem encountered in a recent software project, and allows the reader to benefit from the hard-earned solution to the problem.
Background
In my company, I'm the implementer and maintainer of an internal library that uses the Boost asio ("ASynchronous I/O") socket framework to achieve cross-platform data transfer over sockets. A colleague recently came to me with the following problem: her Blackberry 10 application, which linked and used my library, crashed within several seconds if the Wi-Fi router was unceremoniously turned off during a file transfer operation.
Enabling built-in tracing within the library showed us that the crash was occurring when the library called the boost::asio::write(boost::asio::ip::tcp::socket *, boost::asio::buffer) function with a socket that was not 'valid' (i.e., the socket may be unusable). Placing a try/catch(boost::system::system_error) block around the write() failed to catch anything -- the crash was obviously happening within Boost.
Because the crash occurred only in release builds, we could not use the debugger.
Technical Information
- The QNX compiler, QCC, uses GNU 4.6.3
- The version of Boost used is 1.48.0
Here is a typical command line invocation of the compiler:
/home/foobar/bbndk/host_10_1_0_238/linux/x86/usr/bin/QCC
-Vgcc_ntoarmv7le
-lang-c++
-x c++
-DLINUX -DQNX -DSUPPORT_LAN -DUSE_SQLITE_FOR_DATABASE
-Wno-psabi -Wno-write-strings
-O3
-DNDEBUG
-fno-strict-aliasing
-fPIC
-I/home/foobar/Libraries/BlackBerry_10/boost_1.48/include
...
-I/home/foobar/Libraries/BlackBerry_10/utfcpp_1.0/include
-o CMakeFiles/Internals.dir/ConfigFileSingleton.cpp.o
-c /home/foobar/myproject_dev/myproject/SDK/Internals/ConfigFileSingleton.cpp
Steps used to locate the problem source
We wrote a lightweight minimal app to try and reproduce the problem with far less code, first using raw sockets, then using Boost's ASIO. Had the crash occurred we could have assumed the problem was not caused by our proprietary library. Unfortunately, the crash was not reproducible, leading us to suspect our library was at fault.
We wrote a lightweight tracing framework for use within Boost's ASIO header files, instrumenting functions related to the problem. The framework outputted a string on entry and exit into/from these functions, enabling us to trace values of variables, too.
Using the tracing framework we were able to prove that the crash occurred in the boost::throw_exception() templated function (non-relevant #ifdef'd code removed). Boost calls this function when the system-level write operation fails on a "broken pipe":
template<class E> BOOST_ATTRIBUTE_NORETURN inline void throw_exception( E const & e )
{
//All boost exceptions are required to derive from std::exception,
//to ensure compatibility with BOOST_NO_EXCEPTIONS.
throw_exception_assert_compatibility(e);
throw enable_current_exception(enable_error_info(e));
}
Inserting traces and splitting the 'throw' statement into separate statements proved to us that the crash was occurring in the process of throwing the exception object. In all probability, something was going very wrong with the unwinding of the stack when the exception was being thrown.
The Solution
Once we realized it was most probably a compiler bug, rather than an application-level one, we examined the compiler options used to build the library. We ruled out a memory corruption because the Boost internal code is probably hardened and robust enough. Once we thought that release mode optimization may be the culprit, the distance was short to the solution: drop the optimization level from -O3 to -O2.
Once we did that the crash evaporated.
We have since modified the Blackberry.cmake file in the QNX toolchain to use "-O2" rather than the original "-O3":
SET(CMAKE_C_FLAGS_RELEASE "-O2 -DNDEBUG")
#SET(CMAKE_C_FLAGS_RELEASE "-O3 -DNDEBUG")
. . .
SET(CMAKE_CXX_FLAGS_RELEASE "-O2 -DNDEBUG -fno-strict-aliasing -fPIC")
#SET(CMAKE_CXX_FLAGS_RELEASE "-O3 -DNDEBUG -fno-strict-aliasing -fPIC")
In the light of this crash it might be advisable to use "-O3" with caution. The reason why our minimal app didn't reproduce the problem was because it was compiled with optimization level 2, not 3.
We're looking for an SSCCE for submitting to the QNX and/or GNU teams.