GC Logs Overwritten when JVM Crashes

I'm tuning our product for G1GC, and as part of that testing, I'm experiencing regular segfaults on my Spark Workers, which of course causes the JVM to crash. When this happens, the Spark Worker/Executor JVM automagically restarts itself, which then overwrites the GC logs that were written for the previous Executor JVM.

To be honest, I'm not quite sure the mechanism for how the Executor JVM restarts itself, but I launch the Spark Driver service via init.d, which in turn calls off to a bash script. I do use a timestamp in that script that gets appended to the GC log filename:

today=$(date +%Y%m%dT%H%M%S%3N)

SPARK_HEAP_DUMP="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${SPARK_LOG_HOME}/heapdump_$$_${today}.hprof"
SPARK_GC_LOGS="-Xloggc:${SPARK_LOG_HOME}/gc_${today}.log -XX:LogFile=${SPARK_LOG_HOME}/safepoint_${today}.log"

GC_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:+LogVMOutput -XX:+PrintFlagsFinal -XX:+PrintJNIGCStalls -XX:+PrintTLAB -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=15 -XX:GCLogFileSize=48M -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintAdaptiveSizePolicy -XX:+PrintHeapAtGC -XX:+PrintGCCause -XX:+PrintReferenceGC -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1"

I think the problem is that this script sends these options along to the Spark Driver, which then passes them off to the Spark Executors (via the -Dspark.executor.extraJavaOptions argument), which are all separate servers, and when the Executor JVM crashes, it simply uses the command that had been originally sent to start back up, which would mean that the timestamp portion of the GC log filename is static:

SPARK_STANDALONE_OPTS=`property ${SPARK_APP_CONFIG}/spark.properties "spark-standalone.extra.args"`
SPARK_STANDALONE_OPTS="$SPARK_STANDALONE_OPTS $GC_OPTS $SPARK_GC_LOGS $SPARK_HEAP_DUMP"

exec java ${SPARK_APP_HEAP_DUMP} ${GC_OPTS} ${SPARK_APP_GC_LOGS} \
    ${DRIVER_JAVA_OPTIONS} \
    -Dspark.executor.memory=${EXECUTOR_MEMORY} \
    -Dspark.executor.extraJavaOptions="${SPARK_STANDALONE_OPTS}" \
    -classpath ${CLASSPATH} \
    com.company.spark.Main >> ${SPARK_APP_LOGDIR}/${SPARK_APP_LOGFILE} 2>&1 &

This is making it difficult for me to debug the cause of the segfaults, since I'm losing the activity and state of the Workers that led up to the JVM crash. Any ideas for how I can handle this situation and keep the GC logs on the Workers, even after a JVM crash/segfault?

Recommended topics

Hot tags