very strange behaviour with ruby, openssl, unicorn, systemd (Gcloud)
Asked Answered
P

3

7

We started seeing some strange errors in our logs that normally appear when ruby isn't compiled properly with OpenSSL. But it's inconcistent...

We're getting errors like:

We managed to reproduce it when starting unicorn using service unicorn start or systemctl start unicorn. But only with some requests... Not all of them. Some requests that use OpenSSL under the hood do work. Others don't.

However, when we start unicorn with /etc/init.d/unicorn start, everything works without a hitch. (to clarify, systemd starts the same /etc/init.d script)

We tried debugging ENV vars, user permissions, file/dir ownership, recompile ruby, bootstrap a new server from scratch... Nothing seems to help.

In case this helps:

What are we missing? What can we try that we haven't thought of?

UPDATE 1

  • output of some debug commands, e.g. OpenSSL, ruby etc
  • PATH is being set inside the init.d script
  • unicorn is being executed via su into www-data user
  • The same problem happens when we use this unicorn.service file in /etc/systemd/system
  • We're running Ubuntu 16.04 on Gcloud
  • Ruby was not installed via apt (explicitly removed, in case platform came pre-installed) and compiled from scratch. We're currently running 2.3.4 and tried also 2.3.6. Compiled either manually or using ruby-build. No rbenv, nor RVM.
  • We install libssl-dev via apt (we're running apt-get install -y autoconf bison build-essential libssl-dev libyaml-dev libreadline6-dev zlib1g-dev libncurses5-dev libffi-dev libgdbm3 libgdbm-dev before building ruby)

UPDATE 2

We're using a scripted/repeatable build process for the VM (using fabric), and this problem is consistent on multiple VMs we bootstrapped on GCloud. We then tried a VM on DigitalOcean with the same bootstrap scripts, and the problem doesn't seem to appear there.

In both cases we picked Ubuntu 16.04 64bit base image, but obviously there are some differences with kernel versions, base installed packages etc...

UPDATE 3

The problem simply vanished. See my answer below.

Postimpressionism answered 28/2, 2018 at 18:1 Comment(6)
I realise it's not a classic question for SO, but we're stuck and really need any help we could get.Postimpressionism
I’d say the question is very well stated and it’s perfectly fine for SO. Upvoted.Sousa
Can you post your complete systemd unit file? Sometimes, systemd does strange stuff like system-call filtering, setting a chroot or resource constraints, ... which might affect the started processes in strange and sometimes indeterministic ways.Amphimixis
we don't use a systemd .service file. We use an init.d file. We actually tried to create a .service file, but had the same problem with it.Postimpressionism
In your systemd service file, you should set your path the the same way you do in your init.d file for parity: Environment="PATH=/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin:/sbin"Graaf
@MarkStosberg I added it, but the behaviour is the same. Updated the gist to reflect the changes I've made to the service file. To clarify, we originally kept just the init.d file, which was loaded automatically by systemd... I was just trying with a "real" .service file to test if it makes a difference, but it doesn't appear to...Postimpressionism
M
2

@gingerlime I had a similar situation with our Jenkins on GCP, we're using ChefDK 3.1.0 (ruby embeed 2.5.1p57) -- tried other also, over a Jenkins that was running over systemd (Ubuntu 16.04) and upstart (Ubuntu 14.04) -- we tried on both versions, right now running over 16.04 in 4.15.0-1023-gcp kernel version, running a few jobs with kitchen-docker and this problem always emerge in a few situations.

I digged into and found that this only happens when the Etc.getlogin class gets called (for me here), this doesn't return any error, it return the correct info, the correct type of the class (String), but once it gets a call, the Unsupported digest algorithm gets raised.

If I start the process manually by root or jenkins user, this problem doesn't happen. I tried to implement the Etc.getlogin in several different ways, like using ENV['USER'], a fixed String, or other classes from Etc, like getpwuid, simulating the return class and values from Etc.getlogin, and the error doesn't get raised.

I'm not sure if this is some bug related to the ruby version and the custom kernel that GCP instances uses, but it happens in a similar situation like yours, and for me, the Etc.getlogin was the problem. Right now, I fixed by using a custom configuration that doesn't gets the call from this function, and it's working normally.

Mucoid answered 27/10, 2018 at 21:15 Comment(1)
Yes, I think there's something fishy on GCP, but I couldn't figure out what it was. Even with Gold support, they won't tell me what changes were made to their platform that could have caused / resolved those issues unfortunately.Postimpressionism
G
1

One option is that this isn't an issue of sysVinit vs systemd at all, but you just haven't triggered the issue with your sysVinit script yet.

When you run your svsVinit script through the systemctl command it's going through a compatibility layer, and there may be a problem there. Your problem would be simplified both yourself and for us if you reproduced the issue directly with a systemd service file and shared that file.

You mentioned debugging ENV, but didn't mention exactly what you checked in the ENV. This is definitely one place where systemd could make a difference. As seen in man systemd.exec, systemd sets $PATH in the environment to a fixed value:

/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

If this is not exactly the same as when run directly as an sysVinit script, that could be an issue.

I would also check for all your copies of SSL on the system. Do you have more than one? Where? Do you have more than copy of the ruby openssl module loaded?

 locate -r lib/.*libssl.*so

Also see the answer to the FAQ: Why do things behave differently under systemd?

Graaf answered 28/2, 2018 at 21:37 Comment(8)
"I would also check for all your copies of SSL on the system..." - +1. I would be interested to know more about the different OpenSSL's installed on the system. And I would like to know which one Ruby was compiled and linked against; and which one it runtime links against. I've got a feeling its an intermittent memory error due to mixing/matching OpenSSL versions.Gean
Thanks Mark and @jww. I tried to answer your questions and provide more info. See UPDATE 1 on the question. Happy to run more things, answer more questions, try stuff... we're getting frustrated trying to figure this one out.Postimpressionism
Related posts about SSLError (SSL_CTX_new: (null) point to the issue being related to multiple versions of SSL being involved, which your debug output appears to confirm is happening on the system. You can try running the app with strace to confirm while files are being accessed before the crash.Graaf
I tried strace, but couldn't quite figure out what's going on to be honest... Any tips/pointers what to look for? (also any tips about potentially removing different versions, or pointing to the right version would be appreciated).Postimpressionism
@MarkStosberg thanks for your help. Upvoted. The problem suddenly vanished. It's really confusing and frustrating, but we just can't reproduce it all of a sudden. And it's not just on one place. On several instances, even when restored from snapshot. See my answer above, although it doesn't really shed much light onto this. I really appreciate your help though.Postimpressionism
Another theory I had, which seems like likely is that the SSL problem was actually with sites you were connecting to and not with your code. I don't think this matches up with your "SSLError (SSL_CTX_new: (null))" error, could it could explain how the issue "just vanished" when a site on the other end resolved the problem.Graaf
Thanks @MarkStosberg. Interesting theory! ... but I don't think it applies. We could trigger it with a local rails app that only connects to a local sphinx server using the mysql2 gem that didn't use SSL at all. I'm not too familiar with the mysql2 gem internals, nor with sphinx to tell you with absolute certainty. The problem also disappeared suddenly on this local setup. In addition, this problem didn't happen with Digital Ocean for example with the same external resources... And it also manifested itself with non-SSL local ops like SHA1 digests etc...Postimpressionism
... aaaand if it were external, then how come it worked fine when we started the process using /etc/init.d/unicorn start but failed with systemctl start unicorn?Postimpressionism
P
0

(also posted on this github issue)

It looks like the problem just vanished. We were testing and reproducing it consistently, across several Compute Engine instances on Google Cloud. Under certain conditions (unicorn / puma started by systemd, etc), it was completely reproducible both with our own rails app, and with a plain vanilla rails app we've set for testing purposes. It was reproducible across several ruby versions as well (we tested 2.3.4, 2.3.6 and 2.5.0).

Suddenly, all instances that were consistently failing started working without exhibiting these problems. Like it never existed. We didn't even reboot some of those instances, and we saw no evidence of any unattended upgrades taking place... We also had one snapshot of a system that had this problem, and that we can reliably reproduce on. Creating an instance from this snapshot stopped exhibiting it as well from that specific point in time a few hours ago.

We're totally confused as to what might have cause it, and what might have made it disappear... However, without being able to reproduce it now, I guess there's no point leaving this issue open, so will close it. Chalk it up to Deus ex machina I suppose. (perhaps the google support gods, but they haven't reported anything back to us yet)

Postimpressionism answered 2/3, 2018 at 19:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.