Cluster hangs in 'ssh-ready' state using Spark 1.2.0 EC2 launch script
Asked Answered
O

4

5

I'm trying to launch a standalone Spark cluster using its pre-packaged EC2 scripts, but it just indefinitely hangs in an 'ssh-ready' state:

ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k <key-pair> -i <identity-file>.pem -r us-west-2 -s 3 launch test
Setting up security groups...
Searching for existing cluster test...
Spark AMI: ami-ae6e0d9e
Launching instances...
Launched 3 slaves in us-west-2c, regid = r-b_______6
Launched master in us-west-2c, regid = r-0______0
Waiting for all instances in cluster to enter 'ssh-ready' state..........

Yet I can SSH into these instances without complaint:

ubuntu@machine:~$ ssh -i <identity-file>.pem root@master-ip
Last login: Day MMM DD HH:mm:ss 20YY from c-AA-BBB-CCCC-DDD.eee1.ff.provider.net

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
There are 59 security update(s) out of 257 total update(s) available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2014.09 is available.
root@ip-internal ~]$

I'm trying to figure out if this is a problem in AWS or with the Spark scripts. I've never had this issue before until recently.

Officialism answered 17/1, 2015 at 17:54 Comment(1)
1. Where are you SSH-ing into the cluster from? 2. Where are you launching the cluster from? 3. Are you sure all the nodes in the cluster are accessible by SSH? 4. Does this happen consistently?Vireo
M
4

Spark 1.3.0+

This issue is fixed in Spark 1.3.0.


Spark 1.2.0

Your problem is caused by SSH silently stopping because of conflicting entries in you SSHs known_hosts file.

To resolve your issue add -o UserKnownHostsFile=/dev/null to your spark_ec2.py script like this.


Optionally, to clean up and avoid running into problems with connecting to your cluster with SSH later on I recommend you to:

  1. Remove all the lines from ~/.ssh/known_hosts that include EC2 hosts, for example:

ec2-54-154-27-180.eu-west-1.compute.amazonaws.com,54.154.27.180 ssh-rsa (...)

  1. Use this solution to stop checking and storing the fingerprints of temporary IP of your EC2 instances at all
Microscopy answered 20/1, 2015 at 12:9 Comment(10)
I did not need to remove all of the known AWS hosts as setting the UserKnownHostsFile to /dev/null is enough to correct the problem where the ssh process fails silently and appears to hang.Adkison
@Adkison thanks, I edited the answer to separate necessary and optional steps (and more :).Microscopy
I opened an issue in Spark's JIRA & a PR with my change: issues.apache.org/jira/browse/SPARK-5403. Please vote on it if you're affected!Microscopy
I followed all the steps, waited for 2+ hours and bang! cluster started. A lot of patience needed.Aurochs
@Aurochs I am glad it works for you. :) But it's strange that so slow. For me it takes at most 10 minutes for a cluster with 10 quite big slaves. What kind of cluster are you launching?Microscopy
@GrzegorzDubicki surprisingly just a master and one slave for playing around. I agree it's strange.Aurochs
Thanks @GrzegorzDubicki. Issue resolved in pull request 4196 (github.com/apache/spark/pull/4196), fixed in Spark 1.3.0Officialism
I get this same issue even with the change (its now been pulled into the spark-ec2 scripts). Any other ideas folks?Tipperary
I have the same issue, but I'm using 1.3.1. Warning: SSH connection error. (This could be temporary.) Host: ec2-[deleted for privacy] SSH return code: 255 SSH output: ssh: connect to host ec2-[deleted for privacy] port 22: Connection refused . Cluster is now in 'ssh-ready' state. Waited 486 seconds.Lanceolate
I think that it's another problem, @Frank B. See if https://mcmap.net/q/274086/-ssh-script-returns-255-error helps.Microscopy
K
2

I had the same problem and I followed all the steps mentioned in the thread (mainly adding -o UserKnownHostsFile=/dev/null to your spark_ec2.py script), still it was hanging saying

Waiting for all instances in cluster to enter 'ssh-ready' state

Short answer:

Change permission of the private key file and rerun the spark-ec2 script

[spar@673d356d]/tmp/spark-1.2.1-bin-hadoop2.4/ec2% chmod 0400 /tmp/mykey.pem

Long Answer:

To troubleshoot, I modified spark_ec2.py and logged the the ssh command used and tried to execute it on command prompt, it was the bad permission on the key:

[spar@673d356d]/tmp/spark-1.2.1-bin-hadoop2.4/ec2% ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/mykey.pem  -o ConnectTimeout=3 [email protected] 
Warning: Permanently added '52.1.208.72' (RSA) to the list of known hosts.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for '/tmp/mykey.pem' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
bad permissions: ignore key: /tmp/mykey.pem
Permission denied (publickey).
Knowing answered 7/3, 2015 at 20:31 Comment(0)
A
1

I just ran into the same exact situation. I went into the python script at def is_ssh_available() and had it dump out the return code and cmd.

except subprocess.CalledProcessError, e:
print "CalledProcessError "
print e.returncode
print e.cmd

I had the key file location as ~/.pzkeys/mykey.pem - as an experiment, I changed it to fully qualified - i.e. /home/pete.zybrick/.pzkeys/mykey.pem and that worked ok.

Right after that, I ran into another error - I tried to use --user=ec2-user (I try to avoid using root), then I got a permission error on rsync, removed the --user-ec2-user so it would use root as default, did another attempt with --resume, ran to successful completion.

Aircondition answered 17/1, 2015 at 22:42 Comment(0)
O
1

I used the absolute (not relative) path to my identity file (inspired by Peter Zybrick) and did everything Grzegorz Dubicki suggested. Thank you.

Officialism answered 23/1, 2015 at 10:40 Comment(1)
If Grzegorz Dubicki's answer was correct, mark that as the correct answer.Cunning

© 2022 - 2024 — McMap. All rights reserved.