I'm stumped. I'm trying to run a vagrant/virtualbox/coreos cluster on Windows 8.1 to develop the cluster for running in the cloud. I've tried this on four machines (all are Windows 8.1 with latest updates and all with the latest VirtualBox, Vagrant, Git, and the same config for Vagrant. I'm checking the Vagrant config out of a repo on all 4 system so I'm confident the configs are the same for each. I get 2 successes and 2 failures.
Two machines succeed like this:
Bringing machine 'core-01' up with 'virtualbox' provider...
==> core-01: Checking if box 'coreos-stable' is up to date...
(snip)
core-01: SSH address: 127.0.0.1:2222
core-01: SSH username: core
core-01: SSH auth method: private key
core-01: Warning: Connection timeout. Retrying...
==> core-01: Machine booted and ready!
==> core-01: Setting hostname...
==> core-01: Configuring and enabling network interfaces...
vagrant ssh and vagrant halt both work fine on these two systems.
Two other Windows machines fail like this:
Bringing machine 'core-01' up with 'virtualbox' provider...
==> core-01: Importing base box 'coreos-stable'...
==> core-01: Matching MAC address for NAT networking...
==> core-01: Checking if box 'coreos-stable' is up to date...
==> core-01: Setting the name of the VM: coreos-vm-cluster_core-01_1422899531630_88904
==> core-01: Clearing any previously set network interfaces...
==> core-01: Preparing network interfaces based on configuration...
core-01: Adapter 1: nat
core-01: Adapter 2: hostonly
==> core-01: Forwarding ports...
core-01: 22 => 2222 (adapter 1)
==> core-01: Running 'pre-boot' VM customizations...
==> core-01: Booting VM...
==> core-01: Waiting for machine to boot. This may take a few minutes...
core-01: SSH address: 127.0.0.1:2222
core-01: SSH username: core
core-01: SSH auth method: private key
core-01: Warning: Connection timeout. Retrying...
core-01: Warning: Authentication failure. Retrying...
core-01: Warning: Authentication failure. Retrying...
core-01: Warning: Authentication failure. Retrying...
core-01: Warning: Authentication failure. Retrying...
core-01: Warning: Authentication failure. Retrying...
core-01: Warning: Authentication failure. Retrying...
Note how both the working and non-working systems experience one timeout connecting, but then the successful ones actually do connect and finish bringing up the VM, whereas the unsuccessful ones just get stuck with an authentication retry loop.
Following the authentication failure, if I leave it to time out or even if I ctrl+C, I can run "vagrant ssh core-01" and it takes me straight in:
CoreOS (stable)
core@localhost ~ $
'vagrant halt' also fails to make an ssh connection on these systems:
==> core-01: Attempting graceful shutdown of VM...
core-01: Guest communication could not be established! This is usually because
core-01: SSH is not running, the authentication information was changed,
core-01: or some other networking issue. Vagrant will force halt, if
core-01: capable.
==> core-01: Forcing shutdown of VM...
I can successfully use putty or other ssh clients to access the VM using insecure_private_key for authentication, so I'm assuming the VM itself has the correct config, and the problem lay with Vagrant's ability to call ssh to get in. If "Vagrant up" can't ssh in, it cannot finish the startup config for the VM, so I'd like to solve this primarily for that reason.
This is the ssh config that lets me get in with other ssh clients and I believe should be used by Vagrant:
Host: 127.0.0.1
Port: 2222
Username: core
Private key: C:/Users/Mike/.vagrant.d/insecure_private_key
I have also enabled GUI for the VM's and the console does not show any errors; it gets all the way to a login prompt just fine (which is also consistent with the fact that I can ssh in and otherwise use the VM).
I believe (but don't know how to verify) that Vagrant is calling the openssh client in C:\Program Files (x86)\Git\bin
All are running Vagrant version 1.7.2 and git 1.9.5. Ruby 2.0.0p353.
My %PATH% is about 500 chars long. I'm confident Vagrant is finding an ssh client of some sort due to getting at least one or two timeouts followed by an authentication failure.
Thanks in advance for any ideas!
Update: Buried deep in the output of "vagrant up --debug" is this little gem:
D, [2015-02-02T23:11:10.755468 #3920] DEBUG --
net.ssh.authentication.session[14661cc]: trying publickey
E, [2015-02-02T23:11:10.756472 #3920] ERROR --
net.ssh.authentication.key_manager[1473e1c]:
could not load public key file
`C:/Users/Mike/.vagrant.d/insecure_private_key':
Net::SSH::Exception (public key at
C:/Users/Mike/.vagrant.d/insecure_private_key.pub is not valid)
That final "insecure_private_key.pub is not valid" seems like a solid clue.
I've tried modifying that file to ensure it has just LF for line endings as well as CRLF and it makes no difference. Visually it looks fine. It's also 100% byte-for-byte identical to the file that's working on one of the other systems. Why would it be invalid? I have verified the current user has full control permissions on the file and also tried vagrant up as Administrator. No change in behavior. :(
vagrant ssh
does work. Is it always reproducible that 2 Windows machines work and 2 don't (even after avagrant destroy
)? You could try turning on more verbose debug messages with Vagrant, then compare a working with non-working system to see if any differences appear. (Docs: Debugging and Troubleshooting) – Batfowl