Disclaimer: cloud-init
is very complex, and there are lots of supported cloud vendors, and it's used in lots of different ways, but I think this is a fairly accurate simplified overview.
Couple of minor corrections first: cloud-init
can run on any machine, not just a VM, and it can run on any boot, not just 'first boot'. It's basically just a way to run scripts during boot. Current Ubuntu server images, for example, come with cloud-init
pre-installed, and it runs during boot, even on your desktop.
However, the main use case is first boot of "cloud images". The problem here is that cloud vendors want to ship an official distro release which just works, without the end-user having to actually carry out an installation, or the cloud vendor having to modify the distro in some way. cloud-init
handles this by retrieving configuration data at various points during the boot process. In practice, this tends to be user names, passwords, ssh keys, locales, hostnames, additional repos, and so on. In other words, the sort of stuff you would have manually typed in during an installation, but normally without the network setup.
cloud-init
can frequently determine exactly what it is running on during boot, by querying the DMI/SMBIOS, or a specific file such as /proc/1/environ
. In these cases, it has built-in knowledge of where to find the required configuration data. In general, however, the data will come from the network or, failing that, a filesystem that is bundled with the image.
Many (most? all?) cloud vendors run a private webserver for the image, which is set up for dhcp
on eth0
(the image can instead retrieve the required network configuration from another data source, but I think it's much more common just to use dhcp
, which is the fallback position). The webserver responds to requests from cloud-init
for the user, vendor, and instance data. If you've installed a VM at a cloud provider you'll have seen a user-data
block that you can fill in - this is returned to cloud-init
as the user data.
The docs have a simple tutorial which does exactly this: it uses QEMU to run an image, and the qemu-system-x86_64
command line sets the image smbios
info to specify where the Python webserver is (10.0.2.2:8000). In practice, most cloud vendors serve private data from 169.254.169.254. This is the 'Instance Metadata Service' (IMDS).
There are various other ways to get the data, in addition to or instead of IMDS: a disk partition labelled config-2
, for example, which attaches to the instance when it boots, or the kernel command line, or specific files in the filesystem.
Note that cloud-init
fits a very specific niche, where a vendor has to provide a standard image to an end-user, with some customisation. You can run custom images at a cloud vendor without cloud-init
, but some vendors won't let you install custom images, for reasons best known to themselves.