← Back to team overview

cloud-init team mailing list archive

cloudinit-analyze: tool for examining cloud-init boot time

 

Hello,

I've been working on a tool[1] to help analyze cloud-init boot time
performance.  It's loosely based on systemd-analyze where we have two
main actions: show and blame.

The 'show' action is similar to 'systemd-analyze critical-chain' which
prints a list of units, the time they started and how long they took.
For cloud-init, we have four stages, and within each stage a number
of modules may run depending on configuration.  ‘cloudinit-analyze show’
will, for each boot, print this information and a summary total time,
per boot. [2]

The 'blame' action matches 'systemd-analyze blame' where it prints, in
descending order, the units that took the longest to run.  This output is
highly useful for examining where cloud-init is spending its time during
execution. [3]

Additional data may be needed to further understand why a particular module
takes as long as it does.  I've provided some tooling to help capture this
data.

What I’ve found most useful so far is dumping strace output from the exec
of each cloud-init stage (init-local, init-net, config-modules,
config-final).
This is performed via the cloud-init.wrap shell script which I've integrated
to run when profiling is enabled (touch /etc/cloud/profiler-enable).  The
resulting
strace data is made available in
/var/run/cloud-{init-local,init,config,final}*.strace.PID

I've created a profiling "service" unit which when profiling is enabled will
run  just before cloud-init-local unit and stops at the end of
cloud-init-final unit.  The three hooks I've implemented so far are
blktrace,
perf and dstat.  This units will start up the various profilers which
collect data during cloud-init run and will write out data to
/var/lib/cloud/data/ on a per-profiler, per-boot directory which is
compressed
and can be post-processed at a later time.

cloudinit-analyze has a few requirements that must be met to be useful.
First, we require the event reporting infrastructure to be included in
cloud-init; this is first available in cloud-init 0.7.6 or newer.  Without
the
event reporting infrastructure we don't have timestamps collected at the
start
and finish of each event.  The second requirement is a subsecond timestamp
configuration.  Current cloud-init will merge with rsyslog configuration
during boot and at least in Ubuntu, the default rsyslog config does not use
a
high resolution timestamp.  This is addressed by updating the cloud-init
logging configuration.

With these two requirements met cloudinit-analyze can process cloud-init.log
data to produce useful summaries.

The simplest way to try out cloudinit-analyze and collect your own data is
to
use my modified version of cloud-init from this ppa:

ppa:raharper/cloud-init-dev

For existing Xenial or Yakkety Ubuntu instances, you can do the following:

sudo add-apt-repository -y ppa:raharper/cloud-init-dev
sudo apt update && sudo apt install --no-install-recommends cloud-init
sudo apt install --no-install-recommends linux-tools-virtual blktrace dstat
sudo rm -rf /var/log/cloud-init* /var/lib/cloud/data
/var/lib/cloud/instance*
# optionally enable profiling/tracing with:
# sudo touch /etc/cloud/profiler-enable
sudo reboot

# after logging into the instance
sudo cloudinit-analyze show
sudo cloudinit-analyze blame


If you're using LXD, you can do this locally:

# install some deps (cloud-utils with lxd support, lxc1 for lxc-usernsexec)
sudo add-apt-repository -y ppa:raharper/cloud-init-dev
sudo apt update && sudo apt install --no-install-recommends cloud-utils lxc1
git clone https://git.launchpad.net/~raharper/+git/cloudinit-analyze
cd cloudinit-analyze
lxc init ubuntu-daily:yakkety y1
./bin/inject-payload.sh lxd:y1 ppa
lxc start y1
lxc exec y1 -- systemctl is-active cloud-init.target

Once the is-active command returns 'active', then cloud-init has completed
and you can start processing cloud-init.log data:

Outside container:

lxc file pull y1/var/log/cloud-init.log - | ./bin/cloudinit-analyze show -i
-

or from within:

lxc exec y1 -- cloudinit-analyze show

After looking at your output, if you have questions or would like to discuss
what you're seeing and how to improve boot time, please join us in
#cloud-init
on FreeNode (irc.freenode.net) and the cloud-init community will be happy to
help.

I’ve used a combination of the above tools to track down a few items of
overhead in the NoCloud DataSource while testing with lxd and kvm booting
cloud-images yielding ~3 to ~13% improvement[4] (with profiling disabled of
course) . I’ll provide a follow-up email with some more details, but the
changes driven from the analysis thus far are captured in these patches[5]
I’m proposing for upstream which should help in other DataSources as well,
but
I've not yet taken measurements beyond NoCloud instances. Note, these
patches
are currently integrated into the cloud-init included my cloud-init-dev PPA.

Thanks,
Ryan Harper


1. https://git.launchpad.net/~raharper/+git/cloudinit-analyze
2. http://paste.ubuntu.com/23277326/
3. http://paste.ubuntu.com/23277328/
4. http://paste.ubuntu.com/23271663/
5.
https://code.launchpad.net/~raharper/cloud-init/+git/cloud-init/+ref/nocloud-boottime-improvements