← Back to team overview

cloud-init-dev team mailing list archive

[Merge] ~samgilson/cloud-init:cloud-init-analyze-boot into cloud-init:master

 

Sam Gilson has proposed merging ~samgilson/cloud-init:cloud-init-analyze-boot into cloud-init:master.

Commit message:
Cloud-init analyze module: Added an ability to analyze boot events leading up to cloud-init startup.

This branch introduces a new command line feature for cloud-init. Currently, the cloud-init module has the capability to analyze the events in cloud-init.log in three ways: 'show', 'blame', 'dump'. These changes add a fourth capability, called 'boot'. Running the command 'cloud-init analyze boot' will provide the user three timestamps. 
1) Timestamp for when the kernel starts initializing. 
2) Timestamp for when the kernel finishes its initialization. 
3) Timestamp for when systemd activates cloud-init.
This feature enables cloud-init users to analyze different boot phases that happen before cloud-init can start. This would aid in debugging performance issues related to cloud-init startup or tracking regression. 


Requested reviews:
  cloud-init commiters (cloud-init-dev)

For more details, see:
https://code.launchpad.net/~samgilson/cloud-init/+git/cloud-init/+merge/368943
-- 
Your team cloud-init commiters is requested to review the proposed merge of ~samgilson/cloud-init:cloud-init-analyze-boot into cloud-init:master.
diff --git a/cloudinit/analyze/__main__.py b/cloudinit/analyze/__main__.py
index f861365..0c09110 100644
--- a/cloudinit/analyze/__main__.py
+++ b/cloudinit/analyze/__main__.py
@@ -7,7 +7,7 @@ import re
 import sys
 
 from cloudinit.util import json_dumps
-
+from datetime import datetime
 from . import dump
 from . import show
 
@@ -52,9 +52,55 @@ def get_parser(parser=None):
                              dest='outfile', default='-',
                              help='specify where to write output. ')
     parser_dump.set_defaults(action=('dump', analyze_dump))
+    parser_boot = subparsers.add_parser(
+        'boot', help='Print list of boot times for kernel and cloud-init')
+    parser_boot.add_argument('-i', '--infile', action='store',
+                             dest='infile', default='/var/log/cloud-init.log',
+                             help='specify where to read input. ')
+    parser_boot.add_argument('-o', '--outfile', action='store',
+                             dest='outfile', default='-',
+                             help='specify where to write output.')
+    parser_boot.set_defaults(action=('boot', analyze_boot))
     return parser
 
 
+def analyze_boot(name, args):
+    """Report a list of how long different boot operations took.
+    For Example:
+        Kernel started at <time>
+        Kernel ended boot at <time>
+        Kernel time to boot (seconds): <time>
+        Cloud-init started at <time>
+        Time between Kernel boot and Cloud-init start (seconds): <time>
+    """
+    configured_io = configure_io(args)
+    outfh = configured_io[1]
+    kernel_info = show.dist_check_timestamp()
+    if kernel_info == show.TIMESTAMP_UNKNOWN:
+        outfh.write('Your Linux distro does not support this functionality.\n'
+                    'You must be running a Kernel Telemetry supported '
+                    'distro.\n Please check '
+                    'cloud-init/doc/rtd/topics/analyze.rst '
+                    'for more '
+                    'information on supported distros.\n')
+        return
+    kernel_start, kernel_end, cloudinit_start = kernel_info
+    kernel_start_timestamp = datetime.utcfromtimestamp(kernel_start)
+    kernel_end_timestamp = datetime.utcfromtimestamp(kernel_end)
+    cloudinit_start_timestamp = datetime.utcfromtimestamp(cloudinit_start)
+    outfh.write('-- Most Recent Boot Record --\n')
+    outfh.write('   Kernel Started at: %s\n' %
+                kernel_start_timestamp)
+    outfh.write('   Kernel ended boot at: %s\n' % kernel_end_timestamp)
+    kernel_runtime = (kernel_end - kernel_start)
+    outfh.write('   Kernel time to boot (seconds): %f\n' % kernel_runtime)
+    outfh.write('   Cloud-init start: %s\n' % cloudinit_start_timestamp)
+    between_process_runtime = (cloudinit_start - kernel_end)
+    outfh.write('   Time between Kernel boot '
+                'and Cloud-init '
+                'start (seconds): %f\n' % between_process_runtime)
+
+
 def analyze_blame(name, args):
     """Report a list of records sorted by largest time delta.
 
@@ -119,7 +165,7 @@ def analyze_dump(name, args):
 
 def _get_events(infile):
     rawdata = None
-    events, rawdata = show.load_events(infile, None)
+    events, rawdata = show.load_events_infile(infile)
     if not events:
         events, _ = dump.dump_events(rawdata=rawdata)
     return events
diff --git a/cloudinit/analyze/show.py b/cloudinit/analyze/show.py
index 3e778b8..ead57a7 100644
--- a/cloudinit/analyze/show.py
+++ b/cloudinit/analyze/show.py
@@ -8,6 +8,7 @@ import base64
 import datetime
 import json
 import os
+import time
 
 from cloudinit import util
 
@@ -49,6 +50,7 @@ format_key = {
 
 formatting_help = " ".join(["{0}: {1}".format(k.replace('%', '%%'), v)
                            for k, v in format_key.items()])
+TIMESTAMP_UNKNOWN = (-1, -1, -1)
 
 
 def format_record(msg, event):
@@ -125,9 +127,160 @@ def total_time_record(total_time):
     return 'Total Time: %3.5f seconds\n' % total_time
 
 
+class SystemctlReader(object):
+    '''
+    Class for dealing with all systemctl subp calls in a consistent manner.
+    '''
+    def __init__(self, property, parameter=None):
+        self.epoch = None
+        self.args = ['/bin/systemctl', 'show']
+        if parameter:
+            self.args.append(parameter)
+        self.args.extend(['-p', property])
+        # Don't want the init of our object to break. Instead of throwing
+        # an exception, set an error code that gets checked when data is
+        # requested from the object
+        self.failure = self.subp()
+
+    def subp(self):
+        '''
+        Make a subp call based on set args and handle errors by setting
+        failure code
+        :return: whether the subp call failed or not
+        '''
+        try:
+            value, err = util.subp(self.args, capture=True)
+            if err:
+                return err
+            self.epoch = value
+            return None
+        except Exception as systemctl_fail:
+            return systemctl_fail
+
+    def parse_epoch_as_float(self):
+        '''
+        If subp call succeeded, return the timestamp from subp as a float.
+        :return: timestamp as a float
+        '''
+        # subp has 2 ways to fail: it either fails and throws an exception,
+        # or returns an error code. Raise an exception here in order to make
+        # sure both scenarios throw exceptions
+        if self.failure:
+            raise RuntimeError('Subprocess call to systemctl has failed, '
+                               'returning error code ({})'
+                               .format(self.failure))
+        # Output from systemctl show has the format Property=Value.
+        # For example, UserspaceMonotonic=1929304
+        timestamp = self.epoch.split('=')[1]
+        # Timestamps reported by systemctl are in microseconds, converting
+        return float(timestamp) / 1000000
+
+
+def dist_check_timestamp():
+    '''
+    Determine which init system a particular linux distro is using.
+    Each init system (systemd, upstart, etc) has a different way of
+    providing timestamps.
+    :return: timestamps of kernelboot, kernelendboot, and cloud-initstart
+    or TIMESTAMP_UNKNOWN if the timestamps cannot be retrieved.
+    '''
+    cur_platform = util.system_info()['dist']
+    distro = cur_platform[0].lower()
+    version = cur_platform[1]
+
+    distros_to_check = {'ubuntu': lambda version: float(version) >= 15.04,
+                        'centos': lambda version: int(version.split('.')[0])
+                        >= 7,
+                        'redhat': lambda version: float(version) >= 15.04,
+                        'suse': lambda version: float(version) >= 12.2,
+                        'debian': lambda version: float(version) >= 8.0,
+                        'arch': lambda version: True,
+                        'fedora': lambda version: int(version) >= 15,
+                        'sles': lambda version: float(version) >= 12,
+                        'coreos': lambda version: int(version.split('.')[0])
+                        >= 94}
+
+    if distro in distros_to_check and distros_to_check[distro](version):
+        return gather_timestamps_using_systemd()
+
+    # Use dmesg to get timestamps if the distro does not have systemd
+    if util.is_FreeBSD() or 'gentoo' in \
+            util.system_info()['system'].lower():
+        return gather_timestamps_using_dmesg()
+
+    # this distro doesn't fit anything that is supported by cloud-init. just
+    # return error codes
+    return TIMESTAMP_UNKNOWN
+
+
+def gather_timestamps_using_dmesg():
+    '''
+    Gather timestamps that corresponds to kernel begin initialization,
+    kernel finish initialization using dmesg as opposed to systemctl
+    :return: the two timestamps plus a dummy timestamp to keep consistency
+    with gather_timestamps_using_systemd
+    '''
+    try:
+        data, err = util.subp(['dmesg'], capture=True)
+        if err:
+            return TIMESTAMP_UNKNOWN
+        split_entries = data[0].splitlines()
+        for i in split_entries:
+            if i.decode('UTF-8').find('user') != -1:
+                splitup = i.decode('UTF-8').split()
+                stripped = splitup[1].strip(']')
+
+                # kernel timestamp from dmesg is equal to 0,
+                # with the userspace timestamp relative to it.
+                user_space_timestamp = float(stripped)
+                kernel_start = float(time.time()) - float(util.uptime())
+                kernel_end = kernel_start + user_space_timestamp
+
+                # systemd wont start cloud-init in this case,
+                # so we cannot get that timestamp
+                return kernel_start, kernel_end, kernel_end
+
+    except Exception:
+        pass
+    return TIMESTAMP_UNKNOWN
+
+
+def gather_timestamps_using_systemd():
+    '''
+    Gather timestamps that corresponds to kernel begin initialization,
+    kernel finish initialization. and cloud-init systemd unit activation
+    :return: the three timestamps
+    '''
+    kernel_start = float(time.time()) - float(util.uptime())
+    try:
+        delta = SystemctlReader('UserspaceTimestampMonotonic')\
+            .parse_epoch_as_float()
+        kernel_end = kernel_start + delta
+        delta = SystemctlReader('InactiveExitTimestampMonotonic',
+                                'cloud-init-local').parse_epoch_as_float()
+        cloudinit_start = kernel_start + delta
+    # Except ALL exceptions as Systemctl reader can throw many different
+    # errors, but any failure in systemctl means that timestamps cannot be
+    # obtained
+    except Exception:
+        return TIMESTAMP_UNKNOWN
+    return kernel_start, kernel_end, cloudinit_start
+
+
 def generate_records(events, blame_sort=False,
                      print_format="(%n) %d seconds in %I%D",
                      dump_files=False, log_datafiles=False):
+    '''
+    Take in raw events and create parent-child dependencies between events
+    in order to order events in chronological order.
+    :param events: JSONs from dump that represents events taken from logs
+    :param blame_sort: whether to sort by timestamp or by time taken.
+    :param print_format: formatting to represent event, time stamp,
+    and time taken by the event in one line
+    :param dump_files: whether to dump files into JSONs
+    :param log_datafiles: whether or not to log events generated
+    :return: boot records ordered chronologically
+    '''
 
     sorted_events = sorted(events, key=lambda x: x['timestamp'])
     records = []
@@ -189,19 +342,35 @@ def generate_records(events, blame_sort=False,
 
 
 def show_events(events, print_format):
+    '''
+    A passthrough method that makes it easier to call generate_records()
+    :param events: JSONs from dump that represents events taken from logs
+    :param print_format: formatting to represent event, time stamp,
+    and time taken by the event in one line
+    :return: boot records ordered chronologically
+    '''
     return generate_records(events, print_format=print_format)
 
 
-def load_events(infile, rawdata=None):
-    if rawdata:
-        data = rawdata.read()
-    else:
-        data = infile.read()
+def load_events_infile(infile):
+    '''
+    Takes in a log file, read it, and convert to json.
+    :param infile: The Log file to be read
+    :return: json version of logfile, raw file
+    '''
+    data = infile.read()
+    return json_from_data(data), data
 
+
+def json_from_data(data):
+    '''
+    Take in a read-in logfile, and tries to make a json object of the log file
+    :param data: logfile.read()
+    :return: json version of data
+    '''
     j = None
     try:
         j = json.loads(data)
     except ValueError:
         pass
-
-    return j, data
+    return j
diff --git a/cloudinit/analyze/tests/test_boot.py b/cloudinit/analyze/tests/test_boot.py
new file mode 100644
index 0000000..fcf6e05
--- /dev/null
+++ b/cloudinit/analyze/tests/test_boot.py
@@ -0,0 +1,121 @@
+import os
+from cloudinit.analyze.show import dist_check_timestamp, SystemctlReader
+from cloudinit.analyze.__main__ import (analyze_boot, get_parser)
+from cloudinit.tests.helpers import CiTestCase, mock
+
+
+class TestDistroChecker(CiTestCase):
+
+    @mock.patch('cloudinit.util.system_info', return_value={'dist': ('', '',
+                                                                     ''),
+                                                            'system': ''})
+    @mock.patch('platform.linux_distribution', return_value=('', '', ''))
+    @mock.patch('cloudinit.util.is_FreeBSD', return_value=False)
+    def test_blank_distro(self, m_sys_info, m_linux_distribution, m_free_bsd):
+        err_code = (-1, -1, -1)
+        self.assertEqual(err_code, dist_check_timestamp())
+
+    @mock.patch('cloudinit.util.system_info', return_value={'dist': ('', '',
+                                                                     '')})
+    @mock.patch('platform.linux_distribution', return_value=('', '', ''))
+    @mock.patch('cloudinit.util.is_FreeBSD', return_value=True)
+    def test_freebsd_gentoo_cant_find(self, m_sys_info,
+                                      m_linux_distribution, m_is_FreeBSD):
+        err_code = (-1, -1, -1)
+        self.assertEqual(err_code, dist_check_timestamp())
+
+    @mock.patch('cloudinit.util.subp', return_value=(0, 1))
+    def test_subp_fails(self, m_subp):
+        err_code = (-1, -1, -1)
+        self.assertEqual(err_code, dist_check_timestamp())
+
+
+class TestSystemCtlReader(CiTestCase):
+
+    def test_systemctl_invalid_property(self):
+        reader = SystemctlReader('dummyProperty')
+        with self.assertRaises(RuntimeError):
+            reader.parse_epoch_as_float()
+
+    def test_systemctl_invalid_parameter(self):
+        reader = SystemctlReader('dummyProperty', 'dummyParameter')
+        with self.assertRaises(RuntimeError):
+            reader.parse_epoch_as_float()
+
+    @mock.patch('cloudinit.util.subp', return_value=('U=1000000', None))
+    def test_systemctl_works_correctly_threshold(self, m_subp):
+        reader = SystemctlReader('dummyProperty', 'dummyParameter')
+        self.assertEqual(1.0, reader.parse_epoch_as_float())
+        thresh = 1.0 - reader.parse_epoch_as_float()
+        self.assertTrue(thresh < 1e-6)
+        self.assertTrue(thresh > (-1 * 1e-6))
+
+    @mock.patch('cloudinit.util.subp', return_value=('U=0', None))
+    def test_systemctl_succeed_zero(self, m_subp):
+        reader = SystemctlReader('dummyProperty', 'dummyParameter')
+        self.assertEqual(0.0, reader.parse_epoch_as_float())
+
+    @mock.patch('cloudinit.util.subp', return_value=('U=1', None))
+    def test_systemctl_succeed_distinct(self, m_subp):
+        reader = SystemctlReader('dummyProperty', 'dummyParameter')
+        val1 = reader.parse_epoch_as_float()
+        m_subp.return_value = ('U=2', None)
+        reader2 = SystemctlReader('dummyProperty', 'dummyParameter')
+        val2 = reader2.parse_epoch_as_float()
+        self.assertNotEqual(val1, val2)
+
+    @mock.patch('cloudinit.util.subp', return_value=('100', None))
+    def test_systemctl_epoch_not_splittable(self, m_subp):
+        reader = SystemctlReader('dummyProperty', 'dummyParameter')
+        with self.assertRaises(IndexError):
+            reader.parse_epoch_as_float()
+
+    @mock.patch('cloudinit.util.subp', return_value=('U=foobar', None))
+    def test_systemctl_cannot_convert_epoch_to_float(self, m_subp):
+        reader = SystemctlReader('dummyProperty', 'dummyParameter')
+        with self.assertRaises(ValueError):
+            reader.parse_epoch_as_float()
+
+
+class TestAnalyzeBoot(CiTestCase):
+
+    def set_up_dummy_file(self, path, log_path):
+        infh = open(path, 'w+')
+        infh.write('dummy data')
+        infh.close()
+        outfh = open(log_path, 'w+')
+        outfh.close()
+
+    def remove_dummy_file(self, path, log_path):
+        if os.path.isfile(path):
+            os.remove(path)
+        if os.path.isfile(log_path):
+            os.remove(log_path)
+
+    @mock.patch('cloudinit.analyze.show.dist_check_timestamp',
+                return_value=(-1, -1, -1))
+    def test_boot_invalid_distro(self, m_dist_check_timestamp):
+
+        path = os.path.dirname(os.path.abspath(__file__))
+        log_path = path + '/boot-test.log'
+        path += '/dummy.log'
+        self.set_up_dummy_file(path, log_path)
+
+        parser = get_parser()
+        args = parser.parse_args(args=['boot', '-i', path, '-o',
+                                       log_path])
+        name_default = ''
+        analyze_boot(name_default, args)
+        # now args have been tested, go into outfile and make sure error
+        # message is in the outfile
+        outfh = open(args.outfile, 'r')
+        data = outfh.read()
+        err_string = 'Your Linux distro does not support this ' \
+                     'functionality.\n' 'You must be running a ' \
+                     'Kernel Telemetry supported '\
+                     'distro.\n Please check ' \
+                     'cloud-init/doc/rtd/topics/analyze.rst for more '\
+                     'information on supported distros.\n'
+
+        self.assertEqual(err_string, data)
+        self.remove_dummy_file(path, log_path)
diff --git a/doc/rtd/topics/analyze.rst b/doc/rtd/topics/analyze.rst
new file mode 100644
index 0000000..a6a6c23
--- /dev/null
+++ b/doc/rtd/topics/analyze.rst
@@ -0,0 +1,84 @@
+*************************
+Cloud-init Analyze Module
+*************************
+
+Overview
+========
+The analyze module was added to cloud-init in order to help analyze cloud-init boot time 
+performance. It is loosely based on systemd-analyze where there are 4 main actions: 
+show, blame, dump, and boot.
+
+The 'show' action is similar to 'systemd-analyze critical-chain' which prints a list of units, the 
+time they started and how long they took. For cloud-init, we have four stages, and within each stage
+a number of modules may run depending on configuration.  ‘cloudinit-analyze show’ will, for each 
+boot, print this information and a summary total time, per boot.
+
+The 'blame' action matches 'systemd-analyze blame' where it prints, in descending order, 
+the units that took the longest to run.  This output is highly useful for examining where cloud-init 
+is spending its time during execution.
+
+The 'dump' action simply dumps the cloud-init logs that the analyze module is performing
+the analysis on and returns a list of dictionaries that can be consumed for other reporting needs.
+
+The 'boot' action prints out kernel related timestamps that are not included in any of the
+cloud-init logs. There are three different timestamps that are presented to the user: 
+kernel start, kernel finish boot, and cloud-init start. This was added for additional
+clarity into the boot process that cloud-init does not have control over, to aid in debugging of 
+performance issues related to cloudinit startup or tracking regression.
+
+Usage
+=====
+Using each of the printing formats is as easy as running one of the following bash commands:
+
+.. code-block:: shell-session
+
+  cloud-init analyze show
+  cloud-init analyze blame
+  cloud-init analyze dump
+  cloud-init analyze boot
+
+Cloud-init analyze boot Timestamp Gathering
+===========================================
+The following boot related timestamps are gathered on demand when cloud-init analyze boot runs:
+- Kernel Startup, which is inferred from system uptime
+- Kernel Finishes Initialization, which is inferred from systemd UserSpaceMonotonicTimestamp property
+- Cloud-init activation, which is inferred from the property InactiveExitTimestamp of the cloud-init
+local systemd unit.
+
+In order to gather the necessary timestamps using systemd, running the commands
+
+.. code-block:: shell-session
+
+	systemctl show -p UserspaceTimestampMonotonic  
+	systemctl show cloud-init-local -p InactiveExitTimestampMonotonic
+
+will gather the UserspaceTimestamp and InactiveExitTimestamp. 
+The UserspaceTimestamp tracks when the init system starts, which is used as an indicator of kernel
+finishing initialization. The InactiveExitTimestamp tracks when a particular systemd unit transitions
+from the Inactive to Active state, which can be used to mark the beginning of systemd's activation
+of cloud-init.
+
+Currently this only works for distros that use systemd as the init process. We will be expanding
+support for other distros in the future and this document will be updated accordingly.
+
+If systemd is not present on the system, dmesg is used to attempt to find an event that logs the
+beginning of the init system. However, with this method only the first two timestamps are able to be found;
+dmesg does not monitor userspace processes, so no cloud-init start timestamps are emitted like when
+using systemd.
+
+List of Cloud-init analyze boot supported distros
+=================================================
+- Arch
+- CentOS
+- Debian
+- Fedora
+- OpenSuSE
+- Red Hat Enterprise Linux
+- Ubuntu
+- SUSE Linux Enterprise Server
+- CoreOS
+
+List of Cloud-init analyze boot unsupported distros
+===================================================
+- FreeBSD
+- Gentoo
\ No newline at end of file
diff --git a/doc/rtd/topics/capabilities.rst b/doc/rtd/topics/capabilities.rst
index 0d8b894..6d85a99 100644
--- a/doc/rtd/topics/capabilities.rst
+++ b/doc/rtd/topics/capabilities.rst
@@ -217,6 +217,7 @@ Get detailed reports of where cloud-init spends most of its time. See
 * **dump** Machine-readable JSON dump of all cloud-init tracked events.
 * **show** show time-ordered report of the cost of operations during each
   boot stage.
+* **boot** show timestamps from kernel initialization, kernel finish initialization, and cloud-init start.
 
 .. _cli_devel:
 
diff --git a/doc/rtd/topics/debugging.rst b/doc/rtd/topics/debugging.rst
index 51363ea..e13d915 100644
--- a/doc/rtd/topics/debugging.rst
+++ b/doc/rtd/topics/debugging.rst
@@ -68,6 +68,19 @@ subcommands default to reading /var/log/cloud-init.log.
          00.00100s (modules-final/config-rightscale_userdata)
          ...
 
+* ``analyze boot`` Make subprocess calls to the kernel in order to get relevant 
+  pre-cloud-init timestamps, such as the kernel start, kernel finish boot, and cloud-init start.
+
+.. code-block:: shell-session
+
+    $ cloud-init analyze boot 
+    -- Most Recent Boot Record --
+    	Kernel Started at: 2019-06-13 15:59:55.809385
+    	Kernel ended boot at: 2019-06-13 16:00:00.944740
+    	Kernel time to boot (seconds): 5.135355
+    	Cloud-init start: 2019-06-13 16:00:05.738396
+    	Time between Kernel boot and Cloud-init start (seconds): 4.793656
+
 
 Analyze quickstart - LXC
 ---------------------------