← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1349617] Re: SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer

 

So this is what I found out.

Instance log from a failing istance [1]. The important bit there is
"cirros-apply-local already run per instance", and not "no userdata for
datasource" as initially thought. That was just me being stupid and
thinking the public key was part of user data. That was really silly.

"cirros-apply-local already run per instance" seems to appear in the console log for all SSH protocol banner failures [2]. the presence of duplicates makes it difficult to prove correlation with SSH protocol banner failures.
However, they key here is that local testing revealing that when the SSH connection fails there is no authorized_keys file in /home/cirros/.ssh. This obviously explains the authentication failure. Whether the subsequent SSH protocol banner errors are due to the cited MTU problems or else it has to be clarified yet.
What is certain is that cirros processes the data source containing the public SSH key before starting sshd. So the auth failures cannot be due to the init process not being yet complete.

The cirros initialization process executes a set of steps on an instance basis. This steps include setting public ssh keys.
"On an instance basis " means that these steps are not executed at each boot but once per instance.

cirros-apply local [3] is the step which processes, among other things, ssh public keys.
It is called by the cirros-per scripts [4], which at the end of its execution writes a marker file [5]. The cirros-per process will terminate if when executed the marker file is already present [6]

During the failing test it has been observed the following:

from the console log: 
[    3.696172] rtc_cmos 00:01: setting system clock to 2014-09-04 19:05:27 UTC (1409857527)

from the cirros-apply marker directory:
$ ls -le /var/lib/cirros/sem/
total 3
-rw-r--r--    1 root     root            35 Thu Sep  4 13:06:28 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.check-version
-rw-r--r--    1 root     root            22 Thu Sep  4 13:05:07 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.cirros-apply-local
-rw-r--r--    1 root     root            24 Thu Sep  4 13:06:31 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.userdata

as cirros defaults to MDT (UTC -6), this means the apply-local marker has been applied BEFORE instance boot.
This is consistent with the situation we're seeing where the failure always occur after events such as resize or stop.
The ssh public key should be applied in the first boot of the VM. When it's restarted the process is skipped as the key should already be there. Unfortunately the key isn't there, which is a bit of a mystery, especially since the instance is powered off in a graceful way thanks to [7].

Nevertheless when an instance receives a shutdown signal it sends a TERM signal to all processes. Meaning that the apply-local spawned by cirros-per at [4] can be killed before it actually writes the key.
However, cirros-per even if it retrieves the return code it writes the marker in any case [5]. 
This creates the conditions for a situation where the marker can be present without having actually completed the apply-local phase. As a result it is possible to have guests without SSH public key which manifest the failure reported in this bug.

Why is this happening only recently.
It seems a paradox, but [7] might be the reason.
This patch (and a few similar others) introduced soft instance shutdown. Soft instance shutdown avoid the abrupt shutdown which can actually leave the cirros-init process incomplete.
However, since cirros-per writes the marker regardless whether the process it called terminated with 0 or other code, it does not guarantee a successful completion.

On the other hand, introducing [7] added a delay before stopping the
instance. For instance in the case [8] it took 13 seconds. Previously
tempest was just immediately powering off the instance, not giving it a
chance to run cirros-init. Now, with the added dealy, the cirros-init
process is executed and this might be the reason that this failure,
which was previously occasional, has recently become the biggest gate
breaker.

What are the possible solutions?
1) fix cirros-per to not write the marker if the called process returned a non-zero value. The feasibility of this depends on wheter cirros-apply can be considered idempotent
2) adjust tempest to wait a little after the instance become ACTIVE. It could wait a fixed amount of time just to ensure instance initialization is completed.
3) Your proposal here.


[1] http://paste.openstack.org/show/106049/
[2] http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiY2lycm9zLWFwcGx5LWxvY2FsIGFscmVhZHkgcnVuIHBlciBpbnN0YW5jZVwiICBBTkQgdGFnczpjb25zb2xlIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA5ODc0NDg4NDM0fQ==
[3] http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/sbin/cirros-apply
[4] http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L102
[5] http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L106
[6]  http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L92
[7] http://git.openstack.org/cgit/openstack/nova/diff/nova/virt/libvirt/driver.py?id=c07ed15415c0ec3c5862f437f440632eff1e94df
[8] http://logs.openstack.org/95/117695/5/gate/gate-tempest-dsvm-neutron-full/d28fe19/logs/screen-n-cpu.txt.gz#_2014-09-02_16_36_04_542

** Also affects: cirros
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1349617

Title:
  SSHException: Error reading SSH protocol banner[Errno 104] Connection
  reset by peer

Status in CirrOS a tiny cloud guest:
  New
Status in Grenade - OpenStack upgrade testing:
  New
Status in OpenStack Neutron (virtual network service):
  New
Status in OpenStack Compute (Nova):
  Confirmed
Status in Tempest:
  New

Bug description:
  Noticed a drop in categorized bugs on grenade jobs, so looking at
  latest I see this:

  http://logs.openstack.org/63/108363/5/gate/gate-grenade-dsvm-partial-
  ncpu/1458072/console.html

  Running this query:

  message:"Failed to establish authenticated ssh connection to cirros@"
  AND message:"(Error reading SSH protocol banner[Errno 104] Connection
  reset by peer). Number attempts: 18. Retry after 19 seconds." AND
  tags:"console"

  http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRmFpbGVkIHRvIGVzdGFibGlzaCBhdXRoZW50aWNhdGVkIHNzaCBjb25uZWN0aW9uIHRvIGNpcnJvc0BcIiBBTkQgbWVzc2FnZTpcIihFcnJvciByZWFkaW5nIFNTSCBwcm90b2NvbCBiYW5uZXJbRXJybm8gMTA0XSBDb25uZWN0aW9uIHJlc2V0IGJ5IHBlZXIpLiBOdW1iZXIgYXR0ZW1wdHM6IDE4LiBSZXRyeSBhZnRlciAxOSBzZWNvbmRzLlwiIEFORCB0YWdzOlwiY29uc29sZVwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTkwMTEwMzMyLCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

  I get 28 hits in 7 days, and it seems to be very particular to grenade
  jobs.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cirros/+bug/1349617/+subscriptions