openstack team mailing list archive

Thread
Date

Swift reliability

To: "'openstack@xxxxxxxxxxxxxxxxxxx'" <openstack@xxxxxxxxxxxxxxxxxxx>
From: Phil Holden <Phil.Holden@xxxxxxxxxxxxxxxxx>
Date: Wed, 26 Sep 2012 10:39:13 +0000
Accept-language: en-GB, en-US
Thread-index: Ac2bz/+kQ+gj2QGcQMuwHJNysVtRiw==
Thread-topic: Swift reliability

Hello,

I have been continuing to run the Swift reliability test described at
https://answers.launchpad.net/swift/+question/201627
This is now using ext4 filesystems but continues to have some issues.
The test has been resized a little and now consists of 40 threads doing
a PUT with an object, then a GET on it some time later. Each thread will
eventually PUT 15,000 objects in 1 container per thread. The object
number then wraps around and it should thereafter be over-writing
objects which already exist. The data objects are very small, e.g.,
"Content of object 11234 in container 15-1 \n"
The test is rate limited. It has been run at up to 2,100 HTTP requests
(GET or PUT) per minute which is the expected traffic rate we want it to
support.

The Swift cluster consists of a load balancer in front of 2 x Swift
proxies, in turn connected to 6 Swift data nodes. All these systems are
VM's in a managed cluster of physical servers and so may compete for
physical resources, but we think they are provisioned adequately for
this phase of testing. Other tests have achieved over 3,500 HTTP
requests/minute using this cluster. The rings are configured for 3
replicas of the data. The Swift version is Essex (2012.1).

A number of problems continue to be encountered with the test. These
have been as follows:

The problems described in question 201627 (above) continued to occur
when XFS filesystems were used. This problem is not seen if ext4
filesystems are used.

The remaining problems have only been seen using ext4 filesystems. They
occur after the test has been running for some time, several days.
Using xfs filesystems, the test gets stuck as in question 201627 before
encountering any of these.

After the test has wrapped around on the object number that it is
writing, space usage continues to grow, eventually filling all the data
nodes. If an object is over-written, replacing its contents, is the
old data freed immediately or is it left around, waiting to be tidied
away later by some clean-up process? The object-expirer is being run on
one of the proxy nodes, but all objects should be over-written well
before their expiry time.

On one occasion half the data nodes were completely filled at 100% and
the cluster overall became unresponsive. This situation was solved by a
rolling restart where each of the data nodes is restarted, one-by-one.

HTTP 404 : Not Found is repeatedly reported on a PUT to an object in an
existing container. The test gets stuck on this until it is resolved.
This can often be resolved by a rolling restart where each of the data
nodes is restarted, one-by-one.

One of the Swift proxy server processes became unresponsive. This meant
that only half the requests succeeded, the ones which went through the
other proxy. There was nothing evident in the logfiles. The proxy
process did not respond to an ordinary kill (SIGTERM). A SIGKILL was
needed to remove it. The object-expirer which was running at the same
time on the same host did respond to SIGTERM and stopped. Everything
continued normally after the proxy server and object-expirer were
restarted.

Further testing is being performed at a reduced rate of 525 HTTP
requests per minute (25% of the target rate) to see if this Swift
cluster will perform more reliably at this reduced rate.

Can anyone shed any light on the problems described above and suggest
ways they could be prevented from happening.

The overall purpose of the test is to determine if Swift can be reliably
used for storage of mission-critical data. Obviously open source
software such as this comes with no warranty, but, in a similar manner
to making a judgement about use of the Linux kernel and filesystems and
related software for mission-critical activities, a judgement about the
use of Swift needs to be made. This test is intended to support the
ability to make this decision.

Regards
- Phil -

NOTICE: Cognito Limited. Benham Valence, Newbury, Berkshire, RG20 8LU. UK. Company number 02723032. This e-mail message and any attachment is confidential. It may not be disclosed to or used by anyone other than the intended recipient. If you have received this e-mail in error please notify the sender immediately then delete it from your system. Whilst every effort has been made to check this mail is virus free we accept no responsibility for software viruses and you should check for viruses before opening any attachments. Opinions, conclusions and other information in this email and any attachments which do not relate to the official business of the company are neither given by the company nor endorsed by it.

This email message has been scanned for viruses by Mimecast

Follow ups

Re: Swift reliability
From: John Dickinson, 2012-09-26