← Back to team overview

openstack team mailing list archive

Re: Swift performance for very small objects

 

On 05/19/2012 06:34 PM, Paulo Ricardo Motta Gomes wrote:
Hello,

I'm doing some experiments in a Swift cluster testbed of 9 nodes/devices
and 3 zones (3 nodes on each zone).

In one of my tests, I noticed that PUTs of very small objects are
extremely inefficient.

- 5000 PUTs of objects with an average size of 40K - total of 195MB -
took 67s (avg time per request: 0.0135s)
- 5000 PUTS of objects with an average size of 190 bytes - total of
930KB - took 60s (avg time per request: 0.0123s)

I plotted object size vs request time and found that there is
significant difference in request times only after 200KB. When objects
are smaller than this PUT requests have a minimum execution time of
0.01s, no matter the object size.
I suppose swift is not optimized for such small objects, but I wonder
what is the main cause for this, if it's the HTTP overhead or disk
writing. I checked the log of the object servers and requests are taking
an average of 0.006s, whether objects are 40K or 190 bytes, which
indicate part of the bottleneck could be at the disk. Curently I'm using
a loopback device for storage.

I thought that maybe this could be improved a bit if the proxy server
maintained persistent connections to the storage nodes instead of
opening a new one for each request?

Persistent TCP connections do nothing to improve disc performance, so it would be unlikely to have much of an effect if indeed the bulk of the response time is in getting things to disc on the server.

If you take a tcpdump trace you would be able to see how much of the time to service a request is taken-up by the TCP three-way handshake. You would take the time from the client's send of the SYNchronize segment to the time the SYN|ACKnowledgement segment arrives from the server. You can also then look at the time between when the last byte of the request is sent and the first/last byte of the response arrives.

If you wanted, you could probably compare that with the likes of say a netperf TCP_CRR test and see it with something that is only TCP. The netperf TCP_CRR test will open a tcp connection, send a request of a given size, receive a response of a given size and the connection is closed.

For an example, here is a TCP_CRR test between my laptop and my workstation, (100 Mbit/s) with netperf configured to include histogram support via configure --enable-histogram. I just picked a number out of the ether for the sizes:

raj@tardy:~/netperf2_trunk$ src/netperf -H raj-8510w.americas.hpqcorp.net -t TCP_CRR -l 30 -v 2 -- -r `expr 190 + 128`,128 MIGRATED TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to internal-host.americas.hpqcorp.net () port 0 AF_INET : histogram
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  318      128     30.00    2340.51
16384  87380

Alignment      Offset
Local  Remote  Local  Remote
Send   Recv    Send   Recv
    8      8       0      0

Histogram of request/response times
UNIT_USEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_USEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
HUNDRED_USEC  :    0:    0:    0: 18238: 47290: 3945:  486:  128:   34:   20
UNIT_MSEC     :    0:   47:   14:    1:    6:    2:    2:    0:    1:    1
TEN_MSEC      :    0:    0:    1:    0:    0:    0:    0:    0:    0:    0
HUNDRED_MSEC  :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
UNIT_SEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_SEC       :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
>100_SECS: 0
HIST_TOTAL:      70216

So, on average it was 1/2340 or 0.00043 seconds per complete transaction just doing things at the TCP level, without any openstack code or disc I/O etc. There was a tail, too short to be TCP retransmissions, my bet would be task scheduling as my workstation was not otherwise idle.

In the midst of all that I did:

raj@tardy:~/netperf2_trunk$ sudo tcpdump -i eth0 -c 20 -w /tmp/example.pcap

and post-processing that with:

sudo tcpdump -i eth0 -r /tmp/example.pcap -ttt

here is one of those transactions:

raj@tardy:~/netperf2_trunk$ sudo tcpdump -i eth0 -r /tmp/example.pcap -ttt port 63978
reading from file /tmp/example.pcap, link-type EN10MB (Ethernet)
00:00:00.000000 IP tardy.63978 > internal-host.americas.hpqcorp.net.35594: Flags [SEW], seq 1709857287, win 14600, options [mss 1460,sackOK,TS val 111970417 ecr 0,nop,wscale 7], length 0 00:00:00.000097 IP internal-host.americas.hpqcorp.net.35594 > tardy.63978: Flags [S.E], seq 3540302987, ack 1709857288, win 14480, options [mss 1460,sackOK,TS val 428732549 ecr 111970417,nop,wscale 5], length 0 00:00:00.000007 IP tardy.63978 > internal-host.americas.hpqcorp.net.35594: Flags [.], ack 1, win 115, options [nop,nop,TS val 111970417 ecr 428732549], length 0 00:00:00.000010 IP tardy.63978 > internal-host.americas.hpqcorp.net.35594: Flags [P.], seq 1:319, ack 1, win 115, options [nop,nop,TS val 111970417 ecr 428732549], length 318 00:00:00.000158 IP internal-host.americas.hpqcorp.net.35594 > tardy.63978: Flags [.], ack 319, win 486, options [nop,nop,TS val 428732549 ecr 111970417], length 0 00:00:00.000062 IP internal-host.americas.hpqcorp.net.35594 > tardy.63978: Flags [P.], seq 1:129, ack 319, win 486, options [nop,nop,TS val 428732549 ecr 111970417], length 128 00:00:00.000004 IP internal-host.americas.hpqcorp.net.35594 > tardy.63978: Flags [F.], seq 129, ack 319, win 486, options [nop,nop,TS val 428732549 ecr 111970417], length 0 00:00:00.000006 IP tardy.63978 > internal-host.americas.hpqcorp.net.35594: Flags [.], ack 129, win 123, options [nop,nop,TS val 111970417 ecr 428732549], length 0 00:00:00.000008 IP tardy.63978 > internal-host.americas.hpqcorp.net.35594: Flags [F.], seq 319, ack 130, win 123, options [nop,nop,TS val 111970417 ecr 428732549], length 0

It was in this LAN-based case a very short time between the SYN (the first line with "SEW" in it (the EW relates to Explicit Congestion Notification, which is on on my systems) and the second line, which is the SYN|ACK (and acceptance of ecn for this connection). I used "port 63978 as a packet filter to let the relative timestamps be helpful - otherwise, a last ACK from a previous connection would have been in there.

Anyway, it was only 97+7 microseconds (0.000104 s) before the connection was established from the point-of-view of the client, and only another 10 microseconds after that before the request was on its way. (Strictly speaking, past the tracepoint in the stack running on the client, but I rather doubt it was much longer before it was actually through the NIC and on the wire)

It would be great if you could share your thoughts on this and how could
the performance of this special case be improved.

You should repeat the above (at least the tcpdump of your actual PUTs if not also the pure netperf test) on your setup. I am as much a fan of persistent connections as anyone but I think you will find that TCP connection setup overhead wall clock time, is not the biggest component of the response time floor you have found.

That was suggesting tcpdump on the client. If you also take a tcpdump trace at the server, you can see the length of time inside the server between when the request arrived off the wire, and when the response was sent (queued to the driver at least).

Caveat - you should not try to do math between absolute timestamps on the client and on the server unless you know that the two systems have really, Really, REALLY well synchronized clocks...

rick jones
http://www.netperf.org/

If you are still reading, for grins here is the TCP_RR for the same sizes. Just no connection establishment overhead: raj@tardy:~/netperf2_trunk$ src/netperf -H raj-8510w.americas.hpqcorp.net -t TCP_RR -l 30 -v 2 -- -r `expr 190 + 128`,128 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to internal-host.americas.hpqcorp.net () port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  318      128     30.00    4114.35
16384  87380
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
    8      0       0      0   243.052   4114.347 10.467    4.213

Histogram of request/response times
UNIT_USEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_USEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
HUNDRED_USEC  :    0:  774: 117564: 4923:  102:   25:    9:    4:    4:    7
UNIT_MSEC     :    0:   14:    2:    1:    0:    0:    0:    1:    1:    1
TEN_MSEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
HUNDRED_MSEC  :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
UNIT_SEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_SEC       :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
>100_SECS: 0
HIST_TOTAL:      123432

Certainly, when/if there is very little service time in the server, a persistent connection will be "faster" - it will send rather fewer segments back and forth per transaction. But if the service times are rather larger than the network round-trip-times involved the gains from a persistent connection will be minimized.


References