openstack team mailing list archive
-
openstack team
-
Mailing list archive
-
Message #17932
Re: Troubleshooting Swift 1.7.4 on mini servers
Also check the number of inodes used: `df -i`
--John
On Oct 29, 2012, at 8:31 AM, Nathan Trueblood <nathan@xxxxxxxxxxxxxxxx> wrote:
> Yeah, I read about the 507 error. However, when the error occurs on my I can see with 'df' that the drive is only 1% full and is definitely not unmounted. I can write files to the mounted filesystem directly before, during, and after the Swift error occurs. So the problem must be some kind of timeout that is causing the object server to think that something is wrong with the disk.
>
> I'll keep digging...
>
> On Fri, Oct 26, 2012 at 11:21 PM, John Dickinson <me@xxxxxx> wrote:
> A 507 is returned by the object servers in 2 situations: 1) the drives are full or 2) the drives have been unmounted because of disk error.
>
> It's highly likely that you simply have full drives. Remember that the usable space in your cluster is 1/N where N = replica count. As an example, with 3 replicas and 5 nodes with a single 1TB drive each, you only have about 1.6TB available for data.
>
> As Pete suggested in his response, how big are your drives, and what does `df` tell you?
>
> --John
>
>
> On Oct 26, 2012, at 5:26 PM, Nathan Trueblood <nathan@xxxxxxxxxxxxxxxx> wrote:
>
> > Hey folks-
> >
> > I'm trying to figure out what's going wrong with my Swift deployment on a small cluster of "mini" servers. I have a small test cluster (5 storage nodes, 1 proxy) of mini-servers that are ARM-based. The proxy is a regular, Intel-based server with plenty of RAM. The object/account/container servers are relatively small, with 2GB of RAM per node.
> >
> > Everything starts up fine, but now I'm trying to troubleshoot a strange problem. After I successfully upload a few test files, it seems like the storage system stops responding and the proxy gives me a 503 error.
> >
> > Here's the test sequence I run on my proxy:
> >
> > lab@proxy01:~/bin$ ./swiftcl.sh stat
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass stat
> > Account: AUTH_system
> > Containers: 5
> > Objects: 4
> > Bytes: 47804968
> > Accept-Ranges: bytes
> > X-Timestamp: 1351294912.72119
> > lab@proxy01:~/bin$ ./swiftcl.sh upload myfiles1 /home/lab/bigfile1
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass upload myfiles1 /home/lab/bigfile1
> > home/lab/bigfile1
> > lab@proxy01:~/bin$ ./swiftcl.sh upload myfiles2 /home/lab/bigfile1
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass upload myfiles2 /home/lab/bigfile1
> > home/lab/bigfile1
> > lab@proxy01:~/bin$ ./swiftcl.sh upload myfiles3 /home/lab/bigfile1
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass upload myfiles3 /home/lab/bigfile1
> > home/lab/bigfile1
> > lab@proxy01:~/bin$ ./swiftcl.sh upload myfiles4 /home/lab/bigfile1
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass upload myfiles4 /home/lab/bigfile1
> > home/lab/bigfile1
> > lab@proxy01:~/bin$ ./swiftcl.sh upload myfiles5 /home/lab/bigfile1
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass upload myfiles5 /home/lab/bigfile1
> > Object PUT failed: http://172.16.1.111:8080/v1/AUTH_system/myfiles5/home/lab/bigfile1 503 Service Unavailable [first 60 chars of response] 503 Service Unavailable
> >
> > The server is currently unavailable
> > lab@proxy01:~/bin$ ./swiftcl.sh stat
> > swift -A http://proxy01:8080/auth/v1.0 -U system:root -K testpass stat
> > Account: AUTH_system
> > Containers: 6
> > Objects: 5
> > Bytes: 59756210
> > Accept-Ranges: bytes
> > X-Timestamp: 1351294912.72119
> >
> > Here's the corresponding log on the Proxy:
> >
> > Oct 26 17:06:52 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/06/52 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> > Oct 26 17:07:13 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/13 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0017
> > Oct 26 17:07:13 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/13 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> > Oct 26 17:07:22 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/22 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> > Oct 26 17:07:22 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/22 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> > Oct 26 17:07:27 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/27 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> > Oct 26 17:07:27 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/27 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> > Oct 26 17:07:27 proxy01 proxy-server Handoff requested (1) (txn: tx6946419daba54efe9c2878f8a2a78f88) (client_ip: 172.16.1.111)
> > Oct 26 17:07:27 proxy01 proxy-server Handoff requested (2) (txn: tx6946419daba54efe9c2878f8a2a78f88) (client_ip: 172.16.1.111)
> > Oct 26 17:07:33 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/33 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0010
> > Oct 26 17:07:33 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/33 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0016
> > Oct 26 17:07:33 proxy01 proxy-server Handoff requested (1) (txn: tx5f9659f74cb2491f9a63cbb84f680c5c) (client_ip: 172.16.1.111)
> > Oct 26 17:07:33 proxy01 proxy-server Handoff requested (2) (txn: tx5f9659f74cb2491f9a63cbb84f680c5c) (client_ip: 172.16.1.111)
> > Oct 26 17:07:39 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/39 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0009
> > Oct 26 17:07:39 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/07/39 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0009
> > Oct 26 17:07:39 proxy01 proxy-server Handoff requested (1) (txn: tx8dc917a4a8c84c40a4429b7bab0323c6) (client_ip: 172.16.1.111)
> > Oct 26 17:07:39 proxy01 proxy-server Handoff requested (2) (txn: tx8dc917a4a8c84c40a4429b7bab0323c6) (client_ip: 172.16.1.111)
> > Oct 26 17:07:40 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: tx8dc917a4a8c84c40a4429b7bab0323c6) (client_ip: 172.16.1.111)
> > Oct 26 17:07:41 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: tx07a1f5dfaa23445a88eaa4a2ade68466) (client_ip: 172.16.1.111)
> > Oct 26 17:07:43 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: tx938d08b706844db3886695b798bd9fad) (client_ip: 172.16.1.111)
> > Oct 26 17:07:47 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: txa35e9f8a54924f139e13d6f3a5dc457f) (client_ip: 172.16.1.111)
> > Oct 26 17:07:55 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: txc2f5c35b54a14b32a2d0178423bc7771) (client_ip: 172.16.1.111)
> > Oct 26 17:08:11 proxy01 proxy-server Object PUT returning 503, 1/2 required connections (txn: txd8df89e76edd40f3b4c04091b3cbc857) (client_ip: 172.16.1.111)
> > Oct 26 17:08:17 proxy01 proxy-server - 127.0.0.1 27/Oct/2012/00/08/17 GET /auth/v1.0/ HTTP/1.0 200 - - - - - - - - 0.0011
> >
> > And here's the log from one of the Storage Nodes:
> >
> > Oct 26 17:07:20 data05 container-server 192.168.1.111 - - [27/Oct/2012:00:07:20 +0000] "PUT /sda6/200589/AUTH_system/myfiles1" 201 - "tx063cd0643ef64d169b3bc5a86a808df4" "-" "-" 0.0510
> > Oct 26 17:07:20 data05 account-server 192.168.1.204 - - [27/Oct/2012:00:07:20 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 - "tx063cd0643ef64d169b3bc5a86a808df4" "-" "-" 0.0154 ""
> > Oct 26 17:07:20 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:20 +0000] "HEAD /sda6/36256/AUTH_system/myfiles1/home/lab/bigfile1" 404 - "-" "tx6f1018e5da4d4974a85d544a0874a581" "-" 0.0015
> > Oct 26 17:07:21 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:21 +0000] "PUT /sda6/36256/AUTH_system/myfiles1/home/lab/bigfile1" 201 - "-" "tx59f931d3281d41159eba0e0c88c18107" "-" 1.0255
> > Oct 26 17:07:21 data05 container-server 192.168.1.202 - - [27/Oct/2012:00:07:21 +0000] "PUT /sda6/200589/AUTH_system/myfiles1/home/lab/bigfile1" 201 - "tx59f931d3281d41159eba0e0c88c18107" "-" "-" 0.0218
> > Oct 26 17:07:24 data05 account-replicator Beginning replication run
> > Oct 26 17:07:25 data05 account-replicator Replication run OVER
> > Oct 26 17:07:25 data05 account-replicator Attempted to replicate 1 dbs in 0.21688 seconds (4.61073/s)
> > Oct 26 17:07:25 data05 account-replicator Removed 0 dbs
> > Oct 26 17:07:25 data05 account-replicator 2 successes, 0 failures
> > Oct 26 17:07:25 data05 account-replicator no_change:0 ts_repl:0 diff:2 rsync:0 diff_capped:0 hashmatch:0 empty:0
> > Oct 26 17:07:28 data05 account-server 192.168.1.204 - - [27/Oct/2012:00:07:28 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 - "-" "-" "-" 0.0021 ""
> > Oct 26 17:07:29 data05 account-server 192.168.1.205 - - [27/Oct/2012:00:07:29 +0000] "PUT /sda6/116021/AUTH_system/myfiles2" 201 - "txbace9f92a41a4645beb9159a802fc12d" "-" "-" 0.0178 ""
> > Oct 26 17:07:30 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:30 +0000] "PUT /sda6/63086/AUTH_system/myfiles2/home/lab/bigfile1" 201 - "-" "txa4fac7ceab2948ae9010b4ce4cb56b97" "-" 0.7135
> > Oct 26 17:07:33 data05 object-replicator Starting object replication pass.
> > Oct 26 17:07:33 data05 object-replicator 4/4 (100.00%) partitions replicated in 0.17s (23.71/sec, 0s remaining)
> > Oct 26 17:07:33 data05 object-replicator 4 suffixes checked - 50.00% hashed, 0.00% synced
> > Oct 26 17:07:33 data05 object-replicator Partition times: max 0.1012s, min 0.0164s, med 0.0273s
> > Oct 26 17:07:33 data05 object-replicator Object replication complete. (0.00 minutes)
> > Oct 26 17:07:34 data05 account-server 192.168.1.206 - - [27/Oct/2012:00:07:34 +0000] "PUT /sda6/116021/AUTH_system/myfiles3" 201 - "txa547450c156d4df9ad7e42c04dcabd8f" "-" "-" 0.0168 ""
> > Oct 26 17:07:34 data05 container-server 192.168.1.111 - - [27/Oct/2012:00:07:34 +0000] "PUT /sda6/118594/AUTH_system/myfiles3" 202 - "txa547450c156d4df9ad7e42c04dcabd8f" "-" "-" 0.0399
> > Oct 26 17:07:35 data05 container-server 192.168.1.206 - - [27/Oct/2012:00:07:35 +0000] "PUT /sda6/118594/AUTH_system/myfiles3/home/lab/bigfile1" 201 - "tx6946419daba54efe9c2878f8a2a78f88" "-" "-" 0.0022
> > Oct 26 17:07:35 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:35 +0000] "PUT /sda6/35436/AUTH_system/myfiles3/home/lab/bigfile1" 201 - "-" "tx6946419daba54efe9c2878f8a2a78f88" "-" 0.6806
> > Oct 26 17:07:36 data05 account-server 192.168.1.205 - - [27/Oct/2012:00:07:36 +0000] "PUT /sda6/116021/AUTH_system/myfiles2" 201 - "-" "-" "-" 0.0021 ""
> > Oct 26 17:07:37 data05 object-auditor Begin object audit "forever" mode (ZBF)
> > Oct 26 17:07:37 data05 object-auditor Object audit (ZBF) "forever" mode completed: 0.04s. Total quarantined: 0, Total errors: 0, Total files/sec: 89.46 , Total bytes/sec: 0.00, Auditing time: 0.04, Rate: 0.95
> > Oct 26 17:07:40 data05 account-server 192.168.1.204 - - [27/Oct/2012:00:07:40 +0000] "PUT /sda6/116021/AUTH_system/myfiles4" 201 - "txe095c94f8e5c4c4681a8d6a9c330bb7b" "-" "-" 0.0200 ""
> > Oct 26 17:07:40 data05 container-server 192.168.1.111 - - [27/Oct/2012:00:07:40 +0000] "PUT /sda6/217542/AUTH_system/myfiles4" 202 - "txe095c94f8e5c4c4681a8d6a9c330bb7b" "-" "-" 0.0671
> > Oct 26 17:07:40 data05 container-server 192.168.1.206 - - [27/Oct/2012:00:07:40 +0000] "PUT /sda6/217542/AUTH_system/myfiles4/home/lab/bigfile1" 201 - "tx5f9659f74cb2491f9a63cbb84f680c5c" "-" "-" 0.0023
> > Oct 26 17:07:40 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:40 +0000] "PUT /sda6/49697/AUTH_system/myfiles4/home/lab/bigfile1" 201 - "-" "tx5f9659f74cb2491f9a63cbb84f680c5c" "-" 0.6038
> > Oct 26 17:07:44 data05 container-updater Begin container update sweep
> > Oct 26 17:07:44 data05 container-replicator Beginning replication run
> > Oct 26 17:07:44 data05 account-server 192.168.1.206 - - [27/Oct/2012:00:07:44 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 - "-" "-" "-" 0.0021 ""
> > Oct 26 17:07:44 data05 account-server 192.168.1.206 - - [27/Oct/2012:00:07:44 +0000] "PUT /sda6/116021/AUTH_system/myfiles3" 201 - "-" "-" "-" 0.0021 ""
> > Oct 26 17:07:44 data05 account-server 192.168.1.206 - - [27/Oct/2012:00:07:44 +0000] "PUT /sda6/116021/AUTH_system/myfiles4" 201 - "-" "-" "-" 0.0021 ""
> > Oct 26 17:07:45 data05 account-server 192.168.1.203 - - [27/Oct/2012:00:07:45 +0000] "PUT /sda6/116021/AUTH_system/myfiles2" 201 - "-" "-" "-" 0.0023 ""
> > Oct 26 17:07:45 data05 account-server 192.168.1.203 - - [27/Oct/2012:00:07:45 +0000] "PUT /sda6/116021/AUTH_system/myfiles1" 201 - "-" "-" "-" 0.0020 ""
> > Oct 26 17:07:45 data05 container-replicator Replication run OVER
> > Oct 26 17:07:45 data05 account-server 192.168.1.203 - - [27/Oct/2012:00:07:45 +0000] "PUT /sda6/116021/AUTH_system/myfiles3" 201 - "-" "-" "-" 0.0223 ""
> > Oct 26 17:07:45 data05 container-replicator Attempted to replicate 5 dbs in 0.83001 seconds (6.02399/s)
> > Oct 26 17:07:45 data05 container-replicator Removed 0 dbs
> > Oct 26 17:07:45 data05 container-replicator 10 successes, 0 failures
> > Oct 26 17:07:45 data05 container-replicator no_change:8 ts_repl:0 diff:2 rsync:0 diff_capped:0 hashmatch:0 empty:0
> > Oct 26 17:07:45 data05 container-updater Container update sweep completed: 0.98s
> > Oct 26 17:07:46 data05 account-server 192.168.1.205 - - [27/Oct/2012:00:07:46 +0000] "PUT /sda6/116021/AUTH_system/myfiles5" 201 - "tx7ba0264da03c485ca3c3b4048f6a3d8c" "-" "-" 0.0098 ""
> > Oct 26 17:07:46 data05 container-server 192.168.1.111 - - [27/Oct/2012:00:07:46 +0000] "PUT /sda6/122255/AUTH_system/myfiles5" 202 - "tx7ba0264da03c485ca3c3b4048f6a3d8c" "-" "-" 0.0465
> > Oct 26 17:07:46 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:46 +0000] "HEAD /sda6/150861/AUTH_system/myfiles5/home/lab/bigfile1" 404 - "-" "txf08e8b91b2894a9da099f26b5300327b" "-" 0.0011
> > Oct 26 17:07:46 data05 object-server 192.168.1.111 - - [27/Oct/2012:00:07:46 +0000] "PUT /sda6/150861/AUTH_system/myfiles5/home/lab/bigfile1" 507 - "-" "tx8dc917a4a8c84c40a4429b7bab0323c6" "-" 0.0031
> >
> > Right now, there is nothing special in the object-server.conf:
> >
> > [DEFAULT]
> > bind_ip = 192.168.1.206
> > workers = 1
> >
> > [pipeline:main]
> > pipeline = object-server
> >
> > [app:object-server]
> > use = egg:swift#object
> >
> > [object-replicator]
> >
> > [object-updater]
> >
> > [object-auditor]
> >
> > [object-server]
> > set log_level = DEBUG
> >
> > My general question is: how do I troubleshoot this? I've tried turning on debug logging on the Object Server, but I don't get any additional clues.
> >
> > The Object-servers do give a 507 error, which might indicate a disk problem, but there is nothing wrong with the storage drive. And also if there was a fundamental drive problem then I wouldn't be able to upload objects in the first place.
> >
> > My best guess at the moment is that I should adjust the configuration of the object servers to reflect their lower performance. What would I adjust, the cache size? I'm not seeing any log messages that would indicate a time-out.
> >
> > Thanks in advance for your advice.
> >
> > -N
> > _______________________________________________
> > Mailing list: https://launchpad.net/~openstack
> > Post to : openstack@xxxxxxxxxxxxxxxxxxx
> > Unsubscribe : https://launchpad.net/~openstack
> > More help : https://help.launchpad.net/ListHelp
>
>
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
References