← Back to team overview

graphite-dev team mailing list archive

Re: [Question #170527]: carbon-cache.py grinds to a halt

 

Question #170527 on Graphite changed:
https://answers.launchpad.net/graphite/+question/170527

    Status: Answered => Open

Dyna is still having a problem:
Hi

Thanks for you answer:

Metrics:

8 different metics: 
cpu, user cpu, system cpu, context switch,  disk usage, write rate kb, read rate kb, mem usage

The metrics are in a csv file, one line each five (5) minutes (this is
std extract data from HP-UX).

I'm reading each line then sending it like this:

           lines.append("server.hpux.%s.global.cpu %s %s" % (server, CPU, EPOC))
           lines.append("server.hpux.%s.global.system_cpu %s %s" % (server, SCPU, EPOC))
           lines.append("server.hpux.%s.global.user_cpu %s %s" % (server, UCPU, EPOC))
           lines.append("server.hpux.%s.global.context_switch %s %s" % (server, CSWI, EPOC))
           lines.append("server.hpux.%s.global.disk %s %s" % (server, DISK, EPOC))
           lines.append("server.hpux.%s.global.read_rate_kb %s %s" % (server, READ, EPOC))
           lines.append("server.hpux.%s.global.write_rate_kb %s %s" % (server, WRITE, EPOC))
           lines.append("server.hpux.%s.global.mem %s %s" % (server, MEM, EPOC))
           message = '\n'.join(lines) + '\n
           ......
           sock.sendall(message)

I do this in a loop sending it as fast as I can, it's not very fast even
when I start it from scratch (I can see this since I write a little dot
for each line sent).

I've installed carbon/graphite on two server so I can compare, both are
running the latest update of RHEL 5

Server1:

VMware virtual server: 500MB mem and 1CPU allocated, the hosting VMware
server is a DL380G5 4GB RAM 1 CPU with about 10 virtual server running
at the same time (I expect this server to be slow)

Server2:
Is a real box DL360G5 (Dual CPU) with 8GB of RAM (I expect this box to be fast since it has loads of mem and loads of CPU resourses compared to the other one). 


Here is my findings so far. 

Visually I don't see any difference in commit rate, they are both slow,
and looking at the listner log it's about 1.5 commits per second when
they start with fresh database files.  In my test yesterday Server2
grided to a halt after just 50 min of importing the data, it had then
eaten up 75% of the mem and didn't listen to kill signals (i.e. carbon-
cache.py stop). However if you do kill the commit process exporting the
data carbon-cache.py will wake up after a little while (2-10 min)
complaing it lost connection with the other end.

What was even funnier is that the Server1 continued to work until late
last evening.. then you see random hangs like this:

09/09/2011 06:09:02 :: [listener] MetricLineReceiver connection with 10.102.246.140:60858 established
09/09/2011 06:09:02 :: [listener] MetricLineReceiver connection with 10.102.246.140:60857 closed cleanly
09/09/2011 06:09:03 :: [listener] MetricLineReceiver connection with 10.102.246.140:60859 established
09/09/2011 06:21:03 :: [listener] MetricLineReceiver connection with 10.102.246.140:60858 closed cleanly
09/09/2011 06:21:04 :: [listener] MetricLineReceiver connection with 10.102.246.140:61511 established
09/09/2011 06:21:04 :: [listener] MetricLineReceiver connection with 10.102.246.140:60859 closed cleanly
09/09/2011 06:21:04 :: [listener] MetricLineReceiver connection with 10.102.246.140:61512 established
09/09/2011 06:21:04 :: [listener] MetricLineReceiver connection with 10.102.246.140:61511 closed cleanly
09/09/2011 06:21:05 :: [listener] MetricLineReceiver connection with 10.102.246.140:61513 established


It looks like the hang of the carbon-cache.py is pretty random. It
doesn't look like it's a server size issue, reason being that the server
with very low resourses continued to work until late last evening while
it only too 50 min or so for the beefy one to hang.

I say this is some sort of bug either in the program or in my config or
possibly in the program that commits. Commiting five data points every
second is not much load any system should be able to handle that. It's
also an issue of that when I commit the CPU spikes at nearly 100%, and
that for  ~300-450 metics in a minute???

I'm aware of the MAX_CACHE_SIZE = inf, (althugh I've not digged deep
enough to see how it's parsed so I don't know if I can limit it like
this MAX_CACHE_SIZE = 1GB).  However the MAX_UPDATES_PER_SECOND = 1000
indicates to be that my 1 update per second should be childs play ;)..


I'm attaching my program that submits the data and I'm now about to pickle it to see if that solves the problem.  Sorry for the poor python programming.. 

Cheers Dyna

----

#!/usr/bin/python

import getopt
import sys
import time
import os
import platform
import csv
from socket import socket

CARBON_SERVER = 'ldn4lin15.ebrd.com'
CARBON_PORT = 2003


def usage():
    print "Usage\n"

def main():
    try:
        opts, args = getopt.getopt(sys.argv[1:], "hfs:v", ["help", "file=", "server="])
    except getopt.GetoptError, err:
        # print help information and exit:
        #print str(err) # will print something like "option -a not recognized"
        usage()
        sys.exit(2)
    message = None
    file = None
    server = None
    epoc = 0
    verbose = False
    for o, a in opts:
        if o == "-v":
            verbose = True
        elif o in ("-h", "--help"):
            usage()
            sys.exit()
        elif o in ("-f", "--file"):
            file = a
        elif o in ("-s", "--server"):
            server = a
        else:
            assert False, "unhandled option"
    if file == None or server == None:
        usage()
        sys.exit(2)
    print ("Sending global stats file %s from server %s \n" % (file, server))

    infile  = open(file, "rb")
    try:
        infile  = open(file, "rb")
    except IOError:
        sys.exit(2)

    reader = csv.reader(infile)
    rownum = 0
    version = 0
    dot = 0
    lines = []
    for row in reader:
        if rownum ==  1:
           #print ("%s,%s,%s,%s,%s" % (row[0],row[1],row[2],row[3],row[4]))
           if row[1] == "Cache ":
                #print "Version 1\n"
                version = 1
           elif (row[1] == "Cache Mem") and (row[6] != "   Phys   "):
                #print "Version 2\n"
                version = 2
           elif row[6] == "   Phys   ":
                #print "Version 3\n"
                version = 3
           else:
                version=4
        if rownum >= 3 and version == 1:
           EPOC=int(time.mktime(time.strptime(row[0], '%m/%d/%Y %H:%M'))) - time.timezone
           CMHIT=row[1]
           SCPU=row[2]
           UCPU=row[3]
           CSWI=row[4]
           DISK=row[5]
           SWAP=row[6]
           MEM=row[7]

           lines.append("server.hpux.%s.global.cache_mem_hit %s %s" % (server, CMHIT, EPOC))
           lines.append("server.hpux.%s.global.system_cpu %s %s" % (server, SCPU, EPOC))
           lines.append("server.hpux.%s.global.user_cpu %s %s" % (server, UCPU, EPOC))
           lines.append("server.hpux.%s.global.context_swith %s %s" % (server, CSWI, EPOC))
           lines.append("server.hpux.%s.global.disk %s %s" % (server, DISK, EPOC))


        if rownum >= 3 and version == 2:
           EPOC=int(time.mktime(time.strptime(row[0], '%m/%d/%Y %H:%M:%S'))) - time.timezone
           CMHIT=row[1]
           CPU=row[2]
           SCPU=row[3]
           UCPU=row[4]
           CSWI=row[5]
           DISK=row[6]
           SWAP=row[7]

           lines.append("server.hpux.%s.global.cache_mem_hit %s %s" % (server, CMHIT, EPOC))
           lines.append("server.hpux.%s.global.cpu %s %s" % (server, CPU, EPOC))
           lines.append("server.hpux.%s.global.system_cpu %s %s" % (server, SCPU, EPOC))
           lines.append("server.hpux.%s.global.user_cpu %s %s" % (server, UCPU, EPOC))
           lines.append("server.hpux.%s.global.context_swith %s %s" % (server, CSWI, EPOC))
           lines.append("server.hpux.%s.global.disk %s %s" % (server, DISK, EPOC))
           lines.append("server.hpux.%s.global.swap %s %s" % (server, SWAP, EPOC))

        if rownum >= 3 and version == 3:
           EPOC=int(time.mktime(time.strptime(row[0], '%m/%d/%Y %H:%M:%S'))) - time.timezone
           CMHIT=row[1]
           CPU=row[2]
           SCPU=row[3]
           UCPU=row[4]
           CSWI=row[5]
           PIOR=row[6]
           DISK=row[7]

           lines.append("server.hpux.%s.global.cache_mem_hit %s %s" % (server, CMHIT, EPOC))
           lines.append("server.hpux.%s.global.cpu %s %s" % (server, CPU, EPOC))
           lines.append("server.hpux.%s.global.system_cpu %s %s" % (server, SCPU, EPOC))
           lines.append("server.hpux.%s.global.user_cpu %s %s" % (server, UCPU, EPOC))
           lines.append("server.hpux.%s.global.context_swith %s %s" % (server, CSWI, EPOC))
           lines.append("server.hpux.%s.global.phys_io_rate %s %s" % (server, PIOR, EPOC))
           lines.append("server.hpux.%s.global.disk %s %s" % (server, DISK, EPOC))


        if rownum >= 2 and version == 4:
           #EPOC=row[0]
           EPOC=int(time.mktime(time.strptime(row[0], '%m/%d/%Y %H:%M:%S'))) - time.timezone
           CPU=row[1]
           SCPU=row[2]
           UCPU=row[3]
           CSWI=row[4]
           DISK=row[5]
           READ=row[6]
           WRITE=row[7]
           MEM=row[8]

           lines.append("server.hpux.%s.global.cpu %s %s" % (server, CPU, EPOC))
           lines.append("server.hpux.%s.global.system_cpu %s %s" % (server, SCPU, EPOC))
           lines.append("server.hpux.%s.global.user_cpu %s %s" % (server, UCPU, EPOC))
           lines.append("server.hpux.%s.global.context_swith %s %s" % (server, CSWI, EPOC))
           lines.append("server.hpux.%s.global.disk %s %s" % (server, DISK, EPOC))
           lines.append("server.hpux.%s.global.read_rate_kb %s %s" % (server, READ, EPOC))
           lines.append("server.hpux.%s.global.write_rate_kb %s %s" % (server, WRITE, EPOC))
           lines.append("server.hpux.%s.global.mem %s %s" % (server, MEM, EPOC))

        if lines:
           message = '\n'.join(lines) + '\n'
           #print message
           if dot == 70:
              dot = 0
              print ".\n",
           else:
              dot = dot + 1
              print ".",
           sock = socket()
           try:
              sock.connect( (CARBON_SERVER,CARBON_PORT) )
           except:
              print "Couldn't connect to %(server)s on port %(port)d, is carbon-agent.py running?" % { 'server':CARBON_SERVER, 'port':CARBON_PORT }
              sys.exit(1)

           sock.sendall(message)
           sock.close()
        rownum = rownum + 1

if __name__ == "__main__":
    main()

-- 
You received this question notification because you are a member of
graphite-dev, which is an answer contact for Graphite.