bigdata-dev team mailing list archive

Thread
Date
Re: [Merge] lp:~aisrael/charms/trusty/apache-hadoop-client/benchmarks into lp:~bigdata-dev/charms/trusty/apache-hadoop-client/trunk

To: Adam Israel <adam.israel@xxxxxxxxxxxxx>
From: Cory Johns <cory.johns@xxxxxxxxxxxxx>
Date: Mon, 01 Jun 2015 13:23:58 -0000
In-reply-to: <20150528210220.1056.68053.launchpad@ackee.canonical.com>
Reply-to: mp+260526@xxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx
Awesome!  Thanks for this.  See my two inline comments, though, regarding the /etc/environment issues you ran into.

Diff comments:

> === modified file 'README.md'
> --- README.md	2015-05-19 16:00:07 +0000
> +++ README.md	2015-05-28 21:05:22 +0000
> @@ -22,6 +22,50 @@
>      juju ssh client/0
>      hadoop jar my-job.jar
>  
> +## Benchmarking
> +
> +    You can perform a terasort benchmark, in order to gauge performance of your environment:
> +
> +        $ juju action do apache-hadoop-client/0 terasort
> +        Action queued with id: cbd981e8-3400-4c8f-8df1-c39c55a7eae6
> +        $ juju action fetch --wait 0 cbd981e8-3400-4c8f-8df1-c39c55a7eae6
> +        results:
> +          meta:
> +            composite:
> +              direction: asc
> +              units: ms
> +              value: "206676"
> +          results:
> +            raw: '{"Total vcore-seconds taken by all map tasks": "439783", "Spilled Records":
> +              "30000000", "WRONG_LENGTH": "0", "Reduce output records": "10000000", "HDFS:
> +              Number of bytes read": "1000001024", "Total vcore-seconds taken by all reduce
> +              tasks": "50275", "Reduce input groups": "10000000", "Shuffled Maps ": "8", "FILE:
> +              Number of bytes written": "3128977482", "Input split bytes": "1024", "Total
> +              time spent by all reduce tasks (ms)": "50275", "FILE: Number of large read operations":
> +              "0", "Bytes Read": "1000000000", "Virtual memory (bytes) snapshot": "7688794112",
> +              "Launched map tasks": "8", "GC time elapsed (ms)": "11656", "Bytes Written":
> +              "1000000000", "FILE: Number of read operations": "0", "HDFS: Number of write
> +              operations": "2", "Total megabyte-seconds taken by all reduce tasks": "51481600",
> +              "Combine output records": "0", "HDFS: Number of bytes written": "1000000000",
> +              "Total time spent by all map tasks (ms)": "439783", "Map output records": "10000000",
> +              "Physical memory (bytes) snapshot": "2329722880", "FILE: Number of write operations":
> +              "0", "Launched reduce tasks": "1", "Reduce input records": "10000000", "Total
> +              megabyte-seconds taken by all map tasks": "450337792", "WRONG_REDUCE": "0",
> +              "HDFS: Number of read operations": "27", "Reduce shuffle bytes": "1040000048",
> +              "Map input records": "10000000", "Map output materialized bytes": "1040000048",
> +              "CPU time spent (ms)": "195020", "Merged Map outputs": "8", "FILE: Number of
> +              bytes read": "2080000144", "Failed Shuffles": "0", "Total time spent by all
> +              maps in occupied slots (ms)": "439783", "WRONG_MAP": "0", "BAD_ID": "0", "Rack-local
> +              map tasks": "2", "IO_ERROR": "0", "Combine input records": "0", "Map output
> +              bytes": "1020000000", "CONNECTION": "0", "HDFS: Number of large read operations":
> +              "0", "Total committed heap usage (bytes)": "1755840512", "Data-local map tasks":
> +              "6", "Total time spent by all reduces in occupied slots (ms)": "50275"}'
> +        status: completed
> +        timing:
> +          completed: 2015-05-28 20:55:50 +0000 UTC
> +          enqueued: 2015-05-28 20:53:41 +0000 UTC
> +          started: 2015-05-28 20:53:44 +0000 UTC
> +
>  
>  ## Contact Information
>  
> 
> === added directory 'actions'
> === added file 'actions.yaml'
> --- actions.yaml	1970-01-01 00:00:00 +0000
> +++ actions.yaml	2015-05-28 21:05:22 +0000
> @@ -0,0 +1,38 @@
> +teragen:
> +    description: foo
> +    params:
> +        size:
> +            description: The number of 100 byte rows, default to 100MB of data to generate and sort
> +            type: string
> +            default: "10000000"
> +        indir:
> +            description: foo
> +            type: string
> +            default: 'tera_demo_in'
> +terasort:
> +    description: foo
> +    params:
> +        indir:
> +            description: foo
> +            type: string
> +            default: 'tera_demo_in'
> +        outdir:
> +            description: foo
> +            type: string
> +            default: 'tera_demo_out'
> +        size:
> +            description: The number of 100 byte rows, default to 100MB of data to generate and sort
> +            type: string
> +            default: "10000000"
> +        maps:
> +            description: The default number of map tasks per job. 1-20
> +            type: integer
> +            default: 1
> +        reduces:
> +            description: The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Try 1-20
> +            type: integer
> +            default: 1
> +        numtasks:
> +            description: How many tasks to run per jvm. If set to -1, there is no limit.
> +            type: integer
> +            default: 1
> 
> === added file 'actions/parseTerasort.py'
> --- actions/parseTerasort.py	1970-01-01 00:00:00 +0000
> +++ actions/parseTerasort.py	2015-05-28 21:05:22 +0000
> @@ -0,0 +1,54 @@
> +#!/usr/bin/env python
> +"""
> +Simple script to parse cassandra-stress' transaction results
> +and reformat them as JSON for sending back to juju
> +"""
> +import sys
> +import subprocess
> +import json
> +from charmhelpers.contrib.benchmark import Benchmark
> +import re
> +
> +
> +def action_set(key, val):
> +    action_cmd = ['action-set']
> +    if isinstance(val, dict):
> +        for k, v in val.iteritems():
> +            action_set('%s.%s' % (key, k), v)
> +        return
> +
> +    action_cmd.append('%s=%s' % (key, val))
> +    subprocess.check_call(action_cmd)
> +
> +
> +def parse_terasort_output():
> +    """
> +    Parse the output from terasort and set the action results:
> +
> +    """
> +
> +    results = {}
> +
> +    # Find all of the interesting things
> +    regex = re.compile('\t+(.*)=(.*)')
> +    for line in sys.stdin.readlines():
> +        m = regex.match(line)
> +        if m:
> +            results[m.group(1)] = m.group(2)
> +    action_set("results.raw", json.dumps(results))
> +
> +    # Calculate what's important
> +    if 'CPU time spent (ms)' in results:
> +        composite = int(results['CPU time spent (ms)']) + int(results['GC time elapsed (ms)'])
> +        Benchmark.set_composite_score(
> +            composite,
> +            'ms',
> +            'asc'
> +        )
> +    else:
> +        print "Invalid test results"
> +        print results
> +
> +
> +if __name__ == "__main__":
> +    parse_terasort_output()
> 
> === added file 'actions/teragen'
> --- actions/teragen	1970-01-01 00:00:00 +0000
> +++ actions/teragen	2015-05-28 21:05:22 +0000
> @@ -0,0 +1,21 @@
> +#!/bin/bash
> +set -eux
> +SIZE=`action-get size`
> +IN_DIR=`action-get indir`
> +
> +benchmark-start
> +
> +# I don't know why, but have to source /etc/environment before and after

The Hadoop system needs certain vars set up in the env, such as JAVA_HOME.  The charm configures those in /etc/environment, which is automatically sourced for login shells.  By default, su does not use a login shell, so you have to manually source the file (inside the here-doc).  Alternatively, you could use the --login (-l) option for su.

You are also using the interpolating version of a here-doc (no quotes, or double quotes, around the EOF marker at the start), so JAVA_HOME is being interpolated from the containing environment, so you also need to source /etc/environment outside the here-doc. The solution for that would be to escape the $JAVA_HOME references in the here-doc to \$JAVA_HOME, as well as $HADOOP_HOME.  You could potentially also use a single-quoted EOF marker (su ubuntu << 'EOF') but that would disable interpolation entirely, meaning $SIZE and $IN_DIR wouldn't get set properly, so I'm not sure how you'd get them passed in that way.

> +# invoking the bash shell to get it working.
> +. /etc/environment
> +su ubuntu << EOF
> +. /etc/environment
> +if JAVA_HOME=${JAVA_HOME} hadoop fs -stat ${IN_DIR}; then
> +    JAVA_HOME=${JAVA_HOME} hadoop fs -rm -r -skipTrash ${IN_DIR} || true
> +fi
> +
> +JAVA_HOME=${JAVA_HOME} hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar teragen ${SIZE} ${IN_DIR}
> +
> +EOF
> +
> +benchmark-finish
> 
> === added file 'actions/terasort'
> --- actions/terasort	1970-01-01 00:00:00 +0000
> +++ actions/terasort	2015-05-28 21:05:22 +0000
> @@ -0,0 +1,49 @@
> +#!/bin/bash
> +IN_DIR=`action-get indir`
> +OUT_DIR=`action-get outdir`
> +SIZE=`action-get size`
> +OPTIONS=''
> +
> +MAPS=`action-get maps`
> +REDUCES=`action-get reduces`
> +NUMTASKS=`action-get numtasks`
> +
> +OPTIONS="${OPTIONS} -D mapreduce.job.maps=${MAPS}"
> +OPTIONS="${OPTIONS} -D mapreduce.job.reduces=${REDUCES}"
> +OPTIONS="${OPTIONS} -D mapreduce.job.jvm.numtasks=${NUMTASKS}"
> +
> +mkdir -p /opt/terasort
> +chown ubuntu:ubuntu /opt/terasort
> +run=`date +%s`
> +
> +# HACK: the environment reset below is munging the PATH

See other comment.  Escaping $JAVA_HOME/$HADOOP_HOME inside the here-doc should remove the need for the outer /etc/environment source and this path hack.

> +OLDPATH=$PATH
> +
> +
> +# I don't know why, but have to source /etc/environment before and after
> +# invoking the bash shell to get it working.
> +. /etc/environment
> +su ubuntu << EOF
> +. /etc/environment
> +
> +mkdir -p /opt/terasort/results/$run
> +
> +# If there's no data generated yet, create it using the action defaults
> +if ! JAVA_HOME=${JAVA_HOME} hadoop fs -stat ${IN_DIR} &> /dev/null; then
> +    JAVA_HOME=${JAVA_HOME} hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar teragen ${SIZE} ${IN_DIR} > /dev/null
> +
> +fi
> +
> +# If there's already sorted data, remove it
> +if JAVA_HOME=${JAVA_HOME} hadoop fs -stat ${OUT_DIR} &> /dev/null; then
> +    JAVA_HOME=${JAVA_HOME} hadoop fs -rm -r -skipTrash ${OUT_DIR} || true
> +fi
> +
> +benchmark-start
> +JAVA_HOME=${JAVA_HOME} hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar terasort ${OPTIONS} ${IN_DIR} ${OUT_DIR} &> /opt/terasort/results/$run/terasort.log
> +benchmark-finish
> +
> +EOF
> +PATH=$OLDPATH
> +
> +`cat /opt/terasort/results/$run/terasort.log | python $CHARM_DIR/actions/parseTerasort.py`
> 
> === added file 'hooks/benchmark-relation-changed'
> --- hooks/benchmark-relation-changed	1970-01-01 00:00:00 +0000
> +++ hooks/benchmark-relation-changed	2015-05-28 21:05:22 +0000
> @@ -0,0 +1,3 @@
> +#!/bin/bash
> +
> +relation-set benchmarks=terasort
> 
> === modified file 'hooks/install'
> --- hooks/install	2015-05-11 22:25:12 +0000
> +++ hooks/install	2015-05-28 21:05:22 +0000
> @@ -1,2 +1,4 @@
>  #!/bin/bash
> +apt-get install -y python-pip && pip install -U charm-benchmark
> +
>  hooks/status-set blocked "Please add relation to apache-hadoop-plugin"
> 
> === added symlink 'hooks/upgrade-charm'
> === target is u'install'
> === modified file 'metadata.yaml'
> --- metadata.yaml	2015-05-12 22:18:09 +0000
> +++ metadata.yaml	2015-05-28 21:05:22 +0000
> @@ -12,3 +12,5 @@
>    hadoop-plugin:
>      interface: hadoop-plugin
>      scope: container
> +  benchmark:
> +    interface: benchmark
> 


-- 
https://code.launchpad.net/~aisrael/charms/trusty/apache-hadoop-client/benchmarks/+merge/260526
Your team Juju Big Data Development is requested to review the proposed merge of lp:~aisrael/charms/trusty/apache-hadoop-client/benchmarks into lp:~bigdata-dev/charms/trusty/apache-hadoop-client/trunk.
References

[Merge] lp:~aisrael/charms/trusty/apache-hadoop-client/benchmarks into lp:~bigdata-dev/charms/trusty/apache-hadoop-client/trunk
From: Adam Israel, 2015-05-28