bigdata-dev team mailing list archive
-
bigdata-dev team
-
Mailing list archive
-
Message #00084
Re: [Merge] lp:~aisrael/charms/trusty/apache-hadoop-client/benchmarks into lp:~bigdata-dev/charms/trusty/apache-hadoop-client/trunk
Awesome! Thanks for this. See my two inline comments, though, regarding the /etc/environment issues you ran into.
Diff comments:
> === modified file 'README.md'
> --- README.md 2015-05-19 16:00:07 +0000
> +++ README.md 2015-05-28 21:05:22 +0000
> @@ -22,6 +22,50 @@
> juju ssh client/0
> hadoop jar my-job.jar
>
> +## Benchmarking
> +
> + You can perform a terasort benchmark, in order to gauge performance of your environment:
> +
> + $ juju action do apache-hadoop-client/0 terasort
> + Action queued with id: cbd981e8-3400-4c8f-8df1-c39c55a7eae6
> + $ juju action fetch --wait 0 cbd981e8-3400-4c8f-8df1-c39c55a7eae6
> + results:
> + meta:
> + composite:
> + direction: asc
> + units: ms
> + value: "206676"
> + results:
> + raw: '{"Total vcore-seconds taken by all map tasks": "439783", "Spilled Records":
> + "30000000", "WRONG_LENGTH": "0", "Reduce output records": "10000000", "HDFS:
> + Number of bytes read": "1000001024", "Total vcore-seconds taken by all reduce
> + tasks": "50275", "Reduce input groups": "10000000", "Shuffled Maps ": "8", "FILE:
> + Number of bytes written": "3128977482", "Input split bytes": "1024", "Total
> + time spent by all reduce tasks (ms)": "50275", "FILE: Number of large read operations":
> + "0", "Bytes Read": "1000000000", "Virtual memory (bytes) snapshot": "7688794112",
> + "Launched map tasks": "8", "GC time elapsed (ms)": "11656", "Bytes Written":
> + "1000000000", "FILE: Number of read operations": "0", "HDFS: Number of write
> + operations": "2", "Total megabyte-seconds taken by all reduce tasks": "51481600",
> + "Combine output records": "0", "HDFS: Number of bytes written": "1000000000",
> + "Total time spent by all map tasks (ms)": "439783", "Map output records": "10000000",
> + "Physical memory (bytes) snapshot": "2329722880", "FILE: Number of write operations":
> + "0", "Launched reduce tasks": "1", "Reduce input records": "10000000", "Total
> + megabyte-seconds taken by all map tasks": "450337792", "WRONG_REDUCE": "0",
> + "HDFS: Number of read operations": "27", "Reduce shuffle bytes": "1040000048",
> + "Map input records": "10000000", "Map output materialized bytes": "1040000048",
> + "CPU time spent (ms)": "195020", "Merged Map outputs": "8", "FILE: Number of
> + bytes read": "2080000144", "Failed Shuffles": "0", "Total time spent by all
> + maps in occupied slots (ms)": "439783", "WRONG_MAP": "0", "BAD_ID": "0", "Rack-local
> + map tasks": "2", "IO_ERROR": "0", "Combine input records": "0", "Map output
> + bytes": "1020000000", "CONNECTION": "0", "HDFS: Number of large read operations":
> + "0", "Total committed heap usage (bytes)": "1755840512", "Data-local map tasks":
> + "6", "Total time spent by all reduces in occupied slots (ms)": "50275"}'
> + status: completed
> + timing:
> + completed: 2015-05-28 20:55:50 +0000 UTC
> + enqueued: 2015-05-28 20:53:41 +0000 UTC
> + started: 2015-05-28 20:53:44 +0000 UTC
> +
>
> ## Contact Information
>
>
> === added directory 'actions'
> === added file 'actions.yaml'
> --- actions.yaml 1970-01-01 00:00:00 +0000
> +++ actions.yaml 2015-05-28 21:05:22 +0000
> @@ -0,0 +1,38 @@
> +teragen:
> + description: foo
> + params:
> + size:
> + description: The number of 100 byte rows, default to 100MB of data to generate and sort
> + type: string
> + default: "10000000"
> + indir:
> + description: foo
> + type: string
> + default: 'tera_demo_in'
> +terasort:
> + description: foo
> + params:
> + indir:
> + description: foo
> + type: string
> + default: 'tera_demo_in'
> + outdir:
> + description: foo
> + type: string
> + default: 'tera_demo_out'
> + size:
> + description: The number of 100 byte rows, default to 100MB of data to generate and sort
> + type: string
> + default: "10000000"
> + maps:
> + description: The default number of map tasks per job. 1-20
> + type: integer
> + default: 1
> + reduces:
> + description: The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Try 1-20
> + type: integer
> + default: 1
> + numtasks:
> + description: How many tasks to run per jvm. If set to -1, there is no limit.
> + type: integer
> + default: 1
>
> === added file 'actions/parseTerasort.py'
> --- actions/parseTerasort.py 1970-01-01 00:00:00 +0000
> +++ actions/parseTerasort.py 2015-05-28 21:05:22 +0000
> @@ -0,0 +1,54 @@
> +#!/usr/bin/env python
> +"""
> +Simple script to parse cassandra-stress' transaction results
> +and reformat them as JSON for sending back to juju
> +"""
> +import sys
> +import subprocess
> +import json
> +from charmhelpers.contrib.benchmark import Benchmark
> +import re
> +
> +
> +def action_set(key, val):
> + action_cmd = ['action-set']
> + if isinstance(val, dict):
> + for k, v in val.iteritems():
> + action_set('%s.%s' % (key, k), v)
> + return
> +
> + action_cmd.append('%s=%s' % (key, val))
> + subprocess.check_call(action_cmd)
> +
> +
> +def parse_terasort_output():
> + """
> + Parse the output from terasort and set the action results:
> +
> + """
> +
> + results = {}
> +
> + # Find all of the interesting things
> + regex = re.compile('\t+(.*)=(.*)')
> + for line in sys.stdin.readlines():
> + m = regex.match(line)
> + if m:
> + results[m.group(1)] = m.group(2)
> + action_set("results.raw", json.dumps(results))
> +
> + # Calculate what's important
> + if 'CPU time spent (ms)' in results:
> + composite = int(results['CPU time spent (ms)']) + int(results['GC time elapsed (ms)'])
> + Benchmark.set_composite_score(
> + composite,
> + 'ms',
> + 'asc'
> + )
> + else:
> + print "Invalid test results"
> + print results
> +
> +
> +if __name__ == "__main__":
> + parse_terasort_output()
>
> === added file 'actions/teragen'
> --- actions/teragen 1970-01-01 00:00:00 +0000
> +++ actions/teragen 2015-05-28 21:05:22 +0000
> @@ -0,0 +1,21 @@
> +#!/bin/bash
> +set -eux
> +SIZE=`action-get size`
> +IN_DIR=`action-get indir`
> +
> +benchmark-start
> +
> +# I don't know why, but have to source /etc/environment before and after
The Hadoop system needs certain vars set up in the env, such as JAVA_HOME. The charm configures those in /etc/environment, which is automatically sourced for login shells. By default, su does not use a login shell, so you have to manually source the file (inside the here-doc). Alternatively, you could use the --login (-l) option for su.
You are also using the interpolating version of a here-doc (no quotes, or double quotes, around the EOF marker at the start), so JAVA_HOME is being interpolated from the containing environment, so you also need to source /etc/environment outside the here-doc. The solution for that would be to escape the $JAVA_HOME references in the here-doc to \$JAVA_HOME, as well as $HADOOP_HOME. You could potentially also use a single-quoted EOF marker (su ubuntu << 'EOF') but that would disable interpolation entirely, meaning $SIZE and $IN_DIR wouldn't get set properly, so I'm not sure how you'd get them passed in that way.
> +# invoking the bash shell to get it working.
> +. /etc/environment
> +su ubuntu << EOF
> +. /etc/environment
> +if JAVA_HOME=${JAVA_HOME} hadoop fs -stat ${IN_DIR}; then
> + JAVA_HOME=${JAVA_HOME} hadoop fs -rm -r -skipTrash ${IN_DIR} || true
> +fi
> +
> +JAVA_HOME=${JAVA_HOME} hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar teragen ${SIZE} ${IN_DIR}
> +
> +EOF
> +
> +benchmark-finish
>
> === added file 'actions/terasort'
> --- actions/terasort 1970-01-01 00:00:00 +0000
> +++ actions/terasort 2015-05-28 21:05:22 +0000
> @@ -0,0 +1,49 @@
> +#!/bin/bash
> +IN_DIR=`action-get indir`
> +OUT_DIR=`action-get outdir`
> +SIZE=`action-get size`
> +OPTIONS=''
> +
> +MAPS=`action-get maps`
> +REDUCES=`action-get reduces`
> +NUMTASKS=`action-get numtasks`
> +
> +OPTIONS="${OPTIONS} -D mapreduce.job.maps=${MAPS}"
> +OPTIONS="${OPTIONS} -D mapreduce.job.reduces=${REDUCES}"
> +OPTIONS="${OPTIONS} -D mapreduce.job.jvm.numtasks=${NUMTASKS}"
> +
> +mkdir -p /opt/terasort
> +chown ubuntu:ubuntu /opt/terasort
> +run=`date +%s`
> +
> +# HACK: the environment reset below is munging the PATH
See other comment. Escaping $JAVA_HOME/$HADOOP_HOME inside the here-doc should remove the need for the outer /etc/environment source and this path hack.
> +OLDPATH=$PATH
> +
> +
> +# I don't know why, but have to source /etc/environment before and after
> +# invoking the bash shell to get it working.
> +. /etc/environment
> +su ubuntu << EOF
> +. /etc/environment
> +
> +mkdir -p /opt/terasort/results/$run
> +
> +# If there's no data generated yet, create it using the action defaults
> +if ! JAVA_HOME=${JAVA_HOME} hadoop fs -stat ${IN_DIR} &> /dev/null; then
> + JAVA_HOME=${JAVA_HOME} hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar teragen ${SIZE} ${IN_DIR} > /dev/null
> +
> +fi
> +
> +# If there's already sorted data, remove it
> +if JAVA_HOME=${JAVA_HOME} hadoop fs -stat ${OUT_DIR} &> /dev/null; then
> + JAVA_HOME=${JAVA_HOME} hadoop fs -rm -r -skipTrash ${OUT_DIR} || true
> +fi
> +
> +benchmark-start
> +JAVA_HOME=${JAVA_HOME} hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar terasort ${OPTIONS} ${IN_DIR} ${OUT_DIR} &> /opt/terasort/results/$run/terasort.log
> +benchmark-finish
> +
> +EOF
> +PATH=$OLDPATH
> +
> +`cat /opt/terasort/results/$run/terasort.log | python $CHARM_DIR/actions/parseTerasort.py`
>
> === added file 'hooks/benchmark-relation-changed'
> --- hooks/benchmark-relation-changed 1970-01-01 00:00:00 +0000
> +++ hooks/benchmark-relation-changed 2015-05-28 21:05:22 +0000
> @@ -0,0 +1,3 @@
> +#!/bin/bash
> +
> +relation-set benchmarks=terasort
>
> === modified file 'hooks/install'
> --- hooks/install 2015-05-11 22:25:12 +0000
> +++ hooks/install 2015-05-28 21:05:22 +0000
> @@ -1,2 +1,4 @@
> #!/bin/bash
> +apt-get install -y python-pip && pip install -U charm-benchmark
> +
> hooks/status-set blocked "Please add relation to apache-hadoop-plugin"
>
> === added symlink 'hooks/upgrade-charm'
> === target is u'install'
> === modified file 'metadata.yaml'
> --- metadata.yaml 2015-05-12 22:18:09 +0000
> +++ metadata.yaml 2015-05-28 21:05:22 +0000
> @@ -12,3 +12,5 @@
> hadoop-plugin:
> interface: hadoop-plugin
> scope: container
> + benchmark:
> + interface: benchmark
>
--
https://code.launchpad.net/~aisrael/charms/trusty/apache-hadoop-client/benchmarks/+merge/260526
Your team Juju Big Data Development is requested to review the proposed merge of lp:~aisrael/charms/trusty/apache-hadoop-client/benchmarks into lp:~bigdata-dev/charms/trusty/apache-hadoop-client/trunk.
References