← Back to team overview

bigdata-dev team mailing list archive

[Merge] lp:~bigdata-dev/charms/bundles/apache-hadoop-spark-notebook/trunk into lp:~charmers/charms/bundles/apache-hadoop-spark-notebook/bundle

 

Kevin W Monroe has proposed merging lp:~bigdata-dev/charms/bundles/apache-hadoop-spark-notebook/trunk into lp:~charmers/charms/bundles/apache-hadoop-spark-notebook/bundle.

Requested reviews:
  Kevin W Monroe (kwmonroe)
Related bugs:
  Bug #1475634 in Juju Charms Collection: "bigdata solution: need a Apache Hadoop, SPark, and ipython notebook for SPark solution"
  https://bugs.launchpad.net/charms/+bug/1475634

For more details, see:
https://code.launchpad.net/~bigdata-dev/charms/bundles/apache-hadoop-spark-notebook/trunk/+merge/286952

updates from bigdata-dev:
- version lock charms in bundle.yaml
- update bundle tests
- fix README formatting
-- 
Your team Juju Big Data Development is subscribed to branch lp:~bigdata-dev/charms/bundles/apache-hadoop-spark-notebook/trunk.
=== modified file 'README.md'
--- README.md	2015-07-09 15:09:34 +0000
+++ README.md	2016-02-23 21:01:00 +0000
@@ -16,81 +16,81 @@
   - 1 Notebook (colocated on the Spark unit)
 
 
-  ## Usage
-  Deploy this bundle using juju-quickstart:
-
-      juju quickstart u/bigdata-dev/apache-hadoop-spark-notebook
-
-  See `juju quickstart --help` for deployment options, including machine
-  constraints and how to deploy a locally modified version of the
-  apache-hadoop-spark-notebook bundle.yaml.
-
-
-  ## Testing the deployment
-
-  ### Smoke test HDFS admin functionality
-  Once the deployment is complete and the cluster is running, ssh to the HDFS
-  Master unit:
-
-      juju ssh hdfs-master/0
-
-  As the `ubuntu` user, create a temporary directory on the Hadoop file system.
-  The steps below verify HDFS functionality:
-
-      hdfs dfs -mkdir -p /tmp/hdfs-test
-      hdfs dfs -chmod -R 777 /tmp/hdfs-test
-      hdfs dfs -ls /tmp # verify the newly created hdfs-test subdirectory exists
-      hdfs dfs -rm -R /tmp/hdfs-test
-      hdfs dfs -ls /tmp # verify the hdfs-test subdirectory has been removed
-      exit
-
-  ### Smoke test YARN and MapReduce
-  Run the `terasort.sh` script from the Spark unit to generate and sort data. The
-  steps below verify that Spark is communicating with the cluster via the plugin
-  and that YARN and MapReduce are working as expected:
-
-      juju ssh spark/0
-      ~/terasort.sh
-      exit
-
-  ### Smoke test HDFS functionality from user space
-  From the Spark unit, delete the MapReduce output previously generated by the
-  `terasort.sh` script:
-
-      juju ssh spark/0
-      hdfs dfs -rm -R /user/ubuntu/tera_demo_out
-      exit
-
-  ### Smoke test Spark
-  SSH to the Spark unit and run the SparkPi demo as follows:
-
-      juju ssh spark/0
-      ~/sparkpi.sh
-      exit
-
-  ### Access the IPython Notebook web interface
-  Access the notebook web interface at
-  http://{spark_unit_ip_address}:8880. The ip address can be found by running
-  `juju status spark/0 | grep public-address`.
-
-
-  ## Scale Out Usage
-  This bundle was designed to scale out. To increase the amount of Compute
-  Slaves, you can add units to the compute-slave service. To add one unit:
-
-      juju add-unit compute-slave
-
-  Or you can add multiple units at once:
-
-      juju add-unit -n4 compute-slave
-
-
-  ## Contact Information
-
-  - <bigdata-dev@xxxxxxxxxxxxxxxxxxx>
-
-
-  ## Help
-
-  - [Juju mailing list](https://lists.ubuntu.com/mailman/listinfo/juju)
-  - [Juju community](https://jujucharms.com/community)
+## Usage
+Deploy this bundle using juju-quickstart:
+
+    juju quickstart apache-hadoop-spark-notebook
+
+See `juju quickstart --help` for deployment options, including machine
+constraints and how to deploy a locally modified version of the
+apache-hadoop-spark-notebook bundle.yaml.
+
+
+## Testing the deployment
+
+### Smoke test HDFS admin functionality
+Once the deployment is complete and the cluster is running, ssh to the HDFS
+Master unit:
+
+    juju ssh hdfs-master/0
+
+As the `ubuntu` user, create a temporary directory on the Hadoop file system.
+The steps below verify HDFS functionality:
+
+    hdfs dfs -mkdir -p /tmp/hdfs-test
+    hdfs dfs -chmod -R 777 /tmp/hdfs-test
+    hdfs dfs -ls /tmp # verify the newly created hdfs-test subdirectory exists
+    hdfs dfs -rm -R /tmp/hdfs-test
+    hdfs dfs -ls /tmp # verify the hdfs-test subdirectory has been removed
+    exit
+
+### Smoke test YARN and MapReduce
+Run the `terasort.sh` script from the Spark unit to generate and sort data. The
+steps below verify that Spark is communicating with the cluster via the plugin
+and that YARN and MapReduce are working as expected:
+
+    juju ssh spark/0
+    ~/terasort.sh
+    exit
+
+### Smoke test HDFS functionality from user space
+From the Spark unit, delete the MapReduce output previously generated by the
+`terasort.sh` script:
+
+    juju ssh spark/0
+    hdfs dfs -rm -R /user/ubuntu/tera_demo_out
+    exit
+
+### Smoke test Spark
+SSH to the Spark unit and run the SparkPi demo as follows:
+
+    juju ssh spark/0
+    ~/sparkpi.sh
+    exit
+
+### Access the IPython Notebook web interface
+Access the notebook web interface at
+http://{spark_unit_ip_address}:8880. The ip address can be found by running
+`juju status spark/0 | grep public-address`.
+
+
+## Scale Out Usage
+This bundle was designed to scale out. To increase the amount of Compute
+Slaves, you can add units to the compute-slave service. To add one unit:
+
+    juju add-unit compute-slave
+
+Or you can add multiple units at once:
+
+    juju add-unit -n4 compute-slave
+
+
+## Contact Information
+
+- <bigdata-dev@xxxxxxxxxxxxxxxxxxx>
+
+
+## Help
+
+- [Juju mailing list](https://lists.ubuntu.com/mailman/listinfo/juju)
+- [Juju community](https://jujucharms.com/community)

=== modified file 'bundle.yaml'
--- bundle.yaml	2015-07-16 20:35:31 +0000
+++ bundle.yaml	2016-02-23 21:01:00 +0000
@@ -1,46 +1,46 @@
 services:
   compute-slave:
-    charm: cs:trusty/apache-hadoop-compute-slave
+    charm: cs:trusty/apache-hadoop-compute-slave-9
     num_units: 3
     annotations:
       gui-x: "300"
       gui-y: "200"
-    constraints: mem=3G
+    constraints: mem=7G
   hdfs-master:
-    charm: cs:trusty/apache-hadoop-hdfs-master
+    charm: cs:trusty/apache-hadoop-hdfs-master-9
     num_units: 1
     annotations:
       gui-x: "600"
       gui-y: "350"
     constraints: mem=7G
   plugin:
-    charm: cs:trusty/apache-hadoop-plugin
+    charm: cs:trusty/apache-hadoop-plugin-10
     annotations:
       gui-x: "900"
       gui-y: "200"
   secondary-namenode:
-    charm: cs:trusty/apache-hadoop-hdfs-secondary
+    charm: cs:trusty/apache-hadoop-hdfs-secondary-7
     num_units: 1
     annotations:
       gui-x: "600"
       gui-y: "600"
     constraints: mem=7G
   spark:
-    charm: cs:trusty/apache-spark
+    charm: cs:trusty/apache-spark-6
     num_units: 1
     annotations:
       gui-x: "1200"
       gui-y: "200"
     constraints: mem=3G
   yarn-master:
-    charm: cs:trusty/apache-hadoop-yarn-master
+    charm: cs:trusty/apache-hadoop-yarn-master-7
     num_units: 1
     annotations:
       gui-x: "600"
       gui-y: "100"
     constraints: mem=7G
   notebook:
-    charm: cs:trusty/apache-spark-notebook
+    charm: cs:trusty/apache-spark-notebook-3
     annotations:
       gui-x: "1200"
       gui-y: "450"
@@ -53,4 +53,4 @@
   - [plugin, yarn-master]
   - [plugin, hdfs-master]
   - [spark, plugin]
-  - [notebook, spark]
+  - [spark, notebook]

=== removed file 'tests/00-setup'
--- tests/00-setup	2015-07-16 20:35:31 +0000
+++ tests/00-setup	1970-01-01 00:00:00 +0000
@@ -1,8 +0,0 @@
-#!/bin/bash
-
-if ! dpkg -s amulet &> /dev/null; then
-    echo Installing Amulet...
-    sudo add-apt-repository -y ppa:juju/stable
-    sudo apt-get update
-    sudo apt-get -y install amulet
-fi

=== modified file 'tests/01-bundle.py'
--- tests/01-bundle.py	2015-07-16 20:35:31 +0000
+++ tests/01-bundle.py	2016-02-23 21:01:00 +0000
@@ -1,61 +1,32 @@
 #!/usr/bin/env python3
 
 import os
-import time
 import unittest
 
 import yaml
 import amulet
 
 
-class Base(object):
-    """
-    Base class for tests for Apache Hadoop Bundle.
-    """
+class TestBundle(unittest.TestCase):
     bundle_file = os.path.join(os.path.dirname(__file__), '..', 'bundle.yaml')
-    profile_name = None
 
     @classmethod
-    def deploy(cls):
-        # classmethod inheritance doesn't work quite right with
-        # setUpClass / tearDownClass, so subclasses have to manually call this
+    def setUpClass(cls):
         cls.d = amulet.Deployment(series='trusty')
         with open(cls.bundle_file) as f:
             bun = f.read()
-        profiles = yaml.safe_load(bun)
-        # amulet always selects the first profile, so we have to fudge it here
-        profile = {cls.profile_name: profiles[cls.profile_name]}
-        cls.d.load(profile)
-        cls.d.setup(timeout=9000)
-        cls.d.sentry.wait()
-        cls.hdfs = cls.d.sentry.unit['hdfs-master/0']
-        cls.yarn = cls.d.sentry.unit['yarn-master/0']
-        cls.slave = cls.d.sentry.unit['compute-slave/0']
-        cls.secondary = cls.d.sentry.unit['secondary-namenode/0']
-        cls.plugin = cls.d.sentry.unit['plugin/0']
-        cls.client = cls.d.sentry.unit['client/0']
-
-    @classmethod
-    def reset_env(cls):
-        # classmethod inheritance doesn't work quite right with
-        # setUpClass / tearDownClass, so subclasses have to manually call this
-        juju_env = amulet.helpers.default_environment()
-        services = ['hdfs-master', 'yarn-master', 'compute-slave', 'secondary-namenode', 'plugin', 'client']
-
-        def check_env_clear():
-            state = amulet.waiter.state(juju_env=juju_env)
-            for service in services:
-                if state.get(service, {}) != {}:
-                    return False
-            return True
-
-        for service in services:
-            cls.d.remove(service)
-        with amulet.helpers.timeout(300):
-            while not check_env_clear():
-                time.sleep(5)
-
-    def test_hadoop_components(self):
+        bundle = yaml.safe_load(bun)
+        cls.d.load(bundle)
+        cls.d.setup(timeout=1800)
+        cls.d.sentry.wait_for_messages({'notebook': 'Ready'}, timeout=1800)
+        cls.hdfs = cls.d.sentry['hdfs-master'][0]
+        cls.yarn = cls.d.sentry['yarn-master'][0]
+        cls.slave = cls.d.sentry['compute-slave'][0]
+        cls.secondary = cls.d.sentry['secondary-namenode'][0]
+        cls.spark = cls.d.sentry['spark'][0]
+        cls.notebook = cls.d.sentry['notebook'][0]
+
+    def test_components(self):
         """
         Confirm that all of the required components are up and running.
         """
@@ -63,17 +34,48 @@
         yarn, retcode = self.yarn.run("pgrep -a java")
         slave, retcode = self.slave.run("pgrep -a java")
         secondary, retcode = self.secondary.run("pgrep -a java")
-        client, retcode = self.client.run("pgrep -a java")
+        spark, retcode = self.spark.run("pgrep -a java")
+        notebook, retcode = self.spark.run("pgrep -a python")
 
         # .NameNode needs the . to differentiate it from SecondaryNameNode
         assert '.NameNode' in hdfs, "NameNode not started"
+        assert '.NameNode' not in yarn, "NameNode should not be running on yarn-master"
+        assert '.NameNode' not in slave, "NameNode should not be running on compute-slave"
+        assert '.NameNode' not in secondary, "NameNode should not be running on secondary-namenode"
+        assert '.NameNode' not in spark, "NameNode should not be running on spark"
+
         assert 'ResourceManager' in yarn, "ResourceManager not started"
+        assert 'ResourceManager' not in hdfs, "ResourceManager should not be running on hdfs-master"
+        assert 'ResourceManager' not in slave, "ResourceManager should not be running on compute-slave"
+        assert 'ResourceManager' not in secondary, "ResourceManager should not be running on secondary-namenode"
+        assert 'ResourceManager' not in spark, "ResourceManager should not be running on spark"
+
         assert 'JobHistoryServer' in yarn, "JobHistoryServer not started"
+        assert 'JobHistoryServer' not in hdfs, "JobHistoryServer should not be running on hdfs-master"
+        assert 'JobHistoryServer' not in slave, "JobHistoryServer should not be running on compute-slave"
+        assert 'JobHistoryServer' not in secondary, "JobHistoryServer should not be running on secondary-namenode"
+        assert 'JobHistoryServer' not in spark, "JobHistoryServer should not be running on spark"
+
         assert 'NodeManager' in slave, "NodeManager not started"
+        assert 'NodeManager' not in yarn, "NodeManager should not be running on yarn-master"
+        assert 'NodeManager' not in hdfs, "NodeManager should not be running on hdfs-master"
+        assert 'NodeManager' not in secondary, "NodeManager should not be running on secondary-namenode"
+        assert 'NodeManager' not in spark, "NodeManager should not be running on spark"
+
         assert 'DataNode' in slave, "DataServer not started"
+        assert 'DataNode' not in yarn, "DataNode should not be running on yarn-master"
+        assert 'DataNode' not in hdfs, "DataNode should not be running on hdfs-master"
+        assert 'DataNode' not in secondary, "DataNode should not be running on secondary-namenode"
+        assert 'DataNode' not in spark, "DataNode should not be running on spark"
+
         assert 'SecondaryNameNode' in secondary, "SecondaryNameNode not started"
+        assert 'SecondaryNameNode' not in yarn, "SecondaryNameNode should not be running on yarn-master"
+        assert 'SecondaryNameNode' not in hdfs, "SecondaryNameNode should not be running on hdfs-master"
+        assert 'SecondaryNameNode' not in slave, "SecondaryNameNode should not be running on compute-slave"
+        assert 'SecondaryNameNode' not in spark, "SecondaryNameNode should not be running on spark"
 
-        return hdfs, yarn, slave, secondary, client  # allow subclasses to do additional checks
+        assert 'spark' in spark, 'Spark should be running on spark'
+        assert 'notebook' in notebook, 'Notebook should be running on spark'
 
     def test_hdfs_dir(self):
         """
@@ -84,11 +86,11 @@
 
         NB: These are order-dependent, so must be done as part of a single test case.
         """
-        output, retcode = self.client.run("su hdfs -c 'hdfs dfs -mkdir -p /user/ubuntu'")
+        output, retcode = self.spark.run("su hdfs -c 'hdfs dfs -mkdir -p /user/ubuntu'")
         assert retcode == 0, "Created a user directory on hdfs FAILED:\n{}".format(output)
-        output, retcode = self.client.run("su hdfs -c 'hdfs dfs -chown ubuntu:ubuntu /user/ubuntu'")
+        output, retcode = self.spark.run("su hdfs -c 'hdfs dfs -chown ubuntu:ubuntu /user/ubuntu'")
         assert retcode == 0, "Assigning an owner to hdfs directory FAILED:\n{}".format(output)
-        output, retcode = self.client.run("su hdfs -c 'hdfs dfs -chmod -R 755 /user/ubuntu'")
+        output, retcode = self.spark.run("su hdfs -c 'hdfs dfs -chmod -R 755 /user/ubuntu'")
         assert retcode == 0, "seting directory permission on hdfs FAILED:\n{}".format(output)
 
     def test_yarn_mapreduce_exe(self):
@@ -112,59 +114,15 @@
             ('cleanup',      "su hdfs -c 'hdfs dfs -rm -r /user/ubuntu/teragenout'"),
         ]
         for name, step in test_steps:
-            output, retcode = self.client.run(step)
+            output, retcode = self.spark.run(step)
             assert retcode == 0, "{} FAILED:\n{}".format(name, output)
 
-
-class TestScalable(unittest.TestCase, Base):
-    profile_name = 'apache-core-batch-processing'
-
-    @classmethod
-    def setUpClass(cls):
-        cls.deploy()
-
-    @classmethod
-    def tearDownClass(cls):
-        cls.reset_env()
-
-    def test_hadoop_components(self):
-        """
-        In addition to testing that the components are running where they
-        are supposed to be, confirm that none of them are also running where
-        they shouldn't be.
-        """
-        hdfs, yarn, slave, secondary, client = super(TestScalable, self).test_hadoop_components()
-
-        # .NameNode needs the . to differentiate it from SecondaryNameNode
-        assert '.NameNode' not in yarn, "NameNode should not be running on yarn-master"
-        assert '.NameNode' not in slave, "NameNode should not be running on compute-slave"
-        assert '.NameNode' not in secondary, "NameNode should not be running on secondary-namenode"
-        assert '.NameNode' not in client, "NameNode should not be running on client"
-
-        assert 'ResourceManager' not in hdfs, "ResourceManager should not be running on hdfs-master"
-        assert 'ResourceManager' not in slave, "ResourceManager should not be running on compute-slave"
-        assert 'ResourceManager' not in secondary, "ResourceManager should not be running on secondary-namenode"
-        assert 'ResourceManager' not in client, "ResourceManager should not be running on client"
-
-        assert 'JobHistoryServer' not in hdfs, "JobHistoryServer should not be running on hdfs-master"
-        assert 'JobHistoryServer' not in slave, "JobHistoryServer should not be running on compute-slave"
-        assert 'JobHistoryServer' not in secondary, "JobHistoryServer should not be running on secondary-namenode"
-        assert 'JobHistoryServer' not in client, "JobHistoryServer should not be running on client"
-
-        assert 'NodeManager' not in yarn, "NodeManager should not be running on yarn-master"
-        assert 'NodeManager' not in hdfs, "NodeManager should not be running on hdfs-master"
-        assert 'NodeManager' not in secondary, "NodeManager should not be running on secondary-namenode"
-        assert 'NodeManager' not in client, "NodeManager should not be running on client"
-
-        assert 'DataNode' not in yarn, "DataNode should not be running on yarn-master"
-        assert 'DataNode' not in hdfs, "DataNode should not be running on hdfs-master"
-        assert 'DataNode' not in secondary, "DataNode should not be running on secondary-namenode"
-        assert 'DataNode' not in client, "DataNode should not be running on client"
-
-        assert 'SecondaryNameNode' not in yarn, "SecondaryNameNode should not be running on yarn-master"
-        assert 'SecondaryNameNode' not in hdfs, "SecondaryNameNode should not be running on hdfs-master"
-        assert 'SecondaryNameNode' not in slave, "SecondaryNameNode should not be running on compute-slave"
-        assert 'SecondaryNameNode' not in client, "SecondaryNameNode should not be running on client"
+    def test_spark(self):
+        output, retcode = self.spark.run("su ubuntu -c 'bash -lc /home/ubuntu/sparkpi.sh 2>&1'")
+        assert 'Pi is roughly' in output, 'SparkPI test failed: %s' % output
+
+    def test_notebook(self):
+        pass  # requires javascript; how to test?
 
 
 if __name__ == '__main__':

=== added file 'tests/tests.yaml'
--- tests/tests.yaml	1970-01-01 00:00:00 +0000
+++ tests/tests.yaml	2016-02-23 21:01:00 +0000
@@ -0,0 +1,3 @@
+reset: false
+packages:
+  - amulet


Follow ups