← Back to team overview

trafodion-firefighters team mailing list archive

Re: check test failures

 

This issue is probably related to the one in the attached email


From: Trafodion-firefighters [mailto:trafodion-firefighters-bounces+khaled.bouaziz=hp.com@xxxxxxxxxxxxxxxxxxx] On Behalf Of Subbiah, Suresh
Sent: Monday, March 02, 2015 8:31 AM
To: trafodion-firefighters@xxxxxxxxxxxxxxxxxxx
Subject: [Trafodion-firefighters] FW: check test failures

Hi FFs,

Any suggestions on how to resolve this problem?

Thanks
Suresh

From: Varnau, Steve (Trafodion)
Sent: Saturday, February 28, 2015 11:08 PM
To: Subbiah, Suresh; Sheedy, Chris (Trafodion)
Cc: Govindarajan, Selvaganes; Cooper, Joanie
Subject: RE: check test failures

Hi Suresh,

Joanie hit this on Friday too. There seems to be an intermittent failure introduced recently, which causes a hang in sqstart. Maybe firefighters need to tackle it.

-Steve

From: Subbiah, Suresh
Sent: Saturday, February 28, 2015 17:53
To: Varnau, Steve (Trafodion); Sheedy, Chris (Trafodion)
Cc: Govindarajan, Selvaganes
Subject: check test failures

Hi Steve, Chris

For my checkin https://review.trafodion.org/#/c/1169/ I am getting check tests failures in either the phoenix  of the core-seabase suites.
As far as I can tell the issue is not with any particular test failing, but with something in the setup stage not working as expected.
Is there anything I can do to avoid these errors? Copying Selva since I think his tests may be running into similar pblms. I did not check though.

Thanks
Suresh


Current Run
For the current run the first error I see for failing phoenix test is
2015-03-01 01:27:01 ***ERROR: (hdfs dfs -setfacl -R -m mask::rwx /apps/hbase/data/archive) command failed
2015-03-01 01:27:01 ***ERROR: traf_hortonworks_mods98 exited with error.
2015-03-01 01:27:01 ***ERROR: Please check log files.
2015-03-01 01:27:01 ***ERROR: Exiting....

Previous run (core Seabase)
***INFO: End of DCS install.
***INFO: starting Trafodion instance
***INFO: End of DCS install.
***INFO: starting Trafodion instance
Build timed out (after 200 minutes). Marking the build as failed.

Two runs before (once in phoenix and the in core Seabase)
***ERROR: (hdfs dfs -setfacl -R -m mask::rwx /apps/hbase/data/archive) command failed
***ERROR: traf_hortonworks_mods98 exited with error.
***ERROR: Please check log files.
***ERROR: Exiting....
--- Begin Message ---
Yeah, hdfs definitely has permission to do anything in the filesystem. The only question is who will be the owner/group of the archive directory. My script assumed that hbase user was the owner of /hbase and was the desired owner of /hbase/archive.  So you might want to add a chown command after the mkdir?



-Steve



From: Anderson, Marvin
Sent: Tuesday, February 24, 2015 23:25
To: Varnau, Steve (Trafodion); Bouaziz, Khaled
Cc: Moran, Amanda
Subject: RE: Gate job failing



I get an error using hbase  to do the mkdir on our test clusters, it has to be hdfs userid



[andersma@sea-nodepool installer]$ sudo su hbase --command "hdfs dfs -mkdir -p /hbase/archive"

mkdir: Permission denied: user=hbase, access=WRITE, inode="/":hdfs:hdfs:drwxr-xr-x



[andersma@sea-nodepool installer]$ sudo su hdfs --command "hdfs dfs -mkdir -p /hbase/archive"



So, I guess I'll put it in for the hdfs userid.



--Marvin



From: Varnau, Steve (Trafodion)
Sent: Tuesday, February 24, 2015 4:59 PM
To: Anderson, Marvin; Bouaziz, Khaled
Cc: Moran, Amanda
Subject: RE: Gate job failing



Here is what my script does that runs on non-cluster-mgr Cloudera nodes:



  sudo -u hbase hdfs dfs -mkdir -p /hbase/archive

  sudo -u hdfs hdfs dfs -setfacl -R -m user:jenkins:rwx /hbase/archive

  sudo -u hdfs hdfs dfs -setfacl -R -m default:user:jenkins:rwx /hbase/archive

  sudo -u hdfs hdfs dfs -setfacl -R -m mask::rwx /hbase/archive



This is sufficient for the tests. I can't say what other implications there might be.



-Steve



From: Anderson, Marvin
Sent: Tuesday, February 24, 2015 13:44
To: Varnau, Steve (Trafodion); Bouaziz, Khaled
Cc: Moran, Amanda
Subject: RE: Gate job failing



Does the installer just do a simple HDFS mkdir command like



Hortonworks

hdfs dfs -mkdir -p /apps/hbase/data/archieve



Cloudera

hdfs dfs -mkdir -p /hbase/archive



Does it need to have permissions set any specific way or other ownership or other settings?



--Marvin



From: Varnau, Steve (Trafodion)
Sent: Tuesday, February 24, 2015 2:58 PM
To: Bouaziz, Khaled; Anderson, Marvin
Cc: Moran, Amanda
Subject: RE: Gate job failing



This may indeed be due to the way we clean up the test nodes between jobs. To protect against any change or test that corrupts hbase data, we completely remove hbase data (in HDFS and in Zookeeper) at the beginning of each job and bring hbase back up to initialize it.



Apparently just bringing up hbase is not creating that directory. So, you could add installer logic to create it if it does not exist, but if you think it should always be there on a normal system, then I can script re-create the archive directory when doing that clean up.



-Steve



From: Bouaziz, Khaled
Sent: Tuesday, February 24, 2015 11:31
To: Anderson, Marvin
Cc: Varnau, Steve (Trafodion); Moran, Amanda
Subject: RE: Gate job failing



Hi Marvin:



I think Steve mentioned something like this before  and I think he has a solution already



thanks



From: Anderson, Marvin
Sent: Tuesday, February 24, 2015 1:12 PM
To: Bouaziz, Khaled
Cc: Varnau, Steve (Trafodion); Moran, Amanda
Subject: Gate job failing



Hi Khaled,

I was checking in the changes for snapshot scan support and those changes run fine on several of our test clusters but they are failing on the Jenkins build  gate machines for both Cloudera and Hortonworks.



***INFO: Setting HDFS ACLs for snapshot scan support

setfacl: `/hbase/archive': No such file or directory

***ERROR: (hdfs dfs -setfacl -R -m user:trafodion:rwx /hbase/archive) command failed

***ERROR: traf_cloudera_mods98 exited with error.





***INFO: Setting HDFS ACLs for snapshot scan support
setfacl: `/apps/hbase/data/archive': No such file or directory
***ERROR: (hdfs dfs -setfacl -R -m mask::rwx /apps/hbase/data/archive) command failed
***ERROR: traf_hortonworks_mods98 exited with error.



Are these missing directories ones we should be creating or should they already be there.  It appears they are not there on the build machines but were there on all our test machines.  So, is this a problem with the build machines hadoop env or something we should be creating?



--Marvin








--- End Message ---

Follow ups

References