Friday, December 12, 2014

Running Spark on Yarn from Outside the Cluster (a remote machine)

In figuring out how to run Spark on Yarn from a machine that wasn't part of the cluster, I found that I (like a few others in the forums) was confused about how it works.  I was trying to follow along here:

http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/

However, those instructions (and a lot of what I found in terms of documentation) seem to be oriented around running the client from a node in the Yarn cluster.  Here's one thing that confused people:

You Don't Have to Install Spark On The Cluster

That's right, extracting the tar file only has to be done on the machine you want to launch from--that could be a node in the cluster or a completely different machine.  It doesn't have to have Hadoop/Yarn on it at all.  You don't need to drop any bits on the cluster.  That probably confuses people who were used to installing things on the cluster before Yarn.  I believe that with Spark on Yarn, the Spark client delivers everything needed for Yarn to set everything up at runtime.

But what about that "export YARN_CONF_DIR=/etc/hadoop/conf" thing?  How does that work if I'm running remotely?  Well, at first I thought that was supposed to point to the configuration on the cluster.  But as I tried working with the command line arguments, I realized there was no way that Spark knew where the cluster was since I wasn't giving it an URL.  So I scp'd the contents of /etc/hadoop/conf from my cluster to my non-cluster machine, and pointed YARN_CONF_DIR at it.  Maybe there is a better way, but it worked.

That may be all you need to get both cluster and client modes working from outside the cluster.  Then again, you are probably more likely to hit permission errors (like me) since you are off-cluster:

Permission denied: user=myusername, access=WRITE, inode="/user":hdfs:hdfs:drwxr-xr-x

If you see this, you just need to provision the user you are running locally on the cluster--probably something along the lines of:

MyUsername=someusername
sudo useradd $MyUsername
sudo -u hdfs hadoop fs -mkdir /user/$MyUsername
sudo -u hdfs hadoop fs -chown -R $MyUsername /user/$MyUsername

Anyway, once you get to permission errors, it probably mean you've got your Spark configuration right--especially if you see your cluster URL showing up in the console logs.

And kudos to the Spark devs for good error messages--I got this trying to run with bits I got on a thumb drive from them at an event:

Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.

This was easily resolved by just downloading the tar file from the Hortonworks article and using that (shame on me for being lazy...).

Wednesday, October 8, 2014

Gnome Desktop Blank Screen (spinner never goes away)

UPDATE:  The root cause of this issue was that GDM AutomaticLogin was enabled in /etc/gdm/custom.conf but the login account had an expired password.  The problem went away when this was resolved.

Using the Cloudera CDH Quickstart VM 5.1 converted to a VMware template and running on ESXi 5.5, I was frequently getting a never ending spinner on the console to the graphical desktop (so basically, there was a blank black screen with the spinning "beachball").  It wasn't totally repeatable but it was happening more than half the time.  The VM uses Centos 6.5 under the hood.  

The fix was to kill the gnome gdm-simple-slave process, which then automatically restarts.  This would cause the console to flash and then the desktop would come up.  This command will do it:


pkill -o -f gdm-simple-slave

You can either ssh directly into the machine to run this, or do a Ctrl-Alt-F6 to switch the console to a command line and run it.

I'm not sure what the underlying problem is that's causing this but at least this can get past it.

Friday, September 12, 2014

Git ignore pattern in a file not working? Watch out for spaces on the line!

I couldn't figure out why a git ignore pattern applied via:

git config --global core.excludesfile

wasn't working.  Turns out there were spaces at the end of the pattern line that were preventing it from working.  According to the docs, spaces at the end of lines aren't supposed to matter:

Trailing spaces are ignored unless they are quoted with backlash ("\") 
ref:  http://git-scm.com/docs/gitignore

However, it did in this case.  Perhaps this is just an issue with Git on Windows.

TIP:  Use the following to test out your ignore patterns in dry run mode via -n:

git add -n *

Monday, September 8, 2014

Resetting (Deleting and Cleaning Out) an Ambari Cluster

If you are experimenting with Ambari for Hadoop cluster provisioning, it is useful to be able to wipe the ambari server and agents clean so you can try again.  There are some commands provided by Ambari that you can run to do this, but there are also a couple of things to watch out for--detailed below.  These instructions worked for me on Ambari 1.6.1 with Redhat 6.5.

First, stop and reset on the Ambari server:

[root@test-ambari ambuser]# ambari-server stop
[root@test-ambari ambuser]# ambari-server reset

Next, to prevent a possible obscure "no more mirrors to try" error on re-provisioning, clean out yum cache on all the agent machines--as I showed here.  I have SaltStack installed so I can run it across my cluster like this (or just log into each machine and run 'yum clean all'):

[root@test-ambari ~]# salt '*' cmd.run 'yum clean all'

Then go to each Ambari agent machine and run the host cleanup.  It would be nice to do this with SaltStack, but that requires giving sudo tty permissions for the command (which I didn't want to get into).  

I'm showing some of the output below but you may see different behaviour depending on the particulars of the cluster and how far the prior provisioning process got:

[root@master-master ~]# python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py --silent

Now restart the Ambari server:

[root@test-ambari ambuser]# ambari-server start

Now I don't know if this was documented anywhere, but I am using a script to provision my cluster via the API--and I found I had to wait until all the machines (agents) self-register with the ambari server (at least this is what I think is going on).  Here, I am using the ambari api,  piped through "wc", to monitor the count of of registered machines.  It took about 45 seconds for all the agents to register (when the count finally hit 4).

[root@test-ambari ~]# curl -sH "X-Requested-By: ambari" -u $USER:$PWD -i  http://localhost:8080/api/v1/hosts | grep host_name | wc
      2       6     102
[root@test-ambari ~]# curl -sH "X-Requested-By: ambari" -u $USER:$PWD -i  http://localhost:8080/api/v1/hosts | grep host_name | wc
      3       9     152
[root@test-ambari ~]# curl -sH "X-Requested-By: ambari" -u $USER:$PWD -i  http://localhost:8080/api/v1/hosts | grep host_name | wc
      3       9     152
[root@test-ambari ~]# curl -sH "X-Requested-By: ambari" -u $USER:$PWD -i  http://localhost:8080/api/v1/hosts | grep host_name | wc
      4      12     202

If you proceed before everything is registered, you may run into this error using the API:

  "status" : 400,
  "message" : "Attempted to add unknown hosts to a cluster.  These hosts have not been registered with the server: test-agent3.example.com"
At this point, you should have clean ambari server/agent cluster substrate to create the next cluster.  Happy provisioning!



Here are the commands with output:

Ambari-server  stop/reset:

[root@test-ambari ambuser]# ambari-server stop
Using python  /usr/bin/python2.6
Stopping ambari-server
Ambari Server stopped
[root@test-ambari ambuser]# ambari-server reset
Using python  /usr/bin/python2.6
Resetting ambari-server
**** WARNING **** You are about to reset and clear the Ambari Server database. This will remove all cluster host and configuration information from the database. You will be required to re-configure the Ambari server and re-run the cluster wizard. 
Are you SURE you want to perform the reset [yes/no] (no)? y
Confirm server reset [yes/no](no)? y
Resetting the Server database...
Connecting to local database...done.
WARNING: Non critical error in DDL, use --verbose for more information
Ambari Server 'reset' completed with warnings.

Yum cache cleaning:

[root@test-ambari ~]# salt '*' cmd.run 'yum clean all'
test-master.example.com:
    Loaded plugins: product-id, refresh-packagekit, rhnplugin, security,
    Cleaning repos: HDP-2.1 HDP-UTILS-1.1.0.17 Updates-ambari-1.6.1 ambari-1.x
                  : dogfood dogfood_6_x86-64 epel6_x86-64 rhel-x86_64-server-6
                  : rhel-x86_64-server-optional-6 rhel-x86_64-server-supplementary-6
    Cleaning up Everything
...SNIP

Host Cleanup (on the agents)--your output could be quite different:

[root@master-master ~]# python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py --silent
INFO:HostCleanup:
Killing pid's: ['']
INFO:HostCleanup:Deleting packages: ['']
INFO:HostCleanup:
Deleting users: ['ambari-qa', 'yarn', 'hdfs', 'mapred', 'zookeeper']
INFO:HostCleanup:Executing command: sudo userdel -rf ambari-qa
INFO:HostCleanup:Successfully deleted user: ambari-qa
INFO:HostCleanup:Executing command: sudo userdel -rf yarn
INFO:HostCleanup:Successfully deleted user: yarn
INFO:HostCleanup:Executing command: sudo userdel -rf hdfs
INFO:HostCleanup:Successfully deleted user: hdfs
INFO:HostCleanup:Executing command: sudo userdel -rf mapred
INFO:HostCleanup:Successfully deleted user: mapred
INFO:HostCleanup:Executing command: sudo userdel -rf zookeeper
INFO:HostCleanup:Successfully deleted user: zookeeper
INFO:HostCleanup:Executing command: sudo groupdel hadoop
WARNING:HostCleanup:Cannot delete group : hadoop, groupdel: cannot remove the primary group of user 'tez'
INFO:HostCleanup:Path doesn't exists: /home/ambari-qa
INFO:HostCleanup:Path doesn't exists: /home/yarn
INFO:HostCleanup:Path doesn't exists: /home/hdfs
INFO:HostCleanup:Path doesn't exists: /home/mapred
INFO:HostCleanup:Path doesn't exists: /home/zookeeper
INFO:HostCleanup:
Deleting directories: ['']
INFO:HostCleanup:Path doesn't exists: 
INFO:HostCleanup:
Deleting repo files: []
INFO:HostCleanup:
Erasing alternatives:{'symlink_list': [''], 'target_list': ['']}
INFO:HostCleanup:Path doesn't exists: 

INFO:HostCleanup:Clean-up completed. The output is at /var/lib/ambari-agent/data/hostcleanup.result

Restart the Ambari server:

[root@test-ambari ambuser]# ambari-server start
Using python  /usr/bin/python2.6
Starting ambari-server
Ambari Server running with 'root' privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Waiting for server start...
sh: line 0: ulimit: open files: cannot modify limit: Operation not permitted
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Ambari Server 'start' completed successfully.

[root@test-ambari ambuser]# python AmbariApiScript.py 

Friday, September 5, 2014

Ambari Cluster Provisioning Failure -- No More Mirrors To Try

Saw this when trying to re-provision a cluster after doing an "ambari-server reset" (Ambari 1.6.1 on Redhat 6.5):

Fail: Execution of '/usr/bin/yum -d 0 -e 0 -y install hadoop-yarn' returned 1. Error Downloading Packages:
  hadoop-yarn-2.4.0.2.1.5.0-695.el6.x86_64: failure: hadoop/hadoop-yarn-2.4.0.2.1.5.0-695.el6.x86_64.rpm from HDP-2.1: [Errno 256] No more mirrors to try.
  hadoop-2.4.0.2.1.5.0-695.el6.x86_64: failure: hadoop/hadoop-2.4.0.2.1.5.0-695.el6.x86_64.rpm from HDP-2.1: [Errno 256] No more mirrors to try.
  zookeeper-3.4.5.2.1.5.0-695.el6.noarch: failure: zookeeper/zookeeper-3.4.5.2.1.5.0-695.el6.noarch.rpm from HDP-2.1: [Errno 256] No more mirrors to try.

The solution was to do a "yum clean all" on the agents and retrying (requires doing all the "ambari-server reset" and agent cleanup again).