Wednesday, March 30, 2011

whirr config, hadoop cmds, test mapreduce

Although the commands below are replicated in the Cloudera Whirr install doc, I find it helpful to document my own experiences, as I usually encounter errors not mentioned in the documentation.

Whirr Configuration
linux-z6tw:~> whirr version
Apache Whirr 0.3.0-CDH3B4

linux-z6tw:~> cat hadoop.properties
whirr.service-name=hadoop
whirr.cluster-name=myhadoopcluster
whirr.instance-templates=1 jt+nn,1 dn+tt
whirr.provider=ec2
whirr.identity=0TM8DSRMM
whirr.credential=VhmAq9QzCxzKpxhQxBoA5jO
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
whirr.hadoop-install-runurl=cloudera/cdh/install
whirr.hadoop-configure-runurl=cloudera/cdh/post-configure


Launching a Whirr Cluster
linux-z6tw:~> whirr launch-cluster --config hadoop.properties
Bootstrapping cluster
Configuring template
Starting 1 node(s) with roles [tt, dn]
Configuring template
Starting 1 node(s) with roles [jt, nn]

Nodes started: [[id=us-east-1/i-c4e942ab, providerId=i-c4e942ab, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-2a1fec43, os=[name=null, family=amzn-linux, version=2011.02.1, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2011.02.1.i386.manifest.xml], userMetadata={}, state=RUNNING, privateAddresses=[10.116.209.9], publicAddresses=[50.17.125.68], hardware=[id=m1.small, providerId=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]]
Nodes started: [[id=us-east-1/i-c0e942af, providerId=i-c0e942af, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-2a1fec43, os=[name=null, family=amzn-linux, version=2011.02.1, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2011.02.1.i386.manifest.xml], userMetadata={}, state=RUNNING, privateAddresses=[10.101.11.193], publicAddresses=[184.72.91.22], hardware=[id=m1.small, providerId=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]]
Authorizing firewall
Running configuration script
Configuration script run completed
Running configuration script
Configuration script run completed
Completed configuration of myhadoopcluster
Web UI available at http://ec2-184-72-91-22.compute-1.amazonaws.com
Wrote Hadoop site file /home/sodo/.whirr/myhadoopcluster/hadoop-site.xml
Wrote Hadoop proxy script /home/sodo/.whirr/myhadoopcluster/hadoop-proxy.sh
Wrote instances file /home/sodo/.whirr/myhadoopcluster/instances
Started cluster of 2 instances
Cluster{instances=[Instance{roles=[jt, nn], publicAddress=/184.72.91.22, privateAddress=/10.101.11.193, id=us-east-1/i-c0e942af}, Instance{roles=[tt, dn], publicAddress=/50.17.125.68, privateAddress=/10.116.209.9, id=us-east-1/i-c4e942ab}], configuration={hadoop.job.ugi=root,root, mapred.job.tracker=ec2-184-72-91-22.compute-1.amazonaws.com:8021, hadoop.socks.server=localhost:6666, fs.s3n.awsAccessKeyId=058DSRMMTF, fs.s3.awsSecretAccessKey=VhmOHmAq9QzCxzKpxhQxBoA5jOxZksq62jpO5mbD, fs.s3.awsAccessKeyId=058DSRMMTF, hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.SocksSocketFactory, fs.default.name=hdfs://ec2-184-72-91-22.compute-1.amazonaws.com:8020/, fs.s3n.awsSecretAccessKey=VhmAq9QzCxzKpxhQxBoA5jOxZks}}


Update the local Hadoop configuration to use hadoop-site.xml
linux-z6tw:~> ls .whirr/myhadoopcluster/
hadoop-proxy.sh hadoop-site.xml instances

linux-z6tw:~> sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.whirr
root's password:
linux-z6tw:~> sudo rm -f /etc/hadoop-0.20/conf.whirr/*-site.xml
sudo rm -f /etc/hadoop-0.20/conf.whirr/*-site.xml
linux-z6tw:~> sudo cp ~/.whirr/myhadoopcluster/hadoop-site.xml /etc/hadoop-0.20/conf.whirr

linux-z6tw:/> sudo /usr/sbin/update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.whirr 50

linux-z6tw:/> /usr/sbin/update-alternatives --display hadoop-0.20-conf
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.whirr
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.pseudo - priority 30
/etc/hadoop-0.20/conf.whirr - priority 50
Current `best' version is /etc/hadoop-0.20/conf.whirr.


Running a Whirr Proxy
linux-z6tw:/> . ~/.whirr/myhadoopcluster/hadoop-proxy.sh
Running proxy to Hadoop cluster at ec2-50-17-36-127.compute-1.amazonaws.com. Use Ctrl-c to quit.
Warning: Permanently added 'ec2-50-17-36-127.compute-1.amazonaws.com,10.194.74.132' (RSA) to the list of known hosts.


Hadoop Commands
Error: You get this error if you don't have the proxy running:
linux-z6tw:/> hadoop fs -ls /
11/03/31 01:53:08 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
11/03/31 01:53:10 INFO ipc.Client: Retrying connect to server: ec2-184-72-91-22.compute-1.amazonaws.com/184.72.91.22:8020. Already tried 0 time(s).
11/03/31 01:53:11 INFO ipc.Client: Retrying connect to server: ec2-184-72-91-22.compute-1.amazonaws.com/184.72.91.22:8020. Already tried 1 time(s).
11/03/31 01:53:12 INFO ipc.Client: Retrying connect to server: ec2-184-72-91-22.compute-1.amazonaws.com/184.72.91.22:8020. Already tried 2 time(s).
11/03/31 01:53:13 INFO ipc.Client: Retrying connect to server: ec2-184-72-91-22.compute-1.amazonaws.com/184.72.91.22:8020. Already tried 3 time(s).
11/03/31 01:53:14 INFO ipc.Client: Retrying connect to server: ec2-184-72-91-22.compute-1.amazonaws.com/184.72.91.22:8020. Already tried 4 time(s).
^Csodo@linux-z6tw:/>

linux-z6tw:~/.whirr/myhadoopcluster> hadoop fs -ls /
Found 4 items
drwxrwxrwx - hdfs supergroup 0 2011-03-31 01:43 /hadoop
drwxrwxrwx - hdfs supergroup 0 2011-03-31 01:43 /mnt
drwxrwxrwx - hdfs supergroup 0 2011-03-31 01:43 /tmp
drwxrwxrwx - hdfs supergroup 0 2011-03-31 01:43 /user

linux-z6tw:~> hadoop fs -mkdir input

linux-z6tw:~> hadoop fs -put /usr/lib/hadoop-0.20/LICENSE.txt input

linux-z6tw:~> hadoop fs -ls /user/sodo/input
Found 1 items
-rw-r--r-- 3 sodo supergroup 13366 2011-03-31 02:02 /user/sodo/input/LICENSE.txt


MapReduce Test
linux-z6tw:~> hadoop jar /usr/lib/hadoop-0.20/hadoop-examples-*.jar wordcount input output
11/03/31 02:07:53 INFO input.FileInputFormat: Total input paths to process : 1
11/03/31 02:07:54 INFO mapred.JobClient: Running job: job_201103310543_0001
11/03/31 02:07:55 INFO mapred.JobClient: map 0% reduce 0%
11/03/31 02:08:07 INFO mapred.JobClient: map 100% reduce 0%
11/03/31 02:08:23 INFO mapred.JobClient: map 100% reduce 100%
11/03/31 02:08:27 INFO mapred.JobClient: Job complete: job_201103310543_0001
11/03/31 02:08:27 INFO mapred.JobClient: Counters: 22
11/03/31 02:08:27 INFO mapred.JobClient: Job Counters
11/03/31 02:08:27 INFO mapred.JobClient: Launched reduce tasks=1
11/03/31 02:08:27 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12167
11/03/31 02:08:27 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/03/31 02:08:27 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/03/31 02:08:27 INFO mapred.JobClient: Launched map tasks=1
11/03/31 02:08:27 INFO mapred.JobClient: Data-local map tasks=1
11/03/31 02:08:27 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15843
11/03/31 02:08:27 INFO mapred.JobClient: FileSystemCounters
11/03/31 02:08:27 INFO mapred.JobClient: FILE_BYTES_READ=10206
11/03/31 02:08:27 INFO mapred.JobClient: HDFS_BYTES_READ=13508
11/03/31 02:08:27 INFO mapred.JobClient: FILE_BYTES_WRITTEN=114918
11/03/31 02:08:27 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=7376
11/03/31 02:08:27 INFO mapred.JobClient: Map-Reduce Framework
11/03/31 02:08:27 INFO mapred.JobClient: Reduce input groups=714
11/03/31 02:08:27 INFO mapred.JobClient: Combine output records=714
11/03/31 02:08:27 INFO mapred.JobClient: Map input records=244
11/03/31 02:08:27 INFO mapred.JobClient: Reduce shuffle bytes=10206
11/03/31 02:08:27 INFO mapred.JobClient: Reduce output records=714
11/03/31 02:08:27 INFO mapred.JobClient: Spilled Records=1428
11/03/31 02:08:27 INFO mapred.JobClient: Map output bytes=19699
11/03/31 02:08:27 INFO mapred.JobClient: Combine input records=1887
11/03/31 02:08:27 INFO mapred.JobClient: Map output records=1887
11/03/31 02:08:27 INFO mapred.JobClient: SPLIT_RAW_BYTES=142
11/03/31 02:08:27 INFO mapred.JobClient: Reduce input records=714


More Hadoop Commands
linux-z6tw:~> hadoop fs -ls /user/sodo
Found 3 items
drwx------ - sodo supergroup 0 2011-03-31 02:04 /user/sodo/.staging
drwxr-xr-x - sodo supergroup 0 2011-03-31 02:02 /user/sodo/input
drwxrwxrwx - sodo supergroup 0 2011-03-31 02:04 /user/sodo/output

linux-z6tw:~> hadoop fs -ls /user/sodo/output
Found 3 items
-rw-r--r-- 3 sodo supergroup 0 2011-03-31 02:04 /user/sodo/output/_SUCCESS
drwxrwxrwx - sodo supergroup 0 2011-03-31 02:03 /user/sodo/output/_logs
-rw-r--r-- 3 sodo supergroup 7376 2011-03-31 02:04 /user/sodo/output/part-r-00000


linux-z6tw:~> hadoop fs -cat /user/sodo/output/part-* | head
"AS 3
"Contribution" 1
"Contributor" 1
"Derivative 1
"Legal 1
"License" 1
"License"); 1
"Licensor" 1
"NOTICE" 1
"Not 1
cat: Unable to write to output stream.


Verify SSH connectivity
sodo@linux-z6tw:~/trendingtopics> ssh ec2-user@ec2-204-236-240-136.compute-1.amazonaws.com
Last login: Sun Apr 3 15:35:18 2011 from c-69-248-248-90.hsd1.nj.comcast.net

__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|

See /usr/share/doc/system-release-2011.02 for latest release notes. :-)
[ec2-user@ip-10-114-102-177 ~]$ exit
logout
Connection to ec2-204-236-240-136.compute-1.amazonaws.com closed


Hadoop Update script
sodo@linux-z6tw:~/trendingtopics> cat ~/updateHadoopConfig.sh
#!/bin/bash -v
whirr version
cat hadoop.properties
echo "Launch the cluster..hit ENTER"
read ANSWER
whirr launch-cluster --config hadoop.properties
ls .whirr/myhadoopcluster/
echo "Update the local Hadoop configuration to use hadoop-site.xml..hit ENTER"
read ANSWER
sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.whirr
sudo rm -f /etc/hadoop-0.20/conf.whirr/*-site.xml
sudo cp ~/.whirr/myhadoopcluster/hadoop-site.xml /etc/hadoop-0.20/conf.whirr
sudo /usr/sbin/update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.whirr 50
sudo /usr/sbin/update-alternatives --display hadoop-0.20-conf
echo "Now, run the proxy..hit ENTER"
read ANSWER
. ~/.whirr/myhadoopcluster/hadoop-proxy.sh


MapReduce Test Script
sfrase@linux-z6tw:~> cat mapReduceTestJob.sh
#!/bin/bash -v
hadoop fs -mkdir input
hadoop fs -put $HADOOP_HOME/LICENSE.txt input
hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount input output
hadoop fs -cat output/part-* | head


Reference
Hadoop Command Reference
Hadoop Streaming
How MapReduce Works

1 comment:

  1. I followed your instructions, and I actually have the proxy running, so far everything works great. BUT, when I try to browse the HDFS, I get the error you mentioned ("Error: You get this error if you don't have the proxy running:").
    The proxy says:
    channel 2: open failed: connect failed: Connection refused
    channel 2: open failed: connect failed: Connection refused

    Do you have any idea what can I do?
    Thanks in advance!

    ReplyDelete