Although the commands below are replicated in the Cloudera Whirr install doc, I find it helpful to document my own experiences, as I usually encounter errors not mentioned in the documentation.
Whirr Configuration
linux-z6tw:~> whirr version
Apache Whirr 0.3.0-CDH3B4
linux-z6tw:~> cat
whirr.instance-templates=1 jt+nn,1 dn+tt
Launching a Whirr Cluster
linux-z6tw:~> whirr launch-cluster --config
Bootstrapping cluster
Configuring template
Starting 1 node(s) with roles [tt, dn]
Configuring template
Starting 1 node(s) with roles [jt, nn]
Nodes started: [[id=us-east-1/i-c4e942ab, providerId=i-c4e942ab, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-2a1fec43, os=[name=null, family=amzn-linux, version=2011.02.1, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2011.02.1.i386.manifest.xml], userMetadata={}, state=RUNNING, privateAddresses=[], publicAddresses=[], hardware=[id=m1.small, providerId=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]]
Nodes started: [[id=us-east-1/i-c0e942af, providerId=i-c0e942af, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scope=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-2a1fec43, os=[name=null, family=amzn-linux, version=2011.02.1, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2011.02.1.i386.manifest.xml], userMetadata={}, state=RUNNING, privateAddresses=[], publicAddresses=[], hardware=[id=m1.small, providerId=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]]
Authorizing firewall
Running configuration script
Configuration script run completed
Running configuration script
Configuration script run completed
Completed configuration of myhadoopcluster
Web UI available at
Wrote Hadoop site file /home/sodo/.whirr/myhadoopcluster/hadoop-site.xml
Wrote Hadoop proxy script /home/sodo/.whirr/myhadoopcluster/
Wrote instances file /home/sodo/.whirr/myhadoopcluster/instances
Started cluster of 2 instances
Cluster{instances=[Instance{roles=[jt, nn], publicAddress=/, privateAddress=/, id=us-east-1/i-c0e942af}, Instance{roles=[tt, dn], publicAddress=/, privateAddress=/, id=us-east-1/i-c4e942ab}], configuration={hadoop.job.ugi=root,root,, hadoop.socks.server=localhost:6666, fs.s3n.awsAccessKeyId=058DSRMMTF, fs.s3.awsSecretAccessKey=VhmOHmAq9QzCxzKpxhQxBoA5jOxZksq62jpO5mbD, fs.s3.awsAccessKeyId=058DSRMMTF,,, fs.s3n.awsSecretAccessKey=VhmAq9QzCxzKpxhQxBoA5jOxZks}}
Update the local Hadoop configuration to use hadoop-site.xml
linux-z6tw:~> ls .whirr/myhadoopcluster/ hadoop-site.xml instances
linux-z6tw:~> sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.whirr
root's password:
linux-z6tw:~> sudo rm -f /etc/hadoop-0.20/conf.whirr/*-site.xml
sudo rm -f /etc/hadoop-0.20/conf.whirr/*-site.xml
linux-z6tw:~> sudo cp ~/.whirr/myhadoopcluster/hadoop-site.xml /etc/hadoop-0.20/conf.whirr
linux-z6tw:/> sudo /usr/sbin/update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.whirr 50
linux-z6tw:/> /usr/sbin/update-alternatives --display hadoop-0.20-conf
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.whirr
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.pseudo - priority 30
/etc/hadoop-0.20/conf.whirr - priority 50
Current `best' version is /etc/hadoop-0.20/conf.whirr.
Running a Whirr Proxy
linux-z6tw:/> . ~/.whirr/myhadoopcluster/
Running proxy to Hadoop cluster at Use Ctrl-c to quit.
Warning: Permanently added ',' (RSA) to the list of known hosts.
Hadoop Commands
Error: You get this error if you don't have the proxy running:
linux-z6tw:/> hadoop fs -ls /
11/03/31 01:53:08 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
11/03/31 01:53:10 INFO ipc.Client: Retrying connect to server: Already tried 0 time(s).
11/03/31 01:53:11 INFO ipc.Client: Retrying connect to server: Already tried 1 time(s).
11/03/31 01:53:12 INFO ipc.Client: Retrying connect to server: Already tried 2 time(s).
11/03/31 01:53:13 INFO ipc.Client: Retrying connect to server: Already tried 3 time(s).
11/03/31 01:53:14 INFO ipc.Client: Retrying connect to server: Already tried 4 time(s).
linux-z6tw:~/.whirr/myhadoopcluster> hadoop fs -ls /
Found 4 items
drwxrwxrwx - hdfs supergroup 0 2011-03-31 01:43 /hadoop
drwxrwxrwx - hdfs supergroup 0 2011-03-31 01:43 /mnt
drwxrwxrwx - hdfs supergroup 0 2011-03-31 01:43 /tmp
drwxrwxrwx - hdfs supergroup 0 2011-03-31 01:43 /user
linux-z6tw:~> hadoop fs -mkdir input
linux-z6tw:~> hadoop fs -put /usr/lib/hadoop-0.20/LICENSE.txt input
linux-z6tw:~> hadoop fs -ls /user/sodo/input
Found 1 items
-rw-r--r-- 3 sodo supergroup 13366 2011-03-31 02:02 /user/sodo/input/LICENSE.txt
MapReduce Test
linux-z6tw:~> hadoop jar /usr/lib/hadoop-0.20/hadoop-examples-*.jar wordcount input output
11/03/31 02:07:53 INFO input.FileInputFormat: Total input paths to process : 1
11/03/31 02:07:54 INFO mapred.JobClient: Running job: job_201103310543_0001
11/03/31 02:07:55 INFO mapred.JobClient: map 0% reduce 0%
11/03/31 02:08:07 INFO mapred.JobClient: map 100% reduce 0%
11/03/31 02:08:23 INFO mapred.JobClient: map 100% reduce 100%
11/03/31 02:08:27 INFO mapred.JobClient: Job complete: job_201103310543_0001
11/03/31 02:08:27 INFO mapred.JobClient: Counters: 22
11/03/31 02:08:27 INFO mapred.JobClient: Job Counters
11/03/31 02:08:27 INFO mapred.JobClient: Launched reduce tasks=1
11/03/31 02:08:27 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12167
11/03/31 02:08:27 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/03/31 02:08:27 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/03/31 02:08:27 INFO mapred.JobClient: Launched map tasks=1
11/03/31 02:08:27 INFO mapred.JobClient: Data-local map tasks=1
11/03/31 02:08:27 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15843
11/03/31 02:08:27 INFO mapred.JobClient: FileSystemCounters
11/03/31 02:08:27 INFO mapred.JobClient: FILE_BYTES_READ=10206
11/03/31 02:08:27 INFO mapred.JobClient: HDFS_BYTES_READ=13508
11/03/31 02:08:27 INFO mapred.JobClient: FILE_BYTES_WRITTEN=114918
11/03/31 02:08:27 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=7376
11/03/31 02:08:27 INFO mapred.JobClient: Map-Reduce Framework
11/03/31 02:08:27 INFO mapred.JobClient: Reduce input groups=714
11/03/31 02:08:27 INFO mapred.JobClient: Combine output records=714
11/03/31 02:08:27 INFO mapred.JobClient: Map input records=244
11/03/31 02:08:27 INFO mapred.JobClient: Reduce shuffle bytes=10206
11/03/31 02:08:27 INFO mapred.JobClient: Reduce output records=714
11/03/31 02:08:27 INFO mapred.JobClient: Spilled Records=1428
11/03/31 02:08:27 INFO mapred.JobClient: Map output bytes=19699
11/03/31 02:08:27 INFO mapred.JobClient: Combine input records=1887
11/03/31 02:08:27 INFO mapred.JobClient: Map output records=1887
11/03/31 02:08:27 INFO mapred.JobClient: SPLIT_RAW_BYTES=142
11/03/31 02:08:27 INFO mapred.JobClient: Reduce input records=714
More Hadoop Commands
linux-z6tw:~> hadoop fs -ls /user/sodo
Found 3 items
drwx------ - sodo supergroup 0 2011-03-31 02:04 /user/sodo/.staging
drwxr-xr-x - sodo supergroup 0 2011-03-31 02:02 /user/sodo/input
drwxrwxrwx - sodo supergroup 0 2011-03-31 02:04 /user/sodo/output
linux-z6tw:~> hadoop fs -ls /user/sodo/output
Found 3 items
-rw-r--r-- 3 sodo supergroup 0 2011-03-31 02:04 /user/sodo/output/_SUCCESS
drwxrwxrwx - sodo supergroup 0 2011-03-31 02:03 /user/sodo/output/_logs
-rw-r--r-- 3 sodo supergroup 7376 2011-03-31 02:04 /user/sodo/output/part-r-00000
linux-z6tw:~> hadoop fs -cat /user/sodo/output/part-* | head
"AS 3
"Contribution" 1
"Contributor" 1
"Derivative 1
"Legal 1
"License" 1
"License"); 1
"Licensor" 1
"Not 1
cat: Unable to write to output stream.
Verify SSH connectivity
sodo@linux-z6tw:~/trendingtopics> ssh
Last login: Sun Apr 3 15:35:18 2011 from
__| __|_ ) Amazon Linux AMI
_| ( / Beta
See /usr/share/doc/system-release-2011.02 for latest release notes. :-)
[ec2-user@ip-10-114-102-177 ~]$ exit
Connection to closed
Hadoop Update script
sodo@linux-z6tw:~/trendingtopics> cat ~/
#!/bin/bash -v
whirr version
echo "Launch the cluster..hit ENTER"
whirr launch-cluster --config
ls .whirr/myhadoopcluster/
echo "Update the local Hadoop configuration to use hadoop-site.xml..hit ENTER"
sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.whirr
sudo rm -f /etc/hadoop-0.20/conf.whirr/*-site.xml
sudo cp ~/.whirr/myhadoopcluster/hadoop-site.xml /etc/hadoop-0.20/conf.whirr
sudo /usr/sbin/update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.whirr 50
sudo /usr/sbin/update-alternatives --display hadoop-0.20-conf
echo "Now, run the proxy..hit ENTER"
. ~/.whirr/myhadoopcluster/
MapReduce Test Script
sfrase@linux-z6tw:~> cat
#!/bin/bash -v
hadoop fs -mkdir input
hadoop fs -put $HADOOP_HOME/LICENSE.txt input
hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount input output
hadoop fs -cat output/part-* | head
Hadoop Command Reference
Hadoop Streaming
How MapReduce Works
Wednesday, March 30, 2011
Saturday, March 26, 2011
Cloudera for Hadoop, Beta 3 install and Whirr
I needed a test bed, so I fired up a micro OpenSuSE 11.4 instance on EC2 as it is relatively cheap (0.02 per Micro Instance, at the time of this writing). I'll build a VM for this purpose later, but it is late in the evening and I just want to get something working.
Once I had a vm running, I then went through the install docs and got both Cloudera for Hadoop Beta 3 and Whirr installed. Finally, I was able to fire up a cluster using Whirr. I describe Whirr configuration, Hadoop commands and perform a quick MapReduce test in a follow-up post.
Cloudera for Hadoop Beta 3
Following instructions here..
Whirr Notes
ip-10-212-121-180:~ # cat .bashrc
alias whirr='java -jar /usr/lib/whirr/whirr-cli-0.3.0-CDH3B4.jar'
alias whirr-ec2='whirr --identity=058DSRMMTFQMQRER2 --credential=VHmAq9QzCxzKpxhQxBoA5jOxZksq62jpO5mbD'
ip-10-212-121-180:~ # cat .bash_profile
export WHIRR_HOME=/usr/lib/whirr
export AWS_SECRET_ACCESS_KEY="VHmAq9QzCxzKpxhQxBoA5jOxZks"
Hadoop Properties
ip-10-212-121-180:~ # cat
whirr.instance-templates=1 jt+nn,1 dn+tt
Passphraseless SSH for Localhost
ip-10-212-121-180:~ # ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/
*make sure you ssh login as ec2-user to any hadoop cluster members
Launching a Hadoop cluster with Whirr
ip-10-212-121-180:~ # whirr launch-cluster --config ip-10-212-121-180:~
Bootstrapping cluster
Configuring template
Starting 1 node(s) with roles [tt, dn]
Configuring template
Starting 1 node(s) with roles [jt, nn]
Nodes started: [[id=us-east-1/i-2a55fb45, providerId=i-2a55fb45, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scop
e=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-2a1fec43, os=[name=null, family=amzn-linux,
version=2011.02.1, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2011.02.1.i386.manifest.xml], use
rMetadata={}, state=RUNNING, privateAddresses=[], publicAddresses=[], hardware=[id=m1.small, provide
rId=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/
dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=f
alse]], supportsImage=Not(is64Bit())]]]
Nodes started: [[id=us-east-1/i-c455fbab, providerId=i-c455fbab, tag=myhadoopcluster, name=null, location=[id=us-east-1d, scop
e=ZONE, description=us-east-1d, parent=us-east-1], uri=null, imageId=us-east-1/ami-2a1fec43, os=[name=null, family=amzn-linux,
version=2011.02.1, arch=paravirtual, is64Bit=false, description=amzn-ami-us-east-1/amzn-ami-2011.02.1.i386.manifest.xml], use
rMetadata={}, state=RUNNING, privateAddresses=[], publicAddresses=[], hardware=[id=m1.small, provider
Id=m1.small, name=m1.small, processors=[[cores=1.0, speed=1.0]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, device=/d
ev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=150.0, device=/dev/sda2, durable=false, isBootDevice=fa
lse]], supportsImage=Not(is64Bit())]]]
Authorizing firewall
Running configuration script
Configuration script run completed
Running configuration script
Configuration script run completed
Completed configuration of myhadoopcluster
Web UI available at
Wrote Hadoop site file /root/.whirr/myhadoopcluster/hadoop-site.xml
Wrote Hadoop proxy script /root/.whirr/myhadoopcluster/
Wrote instances file /root/.whirr/myhadoopcluster/instances
Started cluster of 2 instances
Cluster{instances=[Instance{roles=[tt, dn], publicAddress=/, privateAddress=/, id=us-east-1/i-2a55fb
45}, Instance{roles=[jt, nn], publicAddress=/, privateAddress=/, id=us-east-1/i-c455fbab}], configura
tion={hadoop.job.ugi=root,root,, hadoop.socks.server=localho
st:6666, fs.s3n.awsAccessKeyId=05TNM8DSRMM, fs.s3.awsSecretAccessKey=VhmAq9QzCxzKpxhQxBoA5jOxZksq62jpO5mbD, fs.s3.
awsAccessKeyId=058DSRM,, fs.defa, fs.s3n.awsSecretAccessKey=VhmAq9QzCxzKpxhQxBoA5j
Destroying a Cluster
ip-10-212-121-180:~ # whirr destroy-cluster --config
Destroying myhadoopcluster cluster
Cluster myhadoopcluster destroyed
More Whirr, Hadoop and MapReduce testing here
CDH3 Beta install and test cases
Pseudo-distributed mode and passphraseless SSH (for MapReduce jobs sample)
Hadoop Shell Commands
Friday, March 25, 2011
ec2 ami, ebs and s3 storage
Thursday, March 24, 2011
Data driven website in the AWS cloud..oh boy!
After reading Semil Shah's excellent article on BigData, it started me thinking that there is a wellspring of data in my weblogs that I'm not taking advantage of. Peter Skomoroch got me psyched to try Hadoop and Cloudera with his Hadoop World talk on rapid prototyping of data intensive web apps. Pete posted a great guide on how to piece together the components for the site. I think it would be a great exercise to:
1) get a version of site up and running.
2) apply that knowledge to big dataset management tasks back at work
Pete's guide is a fabulous open source resource. However, it has been two years since he wrote the application and a lot of the details about how each piece of software works have changed slightly or have been deprecated. (Funny how web technology techniques become obsoleted in two years!)
Rather than get into the nitty gritty details, I think it would be helpful to take a step back and visualize the architecture that I am trying to replicate as a whole. It is not inconsequential:

With that baseline set, I will delve into the more technical details of the project implementation in my upcoming posts.
next steps: getting a Cloudera-Hadoop cluster fired up with Whirr and running MapReduce on a dataset.
Amazon Public Data Sets
Wikipedia Traffic Statsand Raw Data
