BiggData: Cloudera for Hadoop..the errors

Following DataWrangling's prerequisites for the TrendingTopics data-driven website, I got myself an Amazon EC2 account. From the selection of virtual machine images that Amazon shows you and mapping them to the Cloudera Quick Start Guide, I decided to first setup a virtual machine instance of SUSE Linux Enterprise Server 11G, 64-bit in VMware Player. I did this so that I wouldn't waste any money on EC2 if I encountered any installation glitches. Which of course, I did.

SUSE Enterprise Linux is a very nice distribution..I hadn't used it before. Novell has kept the desktop very clean as opposed to OpenSuSE's choice of KDE for 11.4..ugh! YaST is a very nicely integrated system management tool. Also, the installation into VMware Player was quick..about 15 minutes from the install DVD. There were two ISO DVDs, the base install only required DVD1. Also, the Sun Java JDK is a prereq here, so I downloaded and installed that.

There are a few tasks to accomplish:
1) install the Cloudera repository
2) install Hadoop in pseudo-distributed mode (hadoop-0.20-conf-pseudo) + hadoop-hive
3) start the Hadoop services
4) test

I got hung up a few places:
1) install the rpm, not the bin. One SUSE Enterprise Linux 64-bit, it'd be:
sudo rpm -Uvh ./jdk-6u24-linux-amd64.rpm

2) the Cloudera documentation sends you to the wrong location for the repository file for SUSE 64-bit. It should be:
http://archive.cloudera.com/sles/11/x86_64/cdh/cloudera-cdh3.repo
(also, Cloudera's ami-8759bfee is out of date: old-releases.ubuntu.com)

3) the namenode service crapped out:
metrics.MetricsUtil: Unable to obtain hostName

This was due to the fact that the hostname of my local machine was not recognized. I added an entry into /etc/hosts to resolve this.

3) the test jar for calculating the value of pi failed:
sodo@linux-8u67:/var/log/hadoop-0.20> hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000
Number of Maps = 2
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Starting Job
11/03/24 16:28:49 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8021. Already tried 0 time(s).

For some reason, the mapReduce jobtracker service was not up and running on port 8021. This issue turned out to be related to the timing of when the other Hadoop services start. If I restart the Hadoop jobtracker service on its own, the service starts just fine:
sodo@linux-8u67:/etc/hadoop/conf> sudo /etc/init.d/hadoop-0.20-jobtracker restart
Stopping Hadoop jobtracker daemon (hadoop-jobtracker): done
no jobtracker to stop

Starting Hadoop jobtracker daemon (hadoop-jobtracker): done
starting jobtracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-jobtracker-linux-8u67.out

4) After package installs:
insserv: Script jexec is broken: incomplete LSB comment

Sun Java problem. Workaround here: https://bugzilla.novell.com/show_bug.cgi?id=504596#c14

5) insserv: FATAL: service syslog is missed in the runlevels 4 to use service hadoop-0.20-*

Workaround here or here.

6) FAILED: Unknown exception : org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /tmp/hive-hadoop. Name node is in safe mode

I did not wait long enough after hadoop services were started. I gave them a couple minutes to cook and then my job ran.

7) hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketException: Protocol not available

I had this java stack installed:
linux-z6tw:/> java -version
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.7) (suse-1.2.1-x86_64)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)

I yanked out the java-1_6_0-openjdk* version:
linux-z6tw:~> sudo zypper remove java-1_6_0-openjdk*
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following NEW package is going to be installed:
java-1_6_0-sun

The following packages are going to be REMOVED:
java-1_6_0-openjdk java-1_6_0-openjdk-plugin

Retrieving package java-1_6_0-sun-1.6.0.u23-3.3.x86_64 (1/1), 20.8 MiB (88.6 MiB unpacked)
Retrieving: java-1_6_0-sun-1.6.0.u23-3.3.x86_64.rpm [done (1.6 MiB/s)]
Removing java-1_6_0-openjdk-plugin-1.6.0.0_b20.1.9.7-1.2.1 [done]
Removing java-1_6_0-openjdk-1.6.0.0_b20.1.9.7-1.2.1 [done]

So that only the Sun JDK was left:
linux-z6tw:~> java -version
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)

And Hadoop is now running MapReduce jobs!
sodo@linux-8u67:~/Desktop> hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000
Number of Maps = 2
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Starting Job
11/03/24 17:43:13 INFO mapred.FileInputFormat: Total input paths to process : 2
11/03/24 17:43:14 INFO mapred.JobClient: Running job: job_201103241738_0001
11/03/24 17:43:15 INFO mapred.JobClient: map 0% reduce 0%
11/03/24 17:43:23 INFO mapred.JobClient: map 100% reduce 0%
11/03/24 17:43:31 INFO mapred.JobClient: map 100% reduce 100%
11/03/24 17:43:31 INFO mapred.JobClient: Job complete: job_201103241738_0001
11/03/24 17:43:31 INFO mapred.JobClient: Counters: 23
11/03/24 17:43:31 INFO mapred.JobClient: Job Counters
11/03/24 17:43:31 INFO mapred.JobClient: Launched reduce tasks=1
11/03/24 17:43:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=11869
11/03/24 17:43:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/03/24 17:43:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/03/24 17:43:31 INFO mapred.JobClient: Launched map tasks=2
11/03/24 17:43:31 INFO mapred.JobClient: Data-local map tasks=2
11/03/24 17:43:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8403
11/03/24 17:43:31 INFO mapred.JobClient: FileSystemCounters
11/03/24 17:43:31 INFO mapred.JobClient: FILE_BYTES_READ=50
11/03/24 17:43:31 INFO mapred.JobClient: HDFS_BYTES_READ=472
11/03/24 17:43:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=156550
11/03/24 17:43:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215
11/03/24 17:43:31 INFO mapred.JobClient: Map-Reduce Framework
11/03/24 17:43:31 INFO mapred.JobClient: Reduce input groups=2
11/03/24 17:43:31 INFO mapred.JobClient: Combine output records=0
11/03/24 17:43:31 INFO mapred.JobClient: Map input records=2
11/03/24 17:43:31 INFO mapred.JobClient: Reduce shuffle bytes=28
11/03/24 17:43:31 INFO mapred.JobClient: Reduce output records=0
11/03/24 17:43:31 INFO mapred.JobClient: Spilled Records=8
11/03/24 17:43:31 INFO mapred.JobClient: Map output bytes=36
11/03/24 17:43:31 INFO mapred.JobClient: Map input bytes=48
11/03/24 17:43:31 INFO mapred.JobClient: Combine input records=0
11/03/24 17:43:31 INFO mapred.JobClient: Map output records=4
11/03/24 17:43:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=236
11/03/24 17:43:31 INFO mapred.JobClient: Reduce input records=4
Job Finished in 18.177 seconds
Estimated value of Pi is 3.14118000000000000000

Neato. More to come..

References
Hadoop Default Ports
Unable to Obtain HostName error
Cloudera Documentation
Official OpenSuSE Documentation
Unofficial OpenSuSE Documentation
OpenSuSE Package Search
AddRepos
http://download.opensuse.org/distribution/11.1/repo/non-oss/
http://download.opensuse.org/distribution/11.1/repo/oss/

BiggData

Friday, April 1, 2011

Cloudera for Hadoop..the errors

No comments:

Post a Comment