Wednesday, December 21, 2011

hadoop pseudo mode gotchas

Reviving my TrendingTopics project, I wanted to get the Hadoop Pseudo mode configuration working again. That would allow me to test and run my shell scripts (adapted from Pete Skomoroch's originals found here) locally, rather than having to worry about EC2 config and setup.  I had a larger problem that impacted my Hadoop setup: some of the subdirectories under /var were missing.  This unwelcomed problem led me on a longer troubleshooting journey than I initially expected..

The Hadoop pseudo setup is based on this doc: https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode

In a nutshell..
1) Make sure your Hadoop pseudo config is installed (here is a $ sudo yum install hadoop-0.20-conf-pseudo
2) I had some problems where /var directories important for Hadoop did not exist. So make sure these directories exist and are writeable:
lrwxrwxrwx 1 root root 20 Apr 8 2011 /usr/lib/hadoop-0.20/pids -> /var/run/hadoop-0.20 
lrwxrwxrwx 1 root root 20 Apr 8 2011 /usr/lib/hadoop-0.20/logs -> /var/log/hadoop-0.20 ll /var/lock/subsys/ 
drwxrwxrwx 2 root root 4096 Dec 22 00:41 subsys

I took the easy way out and just chmod'd them:
$ sudo chmod 777 /var/run/hadoop-0.20 /var/log/hadoop-0.20

After which, you should be able to start the services: 
linux-z6tw:/var/log # for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done 
Starting Hadoop datanode daemon (hadoop-datanode): done 
starting datanode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-datanode-linux-z6tw.out 


Starting Hadoop jobtracker daemon (hadoop-jobtracker): done 
starting jobtracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-jobtracker-linux-z6tw.out 


Starting Hadoop namenode daemon (hadoop-namenode): done 
starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-linux-z6tw.out 


Starting Hadoop secondarynamenode daemon (hadoop-secondarynamenode): done 
starting secondarynamenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-linux-z6tw.out 


Starting Hadoop tasktracker daemon (hadoop-tasktracker): done 
starting tasktracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-tasktracker-linux-z6tw.out 

Always look at the logs to make sure all the daemons are working:
sodo@linux-z6tw:/var/log/hadoop-0.20> ls -ltr 
-rw-r--r-- 1 mapred mapred 394764 Dec 27 14:25 hadoop-hadoop-jobtracker-linux-z6tw.log 
-rw-r--r-- 1 hdfs hdfs 789914 Dec 27 14:25 hadoop-hadoop-namenode-linux-z6tw.log 
-rw-r--r-- 1 hdfs hdfs 536726 Dec 27 14:25 hadoop-hadoop-datanode-linux-z6tw.log 
-rw-r--r-- 1 mapred mapred 2524526 Dec 27 14:25 hadoop-hadoop-tasktracker-linux-z6tw.log

Also, view the name node and job tracker status interfaces as outlined here: http://www.bigfastblog.com/map-reduce-with-ruby-using-hadoop#running-the-hadoop-job

One of them being the jobtracker:
http://:50030/jobtracker.jsp 


Give 'er a test 
Once the daemons are running properly, test with the ol' pi calculation example: 
sodo@linux-z6tw:/> hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar pi 2 100000 
Number of Maps = 2 
Samples per Map = 100000 
Wrote input for Map #0 
Wrote input for Map #1 
Starting Job 
11/12/22 02:14:57 INFO mapred.FileInputFormat: Total input paths to process : 2 
11/12/22 02:14:58 INFO mapred.JobClient: Running job: job_201112220213_0002 
11/12/22 02:14:59 INFO mapred.JobClient: map 0% reduce 0% 
11/12/22 02:15:05 INFO mapred.JobClient: map 100% reduce 0% 
11/12/22 02:15:13 INFO mapred.JobClient: map 100% reduce 33% 
11/12/22 02:15:15 INFO mapred.JobClient: map 100% reduce 100% 
11/12/22 02:15:16 INFO mapred.JobClient: Job complete: job_201112220213_0002 
11/12/22 02:15:16 INFO mapred.JobClient: Counters: 23 
11/12/22 02:15:16 INFO mapred.JobClient: Job Counters 
11/12/22 02:15:16 INFO mapred.JobClient: Launched reduce tasks=1 
11/12/22 02:15:16 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=9121 
11/12/22 02:15:16 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 
11/12/22 02:15:16 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/22 02:15:16 INFO mapred.JobClient: Launched map tasks=2 
11/12/22 02:15:16 INFO mapred.JobClient: Data-local map tasks=2 
11/12/22 02:15:16 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8995 
11/12/22 02:15:16 INFO mapred.JobClient: FileSystemCounters 
11/12/22 02:15:16 INFO mapred.JobClient: FILE_BYTES_READ=50 
11/12/22 02:15:16 INFO mapred.JobClient: HDFS_BYTES_READ=472 
11/12/22 02:15:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=156541 
11/12/22 02:15:16 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215 
11/12/22 02:15:16 INFO mapred.JobClient: Map-Reduce Framework 11/12/22 02:15:16 INFO mapred.JobClient: Reduce input groups=2 11/12/22 02:15:16 INFO mapred.JobClient: Combine output records=0 11/12/22 02:15:16 INFO mapred.JobClient: Map input records=2 
11/12/22 02:15:16 INFO mapred.JobClient: Reduce shuffle bytes=56 
11/12/22 02:15:16 INFO mapred.JobClient: Reduce output records=0 
11/12/22 02:15:16 INFO mapred.JobClient: Spilled Records=8 
11/12/22 02:15:16 INFO mapred.JobClient: Map output bytes=36 
11/12/22 02:15:16 INFO mapred.JobClient: Map input bytes=48 
11/12/22 02:15:16 INFO mapred.JobClient: Combine input records=0 
11/12/22 02:15:16 INFO mapred.JobClient: Map output records=4 
11/12/22 02:15:16 INFO mapred.JobClient: SPLIT_RAW_BYTES=236 
11/12/22 02:15:16 INFO mapred.JobClient: Reduce input records=4 
Job Finished in 18.777 seconds 
Estimated value of Pi is 3.14118000000000000000 

Errors
1) Hadoop may hang if you have an incorrect /etc/hosts entry
Since I didn't have a DHCP reservation for my machine's IP, the IP address changed and the name node was sending packets out my gateway.  Hardcoding an /etc/hosts entry fixed this.
http://getsatisfaction.com/cloudera/topics/hadoop_setup_example_job_hangs_in_reduce_task_getimage_failed_java_io_ioexception_content_length_header_is_not_provided_by-1m4p8b

2) 11/05/02 23:59:47 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /blah/blah could only be replicated to 0 nodes
Stupidly, I built my root filesystem with only 8GB of space. So the data node ran out of space when it tried to run any Hadoop job. I got the above error when that happened.

Hadoop DFSadmin utility is good for diagnosing issues like the above:
sodo@linux-z6tw:/var/log/hadoop-0.20> hadoop dfsadmin -report
Configured Capacity: 33316270080 (31.03 GB)
Present Capacity: 26056040448 (24.27 GB)
DFS Remaining: 21374869504 (19.91 GB)
DFS Used: 4681170944 (4.36 GB)
DFS Used%: 17.97%
Under replicated blocks: 9
Blocks with corrupt replicas: 0
Missing blocks: 6
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)
Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 33316270080 (31.03 GB)
DFS Used: 4681170944 (4.36 GB)
Non DFS Used: 7260229632 (6.76 GB)
DFS Remaining: 21374869504(19.91 GB)
DFS Used%: 14.05%
DFS Remaining%: 64.16%
Last contact: Tue Dec 27 15:11:27 EST 2011

The resolution was to:
1) add a new filesystem to my virtual machine
2) moved the data node /tmp directory to larger filesystem
3) also moved my mysql installation to that new filesystem (nice instructions here for that: http://kaliphonia.com/content/linux/how-to-move-mysql-datadir-to-another-drive)

3) name node cannot start due to permissions
I moved my /tmp directory to a new filesystem and it did not have the proper permissions to write to the new temp directory. So I set perms like so:
$ chmod 777 /tmp

4) ERROR 1148 (42000) at line 40: The used command is not allowed with this MySQL version
Security issue with local data loads:
http://dev.mysql.com/doc/refman/5.0/en/load-data-local.html

You must use "--local-infile" as a parameter to the mysql command line like so:
sodo@linux-z6tw:~/trendingtopics/lib/sql> mysql -u user trendingtopics_development < loadSampleData.sql --local-infile

5) Cannot delete Name node is in safe mode
Hadoop safemode must be disabled
https://issues.apache.org/jira/browse/HADOOP-5937

References
http://archive.cloudera.com/cdh/3/hadoop-0.20.2-CDH3B4/single_node_setup.html#PseudoDistributed
https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode
https://cwiki.apache.org/confluence/display/WHIRR/Quick+Start+Guide
Map Reduce Tutorial
"could only be replicated to 0 nodes" FAQ
HDFS Basics for Developers

3 comments:

  1. Thanks for the detailed information.How are u debugging if any error occured.are you using any log4j at HDFS level.
    To find out the HDFS basic understanding HDFS Hadoop basics understanding

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete