The Hadoop pseudo setup is based on this doc: https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode
In a nutshell..
1) Make sure your Hadoop pseudo config is installed (here is a $ sudo yum install hadoop-0.20-conf-pseudo
2) I had some problems where /var directories important for Hadoop did not exist. So make sure these directories exist and are writeable:
lrwxrwxrwx 1 root root 20 Apr 8 2011 /usr/lib/hadoop-0.20/pids -> /var/run/hadoop-0.20
lrwxrwxrwx 1 root root 20 Apr 8 2011 /usr/lib/hadoop-0.20/logs -> /var/log/hadoop-0.20 ll /var/lock/subsys/
drwxrwxrwx 2 root root 4096 Dec 22 00:41 subsys
I took the easy way out and just chmod'd them:
$ sudo chmod 777 /var/run/hadoop-0.20 /var/log/hadoop-0.20
After which, you should be able to start the services:
linux-z6tw:/var/log # for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
Starting Hadoop datanode daemon (hadoop-datanode): done
starting datanode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-datanode-linux-z6tw.out
Starting Hadoop jobtracker daemon (hadoop-jobtracker): done
starting jobtracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-jobtracker-linux-z6tw.out
Starting Hadoop namenode daemon (hadoop-namenode): done
starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-linux-z6tw.out
Starting Hadoop secondarynamenode daemon (hadoop-secondarynamenode): done
starting secondarynamenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-linux-z6tw.out
Starting Hadoop tasktracker daemon (hadoop-tasktracker): done
starting tasktracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-tasktracker-linux-z6tw.out
Always look at the logs to make sure all the daemons are working:
sodo@linux-z6tw:/var/log/hadoop-0.20> ls -ltr
-rw-r--r-- 1 mapred mapred 394764 Dec 27 14:25 hadoop-hadoop-jobtracker-linux-z6tw.log
-rw-r--r-- 1 hdfs hdfs 789914 Dec 27 14:25 hadoop-hadoop-namenode-linux-z6tw.log
-rw-r--r-- 1 hdfs hdfs 536726 Dec 27 14:25 hadoop-hadoop-datanode-linux-z6tw.log
-rw-r--r-- 1 mapred mapred 2524526 Dec 27 14:25 hadoop-hadoop-tasktracker-linux-z6tw.log
Also, view the name node and job tracker status interfaces as outlined here: http://www.bigfastblog.com/map-reduce-with-ruby-using-hadoop#running-the-hadoop-job
One of them being the jobtracker:
http://
Estimated value of Pi is 3.14118000000000000000
Errors
1) Hadoop may hang if you have an incorrect /etc/hosts entry
Since I didn't have a DHCP reservation for my machine's IP, the IP address changed and the name node was sending packets out my gateway. Hardcoding an /etc/hosts entry fixed this.
http://getsatisfaction.com/cloudera/topics/hadoop_setup_example_job_hangs_in_reduce_task_getimage_failed_java_io_ioexception_content_length_header_is_not_provided_by-1m4p8b
2) 11/05/02 23:59:47 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /blah/blah could only be replicated to 0 nodes
Stupidly, I built my root filesystem with only 8GB of space. So the data node ran out of space when it tried to run any Hadoop job. I got the above error when that happened.
Hadoop DFSadmin utility is good for diagnosing issues like the above:
sodo@linux-z6tw:/var/log/hadoop-0.20> hadoop dfsadmin -report
Configured Capacity: 33316270080 (31.03 GB)
Present Capacity: 26056040448 (24.27 GB)
DFS Remaining: 21374869504 (19.91 GB)
DFS Used: 4681170944 (4.36 GB)
DFS Used%: 17.97%
Under replicated blocks: 9
Blocks with corrupt replicas: 0
Missing blocks: 6
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)
Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 33316270080 (31.03 GB)
DFS Used: 4681170944 (4.36 GB)
Non DFS Used: 7260229632 (6.76 GB)
DFS Remaining: 21374869504(19.91 GB)
DFS Used%: 14.05%
DFS Remaining%: 64.16%
Last contact: Tue Dec 27 15:11:27 EST 2011
The resolution was to:
1) add a new filesystem to my virtual machine
2) moved the data node /tmp directory to larger filesystem
3) also moved my mysql installation to that new filesystem (nice instructions here for that: http://kaliphonia.com/content/linux/how-to-move-mysql-datadir-to-another-drive)
3) name node cannot start due to permissions
I moved my /tmp directory to a new filesystem and it did not have the proper permissions to write to the new temp directory. So I set perms like so:
$ chmod 777 /tmp
4) ERROR 1148 (42000) at line 40: The used command is not allowed with this MySQL version
Security issue with local data loads:
http://dev.mysql.com/doc/refman/5.0/en/load-data-local.html
You must use "--local-infile" as a parameter to the mysql command line like so:
sodo@linux-z6tw:~/trendingtopics/lib/sql> mysql -u user trendingtopics_development < loadSampleData.sql --local-infile
Hadoop safemode must be disabled
https://issues.apache.org/jira/browse/HADOOP-5937
References
http://archive.cloudera.com/cdh/3/hadoop-0.20.2-CDH3B4/single_node_setup.html#PseudoDistributed
https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode
https://cwiki.apache.org/confluence/display/WHIRR/Quick+Start+Guide
Map Reduce Tutorial
"could only be replicated to 0 nodes" FAQ
HDFS Basics for Developers