My next goal was to get DataWrangling's update scripts working. Because of my lack of knowledge about Hadoop and Hive, I've encountered a LOT of errors, made a lot of mistakes, and lost many hours normally devoted to sleep. Ah well..what's a techy supposed to do? Of course, encountering problems is a great way to learn and I had problems aplenty. It was well worth the effort.
I first installed Cloudera Hadoop in pseudo-distributed mode. This was great for testing out DW's map/reduce and python scripts BEFORE you fire up multiple EC2 instances that will cost you. Ahem.
I spent about a week's worth of nights figuring out the scripts. I rewrote them to make them easier to use. Once I get a bit of time, I will post them in a more readable format. Once I got those scripts working, I updated my hadoop.properties file in order to munge through the big dataset on EC2.
INSERT AMAZON EC2 EXPERIENCE HERE
I spent too much money on what I should have been doing in my unit test environment first.
*** end EXPERIENCE ***
The only trouble I had is that once I was finished with the EC2 configuration, I didn't know how to revert back to the pseudo-distributed config. Eventually I'll figure this out, but I had to reinstall Hadoop to clear out the config to get back to pseudo mode.
Below I list the things I learned this week.
pseudo-distributed mode
The quickest way to reset the Hadoop configuration in order to get back to pseudo mode from EC2 mode was to reinstall Hadoop.
Make sure your environment is setup properly!
PATH=$PATH:/usr/local/apache-maven-3.0.3/bin
export PATH
export HADOOP_HOME=/usr/lib/hadoop-0.20
#export HADOOP_ROOT_LOGGER=DEBUG,console
export HIVE_HOME=/usr/lib/hive
export MYBUCKET=trendingtopics
export MYSERVER=linux-z6tw
export MAILTO=cacasododom@gmail.com
export JAVA_HOME=/usr/java/jdk1.6.0_24/
export AWS_ACCESS_KEY_ID="DSRMMT"
export AWS_SECRET_ACCESS_KEY="zKpxhQxBoA5jOxZk"
In my scripts, I had to add the AWS access id/secret key to the URL for S3 access:
s3n://accessid:secretkey@$MYBUCKET/...
Debugging
Use Hive Logs
..to tell you what the hell is going wrong with Hive.
linux-z6tw:/var/lib/hive # ll /tmp/root/hive.log
-rw-r--r-- 1 root root 503689 Apr 8 15:23 /tmp/root/hive.log
linux-z6tw:/var/lib/hive # date
Fri Apr 8 15:28:37 EDT 2011
linux-z6tw:/var/lib/hive # tail /tmp/root/hive.log
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
enable Hadoop Debugger for streaming errors
HADOOP_ROOT_LOGGER=DEBUG,console
Like so:
11/04/08 16:42:06 DEBUG streaming.StreamJob: Error in streaming job
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password
By default, Hadoop Streaming does not spit out very clear errors. Once debug mode is enable, the message that pops from hadoop-streaming is fairly self-explanatory.
Errors
FAILED: Error in semantic analysis: line 3:17 Invalid Path 'outputStage1': source contains directory: hdfs://ec2-184-73-137-122.compute-1.amazonaws.com/user/root/outputStage1/_logs
You must delete the _logs directory
ERROR DataNucleus.Plugin
>> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
>> "org.eclipse.core.resources" but it cannot be resolve
Corrupt Hive Metastore?
Maybe move/delete metastore to resolve unknown issues.
linux-z6tw:~ # ll /var/lib/hive/metastore/
total 16
-rw-r--r-- 1 root root 354 Apr 8 15:52 derby.log
drwxr-xr-x 5 root root 4096 Apr 8 15:58 metastore_db
drwxr-xr-x 5 root root 4096 Apr 8 15:51 metastore_dbBACKUP
drwxrwxrwt 3 root root 4096 Apr 8 14:38 scripts
Make sure you don't have another hive CLI session open..
Don't have two command line Hive interfaces (CLI) up when trying to add/drop/delete tables,
or you'll get strange failures, like this:
"DROPing redirect_table"
FAILED: Error in semantic analysis: Table not found redirect_table
FAILED: Error in semantic analysis: Unable to fetch table daily_trends
For LOAD DATA LOCAL INFILE calls, use --local-infile command line arg
mysql -u user -p --local-infile
Otherwise, you'd get:
ERROR 1148 (42000): The used command is not allowed with this MySQL version
DataWrangling's reducer2 seems to need the exact number of logfiles you're munging
-reducer "daily_timelines.py reducer2 1" \
Reference
http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html
Running a multi-node Hadoop cluster
Cloudera Pseudo-Distributed Mode
No comments:
Post a Comment