Wednesday, December 21, 2011

hadoop pseudo mode gotchas

Reviving my TrendingTopics project, I wanted to get the Hadoop Pseudo mode configuration working again. That would allow me to test and run my shell scripts (adapted from Pete Skomoroch's originals found here) locally, rather than having to worry about EC2 config and setup.  I had a larger problem that impacted my Hadoop setup: some of the subdirectories under /var were missing.  This unwelcomed problem led me on a longer troubleshooting journey than I initially expected..

The Hadoop pseudo setup is based on this doc: https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode

In a nutshell..
1) Make sure your Hadoop pseudo config is installed (here is a $ sudo yum install hadoop-0.20-conf-pseudo
2) I had some problems where /var directories important for Hadoop did not exist. So make sure these directories exist and are writeable:
lrwxrwxrwx 1 root root 20 Apr 8 2011 /usr/lib/hadoop-0.20/pids -> /var/run/hadoop-0.20 
lrwxrwxrwx 1 root root 20 Apr 8 2011 /usr/lib/hadoop-0.20/logs -> /var/log/hadoop-0.20 ll /var/lock/subsys/ 
drwxrwxrwx 2 root root 4096 Dec 22 00:41 subsys

I took the easy way out and just chmod'd them:
$ sudo chmod 777 /var/run/hadoop-0.20 /var/log/hadoop-0.20

After which, you should be able to start the services: 
linux-z6tw:/var/log # for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done 
Starting Hadoop datanode daemon (hadoop-datanode): done 
starting datanode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-datanode-linux-z6tw.out 


Starting Hadoop jobtracker daemon (hadoop-jobtracker): done 
starting jobtracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-jobtracker-linux-z6tw.out 


Starting Hadoop namenode daemon (hadoop-namenode): done 
starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-linux-z6tw.out 


Starting Hadoop secondarynamenode daemon (hadoop-secondarynamenode): done 
starting secondarynamenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-linux-z6tw.out 


Starting Hadoop tasktracker daemon (hadoop-tasktracker): done 
starting tasktracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-tasktracker-linux-z6tw.out 

Always look at the logs to make sure all the daemons are working:
sodo@linux-z6tw:/var/log/hadoop-0.20> ls -ltr 
-rw-r--r-- 1 mapred mapred 394764 Dec 27 14:25 hadoop-hadoop-jobtracker-linux-z6tw.log 
-rw-r--r-- 1 hdfs hdfs 789914 Dec 27 14:25 hadoop-hadoop-namenode-linux-z6tw.log 
-rw-r--r-- 1 hdfs hdfs 536726 Dec 27 14:25 hadoop-hadoop-datanode-linux-z6tw.log 
-rw-r--r-- 1 mapred mapred 2524526 Dec 27 14:25 hadoop-hadoop-tasktracker-linux-z6tw.log

Also, view the name node and job tracker status interfaces as outlined here: http://www.bigfastblog.com/map-reduce-with-ruby-using-hadoop#running-the-hadoop-job

One of them being the jobtracker:
http://:50030/jobtracker.jsp 


Give 'er a test 
Once the daemons are running properly, test with the ol' pi calculation example: 
sodo@linux-z6tw:/> hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar pi 2 100000 
Number of Maps = 2 
Samples per Map = 100000 
Wrote input for Map #0 
Wrote input for Map #1 
Starting Job 
11/12/22 02:14:57 INFO mapred.FileInputFormat: Total input paths to process : 2 
11/12/22 02:14:58 INFO mapred.JobClient: Running job: job_201112220213_0002 
11/12/22 02:14:59 INFO mapred.JobClient: map 0% reduce 0% 
11/12/22 02:15:05 INFO mapred.JobClient: map 100% reduce 0% 
11/12/22 02:15:13 INFO mapred.JobClient: map 100% reduce 33% 
11/12/22 02:15:15 INFO mapred.JobClient: map 100% reduce 100% 
11/12/22 02:15:16 INFO mapred.JobClient: Job complete: job_201112220213_0002 
11/12/22 02:15:16 INFO mapred.JobClient: Counters: 23 
11/12/22 02:15:16 INFO mapred.JobClient: Job Counters 
11/12/22 02:15:16 INFO mapred.JobClient: Launched reduce tasks=1 
11/12/22 02:15:16 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=9121 
11/12/22 02:15:16 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 
11/12/22 02:15:16 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/22 02:15:16 INFO mapred.JobClient: Launched map tasks=2 
11/12/22 02:15:16 INFO mapred.JobClient: Data-local map tasks=2 
11/12/22 02:15:16 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8995 
11/12/22 02:15:16 INFO mapred.JobClient: FileSystemCounters 
11/12/22 02:15:16 INFO mapred.JobClient: FILE_BYTES_READ=50 
11/12/22 02:15:16 INFO mapred.JobClient: HDFS_BYTES_READ=472 
11/12/22 02:15:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=156541 
11/12/22 02:15:16 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215 
11/12/22 02:15:16 INFO mapred.JobClient: Map-Reduce Framework 11/12/22 02:15:16 INFO mapred.JobClient: Reduce input groups=2 11/12/22 02:15:16 INFO mapred.JobClient: Combine output records=0 11/12/22 02:15:16 INFO mapred.JobClient: Map input records=2 
11/12/22 02:15:16 INFO mapred.JobClient: Reduce shuffle bytes=56 
11/12/22 02:15:16 INFO mapred.JobClient: Reduce output records=0 
11/12/22 02:15:16 INFO mapred.JobClient: Spilled Records=8 
11/12/22 02:15:16 INFO mapred.JobClient: Map output bytes=36 
11/12/22 02:15:16 INFO mapred.JobClient: Map input bytes=48 
11/12/22 02:15:16 INFO mapred.JobClient: Combine input records=0 
11/12/22 02:15:16 INFO mapred.JobClient: Map output records=4 
11/12/22 02:15:16 INFO mapred.JobClient: SPLIT_RAW_BYTES=236 
11/12/22 02:15:16 INFO mapred.JobClient: Reduce input records=4 
Job Finished in 18.777 seconds 
Estimated value of Pi is 3.14118000000000000000 

Errors
1) Hadoop may hang if you have an incorrect /etc/hosts entry
Since I didn't have a DHCP reservation for my machine's IP, the IP address changed and the name node was sending packets out my gateway.  Hardcoding an /etc/hosts entry fixed this.
http://getsatisfaction.com/cloudera/topics/hadoop_setup_example_job_hangs_in_reduce_task_getimage_failed_java_io_ioexception_content_length_header_is_not_provided_by-1m4p8b

2) 11/05/02 23:59:47 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /blah/blah could only be replicated to 0 nodes
Stupidly, I built my root filesystem with only 8GB of space. So the data node ran out of space when it tried to run any Hadoop job. I got the above error when that happened.

Hadoop DFSadmin utility is good for diagnosing issues like the above:
sodo@linux-z6tw:/var/log/hadoop-0.20> hadoop dfsadmin -report
Configured Capacity: 33316270080 (31.03 GB)
Present Capacity: 26056040448 (24.27 GB)
DFS Remaining: 21374869504 (19.91 GB)
DFS Used: 4681170944 (4.36 GB)
DFS Used%: 17.97%
Under replicated blocks: 9
Blocks with corrupt replicas: 0
Missing blocks: 6
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)
Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 33316270080 (31.03 GB)
DFS Used: 4681170944 (4.36 GB)
Non DFS Used: 7260229632 (6.76 GB)
DFS Remaining: 21374869504(19.91 GB)
DFS Used%: 14.05%
DFS Remaining%: 64.16%
Last contact: Tue Dec 27 15:11:27 EST 2011

The resolution was to:
1) add a new filesystem to my virtual machine
2) moved the data node /tmp directory to larger filesystem
3) also moved my mysql installation to that new filesystem (nice instructions here for that: http://kaliphonia.com/content/linux/how-to-move-mysql-datadir-to-another-drive)

3) name node cannot start due to permissions
I moved my /tmp directory to a new filesystem and it did not have the proper permissions to write to the new temp directory. So I set perms like so:
$ chmod 777 /tmp

4) ERROR 1148 (42000) at line 40: The used command is not allowed with this MySQL version
Security issue with local data loads:
http://dev.mysql.com/doc/refman/5.0/en/load-data-local.html

You must use "--local-infile" as a parameter to the mysql command line like so:
sodo@linux-z6tw:~/trendingtopics/lib/sql> mysql -u user trendingtopics_development < loadSampleData.sql --local-infile

5) Cannot delete Name node is in safe mode
Hadoop safemode must be disabled
https://issues.apache.org/jira/browse/HADOOP-5937

References
http://archive.cloudera.com/cdh/3/hadoop-0.20.2-CDH3B4/single_node_setup.html#PseudoDistributed
https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode
https://cwiki.apache.org/confluence/display/WHIRR/Quick+Start+Guide
Map Reduce Tutorial
"could only be replicated to 0 nodes" FAQ
HDFS Basics for Developers

Wednesday, August 17, 2011

min requirements and env for ec2-api-tools

I started a mini Fedora EC2 instance on Amazon Web Services the other day. The instance had a slim set of installed programs. Here is the short list of dependencies to get the ec2-api-tools running:
  • yum install openssh-clients (for SCP)
  • yum install unzip (for the zipped ec2-api-tools install zip)
  • install latest jdk (latest 6_27 from Oracle's site)
As well, ec2-api-tools depends on a a few environment variables..the five lines at the bottom of this .bash_profile:
export PATH
export HADOOP_HOME=/usr/lib/hadoop-0.20
export HIVE_HOME=/usr/lib/hive
export MYBUCKET=frasestrendingtopics
export MYSERVER=localhost
export MAILTO=cacasododom@gmail.com
export JAVA_HOME=/usr/java/jdk1.6.0_24/
export EC2_HOME=/usr/bin/ec2-api-tools-1.4.4.0
export EC2_PRIVATE_KEY=~/doc/software/amazon/certs/pk-GEQDDYWBHKJKKUWNWAS.pem
export EC2_CERT=~/doc/software/amazon/certs/cert-DYWBHKJKKUWNWASYNF2YL.pem
export ACCESS_KEY_ID=8DSRMMTFQMQRER2
export SECRET_ACCESS_KEY=mAq9QzCVhmxzKpxhQxBoA5jOxZksq62jpO5mbD

Cheers..


Friday, August 5, 2011

ec2 api tools through an http proxy

I spent about an hour on this today so I thought it would be reasonable to post about it. I was trying to access the EC2 API tools behind my firewall at work, through our http proxy server. You can get this to work by setting a JVM environment variable called EC2_JVM_ARGS like so. Example is from a Linux box.

Example without username authentication (using ntlmaps)
export EC2_JVM_ARGS="-DproxySet=true -DproxyHost=localhost -DproxyPort=5865 -Dhttps.proxySet=true -Dhttps.proxyHost=localhost -Dhttps.proxyPort=5865"

Example with username authentication
export EC2_JVM_ARGS="-DproxySet=true -DproxyHost=http-proxy -DproxyPort=8080 -Dhttps.proxySet=true -Dhttps.proxyHost=http-proxy -Dhttps.proxyPort=8080 -Dhttp.proxyUser=username -Dhttp.proxyPass=password -Dhttps.proxyUser=username -Dhttps.proxyPass=password"

Voila..it works!
sodo@linux-z6tw:~> ec2din -v
Setting User-Agent to [ec2-api-tools 1.4.4.0]
Using proxy [http-proxy:8080]
Using proxy credentials [username@password] for all realms
------------------------------[ REQUEST ]-------------------------------
------------------------------------------------------------------------
REQUEST ID 76b60060-790c-403f-b0dd-75d06d7f3a79


Hooray!
sodo

Reference
https://forums.aws.amazon.com/thread.jspa?messageID=52041쭉

Wednesday, April 27, 2011

preparing an EBS volume to use as Amazon Public Dataset

I had created a three month refresh of Pete Skomoroch's Amazon Public Dataset of the Wikipedia Traffic Stats. The snapshot (snap-5300883c) is about 150GB of data and includes 2060 logfiles spanning from 1/1/2011-3/31/2011.

To share the snapshot with Amazon, I first had to migrate the data from my S3 bucket to an EBS volume. Then, I created a snapshot of the EBS volume so that the Amazon support folks could use it as a public data set. Next, I consulted an Amazon support engineer so that they could give me permission and the proper authorization number.

Once I got proper authorization, here are the steps I performed.

Migrate the data from S3 to EBS
1) Create an EBS volume of the necessary size and attach it to an EC2 instance
2) Transfer the data from S3 to that volume
3) Create an EBS snapshot
4) Share the snapshot with the Amazon account as described previously

Share the EBS Snapshot with Amazon
1) Select the "EC2" tab in the AWS Console
2) Select "Snapshots" item in the left menu bar
3) Right-click on the snapshot you would like to share and select "Snapshot Permissions"
4) Choose the "Private" option (though it will also be available to me if you leave it as "Public")
5) Enter XXXXXXXX (number given by Amazon rep) next to "AWS Account Number 1:" and select "Save"
6) Select "Snapshot Permissions" on that snapshot again, if this was done correctly "amazon" should now show up under "Remove Create Volume Permission:"

Amazon then created their own snapshot based off of my shared snapshot. It was an interesting exercise to go through.

'sodo

creating, attaching, formatting, mounting an ebs volume

In preparation to move the most recent Wikipedia Traffic data to a public dataset, I moved my S3 storage bucket to EBS. First, though, I had to create the storage bucket and copy the S3 files to it. Here are some notes on creating, mounting and formatting an Amazon EBS volume

Overview
* create the EBS volume
* attach the EBS volume to an instance
* validate attachment through dmesg, system logs
* partition the volume
* format the parition
* create a mount point
* mount
* copy files to the new voume

Detail
1) create the EBS volume
When creating your EBS volume, make sure your EBS volume is in the same Availability Zone as your Amazon EC2 instance.


2) attach the EBS volume to an instance
3) review the system message log to verify it got attached
ip-10-32-69-206:~ # dmesg | tail
[ 1.073737] kjournald starting. Commit interval 15 seconds
[ 1.097264] EXT3-fs (sda2): using internal journal
[ 1.097280] EXT3-fs (sda2): mounted filesystem with ordered data mode
[ 1.125686] JBD: barrier-based sync failed on sda1 - disabling barriers
[ 83.353564] sdf: unknown partition table


As it is raw storage, we'll need to partition the volume.
4) partition the EBS volume
ip-10-32-69-206:~ # fdisk /dev/sdf
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-19581, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-19581, default 19581):
Using default value 19581

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.


And of course, format!
5) format the newly created partition
ip-10-32-69-206:~ # mkfs.ext3 /dev/sdf1
mke2fs 1.41.11 (14-Mar-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
9830400 inodes, 39321087 blocks
1966054 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
1200 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 34 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.


6) create a mount point for the partition
ip-10-32-69-206:~ # mkdir /mnt/data

7) mount the partition
ip-10-32-69-206:~ # mount -t ext3 /dev/sdf1 /mnt/data

Good to go for copy!
8) copy the S3 storage bucket data over to the EBS volume
ip-10-32-69-206:/data/wikistats # s3cmd get s3://sodotrendingtopics/wikistats/* /data
s3://sodotrendingtopics/wikistats/pagecounts-20110101-000000.gz -> ./pagecounts-20110101-000000.gz [1 of 2161]
..


2161 files took about four hours to copy over from S3. Amazon does not charge you to move files within their network from S3 to EBS and vice-versa.

Saturday, April 23, 2011

refreshed trendingtopics data

I was able to successfully refresh the Trendtopics website data today. I used a sample of an updated Wikipedia dataset that I setup on Amazon S3. The updated data is from 1/1/2011-3/31/2011. As crunching through months of data would take too long or cost too much if I used the EC2 cloud, I didn't use the whole dataset..I sampled only one hour of weblogs from each day. You can checkout my local version of TrendingTopics here .  (CTRL-click or wheel-click to open in new browser window).

In case the site is offline, you can see the more recent dates in the screenshot below:


The caveat with the refresh is that I can only load 100 records of the sample data. For some reason, the Rails app bombs when I try to load the full dataset.
* must figure out why

--more to come--

Saturday, April 16, 2011

timeline and trend scripts, part I

DataWrangling's TrendingTopics website relies on lots of shell, Hive, MySQL and Python scripts to update and perform statistical analysis on the Wikipedia data. I've been trying to make sense of and optimize them over the past couple of weeks and am slowly making headway.

I'll start at the beginning. The two main shell scripts take the most recent logfile data from Wikipedia, perform some data massaging on the logs and then output files ready to load into the MySQL, the db engine behind the web app. These two shell scripts are:
  • run_daily_timelines.sh
  • run_daily_trends.sh
The shell scripts kick off Hadoop Streaming jobs that use python scripts to perform MapReduce tasks. The scripts also utilize Hive to manipulate the structured data summarized from the weblogs.

run_daily_timelines.sh
Generally, this script uses Hadoop clusters to convert the Wikipedia weblog data using MapReduce. The data is then loaded into Hive for further merging into temp tables. The final output will be used in later scripts to refresh three of the TrendingTopics MySQL database tables:
  • pages
  • daily_timelines 
  • sample_pages

INPUT
Wikipedia traffic stats data for a selected period

DETAIL
The more detailed view of what run_daily_timelines.sh does is as follows:
1) it grabs the hourly logfiles for a number of days (based on a LIMIT constant)
2) runs a two part hadoop streaming job (daily_timelines.py) to convert the logfiles described in my previous entry
3) deletes local hadoop _log file directories so that Hive can load the trend data
4) grabs the wikipedia page id lookup table
5) does a bunch of data munging in Hive (not in MySQL yet)
  a. creates daily_timelines, pages, raw_daily_stats_table, redirect_table, sample_pages tables
  b. overwrites redirect_table with the wiki data from step 4
  c. loads the munged stage2 output into raw_daily_stats_table
  d. overwrites the pages table with data from the recently created redirect_table
6) exports the tab delimited data out of Hive for bulk loading into MySQL

OUTPUT
The script outputs five files:
-rw-r--r-- 1 root root 162454865 Apr 16 17:24 page_lookup_nonredirects.txt
-rw-r--r-- 1 root root 61580486 Apr 16 17:33 pages.txt
-rw-r--r-- 1 root root 5564 Apr 16 17:34 sample_pages.txt
-rw-r--r-- 1 root root 36958988 Apr 16 17:51 daily_timelines.txt
-rw-r--r-- 1 root root 8684 Apr 16 17:52 sample_daily_timelines.txt


Here's what they look like. I'm grepping for Oldsmobile because I used to have a number of the cars and more importantly, a search on a full name like Oldsmobile drops some of the entries beginning with punctuation that might be confusing). This is the data that will be loaded into MySQL
sodo@linux-z6tw:/mnt> grep Oldsmobile page_lookup_nonredirects.txt | head -2 Oldsmobile Oldsmobile 52040 276183917
Oldsmobile_98 Oldsmobile 98 540806 272510635


sodo@linux-z6tw:/mnt> grep Oldsmobile pages.txt | head -2

52040 Oldsmobile Oldsmobile 276183917 46 46.0

52040 Oldsmobile Oldsmobile 276183917 46 46.0

sodo@linux-z6tw:/mnt> head -2 sample_pages.txt

25895 Robert_Bunsen Robert Bunsen 276223823 100439 100439.0

14944 Insect Insect 276199176 13679 13679.0


sodo@linux-z6tw:/mnt> cat daily_timelines.txt | head -2
600744 [20110330,20110331] [21,16] 37
4838455 [20110330,20110331] [3,3] 6

sodo@linux-z6tw:/mnt> cat sample_daily_timelines.txt | head -2

3382 [20110330,20110331] [1077,867] 1944
4924 [20110330,20110331] [27,5770] 5797


run_daily_trends.sh
This second script interacts with the Hive tables that were created in the run_daily_timelines.sh script. This script outputs two files that will be loaded into MySQL by later scripts.

INPUT
Wikipedia traffic stats data for a selected period

DETAIL
It does the following:
1) runs Hadoop to MapReduce (daily_trends.py) some trend data as per my other post
2) loads the trend data into Hive
3) does a bunch of data munging in Hive
  a. drop the redirect_table and sample_pages tables
  b. create a new redirect_table, raw_daily_trends_table, daily_trends, sample_pages
  c. load the page_lookups_nonredirects.txt file into redirect_table (redundant to run_daily_timelines, I think)
  d. load the sample_pages.txt file into sample_pages
  e. load the MapReduce data that was just produced into raw_daily_trends_table
  f. overwrite daily_trends with the redirect_table data joined to the raw_daily_trends_table
4) output the daily_trends table information to a file
5) output sample_pages data

OUTPUT
The two files produced by run_daily_trends:
-rw-r--r-- 1 root root 37728906 Apr 16 19:36 daily_trends.txt
-rw-r--r-- 1 root root 3742 Apr 16 19:36 sample_daily_trends.txt


They mainly have statistical trend information that will be loaded into MySQL at a later time:
linux-z6tw:/home/sodo/trendingtopics/lib/scripts # head -2 /mnt/daily_trends.txt
600744 888.58405328 0.0821994936527
4838455 101.253019797 0.204124145232

linux-z6tw:/home/sodo/trendingtopics/lib/scripts # head -2 /mnt/sample_daily_trends.txt

3382 77440.6080993 0.0113402302907
3382 77440.6080993 0.0113402302907


I've made some changes to both scripts to make them configurable for a new user and will post them once perfected.

The next step is to get the refreshed data loaded into MySQL.
..to be continued..

References
Raw traffic stats
My Amazon Public Data Set
Pete's Original Amazon Public Data Set

Friday, April 8, 2011

script overview

1) run_daily_timelines.sh
-hive_daily_timelines.sql

2) run_daily_trends.sh
-hive_daily_trends.sql
-outputs daily_trends.txt

3) create dropindex procedure

4) run_daily_load.sh (grabs people/companies)
INPUT: /mnt/pages.txt, /mnt/daily_timelines (for two days, they have 1012222 records)
a. verifies if the new_pages table exists
b. checks the last date from daily_timelines
c. if new_pages staging table exists, load data; if not, backup tables to new_pages
TODO rename "new_*" tables to "staging_*"
d. fetch people & companies data
TODO find the script that actually builds these files
e. create dropindex procedure
TODO should be moved to top of script
f. load "new tables" via load_history.sql
.truncates people/companies tables and loads recent people/company data
.drops/recreates people/companies index
.truncates staging tables (new_pages, new_daily_timelines)
..in dev, those tables have no data anyway???
.here is where the dropindex procedure is..I yanked out into separate create_dropindex.sql
.dropindexes on new_pages (6), new_daily_timelines (2)
.disable primary keys on new_pages, new_daily_timelines
.load /mnt/pages -> new_pages, /mnt/timelines -> new_daily_timelines
.enable primary keys on new_pages, new_daily_timeline:s
.create indexes on new_pages (6), new_daily_timelines (2)
g. load "trends" table via load_trends.sql
.truncates new_daily_trends table
..in dev, zero data anyway
.load /mnt/daily_trends.txt (zero rows..I believe because we don't have enough days (need 45)
.(Yes, after loading 45 days..we have trends)
.enable keys on new_daily_trends
h. find max date of the trendsdb
i. load featured_pages
.new_featured_pages not working..probably because file from s3 hasn't been updated
j. alter new_pages
.add featured
.create index, drop index, create index
k. archive to s3
l. archive to s3 by date

After run of daily_load.sh
Companies
1448
Daily Timelines
100
Daily Trends
100
Featured Pages
0
Pages
100
People
345603
New Daily Timelines
1012230
New Daily Trends
2714791
New Featured Pages
841
New Pages
506111
Weekly Trends
100


New Featured Pages, New Timelines, New Trends and New Pages have data..must copy to live prod via last step:
m. swap table to go live


-execute run_daily_load.sh (grabs people/companies) up until s3cmd
linux-z6tw:/home/sfrase/trendingtopics/lib/scripts # ./daily_load.sh
Enter password:
create index page' at line 1

real 3m56.042s
user 0m0.021s
sys 0m0.457s
create index pages_id_index on new_pages (id);
-- Query OK, 2804203 rows affected (9 min 57.82 sec)

-- create index pages_autocomp_trend_index on new_pages (title(64), monthly_trend);
-- Query OK, 2783939 rows affected (6 min 20.95 sec)
-- Records: 2783939 Duplicates: 0 Warnings: 0

-- for main pagination
create index pages_trend_index on new_pages (monthly_trend);
-- Query OK, 2783939 rows affected (1 min 25.65 sec)
-- Records: 2783939 Duplicates: 0 Warnings: 0

-- for sparklines
create index timeline_pageid_index on new_daily_timelines (page_id);
-- Query OK, 2804057 rows affected (22 min 33.80 sec)
-- Records: 2804057 Duplicates: 0 Warnings: 0

-rename_backup_to_new.sql (only if there is no new_pages table)
-load_history.sql
-load_trends.sql (loads /mnt/daily_trends.txt and indexes the table)
linux-z6tw:/home/sodo/trendingtopics/lib/sql # mysql -u root -p --local-infile trendingtopics_development < load_trends.sql
Enter password:

-load_featured_pages.sql (/mnt/featured_pages created by daily_load.sh)
linux-z6tw:/home/sodo/trendingtopics/lib/scripts # ./daily_load.sh
Enter password:
loading featured pages

-archives trendsdb (again, and by date)
-rename_new_to_live.sql (swaps new tables to go live automatically)

linux-z6tw:/home/sodo/trendingtopics/lib/sql # time mysql -u root -p trendingtopics_development --local-infile < /mnt/app/current/lib/sql/load_featured_pages.sql
Enter password:

real 0m1.831s
user 0m0.012s
sys 0m0.011s


Unknown
run_daily_merge.sh

linux-z6tw:/home/sodo/trendingtopics/lib/scripts # ./daily_load.sh
staging tables exist, loading data
loading history tables
loading trends table
Enter password:
loading featured pages
Traceback (most recent call last):
File "/mnt/app/current/lib/scripts/generate_featured_pages.py", line 14, in
from BeautifulSoup import BeautifulSoup
ImportError: No module named BeautifulSoup

References
http://www.crummy.com/software/BeautifulSoup/
Install BeautifulSoup
sudo python setup.py install
running install
running build
running build_py
running install_lib
copying build/lib/BeautifulSoupTests.py -> /usr/local/lib/python2.7/site-packages
copying build/lib/BeautifulSoup.py -> /usr/local/lib/python2.7/site-packages
byte-compiling /usr/local/lib/python2.7/site-packages/BeautifulSoupTests.py to BeautifulSoupTests.pyc
byte-compiling /usr/local/lib/python2.7/site-packages/BeautifulSoup.py to BeautifulSoup.pyc
running install_egg_info
Writing /usr/local/lib/python2.7/site-packages/BeautifulSoup-3.2.0-py2.7.egg-info

Install MySQLdb python
-need python-setuptools
-then installed the whole shebang

sodo@linux-z6tw:~/Downloads/MySQL-python-1.2.3> sudo python setup.py install
running install
Checking .pth file support in /usr/local/lib64/python2.7/site-packages/
/usr/bin/python -E -c pass
TEST PASSED: /usr/local/lib64/python2.7/site-packages/ appears to support .pth files
running bdist_egg
running egg_info
writing MySQL_python.egg-info/PKG-INFO
writing top-level names to MySQL_python.egg-info/top_level.txt
writing dependency_links to MySQL_python.egg-info/dependency_links.txt
...
Installed /usr/local/lib64/python2.7/site-packages/MySQL_python-1.2.3-py2.7-linux-x86_64.egg
Processing dependencies for MySQL-python==1.2.3
Finished processing dependencies for MySQL-python==1.2.3

Thursday, April 7, 2011

TrendingTopics MySQL database

I'm starting to dissect the TrendingTopics shell and sql scripts, but got a bit confused with the tables available to me. So I installed and used MySQL Workbench to show the database schema. There are 11 tables in trendingtopics_development, but no referential integrity is enforced:
daily_timelines
daily_trends
pages
weekly_trends
schema_migrations
new_daily_timelines
new_pages
new_daily_trends
companies
people
featured_pages


Here is the table layout of the development database:


Only five tables are populated with data. The top four have 100 records of test data and are the core tables that drive the application. The last table called "schema_migrations" is an audit table and has no data intended for the web application:
daily_timelines
daily_trends
pages
weekly_trends
schema_migrations


Here's a sample of the contents of the four core tables:


MySQL Workbench install hiccup
Starting MySQL on bootup error, need to edit /etc/init.d/mysql for startup levels
linux-z6tw:/etc/init.d # chkconfig --level 235 mysql on
insserv: FATAL: service network is missed in the runlevels 2 to use service mysql
insserv: exiting now!
/sbin/insserv failed, exit code 1

After editing /etc/init.d/mysql # Default-Start: 3 5
linux-z6tw:/etc/init.d # chkconfig --level 35 mysql on
linux-z6tw:/etc/init.d # /etc/init.d/mysql start
Starting service MySQL done
linux-z6tw:/etc/init.d # netstat -an | grep LISTEN
tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN

MySQL Reference
MySQL passwordless login
Copying tables
Insert Into example
Delete syntax
Total number of locks exceed the lock table size
Option files
Inno DB Transaction Model
Various Advice on Enabling Explicit Commits

hadoop pseudo-mode, debugging, errors

My next goal was to get DataWrangling's update scripts working. Because of my lack of knowledge about Hadoop and Hive, I've encountered a LOT of errors, made a lot of mistakes, and lost many hours normally devoted to sleep. Ah well..what's a techy supposed to do? Of course, encountering problems is a great way to learn and I had problems aplenty. It was well worth the effort.

I first installed Cloudera Hadoop in pseudo-distributed mode. This was great for testing out DW's map/reduce and python scripts BEFORE you fire up multiple EC2 instances that will cost you. Ahem.

I spent about a week's worth of nights figuring out the scripts. I rewrote them to make them easier to use. Once I get a bit of time, I will post them in a more readable format. Once I got those scripts working, I updated my hadoop.properties file in order to munge through the big dataset on EC2.

INSERT AMAZON EC2 EXPERIENCE HERE
I spent too much money on what I should have been doing in my unit test environment first.
*** end EXPERIENCE ***

The only trouble I had is that once I was finished with the EC2 configuration, I didn't know how to revert back to the pseudo-distributed config. Eventually I'll figure this out, but I had to reinstall Hadoop to clear out the config to get back to pseudo mode.

Below I list the things I learned this week.

pseudo-distributed mode
The quickest way to reset the Hadoop configuration in order to get back to pseudo mode from EC2 mode was to reinstall Hadoop.

Make sure your environment is setup properly!
PATH=$PATH:/usr/local/apache-maven-3.0.3/bin
export PATH
export HADOOP_HOME=/usr/lib/hadoop-0.20
#export HADOOP_ROOT_LOGGER=DEBUG,console
export HIVE_HOME=/usr/lib/hive
export MYBUCKET=trendingtopics
export MYSERVER=linux-z6tw
export MAILTO=cacasododom@gmail.com
export JAVA_HOME=/usr/java/jdk1.6.0_24/
export AWS_ACCESS_KEY_ID="DSRMMT"
export AWS_SECRET_ACCESS_KEY="zKpxhQxBoA5jOxZk"


In my scripts, I had to add the AWS access id/secret key to the URL for S3 access:
s3n://accessid:secretkey@$MYBUCKET/...

Debugging
Use Hive Logs
..to tell you what the hell is going wrong with Hive.
linux-z6tw:/var/lib/hive # ll /tmp/root/hive.log
-rw-r--r-- 1 root root 503689 Apr 8 15:23 /tmp/root/hive.log
linux-z6tw:/var/lib/hive # date
Fri Apr 8 15:28:37 EDT 2011
linux-z6tw:/var/lib/hive # tail /tmp/root/hive.log
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)


enable Hadoop Debugger for streaming errors
HADOOP_ROOT_LOGGER=DEBUG,console
Like so:
11/04/08 16:42:06 DEBUG streaming.StreamJob: Error in streaming job
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password

By default, Hadoop Streaming does not spit out very clear errors. Once debug mode is enable, the message that pops from hadoop-streaming is fairly self-explanatory.

Errors
FAILED: Error in semantic analysis: line 3:17 Invalid Path 'outputStage1': source contains directory: hdfs://ec2-184-73-137-122.compute-1.amazonaws.com/user/root/outputStage1/_logs

You must delete the _logs directory


ERROR DataNucleus.Plugin
>> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
>> "org.eclipse.core.resources" but it cannot be resolve


Corrupt Hive Metastore?
Maybe move/delete metastore to resolve unknown issues.
linux-z6tw:~ # ll /var/lib/hive/metastore/
total 16
-rw-r--r-- 1 root root 354 Apr 8 15:52 derby.log
drwxr-xr-x 5 root root 4096 Apr 8 15:58 metastore_db
drwxr-xr-x 5 root root 4096 Apr 8 15:51 metastore_dbBACKUP
drwxrwxrwt 3 root root 4096 Apr 8 14:38 scripts


Make sure you don't have another hive CLI session open..
Don't have two command line Hive interfaces (CLI) up when trying to add/drop/delete tables,
or you'll get strange failures, like this:
"DROPing redirect_table"
FAILED: Error in semantic analysis: Table not found redirect_table
FAILED: Error in semantic analysis: Unable to fetch table daily_trends


For LOAD DATA LOCAL INFILE calls, use --local-infile command line arg
mysql -u user -p --local-infile
Otherwise, you'd get:
ERROR 1148 (42000): The used command is not allowed with this MySQL version

DataWrangling's reducer2 seems to need the exact number of logfiles you're munging
-reducer "daily_timelines.py reducer2 1" \


Reference
http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html
Running a multi-node Hadoop cluster
Cloudera Pseudo-Distributed Mode

Saturday, April 2, 2011

MapReduce and Hive by example

In my last post, I was able to get MySQL and the Ruby-on-Rails environment setup. After updating my Rails configuration, I was able to fire up a working version of DataWrangling's TrendingTopics site.


Of course, the data is still two years old, so I need to delve further into how I would update the data. Here, Pete has some interesting stuff going on:
- use hadoop streaming to power python scripts that chunk through that fat weblog data
- kick off HiveQL script to load final output and create other temporary tables
- from Hive, join tables and prep latest daily data to ship off to MySQL
- wraps the status of what happens during the process in an email

Let's look at MapReduce and Hive in a bit more depth.

MapReduce
Using a hint from Sujit Pal's blog post, it was helpful to see what exactly Pete's mapper and reducer scripts do. These scripts are written in Python and are called by run_daily_timelines.sh and run_daily_trends.sh.

daily_timelines.py
run_daily_timelines.sh has two Hadoop Streaming jobs that run Pete's MapReduce code. The MR code is in daily_timelines.py. The first mapper code outputs this:
[sodo@computer python_streaming]$ zgrep '^en Oldsmobile' ~/Downloads/pagecounts-20090401-000001.gz | python daily_timelines.py mapper1 | sort | head
LongValueSum:Oldsmobile}20090419 65
LongValueSum:Oldsmobile_442}20090419 31
LongValueSum:Oldsmobile_66}20090419 1
LongValueSum:Oldsmobile_88}20090419 26
LongValueSum:Oldsmobile_98}20090419 8
LongValueSum:Oldsmobile_Achieva}20090419 5
LongValueSum:Oldsmobile_achieva_catalytic_converter}20090419 1
LongValueSum:Oldsmobile_achieva_fuel_tanks}20090419 1
LongValueSum:Oldsmobile_Alero_1999_2004_Center}20090419 1
LongValueSum:Oldsmobile_Alero}20090419 20


In daily_timelines.py, the reducer morphs the output of the mapper into this:
[sodo@computer python_streaming]$ zgrep '^en Oldsmobile' ~/Downloads/pagecounts-20090401-000001.gz | python daily_timelines.py mapper1 | sort | python daily_timelines.py reducer1 | head
Oldsmobile}20090419 65
Oldsmobile_442}20090419 31
Oldsmobile_66}20090419 1
Oldsmobile_88}20090419 26
Oldsmobile_98}20090419 8
Oldsmobile_Achieva}20090419 5
Oldsmobile_achieva_catalytic_converter}20090419 1
Oldsmobile_achieva_fuel_tanks}20090419 1
Oldsmobile_Alero_1999_2004_Center}20090419 1
Oldsmobile_Alero}20090419 20


A second mapper takes that output and changes it some more:
[sodo@computer python_streaming]$ zgrep '^en Oldsmobile' ~/Downloads/pagecounts-20090401-000001.gz | python daily_timelines.py mapper1 | sort | head -100 | python daily_timelines.py reducer1 | python daily_timelines.py mapper2
Oldsmobile 20090419 65
Oldsmobile_442 20090419 31
Oldsmobile_66 20090419 1
Oldsmobile_88 20090419 26
Oldsmobile_98 20090419 8
Oldsmobile_Achieva 20090419 5
Oldsmobile_achieva_catalytic_converter 20090419 1
Oldsmobile_achieva_fuel_tanks 20090419 1
Oldsmobile_Alero_1999_2004_Center 20090419 1


Finally, the second reducer code formats the timeline data, now ready to import into Hive:
[sodo@computer python_streaming]$ zgrep '^en Oldsmobile' ~/Downloads/pagecounts-20090401-000001.gz | python daily_timelines.py mapper1 | sort | python daily_timelines.py reducer1 | python daily_timelines.py mapper2 | python daily_timelines.py reducer2 1 | head
Oldsmobile [20090419] [65] 65 65.0
Oldsmobile_442 [20090419] [31] 31 31.0
Oldsmobile_66 [20090419] [1] 1 1.0
Oldsmobile_88 [20090419] [26] 26 26.0
Oldsmobile_98 [20090419] [8] 8 8.0
Oldsmobile_Achieva [20090419] [5] 5 5.0
Oldsmobile_achieva_catalytic_converter [20090419] [1] 1 1.0
Oldsmobile_achieva_fuel_tanks [20090419] [1] 1 1.0
Oldsmobile_Alero_1999_2004_Center [20090419] [1] 1 1.0
Oldsmobile_Alero [20090419] [20] 20 20.0


* It is here that I discovered a problem: the reducer script requires a minimum of 45 days of data. For testing purposes, I changed the following line in the script to lower the minimum to one day:
-reducer "daily_timelines.py reducer2 45" \

daily_trends.py
In like fashion, the run_daily_trends.sh script also utilizes a Hadoop Streaming job to calculate trend data against the Wikipedia weblogs. This MapReduce code is found in daily_trends.py. Herein, the mapper does a first pass at the data:
linux-z6tw:/home/sodo/trendingtopics/lib/scripts # zcat /home/sodo/Downloads/pagecounts-20110328-010000.gz | python ../python_streaming/daily_trends.py mapper | grep '^Oldsmobile' | head
Oldsmobile 20090419 64
Oldsmobile_"Rocket_V8"_engine 20090419 2
Oldsmobile_4-4-2 20090419 1
Oldsmobile_442 20090419 40
Oldsmobile_66 20090419 2
Oldsmobile_88 20090419 39
Oldsmobile_98 20090419 25
Oldsmobile_Achieva 20090419 2
Oldsmobile_Aerotech 20090419 1
Oldsmobile_Aerotech_III 20090419 1


The reducer takes the mapper feed and creates final trend numbers like so:
linux-z6tw:/home/sodo/trendingtopics/lib/scripts # zcat /home/sodo/Downloads/pagecounts-20110328-010000.gz | python ../python_streaming/daily_trends.py mapper | python ../python_streaming/daily_trends.py reducer 1 | grep '^Oldsmobile' | head
Oldsmobile 331.160785273 0.125
Oldsmobile_"Rocket_V8"_engine 4.19722457734 0.707106781187
Oldsmobile_4-4-2 1.69314718056 1.0
Oldsmobile_442 188.542882668 0.158113883008
Oldsmobile_66 4.19722457734 0.707106781187
Oldsmobile_88 182.86629871 0.160128153805
Oldsmobile_98 106.452413451 0.2
Oldsmobile_Achieva 4.19722457734 0.707106781187
Oldsmobile_Aerotech 1.69314718056 1.0
Oldsmobile_Aerotech_III 1.69314718056 1.0


Later scripts will then load this data into Hive and eventually, MySQL for display in the web app.

HiveQL
To really oversimplify, HiveQL is just like SQL (the basic syntax is the same), but for databases that sit atop a Hadoop HDFS filesystem. The benefit of that is that you can divide up the computational tasks of a relational database into little chunks spread out over many servers, managed by MapReduce. This is especially helpful if you have say, big fat ass weblogs you need to munge through in a relational fashion.

Here's sample HiveQL code, courtesy of Peter Skomoroch:
sodo@linux-z6tw:~/trendingtopics/lib/hive> cat hive_daily_timelines.sql
CREATE TABLE raw_daily_stats_table (redirect_title STRING, dates STRING, pageviews STRING, total_pageviews BIGINT, monthly_tre
nd DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

CREATE TABLE redirect_table (redirect_title STRING, true_title STRING, page_id BIGINT, page_latest BIGINT) ROW FORMAT DELIMITE
D FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

CREATE TABLE daily_timelines (page_id BIGINT, dates STRING, pageviews STRING, total_pageviews BIGINT) ROW FORMAT DELIMITED FIE
LDS TERMINATED BY '\t' STORED AS TEXTFILE;

CREATE TABLE pages (page_id BIGINT, url STRING, title STRING, page_latest BIGINT, total_pageviews BIGINT, monthly_trend DOUBLE
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

CREATE TABLE sample_pages (page_id BIGINT, url STRING, title STRING, page_latest BIGINT, total_pageviews BIGINT, monthly_trend
DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/mnt/page_lookup_nonredirects.txt' OVERWRITE INTO TABLE redirect_table;
-- columns: redirect_title string,true_title string, page_id bigint, page_latest bigint
-- sample data: Anarchism Anarchism 12 275854375

LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table;
-- columns: redirect_title string, dates string, pageviews string, total_pageviews bigint, monthly_trend double
-- sample data: Anarchism [20110215,20110216,20110217] [127,126,100] 4394 -286.0

INSERT OVERWRITE TABLE pages
SELECT redirect_table.page_id, redirect_table.redirect_title, redirect_table.true_title, redirect_table.page_latest, raw_daily
_stats_table.total_pageviews, raw_daily_stats_table.monthly_trend
FROM redirect_table
JOIN raw_daily_stats_table ON (redirect_
table.redirect_title = raw_daily_stats_table.redirect_title
);
-- columns: page_id bigint, url string, title string, page_latest bigint, total_pageviews bigint, monthly_trend double
-- sample data: 12 Anarchism Anarchism 275854375 4394 -286.0

INSERT OVERWRITE TABLE sample_pages
SELECT * FROM pages SORT BY monthly_trend DESC LIMIT 100;

INSERT OVERWRITE TABLE daily_timelines
SELECT redirect_table.page_id, raw_daily_stats_table.dates, raw_daily_stats_table.pageviews, raw_daily_stats_table.total_pagev
iews FROM redirect_table JOIN raw_daily_stats_table ON (redirect_table.redirect_title = raw_daily_stats_table.redirect_title);
-- columns: page_id bigint, dates string, pageviews string, total_pageviews bigint
-- sample data: 12 [20110215,20110216,20110217] [127,126,100] 4394


You see a variety of standard SQL features that Pete applies to Hadoop's distributed filesystem model:
1) loading data from a file in a local filesystem
2) loading data from an HDFS/Hadoop filesystem
3) creating tables by SELECTing from other tables
4) creating tables based on JOINs

Nice work, Mr.Skomoroch!

Minutiae and Some Time Wasters
Set environment variables
export MYBUCKET=trendingtopics
export MYSERVER=linux-z6tw
export MAILTO=cacasododom@gmail.com

Commands to be run as non-privileged user
whirr launch-cluster --config=hadoop.properties
./updateHadoopConfig.sh (including launch proxy)

Commands to be run as root
start run_daily_timelines.sh
bash trendingtopics/lib/scripts/run_daily_timelines.sh $MYBUCKET $MYSERVER $MAILTO

Diff between Pete's original run_daily_timelines.sh and mine
1) my s3n input parameter doesn't accept wildcards like this:
-input s3n://$1/wikistats/pagecounts-200*
2) my scripts are in a different directory:
-file '/home/sodo/trendingtopics/lib/python_streaming/daily_timelines.py'
-file '/home/sfrase/trendingtopics/lib/python_streaming/daily_timelines.py'
hive -f /home/sfrase/trendingtopics/lib/hive/hive_daily_timelines.sql

3) the hadoop streaming jar file has changed:
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-CDH3B4.jar
4) for testing purposes, I had to change the reducer to look at one day of data
-reducer "daily_timelines.py reducer2 1"

Among the various permissions, credentials and communication issues..
A final gotcha was that at one point, because of incorrect pathing, the script failed. But the script failed after it created some of the data tables in Hive. Therefore, when I ran the script again, the HiveQL scripts would bomb out:
FAILED: Error in metadata: AlreadyExistsException(message:Table raw_daily_stats_table already exists)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask


Once I dropped those tables, all was right with the world.

References
http://sujitpal.blogspot.com/2009/10/parallelizing-crawling-with-hadoop.html
Writing a MapReduce Program in Python
Tom White's MapReduce Primer
How To Setup Nutch and Hadoop
How Hadoop MapReduce Works

success on getting Rails app working!

After a couple of nights of re-education on mysql and rails, I was able to get an instance of TrendingTopics up and running out in Amazon's cloud. This is using an OpenSuSE 11.4 instance:

Hooray!

However, if you look carefully at the data, you'll notice it's two years old. It would be nice to refresh the public dataset. It is two years old.

My next step is to understand DataWrangling's data update methodology using Hadoop Streaming (python scripts) to munge the weblog data with MapReduce and then stored into Hive datasets. Sounds complex.
;)
Tally ho!



Install Notes
Fire up new instance of EC2 micro OpenSuSE11.4 ami-00c83b69
setup ssh key
zypper in expect gcc git httpd make mysql mysql-devel mysql-server openssl-devel ruby rubygems ruby-devel s3cmd zlib*

note that s3cmd requires a manual install on OpenSuse 11.4
sudo python setup.py install

mysql
# start mysql
/etc/init.d/mysql start

# set a password
/usr/bin/mysqladmin -u root password '***'
# test the login
mysql -u root -p

# note the new open port
ip-10-117-78-175:~ # ns
tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN


Gems Installs
# install base rails gem
sudo gem install rails --no-ri --no-rdoc
# install mongrel and mongrel_cluster gems
(–include-dependencies not needed)
sudo gem install mongrel --no-ri --no-rdoc
sudo gem install mongrel_cluster --no-ri --no-rdoc


If you're behind a proxy, this command will work with ntlmaps installed:
[sodo@computer ~]$ sudo gem install rails --no-ri --no-rdoc --http-proxy http://localhost:5865
Successfully installed rails-3.0.7
1 gem installed


Back to pete's doc..
# install version specific rails gem
sudo gem install -v=2.3.2 rails --no-ri --no-rdoc
# install mysql gem
sudo gem install mysql --no-ri --no-rdoc
# install memcached
sudo gem install cached_model --no-ri --no-rdoc

May encounter this problem with the cached_model
[sodo@computer ~]$ sudo gem install cached_model --no-ri --no-rdoc --http-proxy http://localhost:5865
ERROR: Error installing cached_model:
hoe requires RubyGems version >= 1.4. Try 'gem update --system' to update RubyGems itself.
[sodo@computer ~]$ sudo gem update --system --no-ri --no-rdoc --http-proxy http://localhost:5865
Updating RubyGems
Updating rubygems-update
Successfully installed rubygems-update-1.7.2
Updating RubyGems to 1.7.2
Installing RubyGems 1.7.2
RubyGems 1.7.2 installed

------------------------------------------------------------------------------

RubyGems installed the following executables:
/usr/bin/gem


After which, you'll have to update the Rails 2.3.2 gem.

Nice trick to delete all gems at once:
sudo gem list | cut -d" " -f1 | xargs sudo gem uninstall -aIx

Rails config
git clone git://github.com/datawrangling/trendingtopics.git
$ cd trendingtopics
cp config/config.yml.sample config/config.yml
domain: '`hostname`'
admin_user: root
admin_password:

cp config/database.yml.sample config/database.yml
username: root
password:
socket: /var/run/mysql/mysql.sock


# rake time!
sudo rake gems:install

# rails app error
error: rake aborted! undefined local variable or method 'version_requirements'
http://www.redmine.org/boards/2/topics/22358?r=22797#message-22797

# edit config/environment.rb between bootstrap and initializer:
if Gem::VERSION >= "1.3.6"
module Rails
class GemDependency
def requirement
r = super
(r == Gem::Requirement.default) ? nil : r
end
end
end
end

#change "jpignata-bossman" to 'bossman'
$ rake db:create
$ rake db:migrate
$ rake db:develop


# start the Rails app
ip-10-117-78-175:~/trendingtopics # script/server
=> Booting Mongrel
=> Rails 2.3.2 application starting on http://0.0.0.0:3000
=> Call with -d to detach
=> Ctrl-C to shutdown server

# edit EC2 instance security settings to allow for inbound rails traffic on port 3000


Add a proxy to your environment.rb
For some reason, gem install wasn't recognizing my http_proxy environment variable. I discovered a really ugly workaround. For example:
Rails::Initializer.run do |config|
# Specify gems that this application depends on and have them installed with rake gems:install
config.gem "ya2yaml --http-proxy http://username:password@http-proxy"
config.gem 'rubaidh-google_analytics --http-proxy http://username:password@http-proxy', :lib => 'rubaidh/google_analytics', :source => 'http://gems.github.com'


Assorted Problems
sodo@linux-z6tw:~> sudo /sbin/chkconfig --level 35 mysql on
insserv: FATAL: service network is missed in the runlevels 2 to use service mysql
insserv: exiting now!
/sbin/insserv failed, exit code 1
sodo@linux-z6tw:~> vi /etc/init.d/mysql
sodo@linux-z6tw:~> sudo vi /etc/init.d/mysql  # remove Default Start Runlevel of 2
sodo@linux-z6tw:~> sudo /sbin/chkconfig --level 35 mysql on


Run "rcmysql start" to help debug issues

Reference
http://s3tools.org/download
MySQL Tutorial
Ruby on Rails Guide
Rails: Basics of Layouts
Rails: Enabling Forgery Protection
Cache Mgmt in Rails
Sweeping the Rails cache
Google Charts for Rails
Memcached.org
Memcached Basics
http://dev.mysql.com/doc/mysql/en/crashing.html
Suse 11.4 Download Link 

Friday, April 1, 2011

Cloudera for Hadoop..the errors

Following DataWrangling's prerequisites for the TrendingTopics data-driven website, I got myself an Amazon EC2 account. From the selection of virtual machine images that Amazon shows you and mapping them to the Cloudera Quick Start Guide, I decided to first setup a virtual machine instance of SUSE Linux Enterprise Server 11G, 64-bit in VMware Player. I did this so that I wouldn't waste any money on EC2 if I encountered any installation glitches. Which of course, I did.

SUSE Enterprise Linux is a very nice distribution..I hadn't used it before. Novell has kept the desktop very clean as opposed to OpenSuSE's choice of KDE for 11.4..ugh! YaST is a very nicely integrated system management tool. Also, the installation into VMware Player was quick..about 15 minutes from the install DVD. There were two ISO DVDs, the base install only required DVD1. Also, the Sun Java JDK is a prereq here, so I downloaded and installed that.

There are a few tasks to accomplish:
1) install the Cloudera repository
2) install Hadoop in pseudo-distributed mode (hadoop-0.20-conf-pseudo) + hadoop-hive
3) start the Hadoop services
4) test

I got hung up a few places:
1) install the rpm, not the bin. One SUSE Enterprise Linux 64-bit, it'd be:
sudo rpm -Uvh ./jdk-6u24-linux-amd64.rpm

2) the Cloudera documentation sends you to the wrong location for the repository file for SUSE 64-bit. It should be:
http://archive.cloudera.com/sles/11/x86_64/cdh/cloudera-cdh3.repo
(also, Cloudera's ami-8759bfee is out of date: old-releases.ubuntu.com)

3) the namenode service crapped out:
metrics.MetricsUtil: Unable to obtain hostName

This was due to the fact that the hostname of my local machine was not recognized. I added an entry into /etc/hosts to resolve this.

3) the test jar for calculating the value of pi failed:
sodo@linux-8u67:/var/log/hadoop-0.20> hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000
Number of Maps = 2
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Starting Job
11/03/24 16:28:49 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8021. Already tried 0 time(s).


For some reason, the mapReduce jobtracker service was not up and running on port 8021. This issue turned out to be related to the timing of when the other Hadoop services start. If I restart the Hadoop jobtracker service on its own, the service starts just fine:
sodo@linux-8u67:/etc/hadoop/conf> sudo /etc/init.d/hadoop-0.20-jobtracker restart
Stopping Hadoop jobtracker daemon (hadoop-jobtracker): done
no jobtracker to stop

Starting Hadoop jobtracker daemon (hadoop-jobtracker): done
starting jobtracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-jobtracker-linux-8u67.out


4) After package installs:
insserv: Script jexec is broken: incomplete LSB comment

Sun Java problem. Workaround here: https://bugzilla.novell.com/show_bug.cgi?id=504596#c14

5) insserv: FATAL: service syslog is missed in the runlevels 4 to use service hadoop-0.20-*

Workaround here or here.

6) FAILED: Unknown exception : org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /tmp/hive-hadoop. Name node is in safe mode

I did not wait long enough after hadoop services were started. I gave them a couple minutes to cook and then my job ran.


7) hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketException: Protocol not available

I had this java stack installed:
linux-z6tw:/> java -version
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.7) (suse-1.2.1-x86_64)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)


I yanked out the java-1_6_0-openjdk* version:
linux-z6tw:~> sudo zypper remove java-1_6_0-openjdk*
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following NEW package is going to be installed:
java-1_6_0-sun

The following packages are going to be REMOVED:
java-1_6_0-openjdk java-1_6_0-openjdk-plugin

Retrieving package java-1_6_0-sun-1.6.0.u23-3.3.x86_64 (1/1), 20.8 MiB (88.6 MiB unpacked)
Retrieving: java-1_6_0-sun-1.6.0.u23-3.3.x86_64.rpm [done (1.6 MiB/s)]
Removing java-1_6_0-openjdk-plugin-1.6.0.0_b20.1.9.7-1.2.1 [done]
Removing java-1_6_0-openjdk-1.6.0.0_b20.1.9.7-1.2.1 [done]


So that only the Sun JDK was left:
linux-z6tw:~> java -version
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)


And Hadoop is now running MapReduce jobs!
sodo@linux-8u67:~/Desktop> hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000
Number of Maps = 2
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Starting Job
11/03/24 17:43:13 INFO mapred.FileInputFormat: Total input paths to process : 2
11/03/24 17:43:14 INFO mapred.JobClient: Running job: job_201103241738_0001
11/03/24 17:43:15 INFO mapred.JobClient: map 0% reduce 0%
11/03/24 17:43:23 INFO mapred.JobClient: map 100% reduce 0%
11/03/24 17:43:31 INFO mapred.JobClient: map 100% reduce 100%
11/03/24 17:43:31 INFO mapred.JobClient: Job complete: job_201103241738_0001
11/03/24 17:43:31 INFO mapred.JobClient: Counters: 23
11/03/24 17:43:31 INFO mapred.JobClient: Job Counters
11/03/24 17:43:31 INFO mapred.JobClient: Launched reduce tasks=1
11/03/24 17:43:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=11869
11/03/24 17:43:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/03/24 17:43:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/03/24 17:43:31 INFO mapred.JobClient: Launched map tasks=2
11/03/24 17:43:31 INFO mapred.JobClient: Data-local map tasks=2
11/03/24 17:43:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8403
11/03/24 17:43:31 INFO mapred.JobClient: FileSystemCounters
11/03/24 17:43:31 INFO mapred.JobClient: FILE_BYTES_READ=50
11/03/24 17:43:31 INFO mapred.JobClient: HDFS_BYTES_READ=472
11/03/24 17:43:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=156550
11/03/24 17:43:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215
11/03/24 17:43:31 INFO mapred.JobClient: Map-Reduce Framework
11/03/24 17:43:31 INFO mapred.JobClient: Reduce input groups=2
11/03/24 17:43:31 INFO mapred.JobClient: Combine output records=0
11/03/24 17:43:31 INFO mapred.JobClient: Map input records=2
11/03/24 17:43:31 INFO mapred.JobClient: Reduce shuffle bytes=28
11/03/24 17:43:31 INFO mapred.JobClient: Reduce output records=0
11/03/24 17:43:31 INFO mapred.JobClient: Spilled Records=8
11/03/24 17:43:31 INFO mapred.JobClient: Map output bytes=36
11/03/24 17:43:31 INFO mapred.JobClient: Map input bytes=48
11/03/24 17:43:31 INFO mapred.JobClient: Combine input records=0
11/03/24 17:43:31 INFO mapred.JobClient: Map output records=4
11/03/24 17:43:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=236
11/03/24 17:43:31 INFO mapred.JobClient: Reduce input records=4
Job Finished in 18.177 seconds
Estimated value of Pi is 3.14118000000000000000


Neato. More to come..

References
Hadoop Default Ports
Unable to Obtain HostName error
Cloudera Documentation
Official OpenSuSE Documentation
Unofficial OpenSuSE Documentation
OpenSuSE Package Search
AddRepos
http://download.opensuse.org/distribution/11.1/repo/non-oss/
http://download.opensuse.org/distribution/11.1/repo/oss/