BiggData

Wednesday, December 21, 2011

hadoop pseudo mode gotchas

Reviving my TrendingTopics project, I wanted to get the Hadoop Pseudo mode configuration working again. That would allow me to test and run my shell scripts (adapted from Pete Skomoroch's originals found here) locally, rather than having to worry about EC2 config and setup. I had a larger problem that impacted my Hadoop setup: some of the subdirectories under /var were missing. This unwelcomed problem led me on a longer troubleshooting journey than I initially expected..

The Hadoop pseudo setup is based on this doc: https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode

In a nutshell..
1) Make sure your Hadoop pseudo config is installed (here is a $ sudo yum install hadoop-0.20-conf-pseudo
2) I had some problems where /var directories important for Hadoop did not exist. So make sure these directories exist and are writeable:
lrwxrwxrwx 1 root root 20 Apr 8 2011 /usr/lib/hadoop-0.20/pids -> /var/run/hadoop-0.20
lrwxrwxrwx 1 root root 20 Apr 8 2011 /usr/lib/hadoop-0.20/logs -> /var/log/hadoop-0.20 ll /var/lock/subsys/
drwxrwxrwx 2 root root 4096 Dec 22 00:41 subsys

I took the easy way out and just chmod'd them:
$ sudo chmod 777 /var/run/hadoop-0.20 /var/log/hadoop-0.20

After which, you should be able to start the services:
linux-z6tw:/var/log # for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
Starting Hadoop datanode daemon (hadoop-datanode): done
starting datanode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-datanode-linux-z6tw.out

Starting Hadoop jobtracker daemon (hadoop-jobtracker): done
starting jobtracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-jobtracker-linux-z6tw.out

Starting Hadoop namenode daemon (hadoop-namenode): done
starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-linux-z6tw.out

Starting Hadoop secondarynamenode daemon (hadoop-secondarynamenode): done
starting secondarynamenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-linux-z6tw.out

Starting Hadoop tasktracker daemon (hadoop-tasktracker): done
starting tasktracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-tasktracker-linux-z6tw.out

Always look at the logs to make sure all the daemons are working:
sodo@linux-z6tw:/var/log/hadoop-0.20> ls -ltr
-rw-r--r-- 1 mapred mapred 394764 Dec 27 14:25 hadoop-hadoop-jobtracker-linux-z6tw.log
-rw-r--r-- 1 hdfs hdfs 789914 Dec 27 14:25 hadoop-hadoop-namenode-linux-z6tw.log
-rw-r--r-- 1 hdfs hdfs 536726 Dec 27 14:25 hadoop-hadoop-datanode-linux-z6tw.log
-rw-r--r-- 1 mapred mapred 2524526 Dec 27 14:25 hadoop-hadoop-tasktracker-linux-z6tw.log

Also, view the name node and job tracker status interfaces as outlined here: http://www.bigfastblog.com/map-reduce-with-ruby-using-hadoop#running-the-hadoop-job

One of them being the jobtracker:
http://:50030/jobtracker.jsp

Give 'er a test
Once the daemons are running properly, test with the ol' pi calculation example:
sodo@linux-z6tw:/> hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar pi 2 100000
Number of Maps = 2
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Starting Job
11/12/22 02:14:57 INFO mapred.FileInputFormat: Total input paths to process : 2
11/12/22 02:14:58 INFO mapred.JobClient: Running job: job_201112220213_0002
11/12/22 02:14:59 INFO mapred.JobClient: map 0% reduce 0%
11/12/22 02:15:05 INFO mapred.JobClient: map 100% reduce 0%
11/12/22 02:15:13 INFO mapred.JobClient: map 100% reduce 33%
11/12/22 02:15:15 INFO mapred.JobClient: map 100% reduce 100%
11/12/22 02:15:16 INFO mapred.JobClient: Job complete: job_201112220213_0002
11/12/22 02:15:16 INFO mapred.JobClient: Counters: 23
11/12/22 02:15:16 INFO mapred.JobClient: Job Counters
11/12/22 02:15:16 INFO mapred.JobClient: Launched reduce tasks=1
11/12/22 02:15:16 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=9121
11/12/22 02:15:16 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/22 02:15:16 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/22 02:15:16 INFO mapred.JobClient: Launched map tasks=2
11/12/22 02:15:16 INFO mapred.JobClient: Data-local map tasks=2
11/12/22 02:15:16 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8995
11/12/22 02:15:16 INFO mapred.JobClient: FileSystemCounters
11/12/22 02:15:16 INFO mapred.JobClient: FILE_BYTES_READ=50
11/12/22 02:15:16 INFO mapred.JobClient: HDFS_BYTES_READ=472
11/12/22 02:15:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=156541
11/12/22 02:15:16 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215
11/12/22 02:15:16 INFO mapred.JobClient: Map-Reduce Framework 11/12/22 02:15:16 INFO mapred.JobClient: Reduce input groups=2 11/12/22 02:15:16 INFO mapred.JobClient: Combine output records=0 11/12/22 02:15:16 INFO mapred.JobClient: Map input records=2
11/12/22 02:15:16 INFO mapred.JobClient: Reduce shuffle bytes=56
11/12/22 02:15:16 INFO mapred.JobClient: Reduce output records=0
11/12/22 02:15:16 INFO mapred.JobClient: Spilled Records=8
11/12/22 02:15:16 INFO mapred.JobClient: Map output bytes=36
11/12/22 02:15:16 INFO mapred.JobClient: Map input bytes=48
11/12/22 02:15:16 INFO mapred.JobClient: Combine input records=0
11/12/22 02:15:16 INFO mapred.JobClient: Map output records=4
11/12/22 02:15:16 INFO mapred.JobClient: SPLIT_RAW_BYTES=236
11/12/22 02:15:16 INFO mapred.JobClient: Reduce input records=4
Job Finished in 18.777 seconds
Estimated value of Pi is 3.14118000000000000000

Errors
1) Hadoop may hang if you have an incorrect /etc/hosts entry
Since I didn't have a DHCP reservation for my machine's IP, the IP address changed and the name node was sending packets out my gateway. Hardcoding an /etc/hosts entry fixed this.
http://getsatisfaction.com/cloudera/topics/hadoop_setup_example_job_hangs_in_reduce_task_getimage_failed_java_io_ioexception_content_length_header_is_not_provided_by-1m4p8b

2) 11/05/02 23:59:47 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /blah/blah could only be replicated to 0 nodes
Stupidly, I built my root filesystem with only 8GB of space. So the data node ran out of space when it tried to run any Hadoop job. I got the above error when that happened.

Hadoop DFSadmin utility is good for diagnosing issues like the above:
sodo@linux-z6tw:/var/log/hadoop-0.20> hadoop dfsadmin -report
Configured Capacity: 33316270080 (31.03 GB)
Present Capacity: 26056040448 (24.27 GB)
DFS Remaining: 21374869504 (19.91 GB)
DFS Used: 4681170944 (4.36 GB)
DFS Used%: 17.97%
Under replicated blocks: 9
Blocks with corrupt replicas: 0
Missing blocks: 6
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)
Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 33316270080 (31.03 GB)
DFS Used: 4681170944 (4.36 GB)
Non DFS Used: 7260229632 (6.76 GB)
DFS Remaining: 21374869504(19.91 GB)
DFS Used%: 14.05%
DFS Remaining%: 64.16%
Last contact: Tue Dec 27 15:11:27 EST 2011

The resolution was to:
1) add a new filesystem to my virtual machine
2) moved the data node /tmp directory to larger filesystem
3) also moved my mysql installation to that new filesystem (nice instructions here for that: http://kaliphonia.com/content/linux/how-to-move-mysql-datadir-to-another-drive)

3) name node cannot start due to permissions
I moved my /tmp directory to a new filesystem and it did not have the proper permissions to write to the new temp directory. So I set perms like so:
$ chmod 777 /tmp

4) ERROR 1148 (42000) at line 40: The used command is not allowed with this MySQL version
Security issue with local data loads:
http://dev.mysql.com/doc/refman/5.0/en/load-data-local.html

You must use "--local-infile" as a parameter to the mysql command line like so:
sodo@linux-z6tw:~/trendingtopics/lib/sql> mysql -u user trendingtopics_development < loadSampleData.sql --local-infile

5) Cannot delete Name node is in safe mode
Hadoop safemode must be disabled
https://issues.apache.org/jira/browse/HADOOP-5937

References
http://archive.cloudera.com/cdh/3/hadoop-0.20.2-CDH3B4/single_node_setup.html#PseudoDistributed
https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode
https://cwiki.apache.org/confluence/display/WHIRR/Quick+Start+Guide
Map Reduce Tutorial
"could only be replicated to 0 nodes" FAQ
HDFS Basics for Developers

Wednesday, August 17, 2011

min requirements and env for ec2-api-tools

I started a mini Fedora EC2 instance on Amazon Web Services the other day. The instance had a slim set of installed programs. Here is the short list of dependencies to get the ec2-api-tools running:

yum install openssh-clients (for SCP)
yum install unzip (for the zipped ec2-api-tools install zip)
install latest jdk (latest 6_27 from Oracle's site)

As well, ec2-api-tools depends on a a few environment variables..the five lines at the bottom of this .bash_profile:

export PATH

export HADOOP_HOME=/usr/lib/hadoop-0.20

export HIVE_HOME=/usr/lib/hive

export MYBUCKET=frasestrendingtopics

export MYSERVER=localhost

export MAILTO=cacasododom@gmail.com

export JAVA_HOME=/usr/java/jdk1.6.0_24/

export EC2_HOME=/usr/bin/ec2-api-tools-1.4.4.0

export EC2_PRIVATE_KEY=~/doc/software/amazon/certs/pk-GEQDDYWBHKJKKUWNWAS.pem

export EC2_CERT=~/doc/software/amazon/certs/cert-DYWBHKJKKUWNWASYNF2YL.pem

export ACCESS_KEY_ID=8DSRMMTFQMQRER2

export SECRET_ACCESS_KEY=mAq9QzCVhmxzKpxhQxBoA5jOxZksq62jpO5mbD

Cheers..

Friday, August 5, 2011

ec2 api tools through an http proxy

I spent about an hour on this today so I thought it would be reasonable to post about it. I was trying to access the EC2 API tools behind my firewall at work, through our http proxy server. You can get this to work by setting a JVM environment variable called EC2_JVM_ARGS like so. Example is from a Linux box.

Example without username authentication (using ntlmaps)
export EC2_JVM_ARGS="-DproxySet=true -DproxyHost=localhost -DproxyPort=5865 -Dhttps.proxySet=true -Dhttps.proxyHost=localhost -Dhttps.proxyPort=5865"

Example with username authentication
export EC2_JVM_ARGS="-DproxySet=true -DproxyHost=http-proxy -DproxyPort=8080 -Dhttps.proxySet=true -Dhttps.proxyHost=http-proxy -Dhttps.proxyPort=8080 -Dhttp.proxyUser=username -Dhttp.proxyPass=password -Dhttps.proxyUser=username -Dhttps.proxyPass=password"

Voila..it works!
sodo@linux-z6tw:~> ec2din -v
Setting User-Agent to [ec2-api-tools 1.4.4.0]
Using proxy [http-proxy:8080]
Using proxy credentials [username@password] for all realms
------------------------------[ REQUEST ]-------------------------------
------------------------------------------------------------------------
REQUEST ID 76b60060-790c-403f-b0dd-75d06d7f3a79

Hooray!
sodo

Reference
https://forums.aws.amazon.com/thread.jspa?messageID=52041쭉

Wednesday, April 27, 2011

preparing an EBS volume to use as Amazon Public Dataset

I had created a three month refresh of Pete Skomoroch's Amazon Public Dataset of the Wikipedia Traffic Stats. The snapshot (snap-5300883c) is about 150GB of data and includes 2060 logfiles spanning from 1/1/2011-3/31/2011.

To share the snapshot with Amazon, I first had to migrate the data from my S3 bucket to an EBS volume. Then, I created a snapshot of the EBS volume so that the Amazon support folks could use it as a public data set. Next, I consulted an Amazon support engineer so that they could give me permission and the proper authorization number.

Once I got proper authorization, here are the steps I performed.

Migrate the data from S3 to EBS
1) Create an EBS volume of the necessary size and attach it to an EC2 instance
2) Transfer the data from S3 to that volume
3) Create an EBS snapshot
4) Share the snapshot with the Amazon account as described previously

Share the EBS Snapshot with Amazon
1) Select the "EC2" tab in the AWS Console
2) Select "Snapshots" item in the left menu bar
3) Right-click on the snapshot you would like to share and select "Snapshot Permissions"
4) Choose the "Private" option (though it will also be available to me if you leave it as "Public")
5) Enter XXXXXXXX (number given by Amazon rep) next to "AWS Account Number 1:" and select "Save"
6) Select "Snapshot Permissions" on that snapshot again, if this was done correctly "amazon" should now show up under "Remove Create Volume Permission:"

Amazon then created their own snapshot based off of my shared snapshot. It was an interesting exercise to go through.

'sodo

creating, attaching, formatting, mounting an ebs volume

In preparation to move the most recent Wikipedia Traffic data to a public dataset, I moved my S3 storage bucket to EBS. First, though, I had to create the storage bucket and copy the S3 files to it. Here are some notes on creating, mounting and formatting an Amazon EBS volume

Overview
* create the EBS volume
* attach the EBS volume to an instance
* validate attachment through dmesg, system logs
* partition the volume
* format the parition
* create a mount point
* mount
* copy files to the new voume

Detail
1) create the EBS volume
When creating your EBS volume, make sure your EBS volume is in the same Availability Zone as your Amazon EC2 instance.

2) attach the EBS volume to an instance
3) review the system message log to verify it got attached
ip-10-32-69-206:~ # dmesg | tail
[ 1.073737] kjournald starting. Commit interval 15 seconds
[ 1.097264] EXT3-fs (sda2): using internal journal
[ 1.097280] EXT3-fs (sda2): mounted filesystem with ordered data mode
[ 1.125686] JBD: barrier-based sync failed on sda1 - disabling barriers
[ 83.353564] sdf: unknown partition table

As it is raw storage, we'll need to partition the volume.
4) partition the EBS volume
ip-10-32-69-206:~ # fdisk /dev/sdf
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-19581, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-19581, default 19581):
Using default value 19581

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

And of course, format!
5) format the newly created partition
ip-10-32-69-206:~ # mkfs.ext3 /dev/sdf1
mke2fs 1.41.11 (14-Mar-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
9830400 inodes, 39321087 blocks
1966054 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
1200 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 34 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

6) create a mount point for the partition
ip-10-32-69-206:~ # mkdir /mnt/data

7) mount the partition
ip-10-32-69-206:~ # mount -t ext3 /dev/sdf1 /mnt/data

Good to go for copy!
8) copy the S3 storage bucket data over to the EBS volume
ip-10-32-69-206:/data/wikistats # s3cmd get s3://sodotrendingtopics/wikistats/* /data
s3://sodotrendingtopics/wikistats/pagecounts-20110101-000000.gz -> ./pagecounts-20110101-000000.gz [1 of 2161]
..

2161 files took about four hours to copy over from S3. Amazon does not charge you to move files within their network from S3 to EBS and vice-versa.

Saturday, April 23, 2011

refreshed trendingtopics data

I was able to successfully refresh the Trendtopics website data today. I used a sample of an updated Wikipedia dataset that I setup on Amazon S3. The updated data is from 1/1/2011-3/31/2011. As crunching through months of data would take too long or cost too much if I used the EC2 cloud, I didn't use the whole dataset..I sampled only one hour of weblogs from each day. You can checkout my local version of TrendingTopics here . (CTRL-click or wheel-click to open in new browser window).

In case the site is offline, you can see the more recent dates in the screenshot below:

The caveat with the refresh is that I can only load 100 records of the sample data. For some reason, the Rails app bombs when I try to load the full dataset.
* must figure out why

--more to come--

Saturday, April 16, 2011

timeline and trend scripts, part I

DataWrangling's TrendingTopics website relies on lots of shell, Hive, MySQL and Python scripts to update and perform statistical analysis on the Wikipedia data. I've been trying to make sense of and optimize them over the past couple of weeks and am slowly making headway.

I'll start at the beginning. The two main shell scripts take the most recent logfile data from Wikipedia, perform some data massaging on the logs and then output files ready to load into the MySQL, the db engine behind the web app. These two shell scripts are:

run_daily_timelines.sh
run_daily_trends.sh

The shell scripts kick off Hadoop Streaming jobs that use python scripts to perform MapReduce tasks. The scripts also utilize Hive to manipulate the structured data summarized from the weblogs.

run_daily_timelines.sh
Generally, this script uses Hadoop clusters to convert the Wikipedia weblog data using MapReduce. The data is then loaded into Hive for further merging into temp tables. The final output will be used in later scripts to refresh three of the TrendingTopics MySQL database tables:

pages
daily_timelines
sample_pages

INPUT
Wikipedia traffic stats data for a selected period

DETAIL
The more detailed view of what run_daily_timelines.sh does is as follows:
1) it grabs the hourly logfiles for a number of days (based on a LIMIT constant)
2) runs a two part hadoop streaming job (daily_timelines.py) to convert the logfiles described in my previous entry
3) deletes local hadoop _log file directories so that Hive can load the trend data
4) grabs the wikipedia page id lookup table
5) does a bunch of data munging in Hive (not in MySQL yet)
  a. creates daily_timelines, pages, raw_daily_stats_table, redirect_table, sample_pages tables
  b. overwrites redirect_table with the wiki data from step 4
  c. loads the munged stage2 output into raw_daily_stats_table
  d. overwrites the pages table with data from the recently created redirect_table
6) exports the tab delimited data out of Hive for bulk loading into MySQL

OUTPUT
The script outputs five files:
-rw-r--r-- 1 root root 162454865 Apr 16 17:24 page_lookup_nonredirects.txt
-rw-r--r-- 1 root root 61580486 Apr 16 17:33 pages.txt
-rw-r--r-- 1 root root 5564 Apr 16 17:34 sample_pages.txt
-rw-r--r-- 1 root root 36958988 Apr 16 17:51 daily_timelines.txt
-rw-r--r-- 1 root root 8684 Apr 16 17:52 sample_daily_timelines.txt

Here's what they look like. I'm grepping for Oldsmobile because I used to have a number of the cars and more importantly, a search on a full name like Oldsmobile drops some of the entries beginning with punctuation that might be confusing). This is the data that will be loaded into MySQL
sodo@linux-z6tw:/mnt> grep Oldsmobile page_lookup_nonredirects.txt | head -2 Oldsmobile Oldsmobile 52040 276183917
Oldsmobile_98 Oldsmobile 98 540806 272510635

sodo@linux-z6tw:/mnt> grep Oldsmobile pages.txt | head -2
52040 Oldsmobile Oldsmobile 276183917 46 46.0
52040 Oldsmobile Oldsmobile 276183917 46 46.0

sodo@linux-z6tw:/mnt> head -2 sample_pages.txt
25895 Robert_Bunsen Robert Bunsen 276223823 100439 100439.0
14944 Insect Insect 276199176 13679 13679.0

sodo@linux-z6tw:/mnt> cat daily_timelines.txt | head -2
600744 [20110330,20110331] [21,16] 37
4838455 [20110330,20110331] [3,3] 6

sodo@linux-z6tw:/mnt> cat sample_daily_timelines.txt | head -2
3382 [20110330,20110331] [1077,867] 1944
4924 [20110330,20110331] [27,5770] 5797

run_daily_trends.sh
This second script interacts with the Hive tables that were created in the run_daily_timelines.sh script. This script outputs two files that will be loaded into MySQL by later scripts.

INPUT
Wikipedia traffic stats data for a selected period

DETAIL
It does the following:
1) runs Hadoop to MapReduce (daily_trends.py) some trend data as per my other post
2) loads the trend data into Hive
3) does a bunch of data munging in Hive
  a. drop the redirect_table and sample_pages tables
  b. create a new redirect_table, raw_daily_trends_table, daily_trends, sample_pages
  c. load the page_lookups_nonredirects.txt file into redirect_table (redundant to run_daily_timelines, I think)
  d. load the sample_pages.txt file into sample_pages
  e. load the MapReduce data that was just produced into raw_daily_trends_table
  f. overwrite daily_trends with the redirect_table data joined to the raw_daily_trends_table
4) output the daily_trends table information to a file
5) output sample_pages data

OUTPUT
The two files produced by run_daily_trends:
-rw-r--r-- 1 root root 37728906 Apr 16 19:36 daily_trends.txt
-rw-r--r-- 1 root root 3742 Apr 16 19:36 sample_daily_trends.txt

They mainly have statistical trend information that will be loaded into MySQL at a later time:
linux-z6tw:/home/sodo/trendingtopics/lib/scripts # head -2 /mnt/daily_trends.txt
600744 888.58405328 0.0821994936527
4838455 101.253019797 0.204124145232

linux-z6tw:/home/sodo/trendingtopics/lib/scripts # head -2 /mnt/sample_daily_trends.txt
3382 77440.6080993 0.0113402302907
3382 77440.6080993 0.0113402302907

I've made some changes to both scripts to make them configurable for a new user and will post them once perfected.

The next step is to get the refreshed data loaded into MySQL.
..to be continued..

References
Raw traffic stats
My Amazon Public Data Set
Pete's Original Amazon Public Data Set