Saturday, April 16, 2011

timeline and trend scripts, part I

DataWrangling's TrendingTopics website relies on lots of shell, Hive, MySQL and Python scripts to update and perform statistical analysis on the Wikipedia data. I've been trying to make sense of and optimize them over the past couple of weeks and am slowly making headway.

I'll start at the beginning. The two main shell scripts take the most recent logfile data from Wikipedia, perform some data massaging on the logs and then output files ready to load into the MySQL, the db engine behind the web app. These two shell scripts are:
  • run_daily_timelines.sh
  • run_daily_trends.sh
The shell scripts kick off Hadoop Streaming jobs that use python scripts to perform MapReduce tasks. The scripts also utilize Hive to manipulate the structured data summarized from the weblogs.

run_daily_timelines.sh
Generally, this script uses Hadoop clusters to convert the Wikipedia weblog data using MapReduce. The data is then loaded into Hive for further merging into temp tables. The final output will be used in later scripts to refresh three of the TrendingTopics MySQL database tables:
  • pages
  • daily_timelines 
  • sample_pages

INPUT
Wikipedia traffic stats data for a selected period

DETAIL
The more detailed view of what run_daily_timelines.sh does is as follows:
1) it grabs the hourly logfiles for a number of days (based on a LIMIT constant)
2) runs a two part hadoop streaming job (daily_timelines.py) to convert the logfiles described in my previous entry
3) deletes local hadoop _log file directories so that Hive can load the trend data
4) grabs the wikipedia page id lookup table
5) does a bunch of data munging in Hive (not in MySQL yet)
  a. creates daily_timelines, pages, raw_daily_stats_table, redirect_table, sample_pages tables
  b. overwrites redirect_table with the wiki data from step 4
  c. loads the munged stage2 output into raw_daily_stats_table
  d. overwrites the pages table with data from the recently created redirect_table
6) exports the tab delimited data out of Hive for bulk loading into MySQL

OUTPUT
The script outputs five files:
-rw-r--r-- 1 root root 162454865 Apr 16 17:24 page_lookup_nonredirects.txt
-rw-r--r-- 1 root root 61580486 Apr 16 17:33 pages.txt
-rw-r--r-- 1 root root 5564 Apr 16 17:34 sample_pages.txt
-rw-r--r-- 1 root root 36958988 Apr 16 17:51 daily_timelines.txt
-rw-r--r-- 1 root root 8684 Apr 16 17:52 sample_daily_timelines.txt


Here's what they look like. I'm grepping for Oldsmobile because I used to have a number of the cars and more importantly, a search on a full name like Oldsmobile drops some of the entries beginning with punctuation that might be confusing). This is the data that will be loaded into MySQL
sodo@linux-z6tw:/mnt> grep Oldsmobile page_lookup_nonredirects.txt | head -2 Oldsmobile Oldsmobile 52040 276183917
Oldsmobile_98 Oldsmobile 98 540806 272510635


sodo@linux-z6tw:/mnt> grep Oldsmobile pages.txt | head -2

52040 Oldsmobile Oldsmobile 276183917 46 46.0

52040 Oldsmobile Oldsmobile 276183917 46 46.0

sodo@linux-z6tw:/mnt> head -2 sample_pages.txt

25895 Robert_Bunsen Robert Bunsen 276223823 100439 100439.0

14944 Insect Insect 276199176 13679 13679.0


sodo@linux-z6tw:/mnt> cat daily_timelines.txt | head -2
600744 [20110330,20110331] [21,16] 37
4838455 [20110330,20110331] [3,3] 6

sodo@linux-z6tw:/mnt> cat sample_daily_timelines.txt | head -2

3382 [20110330,20110331] [1077,867] 1944
4924 [20110330,20110331] [27,5770] 5797


run_daily_trends.sh
This second script interacts with the Hive tables that were created in the run_daily_timelines.sh script. This script outputs two files that will be loaded into MySQL by later scripts.

INPUT
Wikipedia traffic stats data for a selected period

DETAIL
It does the following:
1) runs Hadoop to MapReduce (daily_trends.py) some trend data as per my other post
2) loads the trend data into Hive
3) does a bunch of data munging in Hive
  a. drop the redirect_table and sample_pages tables
  b. create a new redirect_table, raw_daily_trends_table, daily_trends, sample_pages
  c. load the page_lookups_nonredirects.txt file into redirect_table (redundant to run_daily_timelines, I think)
  d. load the sample_pages.txt file into sample_pages
  e. load the MapReduce data that was just produced into raw_daily_trends_table
  f. overwrite daily_trends with the redirect_table data joined to the raw_daily_trends_table
4) output the daily_trends table information to a file
5) output sample_pages data

OUTPUT
The two files produced by run_daily_trends:
-rw-r--r-- 1 root root 37728906 Apr 16 19:36 daily_trends.txt
-rw-r--r-- 1 root root 3742 Apr 16 19:36 sample_daily_trends.txt


They mainly have statistical trend information that will be loaded into MySQL at a later time:
linux-z6tw:/home/sodo/trendingtopics/lib/scripts # head -2 /mnt/daily_trends.txt
600744 888.58405328 0.0821994936527
4838455 101.253019797 0.204124145232

linux-z6tw:/home/sodo/trendingtopics/lib/scripts # head -2 /mnt/sample_daily_trends.txt

3382 77440.6080993 0.0113402302907
3382 77440.6080993 0.0113402302907


I've made some changes to both scripts to make them configurable for a new user and will post them once perfected.

The next step is to get the refreshed data loaded into MySQL.
..to be continued..

References
Raw traffic stats
My Amazon Public Data Set
Pete's Original Amazon Public Data Set

No comments:

Post a Comment