Thursday, March 24, 2011

Data driven website in the AWS cloud..oh boy!

After reading Semil Shah's excellent article on BigData, it started me thinking that there is a wellspring of data in my weblogs that I'm not taking advantage of. Peter Skomoroch got me psyched to try Hadoop and Cloudera with his Hadoop World talk on rapid prototyping of data intensive web apps. Pete posted a great guide on how to piece together the components for the site. I think it would be a great exercise to:
1) get a version of TrendingTopics.org site up and running.
2) apply that knowledge to big dataset management tasks back at work

Pete's guide is a fabulous open source resource. However, it has been two years since he wrote the application and a lot of the details about how each piece of software works have changed slightly or have been deprecated. (Funny how web technology techniques become obsoleted in two years!)

Rather than get into the nitty gritty details, I think it would be helpful to take a step back and visualize the architecture that I am trying to replicate as a whole. It is not inconsequential:


With that baseline set, I will delve into the more technical details of the project implementation in my upcoming posts.

next steps: getting a Cloudera-Hadoop cluster fired up with Whirr and running MapReduce on a dataset.

References
AWS
Hadoop
Hive
MySQL
Whirr
Amazon Public Data Sets
Wikipedia Traffic Statsand Raw Data

1 comment: