1) get a version of TrendingTopics.org site up and running.
2) apply that knowledge to big dataset management tasks back at work
Pete's guide is a fabulous open source resource. However, it has been two years since he wrote the application and a lot of the details about how each piece of software works have changed slightly or have been deprecated. (Funny how web technology techniques become obsoleted in two years!)
Rather than get into the nitty gritty details, I think it would be helpful to take a step back and visualize the architecture that I am trying to replicate as a whole. It is not inconsequential:

With that baseline set, I will delve into the more technical details of the project implementation in my upcoming posts.
next steps: getting a Cloudera-Hadoop cluster fired up with Whirr and running MapReduce on a dataset.
References
AWS
Hadoop
Hive
MySQL
Whirr
Amazon Public Data Sets
Wikipedia Traffic Statsand Raw Data
great info!
ReplyDelete