Wednesday, April 27, 2011

preparing an EBS volume to use as Amazon Public Dataset

I had created a three month refresh of Pete Skomoroch's Amazon Public Dataset of the Wikipedia Traffic Stats. The snapshot (snap-5300883c) is about 150GB of data and includes 2060 logfiles spanning from 1/1/2011-3/31/2011.

To share the snapshot with Amazon, I first had to migrate the data from my S3 bucket to an EBS volume. Then, I created a snapshot of the EBS volume so that the Amazon support folks could use it as a public data set. Next, I consulted an Amazon support engineer so that they could give me permission and the proper authorization number.

Once I got proper authorization, here are the steps I performed.

Migrate the data from S3 to EBS
1) Create an EBS volume of the necessary size and attach it to an EC2 instance
2) Transfer the data from S3 to that volume
3) Create an EBS snapshot
4) Share the snapshot with the Amazon account as described previously

Share the EBS Snapshot with Amazon
1) Select the "EC2" tab in the AWS Console
2) Select "Snapshots" item in the left menu bar
3) Right-click on the snapshot you would like to share and select "Snapshot Permissions"
4) Choose the "Private" option (though it will also be available to me if you leave it as "Public")
5) Enter XXXXXXXX (number given by Amazon rep) next to "AWS Account Number 1:" and select "Save"
6) Select "Snapshot Permissions" on that snapshot again, if this was done correctly "amazon" should now show up under "Remove Create Volume Permission:"

Amazon then created their own snapshot based off of my shared snapshot. It was an interesting exercise to go through.



  1. Hi,
    My Name is Mobeen, CS student at Earlham College in Richmond, IN. I am working on a project for database management class, I wanted to use your data for my project. I was wondering where and how I can download the data from? is it available for free or?

    I will really appreciate your help.

    you could contact me at

    Thanks lot,

  2. Mobeen,
    The data is available as a public data set on Amazon EC2 here:

    The best way to get started is to read (and read multiple times) Pete Skomoroch's great guide on Tracking Trends with Hadoop & Hive on EC2:

    Once you get Pete's overview, my website investigates all the natty details of "getting it done." My website picks up where Pete's discussion leaves off: