Analysing Amazon's Buckets

Wed 25th May 11

So, as I promised, here is some analysis of the data I got from running my Bucket Finder tool.

I decided to run it with a list of names as I figured that most people creating buckets will probably name them after themselves. Rather than run a huge list to start off I went with the list of common names from Packet Storm. This list contain 2268 names and didn't take too long to run through, here are is a breakdown of the results.

Buckets

Type	Count
Don't Exist	1206
Private	848
Public	131

As you can see, most of the names tried don't exist but 5% do and are public, I think this is a good hit rate. Packet Storm has other word lists which top 100,000 words, if the same return is true then we are looking at over 5000 buckets to investigate.

Files

Of the public buckets found this is the breakdown of the files found in them:

Type	Count
Private	6016
Public	9683
Total	15699

This shows that when buckets are made public around one third of users still put private files in them. This may imply that people using the system know what they are doing and are deliberately choosing what files to share and what to keep private or it may be that some of the applications being used to manage the buckets require them to be public but then create the files in the buckets as private.

Finally, here is a breakdown of the public file types I found based on file extension:

Type	Extensions	Count
Images	jpg\|png\|gif\|tiff\|psd\|bmp	7086
Web	html\|css\|js	1377
Videos and Music	mp3\|mp4\|flv\|mov\|avi\|wmv\|m4v\|aa\|mpg	436
Documents	pdf\|doc\|xls\|ppt	80
Archives	rar\|zip\|gz	57
SQL	sql	1
Other		646
Total		9683

And a pretty pie chart to show it as well:

Breakdown of public files

Most people are using S3 to store images, I grabbed random selections of these and found they were mostly personal photo collections, lots of photos of babies implying people wanting to share and finding Amazon a good way to throw them quickly onto the net.

Browsing some of the documents I found an MOD training requisition form, including SSN and loads of other personal data, a couple of sets of company accounts and some other company documents that really shouldn't have been online.

The videos didn't reveal much interesting, mostly training and motivational things from the ones I grabbed.

In the music category there were a few people sharing large mp3 collections with the world and I now have a couple of new bands whose music I'll be following.

All-in-all, a pretty mixed bag. There are definitely some gems in there that are worth pulling out but with the amount of data to trawl through it is either going to take a lot of human hours or some very good automation to try to spot them. If anyone has ideas on how to automate this let me know and I'll see what I can do about building things in.

Recent Archive

Support The Site

I don't get paid for any of the projects on this site so if you'd like to support my work you can do so by using the affiliate links below where I either get account credits or cash back. Usually only pennies, but they all add up.

Buy me a smoothie