When doing a web app test you usually end up spidering the site you are testing but what if the site could tell you most of that information without you going hunting for it. Bring on sitemap.xml, a file used by a lot of sites to tell spiders, like Google, all about their content.
This script takes that file and parses it to extract all the URLs then requests each one through your proxy of choice (Burp, ZAP, etc). Now this won't find anything that isn't mentioned in the file and it won't do any brute forcing but it is a nice way to identify all the pages on the site that the admins want you to know about.
sitemap2proxy is a simple Ruby script and doesn't require any additional gems to be installed. Just make it executable and thats it.
Usage is pretty simple, you can specify either a sitemap that you've already downloaded or point it at one on the site. It will take either raw XML (sitemap.xml) or a gzip'ed file (sitemap.xml.gz), I've not see any other variants but if there are any let me know and I'll add handling for them. The other parameter it requires is the proxy URL.
By default the requests are made with the Googlebot user agent string to try to hide the traffic in the logs. If you want to change this you can specify your own agent using the ua parameter.
Here are some examples.
Grab Google's sitemap.xml file and pass it through a local proxy on port 8080:
./sitemap2proxy.rb --url http://www.google.com/sitemap.xml --proxy http://localhost:8080
Note: I wouldn't recommend running this against Google, they have 35k of records in their sitemap, just parsing that takes quite a while.
Do the same time, this time pretending to be the Yahoo Bot
./sitemap2proxy.rb --url http://www.google.com/sitemap.xml \ --proxy http://localhost:8080 \ --ua "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
Parse a file you've already downloaded and send it through a proxy on a different machine:
./sitemap2proxy.rb --file sitemap.xml.gz --proxy http://proxyserver.int:8080
Do the above but verbosely:
./sitemap2proxy.rb -v --file sitemap.xml.gz --proxy http://proxyserver.int:8080
If you are stuck, simply ask for instructions:
While testing this I found that in the robots.txt file on google.com they specify a bunch of additional sitemaps, I didn't know you could do that. You should always be checking the robots.txt file for juicy stuff, I think the possible findings just got juicier.
- Version 1.1 - Added response code stats
- Version 1 - Released