Page count and sitemap cross-check

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.
Post Reply
Posts: 6
Joined: Fri Oct 07, 2011 3:58 pm

Page count and sitemap cross-check

Post by iainhu »

I mainly use BW for site checking, so these are just a couple of ideas....

1. How about an end of crawl report which gives stats on number of pages (HTML) documents retrieved, maybe min, max and average sizes, load times, numbers and types of image, css etc.? BW must know all the info from it's crawl, so it may just be a case of presenting this info at the end.

2. How about a sitemap cross-check. Give BW a sitemap location and have it verify that what is in the sitmap and found and working and what's found is included in the sitemap? Maybe even offer to generate a sitemap from what's found...crawl my site and generate a sitemap please...

3. Comparison to prior versions. BW is great for finding link and other errors, but wouldn't it be neat if it could save copies of crawled sites and then compare to new versions. This would be a great site integrity check. So I make a change...a known change...and I now run a crawl against the new site. The BW crawl should report back no changes of pages other that the known change. If there are other differences, then I have possibly unintentionally altered something else as well. Of course dynamic content is an issue...even if pages carry a timestamp as obviously the same page will be different due to the different timestamp, but anyway, a germ of an idea?


User avatar
Site Admin
Posts: 1892
Joined: Sun Oct 02, 2011 10:49 am

Re: Page count and sitemap cross-check

Post by Support »

All of your suggestions can be done using the Expert script. Your ideas are good one, so I'll see what we can do to make such scripts.

3. Yes that would be great, but on large sites and those with js, Flash, asp, php and whatever is not .html will have problems because scripts like php doesn't send the page source but the page output. No problem with the timestamp as we can just verify the data itself, but dynamic pages will not work at all.
Your support team.

Post Reply