Script Needed: Scan and Download HTML Files Linked to From Portions of Sitemap Page

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.
Post Reply
Scraper
Posts: 3
Joined: Mon Mar 21, 2016 2:06 am

Script Needed: Scan and Download HTML Files Linked to From Portions of Sitemap Page

Post by Scraper » Mon Mar 21, 2016 10:24 am

OS: Windows 7
BlackWidow Version: Unicode v6.30

I want to scan a site using the website's sitemap page in order to limit the content I want to download. The sitemap page is organized into four sections: a) About, b) Team, c) Categories, d) Content.

I only want to scan the links for Categories and Content.

Furthermore, I only want to download only the top-level HTML files of the Categories and Content pages.

For example, if one of the Categories is "Great Scientists of the 20th Century", then I want to download the html page,
http:// http://www.MainSite.com / categories / great-scientists-of-the-20th-century.html

And I want to do this for all the top-level Categories and Content pages listed in the Sitemap.

Can someone help me create the bw6 script that will accomplish this task?

User avatar
Support
Site Admin
Posts: 1830
Joined: Sun Oct 02, 2011 10:49 am

Re: Script Needed: Scan and Download HTML Files Linked to From Portions of Sitemap Page

Post by Support » Mon Mar 21, 2016 12:54 pm

Basically, set the scanner not to scan everything and include filters to the directories you want to scan for html only. Like this...

Code: Select all

[BlackWidow v6.00 filters]
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: 
Startup referrer: 
[ ] Slow down by 10:10 seconds
4 threads
[x] Add /categories/[^/]+\.html$ from URL using regular expression
[end]
change categories to whatever you want and add another same filter with a different directory.
Your support team.
http://SoftByteLabs.com

Scraper
Posts: 3
Joined: Mon Mar 21, 2016 2:06 am

Re: Script Needed: Scan and Download HTML Files Linked to From Portions of Sitemap Page

Post by Scraper » Mon Mar 21, 2016 5:13 pm

Thanks for the answer.

In the example script you wrote (above), Is the URL element needed?
Also, you do not include the "Structure Records", Are they needed?

Example Script:

Code: Select all

[BlackWidow v6.00 filters]
URL = http://www.MySite.com/sitemap
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 3
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page:
Startup referrer:
[ ] Slow down by 10:15 seconds
6 threads
[x] Add /categories/[^/]+\.html$ from URL using regular expression

[BlackWidow v6.30 structure]

{Begin Record}
Path = http://www.MySite.com/sitemap
FileName = sitemap
MimeType = text/html
Title = Sitemap
FileSize = 278219
Modified = 42449.9536531597
Selected = Yes
Referer = http://www.MySite.com/sitemap
{End Record}

[end]
- Are the elements (FileSize and Modified) necessary? How are they generated?
>> "FileSize = 278219" and "Modified = 42449.9536531597"

- What is the difference between "URL =" in the Filters section, and "Path =" in the Structure section?


Can you tell me what is the meaning of some of the terminology...

- "Scan External Links" (I assume this means links away from the current site.)
- What is the "Startup Referrer:" ?
- What does it mean to only "Verify" external links?
- "FileName = sitemap" (How does this affect the parsing engine?)
- What is the purpose for, "Title ="
- How is "Referrer =" different from "Starteup Referrer:"
- What is the purpose of the entry "Default Index Page" ?
- If you do not want to follow (scan) any external links, can you set the number to "-1" ?

User avatar
Support
Site Admin
Posts: 1830
Joined: Sun Oct 02, 2011 10:49 am

Re: Script Needed: Scan and Download HTML Files Linked to From Portions of Sitemap Page

Post by Support » Mon Mar 21, 2016 8:23 pm

The URL element is the URL to scan, that's from the browser tab.
The structure record is not include as this is only filters to tell BW what to scan and what not to scan. It is not the scan data itself, that would be the bw6 file after saving the structure.

FileSize and Modified are not needed but BW uses that to display in the structure when you load the bw6 file.

No difference in the URL or Path. We called it Path because it's the path to the file and URL because it's the address in the browser to scan.

Scan External Links will scan all links, from all website, eventually scanning the whole Internet if you don't have any filters to limit the scan :shock:

The Startup Referrer send to the server a URL of a page where the link was clicked from. Not needed most of the time.

Verify" external links only verify that they are good and if not, they will show in the Link Errors list. Nice to know if you have some links on your site pointing to other sites which may no longer exist.

FileName = sitemap: not sure where you got that one!

The Referrer is the page from which the links was found.

The Default Index Page is the page the website server send by default. For example, http://mysite.com/ doesn't contain a page, usually, it's index.html as in http://mysite.com/index.html but it could be index.php or default.asp, if you know it, BW will use it for all links ending with a slash /

To follow an external link, you can not set it to -1, just uncheck Scan External Links instead.
Your support team.
http://SoftByteLabs.com

Post Reply