Filter Help

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.
Post Reply
jdahlin
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Filter Help

Post by jdahlin » Thu Jul 19, 2012 5:37 pm

I am trying to scan our intranet.
I started off with "Scan Everything", "Stay within the full URL", and nothing for the "External Links".
But, instead of staying withing "intranet.companyName.org" it also went to "diretcory.companyName.org".

There are also a couple directories that I want to exclude and am having a hard time getting the filters set right.
1- I only want "http://intranet.companyName.org/" and no other subdomains.
2- exclude everything under "http://intranet.companyName.org/phone_directory/"
3- exclude scanning URLs like "http://intranet.companyName.org/news/articles/2012/11" or "http://intranet.companyName.org/news/articles/2011/1", but I do want files (they are all .html) that are contained in these directories. (The reason for preventing these directories from being scanned is because the crawler hangs on them... I believe this is due to our proxy settings.)

Thanks!

User avatar
Support
Site Admin
Posts: 1830
Joined: Sun Oct 02, 2011 10:49 am

Re: Filter Help

Post by Support » Thu Jul 19, 2012 6:41 pm

Ok, in this case, do not select to scan everything. Set Scan whole site and external links too. We'll let the filter decide what to scan. Then, add a filter to only scan what you want.

So, first, lets not scan /phone_directory/ directory, then, lets scan the news/articles for htm and html only, then, do not scan anything in news/articles/, then scan everything else in http://intranet.companyName.org/.

The way the filter works is, the first filter to have a match win, so the rest of the filters are ignored. here are the complete filters. Just change intranet.companyName.org to whatever it is...

Code: Select all

[BlackWidow v6.00 filters]
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: 
Startup referrer: 
[ ] Slow down by 10:60 seconds
[x] Do not follow /phone_directory/ using regular expression
[x] Follow ^http://intranet.companyName.org/news/articles/.*\.html?$ using regular expression
[x] Do not follow /news/articles/ using regular expression
[x] Follow ^http://intranet.companyName.org/ using regular expression
[end]
Your support team.
http://SoftByteLabs.com

jdahlin
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Filter Help

Post by jdahlin » Thu Jul 19, 2012 7:11 pm

Looks to be working much better, except, oddly, it is now getting hung up on all URLs that do not have a file name in them (not just the ones I described in the "new_articles/2011/1/" example.

I presume I can modify this line so it applies to every directory and forces a file of some sort (*.*)

Code: Select all

[x] Follow ^http://intranet.companyName.org/news/articles/.*\.html?$ using regular expression
I tried a couple variations, but my regEx skills are horrible...

User avatar
Support
Site Admin
Posts: 1830
Joined: Sun Oct 02, 2011 10:49 am

Re: Filter Help

Post by Support » Thu Jul 19, 2012 7:40 pm

Then you can add a new filter not to follow an URL ending with / as follow...

[x] Do not follow /$ using regular expression

and add it before the last filter.
Your support team.
http://SoftByteLabs.com

jdahlin
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Filter Help

Post by jdahlin » Fri Jul 20, 2012 8:24 am

Thanks a bunch - the scanning is working now.
For some reason... nothing show up in "Structure", even if I tell it to download (which I don't want it to do... I just need a siteMap). I left it running overnight... perhaps I just need a reboot.

Code: Select all

[BlackWidow v6.00 filters]
URL = http://intranet.ops.tiaa-cref.org/index.html
[ ] Expert mode
[x] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: index.html
Browser user agent: Mozilla/4.0 (compatible; MSIE 7.0; BlackWidow v6 - http://SoftByteLabs.com)
Startup referrer: 
[x] Slow down by 1:6 seconds
6 threads
[x] Do not follow /phone_directory/ using regular expression
[x] Do not follow /$ using regular expression
[x] Do not follow \# using regular expression
[x] Follow ^http://intranet.ops.tiaa-cref.org/ using regular expression
[ ] Follow ^http://intranet.ops.tiaa-cref.org/news/articles/.*\.html?$ using regular expression
[ ] Do not follow /news/articles/ using regular expression
[end]

User avatar
Support
Site Admin
Posts: 1830
Joined: Sun Oct 02, 2011 10:49 am

Re: Filter Help

Post by Support » Fri Jul 20, 2012 8:46 am

I didn't have anything to test the filters on, but I think you need to change "Follow" to "Add" instead. This will insruct BW to follow the link and add it to the structure as well.
Your support team.
http://SoftByteLabs.com

jdahlin
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Filter Help

Post by jdahlin » Fri Jul 20, 2012 11:02 am

A reboot fixed the "not adding to structure" issue.

In my other adjustments, I've managed to screw up the "not following links like news/articles/2011/12" (perhaps because there is no trailing slash)
Also, I turned off "Scan External Links" because I saw a few external URLs pop-up in the can again. Perhaps my filters were screwy though.

Code: Select all

[BlackWidow v6.00 filters]
URL = http://intranet.ops.tiaa-cref.org/index.html
[ ] Expert mode
[x] Scan everything
[x] Scan whole site
Local depth: 0
[ ] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: index.html
Browser user agent: Mozilla/4.0 (compatible; MSIE 7.0; BlackWidow v6 - http://SoftByteLabs.com)
Startup referrer: 
[x] Slow down by 1:6 seconds
10 threads
[x] Do not follow /phone_directory/ using regular expression
[x] Do not follow /$ using regular expression
[x] Do not follow \# using regular expression
[x] Do not follow /news/articles/ using regular expression
[x] Follow ^http://intranet.ops.tiaa-cref.org/news/articles/.*\.html?$ using regular expression
[x] Follow ^http://intranet.ops.tiaa-cref.org/ using regular expression
[end]
Thanks a ton for the assistance - I really appreciate it!

User avatar
Support
Site Admin
Posts: 1830
Joined: Sun Oct 02, 2011 10:49 am

Re: Filter Help

Post by Support » Fri Jul 20, 2012 5:37 pm

One trick to now is if you need to reverse the logic of a filter, click on "Scan everything" to reverse it, then double click on the filter and click OK, then click on "Scan everything" to put it back to what it was.
Your support team.
http://SoftByteLabs.com

Post Reply