Page 1 of 1

Filter Help

Posted: Thu Jul 19, 2012 5:37 pm
by jdahlin
I am trying to scan our intranet.
I started off with "Scan Everything", "Stay within the full URL", and nothing for the "External Links".
But, instead of staying withing "intranet.companyName.org" it also went to "diretcory.companyName.org".

There are also a couple directories that I want to exclude and am having a hard time getting the filters set right.
1- I only want "http://intranet.companyName.org/" and no other subdomains.
2- exclude everything under "http://intranet.companyName.org/phone_directory/"
3- exclude scanning URLs like "http://intranet.companyName.org/news/articles/2012/11" or "http://intranet.companyName.org/news/articles/2011/1", but I do want files (they are all .html) that are contained in these directories. (The reason for preventing these directories from being scanned is because the crawler hangs on them... I believe this is due to our proxy settings.)

Thanks!

Re: Filter Help

Posted: Thu Jul 19, 2012 6:41 pm
by Support
Ok, in this case, do not select to scan everything. Set Scan whole site and external links too. We'll let the filter decide what to scan. Then, add a filter to only scan what you want.

So, first, lets not scan /phone_directory/ directory, then, lets scan the news/articles for htm and html only, then, do not scan anything in news/articles/, then scan everything else in http://intranet.companyName.org/.

The way the filter works is, the first filter to have a match win, so the rest of the filters are ignored. here are the complete filters. Just change intranet.companyName.org to whatever it is...

Code: Select all

[BlackWidow v6.00 filters]
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: 
Startup referrer: 
[ ] Slow down by 10:60 seconds
[x] Do not follow /phone_directory/ using regular expression
[x] Follow ^http://intranet.companyName.org/news/articles/.*\.html?$ using regular expression
[x] Do not follow /news/articles/ using regular expression
[x] Follow ^http://intranet.companyName.org/ using regular expression
[end]

Re: Filter Help

Posted: Thu Jul 19, 2012 7:11 pm
by jdahlin
Looks to be working much better, except, oddly, it is now getting hung up on all URLs that do not have a file name in them (not just the ones I described in the "new_articles/2011/1/" example.

I presume I can modify this line so it applies to every directory and forces a file of some sort (*.*)

Code: Select all

[x] Follow ^http://intranet.companyName.org/news/articles/.*\.html?$ using regular expression
I tried a couple variations, but my regEx skills are horrible...

Re: Filter Help

Posted: Thu Jul 19, 2012 7:40 pm
by Support
Then you can add a new filter not to follow an URL ending with / as follow...

[x] Do not follow /$ using regular expression

and add it before the last filter.

Re: Filter Help

Posted: Fri Jul 20, 2012 8:24 am
by jdahlin
Thanks a bunch - the scanning is working now.
For some reason... nothing show up in "Structure", even if I tell it to download (which I don't want it to do... I just need a siteMap). I left it running overnight... perhaps I just need a reboot.

Code: Select all

[BlackWidow v6.00 filters]
URL = http://intranet.ops.tiaa-cref.org/index.html
[ ] Expert mode
[x] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: index.html
Browser user agent: Mozilla/4.0 (compatible; MSIE 7.0; BlackWidow v6 - http://SoftByteLabs.com)
Startup referrer: 
[x] Slow down by 1:6 seconds
6 threads
[x] Do not follow /phone_directory/ using regular expression
[x] Do not follow /$ using regular expression
[x] Do not follow \# using regular expression
[x] Follow ^http://intranet.ops.tiaa-cref.org/ using regular expression
[ ] Follow ^http://intranet.ops.tiaa-cref.org/news/articles/.*\.html?$ using regular expression
[ ] Do not follow /news/articles/ using regular expression
[end]

Re: Filter Help

Posted: Fri Jul 20, 2012 8:46 am
by Support
I didn't have anything to test the filters on, but I think you need to change "Follow" to "Add" instead. This will insruct BW to follow the link and add it to the structure as well.

Re: Filter Help

Posted: Fri Jul 20, 2012 11:02 am
by jdahlin
A reboot fixed the "not adding to structure" issue.

In my other adjustments, I've managed to screw up the "not following links like news/articles/2011/12" (perhaps because there is no trailing slash)
Also, I turned off "Scan External Links" because I saw a few external URLs pop-up in the can again. Perhaps my filters were screwy though.

Code: Select all

[BlackWidow v6.00 filters]
URL = http://intranet.ops.tiaa-cref.org/index.html
[ ] Expert mode
[x] Scan everything
[x] Scan whole site
Local depth: 0
[ ] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: index.html
Browser user agent: Mozilla/4.0 (compatible; MSIE 7.0; BlackWidow v6 - http://SoftByteLabs.com)
Startup referrer: 
[x] Slow down by 1:6 seconds
10 threads
[x] Do not follow /phone_directory/ using regular expression
[x] Do not follow /$ using regular expression
[x] Do not follow \# using regular expression
[x] Do not follow /news/articles/ using regular expression
[x] Follow ^http://intranet.ops.tiaa-cref.org/news/articles/.*\.html?$ using regular expression
[x] Follow ^http://intranet.ops.tiaa-cref.org/ using regular expression
[end]
Thanks a ton for the assistance - I really appreciate it!

Re: Filter Help

Posted: Fri Jul 20, 2012 5:37 pm
by Support
One trick to now is if you need to reverse the logic of a filter, click on "Scan everything" to reverse it, then double click on the filter and click OK, then click on "Scan everything" to put it back to what it was.