Scan stopping at 2 links deep

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.

Scan stopping at 2 links deep

Postby jdahlin » Tue Dec 20, 2011 12:10 pm

I have the radio button for "Scan the whole site" checked, but it wants to stop after only 2 links deep from the home page. Other than stopping early the site-structure returned appears to be exactly what I want. If I manually navigate to one of the URLs returned from this scan and re-scan, it picks up more pages. How do I get it to "not stop" like it is?

I presumed that I needed to go into the expert filters and change the LocalLinkDepth to 99... but when I run the scan from the Expert Filters, the scan hangs parsing js files and some images...

My goal is to pull back a structure of all HTML, PDF, and SWF files on the site. I do not need images, js files, etc.

All I really want to do is have it run the default behavior, but not stop after only 2 links.

Thanks.
jdahlin
 
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Scan stopping at 2 links deep

Postby jdahlin » Tue Dec 20, 2011 2:28 pm

Example...
Entering our starting URL "http://www.blahblah.org" (which redirects to https://www.blahblah/public/index.html) and running the scan it gets as far as "http://www.blahblah.org/public/support/forms/index.html". All told, it located 304 pages. But, this page has links to lots of HTML page, but none of them show up in the structure.

Without changing any settings, if I manually browse to that page (.../forms/index.html), it correctly locates another 1200+ pages to add to our inventory.
jdahlin
 
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Scan stopping at 2 links deep

Postby jdahlin » Tue Dec 20, 2011 3:04 pm

Handling "Forbidden"
It is scanning directories that do not have an index.html and receiving a "Forbidden" which is correct... but it keeps trying again, and again, and again, until all 6 of my threads are occupied by these URLs. How do I teach it to just throw the URL into "link errors" and move on? The only URLs that ever make it into link errors are those I manually place there (by terminating an active scan).

Ignoring name anchors
I don't want it to scan URLs for "xyz.html#maincontent". Similarly, I do not need it to scan all #a, #b, #c, etc. in our glossary pages, etc. I added a filter for "*#*", but I see "cookie sent --> receiving response" statuses anyway.

edit - I tried changing the wildcard *#* to a regular expression [#]... we'll see if that works or not.
jdahlin
 
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Scan stopping at 2 links deep

Postby Support » Tue Dec 20, 2011 4:55 pm

Most of the time, either the link in the href tags are not formated properly, or, they are in a javascript onClick event. If you view the source of the page and locate the link that BW can't find, you'll know. If they are in a javascript call, we can add a filter to modify the tag and turn then into a proper href tag that BW can understand. Let me know and I'll provide the details.

Have you also selected "Scan everything" at the top of the filter page?

To remove links containing # in them, add a filter as follow...

Code: Select all
[x] Do not follow \# using regular expression


and that will take care of it.

To scan only for html, pdf and swf files, uncheck 'Scan everything' and add filters to follow only html and add html, pdf and swf files...

Code: Select all
[x] Follow \.html$ using regular expression
[x] Add \.(html|pdf|swf)$ from URL using regular expression
Your support team.
http://SoftByteLabs.com
User avatar
Support
Site Admin
 
Posts: 908
Joined: Sun Oct 02, 2011 10:49 am

Re: Scan stopping at 2 links deep

Postby jdahlin » Wed Dec 21, 2011 2:22 pm

I think the issue with stopping "two links deep" was due to BW thinking the switch from http:www.mysite to https://www.mysite made it think it was an external URL which I have turned off.

Thanks for the filters - I'll use those instead of the ones I have defined.

New issue:
Our site contains a <script> tag inside the BODY that links to an external site. When browsing to this page, the external site's URL shows up in the Address bar, so it does not scan the HTML page that is on our server. Turning external link scanning on (and setting it to 1) only results in scanning the URL of the JS and not the HTML page on our site - even if I have that page up in the browser tab of the tool
jdahlin
 
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Scan stopping at 2 links deep

Postby Support » Wed Dec 21, 2011 2:56 pm

jdahlin wrote:I think the issue with stopping "two links deep" was due to BW thinking the switch from http:www.mysite to https://www.mysite made it think it was an external URL which I have turned off.


Yes, that would do it.

jdahlin wrote:New issue:
Our site contains a <script> tag inside the BODY that links to an external site. When browsing to this page, the external site's URL shows up in the Address bar, so it does not scan the HTML page that is on our server. Turning external link scanning on (and setting it to 1) only results in scanning the URL of the JS and not the HTML page on our site - even if I have that page up in the browser tab of the tool


In that case, paste the URL in the browser address bar, but do not pull the page. BW will scan that URL even if the browser is blank. As for the <script> tag, BW does not run any javascript, only in the browser. But if there is a link in it, you can add a filter that will convert the <script> tag into an href tag so that BW can see the URL. If you can provide me with the <script> block, I can do that for you.
Your support team.
http://SoftByteLabs.com
User avatar
Support
Site Admin
 
Posts: 908
Joined: Sun Oct 02, 2011 10:49 am

Re: Scan stopping at 2 links deep

Postby jdahlin » Wed Dec 21, 2011 3:06 pm

This page contains lots of links to news articles which demonstrate the issue. None of the news articles appear in my scan.
http://www.tiaa-cref.org/plansponsors/n ... index.html

This page is an example of one of the news articles:
http://www.tiaa-cref.org/public/about/n ... 1_304.html

I tested putting the http://www.tiaa-cref.org/public/about/n ... 1_304.html URL into the browser and immediately scanning it, and it added it to the directory structure, but I can only do that for URLs I know don't exist in the scan results but should. If I knew that list of URLs, I would not be using the tool. ;o)

Please advise... if I cannot get this working today, my boss will ask me to manually click through our entire site and manually notate every html, pdf, and swf!
jdahlin
 
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Scan stopping at 2 links deep

Postby Support » Wed Dec 21, 2011 3:16 pm

ok, now that I have the URL to work with, what exactly do you need to show in the Structure?

Say we start the scan from http://www.tiaa-cref.org/plansponsors/news/tiaa-cref-news/index.html I see a whole lot of links on that page. Are they the ones you need? If so, only the page itself?
Your support team.
http://SoftByteLabs.com
User avatar
Support
Site Admin
 
Posts: 908
Joined: Sun Oct 02, 2011 10:49 am

Re: Scan stopping at 2 links deep

Postby jdahlin » Wed Dec 21, 2011 3:29 pm

I need all html, pdf, and swf from the entire site.
http://www.tiaa-cref.org redirects to https://www.tiaa-cref.org/public.index.html
But I need everything from both http:// and https://

(There are a few directories that I don't need, plus what is in robots.txt, but it's easy enough to throw them away than exclude them from capture.)
jdahlin
 
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Scan stopping at 2 links deep

Postby jdahlin » Wed Dec 21, 2011 4:14 pm

BTW, I don't need to download any of it... just looking for a site inventory.
jdahlin
 
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Scan stopping at 2 links deep

Postby Support » Wed Dec 21, 2011 4:28 pm

If you use the followinf filters, it will get all html, pdf and swf files. I'm not sure if swf will be found if they are embeded in a javascript.

Copy the code below and click on the "Paste settings" button in the Filters window and start the scan.

Code: Select all
[BlackWidow v6.00 filters]
URL = https://www.tiaa-cref.org/public/index.html
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page:
Startup referrer:
[ ] Slow down by 10:60 seconds
4 threads
[x] Do not follow \# using regular expression
[x] Follow tiaa-cref\.org\/ using regular expression
[x] Add \.(htm|html|pdf|swf)$ from URL using regular expression
[end]
Your support team.
http://SoftByteLabs.com
User avatar
Support
Site Admin
 
Posts: 908
Joined: Sun Oct 02, 2011 10:49 am

Re: Scan stopping at 2 links deep

Postby Support » Wed Dec 21, 2011 4:29 pm

jdahlin wrote:BTW, I don't need to download any of it... just looking for a site inventory.


That's exactly what it will do. Also, in the Structure, you can select and delete what you don't want, files and folder, domains and all. Just select and hit the Del key.
Your support team.
http://SoftByteLabs.com
User avatar
Support
Site Admin
 
Posts: 908
Joined: Sun Oct 02, 2011 10:49 am

Re: Scan stopping at 2 links deep

Postby jdahlin » Thu Dec 22, 2011 10:02 am

Thanks for the assist.

I did a scan last night from home but had to abort it when it was 6 hours into its run... I noticed in the settings you posted is that it is set to scan external sites. The structure that returned contained lots and lots of links for non tiaa-cref sites. Are these links included because they were verified but not parsed for additional links or was it parsing them for more links too?

Unfortunately, it's still not working right now that I'm back in the office... I suspect our proxy settings are causing this (the symptoms did not appear when I ran it from home).

In one instance, it keeps getting hung up when looking at a few URLs that don't have a file. When I navigate to the URL directly, my browser gives me an error ("The page isn't redirecting properly - Firefox has detected that the server is redirecting the request for this address in a way that will never complete. This problem can sometimes be caused by disabling or refusing to accept cookies."), but BW keeps trying again and again, eventually eating up all of the available threads.

Based on the error message and what I am seeing BW do, the URL is redirecting back to itself over and over again, so BW keeps reading each redirect and then the next and then the next and then the next, etc.)

I'll try running BW again tonight from my home network where the proxy won't interfere.
jdahlin
 
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Scan stopping at 2 links deep

Postby jdahlin » Fri Dec 23, 2011 7:30 am

Kicked off a scan last night... it ran for 13 hours and collected over 210,000 pages from about 400 different websites.

Perhaps there is a way of changing
Code: Select all
[x] Follow tiaa-cref\.org\/ using regular expression


So that it does not follow things that are not tiaa-cref.org but still follows external links?
jdahlin
 
Posts: 13
Joined: Tue Dec 20, 2011 12:00 pm

Re: Scan stopping at 2 links deep

Postby Support » Fri Dec 23, 2011 4:29 pm

jdahlin wrote:Kicked off a scan last night... it ran for 13 hours and collected over 210,000 pages from about 400 different websites.

Perhaps there is a way of changing
Code: Select all
[x] Follow tiaa-cref\.org\/ using regular expression


So that it does not follow things that are not tiaa-cref.org but still follows external links?


Indeed, but not that filter because it does only follow the site, the problem is the next filter, it says to add anything that ends in html, pdf and swf files! So the last filter should be this instead...

Code: Select all
[x] Add tiaa-cref\.org\/.*\.(htm|html|pdf|swf)$ from URL using regular expression
Your support team.
http://SoftByteLabs.com
User avatar
Support
Site Admin
 
Posts: 908
Joined: Sun Oct 02, 2011 10:49 am


Return to BlackWidow

Who is online

Users browsing this forum: No registered users and 0 guests

cron