no re-visiting

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.
Post Reply
alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

no re-visiting

Post by alpha2 » Thu Nov 01, 2012 4:22 am

Hi,

sorry for another question: I have the impression, that sites are re-visited: E. g. A links to B (and others), B links to C, C links to A again. Is it possible to avoid this?

Regards,

Alpha2

User avatar
Support
Site Admin
Posts: 1848
Joined: Sun Oct 02, 2011 10:49 am

Re: no re-visiting

Post by Support » Thu Nov 01, 2012 11:02 am

BlackWidow automatically avoid this.
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: no re-visiting

Post by alpha2 » Fri Nov 02, 2012 4:51 pm

Sorry. Now I know, what the problem is: There is a site http://www. But there is also http://www. and xxxxx2 and xxxxx3 and so on, which are linked by other sites. I only need http://www. The other sites are the same as aaa. I only need the site once, but it is not clear that aaa will come first. aaa/xxxxx1 or 2 might come first. What can I do to avoid re-visiting? Is it possible? Only in expert mode?

User avatar
Support
Site Admin
Posts: 1848
Joined: Sun Oct 02, 2011 10:49 am

Re: no re-visiting

Post by Support » Fri Nov 02, 2012 4:58 pm

If you start scanning from www.aaa.com and enable the setting "Stay within site", then it will not scan anyhting that does not start with www.aaa.com

Your examples were cut off so I'm not sure what they were!
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: no re-visiting

Post by alpha2 » Fri Nov 02, 2012 5:23 pm

That's strange. The site is in a format www (dot) sitename (dot) com / aaa. But there is also a www (dot) sitename (dot) com / aaa / xxxx1 and xxxx2 and so on. As there are multiple aaa's, I cannot tell BW to stay within aaa, as I wouldn't get the other aaa's. I just have to get rid of the entries with */xxxx* at the end. But it is not sure, that I first get aaa. It can be that I first get aaa / xxxxx1. Thus I cannot just tell BW to ignore all pages with the format */*/*. I might never get that aaa.

User avatar
Support
Site Admin
Posts: 1848
Joined: Sun Oct 02, 2011 10:49 am

Re: no re-visiting

Post by Support » Fri Nov 02, 2012 5:42 pm

ok, I'm not following you here. Lets say the site is called www.sbl.net

From what I undersand, there are links like www.sbl.net/aaa/xxx1 and www.sbl.net/aaa/xxx2 and www.sbl.net/aaa/xxx3 etc right?

There is also www.sbl.net/aab/xxx1 and www.sbl.net/aab/xxx2 and www.sbl.net/aab/xxx3 is that correct?

Now, which liks do you need to scan, and which links do you need to add to the structure?
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: no re-visiting

Post by alpha2 » Fri Nov 02, 2012 6:08 pm

Support wrote:ok, I'm not following you here. Lets say the site is called http://www.sbl.net

From what I undersand, there are links like http://www.sbl.net/aaa/xxx1 and http://www.sbl.net/aaa/xxx2 and http://www.sbl.net/aaa/xxx3 etc right?

There is also http://www.sbl.net/aab/xxx1 and http://www.sbl.net/aab/xxx2 and http://www.sbl.net/aab/xxx3 is that correct?

Now, which liks do you need to scan, and which links do you need to add to the structure?

Yes. But there is also http://www.sbl.net/aaa and http://www.sbl.net/aab (without the appendix "xxx1" etc.). As all these pages are the same, any of these are fine. I just need aaa and aab only once. It doesn't matter, whether I get aaa or aaa/xxx1 or aaa/xxx2. But the problem is, that I never know, whether aaa comes first or aaa/xxx1 (2...). Thus I cannot tell BW to ignore " */*/* ".

User avatar
Support
Site Admin
Posts: 1848
Joined: Sun Oct 02, 2011 10:49 am

Re: no re-visiting

Post by Support » Fri Nov 02, 2012 6:09 pm

So you mean that aaa/xxx1 is the same as aab/xxx1 ?
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: no re-visiting

Post by alpha2 » Fri Nov 02, 2012 6:18 pm

Support wrote:So you mean that aaa/xxx1 is the same as aab/xxx1 ?
No. www.sbl.net/aaa is the same as www.sbl.net/aaa/xxx1 is the same as www.sbl.net/aaa/xxx2 ...
And www.sbl.net/aab is the same as www.sbl.net/aab/xxx1 is the same as www.sbl.net/aab/xxx2 ...
But www.sbl.net/aaa is different from www.sbl.net/aab.

User avatar
Support
Site Admin
Posts: 1848
Joined: Sun Oct 02, 2011 10:49 am

Re: no re-visiting

Post by Support » Fri Nov 02, 2012 6:47 pm

So then, you only need to scan everything in aaa, aab, aac but exclude all xx1, xx2, xx3 etc correct?
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: no re-visiting

Post by alpha2 » Fri Nov 02, 2012 6:58 pm

Support wrote:So then, you only need to scan everything in aaa, aab, aac but exclude all xx1, xx2, xx3 etc correct?
No. aaa.htm is a page of it's own, and I don't need any other thing within. And my problem is that in 99% of all cases aaa/xxx1 etc. comes first and aaa.html comes later. On the other hand, if the crawler sees aaa/xxx1, it is 100% sure, that there is a page aaa.htm.

User avatar
Support
Site Admin
Posts: 1848
Joined: Sun Oct 02, 2011 10:49 am

Re: no re-visiting

Post by Support » Fri Nov 02, 2012 7:06 pm

I see, so you mean that you have no way to know the link of aaa.html inless you scan xxx1?

When you say you don't know which comes first, do you mean that if you scan sbl.net/index.html it will have links to xxx1 and in xxx1 you will have a link to aaa.html which is the page you want, but it can happen that if you scan sbl.net/index.html it will contain a link to aaa.html which you want, but aaa.html contain a link to xxx1 which you do not want?

Do you know the name of the links? are they aaa, aab etc or are they random names? Perhaps you can send me a PM with the real link so I can better understand?
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: no re-visiting

Post by alpha2 » Fri Nov 02, 2012 7:19 pm

I've sent the PN. Did you get it? (It is in my Outbox, but not in Sent) Maybe you can understand the problem without clicking on the links...

User avatar
Support
Site Admin
Posts: 1848
Joined: Sun Oct 02, 2011 10:49 am

Re: no re-visiting

Post by Support » Fri Nov 02, 2012 7:37 pm

yes I got it. It's not an email so it wont be in your Outlook, it's a feature of this forum. Let me take a look...
Your support team.
http://SoftByteLabs.com

User avatar
Support
Site Admin
Posts: 1848
Joined: Sun Oct 02, 2011 10:49 am

Re: no re-visiting

Post by Support » Fri Nov 02, 2012 7:39 pm

ok, I think I understand, but I'll reply to your PM to keep this private. Look at the top of this page, it should show you have 1 message, click on it...
Your support team.
http://SoftByteLabs.com

Post Reply