List of pages to be scanned - export / import

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.
Post Reply
alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

List of pages to be scanned - export / import

Post by alpha2 » Mon Feb 04, 2013 3:28 am

Hi,

I understood, that in version 5 it was possible to export and import pages to be scanned, so that you could stop a crawl, export the "remaining" and come back. This feature is no longer available in version 6. Will it come in version 7 or in an interim release? It would be very useful, if you stop the computer or get lists of to be scanned pages externally. For when is the next realease of BW planned?

Regards,

Alpha

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: List of pages to be scanned - export / import

Post by alpha2 » Mon Feb 04, 2013 5:04 am

P. S.

1) Is it possible to access external files in the expert mode?
e. g.

open file x
while not eof()
lnk=Readln()
Scanlink(lnk);
loop

When the system scans the links as per the Scanlink commands, is it also possible to define, how deep it scans and with which patterns? E. g. starting from abc.htlm and the other pages, I wolud like to also crawl pages abc/type1.htm, but not abc/type2.htm.

2) How many lines of code can be used in expert mode?
If I cannot use an external file (see 1)), I could copy and paste the list of to be crawled pages into the expert mode, e. g.
Scanlink('http://www.site.com/abc.htm');
Scanlink('http://www.site.com/abd.htm');
Scanlink('http://www.site.com/abf.htm');
Scanlink('http://www.site.com/abg.htm');
Scanlink('http://www.site.com/abx.htm');
Scanlink('http://www.site.com/aby.htm');
...

User avatar
Support
Site Admin
Posts: 1854
Joined: Sun Oct 02, 2011 10:49 am

Re: List of pages to be scanned - export / import

Post by Support » Mon Feb 04, 2013 10:52 am

No, but you can paste the entire file content between single or double quotes...

links = "
link1
link2
link3
...
";

then you can do...
for each line in links as alink do
ScanLink(Trim(alink));

as for your other question, blank out the script editor and restart BW. Now the script editor contain a default script with explanations. Does this help?
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: List of pages to be scanned - export / import

Post by alpha2 » Mon Feb 04, 2013 4:49 pm

How many lines of code are possible? Infinite?

Does this approach only scan the sites in the list or can I also go let's say two links away with certain rules (e.g. don't follow *xyz*, but follow *zzz*?

Or does this approach only scan the links in the list?

User avatar
Support
Site Admin
Posts: 1854
Joined: Sun Oct 02, 2011 10:49 am

Re: List of pages to be scanned - export / import

Post by Support » Mon Feb 04, 2013 4:53 pm

Not infinite but as much as available memory.

It's up to you do define what to follow or not. The way it is, it'll scan what's on the list, but if you want to go further, you'll have to ScanLink() the links within the page.
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: List of pages to be scanned - export / import

Post by alpha2 » Wed Feb 06, 2013 10:56 am

Hi,

what does this mean: "It's up to you do define what to follow or not. The way it is, it'll scan what's on the list, but if you want to go further, you'll have to ScanLink() the links within the page."

How do I define, whether to follow and whether not and to what depth? E. g. I want to follow just the links from the start pages in the list and the links from the subsequent page, but only if the liks have a certain format. How can I add this in the list?

Let's say I use the following format:

Scanlink('http://www.site.com/abc.htm');
Scanlink('http://www.site.com/abd.htm');
Scanlink('http://www.site.com/abf.htm');
Scanlink('http://www.site.com/abg.htm');

Before going from abc to abd I would like to follow the links on abc with depth 2, but only those links with a certain format.

User avatar
Support
Site Admin
Posts: 1854
Joined: Sun Oct 02, 2011 10:49 am

Re: List of pages to be scanned - export / import

Post by Support » Wed Feb 06, 2013 1:21 pm

What you are asking is complicated to make because you'll have to keep track of everything to know when to start the next link in the list. The default script will allow this, but for a single URL.

What do you mean by "but only if the liks have a certain format"?
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: List of pages to be scanned - export / import

Post by alpha2 » Wed Feb 06, 2013 6:00 pm

Starting with a page abc, I would like to follow abc/h1..h9, but not k1..9. After I have parsed all h!q..9, I would like to follow the next line in the script.

User avatar
Support
Site Admin
Posts: 1854
Joined: Sun Oct 02, 2011 10:49 am

Re: List of pages to be scanned - export / import

Post by Support » Wed Feb 06, 2013 6:30 pm

I can't write you a script to do this because it'll take several hours to do, but do you have programming experience at all?
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: List of pages to be scanned - export / import

Post by alpha2 » Wed Feb 06, 2013 6:50 pm

Yes, I sometimes do some programming. But I didn't have the time to go through the whole documentation. But maybe I can live without this feature and just crawl the sites in the scan list. After BW has gone through the scan list, does it continue automatically with the links from the parsed documents, or does it stop after the last document in the list?

User avatar
Support
Site Admin
Posts: 1854
Joined: Sun Oct 02, 2011 10:49 am

Re: List of pages to be scanned - export / import

Post by Support » Wed Feb 06, 2013 7:04 pm

The language used is like Basic, Pascal and JavaScript and mixed. So it's should be easy if know know any of those 3.

It will stop after the last link in the list. But what you can do is use the default script (which document everything at the beginning, scroll down to see the script), and modify it as you need. For example, insert the the links to scan in the "Starting" events section, then use the "BeforeFetch" section to filter out links you don't want, then use the "AfterFetch" section to parse for links you want to scan, "BeforeParsing" you won't need it so you can delete that section, then use "BeforeAdding" to only add to the Structure list the links you want, and maybe change the 3 variables in "Starting" to scan external links and how many link depth to scan.

Just remember that the ~= operator is a regular expression test, works the same as in "if a = b then" except that "if a ~= b then" means to test "a" against a regular expression mask in "b". The rest should be easy to understand.
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: List of pages to be scanned - export / import

Post by alpha2 » Thu Feb 07, 2013 3:24 am

I tried to copy the 500,000 lines of code (Scanlink(...)) into the expert mode (it worked fine with 100, thus it is not a code problem). I couldn't run it. The code itself is only 25 MB. On the other hand memory consumption grew drastically - to more than 350 MB of memory. Thus the memory consumption grows faster than the sheer code size. I think, that the content of the expert mode is written to the Registry. Maybe there is some limit. I'll keep trying.

User avatar
Support
Site Admin
Posts: 1854
Joined: Sun Oct 02, 2011 10:49 am

Re: List of pages to be scanned - export / import

Post by Support » Thu Feb 07, 2013 8:33 pm

I think you're right. When you exit BW it writes the script to the registry, and 10,000 lines is about all it can take. But you can still paste the lines in the script and run a scan. Just clear the script before you exit BW.
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: List of pages to be scanned - export / import

Post by alpha2 » Fri Feb 08, 2013 7:42 am

Did you test it on your PC? How many lines of code does it accept? For me, it doesn't start the crawl, if there are too many.

Is there any plan for a future release, where you can export and import link lists to the tool and run it with large numbers of links?

User avatar
Support
Site Admin
Posts: 1854
Joined: Sun Oct 02, 2011 10:49 am

Re: List of pages to be scanned - export / import

Post by Support » Fri Feb 08, 2013 11:31 am

I was able to scan 9,000 lines on mine.

Perhaps you need to look at our BrownRecluse spider instead. This one will let you open/save files, and scan the links in any way you like.
Your support team.
http://SoftByteLabs.com

alpha2
Posts: 31
Joined: Tue Oct 30, 2012 8:24 am

Re: List of pages to be scanned - export / import

Post by alpha2 » Sat Feb 09, 2013 6:19 pm

Does BrownRecluse support https? I was so happy, that BW runs without problems (in principle). If I could use it, it would be better...

User avatar
Support
Site Admin
Posts: 1854
Joined: Sun Oct 02, 2011 10:49 am

Re: List of pages to be scanned - export / import

Post by Support » Sat Feb 09, 2013 6:32 pm

Yes, it does https also. We have lots of free scripts in the Script section.
Your support team.
http://SoftByteLabs.com

waldo
Posts: 2
Joined: Wed Dec 19, 2012 5:18 am

Re: List of pages to be scanned - export / import

Post by waldo » Mon Mar 11, 2013 4:09 pm

Hi,

I scanned with BW 6 as you explained. After several iterations with new sets of links, an error showed up saying "Invalid protocol. Must be HTTP or HTTPS".
After clicking OK it disappeared, but came back when starting the scanner again. Restarting BW solved the problem. How can I avoid this?

Thx

User avatar
Support
Site Admin
Posts: 1854
Joined: Sun Oct 02, 2011 10:49 am

Re: List of pages to be scanned - export / import

Post by Support » Mon Mar 11, 2013 4:33 pm

I don't know, it never happened to me. Maybe there was a typo in the URL? If it doesn't start with http:// or https:// then it will give you this error. It will also do that is the URL you try to scan redirect BW to an FTP site for example.
Your support team.
http://SoftByteLabs.com

Post Reply