Page 1 of 1

List of pages to be scanned - export / import

Posted: Mon Feb 04, 2013 3:28 am
by alpha2
Hi,

I understood, that in version 5 it was possible to export and import pages to be scanned, so that you could stop a crawl, export the "remaining" and come back. This feature is no longer available in version 6. Will it come in version 7 or in an interim release? It would be very useful, if you stop the computer or get lists of to be scanned pages externally. For when is the next realease of BW planned?

Regards,

Alpha

Re: List of pages to be scanned - export / import

Posted: Mon Feb 04, 2013 5:04 am
by alpha2
P. S.

1) Is it possible to access external files in the expert mode?
e. g.

open file x
while not eof()
lnk=Readln()
Scanlink(lnk);
loop

When the system scans the links as per the Scanlink commands, is it also possible to define, how deep it scans and with which patterns? E. g. starting from abc.htlm and the other pages, I wolud like to also crawl pages abc/type1.htm, but not abc/type2.htm.

2) How many lines of code can be used in expert mode?
If I cannot use an external file (see 1)), I could copy and paste the list of to be crawled pages into the expert mode, e. g.
Scanlink('http://www.site.com/abc.htm');
Scanlink('http://www.site.com/abd.htm');
Scanlink('http://www.site.com/abf.htm');
Scanlink('http://www.site.com/abg.htm');
Scanlink('http://www.site.com/abx.htm');
Scanlink('http://www.site.com/aby.htm');
...

Re: List of pages to be scanned - export / import

Posted: Mon Feb 04, 2013 10:52 am
by Support
No, but you can paste the entire file content between single or double quotes...

links = "
link1
link2
link3
...
";

then you can do...
for each line in links as alink do
ScanLink(Trim(alink));

as for your other question, blank out the script editor and restart BW. Now the script editor contain a default script with explanations. Does this help?

Re: List of pages to be scanned - export / import

Posted: Mon Feb 04, 2013 4:49 pm
by alpha2
How many lines of code are possible? Infinite?

Does this approach only scan the sites in the list or can I also go let's say two links away with certain rules (e.g. don't follow *xyz*, but follow *zzz*?

Or does this approach only scan the links in the list?

Re: List of pages to be scanned - export / import

Posted: Mon Feb 04, 2013 4:53 pm
by Support
Not infinite but as much as available memory.

It's up to you do define what to follow or not. The way it is, it'll scan what's on the list, but if you want to go further, you'll have to ScanLink() the links within the page.

Re: List of pages to be scanned - export / import

Posted: Wed Feb 06, 2013 10:56 am
by alpha2
Hi,

what does this mean: "It's up to you do define what to follow or not. The way it is, it'll scan what's on the list, but if you want to go further, you'll have to ScanLink() the links within the page."

How do I define, whether to follow and whether not and to what depth? E. g. I want to follow just the links from the start pages in the list and the links from the subsequent page, but only if the liks have a certain format. How can I add this in the list?

Let's say I use the following format:

Scanlink('http://www.site.com/abc.htm');
Scanlink('http://www.site.com/abd.htm');
Scanlink('http://www.site.com/abf.htm');
Scanlink('http://www.site.com/abg.htm');

Before going from abc to abd I would like to follow the links on abc with depth 2, but only those links with a certain format.

Re: List of pages to be scanned - export / import

Posted: Wed Feb 06, 2013 1:21 pm
by Support
What you are asking is complicated to make because you'll have to keep track of everything to know when to start the next link in the list. The default script will allow this, but for a single URL.

What do you mean by "but only if the liks have a certain format"?

Re: List of pages to be scanned - export / import

Posted: Wed Feb 06, 2013 6:00 pm
by alpha2
Starting with a page abc, I would like to follow abc/h1..h9, but not k1..9. After I have parsed all h!q..9, I would like to follow the next line in the script.

Re: List of pages to be scanned - export / import

Posted: Wed Feb 06, 2013 6:30 pm
by Support
I can't write you a script to do this because it'll take several hours to do, but do you have programming experience at all?

Re: List of pages to be scanned - export / import

Posted: Wed Feb 06, 2013 6:50 pm
by alpha2
Yes, I sometimes do some programming. But I didn't have the time to go through the whole documentation. But maybe I can live without this feature and just crawl the sites in the scan list. After BW has gone through the scan list, does it continue automatically with the links from the parsed documents, or does it stop after the last document in the list?

Re: List of pages to be scanned - export / import

Posted: Wed Feb 06, 2013 7:04 pm
by Support
The language used is like Basic, Pascal and JavaScript and mixed. So it's should be easy if know know any of those 3.

It will stop after the last link in the list. But what you can do is use the default script (which document everything at the beginning, scroll down to see the script), and modify it as you need. For example, insert the the links to scan in the "Starting" events section, then use the "BeforeFetch" section to filter out links you don't want, then use the "AfterFetch" section to parse for links you want to scan, "BeforeParsing" you won't need it so you can delete that section, then use "BeforeAdding" to only add to the Structure list the links you want, and maybe change the 3 variables in "Starting" to scan external links and how many link depth to scan.

Just remember that the ~= operator is a regular expression test, works the same as in "if a = b then" except that "if a ~= b then" means to test "a" against a regular expression mask in "b". The rest should be easy to understand.

Re: List of pages to be scanned - export / import

Posted: Thu Feb 07, 2013 3:24 am
by alpha2
I tried to copy the 500,000 lines of code (Scanlink(...)) into the expert mode (it worked fine with 100, thus it is not a code problem). I couldn't run it. The code itself is only 25 MB. On the other hand memory consumption grew drastically - to more than 350 MB of memory. Thus the memory consumption grows faster than the sheer code size. I think, that the content of the expert mode is written to the Registry. Maybe there is some limit. I'll keep trying.

Re: List of pages to be scanned - export / import

Posted: Thu Feb 07, 2013 8:33 pm
by Support
I think you're right. When you exit BW it writes the script to the registry, and 10,000 lines is about all it can take. But you can still paste the lines in the script and run a scan. Just clear the script before you exit BW.

Re: List of pages to be scanned - export / import

Posted: Fri Feb 08, 2013 7:42 am
by alpha2
Did you test it on your PC? How many lines of code does it accept? For me, it doesn't start the crawl, if there are too many.

Is there any plan for a future release, where you can export and import link lists to the tool and run it with large numbers of links?

Re: List of pages to be scanned - export / import

Posted: Fri Feb 08, 2013 11:31 am
by Support
I was able to scan 9,000 lines on mine.

Perhaps you need to look at our BrownRecluse spider instead. This one will let you open/save files, and scan the links in any way you like.

Re: List of pages to be scanned - export / import

Posted: Sat Feb 09, 2013 6:19 pm
by alpha2
Does BrownRecluse support https? I was so happy, that BW runs without problems (in principle). If I could use it, it would be better...

Re: List of pages to be scanned - export / import

Posted: Sat Feb 09, 2013 6:32 pm
by Support
Yes, it does https also. We have lots of free scripts in the Script section.

Re: List of pages to be scanned - export / import

Posted: Mon Mar 11, 2013 4:09 pm
by waldo
Hi,

I scanned with BW 6 as you explained. After several iterations with new sets of links, an error showed up saying "Invalid protocol. Must be HTTP or HTTPS".
After clicking OK it disappeared, but came back when starting the scanner again. Restarting BW solved the problem. How can I avoid this?

Thx

Re: List of pages to be scanned - export / import

Posted: Mon Mar 11, 2013 4:33 pm
by Support
I don't know, it never happened to me. Maybe there was a typo in the URL? If it doesn't start with http:// or https:// then it will give you this error. It will also do that is the URL you try to scan redirect BW to an FTP site for example.