Problems with Wikipedia.

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.
Post Reply
manganese
Posts: 4
Joined: Wed Jul 04, 2012 1:26 am

Problems with Wikipedia.

Post by manganese » Wed Jul 04, 2012 1:31 am

I want to save pages off Wikipedia (example) http://en.wikipedia.org/wiki/Falcon i want this page and all pages that are linked to on that page including images.
I have tried using filters i have successfully used with other sites but Images still don't get saved.

solution?

User avatar
Support
Site Admin
Posts: 1830
Joined: Sun Oct 02, 2011 10:49 am

Re: Problems with Wikipedia.

Post by Support » Wed Jul 04, 2012 12:16 pm

Just tell it to scan external site to a depth of 1 or 2 and it should do it. Also, do not select to scan everything or the whole site.
Your support team.
http://SoftByteLabs.com

manganese
Posts: 4
Joined: Wed Jul 04, 2012 1:26 am

Re: Problems with Wikipedia.

Post by manganese » Wed Jul 04, 2012 5:20 pm

i did as you said except i ticked "scan everything, let" because it returned no results.

The saved pages still contain no image files. :(

User avatar
Support
Site Admin
Posts: 1830
Joined: Sun Oct 02, 2011 10:49 am

Re: Problems with Wikipedia.

Post by Support » Wed Jul 04, 2012 10:55 pm

ok, here is an actual script that will do that. Copy the whole code below, and in the Filters window, click on Expert (top right) and replace the whole content with the one below (ctrl+A then Paste).

Code: Select all

/*

	Basically, the script handle the events generated by the BlackWidow scanner.
	The 'ScannerEvent' is set to one of the following event...

	Starting:      The scan is about to start.
	BeforeFetch:   The scanner is about to fetch a document (download a URL)
	AfterFetch:    The scanner has fetched the document.
	BeforeParsing: The scanner is about to parse the fetched document for new links.
	BeforeAdding:  The scanner is about to add a link to the Structure.
	FoundLink:     The parser has found a new link to scan.
	Finishing:     The scan is about to end.

	The AcceptEvent must be set to True or False (Yes or No) for every events except
	for the Starting and Finishing events. If AcceptEvent is set to True (or Yes), the
	scanner will proceed with its function, otherwise, it will not. For example, if
  AcceptEvent is set to False in the BeforeFetch event, the scanner will not fetch
	the link, therefore, it will not find any links in that document.

	Optionally, you can add links to the scan queue using the ScanLink() function.

	Below are some example of what you can do in each event triggered by the scanner...

  Starting: Set any global variables, and optionally, generate links to scan
						and add them to the scan queue.

  BeforeFetch: If the link of the document is not to be scanned, set AcceptEvent to
							 False and the scanner will not fetch it. You can test the document
							 in various ways. The following variables will be set accordingly when
							 this event is triggered...
               DocumentURL:   The URL of the document.
               DocumentType:  The mime/type of the document, such as text/html
               DocumentSize:  The size in bytes of the document, if known.
							 DocumentDepth: The scan depth.


  AfterFetch:  You can parse the document for links such as those embeded in javascript
							 and use ResolveRelative before passing the links to ScanLink().
							 The following variables will be set when this event is triggered...
               Document:      The document itself, that is, the data fetched.
               DocumentURL:   The URL of the document.
               DocumentType:  The mime/type of the document, such as text/html
               DocumentSize:  The size in bytes of the document.
							 DocumentDepth: The scan depth.


  BeforeParsing: You can modify the fetched document before the parser parses it.
	               For example, you may want to convert thumbnail image links into large
								 image links to bypass the need to scan each thumbnail pages. You
								 can also convert javascript links into normal href tags, such is the
								 case for obscured emails. The following variables will be set
								 accordingly when this event is triggered...
	               Document:      The document itself, that is, the data fetched.
	               DocumentURL:   The URL of the document.
	               DocumentType:  The mime/type of the document, such as text/html
	               DocumentSize:  The size in bytes of the document.
								 DocumentDepth: The scan depth.


	BeforeAdding: Here you can control what will be added to the Structure. Set
                AcceptEvent to False and the link will not be added to the Structure.
								For example, you may want to add only images to the Structure, so
								you can test the DocumentType if it contain 'image', and if so, set
                AcceptEvent to True and the link will be added to the Structure.
                The following variables will be set accordingly when this event is
								triggered...
                DocumentURL:   The URL of the document.
                DocumentType:  The mime/type of the document, such as image/jpeg
                DocumentSize:  The size in bytes of the document, if known.
							  DocumentDepth: The scan depth.


  FoundLink: When the scanner has fetch a document, it will attempt to parse it and
						 find links in the document. For each link it finds, this event will be
						 triggered and the following variables will be set...
             FoundLinkURL:      the URL of the found link.
             FoundLinkReferrer: The document URL where the link was found.
             FoundLinkDepth:    The next scan depth should this links be followed.
						 Use the above variables to determine which links you want to follow,
						 including those to later add to the Structure. If you set AcceptEvent to
						 True, the link will be added to the scan queue and the BeforeFetch
						 event will be triggered when it's time to fetch it.


  Finishing: Well, not much you can do here as the scan ended. However, it is provided
						 so that future versions of BlackWidow can use it.


	This conclude the scripting capabilities of BlackWidow.


	Below is a basic script using all of the events. Modify as you please. Just remember
	that when an event is triggered, the entire script is run, and it is up to you to
	test ScannerEvent to find out which event is triggered, and which part of the script
	to run for that event. The sample below provides this for you.

	Note on the programming syntax: The ~= uses regular expressions to test for a match,
	and ~!= to test for a no-match.

	BlackWidow uses Perl compatible Regular Expressions. You can look this up on Google
	and get very detailed documentations on it.

 	We are currently adding new features and working on a manual. If you have any questions,
	please post them in our Q&A board so everyone can benefit from the answers provided.

*/
case ScannerEvent of

	Starting:
	begin
		ExternalLinkDepth = 1;  // set to 0 for no external links.
		LocalLinkDepth    = 1;  // set to high number for no limit.
		ScanWholeSite     = No; // Stay within StartupURL
	end;

	BeforeFetch:
	begin
		// Fetch only html pages to parse for links.
		AcceptEvent = (DocumentType ~= 'text/html');
	end;

  /*AfterFetch:
	begin
		for each matching('"([^"]+\.swf)"') in Document as aLink do begin
			aLink.ResolveRelative(DocumentURL); // resolve links like ../foo/bar/
			Scanlink(aLink); // add the link to the scan queue.
		end;
	end;*/

	/*BeforeParsing:
	begin
    Document.Replace('/tn_', '/'); // try to make a large image from a thumbnail.
    AcceptEvent = Yes;
	end;*/

	BeforeAdding:
	begin
		AcceptEvent = (DocumentType ~= 'image|html'); // add any kind of images.
	end;

	FoundLink:
	begin
		if (ScanWholeSite and FoundLinkURL.InRootURI(StartupURL)) or
		   ((not ScanWholeSite) and FoundLinkURL.InBaseURI(StartupURL))
		then
			AcceptEvent = (FoundLinkDepth <= LocalLinkDepth)
		else
			AcceptEvent = (FoundLinkDepth <= ExternalLinkDepth);
	end;

	Finishing:
	begin
		Output('Done');
	end;

else
	AcceptEvent = No;

end;
Your support team.
http://SoftByteLabs.com

manganese
Posts: 4
Joined: Wed Jul 04, 2012 1:26 am

Re: Problems with Wikipedia.

Post by manganese » Thu Jul 05, 2012 11:46 pm

Still doesn't work, but thanks anyway.

User avatar
Support
Site Admin
Posts: 1830
Joined: Sun Oct 02, 2011 10:49 am

Re: Problems with Wikipedia.

Post by Support » Thu Jul 05, 2012 11:48 pm

It did qork for me! When I run it on the sample URL, I get the page + the pages linked and images too. What do you get?
Your support team.
http://SoftByteLabs.com

manganese
Posts: 4
Joined: Wed Jul 04, 2012 1:26 am

Re: Problems with Wikipedia.

Post by manganese » Fri Jul 06, 2012 12:23 am

Here is a .mht File that shows exactly what i did, hopefully you can point out what i'v done wrong.

https://www.dropbox.com/s/jst0adzvjkfxd ... 5_2217.mht

Thankyou.

User avatar
Support
Site Admin
Posts: 1830
Joined: Sun Oct 02, 2011 10:49 am

Re: Problems with Wikipedia.

Post by Support » Fri Jul 06, 2012 9:57 am

Everything looks fine. But I think you are trying to view the html offline are you? If so, it will rarely work, because of all the scripting in the pages.
Your support team.
http://SoftByteLabs.com

Post Reply