Getting images retrieved using java

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.
Post Reply
shark61
Posts: 5
Joined: Fri Jul 12, 2013 11:09 am

Getting images retrieved using java

Post by shark61 »

I just tried to download from a website that has videos and images. It got the videos fine, but not the images that are all shown via small thumbnails on a webpage. It does find the images when they are in a .zip. I noticed that it only downloads the directories around seven subdirectories deep, but the images are nine deep. I saw someone had a similar issue back in 2011 on this forum. You said it would not find it if it is in a regular expression or using java. They use java and maybe regular expressions in the URL for the images on a page.

I am familiar with regular expressions, but I don't understand how you implemented bypassing this issue according to that post.

in BW V6, I clicked on add filter which brought up a screen on different things to filter. I went to bad link substitution. I am guessing that is a way to bypass it, but that says that I should include Modify document. Is there something that explains how to use the capability in the add rule window?

Also. do I need to stop the download before this takes effect? I tried filtering out .m4v videos, but I did not stop the scan and it is still downloading them. I am guessing you have to restart the scanner to have the new filter take effect. If I do stop the download, is there a way to get it to start where it left off and catch the links that it missed before? If I want to test this should I just start with the page before the java/regular expression links? Does BW notice when new things are added to a website and just download them?

Interestingly, I saved the page with the thumbnail images using Firefox and brought it up using Firefox and the javascripts figured out where to put all the thumbnails.

The link in question ends with: 'thumbnails.html? pictorial=' the url for a grandparent directory (2 levels up) to the images is after 'pictorial='

The page uses javascript, ? is part of regular expressions. However that is not what ? is used for in regular expressions. Therefore, I am guessing that is interpreted by a javascript like jquery for example. This site uses an old version of that. How do I get BW to recognize the images there as well since that is just a parent directory where the scanned stopped before?

Is there a way to get java scripts that are in sister directories from the main one that has all the webpages?

Thanks for your assistance!

User avatar
Support
Site Admin
Posts: 1892
Joined: Sun Oct 02, 2011 10:49 am

Re: Getting images retrieved using java

Post by Support »

Yes, you need to stop the scan for the new filters to take effect. It may ask you if you want to resume the scan, but if you already missed something, it will not be scanned unless you restart.

When you have links in javascript, what you need to do is use "Modify Document" with a regular expression that will convert the javascript into an href or img tag that the BW parser can see.

But if the link is already in am href tag, such as <a href="'thumbnails.html? pictorial='\pix\pic001.jpg"> BW will scan it automatically. But make sure your filter is set to scan the URL where the pictures are. So if the site is www.pic.com and the pictures are on img.pic.com you will need to set the filters to scan external sites, then disable "Scan everything" and set filters to only follow links you want scanned.

If you can send me the URL, it would be easier for me to help. If it's private, you can PM me.
Your support team.
http://SoftByteLabs.com

shark61
Posts: 5
Joined: Fri Jul 12, 2013 11:09 am

Re: Getting images retrieved using java

Post by shark61 »

I noticed that in the beginning of downloading from this website it said that the amount of remaining files was 47K and stayed there for an extremely long time of at least 10K scanned 10K. The amount remaining went down by a few thousand up to 30-40K files being scanned. Now at 52K scanned it says that 53K files are still remaining.

This leaves me wondering how the amount of files remaining is calculated and does it build as files are found from the javascript. Maybe that is why the amount remaining didn't go down for a long time and why it went up later, but I know there were many more name java urls processed than those with urls.

User avatar
Support
Site Admin
Posts: 1892
Joined: Sun Oct 02, 2011 10:49 am

Re: Getting images retrieved using java

Post by Support »

The number of files remaining to scan can go up or down as it scans because when it find new links, it add them. When it pull one link and find 10 more links within, then the number of links left to scan will go down by one and up by ten.

BTW, I've replied to your PM.
Your support team.
http://SoftByteLabs.com

Post Reply