# link filter help

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.
Post Reply
NuclearFox
Posts: 8
Joined: Sun Jan 29, 2012 6:36 pm

# link filter help

Post by NuclearFox »

# link filter help
The website i am scanning is probably using php, so their directory ends with gallery/
In this gallery directory is the basic default html file, but with # links to change the lanugage... so currently blackwidow scans all 14 language options plus the directory, adding thousands of extra links.

so the full url would be somedirectory/gallery/ followed by somedirectory/gallery/#ja for japanese, #sp for spanish, #nl for netherlands etc... How do i stop it from even scanning these links?

Also the dirctory i am scanning also has /gallery/low/ gallery/high/ gallery/medium/ I am not interested in following these either, yet blackwidow still follows them, I have tried
do not follow */low/* using wildcard yet blackwidow still follows them.

all of the images I want are labeled t_randomfilename.jpg for the thumbnails, l_randomfilename.jpg for large, m_randomfilename.jpg for medium, and s_randomfilename.jpg for small, I am only interested in having blackwidow snag the t_ and the l_ files ....

The filters don't seem to be cooperating... of course i may not ever get to the actual thumbnail download as blackwidow is adding roughly 2000 links for every single directory, via the /low/ low/#ko /low/#ru low/#sp low/#fr low/#nl etc .... by the time one directory is even scanned, there are approximately 3000 unnecessary links.
screenshot
screenshot
blackwidowfix.jpg (113.21 KiB) Viewed 12478 times

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: # link filter help

Post by Support »

The best way to do this is to only put the URLs to follow, rather than those not to follow.

In other words, from the starting page, which link(s) do you need to click on to get to the thumbnail and large image?
If you can provide that, I can show you the filters to use. You can remove the domain name if you like.
Your support team.
http://SoftByteLabs.com

NuclearFox
Posts: 8
Joined: Sun Jan 29, 2012 6:36 pm

Re: # link filter help

Post by NuclearFox »

the problem isn't following a specific link, the problem is at the end of every single link to any page has a #language option... for a few dozen languages including German, French, Netherlands, Japanese, Chinese, Korean, among others....
so EVERY link downloads every language option for that file as well
see below for a rough example... but i just want blackwidow to stop following anything with a # in it, which is downloading a copy, and following it and scanning it and running it through the filters.... its taking forever to scan a single artists directory, because it adds roughly 3000 to 5600 files to the scan one artist, that are totally redundant since they are the same thing, just a script loaded on the site to give you a language option on every page even if there is no text just an image on the page.

typing just the url's to follow will give me nothing because in order to defeat bot scanning they included the entire m2k checksum of the files as part of the directory name
the problem is, all the #__ links it follows, slowing down the scan, and adding hundreds of files to the folder
plus the fact that i have the /low/ /med/ /high/

the main page it self has each file linked as a thumbnail so its like
{image} <--thumb which takes you to the gallery viewer /low/ with links to /med/ and /high/ right at the top of every page
low 6kb <--link to image
med 53kb <--link to image
large 173kb <--link to image

here is a basic layout imagesby_______ is the artists name which there are about 2000 of them so like imagesbyRolpha imagesbyBobHenry imagesbyKaatlyn etc...
http:// members/somesite.com/members/images/imagesby________/
http:// members/somesite.com/members/images/imagesby________/#en
http:// members/somesite.com/members/images/imagesby________/#fr
http:// members/somesite.com/members/images/imagesby________/#ru
http:// members/somesite.com/members/images/imagesby________/#jp
http:// members/somesite.com/members/images/imagesby________/#ch
http:// members/somesite.com/members/images/imagesby________/#nl
http:// members/somesite.com/members/images/imagesby________/#nz
http:// members/somesite.com/members/images/imagesby________/#.......etc
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/ <--first gallery collection... (there can be upwards of 300 gallery's with up to 100 images in each directory and each gallery directory name is some random m2k type name)
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/#en
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/#fr
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/#ru
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/#jp
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/#.......
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/low/
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/low/#en
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/low/#fr
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/low/#ru
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/low/#jp
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/low/#.......
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/medium/
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/medium/#en
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/medium/#fr
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/medium/#ru
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/medium/#jp
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/medium/#.......
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/#en
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/#fr
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/#ru
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/#jp
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/#.......
and it goes further into the directory for each one
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/1/ <---first image out of 100 html page with left and right and large/medim/low options as well
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/1/#en
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/1/#fr
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/1/#ru
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/1/#jp
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/1/#.......
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/2/
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/2/#en
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/2/#fr
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/2/#ru
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/2/#jp
http:// members/somesite.com/members/images/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/large/2/#.......
and each of these has a /2/low/ 2/medium/ 2/large/ followed by 2/large/#en 2/large/#fr etc etc. ...

while the images are stored which i only want the thumbnails t_ and large l_ files
http:// members/somesite.com/members/files/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/t_0c467e2bfdd753ad5b9914ef41fa9d4c.jpg <image 1 thumb
http:// members/somesite.com/members/files/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/s_0c467e2bfdd753ad5b9914ef41fa9d4c.jpg<image 1 small
http:// members/somesite.com/members/files/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/m_0c467e2bfdd753ad5b9914ef41fa9d4c.jpg<image 1 medium
http:// members/somesite.com/members/files/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/l_0c467e2bfdd753ad5b9914ef41fa9d4c.jpg<image 1 large
http:// members/somesite.com/members/files/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/t_0A4C02B3266D8A979AB3948D005AA99.jpg <-image 2 thumb (notice the file name changed but the directory did not)
http:// members/somesite.com/members/files/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/s_0A4C02B3266D8A979AB3948D005AA99.jpg <-image 2 small
http:// members/somesite.com/members/files/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/m_0A4C02B3266D8A979AB3948D005AA99.jpg <-image 2 medium
http:// members/somesite.com/members/files/imagesby________/0c467e2bfdd753ad5b9914ef41fa9d4c/l_0A4C02B3266D8A979AB3948D005AA99.jpg <-image 2 large
for 100 to 150 images


So again... just to sum up..
help me stop blackwidow from following
ANYTHING with a pound sign in it (that's the shift 3 button not a random number) --> #
Anything with a /low/ or a /med/ or /high/ in it

NuclearFox
Posts: 8
Joined: Sun Jan 29, 2012 6:36 pm

Re: # link filter help

Post by NuclearFox »

all in all, if i can get black widow to quit following # and /low/ and /med/ and /high/ links
the entire structure can be scanned by just 2 files in each artists name/and directory
i.e.
http:// members/somesite.com/members/images/imagesbyGeorge/ <--- Root artist page saved as index.htm locally
http:// members/somesite.com/members/images/imagesbyGeorge/0c467e2bfdd753ad5b9914ef41fa9d4c/ <--- Root gallery 1 page saved as index.htm locally
http:// members/somesite.com/members/images/imagesbyGeorge/314b1e2b677d753ad2f6914e341fa9d4c/ <--- Root gallery 2 page saved as index.htm locally

and the whole sight would be finished scanning with roughly an artist name, and the 1 to 100 galleries, and their images consisting of maybe 1000 thumbnails, 1000 large images, and 2 to 200 gallery pages,

instead blackwidow is providing an artist name, 20 laguages for artist, 1 to 100 gallerys, with 2000 images, and roughly 300000 html files covering every language every view size every image which by the time i end up scanning the 1000 artits directories, i have roughly 3 to 10 million files that are redundant multiple copies of pages within the directory in multiple lanugages, that i don't want. (language option is saved via a cookie anyway, it just links the language option on every page every view page every html etc.)

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: # link filter help

Post by Support »

ok, here are the filters. Copy the whole block then click on the "Paste settings" button.

First, let me explain what each one do. I have it set to scan the whole site and external links, but not to scan everything. So the filters will decide what to scan...

[x] Follow /images/imagesby[^/]/$ using regular expression
this one will scan links like your example...
http:// members/somesite.com/members/images/imagesby________/
and will not scan those with #fr and such.

[x] Follow /images/imagesby[^/]+/[0-9a-z]+/$ using regular expression
this one will scan...
http:// members/somesite.com/members/images/imagesby________/________/
and again, it will not scan those with #fr and such.

[x] Follow /images/imagesby[^/]+/[0-9a-z]+/large/\d+/$ using regular expression
this one will scan...
http:// members/somesite.com/members/images/imagesby________/_______/large/___/
and again, it will not scan those with #fr and such.
but if you don't want to scan the /large/ directory, then what is the link of the pagr hosting the links to the thumb and large image?

[x] Add /files/imagesby[^/]+/[0-9a-z]+/(t|l)_[^/]+\.jpg$ from URL using regular expression
this one will scan and add to the structure...
http:// members/somesite.com/members/files/imagesby________/________/t_______.jpg
and
http:// members/somesite.com/members/files/imagesby________/________/l_______.jpg

If this is almost correct, do this. Give me the link of the starting page, then all the links you need to click all the way to the page that has the links to the thumb and large image.


[BlackWidow v6.00 filters]
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page:
Startup referrer:
[ ] Slow down by 10:60 seconds
4 threads
[x] Follow /images/imagesby[^/]/$ using regular expression
[x] Follow /images/imagesby[^/]+/[0-9a-z]+/$ using regular expression
[x] Follow /images/imagesby[^/]+/[0-9a-z]+/large/\d+/$ using regular expression
[x] Add /files/imagesby[^/]+/[0-9a-z]+/(t|l)_[^/]+\.jpg$ from URL using regular expression
[end]
Your support team.
http://SoftByteLabs.com

NuclearFox
Posts: 8
Joined: Sun Jan 29, 2012 6:36 pm

Re: # link filter help

Post by NuclearFox »

wow... i will give it a try...
too bad this whole thing couldn't have been fixed by just going
do not follow # regular text....

thanks for your help

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: # link filter help

Post by Support »

It could be done the other way too, but there is too much to filter out, that's why I did it that way, much simpler and does the same thing.
Your support team.
http://SoftByteLabs.com

NuclearFox
Posts: 8
Joined: Sun Jan 29, 2012 6:36 pm

Re: # link filter help

Post by NuclearFox »

Follow /images/imagesby[^/]/$ using regular expression

how do you add the character - to the expression ? quite a few of the artist names are tom-brown and cindy-j-fox etc.... so the blank can have up to 7 minus signs in the name
such as when they colaborate on a project
tom-brown-and-cindy-j-fox-and-jarad-b-sanders
/images/imagesby[^/]/$ doesnt seem to scan anything beyond the first -

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: # link filter help

Post by Support »

The regular expression [^/] means anything but a / character, so it will pickup the dashes.

/images/imagesby[^/]/$ means the URL must contain /images/imagesby following any characters up to and including a / character and that / character must be the the last character of the URL...

So it should match /images/imagesbyYOU-and-ME/

but will not match /images/imagesbyYOU/and/ME/
Your support team.
http://SoftByteLabs.com

NuclearFox
Posts: 8
Joined: Sun Jan 29, 2012 6:36 pm

Re: # link filter help

Post by NuclearFox »

Ok, i think i have finally figured this out, thank you so much for your help
Last edited by NuclearFox on Tue Jan 31, 2012 3:33 am, edited 2 times in total.

NuclearFox
Posts: 8
Joined: Sun Jan 29, 2012 6:36 pm

Re: # link filter help

Post by NuclearFox »

i guess i follow a link, but i also add it too? or should i only add the link first

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: # link filter help

Post by Support »

Well, starting from the main page, there may be only one link, like /Members/index.html, then there are more links, each very similar to one another, then again and again, until you get the thumbnail and the large image. Those are the links to add. If you can for example, give me 2 sets of links, both starting from the main page, and down to 2 different image, that would be perfect for me to make you a very simple filter that works fast. It's hard for me to know if I don't have the URLs. Or if you want to PM me the real URL.
Your support team.
http://SoftByteLabs.com

Post Reply