Page 1 of 1

Filters Help

Posted: Wed Jul 25, 2012 6:28 pm
by mdemon
Hi, I am trying to get the BlackWidow program to scan a site and download all of the images on a particular part of the site. The site is
http://www.zerochan.net/touhou (I want only the images in the "Touhou" part of the site)

There are three sizes of the images. A Thumbnail (in the URL it says .240.) , A small (.600.) and a full (.full.), I want just the Small images.

There is currently 44,957 images on this part of the site, and the "page numbers" (even though it says there are over 1800 pages) only goes up to 100, then it starts listing the id of the first image on each page after that.

I have been trying to get BlackWidow to scan that part of the site for awhile but haven't had any luck :(

Any help would be great :D

Thanks!

Re: Filters Help

Posted: Wed Jul 25, 2012 6:56 pm
by Support
Here are the filters. Copy the block of text below and in the Filters window, click on "Paste Settings" and start the scan...

Code: Select all

[BlackWidow v6.00 filters]
URL = http://www.zerochan.net/touhou
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: 
Startup referrer: 
[ ] Slow down by 1:1 seconds
[x] Replace .240. with .600. using plain text
[x] Replace "? with "http://www.zerochan.net/touhou? using plain text
[x] Follow /touhou\?p=(([0-9]{1,2})|(100))$ using regular expression
[x] Add .600. from URL using plain text
[end]

Re: Filters Help

Posted: Wed Jul 25, 2012 8:55 pm
by mdemon
Thank you! That worked! How do I extend it to include pages past 100 though? (like do i put the same thing as the /touhou\?p= thing except with o= and make it a range of a low and high number? )

Re: Filters Help

Posted: Wed Jul 25, 2012 9:32 pm
by Support
The regular expression I used works as follow...

Lets break it down into parts...

[0-9] means any numbers from 0 to 9, but only one.
{1,2} means 1 or 2 of the previous expression.
| mean OR
and 100 mean just that, exactly 100

So when you readi t, it's like this...

[0-9]{1,2} means any numbers from 0 to 9 but either 1 or 2 of them. This will cover from 0 to 99
That expression is combined in between ( and ), so it is now ([0-9]{1,2})
Now we have | which mens the previous expression or the next one.
The next one is simply (100).
So in all, it means there is a match if there is one or two digitis from 0 to 9, or if it is 100, otherwise ther is no match.

If you want to make it go to 200, then we would do this...
(1?[0-9]{1,2})|(200)

1? means either there is nothing or there is a 1, following 1 or 2 digits from 0 to 9, or 200.

So 1? can be nothing or 1, then 1 or 2 digits, so it can be 0 to 199, or 200.

If you want it to go to 500, you'd use this...
([1-41,2})|(500)

[1-4]? means either nothing or a single digits from 1 to 4, following 0 to 99, that gives us 0 to 499, and then the OR 500.

and the $ at the end means the end of the line. So php$ means it has to end with php, like index.php, thumbs.php, but not index.php?id=100

You can use the RegEx evaluator in the BW Filters to test your expression see if they work. If you have a range of numbers you want to make one for, let me know and I'll do it.

I hope I made sense!

Re: Filters Help

Posted: Wed Jul 25, 2012 9:41 pm
by mdemon
I got that :) Thanks for explaining it, but sadly the site, anything past page 100 looks like this: http://www.zerochan.net/touhou?o=1155370

So sadly, can't add new pages over 100 ^^'

Re: Filters Help

Posted: Wed Jul 25, 2012 9:51 pm
by Support
It may come in handy on anyther site tho!

Re: Filters Help

Posted: Wed Jul 25, 2012 10:25 pm
by mdemon
If I can find another site with 40,000 images of that topic again ^^' but is there any way of creating a filter that will include the pages past 100 also? (you can't put 101 on the end of it or 102 or so on, would need to give it a range or something between 1 and 1204438 or something (because after page 100, each page number is actually the id of the first picture of the page, so you cant put page 101 or 102 or something even though there are over 1800 pages)

Re: Filters Help

Posted: Wed Jul 25, 2012 10:34 pm
by Support
Anything is possible with regular expressions. If you got a site with that many pages, let me know and I'll make you the filters.

You can't just add 102 at the end, because that's OR, so it'll mean 0-99 and 102, no 100 and no 101.

Re: Filters Help

Posted: Wed Jul 25, 2012 10:42 pm
by mdemon
At the bottom of http://www.zerochan.net/Touhou it says "page 1 of 1,874" but once you try to go past page 100, the URL goes from http://www.zerochan.net/Touhou?p=100 to http://www.zerochan.net/Touhou?o=1155370 , which should actually say "p=101" in the url instead of "o=1155370" but it doesn't.
Then it goes to http://www.zerochan.net/Touhou?o=1154513 but it should really say "p=102", in the url instead of "o=1155370" but also doesn't

Try going to http://www.zerochan.net/Touhou?p=100 then at the bottom of the page click "next", and you will see what I mean.

Re: Filters Help

Posted: Wed Jul 25, 2012 10:48 pm
by Support
I know what you mean about the page number, but do you want to scan those pages above 100 also?

Re: Filters Help

Posted: Wed Jul 25, 2012 10:50 pm
by mdemon
Yes please ^^' sorry if I wasn't making myself clear.

Re: Filters Help

Posted: Wed Jul 25, 2012 10:52 pm
by Support
oh ok, then change the following filter that contain...

/touhou\?p=(([0-9]{1,2})|(100))$

for this...

/touhou\?p=\d+$

and that will scan all the pages, from 0 to as many as they list.

Re: Filters Help

Posted: Wed Jul 25, 2012 10:55 pm
by mdemon
Thanks! I will need to have two though, because after page 100 it goes from "p=" to "o=", will it work if i have two there and one with p= and the other with o=?

Re: Filters Help

Posted: Wed Jul 25, 2012 10:56 pm
by Support
Yes, that will work, or change p for [po] and it'll do both on one line.

Re: Filters Help

Posted: Wed Jul 25, 2012 11:03 pm
by mdemon
Thank you so much! That is working just fine! :) 520 images downloaded so far!

Re: Filters Help

Posted: Wed Jul 25, 2012 11:35 pm
by mdemon
How would I make it ignore the "login" site? cause it keeps scanning "http://www.zerochan.net/login?ref=/logi ... login?ref="

And each time it scans that specific thing, it adds another "/login?ref=", and it scans it like 8 at a time.

Re: Filters Help

Posted: Thu Jul 26, 2012 12:17 am
by Support
Just add a filter that reads...

[x] Do not follow login using plain text

and move it up as far as it can go.

Here is the new filters to skip that link...

Code: Select all

[BlackWidow v6.00 filters]
URL = http://www.zerochan.net/touhou
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: 
Startup referrer: 
[ ] Slow down by 1:1 seconds
4 threads
[x] Replace .240. with .600. using plain text
[x] Replace "? with "http://www.zerochan.net/touhou? using plain text
[x] Do not follow login using plain text
[x] Follow /touhou\?p=\d+$ using regular expression
[x] Add .600. from URL using plain text
[end]

Re: Filters Help

Posted: Thu Jul 26, 2012 12:29 am
by mdemon
Thank you so much for your help! :)

Re: Filters Help

Posted: Thu Jul 26, 2012 12:42 am
by Support
You're welcome.

Re: Filters Help

Posted: Thu Jul 26, 2012 1:31 pm
by mdemon
Ok, I need help again ^^' What do I change in the filters to make it so it goes through http://www.animepaper.net/gallery/pictu ... me/touhou/ instead? (There is no "page numbers" either)

I want the full image and not the thumbnails (I really need to learn how to set filters better...I tried messing with it a little and couldn't get it working)

Re: Filters Help

Posted: Thu Jul 26, 2012 1:51 pm
by Support
It's not the same for this site, here it is...

Code: Select all

[BlackWidow v6.00 filters]
URL = http://www.animepaper.net/gallery/pictures/anime/touhou/
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: 
Startup referrer: 
[ ] Slow down by 1:1 seconds
4 threads
[x] Follow /gallery/pictures/anime/touhou/\d+$ using regular expression
[x] Follow /art/\d+/[^/#]+$ using regular expression
[x] Add /thumbnails/preview/.*\.(jpg|png|gif|bmp)$ from URL using regular expression
[end]