Hi, I am trying to get the BlackWidow program to scan a site and download all of the images on a particular part of the site. The site is http://www.zerochan.net/touhou (I want only the images in the "Touhou" part of the site)
There are three sizes of the images. A Thumbnail (in the URL it says .240.) , A small (.600.) and a full (.full.), I want just the Small images.
There is currently 44,957 images on this part of the site, and the "page numbers" (even though it says there are over 1800 pages) only goes up to 100, then it starts listing the id of the first image on each page after that.
I have been trying to get BlackWidow to scan that part of the site for awhile but haven't had any luck
[BlackWidow v6.00 filters]
URL = http://www.zerochan.net/touhou
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page:
Startup referrer:
[ ] Slow down by 1:1 seconds
[x] Replace .240. with .600. using plain text
[x] Replace "? with "http://www.zerochan.net/touhou? using plain text
[x] Follow /touhou\?p=(([0-9]{1,2})|(100))$ using regular expression
[x] Add .600. from URL using plain text
[end]
Thank you! That worked! How do I extend it to include pages past 100 though? (like do i put the same thing as the /touhou\?p= thing except with o= and make it a range of a low and high number? )
[0-9] means any numbers from 0 to 9, but only one.
{1,2} means 1 or 2 of the previous expression.
| mean OR
and 100 mean just that, exactly 100
So when you readi t, it's like this...
[0-9]{1,2} means any numbers from 0 to 9 but either 1 or 2 of them. This will cover from 0 to 99
That expression is combined in between ( and ), so it is now ([0-9]{1,2})
Now we have | which mens the previous expression or the next one.
The next one is simply (100).
So in all, it means there is a match if there is one or two digitis from 0 to 9, or if it is 100, otherwise ther is no match.
If you want to make it go to 200, then we would do this...
(1?[0-9]{1,2})|(200)
1? means either there is nothing or there is a 1, following 1 or 2 digits from 0 to 9, or 200.
So 1? can be nothing or 1, then 1 or 2 digits, so it can be 0 to 199, or 200.
If you want it to go to 500, you'd use this...
([1-41,2})|(500)
[1-4]? means either nothing or a single digits from 1 to 4, following 0 to 99, that gives us 0 to 499, and then the OR 500.
and the $ at the end means the end of the line. So php$ means it has to end with php, like index.php, thumbs.php, but not index.php?id=100
You can use the RegEx evaluator in the BW Filters to test your expression see if they work. If you have a range of numbers you want to make one for, let me know and I'll do it.
If I can find another site with 40,000 images of that topic again ^^' but is there any way of creating a filter that will include the pages past 100 also? (you can't put 101 on the end of it or 102 or so on, would need to give it a range or something between 1 and 1204438 or something (because after page 100, each page number is actually the id of the first picture of the page, so you cant put page 101 or 102 or something even though there are over 1800 pages)
Thanks! I will need to have two though, because after page 100 it goes from "p=" to "o=", will it work if i have two there and one with p= and the other with o=?
[BlackWidow v6.00 filters]
URL = http://www.zerochan.net/touhou
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page:
Startup referrer:
[ ] Slow down by 1:1 seconds
4 threads
[x] Replace .240. with .600. using plain text
[x] Replace "? with "http://www.zerochan.net/touhou? using plain text
[x] Do not follow login using plain text
[x] Follow /touhou\?p=\d+$ using regular expression
[x] Add .600. from URL using plain text
[end]
I want the full image and not the thumbnails (I really need to learn how to set filters better...I tried messing with it a little and couldn't get it working)