Filters Help

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.
Post Reply
mdemon
Posts: 11
Joined: Wed Jul 25, 2012 6:20 pm

Filters Help

Post by mdemon » Wed Jul 25, 2012 6:28 pm

Hi, I am trying to get the BlackWidow program to scan a site and download all of the images on a particular part of the site. The site is
http://www.zerochan.net/touhou (I want only the images in the "Touhou" part of the site)

There are three sizes of the images. A Thumbnail (in the URL it says .240.) , A small (.600.) and a full (.full.), I want just the Small images.

There is currently 44,957 images on this part of the site, and the "page numbers" (even though it says there are over 1800 pages) only goes up to 100, then it starts listing the id of the first image on each page after that.

I have been trying to get BlackWidow to scan that part of the site for awhile but haven't had any luck :(

Any help would be great :D

Thanks!

User avatar
Support
Site Admin
Posts: 1847
Joined: Sun Oct 02, 2011 10:49 am

Re: Filters Help

Post by Support » Wed Jul 25, 2012 6:56 pm

Here are the filters. Copy the block of text below and in the Filters window, click on "Paste Settings" and start the scan...

Code: Select all

[BlackWidow v6.00 filters]
URL = http://www.zerochan.net/touhou
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: 
Startup referrer: 
[ ] Slow down by 1:1 seconds
[x] Replace .240. with .600. using plain text
[x] Replace "? with "http://www.zerochan.net/touhou? using plain text
[x] Follow /touhou\?p=(([0-9]{1,2})|(100))$ using regular expression
[x] Add .600. from URL using plain text
[end]
Your support team.
http://SoftByteLabs.com

mdemon
Posts: 11
Joined: Wed Jul 25, 2012 6:20 pm

Re: Filters Help

Post by mdemon » Wed Jul 25, 2012 8:55 pm

Thank you! That worked! How do I extend it to include pages past 100 though? (like do i put the same thing as the /touhou\?p= thing except with o= and make it a range of a low and high number? )

User avatar
Support
Site Admin
Posts: 1847
Joined: Sun Oct 02, 2011 10:49 am

Re: Filters Help

Post by Support » Wed Jul 25, 2012 9:32 pm

The regular expression I used works as follow...

Lets break it down into parts...

[0-9] means any numbers from 0 to 9, but only one.
{1,2} means 1 or 2 of the previous expression.
| mean OR
and 100 mean just that, exactly 100

So when you readi t, it's like this...

[0-9]{1,2} means any numbers from 0 to 9 but either 1 or 2 of them. This will cover from 0 to 99
That expression is combined in between ( and ), so it is now ([0-9]{1,2})
Now we have | which mens the previous expression or the next one.
The next one is simply (100).
So in all, it means there is a match if there is one or two digitis from 0 to 9, or if it is 100, otherwise ther is no match.

If you want to make it go to 200, then we would do this...
(1?[0-9]{1,2})|(200)

1? means either there is nothing or there is a 1, following 1 or 2 digits from 0 to 9, or 200.

So 1? can be nothing or 1, then 1 or 2 digits, so it can be 0 to 199, or 200.

If you want it to go to 500, you'd use this...
([1-41,2})|(500)

[1-4]? means either nothing or a single digits from 1 to 4, following 0 to 99, that gives us 0 to 499, and then the OR 500.

and the $ at the end means the end of the line. So php$ means it has to end with php, like index.php, thumbs.php, but not index.php?id=100

You can use the RegEx evaluator in the BW Filters to test your expression see if they work. If you have a range of numbers you want to make one for, let me know and I'll do it.

I hope I made sense!
Your support team.
http://SoftByteLabs.com

mdemon
Posts: 11
Joined: Wed Jul 25, 2012 6:20 pm

Re: Filters Help

Post by mdemon » Wed Jul 25, 2012 9:41 pm

I got that :) Thanks for explaining it, but sadly the site, anything past page 100 looks like this: http://www.zerochan.net/touhou?o=1155370

So sadly, can't add new pages over 100 ^^'

User avatar
Support
Site Admin
Posts: 1847
Joined: Sun Oct 02, 2011 10:49 am

Re: Filters Help

Post by Support » Wed Jul 25, 2012 9:51 pm

It may come in handy on anyther site tho!
Your support team.
http://SoftByteLabs.com

mdemon
Posts: 11
Joined: Wed Jul 25, 2012 6:20 pm

Re: Filters Help

Post by mdemon » Wed Jul 25, 2012 10:25 pm

If I can find another site with 40,000 images of that topic again ^^' but is there any way of creating a filter that will include the pages past 100 also? (you can't put 101 on the end of it or 102 or so on, would need to give it a range or something between 1 and 1204438 or something (because after page 100, each page number is actually the id of the first picture of the page, so you cant put page 101 or 102 or something even though there are over 1800 pages)

User avatar
Support
Site Admin
Posts: 1847
Joined: Sun Oct 02, 2011 10:49 am

Re: Filters Help

Post by Support » Wed Jul 25, 2012 10:34 pm

Anything is possible with regular expressions. If you got a site with that many pages, let me know and I'll make you the filters.

You can't just add 102 at the end, because that's OR, so it'll mean 0-99 and 102, no 100 and no 101.
Your support team.
http://SoftByteLabs.com

mdemon
Posts: 11
Joined: Wed Jul 25, 2012 6:20 pm

Re: Filters Help

Post by mdemon » Wed Jul 25, 2012 10:42 pm

At the bottom of http://www.zerochan.net/Touhou it says "page 1 of 1,874" but once you try to go past page 100, the URL goes from http://www.zerochan.net/Touhou?p=100 to http://www.zerochan.net/Touhou?o=1155370 , which should actually say "p=101" in the url instead of "o=1155370" but it doesn't.
Then it goes to http://www.zerochan.net/Touhou?o=1154513 but it should really say "p=102", in the url instead of "o=1155370" but also doesn't

Try going to http://www.zerochan.net/Touhou?p=100 then at the bottom of the page click "next", and you will see what I mean.

User avatar
Support
Site Admin
Posts: 1847
Joined: Sun Oct 02, 2011 10:49 am

Re: Filters Help

Post by Support » Wed Jul 25, 2012 10:48 pm

I know what you mean about the page number, but do you want to scan those pages above 100 also?
Your support team.
http://SoftByteLabs.com

mdemon
Posts: 11
Joined: Wed Jul 25, 2012 6:20 pm

Re: Filters Help

Post by mdemon » Wed Jul 25, 2012 10:50 pm

Yes please ^^' sorry if I wasn't making myself clear.

User avatar
Support
Site Admin
Posts: 1847
Joined: Sun Oct 02, 2011 10:49 am

Re: Filters Help

Post by Support » Wed Jul 25, 2012 10:52 pm

oh ok, then change the following filter that contain...

/touhou\?p=(([0-9]{1,2})|(100))$

for this...

/touhou\?p=\d+$

and that will scan all the pages, from 0 to as many as they list.
Your support team.
http://SoftByteLabs.com

mdemon
Posts: 11
Joined: Wed Jul 25, 2012 6:20 pm

Re: Filters Help

Post by mdemon » Wed Jul 25, 2012 10:55 pm

Thanks! I will need to have two though, because after page 100 it goes from "p=" to "o=", will it work if i have two there and one with p= and the other with o=?

User avatar
Support
Site Admin
Posts: 1847
Joined: Sun Oct 02, 2011 10:49 am

Re: Filters Help

Post by Support » Wed Jul 25, 2012 10:56 pm

Yes, that will work, or change p for [po] and it'll do both on one line.
Your support team.
http://SoftByteLabs.com

mdemon
Posts: 11
Joined: Wed Jul 25, 2012 6:20 pm

Re: Filters Help

Post by mdemon » Wed Jul 25, 2012 11:03 pm

Thank you so much! That is working just fine! :) 520 images downloaded so far!

mdemon
Posts: 11
Joined: Wed Jul 25, 2012 6:20 pm

Re: Filters Help

Post by mdemon » Wed Jul 25, 2012 11:35 pm

How would I make it ignore the "login" site? cause it keeps scanning "http://www.zerochan.net/login?ref=/logi ... login?ref="

And each time it scans that specific thing, it adds another "/login?ref=", and it scans it like 8 at a time.

User avatar
Support
Site Admin
Posts: 1847
Joined: Sun Oct 02, 2011 10:49 am

Re: Filters Help

Post by Support » Thu Jul 26, 2012 12:17 am

Just add a filter that reads...

[x] Do not follow login using plain text

and move it up as far as it can go.

Here is the new filters to skip that link...

Code: Select all

[BlackWidow v6.00 filters]
URL = http://www.zerochan.net/touhou
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: 
Startup referrer: 
[ ] Slow down by 1:1 seconds
4 threads
[x] Replace .240. with .600. using plain text
[x] Replace "? with "http://www.zerochan.net/touhou? using plain text
[x] Do not follow login using plain text
[x] Follow /touhou\?p=\d+$ using regular expression
[x] Add .600. from URL using plain text
[end]
Your support team.
http://SoftByteLabs.com

mdemon
Posts: 11
Joined: Wed Jul 25, 2012 6:20 pm

Re: Filters Help

Post by mdemon » Thu Jul 26, 2012 12:29 am

Thank you so much for your help! :)

User avatar
Support
Site Admin
Posts: 1847
Joined: Sun Oct 02, 2011 10:49 am

Re: Filters Help

Post by Support » Thu Jul 26, 2012 12:42 am

You're welcome.
Your support team.
http://SoftByteLabs.com

mdemon
Posts: 11
Joined: Wed Jul 25, 2012 6:20 pm

Re: Filters Help

Post by mdemon » Thu Jul 26, 2012 1:31 pm

Ok, I need help again ^^' What do I change in the filters to make it so it goes through http://www.animepaper.net/gallery/pictu ... me/touhou/ instead? (There is no "page numbers" either)

I want the full image and not the thumbnails (I really need to learn how to set filters better...I tried messing with it a little and couldn't get it working)

User avatar
Support
Site Admin
Posts: 1847
Joined: Sun Oct 02, 2011 10:49 am

Re: Filters Help

Post by Support » Thu Jul 26, 2012 1:51 pm

It's not the same for this site, here it is...

Code: Select all

[BlackWidow v6.00 filters]
URL = http://www.animepaper.net/gallery/pictures/anime/touhou/
[ ] Expert mode
[ ] Scan everything
[x] Scan whole site
Local depth: 0
[x] Scan external links
[ ] Only verify external links
External depth: 0
Default index page: 
Startup referrer: 
[ ] Slow down by 1:1 seconds
4 threads
[x] Follow /gallery/pictures/anime/touhou/\d+$ using regular expression
[x] Follow /art/\d+/[^/#]+$ using regular expression
[x] Add /thumbnails/preview/.*\.(jpg|png|gif|bmp)$ from URL using regular expression
[end]
Your support team.
http://SoftByteLabs.com

Post Reply