How to crawl urls with no file extension?

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.
Post Reply
avalanch
Posts: 43
Joined: Fri Mar 16, 2012 11:31 am

How to crawl urls with no file extension?

Post by avalanch »

For example a site that has a url like..

http://www.site.com/forums/65-Guadosalam
http://www.site.com/threads/583-Happy-Birthday-Someone

The two above examples generated by vbseo, a popular seo plugin for vbulletin.

How to make it crawl and save it as html or something? Blackwidow v6.30 likes to just skip it vs download it.

Also I've noticed more and more sites rewriting their urls like this with htaccess, ngninx & other softwares a big example is gamefaqs
http://www.gamefaqs.com/nes/587273-faxanadu/cheats

avalanch
Posts: 43
Joined: Fri Mar 16, 2012 11:31 am

Re: How to crawl urls with no file extension?

Post by avalanch »

Well? Is there a way to do this?

User avatar
Support
Site Admin
Posts: 1892
Joined: Sun Oct 02, 2011 10:49 am

Re: How to crawl urls with no file extension?

Post by Support »

I'm not quite understanding your question, and without real URLs, I can not test and see if there is away to make it work.

Usually, if there is a problem, it's because the page contain javascript which build URLs, they are not static in the page.
Your support team.
http://SoftByteLabs.com

avalanch
Posts: 43
Joined: Fri Mar 16, 2012 11:31 am

Re: How to crawl urls with no file extension?

Post by avalanch »

I'm wanting to know, how I make your software crawl download these pages when it encounters them?
Instead of saving it and calling the end result .file can we make it save it as .html?

http://www.finalfantasyforums.net/forum.php

There is one of the sites I'd like to crawl.

User avatar
Support
Site Admin
Posts: 1892
Joined: Sun Oct 02, 2011 10:49 am

Re: How to crawl urls with no file extension?

Post by Support »

I am still not clear on this. When you scan a site and have links like /index.php?id=w252344&token=46452372, BlackWidow will save them with the .html extension automatically when you download them.
Your support team.
http://SoftByteLabs.com

avalanch
Posts: 43
Joined: Fri Mar 16, 2012 11:31 am

Re: How to crawl urls with no file extension?

Post by avalanch »

Unfortunately it doesn't seem to want to put the .html for me.

Image

User avatar
Support
Site Admin
Posts: 1892
Joined: Sun Oct 02, 2011 10:49 am

Re: How to crawl urls with no file extension?

Post by Support »

That's because you have "Hide extension for known file type" set in the Folder/View options. If you turn this off, you will see the extension.
Untitled-1.jpg
Untitled-1.jpg (215.13 KiB) Viewed 28214 times
Your support team.
http://SoftByteLabs.com

avalanch
Posts: 43
Joined: Fri Mar 16, 2012 11:31 am

Re: How to crawl urls with no file extension?

Post by avalanch »

even with what you suggested it still remains the same. Even further verified by opening the file in editplus... where it displays the correct file syntax colors to show opened/closed tags & other good stuff.

User avatar
Support
Site Admin
Posts: 1892
Joined: Sun Oct 02, 2011 10:49 am

Re: How to crawl urls with no file extension?

Post by Support »

Well, it seems to work for me...
Untitled-1.jpg
Untitled-1.jpg (403.72 KiB) Viewed 28210 times
Your support team.
http://SoftByteLabs.com

avalanch
Posts: 43
Joined: Fri Mar 16, 2012 11:31 am

Re: How to crawl urls with no file extension?

Post by avalanch »

And what about the threads? Those are the main one's that give trouble. Also maybe you have different rule sets than me because mine isn't saving as .php.htm
instead just saving it with apparently no extension like this
2468-Ultimecia-Vs-Sephiroth-Vs-Kuja-Vs-Kefka
Unless of course it does encounter a .php file which it saves with the correct extension but with directory style urls with no seen extension, it does not append an extension to it.

So instead of 2468-Ultimecia-Vs-Sephiroth-Vs-Kuja-Vs-Kefka becoming 2468-Ultimecia-Vs-Sephiroth-Vs-Kuja-Vs-Kefka.html, it's just working right in that regard.

User avatar
Support
Site Admin
Posts: 1892
Joined: Sun Oct 02, 2011 10:49 am

Re: How to crawl urls with no file extension?

Post by Support »

That setting is automatic and I downloaded the threads and all of them had .htm to them.

What I think is that you do not have the html mime type set on your PC. What I mean is, you do not have the .htm extension assigned to open with MSIE or Chrome. So what I suggest you do is rename one of those file to .html and double it, Windows will ask you which software do you want to open it with, select IE or Chrome, and then try downloading some of the files again with BW see if they will have the .htm extension.
Your support team.
http://SoftByteLabs.com

User avatar
Support
Site Admin
Posts: 1892
Joined: Sun Oct 02, 2011 10:49 am

Re: How to crawl urls with no file extension?

Post by Support »

Also, check your mime database in the registry see if it is set...
Attachments
Untitled-1.jpg
Untitled-1.jpg (165.76 KiB) Viewed 28205 times
Your support team.
http://SoftByteLabs.com

avalanch
Posts: 43
Joined: Fri Mar 16, 2012 11:31 am

Re: How to crawl urls with no file extension?

Post by avalanch »

Yes the mime type was set correctly.... I manage several sites and do some maintenance on them now & then so having a browser set to open a htm/html/shtml/php file etc is pretty important to me but as your request I checked via regedit and here is result.
softbytemime.png
softbytemime.png (296.88 KiB) Viewed 28203 times

User avatar
Support
Site Admin
Posts: 1892
Joined: Sun Oct 02, 2011 10:49 am

Re: How to crawl urls with no file extension?

Post by Support »

I really don't know why it does not work for you. I also use BW v6.30

When BW try to save a file without an extension, it looks up the mime type in the registry and use that extension.

Maybe your anti-virus has blocked BW from accessing the registry?
Your support team.
http://SoftByteLabs.com

Post Reply