Website lookup from a list and link capture

BeownReclise is a programmable web spider. Scan a web site and retrieve from it the information you need. You could scan a Real Estate web site and collect all of the agent addresses, phone numbers and emails, and place all this data into a tab delimited database file. Then import this data in your Excel application for example.
User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

Unfortunately not, because BR need IE to begin with!
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

Ok. Thanks

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

Could the script run a timer that forces that spider to go to the next URL once the timer expires? Would this circumvent the login popup?

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

Unfortunately not!
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

Using this script, how can I add pull of meta data from header- keywords and content- into the delimited output file?

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

Can you give me an example?
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

From this URL:

http://www.usaultimate.org/index.html

if I look at the source on lines 5 and 6 I see:

<title id="PageTitle">USA Ultimate | Home Page</title>
<meta name="keywords" content="USA Ultimate,UPA,Ultimate,Disc,National Governing Body,Ultimate Players Association,US Ultimate,Spirit of the Game,Self Officiated,Sportsmanship,SOTG,Frisbee,College Ultimate,Club Ultimate,Youth Ultimate,Juniors Ultimate,Juniors Frisbee,Championships,Sanctioning,Observers,Coaching,WFDF,Nationals,Regionals,Sectionals,Score Reporter,Ultrastar,Tournament,League,Ultimate Videos,Ultimate Photos,Ultimate Tournament,Huck,Pull,Flying Disc,Layout,Forehand,Backhand,Hammer,Field Sport" />

I'd like to output to the txt file the keywords and title page.

If possible, I like to also output the date of the last update to the page.

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

ok no problem but, are you implementing this in an existing script? and if so, which one?
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

I made some slight modifications to the last script you provided by this thread. I wanted the output to be in a delimited format. I chose "|" as my delimiter because a "," caused issues in reading the output file into Excel (there is probably a better delimiter to use). I also changed the script to put just the server status instead of the full header.

PerlRegEx = Yes;
Output.Clear;

Keyworkds = 'about, contact';

sk = New(Stack);
sk.Split(Keyworkds, ',');
sk.Reverse;

fn = SelectFile('Open file...');
if fn = Nothing then Terminate;

f = New(File);
f.Open(fn);
f.Seek(BeginningOfFile);

DecodeFileName(fn, [drv,dir,fln], [Drive,Directory,FileName]);
outfile = drv+dir+fln + '.output.txt';
f2 = New(File);
f2.Open(outfile);
f2.Truncate;

Link = New(URL);

while f.Position < f.Size do begin

lnk = f.Read;

if Link.Get(lnk) then begin
hdr = Link.ServerCode;
f2.Write(lnk+'|');
f2.Write(Trim(hdr)+'|');
for i = 1 to sk.Count do begin
k = sk.Items;
cnt = WildGet(Link.Data, 'href="([^"]*'+k+'[^"]*)"');
if cnt = Nothing then
cnt = WildGet(Link.Data, "href='([^']*"+k+"[^']*)'");
if cnt = Nothing then
cnt = WildGet(Link.Data, '<a[^>]+href="([^"]+)">[^<]*'+k);
if cnt = Nothing then
cnt = WildGet(Link.Data, "<a[^>]+href='([^']+)'>[^<]*"+k);
if cnt then begin
cnt = Link.FixUp(cnt);
f2.Write(cnt+'|');
end;
end;
end;
f2.Write(crlf);
end;

f2.close;
f.close;

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

ok, then add these 2 lines right after if Link.Get(lnk) then begin

Code: Select all

	  PageTitle = WildGet(Link.Data, '<title[^>]*>([^<]*)');
	  PageKeywords = WildGet(Link.Data, '<meta\s+name="keywords"\s+content="([^"]+)"');
and then add PageTitle and PageKeywords to your f2.Write command. I use TAB as a delimiter because Excel loves tabs when importing...

f2.Write(cnt+TAB+PageTitle+TAB+PageKeywords);
Your support team.
http://SoftByteLabs.com

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

One more thing, when you want to past code in the message, click on the Code button, it'll show your code the same way as mine.
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

Works great. thanks! I see the code button - I'll use it next time.

Can BrownRecluse compile and execute javascript code?

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

It does not execute javascript because this would be a huge security risk. But if you have a page with some javascript code, you can copy it and with little modification, it will run in BrownRecluse to some extend.
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

ok. thanks

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

For the script in this thread, is there a way to retrieve the server IP address for each URL and place it in the .txt output file?

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

I do not believe so. I can't find ant reference for getting the IP.
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

If I wanted to capture phone numbers, URLs, and ZipCode in my script using:

\((?<AreaCode>\d{3})\)\s*(?<Number>\d{3}(?:-|\s*)\d{4})(?x) # Phone numbers

(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*(?x) # URL

(?<Zip>\d{5})-(?<Sub>\d{4})(?x) # Zip Codes

Can I insert these expressions directly to the script?

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

Yes you can. But I would suggest you try them first in the "Expression evaluator" window to make sure they work.
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

The script runs great but I am having trouble breaking out the results from the output file in an organized way. Would it be possible to output the file with column headings referring to each keyword and header result? if there is no result, the field would be empty?

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

I've noticed in the some of the data captured there is a line feed? and maybe a tab occasionally. Is there a way to encapsulate the output or remove the line feed/tab when the data is captured?

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

Yes, you can remove them from the variable holding the data, for example...

Code: Select all

PageTitle = PageTitle - '\t|\r|\n';
The | character means "OR" as in "this or that", and \t means a TAB, \r means a RETURN and \n means NEWLINE. You can do x = x - 'some regex text'; as well.
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

The script runs great but I am having trouble breaking out the results from the output file in an organized way. Would it be possible to output the file with column headings referring to each keyword and header result? if there is no result, the field would be empty?

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

Or would it be better to output a separate file for each keyword/header result since there might be more than one result for each keyword/header query?

User avatar
Support
Site Admin
Posts: 1881
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

What you need to do is write to the file each fields so to keep the columns. The way you have it now is it writes only if the field is not empty...

if cnt then begin
cnt = Link.FixUp(cnt);
f2.Write(cnt+'|');
end;

should be...

if cnt then
cnt = Link.FixUp(cnt)

f2.Write(cnt+'|');
Your support team.
http://SoftByteLabs.com

Post Reply