Website lookup from a list and link capture

BeownReclise is a programmable web spider. Scan a web site and retrieve from it the information you need. You could scan a Real Estate web site and collect all of the agent addresses, phone numbers and emails, and place all this data into a tab delimited database file. Then import this data in your Excel application for example.
User avatar
Support
Site Admin
Posts: 1720
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support » Wed Jul 24, 2013 2:34 pm

Unfortunately not, because BR need IE to begin with!
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Wed Jul 24, 2013 2:37 pm

Ok. Thanks

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Thu Jul 25, 2013 10:12 pm

Could the script run a timer that forces that spider to go to the next URL once the timer expires? Would this circumvent the login popup?

User avatar
Support
Site Admin
Posts: 1720
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support » Thu Jul 25, 2013 10:41 pm

Unfortunately not!
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Tue Jul 30, 2013 1:37 pm

Using this script, how can I add pull of meta data from header- keywords and content- into the delimited output file?

User avatar
Support
Site Admin
Posts: 1720
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support » Tue Jul 30, 2013 2:18 pm

Can you give me an example?
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Tue Jul 30, 2013 6:55 pm

From this URL:

http://www.usaultimate.org/index.html

if I look at the source on lines 5 and 6 I see:

<title id="PageTitle">USA Ultimate | Home Page</title>
<meta name="keywords" content="USA Ultimate,UPA,Ultimate,Disc,National Governing Body,Ultimate Players Association,US Ultimate,Spirit of the Game,Self Officiated,Sportsmanship,SOTG,Frisbee,College Ultimate,Club Ultimate,Youth Ultimate,Juniors Ultimate,Juniors Frisbee,Championships,Sanctioning,Observers,Coaching,WFDF,Nationals,Regionals,Sectionals,Score Reporter,Ultrastar,Tournament,League,Ultimate Videos,Ultimate Photos,Ultimate Tournament,Huck,Pull,Flying Disc,Layout,Forehand,Backhand,Hammer,Field Sport" />

I'd like to output to the txt file the keywords and title page.

If possible, I like to also output the date of the last update to the page.

User avatar
Support
Site Admin
Posts: 1720
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support » Tue Jul 30, 2013 10:11 pm

ok no problem but, are you implementing this in an existing script? and if so, which one?
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Wed Jul 31, 2013 7:01 am

I made some slight modifications to the last script you provided by this thread. I wanted the output to be in a delimited format. I chose "|" as my delimiter because a "," caused issues in reading the output file into Excel (there is probably a better delimiter to use). I also changed the script to put just the server status instead of the full header.

PerlRegEx = Yes;
Output.Clear;

Keyworkds = 'about, contact';

sk = New(Stack);
sk.Split(Keyworkds, ',');
sk.Reverse;

fn = SelectFile('Open file...');
if fn = Nothing then Terminate;

f = New(File);
f.Open(fn);
f.Seek(BeginningOfFile);

DecodeFileName(fn, [drv,dir,fln], [Drive,Directory,FileName]);
outfile = drv+dir+fln + '.output.txt';
f2 = New(File);
f2.Open(outfile);
f2.Truncate;

Link = New(URL);

while f.Position < f.Size do begin

lnk = f.Read;

if Link.Get(lnk) then begin
hdr = Link.ServerCode;
f2.Write(lnk+'|');
f2.Write(Trim(hdr)+'|');
for i = 1 to sk.Count do begin
k = sk.Items;
cnt = WildGet(Link.Data, 'href="([^"]*'+k+'[^"]*)"');
if cnt = Nothing then
cnt = WildGet(Link.Data, "href='([^']*"+k+"[^']*)'");
if cnt = Nothing then
cnt = WildGet(Link.Data, '<a[^>]+href="([^"]+)">[^<]*'+k);
if cnt = Nothing then
cnt = WildGet(Link.Data, "<a[^>]+href='([^']+)'>[^<]*"+k);
if cnt then begin
cnt = Link.FixUp(cnt);
f2.Write(cnt+'|');
end;
end;
end;
f2.Write(crlf);
end;

f2.close;
f.close;

User avatar
Support
Site Admin
Posts: 1720
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support » Wed Jul 31, 2013 2:13 pm

ok, then add these 2 lines right after if Link.Get(lnk) then begin

Code: Select all

	  PageTitle = WildGet(Link.Data, '<title[^>]*>([^<]*)');
	  PageKeywords = WildGet(Link.Data, '<meta\s+name="keywords"\s+content="([^"]+)"');
and then add PageTitle and PageKeywords to your f2.Write command. I use TAB as a delimiter because Excel loves tabs when importing...

f2.Write(cnt+TAB+PageTitle+TAB+PageKeywords);
Your support team.
http://SoftByteLabs.com

User avatar
Support
Site Admin
Posts: 1720
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support » Wed Jul 31, 2013 2:14 pm

One more thing, when you want to past code in the message, click on the Code button, it'll show your code the same way as mine.
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Thu Aug 01, 2013 9:18 pm

Works great. thanks! I see the code button - I'll use it next time.

Can BrownRecluse compile and execute javascript code?

User avatar
Support
Site Admin
Posts: 1720
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support » Thu Aug 01, 2013 9:22 pm

It does not execute javascript because this would be a huge security risk. But if you have a page with some javascript code, you can copy it and with little modification, it will run in BrownRecluse to some extend.
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Thu Aug 01, 2013 9:23 pm

ok. thanks

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Sun Aug 18, 2013 8:21 pm

For the script in this thread, is there a way to retrieve the server IP address for each URL and place it in the .txt output file?

User avatar
Support
Site Admin
Posts: 1720
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support » Sun Aug 18, 2013 8:44 pm

I do not believe so. I can't find ant reference for getting the IP.
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Wed Aug 21, 2013 9:06 am

If I wanted to capture phone numbers, URLs, and ZipCode in my script using:

\((?<AreaCode>\d{3})\)\s*(?<Number>\d{3}(?:-|\s*)\d{4})(?x) # Phone numbers

(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*(?x) # URL

(?<Zip>\d{5})-(?<Sub>\d{4})(?x) # Zip Codes

Can I insert these expressions directly to the script?

User avatar
Support
Site Admin
Posts: 1720
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support » Wed Aug 21, 2013 11:36 am

Yes you can. But I would suggest you try them first in the "Expression evaluator" window to make sure they work.
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Mon Aug 26, 2013 9:23 pm

The script runs great but I am having trouble breaking out the results from the output file in an organized way. Would it be possible to output the file with column headings referring to each keyword and header result? if there is no result, the field would be empty?

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Tue Sep 03, 2013 8:27 pm

I've noticed in the some of the data captured there is a line feed? and maybe a tab occasionally. Is there a way to encapsulate the output or remove the line feed/tab when the data is captured?

User avatar
Support
Site Admin
Posts: 1720
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support » Tue Sep 03, 2013 8:40 pm

Yes, you can remove them from the variable holding the data, for example...

Code: Select all

PageTitle = PageTitle - '\t|\r|\n';
The | character means "OR" as in "this or that", and \t means a TAB, \r means a RETURN and \n means NEWLINE. You can do x = x - 'some regex text'; as well.
Your support team.
http://SoftByteLabs.com

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Fri Sep 06, 2013 6:53 am

The script runs great but I am having trouble breaking out the results from the output file in an organized way. Would it be possible to output the file with column headings referring to each keyword and header result? if there is no result, the field would be empty?

gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich » Fri Sep 06, 2013 8:05 am

Or would it be better to output a separate file for each keyword/header result since there might be more than one result for each keyword/header query?

User avatar
Support
Site Admin
Posts: 1720
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support » Fri Sep 06, 2013 10:44 am

What you need to do is write to the file each fields so to keep the columns. The way you have it now is it writes only if the field is not empty...

if cnt then begin
cnt = Link.FixUp(cnt);
f2.Write(cnt+'|');
end;

should be...

if cnt then
cnt = Link.FixUp(cnt)

f2.Write(cnt+'|');
Your support team.
http://SoftByteLabs.com

Post Reply