Website lookup from a list and link capture

BeownReclise is a programmable web spider. Scan a web site and retrieve from it the information you need. You could scan a Real Estate web site and collect all of the agent addresses, phone numbers and emails, and place all this data into a tab delimited database file. Then import this data in your Excel application for example.
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Website lookup from a list and link capture

Post by gcarmich »

I installed BrownRecluse trail today. I'm not sure how to create the routine I am interested in creating. Here is a description:
I have a list of several hundred URLs in a .csv (could be .txt) format. I'd like to go to each webpage page (URL) and check (or capture) the "contact", "contacts", "contact us" link from the page - if it exists. Then create a file (or update the original list) with the original URL and the indicator or link (http://) to the "contacts" on that page if it exists. This way, I end up with a list of the pages with a "contact" links and can go directly to those pages for more information. Is this a routine that BrownRecluse could do?

For example: the input file would contain

http://www.softbytelabs.com/

and the output file would have

http://www.softbytelabs.com/ , http://www.softbytelabs.com/us/contacts.html

Thank you,
User avatar
Support
Site Admin
Posts: 2989
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

Here is a script that will do this...

Code: Select all

PerlRegEx = Yes;
Output.Clear;

fn = SelectFile('Open file...');
if fn = Nothing then Terminate;

f = New(File);
f.Open(fn);
f.Seek(BeginningOfFile);

outfile = fn + '.contacts.txt';
f2 = New(File);
f2.Truncate;

Link = New(URL);

while f.Position < f.Size do begin

	lnk = f.Read;

	if Link.Get(lnk) then begin
	  cnt = WildGet(Link.Data, 'href="([^"]+contact[^"]+)"');
	  if cnt = Nothing then
	    cnt = WildGet(Link.Data, "href='([^']+contact[^']+)'");
	  if cnt then begin
	  	cnt = Link.FixUp(lnk);
	  	f2.Write(lnk+tab+cnt);
	  	Output(cnt);
		end;
	end;

end;

f2.close;
f.close;
Your support team.
https://SoftByteLabs.com
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

Thanks! I'll give it a try.
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

It runs but it appears to be returning just the original URL and it does not include the "contact" URL. Also, it doesn't seem to write out the .txt file after completion - I see the output in the "runtime outputs" window. I'm new to this and might be doing something wrong.
User avatar
Support
Site Admin
Posts: 2989
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

A typo on line 26, here is the corrected one...

Code: Select all

PerlRegEx = Yes;
Output.Clear;

fn = SelectFile('Open file...');
if fn = Nothing then Terminate;

f = New(File);
f.Open(fn);
f.Seek(BeginningOfFile);

outfile = fn + '.contacts.txt';
f2 = New(File);
f2.Truncate;

Link = New(URL);

while f.Position < f.Size do begin

   lnk = f.Read;

   if Link.Get(lnk) then begin
     cnt = WildGet(Link.Data, 'href="([^"]+contact[^"]+)"');
     if cnt = Nothing then
       cnt = WildGet(Link.Data, "href='([^']+contact[^']+)'");
     if cnt then begin
        cnt = Link.FixUp(cnt);
        f2.Write(lnk+tab+cnt);
        Output(cnt);
      end;
   end;

end;

f2.close;
f.close;
Your support team.
https://SoftByteLabs.com
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

Thanks for the sample code. I emailed a request to softbyte labs developers for a quote on some coding but I haven't received a response yet. I there a way to verify they received the request?

Thanks.
User avatar
Support
Site Admin
Posts: 2989
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

Yes they have, but we are so busy right now finish up Raylectron that it may be another couple of days.
Your support team.
https://SoftByteLabs.com
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

Ok. Np. Thank you.
User avatar
Support
Site Admin
Posts: 2989
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

Here is a script about your request you email me the other day. I did not test it, so if there are any errors, let me know...

Code: Select all

PerlRegEx = Yes;
Output.Clear;

Keyworkds = 'contact,about,music';

sk = New(Stack);
sk.Split(Keyworkds, ',');
sk.Reverse;

fn = SelectFile('Open file...');
if fn = Nothing then Terminate;

f = New(File);
f.Open(fn);
f.Seek(BeginningOfFile);

outfile = fn + '.output.txt';
f2 = New(File);
f2.Truncate;

Link = New(URL);

while f.Position < f.Size do begin

	lnk = f.Read;

	if Link.Get(lnk) then begin
		hdr = Link.Headers;
		f2.Write(lnk+crlf);
		f2.Write(hdr+crlf);
		for i = 1 to sk.Count do begin
		  k = sk.Items[i];
			cnt = WildGet(Link.Data, 'href="([^"]+'+k+'[^"]+)"');
			if cnt = Nothing then
				cnt = WildGet(Link.Data, "href='([^']+"+k+"[^']+)'");
			if cnt then begin
				cnt = Link.FixUp(cnt);
				f2.Write(cnt);
			end;
		end;
		f2.Write('-'*80+crlf);
	end;

end;

f2.close;
f.close;
Your support team.
https://SoftByteLabs.com
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

I ran it today but the code produced no output. I selected the source file when prompted - do I need to do any thing else?
User avatar
Support
Site Admin
Posts: 2989
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

The input file should have one URL per lines, and you should also set the keywords on the 3rd source code line.
Your support team.
https://SoftByteLabs.com
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

The input file does have a single URL per line- same file I ran successfully with the earlier code. I left the third line as is. The keywords were on the URL referenced pages.
User avatar
Support
Site Admin
Posts: 2989
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

What do you mean by "The keywords were on the URL referenced pages"?

The way I have it setup is the input file is one URL per lines and the keywords are stored in the script itself. Isn't it how you wanted it?
Your support team.
https://SoftByteLabs.com
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

I mean that some of the pages have the keywords on them so at least a few of the URLs should have returned in the output.
User avatar
Support
Site Admin
Posts: 2989
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

ok this one is working, I just tested on 3 sites. It creates the output file in the same folder as the input file, with the same name but .output.txt appended to it...

Code: Select all

PerlRegEx = Yes;
Output.Clear;

Keyworkds = 'contact,about,music';

sk = New(Stack);
sk.Split(Keyworkds, ',');
sk.Reverse;

fn = SelectFile('Open file...');
if fn = Nothing then Terminate;

f = New(File);
f.Open(fn);
f.Seek(BeginningOfFile);

DecodeFileName(fn, [drv,dir,fln], [Drive,Directory,FileName]);
outfile = drv+dir+fln + '.output.txt';
f2 = New(File);
f2.Open(outfile);
f2.Truncate;

Link = New(URL);

while f.Position < f.Size do begin

   lnk = f.Read;

   if Link.Get(lnk) then begin
      hdr = Link.Headers;
      f2.Write('-'*80+crlf);
      f2.Write(lnk+crlf);
      f2.Write('-'*80+crlf);
      f2.Write(Trim(hdr)+crlf);
      f2.Write('-'*80+crlf);
      for i = 1 to sk.Count do begin
        k = sk.Items[i];
         cnt = WildGet(Link.Data, 'href="([^"]*'+k+'[^"]*)"');
         if cnt = Nothing then
            cnt = WildGet(Link.Data, "href='([^']*"+k+"[^']*)'");
         if cnt then begin
            cnt = Link.FixUp(cnt);
            f2.Write(cnt+crlf);
         end;
      end;
   end;
   f2.Write(crlf);
   f2.Write(crlf);

end;

f2.close;
f.close;
Your support team.
https://SoftByteLabs.com
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

Perfect! Thank you
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

Is it possible to search for the keywords in both the link and the visible text on the page that references the link? Then return the link as it does is the existing code. Sometimes it says "Contacts" on the page but the link does not contain the word. Does that make sense?
User avatar
Support
Site Admin
Posts: 2989
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

ok, this one will also find links that contain a keyword in the clickable text...

Code: Select all

PerlRegEx = Yes;
Output.Clear;

Keyworkds = 'contact,about,music';

sk = New(Stack);
sk.Split(Keyworkds, ',');
sk.Reverse;

fn = SelectFile('Open file...');
if fn = Nothing then Terminate;

f = New(File);
f.Open(fn);
f.Seek(BeginningOfFile);

DecodeFileName(fn, [drv,dir,fln], [Drive,Directory,FileName]);
outfile = drv+dir+fln + '.output.txt';
f2 = New(File);
f2.Open(outfile);
f2.Truncate;

Link = New(URL);

while f.Position < f.Size do begin

   lnk = f.Read;

   if Link.Get(lnk) then begin
      hdr = Link.Headers;
      f2.Write('-'*80+crlf);
      f2.Write(lnk+crlf);
      f2.Write('-'*80+crlf);
      f2.Write(Trim(hdr)+crlf);
      f2.Write('-'*80+crlf);
      for i = 1 to sk.Count do begin
        k = sk.Items[i];
         cnt = WildGet(Link.Data, 'href="([^"]*'+k+'[^"]*)"');
         if cnt = Nothing then
            cnt = WildGet(Link.Data, "href='([^']*"+k+"[^']*)'");
         if cnt = Nothing then
            cnt = WildGet(Link.Data, '<a[^>]+href="([^"]+)">[^<]*'+k);
         if cnt = Nothing then
            cnt = WildGet(Link.Data, "<a[^>]+href='([^']+)'>[^<]*"+k);
         if cnt then begin
            cnt = Link.FixUp(cnt);
            f2.Write(cnt+crlf);
         end;
      end;
   end;
   f2.Write(crlf);
   f2.Write(crlf);

end;

f2.close;
f.close;
Your support team.
https://SoftByteLabs.com
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

thank you
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

If the spider encounters a site with a login screen, is it possible to "cancel' the login and proceed to the next URL in the list?
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

How can I add "http://www." to the beginning of the URLs in the source .txt file before the spider tries to access the URL?
User avatar
Support
Site Admin
Posts: 2989
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

Using the last posted script, on the line...

lnk = f.Read;

change it to...

lnk = 'http://www.' + f.Read;
Your support team.
https://SoftByteLabs.com
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

thanks. Did you see my question about login screens?
User avatar
Support
Site Admin
Posts: 2989
Joined: Sun Oct 02, 2011 10:49 am

Re: Website lookup from a list and link capture

Post by Support »

oh no I didn't :roll:

I don't think there is a way, because that window is from IE, not BR!
Your support team.
https://SoftByteLabs.com
gcarmich
Posts: 29
Joined: Wed Jun 12, 2013 6:15 am

Re: Website lookup from a list and link capture

Post by gcarmich »

If I uninstalled IE would that prevent the login window from coming up and interrupting the process?
Post Reply