Request help for filter

BlackWidow scans websites (it's a site ripper). It can download an entire website, or download portions of a site.
Post Reply
jplanas
Posts: 7
Joined: Fri Feb 10, 2012 10:58 am

Request help for filter

Post by jplanas »

I need the filters to view and then download files from an open website hosting old newspaper pages. There must be about 3,500 or so from

http://www.memoriademadrid.es/buscador. ... total=2491 to

http://www.memoriademadrid.es/buscador. ... _total=298

Notice that the num_id and the num_total seem to be irrelevant and the only important number is VerFicha&id that changes with every file.

Thanks for your help!

Jorge

User avatar
Support
Site Admin
Posts: 1879
Joined: Sun Oct 02, 2011 10:49 am

Re: Request help for filter

Post by Support »

Do you need the whole page or just the large image on that page?
Your support team.
http://SoftByteLabs.com

jplanas
Posts: 7
Joined: Fri Feb 10, 2012 10:58 am

Re: Request help for filter

Post by jplanas »

Just the image, thanks. Jorge

User avatar
Support
Site Admin
Posts: 1879
Joined: Sun Oct 02, 2011 10:49 am

Re: Request help for filter

Post by Support »

Here's what I did. I made a script to scan all the pages (0 to 2151) and convert the small thumbnails to the large images and add them in the structure. The following is not a filter but a script. So copy it and in the Filters window, click on the "Expert" button on the top right and paste the script in the editor window, replacing any content in it. Then start the scan.

Code: Select all

case ScannerEvent of

  Starting:
  begin
    for Pagina = 0 to 10 do begin
      lnk = 'http://www.memoriademadrid.es/buscador.php?accion=ResultadosAvanzados&pagina='
      +Pagina+
      '&busqueda_libre_01_tipo=*&busqueda_libre_01=&operador=+AND+&busqueda_libre_02_tipo=*'+
      '&busqueda_libre_02=&periodico=&dia=&mes=&anio=&dia_inicio=&mes_inicio=&anio_inicio='+
      '&dia_final=&mes_final=&anio_final=&documentos_ocr=&orden_listado=';
      ScanLink(lnk);
    end;
  end;

  BeforeFetch:
  begin
    AcceptEvent = (DocumentType ~= 'text/html');
  end;

  BeforeParsing:
  begin
    Document.Replace('miniatura&min=50', '1');
    Document.Replace('-prev', '');
    AcceptEvent = Yes;
  end;

  BeforeAdding:
  begin
    AcceptEvent = (DocumentType ~= 'image/');
  end;

  FoundLink:
  begin
    AcceptEvent = (FoundLinkURL ~= '\.jpg$');
  end;

else
  AcceptEvent = No;

end;
Your support team.
http://SoftByteLabs.com

jplanas
Posts: 7
Joined: Fri Feb 10, 2012 10:58 am

Re: Request help for filter

Post by jplanas »

Thanks a lot for your prompt reply!

I'm sorry but I might have misled you. I am getting urls of the type: http://www.memoriademadrid.es/imagen.ph ... 9_0256.jpg
but I'm looking to download this type: http://www.memoriademadrid.es/fondos/HE ... 080101.pdf where the last number is the one that changes from 18080101 (Jan 1, 1808) to http://www.memoriademadrid.es/fondos/HE ... 141231.pdf (Dec 31, 1814). Its all about issues of a 19th century newspaper in Barcelona.

I believe there use to be an application included in the program to change the mask, but I cannot find it, that is why I am asking for your help-

Jorge

User avatar
Support
Site Admin
Posts: 1879
Joined: Sun Oct 02, 2011 10:49 am

Re: Request help for filter

Post by Support »

ok, then here is the new script...

Code: Select all

case ScannerEvent of

  Starting:
  begin
    for x = 18080101 to 18141231 do begin
      lnk = 'http://www.memoriademadrid.es/fondos/HEM/DiarioNoticioso/HM_D_BARCELONA_'+x+'.pdf';
      ScanLink(lnk);
    end;
  end;

  BeforeAdding:
  begin
    AcceptEvent = Yes;
  end;

else
  AcceptEvent = No;

end;
Your support team.
http://SoftByteLabs.com

jplanas
Posts: 7
Joined: Fri Feb 10, 2012 10:58 am

Re: Request help for filter

Post by jplanas »

Thank you very much. It worked beautifully!

Jorge

jplanas
Posts: 7
Joined: Fri Feb 10, 2012 10:58 am

Re: Request help for filter

Post by jplanas »

Well, it didn't work so well after all... Sorry to be a pest!

Because there were many numbers that did not mean anything (i.e. 18099999 is not a date), I decided to scan and download by years, modifying the filter so that the range of '+x+' went, for each year from 1808 to 1812, from 0101 to 1231, which reduced the number of "empty" links.

The problem is that it worked well for 1808, but when applying it to 1809 or 1810, the structure only showed the first issue (18100101 or 18090101) and even if it scanned 1132 links each time, only one was added to the structure.

What did I do wrong? No link errors were reported

Jorge

jplanas
Posts: 7
Joined: Fri Feb 10, 2012 10:58 am

Re: Request help for filter

Post by jplanas »

I think I have the solution. The website contains errors and the names of the files differ every year (i.e. instead of HM_D_BARCELONA of year 1808, year 1809 is HM_D_BARCELOMA and year 1810 is HEM_D_BARCELONA). I doubt they did it on purpose, but they had me wondering for a long while!

Thanks for your invaluable help.

Jorge

User avatar
Support
Site Admin
Posts: 1879
Joined: Sun Oct 02, 2011 10:49 am

Re: Request help for filter

Post by Support »

Perhaps it was a typo when they entered the data? In any case, you can always re-run the script for that one year! But here is another script that will generate the links according to year,month,day so you wont get anyhting above 12 for months and above 31 for days...

Code: Select all

case ScannerEvent of

  Starting:
  begin
    for year = '1810' to '1810' range('0'..'9') do
    for month = '01' to '12' range('0'..'9') do
    for day = '01' to '31' range('0'..'9') do begin
      lnk = 'http://www.memoriademadrid.es/fondos/HEM/DiarioNoticioso/HM_D_BARCELONA_'+year+month+day+'.pdf';
      ScanLink(lnk);
    end;
  end;

  BeforeAdding:
  begin
    AcceptEvent = Yes;
  end;

else
  AcceptEvent = No;

end;
Your support team.
http://SoftByteLabs.com

jplanas
Posts: 7
Joined: Fri Feb 10, 2012 10:58 am

Re: Request help for filter

Post by jplanas »

Even better, thanks!

Jorge

Post Reply