Data extraction output to file from complicated HTML without specific tags
Posted: Mon Apr 14, 2014 1:53 pm
Hi, I am working on a file script that allows me to sequentially visit .php pages to mine html data on the page.
I been trying several of the free scripts to alter them but I can't seem to figure it out based on the ways the html is constructed on my target website.
Here is what I am doing specifically. On http://www.start.umd.edu/tops/terrorist ... .asp?q=All I have the list of several organizations. I want to step through each of them (they typically have this kind of web line url, http://www.start.umd.edu/tops/terrorist ... e.asp?id=x , where x is 1 through about 5000 with lots of empty pages in between (like page http://www.start.umd.edu/tops/terrorist ... sp?id=5000, which yields a page but no data).
For a page like: http://www.start.umd.edu/tops/terrorist ... sp?id=4438, I am trying to dump only a portion of the contents into a .txt file. The data I am collecting from that example page is:
The group name : 1920 Revolution Brigades
Mothertongue Name: كتائب ثورة العشرين (note that most times it is not a foreign script but sometimes it is Arabic)
Aliases: (all of them grouped together separated by commas)
Bases of Operation: (all of them grouped together separated by commas)
Date Formed: (however it is written)
Strength: (however it is written)
Classifications: (however it is written)
Financial Sources: (however it is written)
Founding Philosophy: (usually written as several paragraphs. I want to collect all those paragraphs written into "one cell" is there a limit on tab delimited in this regard?)
Current Goals: (same as above)
Related Groups ( grab each one (whatever is written) but separate each by comma)
I am not interested in the rest of the data on each page. There does not seem to be helpful html tags linked to the data (maybe because of the .php calls?). Any help out there is greatly appreciated on how I can tackle this data extraction project. Thanks!
I been trying several of the free scripts to alter them but I can't seem to figure it out based on the ways the html is constructed on my target website.
Here is what I am doing specifically. On http://www.start.umd.edu/tops/terrorist ... .asp?q=All I have the list of several organizations. I want to step through each of them (they typically have this kind of web line url, http://www.start.umd.edu/tops/terrorist ... e.asp?id=x , where x is 1 through about 5000 with lots of empty pages in between (like page http://www.start.umd.edu/tops/terrorist ... sp?id=5000, which yields a page but no data).
For a page like: http://www.start.umd.edu/tops/terrorist ... sp?id=4438, I am trying to dump only a portion of the contents into a .txt file. The data I am collecting from that example page is:
The group name : 1920 Revolution Brigades
Mothertongue Name: كتائب ثورة العشرين (note that most times it is not a foreign script but sometimes it is Arabic)
Aliases: (all of them grouped together separated by commas)
Bases of Operation: (all of them grouped together separated by commas)
Date Formed: (however it is written)
Strength: (however it is written)
Classifications: (however it is written)
Financial Sources: (however it is written)
Founding Philosophy: (usually written as several paragraphs. I want to collect all those paragraphs written into "one cell" is there a limit on tab delimited in this regard?)
Current Goals: (same as above)
Related Groups ( grab each one (whatever is written) but separate each by comma)
I am not interested in the rest of the data on each page. There does not seem to be helpful html tags linked to the data (maybe because of the .php calls?). Any help out there is greatly appreciated on how I can tackle this data extraction project. Thanks!