Data extraction output to file from complicated HTML without specific tags

BeownReclise is a programmable web spider. Scan a web site and retrieve from it the information you need. You could scan a Real Estate web site and collect all of the agent addresses, phone numbers and emails, and place all this data into a tab delimited database file. Then import this data in your Excel application for example.
Post Reply
agtmulder17
Posts: 2
Joined: Mon Apr 14, 2014 1:28 pm

Data extraction output to file from complicated HTML without specific tags

Post by agtmulder17 » Mon Apr 14, 2014 1:53 pm

Hi, I am working on a file script that allows me to sequentially visit .php pages to mine html data on the page.

I been trying several of the free scripts to alter them but I can't seem to figure it out based on the ways the html is constructed on my target website.

Here is what I am doing specifically. On http://www.start.umd.edu/tops/terrorist ... .asp?q=All I have the list of several organizations. I want to step through each of them (they typically have this kind of web line url, http://www.start.umd.edu/tops/terrorist ... e.asp?id=x , where x is 1 through about 5000 with lots of empty pages in between (like page http://www.start.umd.edu/tops/terrorist ... sp?id=5000, which yields a page but no data).

For a page like: http://www.start.umd.edu/tops/terrorist ... sp?id=4438, I am trying to dump only a portion of the contents into a .txt file. The data I am collecting from that example page is:

The group name : 1920 Revolution Brigades
Mothertongue Name: كتائب ثورة العشرين (note that most times it is not a foreign script but sometimes it is Arabic)
Aliases: (all of them grouped together separated by commas)
Bases of Operation: (all of them grouped together separated by commas)
Date Formed: (however it is written)
Strength: (however it is written)
Classifications: (however it is written)
Financial Sources: (however it is written)
Founding Philosophy: (usually written as several paragraphs. I want to collect all those paragraphs written into "one cell" is there a limit on tab delimited in this regard?)
Current Goals: (same as above)
Related Groups ( grab each one (whatever is written) but separate each by comma)

I am not interested in the rest of the data on each page. There does not seem to be helpful html tags linked to the data (maybe because of the .php calls?). Any help out there is greatly appreciated on how I can tackle this data extraction project. Thanks! :D

agtmulder17
Posts: 2
Joined: Mon Apr 14, 2014 1:28 pm

Re: Data extraction output to file from complicated HTML without specific tags

Post by agtmulder17 » Mon Apr 14, 2014 1:58 pm

I am using BrownRecluse Pro v.1.62. Thanks

User avatar
Support
Site Admin
Posts: 1854
Joined: Sun Oct 02, 2011 10:49 am

Re: Data extraction output to file from complicated HTML without specific tags

Post by Support » Mon Apr 14, 2014 2:13 pm

One thing in common the fields you are looking have, they can be found using this regular expression...

Code: Select all

<label>[^<]+:</label></td>(.*?)</tr>
If you use the following on the page html...

Code: Select all

rx = New(RegEx);
rx.Data = html;
rx.Mask = '<label>[^<]+:</label></td>(.*?)</tr>';

while rx.Match do begin
	txt = Decode(rx.Value[1]);
	txt -= '<[^>]*>';
	txt.Trim;
	Output(txt);
end;
It will output all the fields text minus the html tags one per line (here in the message it may wrap) like this...

&#1603;&#1578;&#1575;&#1574;&#1576; &#1579;&#1608;&#1585;&#1577; &#1575;&#1604;&#1593;&#1588;&#1585;&#1610;&#1606;
20th Revolution Brigades, Revolution of the 1920s Brigades, Twentieth Revolution Brigades
Iraq
June 2003
Unknown number of members
Nationalist/Separatist, Religious
Unknown
The 1920 Revolution Brigades (Kata'ib Thawrat al-Ishreen) is a Sunni Islamic extremist group in the Iraqi insurgency that has claimed responsibility for several attacks on U.S. forces as well as some high profile incidents including the kidnapping of U.S. marine Wassef Ali Hassoun in June 2004 and the bombing of the al-Arabiya television network headquarters in Baghdad in October 2005.The group first appeared in June 2003 as a "nationalist Jihadist movement" dedicated to the withdrawal of U.S. forces from Iraq in order to build an Islamic state. The 1920 Revolution Brigades is the military wing of the Islamic Resistance Movement in Iraq, formerly called the Iraqi National Islamic Resistance. The group is named after the 1920 Iraqi uprising against British colonial occupation following World War I, when the League of Nations granted the United Kingdom control over three Ottoman territories -- Baghdad, Mosul, and Basra -- that make up present day Iraq. Arabic script in the group's logo contains a verse from the Quran popular among Jihadists, "Fight them, God shall torture them by your hands," below which reads, "Islamic Resistance Movement, Twentieth Revolution Brigades."Little is known about the group's leadership, except that on 2 January 2005, the Iraqi Defense Ministry reported that Iraqi security forces arrested Hatim al-Zawba'i, whom they identified as a commander of the 1920 Revolution Brigades.The 1920 Revolution Brigades employs tactics common to other Iraqi insurgency groups such as roadside improvised explosive device (IED) attacks on military vehicles, suicide bombings, and mortar and rocket attacks. Unlike some Jihadist organizations, the group has stated that it prohibits the targeting of public areas and oil facilities and generally forbids the killing of Muslims. The group is also active online, maintaining a website and frequently publishing claims of responsibility and releasing videos of insurgent operations.As one of several militant organizations in Iraq, the 1920 Revolution Brigades has affirmed its autonomy within the insurgency and has declined to join the Mujahideen Shura Council, a union of several Iraqi insurgent bands established in January 2006. In November 2005, however, the group published joint statements with other prominent resistance units under the name Joint Coordination Bureau for Jihad Groups.The 1920 Revolution Brigades gained international media attention on 27 June 2004 when the Arab television network al-Jazeera broadcast a hostage video of captured U.S. marine Wassef Ali Hassoun. A group called Islamic Response, identifying themselves as the security wing of the 1920 Revolution Brigades, claimed responsibility for the kidnapping. The incident later appeared to be a hoax when Hassoun surfaced in his native Lebanon three weeks after he was supposedly captured. Hassoun then reported to the U.S. embassy in Beirut and returned to Camp Lejeune in North Carolina, but he disappeared again in January 2005 just before his military hearing.
The 1920 Revolution Brigades continues to target U.S. troops in Iraq. In a statement issued on 13 February 2006, the group vowed to "carry on jihad until the liberation and victory or [until they are] martyred," and adamantly denied any relation to the Ba'ath party. Given the current trend in activity, it is likely that the group will continue fighting U.S. forces by carrying out suicide bombings, planting IEDs, and launching rockets and mortars at U.S. positions. Likewise, they will probably continue to issue statements online and release videos of their attacks.
No
No
No
No
No
No
No

I hope this helps in accomplishing what you need. If not, let me now.
Your support team.
http://SoftByteLabs.com

Post Reply