by AbstraktMethodz 25. November 2005 16:32
I had a lot of success in developing several website scraping scripts, with the method of parsing by matching regular expressions. Using scraping as a method of obtaining data is always a bad idea. It is obvious whatever system you're scraping isn't supporting your efforts whatsoever. So assuming the process of doing it isn't copyright infringement or otherwise illegal, the code will inevitably require a lot of maintenance. That being said, let's scrape.

This is the HTML for a link that is the result of a Google search:

<a class="l" onmousedown="return clk(this.href,'res','1','')" href="http://www.microsoft.com/"><b>Microsoft</b> Corporation</a>

Now say for instance we want to scrape all the href properties, and the text display of the link. I believe Google has a webservice that would be the correct way to access this data, but that is just a reminder of why you should never do this. I have gone about the parsing using tags as tokens to look for, but ignoring their properties in matches. In most instances properties contain more specific display information about a page, while the tags outline the structure of the page and should be subject to less change. With that in mind, the following is a regular expression that will match the entire text of any anchor with an href reglardless of having class or onmousedown attributes.

/<a [^>?]href="(?P<ANCHOR_TEXT>)"*>(?P<ANCHOR_TEXT>)<\/a>/m

The best way to match the first tag in a markup language is to use the expression
[^>]. It will allow any text beside the closing greater than sign to match there, i.e. the first two attributes I am ignoring. The question mark specifies the RegEx to evaluate non greedy. Doing this for a single node expression like this will prevent some spurious matches. The final noteworthy aspect of this snippet is the final m after the expression, which specifies to match multiple lines of text.

Here is what PHP executing this might appear as:

$pattern = "/<a [^>?]href=\"(?P<ANCHOR_TEXT>)\"*>(?P<ANCHOR_TEXT>)<\/a>/m";
preg_match_all($pattern, $myHtml, $result, PREG_SET_ORDER);

The parsers I've done have been running for months with little maintenance, what I think I owe that to is my extension of this method to match larger portions of HTML with a single regular expression. This ensures the validity of the data matched, because the likelyhood that the exact same HTML structure would exist with the wrong data in it is low. For example, i would parse a segment like this:

<tr class="dataRow" height="15px">
<td class="dataCell">5/21/2001</td>
<td class="dataCell">Active</td>
<td class="dataCell">75.6</td>
<td class="dataCell">CLASS B</td>
<td class="dataCell">foo</td>
<td class="dataCell">bar</td>
<td class="dataCell">
baz
</td>
</tr>

The usual case was having to parse every row of a data table, so I matched every row and every value in each cell. Once we're done we find it all neatly dumped into an associative array.

The gotcha here is matching not only that whitespace, but also the carriage return and newline characters. To do this use the expression [\s\r\n]* which will respectively match these three things.

Tags:

by AbstraktMethodz 25. November 2005 16:27

If you ask my girlfriend Christina what she wants for a gift, the immature yet invariable response is either a pony or a Vespa scooter. These items being a little pricey, me and others have opted to simply provide her their representations. I also recently had a hankering for some C++/OpenGL coding. This is the ideal situation to code a girly screen saver, but we're short on time so leverage some open source stuff:


  • a public domain 3d model of a horse, in 3DS format

  • an open source 3DS file loader & viewer for OpenGL, acquired from www.gametutorials.com

  • an animated snow effect for OpenGL (also open source)


I quickly married it all together, making the the snow fall around the horse while it spins around the Z axis.

 


Its like a hybrid of a snow shaker and music box without the music or the shaking. Now a screensaver is really just an EXE renamed to SCR, so the final touch is to make the program exit on keyboard and mouse messages.

I've had the idea for a while to make my own OpenGL demo that goes with music, i.e. some sort of wild interactive visualization. After jumping back into OpenGL so easily and that recent success with MP3 decoding and analysis for the WWF, I don't see any big hurdles.

Tags: