WebParser: Lookahead Assertions in RegExp


Many times you need to parse an RSS feed or other web site, and can't be sure beforehand how many results for a particular search might be there. You might search for five occurrences of something when there are only four, and the entire regular expression will fail and no results will be returned to the WebParser measure(s).

This can be solved by using what is known in regular expressions as a "Lookahead Assertion". What this does is evaluate an expression you define that says in effect "do this search if some pattern is found, otherwise continue the regular expression without failing".

The best way to explain this is probably to give an example with an explanation of what it is doing.

Let's start with an HTML file. We will use the "file://" functionality of WebParser for this test, so we can parse a local file on our hard drive.

<HTML>
<BODY>
<Item>
<Name>Larry</Name>
</Item>
<Item>
<Name>Curly</Name>
</Item>
</BODY>
</HTML>

Normally, we might use something like this in a WebParser measure to search for and return three results:

[MeasureParent]
Measure=WebParser
URL=file://#CURRENTPATH#MyFile.html
RegExp="(?siU)<Item>.*<Name>(.*)</Name>.*<Item>.*<Name>(.*)</Name>.*<Item>.*<Name>(.*)</Name>.*"

The problem is, there are only two instances of "<Item>" in this file at this time, and the regular expression will fail and no results will be returned to the WebParser measure. A feed or web site might have none or twenty "<Item>" occurrences, and you can't always know ahead of time.

Using a "lookahead assertion" will allow you to solve this. You can use a RegExp of:

;RegExp with look-ahead assertions
RegExp="(?siU)(?(?=.*<Item>).*<Name>(.*)</Name>)(?(?=.*<Item>).*<Name>(.*)</Name>)(?(?=.*<Item>).*<Name>(.*)</Name>)"

Now the WebParser measure will return the two StringIndexes for "Larry" and "Curly", and will just return a null value for the third search item (the third StringIndex) without failing.

How does this work?

(?(?=.*<Item>).*<Name>(.*)</Name>)

The (? at the beginning, paired with the ) at the very end, is an IF conditional directive, basically saying "If the following succeeds, then..."

The (?=.*<Item>) next is the lookahead assertion. It is asking "If I look ahead from here, does the text .*<Item> exist?"

If "true", what comes outside the lookahead assertion, .*<Name>(.*)</Name> is searched for and the capture is made. The trailing ) ends that starting (? IF conditional.

The point of that overall outside (? ... ) IF conditional is to keep the RegExp from failing if the lookahead assertion is "false". If the test in the lookahead is "true", fine. If not, it just shrugs and continues to the next part of the RegExp or exits gracefully at the end of the expression.

Example skin

Let's put it all together. I will use a [Variable] to define the search pattern, then just repeatedly use the variable in the RegExp= statement to save some typing, and set a Substitute= on the WebParser measures so we get something other than just a null value displayed.

[Rainmeter]
DynamicWindowSize=1

[Variables]
Get=(?(?=.*<Item>).*<Name>(.*)</Name>)

[MeasureParent]
Measure=WebParser
URL=file://#CURRENTPATH#Test.html
RegExp="(?siU)#Get##Get##Get#"

[MeasureChild1]
Measure=WebParser
URL=[MeasureParent]
StringIndex=1
Substitute="":"No Moe!"

[MeasureChild2]
Measure=WebParser
URL=[MeasureParent]
StringIndex=2
Substitute="":"No Moe!"

[MeasureChild3]
Measure=WebParser
URL=[MeasureParent]
StringIndex=3
Substitute="":"No Moe!"

[MeterChild1]
Meter=String
MeasureName=MeasureChild1
FontSize=10
FontColor=255,255,255,255
AntiAlias=1
Text=Item 1: %1

[MeterChild2]
Meter=String
MeasureName=MeasureChild2
Y=2R
FontSize=10
FontColor=255,255,255,255
AntiAlias=1
Text=Item 2: %1

[MeterChild3]
Meter=String
MeasureName=MeasureChild3
Y=2R
FontSize=10
FontColor=255,255,255,255
AntiAlias=1
Text=Item 3: %1

This will display:

Item 1: Larry
Item 2: Curly
Item 3: No Moe!

Details on how "Lookaround" functionality works in regular expressions can be found at RegExp Tutorial.

Guides, tutorials and tools for regular expressions can be found at Resources for regular expressions.