WebParser: Lookahead Assertions in RegExp
Many times you need to parse an RSS feed or other web site, and can't be sure beforehand how many results for a particular search might be there. You might search for five occurrences of something when there are only four, and the entire regular expression will fail and no results will be returned to the WebParser measure(s).
This can be solved by using what is known in regular expressions as a "Lookahead Assertion". What this does is evaluate an expression you define that says in effect "do this search if some pattern is found, otherwise continue the regular expression without failing".
The best way to explain this is probably to give an example with an explanation of what it is doing.
Let's start with an HTML file. We will use the "file://" functionality of WebParser for this test, so we can parse a local file on our hard drive.
Normally, we might use something like this in a WebParser measure to search for and return three results:
The problem is, there are only two instances of "<Item>" in this file at this time, and the regular expression will fail and no results will be returned to the WebParser measure. A feed or web site might have none or twenty "<Item>" occurrences, and you can't always know ahead of time.
Using a "lookahead assertion" will allow you to solve this. You can use a RegExp of:
;RegExp with look-ahead assertions
Now the WebParser measure will return the two StringIndexes for "Larry" and "Curly", and will just return a null value for the third search item (the third StringIndex) without failing.
How does this work?
(? at the beginning, paired with the
) at the very end, is an IF conditional directive, basically saying "If the following succeeds, then..."
(?=.*<Item>) next is the lookahead assertion. It is asking "If I look ahead from here, does the text
If "true", what comes outside the lookahead assertion,
.*<Name>(.*)</Name> is searched for and the capture is made. The trailing
) ends that starting
(? IF conditional.
The point of that overall outside
(? ... ) IF conditional is to keep the RegExp from failing if the lookahead assertion is "false". If the test in the lookahead is "true", fine. If not, it just shrugs and continues to the next part of the RegExp or exits gracefully at the end of the expression.
Let's put it all together. I will use a [Variable] to define the search pattern, then just repeatedly use the variable in the
RegExp= statement to save some typing, and set a
Substitute= on the WebParser measures so we get something other than just a null value displayed.
This will display:
Item 1: Larry
Details on how "Lookaround" functionality works in regular expressions can be found at RegExp Tutorial.
Guides, tutorials and tools for regular expressions can be found at Resources for regular expressions.