17 min read

Python 2.6 Text Processing: Beginners Guide

Simple string matching

Regular expressions are notoriously hard to read, especially if you’re not familiar with the obscure syntax. For that reason, let’s start simple and look at some easy regular expressions at the most basic level. Before we begin, remember that Python raw strings allow us to include backslashes without the need for additional escaping.

Whenever you define regular expressions, you should do so using the raw string syntax.

Time for action – testing an HTTP URL

In this example, we’ll check values as they’re entered via the command line as a means to introduce the technology. We’ll dive deeper into regular expressions as we move forward. We’ll be scanning URLs to ensure our end users inputted valid data.

  1. Create a new file and name it number_regex.py.
  2. Enter the following code:

    import sys
    import re

    # Make sure we have a single URL argument.
    if len(sys.argv) != 2:
    print >>sys.stderr, “URL Required”
    sys.exit(-1)
    # Easier access.
    url = sys.argv[1]

    # Ensure we were passed a somewhat valid URL.
    # This is a superficial test.
    if re.match(r’^https?:/{2}w.+$’, url):
    print “This looks valid”
    else:
    print “This looks invalid”

    
    
  3. Now, run the example script on the command line a few times, passing various different values to it on the command line.

    (text_processing)$ python url_regex.py http://www.jmcneil.net
    This looks valid
    (text_processing)$ python url_regex.py http://intranet
    This looks valid
    (text_processing)$ python url_regex.py http://www.packtpub.com
    This looks valid
    (text_processing)$ python url_regex.py https://store
    This looks valid
    (text_processing)$ python url_regex.py httpsstore
    This looks invalid
    (text_processing)$ python url_regex.py https:??store
    This looks invalid
    (text_processing)$

    
    

What just happened?

We took a look at a very simple pattern and introduced you to the plumbing needed to perform a match test. Let’s walk through this little example, skipping the boilerplate code.

First of all, we imported the re module. The re module, as you probably inferred from the name, contains all of Python’s regular expression support.

Any time you need to work with regular expressions, you’ll need to import the re module.

Next, we read a URL from the command line and bind a temporary attribute, which makes for cleaner code. Directly below that, you should notice a line that reads re.match(r’^https?:/{2}w.+$’, url). This line checks to determine whether the string referenced by the url attribute matches the ^https?:/{2}w.+$ pattern.

If a match is found, we’ll print a success message; otherwise, the end user would receive some negative feedback indicating that the input value is incorrect.

This example leaves out a lot of details regarding HTTP URL formats. If you were performing validation on user input, one place to look would be http://formencode.org/. FormEncode is a HTML form-processing and data-validation framework written by Ian Bicking.

Understanding the match function

The most basic method of testing for a match is via the re.match function, as we did in the previous example. The match function takes a regular expression pattern and a string value. For example, consider the following snippet of code:

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type “help”, “copyright”, “credits”, or “license” for more information.
>>> import re
>>> re.match(r’pattern’, ‘pattern’)
<_sre.SRE_Match object at 0x1004811d0>
>>>


Here, we simply passed a regular expression of “pattern” and a string literal of “pattern” to the re.match function. As they were identical, the result was a match. The returned Match object indicates the match was successful. The re.match function returns None otherwise.

>>> re.match(r’pattern’, ‘failure’)
>>>


Learning basic syntax

A regular expression is generally a collection of literal string data and special metacharacters that represents a pattern of text. The simplest regular expression is just literal text that only matches itself.

In addition to literal text, there are a series of special characters that can be used to convey additional meaning, such as repetition, sets, wildcards, and anchors. Generally, the punctuation characters field this responsibility.

Detecting repetition

When building up expressions, it’s useful to be able to match certain repeating patterns without needing to duplicate values. It’s also beneficial to perform conditional matches. This lets us check for content such as “match the letter a, followed by the number one at least three times, but no more than seven times.”

For example, the code below does just that:

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type “help”, “copyright”, “credits”, or “license” for more information.
>>> import re
>>> re.match(r’^a1{3,7}$’, ‘a1111111′)
<_sre.SRE_Match object at 0x100481648>
>>> re.match(r’^a1{3,7}$’, ‘1111111’)
>>>


If the repetition operator follows a valid regular expression enclosed in parenthesis, it will perform repetition on that entire expression. For example:

>>> re.match(r’^(a1){3,7}$’, ‘a1a1a1′)
<_sre.SRE_Match object at 0x100493918>
>>> re.match(r’^(a1){3,7}$’, ‘a11111’)
>>>


The following table details all of the special characters that can be used for marking repeating values within a regular expression.

Specifying character sets and classes

In some circumstances, it’s useful to collect groups of characters into a set such that any of the values in the set will trigger a match. It’s also useful to match any character at all. The dot operator does just that.

A character set is enclosed within standard square brackets. A set defines a series of alternating (or) entities that will match a given text value. If the first character within a set is a caret (^) then a negation is performed. All characters not defined by that set would then match.

There are a couple of additional interesting set properties.

  1. For ranged values, it’s possible to specify an entire selection using a hyphen. For example, ‘[0-6a-d]’ would match all values between 0 and 6, and a and d.
  2. Special characters listed within brackets lose their special meaning. The exceptions to this rule are the hyphen and the closing bracket.

If you need to include a closing bracket or a hyphen within a regular expression, you can either place them as the first elements in the set or escape them by preceding them with a backslash.

As an example, consider the following snippet, which matches a string containing a hexadecimal number.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type “help”, “copyright”, “credits”, or “license” for more information.
>>> import re
>>> re.match(r’^0x[a-f0-9]+$’, ‘0xff’)
<_sre.SRE_Match object at 0x100481648>
>>> re.match(r’^0x[a-f0-9]+$’, ‘0x01′)
<_sre.SRE_Match object at 0x1004816b0>
>>> re.match(r’^0x[a-f0-9]+$’, ‘0xz’)
>>>


In addition to the bracket notation, Python ships with some predefined classes. Generally, these are letter values prefixed with a backslash escape. When they appear within a set, the set includes all values for which they’ll match. The d escape matches all digit values. It would have been possible to write the above example in a slightly more compact manner.

>>> re.match(r’^0x[a-fd]+$’, ‘0x33′)
<_sre.SRE_Match object at 0x100481648>
>>> re.match(r’^0x[a-fd]+$’, ‘0x3f’)
<_sre.SRE_Match object at 0x1004816b0>
>>>


The following table outlines all of the character sets and classes available:

One thing that should become apparent is that lowercase classes are matches whereas their uppercase counterparts are the inverse.

Applying anchors to restrict matches

There are times where it’s important that patterns match at a certain position within a string of text. Why is this important? Consider a simple number validation test. If a user enters a digit, but mistakenly includes a trailing letter, an expression checking for the existence of a digit alone will pass.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type “help”, “copyright”, “credits”, or “license” for more information.
>>> import re
>>> re.match(r’d’, ‘1f’)
<_sre.SRE_Match object at 0x1004811d0>
>>>


Well, that’s unexpected. The regular expression engine sees the leading ‘1’ and considers it a match. It disregards the rest of the string as we’ve not instructed it to do anything else with it. To fix the problem that we have just seen, we need to apply anchors.

>>> re.match(r’^d$’, ‘6’)
<_sre.SRE_Match object at 0x100481648>
>>> re.match(r’^d$’, ‘6f’)
>>>


Now, attempting to sneak in a non-digit character results in no match. By preceding our expression with a caret (^) and terminating it with a dollar sign ($), we effectively said “between the start and the end of this string, there can only be one digit.”

Anchors, among various other metacharacters, are considered zero-width matches. Basically, this means that a match doesn’t advance the regular expression engine within the test string.

We’re not limited to the either end of a string, either. Here’s a collection of all of the available anchors provided by Python.

Wrapping it up

Now that we’ve covered the basics of regular expression syntax, let’s double back and take a look at the expression we used in our first example. It might be a bit easier if we break it down a bit more with a diagram.

Now that we’ve provided a bit of background, this pattern should make sense. We begin the regular expression with a caret, which matches the beginning of the string. The very next element is the literal http. As our caret matches the start of a string and must be immediately followed by http, this is equivalent to saying that our string must start with http.

Next, we include a question mark after the s in https. The question mark states that the previous entity should be matched either zero, or one time. By default, the evaluation engine is looking character-by-character, so the previous entity in this case is simply “s.” We do this so our test passes for both secure and non-secure addresses.

As we advanced forward in our string, the next special term we run into is {2}, and it follows a simple forward slash. This says that the forward slash should appear exactly two times. Now, in the real world, it would probably make more sense to simply type the second slash. Using the repetition check like this not only requires more typing, but it also causes the regular expression engine to work harder.

Immediately after the repetition match, we include a w. The w, if you’ll remember from the previous tables, expands to [0-9a-zA-Z_], or any word character. This is to ensure that our URL doesn’t begin with a special character.

The dot character after the w matches anything, except a new line. Essentially, we’re saying “match anything else, we don’t so much care.” The plus sign states that the preceding wild card should match at least once.

Finally, we’re anchoring the end of the string. However, in this example, this isn’t really necessary.

Have a go hero – tidying up our URL test

There are a few intentional inconsistencies and problems with this regular expression as designed. To name a few:

  1. Properly formatted URLs should only contain a few special characters. Other values should be URL-encoded using percent escapes. This regular expression doesn’t check for that.
  2. It’s possible to include newline characters towards the end of the URL, which is clearly not supported by any browsers!
  3. The w followed by the. + implicitly set a minimum limit of two characters after the protocol specification. A single letter is perfectly valid.

You guessed it. Using what we’ve covered thus far, it should be possible for you to backtrack and update our regular expression in order to fix these flaws. For more information on what characters are allowed, have a look at http://www.w3schools.com/tags/ref_urlencode.asp.

Advanced pattern matching

In addition to basic pattern matching, regular expressions let us handle some more advanced situations as well. It’s possible to group characters for purposes of precedence and reference, perform conditional checks based on what exists later, or previously, in a string, and limit exactly how much of a match actually constitutes a match. Don’t worry; we’ll clarify that last phrase as we move on. Let’s go!

Grouping

When crafting a regular expression string, there are generally two reasons you would wish to group expression components together: entity precedence or to enable access to matched parts later in your application.

Time for action – regular expression grouping

In this example, we’ll return to our LogProcessing application. Here, we’ll update our log split routines to divide lines up via a regular expression as opposed to simple string manipulation.

  1. In core.py, add an import re statement to the top of the file. This makes the regular expression engine available to us.
  2. Directly above the __init__ method definition for LogProcessor, add the following lines of code. These have been split to avoid wrapping.

    _re = re.compile(
    r’^([d.]+) (S+) (S+) [([w/:+ ]+)] “(.+?)” ‘
    r'(?P<rcode>d{3}) (S+) “(S+)” “(.+)”‘)

    
    
  3. Now, we’re going to replace the split method with one that takes advantage of the new regular expression:

    def split(self, line):
    “””
    Split a logfile.
    Uses a simple regular expression to parse out the Apache
    logfile
    entries.
    “””
    line = line.strip()
    match = re.match(self._re, line)
    if not match:
    raise ParsingError(“Malformed line: ” + line)
    return {
    ‘size’: 0 if match.group(6) == ‘-‘
    else int(match.group(6)),
    ‘status’: match.group(‘rcode’),
    ‘file_requested’: match.group(5).split()[1]
    }

    
    
  4. Running the logscan application should now produce the same output as it did when we were using a more basic, split-based approach.

    (text_processing)$ cat example3.log | logscan -c logscan.cfg

    
    

What just happened?

First of all, we imported the re module so that we have access to Python’s regular expression services.

Next, at the LogProcessor class level, we defined a regular expression. Though, this time we did so via re.compile rather than a simple string. Regular expressions that are used more than a handful of times should be “prepared” by running them through re.compile first. This eases the load placed on the system by frequently used patterns. The re.compile function returns a SRE_Pattern object that can be passed in just about anywhere you can pass in a regular expression.

We then replace our split method to take advantage of regular expressions. As you can see, we simply pass self._re in as opposed to a string-based regular expression. If we don’t have a match, we raise a ParsingError, which bubbles up and generates an appropriate error message, much like we would see on an invalid split case.

Now, the end of the split method probably looks somewhat peculiar to you. Here, we’ve referenced our matched values via group identification mechanisms rather than by their list index into the split results. Regular expression components surrounded by parenthesis create a group, which can be accessed via the group method on the Match object later down the road. It’s also possible to access a previously matched group from within the same regular expression. Let’s look at a somewhat smaller example.

>>> match = re.match(r'(0x[0-9a-f]+) (?P<two>1)’, ‘0xff 0xff’)
>>> match.group(1)
‘0xff’
>>> match.group(2)
‘0xff’
>>> match.group(‘two’)
‘0xff’
>>> match.group(‘failure’)
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
IndexError: no such group
>>>


Here, we surround two distinct regular expressions components with parenthesis, (0x[0-9a-f]+), and (?P&lttwo>1). The first regular expression matches a hexadecimal number. This becomes group ID 1. The second expression matches whatever was found by the first, via the use of the 1. The “backslash-one” syntax references the first match. So, this entire regular expression only matches when we repeat the same hexadecimal number twice, separated with a space. The ?P&lttwo> syntax is detailed below.

As you can see, the match is referenced after-the-fact using the match.group method, which takes a numeric index as its argument. Using standard regular expressions, you’ll need to refer to a matched group using its index number. However, if you’ll look at the second group, we added a (?P&ltname>) construct. This is a Python extension that lets us refer to groupings by name, rather than by numeric group ID. The result is that we can reference groups of this type by name as opposed to simple numbers.

Finally, if an invalid group ID is passed in, an IndexError exception is thrown.

The following table outlines the characters used for building groups within a Python regular expression:

Finally, it’s worth pointing out that parenthesis can also be used to alter priority as well. For example, consider this code.

>>> re.match(r’abc{2}’, ‘abcc’)
<_sre.SRE_Match object at 0x1004818b8>
>>> re.match(r’a(bc){2}’, ‘abcc’)
>>> re.match(r’a(bc){2}’, ‘abcbc’)
<_sre.SRE_Match object at 0x1004937b0>
>>>


Whereas the first example matches c exactly two times, the second and third line require us to repeat bc twice. This changes the meaning of the regular expression from “repeat the previous character twice” to “repeat the previous match within parenthesis twice.” The value within the group could have been its own complex regular expression, such as a([b-c]) {2}.

Have a go hero – updating our stats processor to use named groups

Spend a couple of minutes and update our statistics processor to use named groups rather than integer-based references. This makes it slightly easier to read the assignment code in the split method. You do not need to create names for all of the groups, simply the ones we’re actually using will do.

Using greedy versus non-greedy operators

Regular expressions generally like to match as much text as possible before giving up or yielding to the next token in a pattern string. If that behavior is unexpected and not fully understood, it can be difficult to get your regular expression correct. Let’s take a look at a small code sample to illustrate the point.

Suppose that with your newfound knowledge of regular expressions, you decided to write a small script to remove the angled brackets surrounding HTML tags. You might be tempted to do it like this:

>>> match = re.match(r'(?P<tag><.+>)’, ‘<title>Web Page</title>’)
>>> match.group(‘tag’)
‘<title>Web Page</title>’
>>>


The result is probably not what you expected. The reason we got this result was due to the fact that regular expressions are greedy by nature. That is, they’ll attempt to match as much as possible. If you look closely, &lttitle> is a match for the supplied regular expression, as is the entire &lttitle&gtWeb Page</title> string. Both start with an angled-bracket, contain at least one character, and both end with an angled bracket.

The fix is to insert the question mark character, or the non-greedy operator, directly after the repetition specification. So, the following code snippet fixes the problem.

>>> match = re.match(r'(?P<tag><.+?>)’, ‘<title>Web Page</title>’)
>>> match.group(‘tag’)
‘<title>’
>>>


The question mark changes our meaning from “match as much as you possibly can” to “match only the minimum required to actually match.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here