Parsing Specific Data in Python Text Processing

Python Text Processing with NLTK 2.0 Cookbook

Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities.

Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond

Learn how machines and crawlers interpret and process natural languages

Easily work with huge amounts of data and learn how to handle distributed processing

Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible

Introduction

This article covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries for accomplishing this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to the NLTK:

dateutil: Provides date/time parsing and time zone conversion

timex: Can identify time words in text

lxml and BeautifulSoup: Can parse, clean, and convert HTML

chardet: Detects the character encoding of text

The libraries can be useful for pre-processing text before passing it to an NLTK object, or post-processing text that has been processed and extracted using NLTK. Here's an example that ties many of these tools together.

Let's say you need to parse a blog article about a restaurant. You can use lxml or BeautifulSoup to extract the article text, outbound links, and the date and time when the article was written. The date and time can then be parsed to a Python datetime object with dateutil. Once you have the article text, you can use chardet to ensure it's UTF-8 before cleaning out the HTML and running it through NLTK-based part-of-speech tagging, chunk extraction, and/or text classification, to create additional metadata about the article. If there's an event happening at the restaurant, you may be able to discover that by looking at the time words identified by timex. The point of this example is that real-world text processing often requires more than just NLTK-based natural language processing, and the functionality covered in this article can help with those additional requirements.

Parsing dates and times with Dateutil

If you need to parse dates and times in Python, there is no better library than dateutil. The parser module can parse datetime strings in many more formats than can be shown here, while the tz module provides everything you need for looking up time zones. Combined, these modules make it quite easy to parse strings into time zone aware datetime objects.

Getting ready

You can install dateutil using pip or easy_install, that is sudo pip install dateutil or sudo easy_install dateutil. Complete documentation can be found at http://labix.org/python-dateutil

How to do it...

Let's dive into a few parsing examples:

>>> from dateutil import parser
>>> parser.parse('Thu Sep 25 10:36:28 2010')
datetime.datetime(2010, 9, 25, 10, 36, 28)
>>> parser.parse('Thursday, 25. September 2010 10:36AM')
datetime.datetime(2010, 9, 25, 10, 36)
>>> parser.parse('9/25/2010 10:36:28')
datetime.datetime(2010, 9, 25, 10, 36, 28)
>>> parser.parse('9/25/2010')
datetime.datetime(2010, 9, 25, 0, 0)
>>> parser.parse('2010-09-25T10:36:28Z')
datetime.datetime(2010, 9, 25, 10, 36, 28, tzinfo=tzutc())

As you can see, all it takes is importing the parser module and calling the parse() function with a datetime string. The parser will do its best to return a sensible datetime object, but if it cannot parse the string, it will raise a ValueError.

How it works...

The parser does not use regular expressions. Instead, it looks for recognizable tokens and does its best to guess what those tokens refer to. The order of these tokens matters, for example, some cultures use a date format that looks like Month/Day/Year (the default order) while others use a Day/Month/Year format. To deal with this, the parse() function takes an optional keyword argument dayfirst, which defaults to False. If you set it to True, it can correctly parse dates in the latter format.

>>> parser.parse('25/9/2010', dayfirst=True)
datetime.datetime(2010, 9, 25, 0, 0)

Another ordering issue can occur with two-digit years. For example, '10-9-25' is ambiguous. Since dateutil defaults to the Month-Day-Year format, '10-9-25' is parsed to the year 2025. But if you pass yearfirst=True into parse(), it will be parsed to the year 2010.

>>> parser.parse('10-9-25')
datetime.datetime(2025, 10, 9, 0, 0)
>>> parser.parse('10-9-25', yearfirst=True)
datetime.datetime(2010, 9, 25, 0, 0)

There's more...

The dateutil parser can also do fuzzy parsing, which allows it to ignore extraneous characters in a datetime string. With the default value of False, parse() will raise a ValueError when it encounters unknown tokens. But if fuzzy=True, then a datetime object can usually be returned.

>>> try:
... parser.parse('9/25/2010 at about 10:36AM')
... except ValueError:
... 'cannot parse'
'cannot parse'
>>> parser.parse('9/25/2010 at about 10:36AM', fuzzy=True)
datetime.datetime(2010, 9, 25, 10, 36)

Time zone lookup and conversion

Most datetime objects returned from the dateutil parser are naive, meaning they don't have an explicit tzinfo, which specifies the time zone and UTC offset. In the previous recipe, only one of the examples had a tzinfo, and that's because it's in the standard ISO format for UTC date and time strings. UTC is the coordinated universal time, and is the same as GMT. ISO is the International Standards Organization, which among other things, specifies standard date and time formatting.

Python datetime objects can either be naive or aware. If a datetime object has a tzinfo, then it is aware. Otherwise the datetime is naive. To make a naive datetime object time one aware, you must give it an explicit tzinfo. However, the Python datetime library only defines an abstract base class for tzinfo, and leaves it up to the others to actually implement tzinfo creation. This is where the tz module of dateutil comes in—it provides everything you need to lookup time zones from your OS time zone data.

Getting ready

dateutil should be installed using pip or easy_install. You should also make sure your operating system has time zone data. On Linux, this is usually found in /usr/share/zoneinfo, and the Ubuntu package is called tzdata. If you have a number of files and directories in /usr/share/zoneinfo, such as America/, Europe/, and so on, then you should be ready to proceed. The following examples show directory paths for Ubuntu Linux.

How to do it...

Let's start by getting a UTC tzinfo object. This can be done by calling tz.tzutc(), and you can check that the offset is 0 by calling the utcoffset() method with a UTC datetime object.

>>> from dateutil import tz
>>> tz.tzutc()
tzutc()
>>> import datetime
>>> tz.tzutc().utcoffset(datetime.datetime.utcnow())
datetime.timedelta(0)

To get tzinfo objects for other time zones, you can pass in a time zone file path to the gettz() function.

>>> tz.gettz('US/Pacific')
tzfile('/usr/share/zoneinfo/US/Pacific')
>>> tz.gettz('US/Pacific').utcoffset(datetime.datetime.utcnow())
datetime.timedelta(-1, 61200)
>>> tz.gettz('Europe/Paris')
tzfile('/usr/share/zoneinfo/Europe/Paris')
>>> tz.gettz('Europe/Paris').utcoffset(datetime.datetime.utcnow())
datetime.timedelta(0, 7200)

You can see the UTC offsets are timedelta objects, where the first number is days, and the second number is seconds.

If you're storing datetimes in a database, it's a good idea to store them all in UTC to eliminate any time zone ambiguity. Even if the database can recognize time zones, it's still a good practice.

To convert a non-UTC datetime object to UTC, it must be made time zone aware. If you try to convert a naive datetime to UTC, you'll get a ValueError exception. To make a naive datetime time zone aware, you simply call the replace() method with the correct tzinfo. Once a datetime object has a tzinfo, then UTC conversion can be performed by calling the astimezone() method with tz.tzutc().

>>> pst = tz.gettz('US/Pacific')
>>> dt = datetime.datetime(2010, 9, 25, 10, 36)
>>> dt.tzinfo
>>> dt.astimezone(tz.tzutc())
Traceback (most recent call last):
File "/usr/lib/python2.6/doctest.py", line 1248, in __run
compileflags, 1) in test.globs
File "<doctest __main__[22]>", line 1, in <module>
dt.astimezone(tz.tzutc())
ValueError: astimezone() cannot be applied to a naive datetime
>>> dt.replace(tzinfo=pst)
datetime.datetime(2010, 9, 25, 10, 36, tzinfo=tzfile('/usr/share/
zoneinfo/US/Pacific'))
>>> dt.replace(tzinfo=pst).astimezone(tz.tzutc())
datetime.datetime(2010, 9, 25, 17, 36, tzinfo=tzutc())

How it works...

The tzutc and tzfile objects are both subclasses of tzinfo. As such, they know the correct UTC offset for time zone conversion (which is 0 for tzutc). A tzfile object knows how to read your operating system's zoneinfo files to get the necessary offset data. The replace() method of a datetime object does what its name implies—it replaces attributes. Once a datetime has a tzinfo, the astimezone() method will be able to convert the time using the UTC offsets, and then replace the current tzinfo with the new tzinfo.

Note that both replace() and astimezone() return new datetime objects. They do not modify the current object.

There's more...

You can pass a tzinfos keyword argument into the dateutil parser to detect otherwise unrecognized time zones.

>>> parser.parse('Wednesday, Aug 4, 2010 at 6:30 p.m. (CDT)',
fuzzy=True)
datetime.datetime(2010, 8, 4, 18, 30)
>>> tzinfos = {'CDT': tz.gettz('US/Central')}
>>> parser.parse('Wednesday, Aug 4, 2010 at 6:30 p.m. (CDT)',
fuzzy=True, tzinfos=tzinfos)
datetime.datetime(2010, 8, 4, 18, 30, tzinfo=tzfile('/usr/share/
zoneinfo/US/Central'))

In the first instance, we get a naive datetime since the time zone is not recognized. However, when we pass in the tzinfos mapping, we get a time zone aware datetime.

Local time zone

If you want to lookup your local time zone, you can call tz.tzlocal(), which will use whatever your operating system thinks is the local time zone. In Ubuntu Linux, this is usually specified in the /etc/timezone file.

Custom offsets

You can create your own tzinfo object with a custom UTC offset using the tzoffset object. A custom offset of one hour can be created as follows:

>>> tz.tzoffset('custom', 3600)
tzoffset('custom', 3600)

You must provide a name as the first argument, and the offset time in seconds as the second argument.

Tagging temporal expressions with Timex

The NLTK project has a little known contrib repository that contains, among other things, a module called timex.py that can tag temporal expressions. A temporal expression is just one or more time words, such as "this week", or "next month". These are ambiguous expressions that are relative to some other point in time, like when the text was written. The timex module provides a way to annotate text so these expressions can be extracted for further analysis. More on TIMEX can be found at http://timex2.mitre.org/

Getting ready

The timex.py module is part of the nltk_contrib package, which is separate from the current version of NLTK. This means you need to install it yourself, or use the timex.py module. You can also download timex.py directly from http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/timex.py

If you want to install the entire nltk_contrib package, you can check out the source at http://nltk.googlecode.com/svn/trunk/ and do sudo python setup.py install from within the nltk_contrib folder. If you do this, you'll need to do from nltk_contrib import timex instead of just import timex as done in the following How to do it... section.

For this recipe, you have to download the timex.py module into the same folder as the rest of the code, so that import timex does not cause an ImportError.

You'll also need to get the egenix-mx-base package installed. This is a C extension library for Python, so if you have all the correct Python development headers installed, you should be able to do sudo pip install egenix-mx-base or sudo easy_install egenix-mxbase. If you're running Ubuntu Linux, you can instead do sudo apt-get install pythonegenix-mxdatetime. If none of those work, you can go to http://www.egenix.com/products/python/mxBase/ to download the package and find installation instructions.

How to do it...

Using timex is very simple: pass a string into the timex.tag() function and get back an annotated string. The annotations will be XML TIMEX tags surrounding each temporal expression.

>>> import timex
>>> timex.tag("Let's go sometime this week")
"Let's go sometime <TIMEX2>this week</TIMEX2>"
>>> timex.tag("Tomorrow I'm going to the park.")
"<TIMEX2>Tomorrow</TIMEX2> I'm going to the park."

How it works...

The implementation of timex.py is essentially over 300 lines of conditional regular expression matches. When one of the known expressions match, it creates a RelativeDateTime object (from the mx.DateTime module). This RelativeDateTime is then converted back to a string with surrounding TIMEX tags and replaces the original matched string in the text.