Python Text Processing with NLTK: Storing Frequency Distributions in Redis

0
104
9 min read

 

Python Text Processing with NLTK 2.0 Cookbook

Python Text Processing with NLTK 2.0 Cookbook

Use Python’s NLTK suite of libraries to maximize your Natural Language Processing capabilities.

  • Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond
  • Learn how machines and crawlers interpret and process natural languages
  • Easily work with huge amounts of data and learn how to handle distributed processing
  • Part of Packt’s Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible
        Read more about this book      

(For more resources on Python, see here.)

Storing a frequency distribution in Redis`

Redis is a data structure server that is one of the more popular NoSQL databases. Among other things, it provides a network accessible database for storing dictionaries (also known as hash maps). Building a FreqDist interface to a Redis hash map will allow us to create a persistent FreqDist that is accessible to multiple local and remote processes at the same time.

Most Redis operations are atomic, so it’s even possible to have multiple processes write to the FreqDist concurrently.

Getting ready

For this and subsequent recipes, we need to install both Redis and redis-py. A quick start install guide for Redis is available at http://code.google.com/p/redis/wiki/ QuickStart. To use hash maps, you should install at least version 2.0.0 (the latest version as of this writing).

The Redis Python driver redis-py can be installed using pip install redis or easy_ install redis. Ensure you install at least version 2.0.0 to use hash maps. The redispy homepage is at http://github.com/andymccurdy/redis-py/.

Once both are installed and a redis-server process is running, you’re ready to go. Let’s assume redis-server is running on localhost on port 6379 (the default host and port).

How to do it…

The FreqDist class extends the built-in dict class, which makes a FreqDist an enhanced dictionary. The FreqDist class provides two additional key methods: inc() and N(). The inc() method takes a single sample argument for the key, along with an optional count keyword argument that defaults to 1, and increments the value at sample by count. N() returns the number of sample outcomes, which is the sum of all the values in the frequency distribution.

We can create an API-compatible class on top of Redis by extending a RedisHashMap (that will be explained in the next section), then implementing the inc() and N() methods. Since the FreqDist only stores integers, we also override a few other methods to ensure values are always integers. This RedisHashFreqDist (defined in redisprob.py) uses the hincrby command for the inc() method to increment the sample value by count, and sums all the values in the hash map for the N() method.

from rediscollections import RedisHashMap
class RedisHashFreqDist(RedisHashMap):
def inc(self, sample, count=1):
self._r.hincrby(self._name, sample, count)
def N(self):
return int(sum(self.values()))
def __getitem__(self, key):
return int(RedisHashMap.__getitem__(self, key) or 0)
def values(self):
return [int(v) for v in RedisHashMap.values(self)] def items(self):
return [(k, int(v)) for (k, v) in RedisHashMap.items(self)]

We can use this class just like a FreqDist. To instantiate it, we must pass a Redis connection and the name of our hash map. The name should be a unique reference to this particular FreqDist so that it doesn’t clash with any other keys in Redis.

>>> from redis import Redis
>>> from redisprob import RedisHashFreqDist
>>> r = Redis()
>>> rhfd = RedisHashFreqDist(r, 'test')
>>> len(rhfd)
0
>>> rhfd.inc('foo')
>>> rhfd['foo']1
>>> rhfd.items()
>>> len(rhfd)
1

The name of the hash map and the sample keys will be encoded to replace whitespace and & characters with _. This is because the Redis protocol uses these characters for communication. It’s best if the name and keys don’t include whitespace to begin with.

How it works…

Most of the work is done in the RedisHashMap class, found in rediscollections.py, which extends collections.MutableMapping, then overrides all methods that require Redis-specific commands. Here’s an outline of each method that uses a specific Redis command:

  • __len__(): Uses the hlen command to get the number of elements in the hash map
  • __contains__(): Uses the hexists command to check if an element exists in the hash map
  • __getitem__(): Uses the hget command to get a value from the hash map
  • __setitem__(): Uses the hset command to set a value in the hash map
  • __delitem__(): Uses the hdel command to remove a value from the hash map
  • keys(): Uses the hkeys command to get all the keys in the hash map
  • values(): Uses the hvals command to get all the values in the hash map
  • items(): Uses the hgetall command to get a dictionary containing all the keys and values in the hash map
  • clear(): Uses the delete command to remove the entire hash map from Redis

Extending collections.MutableMapping provides a number of other dict compatible methods based on the previous methods, such as update() and setdefault(), so we don’t have to implement them ourselves.

The initialization used for the RedisHashFreqDist is actually implemented here, and requires a Redis connection and a name for the hash map. The connection and name are both stored internally to use with all the subsequent commands. As mentioned before, whitespace is replaced by underscore in the name and all keys, for compatibility with the Redis network protocol.

import collections, re
white = r'[s&]+'
def encode_key(key):
return re.sub(white, '_', key.strip())
class RedisHashMap(collections.MutableMapping):
def __init__(self, r, name):
self._r = r
self._name = encode_key(name)
def __iter__(self):
return iter(self.items())
def __len__(self):
return self._r.hlen(self._name)
def __contains__(self, key):
return self._r.hexists(self._name, encode_key(key))
def __getitem__(self, key):
return self._r.hget(self._name, encode_key(key))
def __setitem__(self, key, val):
self._r.hset(self._name, encode_key(key), val)
def __delitem__(self, key):
self._r.hdel(self._name, encode_key(key))
def keys(self):
return self._r.hkeys(self._name)
def values(self):
return self._r.hvals(self._name)
def items(self):
return self._r.hgetall(self._name).items()
def get(self, key, default=0):
return self[key] or default
def iteritems(self):
return iter(self)
def clear(self):
self._r.delete(self._name)

There’s more…

The RedisHashMap can be used by itself as a persistent key-value dictionary. However, while the hash map can support a large number of keys and arbitrary string values, its storage structure is more optimal for integer values and smaller numbers of keys. However, don’t let that stop you from taking full advantage of Redis. It’s very fast (for a network server) and does its best to efficiently encode whatever data you throw at it.

While Redis is quite fast for a network database, it will be significantly slower than the in-memory FreqDist. There’s no way around this, but while you sacrifice speed, you gain persistence and the ability to do concurrent processing.

See also

In the next recipe, we’ll create a conditional frequency distribution based on the Redis frequency distribution created here.

Storing a conditional frequency distribution in Redis

The nltk.probability.ConditionalFreqDist class is a container for FreqDist instances, with one FreqDist per condition. It is used to count frequencies that are dependent on another condition, such as another word or a class label. Here, we’ll create an API-compatible class on top of Redis using the RedisHashFreqDist from the previous recipe.

Getting ready

As in the previous recipe, you’ll need to have Redis and redis-py installed with an instance of redis-server running.

How to do it…

We define a RedisConditionalHashFreqDist class in redisprob.py that extends nltk.probability.ConditionalFreqDist and overrides the __contains__() and __getitem__() methods. We then override __getitem__() so we can create an instance of RedisHashFreqDist instead of a FreqDist, and override __contains__() so we can call encode_key() from the rediscollections module before checking if the RedisHashFreqDist exists.

from nltk.probability import ConditionalFreqDist
from rediscollections import encode_key
class RedisConditionalHashFreqDist(ConditionalFreqDist):
def __init__(self, r, name, cond_samples=None):
self._r = r
self._name = name
ConditionalFreqDist.__init__(self, cond_samples)
# initialize self._fdists for all matching keys
for key in self._r.keys(encode_key('%s:*' % name)):
condition = key.split(':')[1] self[condition] # calls self.__getitem__(condition)
def __contains__(self, condition):
return encode_key(condition) in self._fdists
def __getitem__(self, condition):
if condition not in self._fdists:
key = '%s:%s' % (self._name, condition)
self._fdists[condition] = RedisHashFreqDist(self._r, key)
return self._fdists[condition] def clear(self):
for fdist in self._fdists.values():
fdist.clear()

An instance of this class can be created by passing in a Redis connection and a base name. After that, it works just like a ConditionalFreqDist.

>>> from redis import Redis
>>> from redisprob import RedisConditionalHashFreqDist
>>> r = Redis()
>>> rchfd = RedisConditionalHashFreqDist(r, 'condhash')
>>> rchfd.N()
0
>>> rchfd.conditions()
[]>>> rchfd['cond1'].inc('foo')
>>> rchfd.N()
1
>>> rchfd['cond1']['foo']1
>>> rchfd.conditions()
['cond1']>>> rchfd.clear()

How it works…

The RedisConditionalHashFreqDist uses name prefixes to reference RedisHashFreqDist instances. The name passed in to the RedisConditionalHashFreqDist is a base name that is combined with each condition to create a unique name for each RedisHashFreqDist. For example, if the base name of the RedisConditionalHashFreqDist is ‘condhash‘, and the condition is ‘cond1‘, then the final name for the RedisHashFreqDist is ‘condhash:cond1‘. This naming pattern is used at initialization to find all the existing hash maps using the keys command. By searching for all keys matching ‘condhash:*‘, we can identify all the existing conditions and create an instance of RedisHashFreqDist for each.

Combining strings with colons is a common naming convention for Redis keys as a way to define namespaces. In our case, each RedisConditionalHashFreqDist instance defines a single namespace of hash maps.

The ConditionalFreqDist class stores an internal dictionary at self._fdists that is a mapping of condition to FreqDist. The RedisConditionalHashFreqDist class still uses self._fdists, but the values are instances of RedisHashFreqDist instead of FreqDist. self._fdists is created when we call ConditionalFreqDist.__init__(), and values are initialized as necessary in the __getitem__() method.

There’s more…

RedisConditionalHashFreqDist also defines a clear() method. This is a helper method that calls clear() on all the internal RedisHashFreqDist instances. The clear() method is not defined in ConditionalFreqDist.

See also

The previous recipe covers the RedisHashFreqDist in detail.

LEAVE A REPLY

Please enter your comment!
Please enter your name here