Python 2.6 Text Processing: Beginners Guide

The easiest way to learn how to manipulate text with Python

The easiest way to learn text processing with Python

Deals with the most important textual data formats you will encounter

Learn to use the most popular text processing libraries available for Python

Packed with examples to guide you through

We'll not dive into too much detail with any single approach. Rather, the goal of this article is to teach you the basics such that you can get started and further explore details on your own. Also, remember that our goal isn't to be pretty; it's to present a useable subset of functionality. In other words, our PDF layouts are ugly!

Unfortunately, the third-party packages used in this article are not yet compatible with Python 3. Therefore, the examples listed here will only work with Python 2.6 and 2.7.

Dealing with PDF files using PLATYPUS

The ReportLab framework provides an easy mechanism for dealing with PDF files. It provides a low-level interface, known as pdfgen, as well as a higher-level interface, known as PLATYPUS. PLATYPUS is an acronym, which stands for Page Layout and Typography Using Scripts. While the pdfgen framework is incredibly powerful, we'll focus on the PLATYPUS system here as it's slightly easier to deal with. We'll still use some of the lower-level primitives as we create and modify our PLATYPUS rendered styles.

The ReportLab Toolkit is not entirely Open Source. While the pieces we use here are indeed free to use, other portions of the library fall under a commercial license. We'll not be looking at any of those components here. For more information, see the ReportLab website, available at http://www.reportlab.com

Time for action – installing ReportLab

Like all of the other third-party packages we've installed thus far, the ReportLab Toolkit can be installed using SetupTools' easy_install command. Go ahead and do that now from your virtual environment. We've truncated the output that we are about to see in order to conserve on space. Only the last lines are shown.

(text_processing)$ easy_install reportlab

advanced-output-formats-python-26-text-processing-img-1

What just happened?

The ReportLab package was downloaded and installed locally. Note that some platforms may require a C compiler in order to complete the installation process. To verify that the packages have been installed correctly, let's simply display the version tag.

(text_processing)$ python
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import reportlab
>>> reportlab.Version
'2.4'
>>>

Generating PDF documents

In order to build a PDF document using PLATYPUS, we'll arrange elements onto a document template via a flow. The flow is simply a list element that contains our individual document components. When we finally ask the toolkit to generate our output file, it will merge all of our individual components together and produce a PDF.

Time for action – writing PDF with basic layout and style

In this example, we'll generate a PDF that contains a set of basic layout and style mechanisms. First, we'll create a cover page for our document. In a lot of situations, we want our first page to differ from the remainder of our output. We'll then use a different format for the remainder of our document.

Create a new Python file and name it pdf_build.py. Copy the following code as it appears as follows:

import sys
 from report lab.PLATYPUS import SimpleDocTemplate, Paragraph
 from reportlab.PLATYPUS import Spacer, PageBreak
 from reportlab.lib.styles import getSampleStyleSheet
 from reportlab.rl_config import defaultPageSize
 from reportlab.lib.units import inch

from reportlab.lib import colors

class PDFBuilder(object):
HEIGHT = defaultPageSize[1]
WIDTH = defaultPageSize[0]

def _intro_style(self):
"""Introduction Specific Style"""
style = getSampleStyleSheet()['Normal']
style.fontName = 'Helvetica-Oblique'
style.leftIndent = 64
style.rightIndent = 64
style.borderWidth = 1
style.borderColor = colors.black
style.borderPadding = 10
return style

def __init__(self, filename, title, intro):
self._filename = filename
self._title = title
self._intro = intro
self._style = getSampleStyleSheet()['Normal']
self._style.fontName = 'Helvetica'

def title_page(self, canvas, doc):
"""
Write our title page.

Generates the top page of the deck,
using some special styling.
"""
canvas.saveState()
canvas.setFont('Helvetica-Bold', 18)
canvas.drawCentredString(
self.WIDTH/2.0, self.HEIGHT-180, self._title)
canvas.setFont('Helvetica', 12)
canvas.restoreState()

def std_page(self, canvas, doc):
"""
Write our standard pages.
"""
canvas.saveState()
canvas.setFont('Helvetica', 9)
canvas.drawString(inch, 0.75*inch, "%d" % doc.page)
canvas.restoreState()

def create(self, content):
"""
Creates a PDF.

Saves the PDF named in self._filename.
The content parameter is an iterable; each
line is treated as a standard paragraph.
"""
document = SimpleDocTemplate(self._filename)
flow = [Spacer(1, 2*inch)]

# Set our font and print the intro
# paragraph on the first page.
flow.append(
Paragraph(self._intro, self._intro_style()))
flow.append(PageBreak())

# Additional content
for para in content:
flow.append(
Paragraph(para, self._style))
# Space between paragraphs.
flow.append(Spacer(1, 0.2*inch))
document.build(
flow, onFirstPage=self.title_page,
onLaterPages=self.std_page)
if __name__ == '__main__':
if len(sys.argv) != 5:
print "Usage: %s <output> <title> <intro file> <content
file>" %
sys.argv[0]
sys.exit(-1)

# Do Stuff
builder = PDFBuilder(
sys.argv[1], sys.argv[2], open(sys.argv[3]).read())
# Generate the rest of the content from a text file
# containing our paragraphs.
builder.create(open(sys.argv[4]))

Next, we'll create a text file that will contain the introductory paragraph. We've placed it in a separate file so it's easier to manipulate. Enter the following into a text file named intro.txt.
This is an example document that we've created from scratch; it has no story to tell. It's purpose? To serve as an example.

Now, we need to create our PDF content. Let's add one more text file and name paragraphs.txt. Feel free to create your own content here. Each new line will start a new paragraph in the resulting PDF. Our test data is as follows:
This is the first paragraph in our document and it really serves no meaning other than example text.
This is the second paragraph in our document and it really serves no meaning other than example text.
This is the third paragraph in our document and it really serves no meaning other than example text.
This is the fourth paragraph in our document and it really serves no meaning other than example text.
This is the final paragraph in our document and it really serves no meaning other than example text.

Now, let's run the PDF generation script

(text_processing)$ python pdf_build.py output.pdf "Example
 Document" intro.txt paragraphs.txt

If you view the generated document in a reader, the generated pages should resemble the following screenshots:

advanced-output-formats-python-26-text-processing-img-2

The preceding screenshot displays the clean Title page, which we derive from the commandline arguments and the contents of the introduction file. The next screenshot contains document copy, which we also read from a file.

advanced-output-formats-python-26-text-processing-img-3

What just happened?

We used the ReportLab Toolkit to generate a basic PDF. In the process, you created two different layouts: one for the initial page and one for subsequent pages. The first page serves as our title page. We printed the document title and a summary paragraph. The second (and third, and so on) pages simply contain text data.

At the top of our code, as always, we import the modules and classes that we'll need to run our script. We import SimpleDocTemplate, Paragraph, Spacer, and Pagebreak from the PLATYPUS module. These are items that will be added to our document flow.

Next, we bring in getSampleStyleSheet. We use this method to generate a sample, or template, stylesheet that we can then change as we need. Stylesheets are used to provide appearance instructions to Paragraph objects here, much like they would be used in an HTML document.

The last two lines import the inch size as well as some page size defaults. We'll use these to better lay out our content on the page. Note that everything here outside of the first line is part of the more general-purpose portion of the toolkit.

The bulk of our work is handled in the PDFBuilder class we've defined. Here, we manage our styles and hide the PDF generation logic. The first thing we do here is assign the default document height and width to class variables named HEIGHT and WIDTH, respectively. This is done to make our code easier to work with and to make for easier inheritance down the road.

The _intro_style method is responsible for generating the paragraph style information that we use for the introductory paragraph that appears in the box. First, we create a new stylesheet by calling getSampleStyleSheet. Next, we simply change the attributes that we wish to modify from default.

advanced-output-formats-python-26-text-processing-img-4

The values in the preceding table define the style used for the introductory paragraph, which is different from the standard style. Note that this is not an exhaustive list; this simply details the variables that we've changed.

Next we have our __init__ method. In addition to setting variables corresponding to the arguments passed, we also create a new stylesheet. This time, we simply change the font used to Helvetica (default is Times New Roman). This will be the style we use for default text.

The next two methods, title_page and std_page, define layout functions that are called when the PDF engine generates both the first and subsequent pages. Let's walk through the title_page method in order to understand what exactly is happening.

First, we save the current state of the canvas. This is a lower-level concept that is used throughout the ReportLab Toolkit. We then change the active font to a bold sans serif at 18 point. Next, we draw a string at a specific location in the center of the document. Lastly, we restore our state as it was before the method was executed.

If you take a quick look at std_page, you'll see that we're actually deciding how to write the page number. The library isn't taking care of that for us. However, it does help us out by giving us the current page number in the doc object.

Neither the std_page nor the title_page methods actually lay the text out. They're called when the pages are rendered to perform annotations. This means that they can do things such as write page numbers, draw logos, or insert callout information. The actual text formatting is done via the document flow.

The last method we define is create, which is responsible for driving title page creation and feeding the rest of our data into the toolkit. Here, we create a basic document template via SimpleDocTemplate. We'll flow all of our components onto this template as we define them.

Next, we create a list named flow that contains a Spacer instance. The Spacer ensures we do not begin writing at the top of the PDF document.

We then build a Paragraph containing our introductory text, using the style built in the self._intro_style method. We append the Paragraph object to our flow and then force a page break by also appending a PageBreak object.

Next, we iterate through all of the lines passed into the method as content. Each generates a new Paragraph object with our default style.

Finally, we call the build method of the document template object. We pass it our flow and two different methods to be called - one when building the first page and one when building subsequent pages.

Our __main__ section simply sets up calls to our PDFBuilder class and reads in our text files for processing.

The ReportLab Toolkit is very heavily documented and is quite easy to work with. For more information, see the documents available at http://www.reportlab.com/software/opensource/. There is also a code snippets library that contains some common PDF recipes.

Have a go hero – drawing a logo

The toolkit provides easy mechanisms for including graphics directly into a PDF document. JPEG images can be included without any additional library support. Using the documentation referenced earlier, alter our title_page method such that you include a logo image below the introductory paragraph.

Writing native Excel data

Here, we'll look at an advanced technique that actually allows us to write actual Excel data (without requiring Microsoft Windows). To do this, we'll be using the xlwt package.

Time for action – installing xlwt

Again, like the other third-party modules we've installed thus far, xlwt can be downloaded and installed via the easy_install system. Activate your virtual environment and install it now. Your output should resemble the following:

(text_processing)$ easy_install xlwt

advanced-output-formats-python-26-text-processing-img-5

What just happened?

We installed the xlwt packages from the Python Package Index. To ensure your install worked correctly, start up Python and display the current version of the xlwt libraries.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import xlwt
>>> xlwt.__VERSION__
'0.7.2'
>>>

At the time of this writing, the xlwt module supports the generation of Excel xls format files, which are compatible with Excel 95 – 2003 (and later). MS Office 2007 and later utilizes Open Office XML (OOXML).