The following text (SomeText.txt) file saved at a suitable location on the hard drive is used. This particular text is:
Rose is Red
Chrysanthemum is yellow
Violets are violet
Rose can be Pink
Hyacinth is white
Desk jobs are the best
Lily is also white
The girl is wearing a rose garland
The boy is handsome
Pink rose is not red
rose garland is made of roses
Roserose is not rose
He rose to power
The desk is made of rose wood
The reason for using the above text is to see how well the Term Extract transformation is able to distinguish words and phrases and find how often they are found in a body of text. The transformation works for text in English and is capable of distinguishing between nouns and other parts of speech.
In the following steps we will create a Visual Studio 2005 Business Intelligence project and access some text stored on the hard drive, and apply this transformation and review the results.
Creating a Business Intelligence Project
In the Visual Studio 2005 IDE, File | New | Project opens the New Project window as shown, where you can highlight the Integration Services Project template in the Business Intelligence page and change its default name to something different. For this tutorial TermExtract has been used as the project name.
Text to be accessed shown above is saved to a file, SomeText.txt in the C: drive. In order to access this from the Integration Services we need to create a Package with a data flow task. The source for this data is the SomeText.txt file on the C: drive.
Change the name of the default package file name to something different, In this case MineText.dtsx. Click Yes on the Microsoft Visual Studio message box asking whether you want to rename the package.
Add a Data Flow Task
Drag and drop a Data Flow Task to the Control Flow page as shown in the next figure. The Data Flow Task will access the SomeText.txt using a connection manager, an intermediary between SQL Server Integration Services and the external system.
Add a Flat File Source
Click on the Data Flow Task page. Drag and drop a Flat File Source from the Data Flow Sources group in the Toolbox and drop it on the Data Flow Page which is open as shown.
When the Flat File Source is dropped on the Data Flow Task page you may see this error in the error window as shown. This is nothing to worry about because a connection is not yet established.
Add a Connection Manager to Manage Flat File Source
Now right click in the Connection Manager’s pane as shown to display the pick list of connection managers and choose New Flat File Connection… as shown.
This immediately displays the Flat File Connection Manager‘s editor window as shown. You must provide a name of your choice to the Connection Manager, and a description of your choice. Then you need to use the Browse button to locate the SomeText.txt file on your hard drive. The next figure shows the editor after these choices are made. The rest of the fields such as Locale, Code page, etc were automatically chosen by the program.
Now click on the Columns list item in the left of the Editor. The one column that gets populated with the data from the SomeText.txt gets displayed. The program has correctly configured the fields for this text.
Click on the OK button on the Editor. This adds a Connection Manager, MyText to the Connection manager’s pane in the SSIS designer. With this, the SomeText.txt is available for the other controls that you may add.
Add a Term Extraction Transformation
The column that was populated in the above will now pass through the Term Extraction Transformation added by dragging and dropping this from the Toolbox on to the Data Flow Page.
Click the dangling green line and extend it to touch the Term Extraction Transformation. This is an easy way to establish a connection from the source to a transformation, a destination.
Double click the Term Extraction Transformation to open its Editor as shown in the next figure. In the Term Extraction tabbed page you see a single column which is displayed unchecked. Place a check mark for this column as shown in the next figure.
When the ‘terms’ are extracted, the output column will have a ‘term’ and a ‘score’ column. The term refers to a noun, a noun phrase, or a noun and a noun phrase. The score represents how many times each term is repeated in the body of the text. Pay attention to the message that says the column can have only values of a certain types and the disabled OK button.
The data type of the data going into the Term Extract Transformation can be found by right clicking on the connecting green line and looking at the page that reveals the Meta data list item as shown in the next figure. This is of the data type DT_STR.
To rectify this, there are two options, either use one more transformation, the data conversion transformation or use the Advanced Editor of the Flat File Source which can be displayed by right clicking the Flat File source component and choosing the Show Advanced Editor. This option was made to change Str[DT_STR] to Unicode str [DT_WSTR].
The DT_* shows the data type that are supported. The following information about these data types are shown extracted from the Books on line.
- DT_STR: A null-terminated ANSI/MBCS character string
- DT_NTEXT: A Unicode character string with a maximum length of 2^30-1 characters
- DT_WSTR: a null terminated Unicode character string
Now when you place a check mark for the Column 0 in the Term Extract Transformation Editor, the OK gets enabled.
Click on the Exclusion tab to reveal its page. This page when configured, allows you to exclude (skip) certain terms stored in an OLEDB database. The figure shows the details of editing this page. A Microsoft Access 2003 database called ‘SkipTerms’ was created and a new table ‘SkipTable’ was created in this database. It has two columns SkipID (autonumber, Primary Key) and SkipThis (text). A new OLEDB Connection was established along the same lines as the connection manager to the Flat File Source. Of course you need to choose an OLEDB Provider in making this connection. The ‘SkipThis’ column has just one entry, ‘desk’. This noun is found twice in SomeText.txt. The word ‘desk’ will be skipped in the output column when the Column 0 is processed by this transformation.
Click on the Advanced tab to open its page as shown. This is where you choose type of terms, nouns, noun phrases, or both noun and noun phrases. You also select the score which shows how many times (Frequency and Frequency Threshold) the terms appear in the text. As chosen here, the transformation will be looking for noun(s) that gets repeated twice. The case sensitive option can also be chosen but left blank in this exercise. The score type TFIDF is another type of scoring more appropriate for a document collection and not a single document like in this article. You may learn more details on this from this link.