(For more resources related to this topic, see here.)
Managing local files
In this section we will look at local file operations. We’ll cover common operations that all computer users will be familiar with—copying, deleting, moving, renaming, and archiving files. We’ll also look at some not-so-common techniques, such as timestamping files, checking for the existence of a file, and listing the files in a directory.
For our first file job, let’s look at a simple file copy process. We will create a job that looks in a specific directory for a file and copies it to another location.
Let’s do some setup first (we can use this for all of the file examples). In your project directory, create a new folder and name it FileManagement. Within this folder, create two more folders and name them Source and Target. In the Source directory, drop a simple text file and name it original.txt. Now let’s create our job:
- Create a new folder in Repository and name it Chapter6
- Create a new job within the Chapter6 directory and name it FileCopy.
- In the Palette, search for copy. You should be able to locate a tFileCopy component. Drop this onto the Job Designer.
- Click on its Component tab. Set the File Name field to point to the original.txt file in the Source directory.
Set the Destination directory field to direct to the Target directory.
For now, let’s leave everything else unchanged. Click on the Run tab and then click on the Run button. The job should complete pretty quickly and, because we only have a single component, there are now data fl ows to observe. Check your Target folder and you will see the original.txt file in there, as expected. Note that the file still remains in the Source folder, as we were simply copying the file.
Copying and removing files
Our next example is a variant of our first file management job. Previously, we copied a file from one folder to another, but often you will want to affect a file move. To use an analogy from desktop operating systems and programs, we want to do a cut and paste rather than a copy and paste. Open the FileCopy job and follow the given steps:
- Remove the original.txt file from the Target directory, making sure it still exists in the Source directory.
In the Basic settings tab of the tFileCopy component, select the checkbox for Remove source file.
- Now run the job. This time the original.txt file will be copied to the Target directory and then removed from the Source directory.
We can also use the tFileCopy component to rename files as we copy or move. Again, let’s work with the FileCopy job we have created previously. Reset your Source and Target directories so that the original.txt file only exists in Source.
- In the Basic settings tab, check the Rename checkbox. This will reveal a new parameter, Destination filename.
Change the default value of the Destination filename parameter to modified_name.txt.
- Run the job. The original file will be copied to the Target directory and renamed. The original file will also be removed from the Source directory.
It is really useful to be able to delete files. For example, once they have been transformed or processed into other systems. Our integration jobs should “clean up afterwards”, rather than leaving lots of interim files cluttering up the directories. In this job example we’ll delete a file from a directory.This is a single-component job.
- Create a new job and name it FileDelete.
In your workspace directory, FileManagement/Source, create a new text file and name it file-to-delete.txt.
- From the Palette, search for filedelete and drag a tFileDelete component onto the Job Designer.
Click on its Component tab to configure it. Change the File Name parameter to be the path to the file you created earlier in step 2.
- Run the job. After it is complete, go to your Source directory and the file will no longer be there.
Note that the file does not get moved to the recycle bin on your computer, but is deleted immediately.
Timestamping a file
Sometimes in real life use, integration jobs, like any software, can fail or give an error. Server issues, previously unencountered bugs, or a host of other things can cause a job to behave in an unexpected manner, and when this happens, manual intervention may be needed to investigate the issue or recover the job that failed. A useful trick to try to incorporate into your jobs is to save files once they have been consumed or processed, in case you need to re-process them again at some point or, indeed, just for investigation and debugging purposes should something go wrong. A common way to save files is to rename them using a date/timestamp. By doing this you can easily identify when files were processed by the job. Follow the given steps to achieve this:
- Create a new job and call it FileTimestamp.
Create a file in the Source directory named timestamp.txt. The job is going to move this to the Target directory, adding a time-stamp to the file as it processes.
- From the Palette, search for filecopy and drop a tFileCopy component onto the Job Designer.
- Click on its Component tab and change the File Name parameter to point to the timestamp.txt file we created in the Source directory.
- Change the Destination Directory to direct to your Target directory.
- Check the Rename checkbox and change the Destination filename parameter to “timestamp”+TalendDate.getDate(“yyyyMMddhhmmss”)+”.txt”.
The previous code snippet concatenates the fixed file name, “timestamp”, with the current date/time as generated by the Studio’s getDate function at runtime. The file extension “.txt” is added to the end too.
Run the job and you will see a new version of the original file drop into the Target directory, complete with timestamp. Run the job again and you will see another file in Target with a different timestamp applied.
Depending on your requirements you can configure different format timestamps. For example, if you are only going to be processing one file a day, you could dispense with the hours, minutes, and second elements of the timestamp and simply set the output format to “yyyyMMdd”. Alternatively, to make the timestamp more readable, you could separate its elements with hyphens—“yyyy-MM-dd”, for example.
You can find more information about Java date formats at http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html..
Listing files in a directory
Our next example job will show how to list all of the files (or all the files matching a specific naming pattern) in a directory. Where might we use such a process? Suppose our target system had a data “drop-off” directory, where all integration files from multiple sources were placed before being picked up to be processed. As an example, this drop-off directory might contain four product catalogue XML files, three CSV files containing inventory data, and 50 order XML files detailing what had been ordered by the customers. We might want to build a catalogue import process that picks up the four catalogue files, processes them by mapping to a different format, and then moves them to the catalogue import directory. The nature of the processing means we have to deal with each file individually, but we want a single execution of the process to pick up all available files at that point in time. This is where our file listing process comes in very handy and, as you might expect, the Studio has a component to help us with this task. Follow the given steps:
Let’s start by preparing the directory and files we want to list. Copy the FileList directory from the resource files to the FileManagement directory we created earlier. The FileList directory contains six XML files.
Create a new job and name it FileList.
Search for Filelist in the Palette and drop a tFileList component onto the Job Designer.
Additionally, search for logrow and drop a tLogRow component onto the designer too.
We will use the tFileList component to read all of the filenames in the directory and pass this through to the tLogRow component. In order to do this, we need to connect the tFileList and tLogRow. The tFileList component works in an iterative manner—it reads each filename and passes it onwards before getting the next filename. Its connector type is Iterative, rather than the more common Main connector. However, we cannot connect an iterative component to the tLogRow component, so we need to introduce another component that will act as an intermediary between the two.
Search for iteratetoflow in the Palette and drop a tIterateToFlow component onto the Job Designer. This bridges the gap between an iterate component and a fl ow component.
Click on the tFileList component and then click on its Component tab. Change the directory value so that it points to the FileList directory we created in step 1.
Click on the + button to add a new row to the File section. Change the value to “*.xml”. This configures the component to search for any files with an XML extension.
Right-click on the tFileList component, select Row | Iterate, and drop the resulting connector onto the tIterateToFlow component.
The tIterateToFlow component requires a schema and, as the tFileList component does not have a schema, it cannot propagate this to the iterateto-flow component when we join them. Instead we will have to create the schema directly. Click on the tIterateToFlow component and then on its Component tab. Click on the Edit schema button and, in the pop-up schema editor, click on the + button to add a row and then rename the column value to filename. Click on OK to close the window.
A new row will be added to the Mapping table. We need to edit its value, so click in the Value column, delete the setting that exists, and press Ctrl + space bar
to access the global variables list.
Scroll through the global variable drop-down list and select “tFileList_1_CURRENT_FILE”. This will add the required parameter to the Value column.
Right-click on the tIterateToFlow component, select Row | Main, and connect this to the tLogRow component.
Let’s run the job. It may run too quickly to be visible to the human eye, but the tFileList component will read the name of the first file it finds, pass this forward to the tIterateToFlow component, go back and read the second file, and so on. As the iterate-to-flow component receives its data, it will pass this onto tLogRow as row data. You will see the following output in the tLogRow component:
Now that we have cracked the basics of the file list component, let’s extend the example to a real-life situation. Let’s suppose we have a number of text files in our input directory, all conforming to the same schema. In the resources directory, you will find five files named fileconcat1.txt, fileconcat2.txt, and so on. Each of these has a “random” number of rows. Copy these files into the Source directory of your workspace. The aim of our job is to pick up each file in turn and write its output to a new file, thereby concatenating all of the original files. Let’s see how we do this:
- Create a new job and name it FileConcat.
- For this job we will need a file list component, a delimited file output component, and a delimited file input component. As we will see in a minute, the delimited input component will be a “placeholder” for each of the input files in turn.
Find the components in the Palette and drop them onto the Job Designer.
- Click on the file list component and change its Directory value to point to the Source directory.
- In the Files box, add a row and change the Filemask value to “*.txt”.
Right-click on the file list component and select Row | Iterate. Drop the connector onto the delimited input component.
- Select the delimited input component and edit its schema so that it has a single field rowdata of data type String
- We need to modify the File name/Stream value, but in this case it is not a fixed file we are looking for but a different file with each iteration of the file list component. TOS gives us an easy way to add such variables into the component definitions. First, though, click on the File name/Stream box and clear the default value.
In the bottom-left corner of the Studio you should see a window named Outline. If you cannot see the Outline window, select Window | Show View from the menu bar and type outline into the pop-up search box. You will see the Outline view in the search results—double click on this to open it.
Now that we can see the Outline window, expand the tFileList item to see the variables available in it. The variables are different depending upon the component selected. In the case of a file list component, the variables are mostly attributes of the current file being processed. We are interested in the filename for each iteration, so click on the variable Current File Name with path and drag it to the File name/Stream box in the Component tab of the delimited input component.
- You can see that the Studio completes the parameter value with a globalMap variable—in this case, tFileList_1_CURRENT_FILEPATH, which denotes the current filename and its directory path.
- Now right-click on the delimited input, select Row | Main, and drop the connector onto the delimited output.
- Change the File Name of the delimited output component to fileconcatout.txt in our target directory and check the Append checkbox, so that the Studio adds the data from each iteration to the bottom of each file. If Append is not checked, then the Studio will overwrite the data on each iteration and all that will be left will be the data from the final iteration.
Run the job and check the output file in the target directory. You will see a single file with the contents of the five original files in it. Note that the Studio shows the number of iterations of the file list component that have been executed, but does not show the number of lines written to the output file, as we are used to seeing in non-iterative jobs.
Checking for files
Let’s look at how we can check for the existence of a file before we undertake an operation on it. Perhaps the first question is “Why do we need to check if a file exists?”
To illustrate why, open the FileDelete job that we created earlier. If you look at its component configuration, you will see that it will delete a file named file-todelete. txt in the Source directory. Go to this directory using your computer’s file explorer and delete this file manually. Now try to run the FileDelete job. You will get an error when the job executes:
The assumption behind a delete component (or a copy, rename, or other file operation process) is that the file does, in fact, exist and so the component can do its work. When the Studio finds that the file does not exist, an error is produced. Obviously, such an error is not desirable. In this particular case nothing too untoward happens—the job simply errors and exits—but it is better if we can avoid unnecessary errors.
What we should really do here is check if the file exists and, if it does, then delete it. If it does not exist, then the delete command should not be invoked. Let’s see how we can put this logic together
- Create a new job and name it FileExist.
- Search for fileexist in the Palette and drop a tFileExist component onto the Job Designer. Then search for filedelete and place a tFileDelete component onto the designer too.
In our Source directory, create a file named file-exist.txt and configure File Name of the tFileDelete component to point to this.
Now click on the tFileExist component and set its File name/Stream parameter to be the same file in the Source directory.
Right-click on the tFileExist component, select Trigger | Run if, and drop the connector onto the tFileDelete component. The connecting line between the two components is labeled If.
When our job runs the first component will execute, but the second component, tFileDelete, will only run if some conditions are satisfied. We need to configure the if conditions.
- Click on If and, in the Component tab, a Condition box will appear.
In the Outline window (in the bottom-left corner of the Studio), expand the tFileExist component. You will see three attributes there. The Exists attribute is highlighted in red in the following screenshot:
Click on the Exists attribute and drag it into the Conditions box of the Component tab.
- As before, a global-map variable is written to the configuration.
- The logic of our job is as follows:
i. Run the tFileExist component.
ii. If the file named in tFileExist actually exists, run the tFileDelete component.
Note that if the file does not exist, the job will exit.
We can check if the job works as expected by running it twice. The file we want to delete is in the Source directory, so we would expect both components to run on the first execution (and for the file to be deleted). When the if condition is evaluated, the result will show in the Job Designer view. In this case, the if condition was true—the file did exist.
Now try to run the job again. We know that the file we are checking for does not exist, as it was deleted on the last execution.
This time, the if condition evaluates to false, and the delete component does not get invoked. You can also see in the console window that the Studio did not log any errors. Much better!
Sometimes we may want to verify that a file does not exist before we invoke another component. We can achieve this in a similar way to checking for the existence of a file, as shown earlier. Drag the Exists variable into the Conditions box and prefix the statement with !—the Java operator for “not”: