(For more resources related to this topic, see here.)
In this section we will look at local file operations. We’ll cover common operations that all computer users will be familiar with—copying, deleting, moving, renaming, and archiving files. We’ll also look at some not-so-common techniques, such as timestamping files, checking for the existence of a file, and listing the files in a directory.
For our first file job, let’s look at a simple file copy process. We will create a job that looks in a specific directory for a file and copies it to another location.
Let’s do some setup first (we can use this for all of the file examples). In your project directory, create a new folder and name it FileManagement. Within this folder, create two more folders and name them Source and Target. In the Source directory, drop a simple text file and name it original.txt. Now let’s create our job:
Set the Destination directory field to direct to the Target directory.
For now, let’s leave everything else unchanged. Click on the Run tab and then click on the Run button. The job should complete pretty quickly and, because we only have a single component, there are now data fl ows to observe. Check your Target folder and you will see the original.txt file in there, as expected. Note that the file still remains in the Source folder, as we were simply copying the file.
Our next example is a variant of our first file management job. Previously, we copied a file from one folder to another, but often you will want to affect a file move. To use an analogy from desktop operating systems and programs, we want to do a cut and paste rather than a copy and paste. Open the FileCopy job and follow the given steps:
In the Basic settings tab of the tFileCopy component, select the checkbox for Remove source file.
We can also use the tFileCopy component to rename files as we copy or move. Again, let’s work with the FileCopy job we have created previously. Reset your Source and Target directories so that the original.txt file only exists in Source.
Change the default value of the Destination filename parameter to modified_name.txt.
It is really useful to be able to delete files. For example, once they have been transformed or processed into other systems. Our integration jobs should “clean up afterwards”, rather than leaving lots of interim files cluttering up the directories. In this job example we’ll delete a file from a directory.This is a single-component job.
In your workspace directory, FileManagement/Source, create a new text file and name it file-to-delete.txt.
Click on its Component tab to configure it. Change the File Name parameter to be the path to the file you created earlier in step 2.
Note that the file does not get moved to the recycle bin on your computer, but is deleted immediately.
Sometimes in real life use, integration jobs, like any software, can fail or give an error. Server issues, previously unencountered bugs, or a host of other things can cause a job to behave in an unexpected manner, and when this happens, manual intervention may be needed to investigate the issue or recover the job that failed. A useful trick to try to incorporate into your jobs is to save files once they have been consumed or processed, in case you need to re-process them again at some point or, indeed, just for investigation and debugging purposes should something go wrong. A common way to save files is to rename them using a date/timestamp. By doing this you can easily identify when files were processed by the job. Follow the given steps to achieve this:
Create a file in the Source directory named timestamp.txt. The job is going to move this to the Target directory, adding a time-stamp to the file as it processes.
The previous code snippet concatenates the fixed file name, “timestamp”, with the current date/time as generated by the Studio’s getDate function at runtime. The file extension “.txt” is added to the end too.
Run the job and you will see a new version of the original file drop into the Target directory, complete with timestamp. Run the job again and you will see another file in Target with a different timestamp applied.
Depending on your requirements you can configure different format timestamps. For example, if you are only going to be processing one file a day, you could dispense with the hours, minutes, and second elements of the timestamp and simply set the output format to “yyyyMMdd”. Alternatively, to make the timestamp more readable, you could separate its elements with hyphens—“yyyy-MM-dd”, for example.
You can find more information about Java date formats at http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html..
Our next example job will show how to list all of the files (or all the files matching a specific naming pattern) in a directory. Where might we use such a process? Suppose our target system had a data “drop-off” directory, where all integration files from multiple sources were placed before being picked up to be processed. As an example, this drop-off directory might contain four product catalogue XML files, three CSV files containing inventory data, and 50 order XML files detailing what had been ordered by the customers. We might want to build a catalogue import process that picks up the four catalogue files, processes them by mapping to a different format, and then moves them to the catalogue import directory. The nature of the processing means we have to deal with each file individually, but we want a single execution of the process to pick up all available files at that point in time. This is where our file listing process comes in very handy and, as you might expect, the Studio has a component to help us with this task. Follow the given steps:
Let’s start by preparing the directory and files we want to list. Copy the FileList directory from the resource files to the FileManagement directory we created earlier. The FileList directory contains six XML files.
Create a new job and name it FileList.
Search for Filelist in the Palette and drop a tFileList component onto the Job Designer.
Additionally, search for logrow and drop a tLogRow component onto the designer too.
We will use the tFileList component to read all of the filenames in the directory and pass this through to the tLogRow component. In order to do this, we need to connect the tFileList and tLogRow. The tFileList component works in an iterative manner—it reads each filename and passes it onwards before getting the next filename. Its connector type is Iterative, rather than the more common Main connector. However, we cannot connect an iterative component to the tLogRow component, so we need to introduce another component that will act as an intermediary between the two.
Search for iteratetoflow in the Palette and drop a tIterateToFlow component onto the Job Designer. This bridges the gap between an iterate component and a fl ow component.
Click on the tFileList component and then click on its Component tab. Change the directory value so that it points to the FileList directory we created in step 1.
Click on the + button to add a new row to the File section. Change the value to “*.xml”. This configures the component to search for any files with an XML extension.
Right-click on the tFileList component, select Row | Iterate, and drop the resulting connector onto the tIterateToFlow component.
The tIterateToFlow component requires a schema and, as the tFileList component does not have a schema, it cannot propagate this to the iterateto-flow component when we join them. Instead we will have to create the schema directly. Click on the tIterateToFlow component and then on its Component tab. Click on the Edit schema button and, in the pop-up schema editor, click on the + button to add a row and then rename the column value to filename. Click on OK to close the window.
A new row will be added to the Mapping table. We need to edit its value, so click in the Value column, delete the setting that exists, and press Ctrl + space bar
to access the global variables list.
Scroll through the global variable drop-down list and select “tFileList_1_CURRENT_FILE”. This will add the required parameter to the Value column.
Right-click on the tIterateToFlow component, select Row | Main, and connect this to the tLogRow component.
Let’s run the job. It may run too quickly to be visible to the human eye, but the tFileList component will read the name of the first file it finds, pass this forward to the tIterateToFlow component, go back and read the second file, and so on. As the iterate-to-flow component receives its data, it will pass this onto tLogRow as row data. You will see the following output in the tLogRow component:
Now that we have cracked the basics of the file list component, let’s extend the example to a real-life situation. Let’s suppose we have a number of text files in our input directory, all conforming to the same schema. In the resources directory, you will find five files named fileconcat1.txt, fileconcat2.txt, and so on. Each of these has a “random” number of rows. Copy these files into the Source directory of your workspace. The aim of our job is to pick up each file in turn and write its output to a new file, thereby concatenating all of the original files. Let’s see how we do this:
Find the components in the Palette and drop them onto the Job Designer.
Right-click on the file list component and select Row | Iterate. Drop the connector onto the delimited input component.
In the bottom-left corner of the Studio you should see a window named Outline. If you cannot see the Outline window, select Window | Show View from the menu bar and type outline into the pop-up search box. You will see the Outline view in the search results—double click on this to open it.
Now that we can see the Outline window, expand the tFileList item to see the variables available in it. The variables are different depending upon the component selected. In the case of a file list component, the variables are mostly attributes of the current file being processed. We are interested in the filename for each iteration, so click on the variable Current File Name with path and drag it to the File name/Stream box in the Component tab of the delimited input component.
Run the job and check the output file in the target directory. You will see a single file with the contents of the five original files in it. Note that the Studio shows the number of iterations of the file list component that have been executed, but does not show the number of lines written to the output file, as we are used to seeing in non-iterative jobs.
Let’s look at how we can check for the existence of a file before we undertake an operation on it. Perhaps the first question is “Why do we need to check if a file exists?”
To illustrate why, open the FileDelete job that we created earlier. If you look at its component configuration, you will see that it will delete a file named file-todelete. txt in the Source directory. Go to this directory using your computer’s file explorer and delete this file manually. Now try to run the FileDelete job. You will get an error when the job executes:
The assumption behind a delete component (or a copy, rename, or other file operation process) is that the file does, in fact, exist and so the component can do its work. When the Studio finds that the file does not exist, an error is produced. Obviously, such an error is not desirable. In this particular case nothing too untoward happens—the job simply errors and exits—but it is better if we can avoid unnecessary errors.
What we should really do here is check if the file exists and, if it does, then delete it. If it does not exist, then the delete command should not be invoked. Let’s see how we can put this logic together
In our Source directory, create a file named file-exist.txt and configure File Name of the tFileDelete component to point to this.
Now click on the tFileExist component and set its File name/Stream parameter to be the same file in the Source directory.
Right-click on the tFileExist component, select Trigger | Run if, and drop the connector onto the tFileDelete component. The connecting line between the two components is labeled If.
When our job runs the first component will execute, but the second component, tFileDelete, will only run if some conditions are satisfied. We need to configure the if conditions.
In the Outline window (in the bottom-left corner of the Studio), expand the tFileExist component. You will see three attributes there. The Exists attribute is highlighted in red in the following screenshot:
Click on the Exists attribute and drag it into the Conditions box of the Component tab.
i. Run the tFileExist component.
ii. If the file named in tFileExist actually exists, run the tFileDelete component.
Note that if the file does not exist, the job will exit.
We can check if the job works as expected by running it twice. The file we want to delete is in the Source directory, so we would expect both components to run on the first execution (and for the file to be deleted). When the if condition is evaluated, the result will show in the Job Designer view. In this case, the if condition was true—the file did exist.
Now try to run the job again. We know that the file we are checking for does not exist, as it was deleted on the last execution.
This time, the if condition evaluates to false, and the delete component does not get invoked. You can also see in the console window that the Studio did not log any errors. Much better!
Sometimes we may want to verify that a file does not exist before we invoke another component. We can achieve this in a similar way to checking for the existence of a file, as shown earlier. Drag the Exists variable into the Conditions box and prefix the statement with !—the Java operator for “not”:
!((Boolean)globalMap.get("tFileExist_1_EXISTS"))
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…