(For more resources on Microsoft Platform, see here.)
Why is it better to shovel one ton of data using two thousand, one pound shovels instead of one big load from a huge power shovel? After all, large commercial databases and the attendant bulk loader or SQL Loader programs are designed to do just that: insert huge loads of data in a single shot.
The bulk load approach works under certain tightly constrained circumstances. They are as follows:
- The “bulk” data comes to you already matching the table structure of the destination system. Of course, this may mean that it was debatched before it gets to your system.
- The destination system can accept some, potentially significant, error rate when individual rows fail to load.
- There are no updates or deletes, just inserts.
- Your destination system can handle bulk loads. Certain systems (for example, some legacy medical systems or other proprietary systems) cannot handle bulk operations.
As the vast majority of data transfer situations will not meet these criteria, we must consider various options. First, one must consider which side of the database event horizon one should perform these tasks. One could, for example, simply dump an entire large file into a staging table on SQL Server, and then debatch using SQL to move the data to the “permanent” tables.
Big Box Stores owns and operates retail chains that include huge Big Box warehouse stores, large retail operations in groceries and general department stores and small convenience stores that sell gasoline, beverages, and fast food. The company has added brands and stores over the past few years through a series of mergers. Each brand has its own unique point of sale system. The stores operate in the United States, Canada, Mexico, and Western Europe.
The loss prevention department has noticed that a number of store sales and clerical staff are helping themselves to “five-finger bonuses.” The staff members use various ruses to take money from cash registers, obtain goods without paying for them, or otherwise embezzle money or steal goods from Big Box. These patterns typically unfold over periods of several days or weeks. For example, employees will make purchases using their employee discount at the store where they work, then return the product for full price at another store where they are not known or they have an accomplice return the goods to the employee for a full refund.
The various methods used to steal from Big Box fall into these recognized patterns and a good deal of this theft can be uncovered by analyzing patterns of sales transactions. Standard ETL techniques will be used to import data concerning the stores, products, and employees to a database where we can analyze these patterns and detect employee theft.
We have been tasked with designing a system that will import comma-delimited files exported by the point of sales (POS) systems into a SQL Server database that will then perform the analysis. Data concerning each sale will be sent from each of the point of sale systems. The files will hold all or part of the prior day’s sales and will range from 30,000 to over 2.5 million rows of data per file. For stores that have “regular” business hours, files will become available approximately one hour after the stores close. This time will vary based on the day of the week and the time of year. During “normal” operations, stores typically close at 9:00 PM local time. During certain peak shopping periods (for example, Christmas or local holiday periods) stores remain open until midnight, local time. Convenience stores are opened 24 hours per day, 7 days per week. Data will be sent for these stores after the POS system has closed the books on the prior day, typically at 1:00 AM local time.
The POS systems can be extended to periodically expose “final” sales to the system throughout the business day via a web service. The impact of using this method during a peak sales period is unknown, and performance of the POS may degrade. A full day’s data may also be extracted from the POS system in the comma-delimited format discussed as follows. The web service would expose the data using the natural hierarchy of “sales header” and “sales detail.”
All data must be loaded and available to the loss prevention department by 9 AM CET for European stores and 9 AM EST for North American stores. It should be noted that the different POS use different data types to identify stores, employees, products, and sales transactions. The load job must account for this and properly relate the data from the store to the master data loaded in a separate application.
The data will be sent in two comma-delimited files, one containing the “Sales Header” data and one containing the sales details. The data will be in the following format:
SalesID, StoreID, EmployeeID, EmployeeFirstName, EmployeeLastName, RegisterID, RegisterLocation, storeAddress, StoreCity, StoreProvince, StorePostalCode, CustomerID, CustomerFirstName, CustomerLastName, CustomerPostalCode, Date, Time, Method of Payment, CreditCardNumber, TotalSales, Amount Tendered, Change, PriorSalesID, Return
SalesID, ProductID, Quantity, markedPrice, ActualPrice, ReturnItem, DiscountCode, DiscountPercent, DiscountDescription, OriginalPurchaseDate, OriginalPurchaseStore, OriginalPurchaseSalesID, originalCustomerID, OriginalFirstName, OriginalLastName, OriginalStoreID, OriginalRegisterID, OriginalEmployeeID
Our mission is to move this data into a data mart that will use a standard star schema for this analysis. Big Box intended to prosecute employees for larceny or theft offences based on evidence this system gathers. Given the legal requirements that evidence gathered through this process must stand up in court, it is vital that this data be correct and loaded with a minimal number of errors or issues.
As is fairly typical, the use case above does not contain information on all of the facts we would need to consider when designing a solution. Every company has operating assumptions that the enterprise takes as a “given” and others we learn through our own involvement with the enterprise. These “facts” are so ingrained into an organization’s culture that people may not even recognize the need to explicitly state these requirements.
For example, if a consultant arrives at a company in Denver, CO that only does business in the United States, then he or she can expect that the business language will be English with US spelling. The exact same company in Calgary, doing business in Canada will need both English with British spelling and French. It is doubtful one would ever see such “requirements” stated explicitly, but anyone designing a solution would do well to keep them in mind.
Other facts may be extrapolated or derived from the given requirements. When you are designing a solution you must take these criteria into account as well. It would be at best unwise to design a solution that was beyond the skill set for the IT staff, for example. In this case, it is probably safe to say the following:
|Fact or Extrapolation||Reason|
|Big Box has a very sophisticated IT staff that can handle any advanced and sophisticated technologies.||They are currently handling multiple POS systems on 2 continents and already do sophisticated ETL work from these systems to existing BI systems.|
|Getting the deliverable “right” is more important than getting it done “fast”.||Legal requirements for using data as evidence.|
|Data must be secure during movement to avoid allegations of evidence tampering.||Legal requirements for using data as evidence.|
|Some level of operational control and monitoring must be built into the application we will design.||Common courtesy to the Network Operations Center (NOC) staff who will deal with this, if nothing else.|
We can tackle this problem from multiple angles, so let us take a look at the available options.
Candidate architecture #1–SSIS
First, we will explore the pros and cons of using SSIS for our solution platform.
Solution design aspects
This scenario is the sweet spot for SSIS. SSIS is, first and foremost, an ETL and batch data processing tool. SSIS can easily read multiple files from a network drive and has the tools out of the box that can debatch, either before or after loading to a database.
Nonetheless, we are faced with certain hurdles that will need to be accounted for in our design. We do not control precisely when the POS data will be made available.
There are a number of variables that influence that timing, not the least of which is the potential need for human intervention in closing books for the day and variable times throughout the year and across the globe when a particular store’s books will be closed. We need to expect that files will be delivered over a time range.
In some ways this is helpful, as it spreads some of the load over time.
One of the great things about SSIS in this situation is the flexibility it provides. We can load all of the data in a single batch to a staging table then move it (debatch) to its final destinations using SQL, or we can debatch on the application side and load directly to the final tables, or any combination that suits us and the strengths of the development team. SSIS can also be extended to monitor directories and load data when it becomes available. Finally, SSIS integrates easily into NOC monitoring systems and provides the ability to guarantee data security and integrity as required for this application. Moreover, SSIS does not incur any additional licensing costs, as it ships with SQL Server out of the box.
Solution delivery aspects
It is not clear from our use case what depth of experience Big Box staff has with SSIS. However, they certainly have experience with database technologies, SQL queries, and with other advance technologies associated with data transfer and integration, given the size of the enterprise operations. We can reasonably expect them to pick up any unfamiliar technologies quickly and easily.
This application will require some extensions to the typical ETL paradigm. Here data must go through some amount of human intervention through the daily “closing” before it is made available. This will involve tasks such as physically counting cash to make sure it matches the records in the POS system. Any number of factors can accelerate or delay the completion of this task. SSIS will therefore need to monitor the directories where data are delivered to ensure the data is available. Also, we will need to design the system so that it does not attempt to load partially completed files. This is a classic ETL problem with many potential solutions and certainly does not present insurmountable issues.
Solution operations aspects
In this case, we have one vitally important operational requirement; the solution must guarantee data integrity and security so that the data can be used to prosecute thieves or otherwise stand up to evidentiary rules. SSIS and SQL Server 2008 Enterprise Edition can handle these requirements. SQL Server 2008 security and data access auditing features will meet chain of custody requirements and ensure that no data tampering occurred. SSIS can enforce business rules programmatically to ensure the precise and accurate transfer of the data sent by the POS systems.
Many of these requirements will be filled with the design of the database itself. We would use, for example, the data access auditing now available with SQL Server 2008 to monitor who has been working with data. The database would use only Windows-based security, not SQL Server based security. Other steps to harden SQL Server against attack should be taken.
All the previously mentioned features secure the data while at rest. We will need to focus on how to ensure data integrity during the transfer of the data—while the data is in motion. SSIS has logging tools that will be used to monitor unsuccessful data transfers. Moreover, we can extend these tools to ensure either a complete data load or that we will have an explanation for any failure to load. It should be noted that the loss prevention staff is interested in outliers, so they will want to carefully examine data that fails to meet business requirements (and therefore fails to load to our target system) to look for patterns of theft.
We understand that Big Box staff has the technical wherewithal to handle this relatively simple extension to existing SQL Server technologies. This is a group of database professionals who deal with multiple stores performing over 2 million transactions per day. They support the POS, financial, inventory, and other systems required to handle this volume on two continents. This is a small step for them in terms of their ability to live with this solution.
Candidate architecture #2–BizTalk Server
While not primarily targeted at bulk data solutions, BizTalk Server can parse large inbound data sets, debatch the individual records, and insert them into a target system.
Solution design aspects
The POS systems that provide sales data to the Big Box data hub typically produce comma-delimited files. Using BizTalk Server, we can define the document structure of delimited files and natively accept and parse them. The requirements earlier also stated that the POS systems could be extended to publish a more real-time feed via web services as opposed to the daily file drop of data. This is more in tune with how BizTalk does standard processing (real-time data feeds) and would be a preferred means to distribute data through the BizTalk bus.
BizTalk Server’s SQL Server adapter is built to insert a record at a time into a database. This means that the BizTalk solution needs to break apart these large inbound data sets and insert each record individually into the final repository. Messages are debatched automatically in BizTalk via pipeline components and specially defined schemas, but this is a CPU-intensive process. We would want to isolate the servers that receive and parse these data sets so that the high CPU utilization doesn’t impede other BizTalk-based solutions from running.
Solution delivery aspects
Big Box leverages SQL Server all across the organization, but does not currently have a BizTalk footprint. This means that they’ll need to set up a small infrastructure to host this software platform. They do have developers well-versed in .NET development and have typically shown a penchant for utilizing external consultants to design and implement large enterprise solutions. It would be critical for them to build up a small center of excellence in BizTalk to ensure that maintenance of this application and the creation of new ones can progress seamlessly.
Solution operations aspects
BizTalk Server provides strong operational support through tooling, scripting, and monitoring. If the downstream database becomes unavailable, BizTalk will queue up the messages that have yet to be delivered. This ensures that no sales information gets lost in transit and provides a level of guarantee that the data mart is always accurate.
Given the relatively large sets of data, the operations team will need to configure a fairly robust BizTalk environment, which can handle the CPU-intensive debatching and perform the database inserts in a timely fashion.
Big Box would be well served by moving to a more real-time processing solution in the near future. This way, they can do more live analysis and not have to wait until daily intervals to acquire the latest actionable data. A messaging-based solution that relies on BizTalk Server is more in tune with that vision.
However, this is a critical program and speed to market is a necessity. Big Box accepts a high level of risk in procuring a new enterprise software product and getting the environments and resources in place to design, develop, and support solutions built upon it.
|SQL Server and SSIS|
|Easily deployed and extensible ETL tool||Need to build sophisticated error handling systems|
|Designed to handle batch processing of large files, exactly the task at hand|
|No additional licensing costs – comes with SQL Server|
|Can be built and maintained by current staff|
|Provides for live, real-time analysis||CPU-intensive processes|
|Can leverage BizTalk capability to send events to downstream transactional systems||High database process overhead|
|Enterprise-class hosting infrastructure||Additional licensing and capital costs|
|Not clear if staff has the skills to support product|
When all is said and done, this is exactly the scenario that SSIS was designed to handle, a batch load to a data mart. Moreover, the selection of SSIS entails no additional licensing costs, as might be the case with BizTalk.