|Read more about this book|
Why consider information integration?
The useful life of pre-relational mainframe database management system engines is coming to an end because of a diminishing application and skills base, and increasing costs.—Gartner Group
During the last 30 years, many companies have deployed mission critical applications running various aspects of their business on the legacy systems. Most of these environments have been built around a proprietary database management system running on the mainframe. According to Gartner Group, the installed base of mainframe, Sybase, and some open source databases has been shrinking. There is vendor sponsored market research that shows mainframe database management systems are growing, which, according to Gartner, is due primarily to increased prices from the vendors, currency conversions, and mainframe CPU replacements.
Over the last few years, many companies have been migrating mission critical applications off the mainframe onto open standard Relational Database Management Systems (RDBMS) such as Oracle for the following reasons:
- Reducing skill base: Students and new entrants to the job market are being trained on RDBMS like Oracle and not on the legacy database management systems. Legacy personnel are retiring, and those that are not are moving into expensive consulting positions to arbitrage the demand.
- Lack of flexibility to meet business requirements: The world of business is constantly changing and new business requirements like compliance and outsourcing require application changes. Changing the behavior, structure, access, interface or size of old databases is very hard and often not possible, limiting the ability of the IT department to meet the needs of the business. Most applications on the aging platforms are 10 to 30 years old and are long past their original usable lifetime.
- Lack of Independent Software Vendor (ISV)applications: With most ISVs focusing on the larger market, it is very difficult to find applications, infrastructure, and tools for legacy platforms. This requires every application to be custom coded on the closed environment by scarce in-house experts or by expensive outside consultants.
- Total Cost of Ownership (TCO): As the user base for proprietary systems decreases, hardware, spare parts, and vendor support costs have been increasing. Adding to this are the high costs of changing legacy applications, paid either as consulting fees for a replacement for diminishing numbers of mainframe trained experts or increased salaries for existing personnel. All leading to a very high TCO which doesn’t even take into account the opportunity cost to the business of having inflexible systems.
Business challenges in data integration and migration
Once the decision has been taken to migrate away from a legacy environment, the primary business challenge is business continuity. Since many of these applications are mission critical, running various aspects of the business, the migration strategy has to ensure continuity to the new application—and in the event of failure, rollback to the mainframe application. This approach requires data in the existing application to be synchronized with data on the new application.
Making the challenge of data migration more complicated is the fact that legacy applications tend to be interdependent, but the need from a risk mitigation standpoint is to move applications one at a time. A follow-on challenge is prioritizing the order in which applications are to be moved off the mainframe, and ensuring that the order meets both the business needs and minimizes the risk in the migration process.
Once a specific application is being migrated, the next challenge is to decide which business processes will be migrated to the new application. Many companies have business processes that are present, because that’s the way their systems work. When migrating an application off the mainframe, many business processes do not need to migrate. Even among the business processes that need to be migrated, some of these business processes will need to be moved as-is and some of them will have to be changed. Many companies utilize the opportunity afforded by a migration to redo the business processes they have had to live with for many years.
Data is the foundation of the modernization process. You can move the application, business logic, and work flow, but without a clean migration of the data the business requirements will not be met. A clean data migration involves:
- Data that is organized in a usable format by all modern tools
- Data that is optimized for an Oracle database
- Data that is easy to maintain
Technical challenges of information integration
The technical challenges with any information integration all stem from the fact that the application accesses heterogeneous data (VSAM, IMS, IDMS, ADABAS, DB2, MSSQL, and so on) that can even be in a non-relational hierarchical format. Some of the technical problems include:
- The flexible file definition feature used in COBOL applications in the existing system will have data files with multi-record formats and multi-record types in the same dataset—neither of which exist in RDBMS. Looping data structure and substructure or relative offset record organization such as a linked list, which are difficult to map into a relational table.
- Data and referential integrity is managed by the Oracle database engine. However, legacy applications already have this integrity built in. One question is whether to use Oracle to handle this integrity and remove the logic from the application.
- Finally, creating an Oracle schema to maximize performance, which includes mapping non-oracle keys to Oracle primary and secondary keys; especially when legacy data is organized in order of key value which can affect the performance on an Oracle RDBMS. There are also differences in how some engines process transactions, rollbacks, and record locking.
General approaches to information integration and migration
There are several technical approaches to consider when doing any kind of integration or migration activity. In this section, we will look at a methodology or approach for both data integration and data migration.
Clearly, given this range of requirements, there are a variety of different integration strategies, including the following:
- Consolidated: A consolidated data integration solution moves all data into a single database and manages it in a central location. There are some considerations that need to be known regarding the differences between non- Oracle and Oracle mechanics. Transaction processing is an example. Some engines use implicit commits and some manage character sets differently than Oracle does, this has an impact on sort order.
- Federated: A federated data integration solution leaves data in the individual data source where it is normally maintained and updated, and simply consolidates it on the fly as needed. In this case, multiple data sources will appear to be integrated into a single virtual database, masking the number and different kinds of databases behind the consolidated view. These solutions can work bidirectionally.
- Shared: A shared data integration solution actually moves data and events from one or more source databases to a consolidated resource, or queue, created to serve one or more new applications. Data can be maintained and exchanged using technologies such as replication, message queuing, transportable table spaces, and FTP.
Oracle has extensive support for consolidated data integration and while there are many obvious benefits to the consolidated solution, it is not practical for any organization that must deal with legacy systems or integrate with data it does not own. Therefore, we will not discuss this type any further, but instead concentrate on federated and shared solutions.
Over 80 percent of migration projects fail or overrun their original budgets/ timelines, according to a study by the Standish Group. In most cases, this is because of a lack of understanding of some of the unique challenges of a migration project. The top five challenges of a migration project are:
- Little migration expertise to draw from: Migration is not an industry-recognized area of expertise with an established body of knowledge and practices, nor have most companies built up any internal competency to draw from.
- Insufficient understanding of data and source systems: The required data is spread across multiple source systems, not in the right format, of poor quality, only accessible through vaguely understood interfaces, and sometimes missing altogether.
- Continuously evolving target system: The target system is often under development at the time of data migration, and the requirements often change during the project.
- Complex target data validations: Many target systems have restrictions, constraints, and thresholds on the validity, integrity, and quality of the data to be loaded.
- Repeated synchronization after the initial migration: Migration is not a one-time effort. Old systems are usually kept alive after new systems launch and synchronization is required between the old and new systems during this handoff period. Also, long after the migration is completed, companies often have to prove the migration was complete and accurate to various government, judicial, and regulatory bodies.
Most migration projects fail because of an inappropriate migration methodology, because the migration problem is thought of as a four stage process:
- Analyze the source data
- Extract/transform the data into the target formats
- Validate and cleanse the data
- Load the data into the target
However, because of the migration challenges discussed previously, this four stage project methodology often fails miserably.
The challenge begins during the initial analysis of the source data when most of the assumptions about the data are proved wrong. Since there is never enough time planned for analysis, any mapping specification from the mainframe to Oracle is effectively an intelligent guess. Based on the initial mapping specification, extractions, and transformations developed run into changing target data requirements, requiring additional analysis and changes to the mapping specification. Validating the data according to various integrity and quality constraints will typically pose a challenge. If the validation fails, the project goes back to further analysis and then further rounds of extractions and transformations. When the data is finally ready to be loaded into Oracle, unexpected data scenarios will often break the loading process and send the project back for more analysis, more extractions and transformations, and more validations. Approaching migration as a four stage process means continually going back to earlier stages due to the five challenges of data migration.
The biggest problem with migration project methodology is that it does not support the iterative nature of migrations. Further complicating the issue is that the technology used for data migration often consists of general-purpose tools repurposed for each of the four project stages. These tools are usually non-integrated and only serve to make difficult processes more difficult on top of a poor methodology.
The ideal model for successfully managing a data migration project is not based on multiple independent tools. Thus, a cohesive method enables you to cycle or spiral your way through the migration process—analyzing the data, extracting and transforming the data, validating the data, and loading it into targets, and repeating the same process until the migration is successfully completed. This approach enables target-driven analysis, validating assumptions, refining designs, and applying best practices as the project progresses. This agile methodology uses the same four stages of analyze, extract/transform, validate and load. However, the four stages are not only iterated, but also interconnected with one another.
An iterative approach is best achieved through a unified toolset, or platform, that leverages automation and provides functionality which spans all four stages. In an iterative process, there is a big difference between using a different tool for each stage and one unified toolset across all four stages. In one unified toolset, the results of one stage can be easily carried into the next, enabling faster, more frequent and ultimately less iteration which is the key to success in a migration project. A single platform not only unifies the development team across the project phases, but also unifies the separate teams that may be handling each different source system in a multi-source migration project.
Architectures: federated versus shared
Federated data integration can be very complicated. This is especially the case for distributed environments where several heterogeneous remote databases are to be synchronized using two-phase commit. Solutions that provide federated data integration access and maintain the data in the place wherever it resides (such as in a mainframe data store associated with legacy applications). Data access is done ‘transparently’ for example, the user (or application) interacts with a single virtual or federated relational database under the control of the primary RDBMS, such as Oracle. This data integration software is working with the primary RDBMS ‘under the covers’ to transform and translate schemas, data dictionaries, and dialects of SQL; ensure transactional consistency across remote foreign databases (using two-phase commit); and make the collection of disparate, heterogeneous, distributed data sources appear as one unified database. The integration software carrying out these complex tasks needs to be tightly integrated with the primary RDBMS in order to benefit from built-in functions and effective query optimization. The RDBMS must also provide all the other important RDBMS functions, including effective query optimization.
Data sharing integration
Data sharing-based integration involves the sharing of data, transactions, and events among various applications in an organization. It can be accomplished within seconds or overnight, depending on the requirement. It may be done in incremental steps, over time, as individual one-off implementations are required. If one-off tools are used to implement data sharing, eventually the variety of data-sharing approaches employed begin to conflict, and the IT department becomes overwhelmed with an unmanageable maintenance, which increases the total cost of ownership.
What is needed is a comprehensive, unified approach that relies on a standard set of services to capture, stage, and consume the information being shared. Such an environment needs to include a rules-based engine, support for popular development languages, and comply with open standards. GUI-based tools should be available for ease of development and the inherent capabilities should be modular enough to satisfy a wide variety of possible implementation scenarios.
The data-sharing form of data integration can be applied to achieve near real-time data sharing. While it does not guarantee the level of synchronization inherent with a federated data integration approach (for example, if updates are performed using two-phase commit), it also doesn’t incur the corresponding performance overhead. Availability is improved because there are multiple copies of the data.
Considerations when choosing an integration approach
There is a range in the complexity of data integration projects from relatively straightforward (for example, integrating data from two merging companies that used the same Oracle applications) to extremely complex projects such as long-range geographical data replication and multiple database platforms. For each project, the following factors can be assessed to estimate the complexity level. Pretend you are a systems integrator such as EDS trying to size a data integration effort as you prepare a project proposal.
- Potential for conflicts: Is the data source updated by more than one application? If so, the potential exists for each application to simultaneously update the same data.
- Latency: What is the required synchronization level for the data integration process? Can it be an overnight batch operation like a typical data warehouse? Must it be synchronous, and with two-phase commit? Or, can it be quasi-real-time, where a two or three second lag is tolerable, permitting an asynchronous solution?
- Transaction volumes and data growth trajectory: What are the expected average and peak transaction rates and data processing throughput that will be required?
- Access patterns: How frequently is the data accessed and from where?
- Data source size: Some data sources of such volume that back up, and unavailability becomes extremely important.
- Application and data source variety: Are we trying to integrate two ostensibly similar databases following the merger of two companies that both use the same application, or did they each have different applications? Are there multiple data sources that are all relational databases? Or are we integrating data from legacy system files with relational databases and realtime external data feeds?
- Data quality: The probability that data quality adds to overall project complexity increases as the variety of data sources increases.
One point of this discussion is that the requirements of data integration projects will vary widely. Therefore, the platform used to address these issues must be a rich superset of the features and functions that will be applied to any one project.