Preparation Analysis of Data Source

5 min read

(For more resources related to this topic, see here.)

List of data sources

Here, the wide varieties of data sources that are supported in the PowerPivot interface are given in brief. The vital part is to install providers such as OLE DB and ODBC that support the existing data source, because when installing the PowerPivot add-in, it will not install the provider too and some providers might already be installed with other applications.

For instance, if there is a SQL Server installed in the system, the OLE DB provider will be installed with the SQL Server, so that later on it wouldn’t be necessary to install OLE DB while adding data sources by using the SQL Server as a data source. Hence, make sure to verify the provider before it is added as a data source.

Perhaps the provider is required only for relational database data sources.

Relational database

By using RDBMS, you can import tables and views into the PowerPivot workbook. The following is a list of various data sources:

Microsoft SQL Server
Microsoft SQL Azure
Microsoft SQL Server Parallel Data Warehouse
Microsoft Access
Oracle
Teradata
Sybase
Informix
IBM DB2
Other providers (OLE DB / ODBC)

Multidimensional sources

Multidimensional data sources can only be added from Microsoft Analysis Services (SSAS).

Data feeds

The three types of data feeds are as follows:

SQL Server Reporting Service (SSRS)
Azure DataMarket dataset
Other feeds such as atom service documents and single data feed

Text files

The two types of text files are given as follows:

Excel file
Text file

Purpose of importing data from a variety of sources

In order to make a decision about a particular subject area, you should analyze all the required data that is relevant to the subject area. If the data is stored in a variety of data sources, importing the data from different data sources has to be done. If all the data is only in one data source, only the data needs to be imported from the required tables for the subject and then various types of analysis can be done.

The reason why users need to import data from different data sources is that they would then have an ample amount of data when they need to make any analysis. Another generic reason would be to cross-analyze data from different business systems such as Customer Relationship Management (CRM) and Campaign Management System (CMS). Data sourcing from only one source wouldn’t be as sophisticated as an analysis done from different data sources as the amount of data from which the analysis was done for multisourced data is more detailed than the data from only a single source. It also might reveal conflicts between multiple data sources.

In real time, usually in the e-commerce industry, blogs, and forum websites wouldn’t ask for more details about customers at the time of registration, because the time consumed for long registrations would discourage the user, leading to cancellation the of the order.

For instance, the customer table that would be stored in the database of an e-commerce industry would contain the following attributes:

Customer

FirstName

LastName

E-mail

BirthDate

Zip Code

Gender

However, this kind of industry needs to know their customers more in order to increase their sales. Since the industry only saves a few attributes about the customer during registration, it is difficult to track down the customers and it is even more difficult to advertise according to individual customers. Therefore, in order to find some relevant data about the customers, the e-commerce industries try to make another data source using the Internet or other types of sources.

For instance, by using the Postcode given by the customer during registration, the user can get Country | State | City from various websites and then use the information obtained to make a table format either in Excel or CSV, as follows:

Location

Postcode

City

State

Country

So, finally the user would have two sources—one source is from the user’s RDBMS database and the other source is from the Excel file created later—both of these can be used in order to make a customer analysis based on their location.

General overview of ETL

The main goal is to facilitate the development of data migration applications by applying data transformations.

Extraction Transform Load (ETL) comprises of the first step in a data warehouse process. It is the most important and the hardest part, because it determines the quality of the data warehouse and the scope of the analyses the users will be able to build upon it.

Let us discuss in detail what ETL is.

The first substep is extraction. As the users would want to regroup all the information a company has, they will have to collect the data from various heterogeneous sources (operational data such as databases and/or external data such as CSV files) and various models (relational, geographical, web pages, and so on).

The difficulty starts here. As data sources are heterogeneous, formats are usually different, and the users would have different types of data models for similar information (different names and/or formats) and different keys for the same objects. These are some of the main problems.

The aim of the transformation task is to correct such problems, as much as possible.

Let us now see what a transformation task is.

The users would need to transform heterogeneous data of different sources into homogeneous data. Here are some examples of what they can do:

Extraction of the data sources
Identification of relevant data sources
Filtering of non-admissible data
Modification of formats or values

Summary

This article shows the users how to prepare data for analysis. We also covered different types of data sources that can be imported into the PowerPivot interface. Some information about the provider and a general overview of the ETL process has also been given. Users now know how the ETL process works in the PowerPivot interface; also, users were shown an introduction and the advantages of DAX.