In this article by Jenny Zhang, author of the book Tableau 10.0 Best Practices, you will learn the Best Practices about Tableau Data Extract. We will look into different ways of creating Tableau data extracts and technical details of how a Tableau data extract works. We will learn on how to create extract with large volume of data efficiently, and then upload and manage Tableau data extract in Tableau online. We will also take a look at refresh Tableau data extract, which is useful to keep your data up to date automatically. Finally, we will take a look using Tableau web connector to create data extract.

(For more resources related to this topic, see here.)

Different ways of creating Tableau data extracts

Tableau provides a few ways to create extracts.

Direct connect to original data sources

Creating an extract by connecting to the original data source (Databases/Salesforce/Google Analytics and so on) will maintain the connection to the original data source. You can right click the extract to edit the extract and refresh the extract from the original data source.

Duplicate of an extract

If you create a duplicate of the extract by right click the data extract and duplicate, it will create a new .tde file and still maintain the connection to the original data source. If you refresh the duplicated data extract, it will not refresh the original data extract that you created the duplicate from.

Connect to a Tableau Extract File

If you create a data extract by connecting to a Tableau extract file (.tde), you will not have that connection to the original data source that the extract is created from since you are just connecting to a local .tde file. You cannot edit or refresh the data from the original data source. Duplicate this extract with connection to the local .tde file will NOT create a new .tde file. The duplication will still point to the same local .tde file.

You can right click – Extract Data to create an extract out of an extract. But we do not normally do that.

Technical details of how a Tableau data extract works

Tableau data extract’s design principle

A Tableau extract (.tde) file is a compressed snapshot of data extracted from a large variety of original data sources (excel, databases, Salesforce, NoSQL and so on). It is stored on disk and loaded into memory as required to create a Tableau Viz.

There are two design principles of the Tableau extract make it ideal for data analytics. The first principle is Tableau extract is a columnar store. The columnar databases store column values rather than row values. The benefit is that the input/output time required to access/aggregate the values in a column is significantly reduced. That is why Tableau extract is great for data analytics.

The second principle is how a Tableau extract is structured to make sure it makes best use of your computer’s memory. This will impact how it is loaded into memory and used by Tableau. To better understand this principle, we need to understand how Tableau extract is created and used as the data source to create visualization.

When Tableau creates data extract, it defines the structure of the .tde file and creates separate files for each column in the original data source. When Tableau retrieves data from the original data source, it sorts, compresses and adds the values for each column to their own file. After that, individual column files are combined with metadata to form a single file with as many individual memory-mapped files as there are the columns in the original data source.

Because a Tableau data extract file is a memory-mapped file, when Tableau requests data from a .tde file, the data is loaded directly into the memory by the operating system. Tableau does not have to open, process or decompress the file. If needed, the operating system continues to move data in and out of RAM to insure that all of the requested data is made available to Tableau. It means that Tableau can query data that is bigger than the RAM on the computer.

Benefits of using Tableau data extract

Following are the seven main benefits of using Tableau data extract

Performance: Using Tableau data extract can increase performance when the underlying data source is slow. It can also speed up CustomSQL.
Reduce load: Using Tableau data extract instead of a live connection to databases reduces the load on the database that can result from heavy traffic.
Portability: Tableau data extract can be bundled with the visualizations in a packaged workbook for sharing with others.
Pre-aggregation: When creating extract, you can choose to aggregate your data for certain dimensions. An aggregated extract has smaller size and contains only aggregated data. Accessing the values of aggregations in a visualization is very fast since all of the work to derive the values has been done. You can choose the level of aggregation. For example, you can choose to aggregate your measures to month, quarter, or year.
Materialize calculated fields: When you choose to optimize the extract, all of the calculated fields that have been defined are converted to static values upon the next full refresh. They become additional data fields that can be accessed and aggregated as quickly as any other fields in the extract. The improvement on performance can be significant especially on string calculations since string calculations are much slower compared to numeric or date calculations.
Publish to Tableau Public and Tableau Online: Tableau Public only supports Tableau extract files. Though Tableau Online can connect to some cloud based data sources, Tableau data extract is most common used.
Support for certain function not available when using live connection: Certain function such as count distinct is only available when using Tableau data extract.

How to create extract with large volume of data efficiently

Load very large Excel file to Tableau

If you have an Excel file with lots of data and lots of formulas, it could take a long time to load into Tableau. The best practice is to save the Excel as a .csv file and remove all the formulas.

Aggregate the values to higher dimension

If you do not need the values down to the dimension of what it is in the underlying data source, aggregate to a higher dimension will significantly reduce the extract size and improve performance.

Use Data Source Filter

Add a data source filter by right click the data source and then choose to Edit Data Source Filter to remove the data you do not need before creating the extract.

Hide Unused Fields

Hide unused fields before creating a data extract can speed up extract creation and also save storage space.

Upload and manage Tableau data extract in Tableau online

Create Workbook just for extracts

One way to create extracts is to create them in different workbooks. The advantage is that you can create extracts on the fly when you need them. But the disadvantage is that once you created many extracts, it is very difficult to manage them. You can hardly remember which dashboard has which extracts. A better solution is to use one workbook just to create data extracts and then upload the extracts to Tableau online. When you need to create visualizations, you can use the extracts in Tableau online. If you want to manage the extracts further, you can use different workbooks for different types of data sources. For example, you can use one workbook for excel files, one workbook for local databases, one workbook for web based data and so on.

Upload data extracts to default project

The default project in Tableau online is a good place to store your data extracts. The reason is that the default project cannot be deleted. Another benefit is that when you use command line to refresh the data extracts, you do not need to specify project name if they are in the default project.

Make sure Tableau online/server has enough space

In Tableau Online/Server, it’s important to make sure that the backgrounder has enough disk space to store existing Tableau data extracts as well as refresh them and create new ones. A good rule of thumb is the size of the disk available to the backgrounder should be two to three times the size of the data extracts that are expected to be stored on it.

Refresh Tableau data extract

Local refresh of the published extract:

Download a Local Copy of the Data source from Tableau Online.
- Go to Data Sources tab
- Click on the name of the extract you want to download
- Click download
Refresh the Local Copy.
- Open the extract file in Tableau Desktop
- Right click on the data source in, and choose Extract- refresh
Publish the refreshed Extract to Tableau Online.
- Right lick the extract and click Publish to server
- You will be asked if you wish to overwrite a file with the same name and click yes

NOTE 1

If you need to make changes to any metadata, please do it before publishing to the server.

NOTE 2

If you use the data extract in Tableau Online to create visualizations for multiple workbooks (which I believe you do since that is the benefit of using a shared data source in Tableau Online), please be very careful when making any changes to the calculated fields, groups, or other metadata. If you have other calculations created in the local workbook with the same name as the calculations in the data extract in Tableau Online, the Tableau Online version of the calculation will overwrite what you created in the local workbook. So make sure you have the correct calculations in the data extract that will be published to Tableau Online.

Schedule data extract refresh in Tableau Online

Only cloud based data sources (eg. Salesforce, Google analytics) can be refreshed using schedule jobs in Tableau online. One option is to use Tableau Desktop command to refresh non-cloud based data source in Tableau Online. Windows scheduler can be used to automate the refresh jobs to update extracts via Tableau Desktop command. Another option is to use the sync application or manually refresh the extracts using Tableau Desktop.

NOTE

If using command line to refresh the extract, + cannot be used in the data extract name.

Tips for Incremental Refreshes

Following are the tips for incremental refrences:

Incremental extracts retrieve only new records from the underlying data source which reduces the amount of time required to refresh the data extract. If there are no new records to add during an incremental extract, the processes associated with performing an incremental extract still execute.
The performance of incremental refresh is decreasing over time. This is because incremental extracts only grow in size, and as a result, the amount of data and areas of memory that must be accessed in order to satisfy requests only grow as well. In addition, larger files are more likely to be fragmented on a disk than smaller ones.
When performing an incremental refresh of an extract, records are not replaced. Therefore, using a date field such as “Last Updated” in an incremental refresh could result in duplicate rows in the extract.
Incremental refreshes are not possible after an additional file has been appended to a file based data source because the extract has multiple sources at that point.

Use Tableau web connector to create data extract

What is Tableau web connector?

The Tableau Web Data Connector is the API that can be used by people who want to write some code to connect to certain web based data such as a web page.

The connectors can be written in java. It seems that these web connectors can only connect to web pages, web services and so on. It can also connect to local files.

How to use Tableau web connector?

Click on Data | New Data source | Web Data Connector.

Is the Tableau web connection live?

The data is pulled when the connection is build and Tableau will store the data locally in Tableau extract. You can still refresh the data manually or via schedule jobs.

Are there any Tableau web connection available?

Here is a list of web connectors around the Tableau community:

Alteryx: http://data.theinformationlab.co.uk/alteryx.html
Facebook: http://tableaujunkie.com/post/123558558693/facebook-web-data-connector

You can check the tableau community for more web connectors

Summary

In summary, be sure to keep in mind the following best practices for data extracts: Use full fresh when possible. Fully refresh the incrementally refreshed extracts on a regular basis. Publish data extracts to Tableau Online/Server to avoid duplicates. Hide unused fields/ use filter before creating extracts to improve performance and save storage space. Make sure there is enough continuous disk space for the largest extract file. A good way is to use SSD drivers.