





















































In this article by Jenny Zhang, author of the book Tableau 10.0 Best Practices, you will learn the Best Practices about Tableau Data Extract. We will look into different ways of creating Tableau data extracts and technical details of how a Tableau data extract works. We will learn on how to create extract with large volume of data efficiently, and then upload and manage Tableau data extract in Tableau online. We will also take a look at refresh Tableau data extract, which is useful to keep your data up to date automatically. Finally, we will take a look using Tableau web connector to create data extract.
(For more resources related to this topic, see here.)
Tableau provides a few ways to create extracts.
Creating an extract by connecting to the original data source (Databases/Salesforce/Google Analytics and so on) will maintain the connection to the original data source. You can right click the extract to edit the extract and refresh the extract from the original data source.
If you create a duplicate of the extract by right click the data extract and duplicate, it will create a new .tde file and still maintain the connection to the original data source. If you refresh the duplicated data extract, it will not refresh the original data extract that you created the duplicate from.
If you create a data extract by connecting to a Tableau extract file (.tde), you will not have that connection to the original data source that the extract is created from since you are just connecting to a local .tde file. You cannot edit or refresh the data from the original data source. Duplicate this extract with connection to the local .tde file will NOT create a new .tde file. The duplication will still point to the same local .tde file.
You can right click – Extract Data to create an extract out of an extract. But we do not normally do that.
A Tableau extract (.tde) file is a compressed snapshot of data extracted from a large variety of original data sources (excel, databases, Salesforce, NoSQL and so on). It is stored on disk and loaded into memory as required to create a Tableau Viz.
There are two design principles of the Tableau extract make it ideal for data analytics. The first principle is Tableau extract is a columnar store. The columnar databases store column values rather than row values. The benefit is that the input/output time required to access/aggregate the values in a column is significantly reduced. That is why Tableau extract is great for data analytics.
The second principle is how a Tableau extract is structured to make sure it makes best use of your computer’s memory. This will impact how it is loaded into memory and used by Tableau. To better understand this principle, we need to understand how Tableau extract is created and used as the data source to create visualization.
When Tableau creates data extract, it defines the structure of the .tde file and creates separate files for each column in the original data source. When Tableau retrieves data from the original data source, it sorts, compresses and adds the values for each column to their own file. After that, individual column files are combined with metadata to form a single file with as many individual memory-mapped files as there are the columns in the original data source.
Because a Tableau data extract file is a memory-mapped file, when Tableau requests data from a .tde file, the data is loaded directly into the memory by the operating system. Tableau does not have to open, process or decompress the file. If needed, the operating system continues to move data in and out of RAM to insure that all of the requested data is made available to Tableau. It means that Tableau can query data that is bigger than the RAM on the computer.
Following are the seven main benefits of using Tableau data extract
If you have an Excel file with lots of data and lots of formulas, it could take a long time to load into Tableau. The best practice is to save the Excel as a .csv file and remove all the formulas.
If you do not need the values down to the dimension of what it is in the underlying data source, aggregate to a higher dimension will significantly reduce the extract size and improve performance.
Add a data source filter by right click the data source and then choose to Edit Data Source Filter to remove the data you do not need before creating the extract.
Hide unused fields before creating a data extract can speed up extract creation and also save storage space.
One way to create extracts is to create them in different workbooks. The advantage is that you can create extracts on the fly when you need them. But the disadvantage is that once you created many extracts, it is very difficult to manage them. You can hardly remember which dashboard has which extracts. A better solution is to use one workbook just to create data extracts and then upload the extracts to Tableau online. When you need to create visualizations, you can use the extracts in Tableau online. If you want to manage the extracts further, you can use different workbooks for different types of data sources. For example, you can use one workbook for excel files, one workbook for local databases, one workbook for web based data and so on.
The default project in Tableau online is a good place to store your data extracts. The reason is that the default project cannot be deleted. Another benefit is that when you use command line to refresh the data extracts, you do not need to specify project name if they are in the default project.
In Tableau Online/Server, it’s important to make sure that the backgrounder has enough disk space to store existing Tableau data extracts as well as refresh them and create new ones. A good rule of thumb is the size of the disk available to the backgrounder should be two to three times the size of the data extracts that are expected to be stored on it.
Local refresh of the published extract:
NOTE 1
If you need to make changes to any metadata, please do it before publishing to the server.
NOTE 2
If you use the data extract in Tableau Online to create visualizations for multiple workbooks (which I believe you do since that is the benefit of using a shared data source in Tableau Online), please be very careful when making any changes to the calculated fields, groups, or other metadata. If you have other calculations created in the local workbook with the same name as the calculations in the data extract in Tableau Online, the Tableau Online version of the calculation will overwrite what you created in the local workbook. So make sure you have the correct calculations in the data extract that will be published to Tableau Online.
Only cloud based data sources (eg. Salesforce, Google analytics) can be refreshed using schedule jobs in Tableau online. One option is to use Tableau Desktop command to refresh non-cloud based data source in Tableau Online. Windows scheduler can be used to automate the refresh jobs to update extracts via Tableau Desktop command. Another option is to use the sync application or manually refresh the extracts using Tableau Desktop.
NOTE
If using command line to refresh the extract, + cannot be used in the data extract name.
Following are the tips for incremental refrences:
The Tableau Web Data Connector is the API that can be used by people who want to write some code to connect to certain web based data such as a web page.
The connectors can be written in java. It seems that these web connectors can only connect to web pages, web services and so on. It can also connect to local files.
Click on Data | New Data source | Web Data Connector.
The data is pulled when the connection is build and Tableau will store the data locally in Tableau extract. You can still refresh the data manually or via schedule jobs.
Here is a list of web connectors around the Tableau community:
You can check the tableau community for more web connectors
In summary, be sure to keep in mind the following best practices for data extracts: Use full fresh when possible. Fully refresh the incrementally refreshed extracts on a regular basis. Publish data extracts to Tableau Online/Server to avoid duplicates. Hide unused fields/ use filter before creating extracts to improve performance and save storage space. Make sure there is enough continuous disk space for the largest extract file. A good way is to use SSD drivers.
Further resources on this subject: