TimeScaleDB announced yesterday that they are going distributed; this version is currently in private beta with the public version slated for later this year. They are also bringing PostgreSQL back. However, with PostgreSQL, a major problem is scaling out. To address this, TimeScaleDB does not implement traditional sharding, instead, using ‘chunking’.
What is TimescaleDB’s chunking?
In TimescaleDB, chunking is the mechanism which scales PostgreSQL for time-series workloads. Chunks are created by automatically partitioning data by multiple dimensions (one of which is time). In a blog post, TimeScaleDB specifies, “this is done in a fine-grain way such that one dataset may be comprised of 1000s of chunks, even on a single node.”
Chunking offers a wide set of capabilities unlike sharding, which only offers the option to scale out. These include scaling-up (on the same node) and scaling-out (across multiple nodes). It also offers elasticity, partitioning flexibility, data retention policies, data tiering, and data reordering. TimescaleDB also automatically partitions a table across multiple chunks on the same instance, whether on the same or different disks.
TimescaleDB’s multi-dimensional chunking auto-creates chunks, keeps recent data chunks in memory, and provides time-oriented data lifecycle management (e.g., for data retention, reordering, or tiering policies).
However, one issue is the management of the number of chunks (i.e., “sub-problems”). For this, they have come up with hypertable abstraction to make partitioned tables easy to use and manage.
Hypertable abstraction makes chunking manageable
Hypertables are typically used to handle a large amount of data by breaking it up into chunks, allowing operations to execute efficiently. However, when the number of chunks is large, these data chunks can be distributed over several machines by using distributed hypertables. Distributed hypertables are similar to normal hypertables, but they add an additional layer of hypertable partitioning by distributing chunks across data nodes. They are designed for multi-dimensional chunking with a large number of chunks (from 100s to 10,000s), offering more flexibility in how chunks are distributed across a cluster.
Users are able to interact with distributed hypertables similar to a regular hypertable (which itself looks just like a regular Postgres table).
Chunking does not put additional burden on applications and developers because TimescaleDB does not interact directly with chunks (and thus do not need to be aware of this partition mapping themselves, unlike in some sharded systems). The system also does not expose different capabilities for chunks than the entire hypertable.
TimescaleDB goes distributed
TimescaleDB is already available for testing in private beta as for selected users and customers. The initial licensed version is expected to be widely available. This version will support features such as high write rates, query parallelism, predicate push down for lower latency, elastically growing a cluster to scale storage and compute, and fault tolerance via physical replica.
Developers were quite intrigued by the new chunking process. A number of questions were asked on Hacker News, duly answered by TimescaleDB creators.
One of the questions put forth is related to the Hot partition problem. A user asks, “The biggest limit is that their “chunking” of data by time-slices may lead directly to the hot partition problem — in their case, a “hot chunk.” Most time series is ‘dull time’ — uninteresting time samples of normal stuff. Then, out of nowhere, some ‘interesting’ stuff happens. It’ll all be in that one chunk, which will get hammered during reads.”
To which Erik Nordström, Timescale engineer replied, “ TimescaleDB supports multi-dimensional partitioning, so a specific “hot” time interval is actually typically split across many chunks, and thus server instances. We are also working on native chunk replication, which allows serving copies of the same chunk out of different server instances.
Apart from these things to mitigate the hot partition problem, it’s usually a good thing to be able to serve the same data to many requests using a warm cache compared to having many random reads that thrashes the cache.”
Another question asked said, “In this vision, would this cluster of servers be reserved exclusively for time series data or do you imagine it containing other ordinary tables as well?”
To which, Mike Freedman, CTO of TimeScale answered, “We commonly see hypertables (time-series tables) deployed alongside relational tables, often because there exists a relation between them: the relational metadata provides information about the user, sensor, server, security instrument that is referenced by id/name in the hypertable. So joins between these time-series and relational tables are often common, and together these serve the applications one often builds on top of your data.
Now, TimescaleDB can be installed on a PG server that is also handling tables that have nothing to do with its workload, in which case one does get performance interference between the two workloads. We generally wouldn’t recommend this for more production deployments, but the decision here is always a tradeoff between resource isolation and cost.”
Some thought sharding remains the better choice even if chunking improves performance.
No doubt at all that Chunking could improve the DML performance but I would still say #Sharding is the best way to improve database performance despite having some drawbacks and caveats in it. #database #postgresql #distributeddatabase #timescaledb https://t.co/CfhT2f9gZ4
— Md. Rashim Uddin (@methu) August 22, 2019