In this article by Nick Frisk, author of the book Mastering Ceph, we will get acquainted with erasure coding. Ceph’s default replication level provides excellent protection against data loss by storing three copies of your data on different OSD’s. The chance of losing all three disks that contain the same objects within the period that it takes Ceph to rebuild from a failed disk, is verging on the extreme edge of probability. However, storing 3 copies of data vastly increases both the purchase cost of the hardware but also associated operational costs such as power and cooling. Furthermore, storing copies also means that for every client write, the backend storage must write three times the amount of data. In some scenarios, either of these drawbacks may mean that Ceph is not a viable option.
Erasure codes are designed to offer a solution. Much like how RAID 5 and 6 offer increased usable storage capacity over RAID1, erasure coding allows Ceph to provide more usable storage from the same raw capacity. However also like the parity based RAID levels, erasure coding brings its own set of disadvantages.
(For more resources related to this topic, see here.)
In this article you will learn:
- What is erasure coding and how does it work
- Details around Ceph’s implementation of erasure coding
- How to create and tune an erasure coded RADOS pool
- A look into the future features of erasure coding with Ceph Kraken release
What is erasure coding
Erasure coding allows Ceph to achieve either greater usable storage capacity or increase resilience to disk failure for the same number of disks versus the standard replica method. Erasure coding achieves this by splitting up the object into a number of parts and then also calculating a type of Cyclic Redundancy Check, the Erasure code, and then storing the results in one or more extra parts. Each part is then stored on a separate OSD. These parts are referred to as k and m chunks, where k refers to the number of data shards and m refers to the number of erasure code shards. As in RAID, these can often be expressed in the form k+m or 4+2 for example.
In the event of an OSD failure which contains an objects shard which isone of the calculated erasure codes, data is read from the remaining OSD’s that store data with no impact. However, in the event of an OSD failure which contains the data shards of an object, Ceph can use the erasure codes to mathematically recreate the data from a combination of the remaining data and erasure code shards.
The more erasure code shards you have, the more OSD failure’s you can tolerate and still successfully read data. Likewise the ratio of k to m shards each object is split into, has a direct effect on the percentage of raw storage that is required for each object.
A 3+1 configuration will give you 75% usable capacity, but only allows for a single OSD failure and so would not be recommended. In comparison a three way replica pool, only gives you 33% usable capacity.
4+2 configurations would give you 66% usable capacity and allows for 2 OSD failures. This is probably a good configuration for most people to use.
At the other end of the scale a 18+2 would give you 90% usable capacity and still allows for 2 OSD failures. On the surface this sounds like an ideal option, but the greater total number of shards comes at a cost. The higher the number of total shards has a negative impact on performance and also an increased CPU demand. The same 4MB object that would be stored as a whole single object in a replicated pool, is now split into 20 x 200KB chunks, which have to be tracked and written to 20 different OSD’s. Spinning disks will exhibit faster bandwidth, measured in MB/s with larger IO sizes, but bandwidth drastically tails off at smaller IO sizes. These smaller shards will generate a large amount of small IO and cause additional load on some clusters. Also its important not to forget that these shards need to be spread across different hosts according to the CRUSH map rules, no shard belonging to the same object can be stored on the same host as another shard from the same object. Some clusters may not have a sufficient number hosts to satisfy this requirement.
Reading back from these high chunk pools is also a problem. Unlike in a replica pool where Ceph can read just the requested data from any offset in an object, in an Erasure pool, all shards from all OSD’s have to be read before the read request can be satisfied. In the 18+2 example this can massively amplify the amount of required disk read ops and average latency will increase as a result. This behavior is a side effect which tends to only cause a performance impact with pools that use large number of shards. A 4+2 configuration in some instances will get a performance gain compared to a replica pool, from the result of splitting an object into shards.As the data is effectively striped over a number of OSD’s, each OSD is having to write less data and there is no secondary and tertiary replica’s to write.
How does erasure coding work in Ceph
As with Replication, Ceph has a concept of a primary OSD, which also exists when using erasure coded pools. The primary OSD has the responsibility of communicating with the client, calculating the erasure shards and sending them out to the remaining OSD’s in the Placement Group (PG) set. This is illustrated in the diagram below:
If an OSD in the set is down, the primary OSD, can use the remaining data and erasure shards to reconstruct the data, before sending it back to the client. During read operations the primary OSD requests all OSD’s in the PG set to send their shards. The primary OSD uses data from the data shards to construct the requested data, the erasure shards are discarded. There is a fast read option that can be enabled on erasure pools, which allows the primary OSD to reconstruct the data from erasure shards if they return quicker than data shards. This can help to lower average latency at the cost of slightly higher CPU usage. The diagram below shows how Ceph reads from an erasure coded pool:
The next diagram shows how Ceph reads from an erasure pool, when one of the data shards is unavailable. Data is reconstructed by reversing the erasure algorithm using the remaining data and erasure shards.
Algorithms and profiles
There are a number of different Erasure plugins you can use to create your erasure coded pool.
The default erasure plugin in Ceph is the Jerasure plugin, which is a highly optimized open source erasure coding library. The library has a number of different techniques that can be used to calculate the erasure codes. The default is Reed Solomon and provides good performance on modern processors which can accelerate the instructions that the technique uses. Cauchy is another technique in the library, it is a good alternative to Reed Solomon and tends to perform slightly better. As always benchmarks should be conducted before storing any production data on an erasure coded pool to identify which technique best suits your workload.
There are also a number of other techniques that can be used, which all have a fixed number of m shards. If you are intending on only having 2 m shards, then they can be a good candidate, as there fixed size means that optimization’s are possible lending to increased performance.
In general the jerasure profile should be prefer in most cases unless another profile has a major advantage, as it offers well balanced performance and is well tested.
The ISA library is designed to work with Intel processors and offers enhanced performance. It too supports both Reed Solomon and Cauchy techniques.
One of the disadvantages of using erasure coding in a distributed storage system is that recovery can be very intensive on networking between hosts. As each shard is stored on a separate host, recovery operations require multiple hosts to participate in the process. When the crush topology spans multiple racks, this can put pressure on the inter rack networking links. The LRC erasure plugin, which stands for Local Recovery Codes, adds an additional parity shard which is local to each OSD node. This allows recovery operations to remain local to the node where a OSD has failed and remove the need for nodes to receive data from all other remaining shard holding nodes. However the addition of these local recovery codes does impact the amount of usable storage for a given number of disks. In the event of multiple disk failures, the LRC plugin has to resort to using global recovery as would happen with the jerasure plugin.
SHingled Erasure Coding
The SHingled Erasure Coding (SHEC) profile is designed with similar goals to the LRC plugin, in that it reduces the networking requirements during recovery. However instead of creating extra parity shards on each node, SHEC shingles the shards across OSD’s in an overlapping fashion. The shingle part of the plugin name represents the way the data distribution resembles shingled tiles on a roof of a house. By overlapping the parity shards across OSD’s, the SHEC plugin reduces recovery resource requirements for both single and multiple disk failures.
Where can I use erasure coding
Since the Firefly release of Ceph in 2014, there has been the ability to create a RADOS pool using erasure coding. There is one major thing that you should be aware of, the erasure coding support in RADOS does not allow an object to be partially updated. You can write to an object in an erasure pool, read it back and even overwrite it whole, but you cannot update a partial section of it. This means that erasure coded pools can’t be used for RBD and CephFS workloads and is limited to providing pure object storage either via the Rados Gateway or applications written to use librados.
The solution at the time was to use the cache tiering ability which was released around the same time, to act as a layer above an erasure coded pools that RBD could be used. In theory this was a great idea, in practice, performance was extremely poor. Every time an object was required to be written to, the whole object first had to be promoted into the cache tier. This act of promotion probably also meant that another object somewhere in the cache pool was evicted. Finally the object now in the cache tier could be written to. This whole process of constantly reading and writing data between the two pools meant that performance was unacceptable unless a very high percentage of the data was idle.
During the development cycle of the Kraken release, an initial implementation for support for direct overwrites on n erasure coded pool was introduced. As of the final Kraken release, support is marked as experimental and is expected to be marked as stable in the following release. Testing of this feature will be covered later in this article.
Creating an erasure coded pool
Let’s bring our test cluster up again and switch into SU mode in Linux so we don’t have to keep prepending sudo to the front of our commands
Erasure coded pools are controlled by the use of erasure profiles, these control how many shards each object is broken up into including the split between data and erasure shards. The profiles also include configuration to determine what erasure code plugin is used to calculate the hashes.
The following plugins are available to use
<list of plugins>
To see a list of the erasure profiles run
# cephosd erasure-code-profile ls
You can see there is a default profile in a fresh installation of Ceph. Lets see what configuration options it contains
# cephosd erasure-code-profile get default
The default specifies that it will use the jerasure plugin with the Reed Solomon error correcting codes and will split objects into 2 data shards and 1 erasure shard. This is almost perfect for our test cluster, however for the purpose of this exercise we will create a new profile.
# cephosd erasure-code-profile set example_profile k=2 m=1 plugin=jerasure technique=reed_sol_van # cephosd erasure-code-profile ls
You can see our new example_profile has been created. Now lets create our erasure coded pool with this profile:
# cephosd pool create ecpool 128 128 erasure example_profile
The above command instructs Ceph to create a new pool called ecpool with a 128 PG’s. It should be an erasure coded pool and should use our “example_profile” we previously created.
Lets create an object with a small text string inside it and the prove the data has been stored by reading it back:
# echo "I am test data for a test object" | rados --pool ecpool put Test1 – # rados --pool ecpool get Test1 -
That proves that the erasure coded pool is working, but it’s hardly the most exciting of discoveries. Lets have a look to see if we can see what’s happening at a lower level.
First, find out what PG is holding the object we just created
# cephosd map ecpoolTest1
The result of the above command tells us that the object is stored in PG 3.40 on OSD’s1, 2 and 0. In this example Ceph cluster that’s pretty obvious as we only have 3 OSD’s, but in larger clusters that is a very useful piece of information.
We can now look at the folder structure of the OSD’s and see how the object has been split.
The PG’s will likely be different on your test cluster, so make sure the PG folder structure matches the output of the “cephosd map” command above.
ls -l /var/lib/ceph/osd/ceph-2/current/1.40s0_head/
# ls -l /var/lib/ceph/osd/ceph-1/current/1.40s1_head/
# ls -l /var/lib/ceph/osd/ceph-0/current/1.40s2_head/ total 4
Notice how the PG directory names have been appended with the shard number, replicated pools just have the PG number as their directory name. If you examine the contents of the object files, you will see our text string that we entered into the object when we created it. However due to the small size of the text string, Ceph has padded out the 2nd shard with null characters and the erasure shard hence will contain the same as the first. You can repeat this example with a new object containing larger amounts of text to see how Ceph splits the text into the shards and calculates the erasure code.
Overwrites on erasure code pools with Kraken
Introduced for the first time in the Kraken release of Cephas an experimental feature, was the ability to allow partial overwrites on erasure coded pools. Partial overwrite support allows RBD volumes to be created on erasure coded pools, making better use of raw capacity of the Ceph cluster.
In parity RAID, where a write request doesn’t span the entire stripe, a read modify write operation is required. This is needed as the modified data chunks will mean the parity chunk is now incorrect. The RAID controller has to read all the current chunks in the stripe, modify them in memory, calculate the new parity chunk and finally write this back out to the disk.
Ceph is also required to perform this read modify write operation, however the distributed model of Ceph increases the complexity of this operation.When the primary OSD for a PG receives a write request that will partially overwrite an existing object, it first works out which shards will be not be fully modified by the request and contacts the relevant OSD’s to request a copy of these shards. The primary OSD then combines these received shards with the new data and calculates the erasure shards. Finally the modified shards are sent out to the respective OSD’s to be committed. This entire operation needs to conform the other consistency requirements Ceph enforces, this entails the use of temporary objects on the OSD, should a condition arise that Ceph needs to roll back a write operation.
This partial overwrite operation, as can be expected, has a performance impact. In general the smaller the write IO’s, the greater the apparent impact. The performance impact is a result of the IO path now being longer, requiring more disk IO’s and extra network hops. However, it should be noted that due to the striping effect of erasure coded pools, in the scenario where full stripe writes occur, performance will normally exceed that of a replication based pool. This is simply down to there being less write amplification due to the effect of striping. If performance of an Erasure pool is not suitable, consider placing it behind a cache tier made up of a replicated pool.
Despite partial overwrite support coming to erasure coded pools in Ceph, not every operation is supported. In order to store RBD data on an erasure coded pool, a replicated pool is still required to hold key metadata about the RBD. This configuration is enabled by using the –data-pool option with the rbd utility.
Partial overwrite is also not recommended to be used with Filestore. Filestore lacks several features that partial overwrites on erasure coded pools uses, without these features extremely poor performance is experienced.
This feature requires the Kraken release or newer of Ceph. If you have deployed your test cluster with the Ansible and the configuration provided, you will be running Ceph Jewel release. The following steps show how to use Ansible to perform a rolling upgrade of your cluster to the Kraken release. We will also enable options to enable experimental options such as bluestore and support for partial overwrites on erasure coded pools.
Edit your group_vars/ceph variable file and change the release version from Jewel to Kraken.
ceph_conf_overrides: global: enable_experimental_unrecoverable_data_corrupting_features: "debug_white_box_testing_ec_overwrites bluestore"
And to correct a small bug when using Ansible to deploy Ceph Kraken, add:
debian_ceph_packages: - ceph - ceph-common - ceph-fuse
To the bottom of the file run the following Ansible playbook:
ansible-playbook -K infrastructure-playbooks/rolling_update.yml
Ansible will prompt you to make sure that you want to carry out the upgrade, once you confirm by entering yes the upgrade process will begin.
Once Ansible has finished, all the stages should be successful as shown below:
Your cluster has now been upgraded to Kraken and can be confirmed by running ceph -v on one of yours VM’s running Ceph.
As a result of enabling the experimental options in the configuration file, every time you now run a Ceph command, you will be presented with the following warning.
This is designed as a safety warning to stop you running these options in a live environment, as they may cause irreversible data loss. As we are doing this on a test cluster, that is fine to ignore, but should be a stark warning not to run this anywhere near live data.
The next command that is required to be run is to enable the experimental flag which allows partial overwrites on erasure coded pools. DO NOT RUN THIS ON PRODUCTION CLUSTERS
cephosd pool get ecpooldebug_white_box_testing_ec_overwrites true
Double check you still have your erasure pool called ecpool and the default RBD pool
# cephosdlspools 0 rbd,1ecpool,
And now create the rbd. Notice that the actual RBD header object still has to live on a replica pool, but by providing an additional parameter we can tell Ceph to store data for this RBD on an erasure coded pool.
rbd create Test_On_EC --data-pool=ecpool --size=1G
The command should return without error and you now have an erasure coded backed RBD image. You should now be able to use this image with any librbd application.
Note: Partial overwrites on Erasure pools require Bluestore to operate efficiently. Whilst Filestore will work, performance will be extremely poor.
Troubleshooting the 2147483647 error
An example of this error is shown below when running the ceph health detail command.
If you see 2147483647 listed as one of the OSD’s for an erasure coded pool, this normally means that CRUSH was unable to find a sufficient number of OSD’s to complete the PG peering process. This is normally due to the number of k+m shards being larger than the number of hosts in the CRUSH topology. However, in some cases this error can still occur even when the number of hosts is equal or greater to the number of shards. In this scenario it’s important to understand how CRUSH picks OSD’s as candidates for data placement. When CRUSH is used to find a candidate OSD for a PG, it applies the crushmap to find an appropriate location in the crush topology. If the result comes back as the same as a previous selected OSD, Ceph will retry to generate another mapping by passing slightly different values into the crush algorithm. In some cases if there is a similar number of hosts to the number of erasure shards, CRUSH may run out of attempts before it can suitably find correct OSD mappings for all the shards. Newer versions of Ceph has mostly fixed these problems by increasing the CRUSH tunable choose_total_tries.
Reproducing the problem
In order to aid understanding of the problem in more detail, the following steps will demonstrate how to create an erasure coded profile that will require more shards than our 3 node cluster can support.
Firstly, like earlier in the articlecreate a new erasure profile, but modify the k/m parameters to be k=3 m=1:
$ cephosd erasure-code-profile set broken_profile k=3 m=1 plugin=jerasure technique=reed_sol_van
And now create a pool with it:
$ cephosd pool create broken_ecpool 128 128 erasure broken_profile
If we look at the output from ceph -s, we will see that the PG’s for this new pool are stuck in the creating state.
The output of ceph health detail, shows the reason why and we see the 2147483647 error.
If you encounter this error and it is a result of your erasure profile being larger than your number of hosts or racks, depending on how you have designed your crushmap. Then the only real solution is to either drop the number of shards, or increase number of hosts.
In this article you have learnt what erasure coding is and how it is implemented in Ceph. You should also have an understanding of the different configuration options possible when creating erasure coded pools and their suitability for different types of scenarios and workloads.
Resources for Article:
Further resources on this subject:
- Ceph Instant Deployment [article]
- Working with Ceph Block Device [article]
- GNU Octave: Data Analysis Examples [article]