21 min read

In this article by Wasim Ahmed, author of the book Proxmox Cookbook, we will cover topics such as local storage, shared storage, Ceph storage, and a recipe which shows you how to configure the Ceph RBD storage.

(For more resources related to this topic, see here.)

A storage is where virtual disk images of virtual machines reside. There are many different types of storage systems with many different features, performances, and use case scenarios. Whether it is a local storage configured with direct attached disks or a shared storage with hundreds of disks, the main responsibility of a storage is to hold virtual disk images, templates, backups, and so on. Proxmox supports different types of storages, such as NFS, Ceph, GlusterFS, and ZFS. Different storage types can hold different types of data.

For example, a local storage can hold any type of data, such as disk images, ISO/container templates, backup files and so on. A Ceph storage, on the other hand, can only hold a .raw format disk image. In order to provide the right type of storage for the right scenario, it is vital to have a proper understanding of different types of storages. The full details of each storage is beyond the scope of this article, but we will look at how to connect them to Proxmox and maintain a storage system for VMs.

Storages can be configured into two main categories:

  • Local storage
  • Shared storage

Local storage

Any storage that resides in the node itself by using directly attached disks is known as a local storage. This type of storage has no redundancy other than a RAID controller that manages an array. If the node itself fails, the storage becomes completely inaccessible. The live migration of a VM is impossible when VMs are stored on a local storage because during migration, the virtual disk of the VM has to be copied entirely to another node.

A VM can only be live-migrated when there are several Proxmox nodes in a cluster and the virtual disk is stored on a shared storage accessed by all the nodes in the cluster.

Shared storage

A shared storage is one that is available to all the nodes in a cluster through some form of network media. In a virtual environment with shared storage, the actual virtual disk of the VM may be stored on a shared storage, while the VM actually runs on another Proxmox host node. With shared storage, the live migration of a VM becomes possible without powering down the VM. Multiple Proxmox nodes can share one shared storage, and VMs can be moved around since the virtual disk is stored on different shared storages. Usually, a few dedicated nodes are used to configure a shared storage with their own resources apart from sharing the resources of a Proxmox node, which could be used to host VMs.

In recent releases, Proxmox has added some new storage plugins that allow users to take advantage of some great storage systems and integrating them with the Proxmox environment. Most of the storage configurations can be performed through the Proxmox GUI.

Ceph storage

Ceph is a powerful distributed storage system, which provides RADOS Block Device (RBD) object storage, Ceph filesystem (CephFS), and Ceph Object Storage. Ceph is built with a very high-level of reliability, scalability, and performance in mind. A Ceph cluster can be expanded to several petabytes without compromising data integrity, and can be configured using commodity hardware. Any data written to the storage gets replicated across a Ceph cluster. Ceph was originally designed with big data in mind. Unlike other types of storages, the bigger a Ceph cluster becomes, the higher the performance. However, it can also be used in small environments just as easily for data redundancy. A lower performance can be mitigated using SSD to store Ceph journals. Refer to the OSD Journal subsection in this section for information on journals.

The built-in self-healing features of Ceph provide unprecedented resilience without a single point of failure. In a multinode Ceph cluster, the storage can tolerate not just hard drive failure, but also an entire node failure without losing data. Currently, only an RBD block device is supported in Proxmox.

Ceph comprises a few components that are crucial for you to understand in order to configure and operate the storage. The following components are what Ceph is made of:

  • Monitor daemon (MON)
  • Object Storage Daemon (OSD)
  • OSD Journal
  • Metadata Server (MSD)
  • Controlled Replication Under Scalable Hashing map (CRUSH map)
  • Placement Group (PG)
  • Pool

MON

Monitor daemons form quorums for a Ceph distributed cluster. There must be a minimum of three monitor daemons configured on separate nodes for each cluster. Monitor daemons can also be configured as virtual machines instead of using physical nodes. Monitors require a very small amount of resources to function, so allocated resources can be very small. A monitor can be set up through the Proxmox GUI after the initial cluster creation.

OSD

Object Storage Daemons (OSDs) are responsible for the storage and retrieval of actual cluster data. Usually, each physical storage device, such as HDD or SSD, is configured as a single OSD. Although several OSDs can be configured on a single physical disc, it is not recommended for any production environment at all. Each OSD requires a journal device where data first gets written and later gets transferred to an actual OSD. By storing journals on fast-performing SSDs, we can increase the Ceph I/O performance significantly.

Thanks to the Ceph architecture, as more and more OSDs are added into the cluster, the I/O performance also increases. An SSD journal works very well on small clusters with about eight OSDs per node. OSDs can be set up through the Proxmox GUI after the initial MON creation.

OSD Journal

Every single piece of data that is destined to be a Ceph OSD first gets written in a journal. A journal allows OSD daemons to write smaller chunks to allow the actual drives to commit writes that give more time. In simpler terms, all data gets written to journals first, then the journal filesystem sends data to an actual drive for permanent writes. So, if the journal is kept on a fast-performing drive, such as SSD, incoming data will be written at a much higher speed, while behind the scenes, slower performing SATA drives can commit the writes at a slower speed. Journals on SSD can really improve the performance of a Ceph cluster, especially if the cluster is small, with only a few terabytes of data.

It should also be noted that if there is a journal failure, it will take down all the OSDs that the journal is kept on the journal drive. In some environments, it may be necessary to put two SSDs to mirror RAIDs and use them as journaling. In a large environment with more than 12 OSDs per node, performance can actually be gained by collocating a journal on the same OSD drive instead of using SSD for a journal.

MDS

The Metadata Server (MDS) daemon is responsible for providing the Ceph filesystem (CephFS) in a Ceph distributed storage system. MDS can be configured on separate nodes or coexist with already configured monitor nodes or virtual machines. Although CephFS has come a long way, it is still not fully recommended to use in a production environment. It is worth mentioning here that there are many virtual environments actively running MDS and CephFS without any issues. Currently, it is not recommended to configure more than two MDSs in a Ceph cluster. CephFS is not currently supported by a Proxmox storage plugin. However, it can be configured as a local mount and then connected to a Proxmox cluster through the Directory storage. MDS cannot be set up through the Proxmox GUI as of version 3.4.

CRUSH map

A CRUSH map is the heart of the Ceph distributed storage. The algorithm for storing and retrieving user data in Ceph clusters is laid out in the CRUSH map. CRUSH allows a Ceph client to directly access an OSD. This eliminates a single point of failure and any physical limitations of scalability since there are no centralized servers or controllers to manage data in and out. Throughout Ceph clusters, CRUSH maintains a map of all MONs and OSDs. CRUSH determines how data should be chunked and replicated among OSDs spread across several local nodes or even nodes located remotely.

A default CRUSH map is created on a freshly installed Ceph cluster. This can be further customized based on user requirements. For smaller Ceph clusters, this map should work just fine. However, when Ceph is deployed with very big data in mind, this map should be customized. A customized map will allow better control of a massive Ceph cluster.

To operate Ceph clusters of any size successfully, a clear understanding of the CRUSH map is mandatory.

For more details on the Ceph CRUSH map, visit http://ceph.com/docs/master/rados/operations/crush-map/ and http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map.

As of Proxmox VE 3.4, we cannot customize the CRUSH map throughout the Proxmox GUI. It can only be viewed through a GUI and edited through a CLI.

PG

In a Ceph storage, data objects are aggregated in groups determined by CRUSH algorithms. This is known as a Placement Group (PG) since CRUSH places this group in various OSDs depending on the replication level set in the CRUSH map and the number of OSDs and nodes. By tracking a group of objects instead of the object itself, a massive amount of hardware resources can be saved. It would be impossible to track millions of individual objects in a cluster. The following diagram shows how objects are aggregated in groups and how PG relates to OSD:

Proxmox Cookbook

To balance available hardware resources, it is necessary to assign the right number of PGs. The number of PGs should vary depending on the number of OSDs in a cluster. The following is a table of PG suggestions made by Ceph developers:

Number of OSDs

Number of PGs

Less than 5 OSDs

128

Between 5-10 OSDs

512

Between 10-50 OSDs

4096

Selecting the proper number of PGs is crucial since each PG will consume node resources. Too many PGs for the wrong number of OSDs will actually penalize the resource usage of an OSD node, while very few assigned PGs in a large cluster will put data at risk. A rule of thumb is to start with the lowest number of PGs possible, then increase them as the number of OSDs increases.

For details on Placement Groups, visit http://ceph.com/docs/master/rados/operations/placement-groups/.

There’s a great PG calculator created by Ceph developers to calculate the recommended number of PGs for various sizes of Ceph clusters at http://ceph.com/pgcalc/.

Pools

Pools in Ceph are like partitions on a hard drive. We can create multiple pools on a Ceph cluster to separate stored data. For example, a pool named accounting can hold all the accounting department data, while another pool can store the human resources data of a company. When creating a pool, assigning the number of PGs is necessary. During the initial Ceph configuration, three default pools are created. They are data, metadata, and rbd. Deleting a pool will delete all stored objects permanently.

For details on Ceph and its components, visit http://ceph.com/docs/master/.

The following diagram shows a basic Proxmox+Ceph cluster:

Proxmox Cookbook

The preceding diagram shows four Proxmox nodes, three Monitor nodes, three OSD nodes, and two MDS nodes comprising a Proxmox+Ceph cluster. Note that Ceph is on a different network than the Proxmox public network. Depending on the set replication number, each incoming data object needs to be written more than once. This causes high bandwidth usage. By separating Ceph on a dedicated network, we can ensure that a Ceph network can fully utilize the bandwidth.

On advanced clusters, a third network is created only between Ceph nodes for cluster replication, thus improving network performance even further. As of Proxmox VE 3.4, the same node can be used for both Proxmox and Ceph. This provides a great way to manage all the nodes from the same Proxmox GUI. It is not advisable to put Proxmox VMs on a node that is also configured as Ceph. During day-to-day operations, Ceph nodes do not consume large amounts of resources, such as CPU or memory. However, when Ceph goes into rebalancing mode due to OSD or node failure, a large amount of data replication occurs, which takes up lots of resources. Performance will degrade significantly if resources are shared by both VMs and Ceph.

Ceph RBD storage can only store .raw virtual disk image files.

Ceph itself does not come with a GUI to manage, so having the option to manage Ceph nodes through the Proxmox GUI makes administrative tasks mush easier. Refer to the Monitoring the Ceph storage subsection under the How to do it… section of the Connecting the Ceph RBD storage recipe later in this article to learn how to install a great read-only GUI to monitor Ceph clusters.

Connecting the Ceph RBD storage

In this recipe, we are going to see how to configure a Ceph block storage with a Proxmox cluster.

Getting ready

The initial Ceph configuration on a Proxmox cluster must be accomplished through a CLI. After the Ceph installation, initial configurations and one monitor creation for all other tasks can be accomplished through the Proxmox GUI.

How to do it…

We will now see how to configure the Ceph block storage with Proxmox.

Installing Ceph on Proxmox

Ceph is not installed by default. Prior to configuring a Proxmox node for the Ceph role, Ceph needs to be installed and the initial configuration must be created through a CLI.

The following steps need to be performed on all Proxmox nodes that will be part of the Ceph cluster:

  1. Log in to each node through SSH or a console.
  2. Configure a second network interface to create a separate Ceph network with a different subnet.
  3. Reboot the nodes to initialize the network configuration.
  4. Using the following command, install the Ceph package on each node:
    # pveceph install –version giant
    Initializing the Ceph configuration

Before Ceph is usable, we have to create the initial Ceph configuration file on one Proxmox+Ceph node.

The following steps need to be performed only on one Proxmox node that will be part of the Ceph cluster:

  1. Log in to the node using SSH or a console.
  2. Run the following command create the initial Ceph configuration:
    # pveceph init –network <ceph_subnet>/CIDR
  3. Run the following command to create the first monitor:
    # pveceph createmon

Configuring Ceph through the Proxmox GUI

After the initial Ceph configuration and the creation of the first monitor, we can continue with further Ceph configurations through the Proxmox GUI or simply run the Ceph Monitor creation command on other nodes.

The following steps show how to create Ceph Monitors and OSDs from the Proxmox GUI:

  1. Log in to the Proxmox GUI as a root or with any other administrative privilege.
  2. Select a node where the initial monitor was created in previous steps, and then click on Ceph from the tabbed menu. The following screenshot shows a Ceph cluster as it appears after the initial Ceph configuration:

    Proxmox Cookbook

    Since no OSDs have been created yet, it is normal for a new Ceph cluster to show PGs stuck and unclean error

  3. Click on Disks on the bottom tabbed menu under Ceph to display the disks attached to the node, as shown in the following screenshot:

    Proxmox Cookbook

  4. Select an available attached disk, then click on the Create: OSD button to open the OSD dialog box, as shown in the following screenshot:

    Proxmox Cookbook

  5. Click on the Journal Disk drop-down menu to select a different device or collocate the journal on the same OSD by keeping it as the default.
  6. Click on Create to finish the OSD creation.
  7. Create additional OSDs on Ceph nodes as needed.

The following screenshot shows a Proxmox node with three OSDs configured:

Proxmox Cookbook

By default, Proxmox has created OSDs with an ext3 partition. However, sometimes, it may be necessary to create OSDs with different partition types due to a requirement or for performance improvement. Enter the following command format through the CLI to create an OSD with a different partition type:

# pveceph createosd –fstype ext4 /dev/sdX

The following steps show how to create Monitors through the Proxmox GUI:

  1. Click on Monitor from the tabbed menu under the Ceph feature. The following screenshot shows the Monitor status with the initial Ceph Monitor we created earlier in this recipe:

    Proxmox Cookbook

  2. Click on Create to open the Monitor dialog box.
  3. Select a Proxmox node from the drop-down menu.
  4. Click on the Create button to start the monitor creation process.
  5. Create a total of three Ceph monitors to establish a Ceph quorum.

The following screenshot shows the Ceph status with three monitors and OSDs added:

Proxmox Cookbook

Note that even with three OSDs added, the PGs are still stuck with errors. This is because by default, the Ceph CRUSH is set up for two replicas. So far, we’ve only created OSDs on one node. For a successful replication, we need to add some OSDs on the second node so that data objects can be replicated twice. Follow the steps described earlier to create three additional OSDs on the second node. After creating three more OSDs, the Ceph status should look like the following screenshot:

Proxmox Cookbook

Managing Ceph pools

It is possible to perform basic tasks, such as creating and removing Ceph pools through the Proxmox GUI. Besides these, we can see check the list, status, number of PGs, and usage of the Ceph pools.

The following steps show how to check, create, and remove Ceph pools through the Proxmox GUI:

  1. Click on the Pools tabbed menu under Ceph in the Proxmox GUI. The following screenshot shows the status of the default rbd pool, which has replica 1, 256 PG, and 0% usage:

    Proxmox Cookbook

  2. Click on Create to open the pool creation dialog box.
  3. Fill in the required information, such as the name of the pool, replica size, and number of PGs. Unless the CRUSH map has been fully customized, the ruleset should be left at the default value 0.
  4. Click on OK to create the pool.
  5. To remove a pool, select the pool and click on Remove. Remember that once a Ceph pool is removed, all the data stored in this pool is deleted permanently.

To increase the number of PGs, run the following command through the CLI:

#ceph osd pool set <pool_name> pg_num <value>

#ceph osd pool set <pool_name> pgp_num <value>

It is only possible to increase the PG value. Once increased, the PG value can never be decreased.

Connecting RBD to Proxmox

Once a Ceph cluster is fully configured, we can proceed to attach it to the Proxmox cluster.

During the initial configuration file creation, Ceph also creates an authentication keyring in the /etc/ceph/ceph.client.admin.keyring directory path.

This keyring needs to be copied and renamed to match the name of the storage ID to be created in Proxmox. Run the following commands to create a directory and copy the keyring:

# mkdir /etc/pve/priv/ceph

# cd /etc/ceph/

# cp ceph.client.admin.keyring /etc/pve/priv/ceph/<storage>.keyring

For our storage, we are naming it rbd.keyring. After the keyring is copied, we can attach the Ceph RBD storage with Proxmox using the GUI:

  1. Click on Datacenter, then click on Storage from the tabbed menu.
  2. Click on the Add drop-down menu and select the RBD storage plugin.
  3. Enter the information as described in the following table:

    Item

    Type of value

    Entered value

    ID

    The name of the storage.

    rbd

    Pool

    The name of the Ceph pool.

    rbd

    Monitor Host

    The IP address and port number of the Ceph MONs. We can enter multiple MON hosts for redundancy.

    172.16.0.71:6789;172.16.0.72:6789; 172.16.0.73:6789

    User name

    The default Ceph administrator.

    Admin

    Nodes

    The Proxmox nodes that will be able to use the storage.

    All

    Enable

    The checkbox for enabling/disabling the storage.

    Enabled

  4. Click on Add to attach the RBD storage. The following screenshot shows the RBD storage under Summary:

    Proxmox Cookbook

Monitoring the Ceph storage

Ceph itself does not come with any GUI to manage or monitor the cluster. We can view the cluster status and perform various Ceph-related tasks through the Proxmox GUI. There are several third-party software that allow Ceph-only GUI to manage and monitor the cluster. Some software provide management features, while others provide read-only features for Ceph monitoring. Ceph Dash is such a software that provides an appealing read-only GUI to monitor the entire Ceph cluster without logging on to the Proxmox GUI. Ceph Dash is freely available through GitHub. There are other heavyweight Ceph GUI dashboards, such as Kraken, Calamari, and others. In this section, we are only going to see how to set up the Ceph Dash cluster monitoring GUI.

The following steps can be used to download and start Ceph Dash to monitor a Ceph cluster using any browser:

  1. Log in to any Proxmox node, which is also a Ceph MON.
  2. Run the following commands to download and start the dashboard:
    # mkdir /home/tools
    
    # apt-get install git
    
    # git clone https://github.com/Crapworks/ceph-dash
    
    # cd /home/tools/ceph-dash
    
    # ./ceph_dash.py
  3. Ceph Dash will now start listening on port 5000 of the node. If the node is behind a firewall, open port 5000 or any other ports with port forwarding in the firewall.
  4. Open any browser and enter <node_ip>:5000 to open the dashboard. The following screenshot shows the dashboard of the Ceph cluster we have created:

    Proxmox Cookbook

We can also monitor the status of the Ceph cluster through a CLI using the following commands:

  1. To check the Ceph status:
    # ceph –s
  2. To view OSDs in different nodes:
    # ceph osd tree
  3. To display real-time Ceph logs:
    # ceph –w
  4. To display a list of Ceph pools:
    # rados lspools
  5. To change the number of replicas of a pool:
    # ceph osd pool set size <value>

Besides the preceding commands, there are many more CLI commands to manage Ceph and perform advanced tasks. The Ceph official documentation has a wealth of information and how-to guides along with the CLI commands to perform them. The documentation can be found at http://ceph.com/docs/master/.

How it works…

At this point, we have successfully integrated a Ceph cluster with a Proxmox cluster, which comprises six OSDs, three MONs, and three nodes. By viewing the Ceph Status page, we can get lot of information about a Ceph cluster at a quick glance. From the previous figure, we can see that there are 256 PGs in the cluster and the total cluster storage space is 1.47 TB. A healthy cluster will have the PG status as active+clean. Based on the nature of issue, the PGs can have various states, such as active+unclean, inactive+degraded, active+stale, and so on.

To learn details about all the states, visit http://ceph.com/docs/master/rados/operations/pg-states/.

By configuring a second network interface, we can separate a Ceph network from the main network.

The #pveceph init command creates a Ceph configuration file in the /etc/pve/ceph.conf directory path. A newly configured Ceph configuration file looks similar to the following screenshot:

Proxmox Cookbook

Since the ceph.conf configuration file is stored in pmxcfs, any changes made to it are immediately replicated in all the Proxmox nodes in the cluster.

As of Proxmox VE 3.4, Ceph RBD can only store a .raw image format. No templates, containers, or backup files can be stored on the RBD block storage.

Here is the content of a storage configuration file after adding the Ceph RBD storage:

rbd: rbd

   monhost 172.16.0.71:6789;172.16.0.72:6789;172.16.0.73:6789

   pool rbd

   content images

   username admin

If a situation dictates the IP address change of any node, we can simply edit this content in the configuration file to manually change the IP address of the Ceph MON nodes.

See also

Summary

In this article, we came across with different configurations for a variety of storage categories and got hands-on practice with various stages in configuring the Ceph RBD storage.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here