Ceph: How to place a pool on specific OSD?

Asked 23/9, 2019 at 10:24 Answered 23/9, 2022 at 0:1

I have a Ceph cluster of 66 OSD with a data_pool and a metadata_pool.

I would like to place the metadata_pool on 3 specific OSD which are having SSDs, since all other 63 OSD having older disks.

How can I force Ceph to place the metadata_pool on specific OSD?

Thanks by advance.

Firenze answered 23/9, 2019 at 10:24 Comment(0)

You need a special crush rule for your pool that will define which type of storage is to be used. There is a nice answer in the proxmox forum.

It boils down to this:

Ceph knows which drive is a HDD or SDD. This information in turn can be used to create a crush rule, that will place PGs only on that type of device.

The default rule coming with ceph is the replicated_rule:

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

So if your ceph cluster contains both types of storage devices you can create the new crush rules with:

$ ceph osd crush rule create-replicated replicated_hdd default host hdd
$ ceph osd crush rule create-replicated replicated_ssd default host ssd

The newly created rule will look nearly the same. This is the hdd rule:

rule replicated_hdd {
    id 1
    type replicated
    min_size 1
    max_size 10
    step take default class hdd
    step chooseleaf firstn 0 type host
    step emit
}

If your cluster does not contain either hdd or ssd devices, the rule creation will fail.

After this you will be able to set the new rule to your existing pool:

$ ceph osd pool set YOUR_POOL crush_rule replicated_ssd

The cluster will enter HEALTH_WARN and move the objects to the right place on the SSDs until the cluster is HEALTHY again.

This feature was added with ceph 10.x aka Luminous.

Aparejo answered 23/9, 2019 at 13:21 Comment(1)

In our configuration we have few ssds (besides a lot of hdds) also, with data on it. We would like to move rgw default.rgw.buckets.index pool data to ssd. Can we set this rule on already used ssds on the fly? Will this do separation? Will cluster be available in the meantime? – Woolly 26/6, 2023 at 14:1

I realize this is older, but it comes up on searches as an answer to a general "how do I separate OSDs into pools", and thus I felt the expanded answer useful.

First and most important: "device class" is not actually device class in Ceph, "device class" is nothing more than a label that separates OSDs from each other. This is exceptionally confusing because they have overloaded all of their terminology, but basically the fact that a spinning disk that uses magnetism is given the "device class" of "hdd" is MOSTLY irrelevant (see note below). It could have been given the device class of "fred" or "pizza" and made all the same difference to ceph. There is no internal meaning to the "device classes" hdd, sdd or nvme beyond them being tags that are different from each other. These tags separate disks from one another. THAT IS IT.

The answer then to how to separate different disks into different pools becomes easy from the command line once you realize that "hdd" doesn't mean spinning disk and "sdd" doesn't mean "disk on chip".

# Remove the current "device class" (label) on the OSDs I want to move to the new pool.
$> ceph osd crush rm-device-class osd.$OSDNUM

# Add a new "device class" (label) to the OSDs to move.
$> ceph osd crush set-device-class hdd2 osd.$OSDNUM

# Create a new crush rule for the newly labeled devices.
$> ceph osd crush rule create-replicated replicated_rule_hdd2 default host hdd2

# Create a new CEPH Pool associated with the new CRUSH Rule.
$> ceph osd pool set hdd2pool crush_rule replicated_rule_hdd2

In the Code Above:

$OSDNUM is the OSD Identifier. When you do "ceph osd tree" it will show the OSDs on your hosts, each OSD will be named "osd.#" where # is a consecutive identifier for the OSD. Probably didn't need to mention that, but lets call this "comprehensive" documentation.
hdd2 Is a user defined label for a new device class. As noted below, this can be ANYTHING you'd like it to be. This value is arbitrary and carries NO significance within Ceph at all. (See Below)
There must be AT LEAST one OSD known by Ceph on the new device class before running the "ceph osd crush rule" command. Otherwise you will get "Error EINVAL: device class does not exist". This error DOES NOT mean that the device class names are a list of known values, it means that Ceph couldn't find an OSD with that device class on it in the cluster already. Run "rm-device-class" and "set-device-class" first.
replicated_rule_hdd2 is a user defined name for a new CRUSH Ruleset. Without modification, you will likely have the rule "replicated_rule" already defined in your Crushmap...you can use anything you want in place of this text EXCEPT the name of any existing rule you have in your crushmap.
hdd2pool is another arbitrarily defined name, this time it's the name of a new pool in Ceph which will get set to use the new crush rule.

The first two commands are simply removing and adding a distinct label to each OSD you want to create a new pool for.

The third command is creating a Ceph "Crushmap" rule associating the above "distinct label" to a unique crushmap rule.

The fourth command creates a new pool and tells that pool to use the new crushmap rule created by the third command above.

Thus this boils down to:

Remove Label from an OSD
Create a new Label for the OSD
Assign the Label to a new Rule
Assign the Rule to a new Pool

Upon creating the pool with the rule assigned, Ceph will begin moving data around.

NOTE On why I use "Mostly Irrelevant" above when describing the "device classes":

This is one more part of the confusion surrounding "device class" in Ceph.

When an OSD is created (and potentially when the OSD is re-scanned such as after a reboot) Ceph, in an attempt to make things easier on the administrator, will automatically detect the type of drive behind the OSD. So if Ceph finds a slow "spinning rust" disk behind the OSD it will automagically assign it the label "hdd", whereas if it finds a "disk on chip" style drive it will assign it the label "sdd" or "nvme".

Because Ceph uses the term "device class" to refer to this label (which has a real technical meaning) and sets the device class to an identifier that also has real technical meaning, it incorrectly and confusingly makes it look like the identifier has actual meaning within the context of the Ceph software...that an HDD must be marked "hdd" so that Ceph can treat a "slow" disk in a special way separately from a "fast" disk such as an SDD. (This is not the case).

It further becomes confusing because upon re-scan, Ceph CAN CHANGE the device class BACK to what it detects the device type to be. If you install 3 OSDs on "class" hdd and 3 more on class "fred", it's possible at one point you will find all 6 devices in a pool associated with "hdd" and none in a pool associated with "fred" because Ceph has "helpfully" reassigned your disks for you.

This can be stopped by putting:

[osd]
osd_class_update_on_start = false

In the /etc/ceph/ceph.conf file.

Thus the use of "mostly irrelevant" here: because while the labels (device class) have no real meaning to Ceph, the software can make it LOOK like the label has pertinence by forcing labels based upon auto-detection of real disk properties.

Phallic answered 23/9, 2022 at 0:1 Comment(0)