UIS Euler version unified summary of issues such as GC not reclaiming or reclaiming too quickly

Cloud host

2025-02-08 17:28:43 Published

0 Followed
0Collected ，455Browsed

Problem Description

This case covers multiple issue scenarios:

1. GC reclamation is too fast and affects services

2. UIS side has completed release, but onestor side still did not trigger GC reclamation, resulting in no space being freed. For example, ceph osd df shows high disk usage, but the corresponding upper-layer storage usage is very low

Process Analysis

1. The following analysis is conducted for the GC recycling process:

The UIS Euler version involves GC recycling, so space release is divided into multiple hierarchy

1. First, determine whether the foreground file system has been released. If the foreground is not released, the underlying storage cannot be released:

Use tgt-admin -s to locate the corresponding rbd block, then check whether the used amount here matches df -h (or the foreground storage pool usage). A significant discrepancy indicates that the foreground file system has not been released

Execute fstrim -v xxxxx to release the upper file system. Try to follow this command or refer to the fstrim documentation. Using this command may occupy some IO and potentially affect services. This operation releases the capacity of the block device itself. Upon completion, the size of the released capacity will be displayed. Check again with rbd du xxxx, which should match the results of df -h.

2. If the checks in the first step are consistent or the first step has been completed, you can then use ceph df to check the usage.

If discrepancies are found here, it indicates that onestor has not yet marked it as garbage, and thus will not trigger GC reclamation (onestor has a timing task for this action; if urgent release is needed, refer to the case). Manual marking is required. Execute the following command on all storage nodes (this may occupy IO and potentially impact services. If there is a window, execute simultaneously; otherwise, proceed node by node):

ceph daemon dse.`hostname` engine all full_compaction row

The above command will mark garbage, making its increase visible. It can also be used alongside this command to check if garbage is accumulating. Replace defaultDataPool here with the on-site pool: ceph engine get_osd_garbage defaultDataPool capacity. When garbage indeed increases, it will automatically trigger GC reclamation, which will also occupy IO. At this point, you can further monitor usage by executing watch ceph df to see if it decreases. A decrease proves the execution was successful

II. The following analysis focuses on the principles of GC reclamation itself:

1. GC collection targets objects. Objects inherently contain garbage, with some holding more and others less. Therefore, onestor classifies garbage into tiers. Objects with more garbage are prioritized for processing, such as bin_9, while those with the least garbage are scheduled last, like bin_-2. For details, refer to the figure below:

Command input: ceph daemon dse.`hostname` engine all gc dump | grep xxx -A 40 (replace xxx with the pool ID visible in ceph df)

As the storage pool usage increases, higher-level bins are more likely to be released. For example, when storage reaches 90%, even bin_0 garbage will be collected, which inevitably impacts services.

2. GC collection has a speed factor. When the speed is extremely high, IO usage becomes significant, severely affecting services. This may cause VM IO delays of several seconds, almost reaching a halt. So how do we acknowledge if the issue is caused by GC collection? Check the current GC collection speed:

ceph daemon dse.`hostname` engine all gc dump | grep running_num (This is also based on the previous command, checking the count of running_num. For example, in version 882, it ranges between 1-20, while in later versions like 886, it ranges between 1-100. A higher digits value indicates faster GC reclamation speed. The screenshot below shows version 882, where values reaching 17 or 20 indicate very fast GC reclamation, consuming significant IO.)

3. The UIS side has been fully released, but ceph osd df shows that a lot is still in use. It needs to be released, but GC is not reclaiming. At this point, manual garbage scanning is required to trigger GC reclamation, thereby reducing OSD usage.

ceph daemon dse.`hostname` engine all full_compaction row (This command needs to be executed on all nodes)

Solution

Based on the analysis of the recycling action, we have two solutions

The first is to directly reduce the GC recycling speed

watch -n 10 ceph daemon dse.hostname engine all gc state set concurrency 4 (1-20, maximum 20, 20 has a significant impact on operations, effective for a single node, this is a scale of 1-20)

Second, directly prevent GC from reclaiming by making GC skip multi-level bins, artificially reducing the storage pool usage rate. Note that this does not actually reduce usage but merely deceives GC. It can be executed on a single node and takes effect globally (note: the command in Method 1 only applies to a single node).

ceph engine force-start-pool-gc defaultDataPool 50 set (This deceives GC into believing the storage pool usage is 50)

UIS Euler version unified summary of issues such as GC not reclaiming or reclaiming too quickly

Problem Description

Process Analysis

Solution

Add Comments: