CAS/UIS storage pool cannot start due to ONEStor osd usuage nearly full

Cloud Academy

2024-12-27 16:18:37 Published

0 Followed
0Collected ，4754Browsed

Limengru

Network Topology

CAS/UIS that useONEStor distributed storage

Problem Description

All virtual machines are abnormal, and the storage pool cannot be started:

Internal eror. OCFS2 config ration error, unknow reason. please add the storage again.

Process Analysis

The shared storage pool vol1 reported an cannot start error, and the ceph osd tree df was used to check the disk usage. It was found that the storage utilization of osd.2 was 97%, which had reached the business threshold and caused abnormal storage in the entire cluster, affecting the business of virtual machines.

The osd.2 storage utilization is 97%, which has exceeded the default business interruption threshold of 95%.

Solution

1. Manually modify the threshold of disabling all OSD services to 98%.

2. Execute fstrim /vms/vol1 fstrim /vms/vol2 to release storage space.
3. Observe the storage usage rate when the disk is released in real time and wait until it drops to 95%.

4. After the storage utilization rate drops to 95%, the GUI cannot be loaded.
5. The virsh command executed output in the background does not have a response, indicating that the virtualization layer is stuck.

6. Execute the ocfs2_kernel_stst_collect.sh script in the background CLI, and the output is "Process have been hung up!" indicating that a process is stuck.

7. Check the OCFS log to find that a process is stuck and there is a deadlock on the host.

8. Turn off the storage pool auto-startup file, restart the host one by one, and restore the business:
Rename the autostart file to autostart.bak, then restart the host one by one, wait for this host to recover (can ssh to the operating system), then restart the next one (first cvk, then cvm).

9. Change the name of the autostart file back, manually start the storage pool, and finally start the virtual machine to restore the business.

Conclusion
1. When the OSD storage utilization reaches 95% of the service threshold, the storage pool will be down, and the host's I/O cannot be normally write to the storage, resulting in abnormal virtual machines and cluster service down.

2. When a certain cvk has an abnormal storage link, and at this time the lock is exactly on that cvk, it will cause the lock on that cvk cannot be released, and at this time other cvks need to get the lock to read and write storage data, in this case, there will be OCFS2 deadlock problems, the host will hang and need to be restarted to recover.

Please rate this case:

0 Comments

No Comments

CAS/UIS storage pool cannot start due to ONEStor osd usuage nearly full

Network Topology

Problem Description

Process Analysis

Solution

Add Comments: