I made a few simple tweaks to btrfs-snap to help alleviate this problem. I added a check to make sure only one instance of the btrfs-snap script can be actively removing snapshots at a time. I also added a delay between snapshot removals.
This version of the script has been running on my laptop for the last few days, keeping a dozen snapshots at five-minute intervals without any problems. With the original script, btrfs-snap processes would start getting gummed up within the first few hours.
I’ve been running this script for a little over two weeks now and I ran into my first runaway snapshot situation last night. Snapshot removal was hanging, and by the time I noticed it I had over 500 extra snapshots of each volume for a total of something over 1650 total snapshots on the file system.
After a reboot, snapshots could be removed again. Early on, the removals took over 30 seconds each, and disk I/O slowed to an absolute crawl. I don’t really want to be stuck with this many snapshots again…
I added a check to the btrfs-snap to make it skip snapshot creation if too many snapshots with the same prefix already exist.
I seem to be getting gummed up more often lately, probably every few days. The file system isn’t getting clogged up with huge amounts of extra snapshots anymore, but by the time I notice things went wrong, my process table usually has a few thousand
btrfs-snap processes sitting around.
They’re getting hung up trying to count snapshots. It seems that it isn’t possible to get an
ls of the
.snapshot directory while
btrfs is in the middle of failing to remove a snapshot. I moved the check for the sentinel file up a bit so that it creates the lock before counting snapshots. I also added a little countdown loop so that it will give up if it can’t get the lock after a few tries.