FreeNAS ZFS “slapshot” fun..

You may recall (or you may not) a post about upgrading our in office FreeNAS server from 8.2.0-p1 to 9.2.1.5 a little while ago. I’m not sure I mentioned why I was doing that – other than it being stupidly out of date of course..

Well – the reason was simple. I couldn’t get our ZFS snapshots to list through the WebUI anymore. We’d taken so many over 12 months, somewhere in the region of 34-36 thousand… Yeah.. 34-36 THOUSAND. No wonder the web interface was shitting a brick whilst trying to load them. You’re possibly thinking “Is this guy on smack? 36K snapshots. What the hell is the point in that!?”

I’ll explain. That 36K~ is made up of 102~ per day. There are six, 2-hourly snapshots of each share. There are 17(+/- 1) shares total in total.

Working on a 4 week month;

17*6 = 102 per day.
102*7 = 714 per week.
714*4 = 2856 per month.
2856*12 = 34272 per year.

Working on a 30 day month;

17*6 = 102 per day.
102*30 = 3060 per month.
3060*12 = 36720 per year.

Quite a few whichever way you work it out and the cause of my problem is clear. Too much data. Not enough processing-fu through the UI.

So.. Hi ho, Hi ho – it’s to the CLI I go!

Instantly, using a bit of ‘zfs’ magic I can get a list;

zfs list -t snapshot -o name -s name

zfs list = Tells zfs to enter ‘list’ mode
-t = sets the object type to ‘snapshot’
-o = Tells zfs to order by name
-s = Tells zfs to list snapshot names (then used to order the list)

This spews out an entire list of the snapshots I have on the system. We have to be careful from here on.. ZFS assumes intelligence and takes no prisoners if you’re lacking..

{ set mode:concentrate: on; }

‘zfs destroy’ is the command to do the dangerous part. It deletes datasets/snapshots whether they’re periodic or recursive. It doesn’t really care.. in fact, it’s a lot like the Honey Badger (https://www.youtube.com/watch?v=M0wi7Ugct1w), so with that in mind and concentration activated.. we proceed further into obtaining a list and deleting the snapshots we want to get rid of.

My pool is called ‘main’.
Our shares are prefixed with a ‘z’ to indicate a local production share to the people within the office (it’s a legacy thing. We used to have a mirrored Samba share which shared read-only rsync copies of the production shares. Mirrors were prefixed with ‘xShareName’ whilst writable were ‘zShareName’.

An example snapshot name: ‘main@auto-20140612.1209-4w
Another example name: ‘main/zProductDevelopment@auto-20140612.1209-4w

With that all cleared up; let’s get a list and process it.

zfs list -t snapshot -o name -s name| grep ^main| xargs -n 1 zfs destroy -r

The above simply generates a list of snapshots, ensures the list only contains snapshots matching my pool name (main) and then calls ‘zfs destroy -r’ on each file in the list. Once started, the command is not interactive or verbose. It sits quietly operating in the background.

To keep an eye on the process, one can use bash and ‘ps’ in a while loop with a ‘sleep’ and ‘clear’ to keep things updating and easy to read.

while true; do ps aux | grep zfs; sleep 2; clear; done

That should output something similar to the below;

[root@freenas /]# while true; do ps aux | grep zfs; sleep 2; clear; done
root 307 0.1 0.0 0 96 ?? DL 10Jun14 3:56.75 [zfskern]
root 30405 0.0 0.3 53968 22484 0 I+ 11:40AM 0:00.37 zfs list -t snapshot -o name -s name
root 30408 0.0 0.0 9900 1600 0 S+ 11:40AM 0:00.33 xargs -n 1 zfs destroy -r
root 48490 0.0 0.0 37584 2908 0 D+ 12:20PM 0:00.00 zfs destroy -r main/zMarcoms@auto-20140603.0800-4w
root 48492 0.0 0.0 16268 1948 1 S+ 12:20PM 0:00.00 grep zfs

The bits in bold are the individual elements of the command running. We can see the snapshot list being created, we see xargs running and the actual zfs destroy command being executed for a particular snapshot.  We can keep an eye on things from here.

The best bit about this is that even after removing only 10-20K snapshots.. the WebUI is working again!

The cause of this issue is somewhat unknown at this time. I’m thinking that something with the ZFS replication is preventing the stale cleanup from running properly. We use periodic snapshots to capture data every 2 hours, and then a replication task to fire the snapshots at our backup server.

Be interesting to see how well this all works between two hosts running v9.2.1.5.

Will update when I know myself 😀

 

 

Oh what fun.. Upgrading FreeNAS v8.2.0-p1 –> v9.2.1.5

The Situation

We’ve been using FreeNAS in the office at work “happily” for a number of years now.. as you can probably tell from the version number I attempt to upgrade from. 8.2.0-p1 was released in Julyish 2012 (http://www.freenas.org/whats-new/2012/07/freenas-8-2-released.html). Can’t quite remember if it was the first version we started using, but it’s certainly what we left it on!

We have FreeNAS configured with ZFS over 4x 1TB 7200K rpm drives for fault tolerance and performance.

Data is then shared by CIFS/Samba to Linux/OSX/Windows users across the network.

ZFS snapshots occur every 2 hours within “core hours” of  08:00 – 20:00. These snapshots can be restored to provide recovery in instance of data loss.

ZFS snapshot replication to another FreeNAS box provides redundancy against data loss on a single unit.

Extra cronjob scripts handle scp/rsync taking care of file level (non ZFS snapshot) backups which are then “scp’d” from the backup server to an offsite Linux server.

  • All tends to work pretty well.
  • We get glitches with permissions in the UI sometimes. Have to restart the CIFS for changes to become effective.
  • ZFS snapshots don’t list so we can’t restore them.
  • We used to have to hack Windows 7 to connect to it.
The Upgrade

Following the instructions proved pointless.

Upgrading through the UI was an abortion of an idea. Just didn’t work. The process is to upload a firmware along with a SHA sum of the file to verify integrity through the UI and magic ensues..

Wellllll.. I must have run out of magic dust because my upgrade didn’t work like that.

Each attempt returned with a message stating:

Error: The firmware is invalid”

You can read about that here –> http://forums.freenas.org/index.php?threads/error-the-firmware-is-invalid-when-upgrading-from-8-2-0-to-8-3-1.11972/

It didn’t seem like upgrading via the command line was either. My despair grew, especially considering the backup server was running v9.2.1.5 quite happily which had upgraded from a much later version like a charm.

To perform the upgrade I had to rewrite the USB stick that we boot from.

Was easy really;

  • Download the USB image
  • ‘dd’ to the USB key
  • Boot in the server

The software booted up but was instantly obvious something was wrong. The server monitor was still going mad and the FreeNAS UI couldn’t be reached.. Damn.

Seems like it booted clean but dropped its configuration (doh!!!).

Lucky for me that I had taken a configuration backup before starting! 😀

Once I’d manually reset the network details, I uploaded the config backup, the restore completed, the system upgraded the data and rebooted. After a minute or to I could again access the UI 🙂

The Fallout

At this point I checked a few CIFS shares from the Macbook and all looked good.. Hah.. If only.

It transpires that FreeNAS v9.2.0+ jumped Samba v3.x to Samba v4.x which has a big switch in terms of the permission model in use. Unix style permissions no longer work properly and everything has to be Windows/Mac ACL style.

The actual symptom of this was that Windows users could create, open and view files, but were unable to edit/modify anything. It didn’t matter if they had literally just created the file.. they couldn’t then edit it. Fully bizarre and also, what a pain! That meant changing all the shares to use the new mode and then recursively set the file level permissions… ugh :\

Setting permissions recursively through the UI, especially when you’ve thousands of files to go through.. so rather than watching python eat CPU I decided to jig the permission model to Windows/Mac ACL through the UI and then fix the actual permissions via the CLI using find, exec and setfacl.

All our shares are stored within ‘/mnt/main/SHARE_NAME’ so rather than going through one by one, file by file, folder by folder we do the below;

To fix permissions on the files;

for i in `ls -1`; do find /mnt/main/$i -type f -exec setfacl -m owner@:full_set::allow,group@:modify_set::allow {} \;; done

To fix permissions on the folders;

for i in `ls -1`; do find /mnt/main/$i -type d -exec setfacl -m owner@:full_set:fd:allow,group@:modify_set:fd:allow {} \;; done

After this, a quick CIFS service restart brings all back to a working state.

lol. As if it did.

When trying to restart the service I found another bug! One that could fortunately be fixed with a little script of a FreeNAS bug tracker (https://bugs.freenas.org/issues/4874#note-11)

Snippet:

Updated by Cyber Jock25 days ago

Per request.. here’s how you apply this patch on 9.2.1.5:

As root and in SSH do the following commands:

  1. cd /tmp
  2. fetch https://bugs.freenas.org/attachments/download/768/fixup.sh.txt
  3. chmod +x fixup.sh.txt
  4. mv fixup.sh.txt fixup.sh
  5. ./fixup.sh

After doing this, and restarting/starting again.. CIFS shares are working once more! Hooray!

ZFS replication seems to be working too. I wonder if I can get that snapshot list to load..