Is It Safe to Use a Big, Honking USB Hard Drive on Your Raspberry Pi Server?

| Comments

The short answer seems to be yes, but why did I decide to write about this question today?!

Through the magic of Tailscale, my Seafile server, a Dropbox-like cloud-storage server, has been running remotely on a Raspberry Pi in Brian Moses’s home office for 22 months. The Pi boots off of a cheap micro SD card with all logging turned off, and Seafile stores all its data on a 14 TB Seagate USB hard drive.

Last week I had a scare! The ext4 filesystem on my USB hard drive set itself to read-only, and there were errors in the kernel’s ring buffer. Did my big storage drive fail prematurely?!

What happened?!

We may never know. I rebooted my little server, and then I ran an fsck. When that finished, I mounted the LUKS encrypted filesystem on the 14 TB USB hard drive, fired Seafile back up, and everything was happy. I haven’t seen another error in 13 days.

dmesg output

What do you think might have gone wrong? Is my USB cable flaky? Did the USB-to-SATA hardware in the USB drive get into a weird state? Did my little friend Zoe run through Brian’s office and bang into the cabinet and jostle the USB cable enough at just the right time to generate an error?

This is the first error I’ve seen in 22 months outside of a couple of random power outages.

This is good enough for me to say that it is safe to run a USB hard drive on my Raspberry Pi server. Maybe that’s not good enough for you. If not, then I hope this is at least a useful data point in your research.

The power outages all wind up being long Seafile outages, because I have to SSH in to punch in the passphrase to mount the encrypted filesystem before Seafile can start. Not only that, but I have to notice that Seafile is down before I can spend two minutes firing it back up again!

I am pleased that my USB hard drive didn’t fail!

I have been comparing my Dropbox-style server to Google Drive’s pricing. Google sells 2 TB of cloud sync storage for $100 per year, and I paid about $300 for my 14 TB drive and my Raspberry Pi. I am getting my off-site colocation for free by storing my Pi at Brian Moses’s house on his 1-gigabit fiber connection. You can just barely see my Pi server in the background of our Butter, What?! Show episodes!

I have done a bad job of properly keeping track of how much money I haven’t paid Google Drive or Dropbox. When I started out, I would have needed to rent 4 TB of storage. At some point during the first year, I would have had to add another 2 TB, and then I would have had to add another 2 TB at some point during this year.

I am currently using 6.8 TB out of a possible 14 TB.

Is it OK if I simplify the payments I may have had to pay to Google Drive? How about we just say I would have spent $200 last year, $400 this year, and I would be spending another $400 in two months? I expect that’s underestimating by something around $100.

I have already paid for all the hardware, and I would still be at least $50 ahead even if I had to order a replacement 14 TB drive last week.

That would have been fine, but I would have been bummed out if I had to do the work of resyncing my data to my new Seafile server. I also might have to set up accounts for the friends I share work with, and they might have to get data synced back up. That would all be a bummer.

I don’t account for the cost of labor in my “savings”

I know what I used to charge in the short spans of time when I did hourly consulting. I understand that my time is valuable, and that value almost certainly exceeds the cost of a few years of Google Drive storage.

There are quite a number of advantages to what I am doing, and they are very difficult to put a price on.

  • I own all my data
  • My server is only accessible from my Tailscale network
  • Everything except Tailscale is blocked on the LAN port
  • My hard disk is LUKS encrypted
  • My Seafile libraries are encrypted client side
  • I don’t get stopped to pay more at every 2 TB

Aside from all of this, it is a bit disingenuous for me to imply that getting my Seafile setup running requires time and effort while setting up Google Drive or Dropbox requires none. I have my data split up into a dozen Seafile libraries, and I have no idea what the equivalent functionality would look like with Dropbox or Drive. I’d still have to make sure that the correct data is synced to the correct devices.

Even though getting Seafile running on a Pi was more persnickety and time consuming than I anticipated, I still spent the majority of the time setting up clients on all my machines and syncing the correct data to each one.

I didn’t intend for this to be an update on the Seafile Pi with Tailscale

Even though this wasn’t my intention, this seemed like a good time for an update, since the server has been in operation for nearly two years. I am quite pleased with how things have been working.

A few months ago, I started booting Raspbian’s 64-bit kernel, and I replaced the 32-bit Tailscale binaries with 64-bit static binaries. That increased my Tailscale throughput from 70 megabits per second to just over 200 megabits per second.

I also swapped out the 4 GB Pi with a 2 GB model. They’re both identical except for the RAM, but the 2 GB Pi will not negotiate gigabit Ethernet. It is stuck at 100 megabit, so I have given up my some of my Tailscale encryption speed boost!

Would I do this with a Raspberry Pi today?

Not a chance! The hardware is working just fine, but Raspberry Pi boards are never in stock. When they are in stock, they are overpriced.

My friend Brian Moses recently grabbed a Beelink box for $140 from Amazon. It has a Celeron N5095, 8 GB of RAM, and a 256 GB SSD.

You get so much more with the Beelink! There’s a real SSD inside. The CPU is several times faster, and so is the GPU. You get full-size HDMI ports. You get 802.11ac. The Beelink isn’t just a bare board. It comes in a nice case, and it comes with a power supply!

The Raspberry Pi 4 is more than enough for my purposes, and I would gladly use one for this purpose again if they were still under $50.

One of the neat things about the Beelink boxes is that you can get a lot more horsepower than the Celeron version that Brian bought. You can also get a Beelink with 16 GB RAM and a 6-core Ryzen 5560U that is about three times faster. One of our friends on our Discord server bought one of those when it was on sale for only $300.

Conclusion

I am relieved that my unexpected and unexplained SATA error didn’t wind up being a drive failure, and I am super excited that the minor gamble that is this entire Seafile Pi experiment is paying off. I have invested less than $300 in hardware, and in two months I will have not had to pay Google $1,000 or Dropbox $1,200 for cloud storage.

Even if we assign me a rather high hourly rate, I am confident that I have crossed the point where I am saving money. Isn’t that awesome?!

Enabling WiFi Fast Transition Between Access Points with OpenWRT

| Comments

This isn’t a tutorial. The steps to enable 802.11r on two or more access points running any reasonably recent release of OpenWRT are pretty simple. I will spell those steps out shortly, but there won’t be screenshots walking you through the process. I am writing today to document what I have done, what I might do next, and how things are working out so far.

As I fact check this post, it just keeps getting longer and longer. So much longer than I expected! It feels like every time I look up specifics to make sure my memories of various events are reasonably accurate I wind up finding something that I had an incorrect understanding or memory of.

Everything in here should be pretty accurate now, but I am definitely not an expert when it comes to any of the newer WiFi roaming specifications of technologies. This is just my experience while messing around with this stuff over the last week or so.

Why is Pat monkeying with his WiFi setup?!

I have not been entirely unhappy with the WiFi signal around my house for the last few years, so how did we get here? Why have I upgraded things?

I was at Brian Moses’s house a couple of weeks ago. He’d just replaced one of his OpenWRT access points, a TP-Link Archer A7 WiFi 5 router, with a GL.iNet WiFi 6 router. That got my gears turning. I was thinking I could use Brian’s old router to solve my poor connectivity issues with my Raspberry Pi Zero W on my CNC machine in the garage. It’d get me a switch port to plug the laptop into as well, so I asked him how much he wanted for it.

TP-Link Archer A7 v5 with OpenWRT

The next week he sent me home with two of his old OpenWRT-compatible routers. That TP-Link and [a Linksys WRT3200ACM][wrt3200acm]. Now that I had an overabundance of WiFi routers, I figured it was time to reengineer my home network and eliminate my only access point that can’t run OpenWRT.

NOTE: You shouldn’t buy the TP-Link or Linksys router. The Linksys seems terribly overpriced, and it feels like the TP-Link should costs less, too. You can get two better equipped WiFi 6 routers from GL.iNet for the price of the Linksys WRT3200ACM WiFi 5 router, and the GL.iNet routers ship from the factory with OpenWRT.

The layout of our house

There are much bigger homes in our neighborhood, but the homes with significantly more square footage will have a second story. Our house is shaped like the letter U, and Zillow claims our house is around 2,200 square feet. I also need to cover the garage which adds another 500 square feet. Let’s just say I need to cover 2,500 square feet with WiFi, and reaching out into the yard wouldn’t be a bonus!

My house and its wifi access points

I have two access points, and each access point has a separate SSID on each radio. Chicanery2.4 and Chicanery5 are on the gateway router that lives in the network cupboard. Shenanigans2.4 and Shenanigans5 are on an access point in the living room. Both devices are D-Link DIR-860L routers. Both have faster WiFi when running D-Link firmware, but the one in the living room can’t even run OpenWRT.

Shenanigans is named after Farva’s favorite restaurant, and Chicanery is named after its Canadian equivalent from the sequel.

This is how things were configured last week.

Why not just put the same SSID on every access point?

We can call this my poor man’s WiFi steering. The access point in the living room does a good job reaching the entire house, but there are some important devices that are close to the network cupboard.

There is a television in the room opposite the network cupboard, and it is connected to chincanery5, and the television in the living room is also connected to chicanery5.

Why is the TV in the living room not connected to the access point in the living room? It is close enough to the cupboard to get a good connection, so there is no reason to waste 20 or 30 megabit on the house’s primary access point.

These devices are still connected to the access point in the cupboard today.

What have I changed?

I wound up putting the Linksys WRT3200ACM in the cupboard as our new Internet gateway and our upgraded chicanery access point. I did some tests, and this router can easily manage routing and NAT at 920 megabits per second. That’s only about 20 megabits shy of the maximum speed of gigabit Ethernet, so we won’t have to worry about swapping routers if we upgrade the speed of our FiOS service.

I put the TP-Link Archer A5 in the living room. Both routers are running the latest release of OpenWRT, and they have all their old SSID settings copied over.

I also added a new SSID called kerchow to every radio. I set up 802.11r fast transition on that SSID for nearly instantaneous roaming.

How do you enable 802.11r roaming in OpenWRT?

I gave every radio that I wanted participating in the roaming the same SSID and WPA2 key. I also had to choose a mobility domain for my roaming group. Then I had to:

  • Tick the checkbox on that interface to enable 802.11r
  • Enter my mobility domain
  • Choose FT over the Air for the FT protocol
  • Set DTIM Interval to 3 (supposedly helps iOS devices roam)

The first two are the only steps that are truly required. Lots of advice on the Internet recommended using FT over the Air, but I haven’t tested both options. They said to do it, it seems to work well, so I am sticking to it.

I have no iOS devices, so I have no idea about the DTIM Interval.

How do I know if 802.11r is working?

I am not entirely sure that I have tested things correctly, but I am mostly confident that I’ve tested well.

I used ssh to connect to both routers and ran logread -f to follow the logs to see when machines were connecting to WiFi. I had the OpenWRT LuCI GUI open in two browser tabs, and I would go in and click the disconnect button for my laptop.

Stable Diffusion Hacker

I would immediately see output in one of the logs saying the laptop reconnected. I had a ping running on the laptop the entire time, and I never missed a ping. Sometimes one or two responses would bump up to 200 or 300 milliseconds, but they always made it.

This seems like a success to me!

How do you know if your 802.11r setup is working well?!

I had a glitch. My Windows 11 laptop very much preferred 2.4 GHz connections. Once it picked a 2.4 GHz radio, it would always connect to another 2.4 GHz radio after being disconnected. I had to turn WiFi off and on again on the laptop to get it to connect to 5.8 GHz again.

This is a bummer, because the 5.8 GHz radios are about four times faster. This happens because 802.11r doesn’t care about bandwidth. It only cares about signal strength. A 2.4 GHz signal has a much easier time penetrating walls than a 5.8 GHz signal, so the slower radio often has a better radio signal even if throughput will be lower.

I solved this problem in a simple, ham-fisted, but very effective way. I now have three SSIDs and three different mobility domains.

I have kerchow with a mobility domain of beef on every single radio. I have kerchow2.4 with a mobility domain of b0b0 on the 2.4 GHz radios, and I have kerchow5 with a mobility domain of b00b on the 5.8 GHz radios.

I wish I discovered both DAWN and usteer before setting this up.

I added a GL.iNet WiFi 6 router to the mix

I have a GL.iNet GL-AXT1800 travel router here, and I was curious if the 802.11r configuration would even work on this hardware. These particular routers have radio chips that aren’t yet supported by OpenWRT, so they are shipped with an oddball proprietary Qualcomm fork of OpenWRT.

Brian said his GL.iNet WiFi 6 router got pretty goofed up when he configured the WiFi through LuCI instead through GL.iNet’s interface, but it did let him add additional SSIDs to each interface through the LuCI GUI. I was worried that other things like 802.11r wouldn’t work.

GL.iNet Mango and AXT1800 on my desk

I ran out of Farva-related restaurants, so I put the SSID of slammin on my GL-AXT1800. Then I added the three kerchow SSIDs with the correct mobility domains.

I don’t really have a good location for a third access point. I just found an open network jack near a power outlet that is about half way between the other two access points. My laptop rarely decides to roam to this access point. I had to click disconnect in the OpenWRT GUI so many times, but it did eventually connect right up to the WiFi 6 router.

As far as I can tell, 802.11r fast transition does work with the oddball OpenWRT firmware.

What about DAWN and usteer?

I am bummed out about this. A few weeks ago I learned the name of one of these packages, but I didn’t remember to make a note of it. I tried finding it several times over the last week, but I failed miserably.

Do you know when I managed to find one of them? Minutes after I set up nine different variations of kerchow on three different routers.

I don’t have a lot to say about either one at this point. They are both available as OpenWRT packages. They are tools that help you configure how devices roam between your WiFi access points.

Both seem fiddly and complicated, and success seems to involved a lot of trial and error. I’d be happy if either package could easily manage to accomplish two things.

I’d like to be able to eliminate two of the three kerchow variations. It sounds like configuring one of these tools to make your 2.4 GHz less appealing only requires a few lines of configuration. That would be awesome.

Why am I waiting before trying DAWN or usteer?!

Both require upgrading the default wpad-basic OpenWRT package with the more feature complete version. I’m sure this will go smoothly, but I am just not ready to try something that might require that much effort!

If I were just monkeying around with a few test routers, this would be fine. That’s not what I am doing. These are the routers running my home network. I need them to work, and they are working right now.

NOTE: I might be confused about this. My fresh installs of OpenWRT 22.03 all have wpad-basic installed, and the description for this package claims it supports 802.11r. I am guessing that wpad-basic has replaced wpad-mini which lacks this support. I am not certain if wpad-basic has the features required for DAWN or usteer, but I am suspecting that it will work fine!

Your setup is probably more complicated than mine!

I have learned that I can cover my entire house with acceptable WiFi speeds using a single access point. Sure, the speed drops to 20 or 30 megabits per second in the bathroom and garage, but that’s more than enough to watch YouTube videos. Anywhere with a desk can manage 200 megabits per second.

I don’t truly need multiple access points with fast roaming, but maybe you actually do! If you do, I imagine things get more complicated!

Stable Diffusion hacker

My access points have their radios cranked up to maximum output. If you’re trying to cover a slightly larger area with two access points, you probably don’t want to be blasting quite so hard. You want your devices to notice that one access point is weaker than the other so they can switch over.

I am also cheating because there is a gigabit Ethernet port at every desk in the house. I only need just enough WiFi reaching every comfy chair for my phone and laptop to at most play YouTube videos, and I need a couple of dozen megabits to get 4K video streaming to two televisions. Things get more complicated if you really need a solid 200 megabit wireless connection at specific workstations in the house.

Why does Pat keep using numbers like 200 megabit?! Can’t modern WiFi do gigabit?!

I’ve been doing iperf tests to various access points from all sorts of distances and different rooms all week. The fastest WiFi 6 router I have access to has managed to pull numbers really close to 700 megabit, but that only happens in the same room, and it only happens some of the time. I am way more likely to see 400 megabits per second while in the same room.

Once you start putting a couple of walls in between your devices, these numbers drop. It seems like 250 megabits per second is about what I can expect to see from a about one room away from the access point. The signal has to either bounce around corners through doorways, or has to go through the pair of walls that make up the hallway.

The fastest advertised WiFi speeds are only possible under ideal conditions.

Something that surprised me!

First of all, if you are buying a fresh set of OpenWRT routers for your project, you should probably be buying GL.iNet routers that ship from the factory with OpenWRT. At last that way you know what you’re going to get. One of my old D-Link tube routers isn’t running OpenWRT because I had no way to know which revision of the hardware I was going to get before it arrived.

Not only that, but many WiFi routers that can run OpenWRT have radio chipsets that don’t run as well with the open source drivers. My older revision D-Link router’s WiFi is about 50% faster than the newer revision with OpenWRT.

I was handed a box of free OpenWRT-compatible WiFi hardware. I already knew the hardware would work. If this didn’t all happen accidentally, I would have bought GL.iNet routers instead.

I guess I should get to the weird part. The Linksys WRT3200ACM can be set to 200 mW on 5.8 GHz and 1,000 mW on 2.4 GHz. The TP-Link Archer is the opposite! It can be set to 1,000 mW on 5.8 GHz and 200 mW on 2.4 GHz!

Does the OpenWRT driver really know how much power the radio is going to put out? Is it anywhere near correct? We have no idea. I never trust numbers like this.

Radio is weird. You have to quadruple the transmit power to double your range. I have no way to properly verify just how much power the Archer A7 is pumping out, but the WiFi analyzer seems to think the TP-Link 5.8 GHz signal is quite a bit stronger than the WRT3200ACM!

Conclusion

I don’t know if there is a conclusion. I set up 802.11r. I pointed a few of my WiFi devices at the 802.11r SSIDs. I think I am at the point where I wait and see how things work out.

Maybe the conclusion is that 802.11r was really easy to set up on my OpenWRT hardware. It isn’t much more work than checking a box, though it does get a bit more fiddly when you are trying to put the correct mobility domain on three different SSIDs on three different routers. I know I goofed at least one up on my first attempt!

OpenWRT, Two GL.Inet Routers, and Tailscale: Successes and Failures

| Comments

This blog isn’t going to lead to some keen piece of insight or an interesting conclusion. I’ve been messing around with a pair of very different OpenWRT routers from GL.iNet: the GL-MT300N-V2 Mango and the GL-AXT1800 Slate AX.

I learned some things I didn’t know about last week. I learned just a little more about Tailscale. This is mostly just to write down what has worked and what hasn’t worked for me.

GL.iNet Mango and Slate AX 1800

I figured I should write some of these things down both for future me and for anyone else who might be trying to accomplish something similar. Mostly for future me.

How did we get here?

A friend of ours is moving to Ireland. He just wants to watch Jeopardy on his Hulu Live TV subscription. He’s been watching Brian and I monkey with Tailscale on our GL.iNet Mango routers for ages. I know our friend has had his Mango for a while, quite possibly for about as long as Brian and I have had our Mangos, but now he’s been trying to use it to connect his Apple TV to an exit node in the United States.

The story is more complicated than this, but the Mango didn’t work out for him, so he tried the much beefier GL.iNet Slate AX, and he couldn’t make that work. That’s how I wound up having a Slate AX here in my possession. He is currently using a Raspberry Pi 4 with Tailscale as a router to forward his Apple TV to America.

I am able to get my tiny Mango to route traffic of connected devices through one of my Tailscale exit nodes. I am not able to do the same with the much nicer Slate AX router.

Who is Pat blaming?!

I don’t think anyone needs to accept the blame here. I don’t believe that running Tailscale on an OpenWRT router isn’t officially supported. Running Tailscale on the Mango is a bit of a hack because the Mango doesn’t have enough storage to even hold Tailscale.

Not only that, but as I learned long after I started trying to figure this out, the GL.iNet WiFi6 routers aren’t even built upon official upstream OpenWRT!

We can probably blame me for not doing science correctly. I am also only a sample size of one, so just because something is working well for me several times on the day I happen to test it does not always mean it works smoothly next week.

Keeping track of what is going on is problematic. The logs on the OpenWRT routers are ephemeral. Not only that, but I didn’t think I would need to troubleshoot anything on my Mango, so my startup scripts redirect all tailscaled output to /dev/null!

Pat is bad at science

I did a bad job writing things down. I didn’t know I was going to have to. All I really managed to do was write down some things after the fact on our Discord server. Those were usually the weirdest happenings.

We also aren’t helped by the fact that not everything happened at the same time. I updated Tailscale and OpenWRT on my Mango a few weeks ago, and everything there was fine. Then, when I tried booting up the router a few days later, things weren’t going as smoothly there as I thought. Then even more time passed before I got to fart around with the beefier WiFi 6 router.

Science is hard. I was kind of expecting to make things work, wipe everything clean, and make it work again. That last step would have given me all the documentation I needed, but I never made things work on the beefy router.

Connecting my GL.iNet Mango to a Tailscale exit node

First of all, I am cheating when I load Tailscale on my tiny Mango. It only has 16 megabytes of flash, and most of that is taken up by GL.iNet’s OpenWRT firmware. The official Tailscale binaries are bigger than that, so I wound up installing them on a USB flash drive.

It was a rather manual process, and I documented it as best as I could. I had to make a few tweaks in the OpenWRT LuCI GUI to get exit node traffic routing correctly, but once I did, I thought I had it working pretty well. I powered the router off and on several times, watched it succeed every time time, then I packed the Mango away in my laptop bag to use in an emergency.

The Mango is pretty slow. It can only encrypt traffic over my Tailscale link at about 4 megabits per second. That’s way slower than Wireguard in the kernel on this machine, and probably only just barely enough for video streaming. I am guessing Netflix would bump you down to standard definition via a Tailscale exit node on the Mango.

A few weeks later when my friend was having trouble, so I pulled my Mango out of the bag and plugged it in. My Mango didn’t want to connect to my Tailnet unless I killed and restarted tailscaled. Adding a long sleep to my Tailscale script seems to help with this, but it isn’t perfect.

I should also note that I only managed to get the Mango to route traffic through my exit node if the Mango is using Ethernet for its WAN connection. If I set up the Mango to use WiFi as its WAN, it won’t route traffic via the exit node.

I didn’t notice this right away, and I haven’t investigated what is going on here. This is only a problem when trying to route traffic through an exit node. The Mango works fine for me with WiFi WAN as long as it is just a regular node or subnet router.

NOTE: On the Mango, I have to create an interface in LuCI for the tailscale0 interface and make sure it is attached to the WAN firewall rules. This coaxes OpenWRT’s firewall scripting to apply the correct iptables rules for packets to flow.

The GL.iNet Slate AX is built on Qualcomm’s OpenWRT fork

Everything that went wrong for me when attempting to route the GL-AXT1800 through a Tailscale exit node was completely different than on the Mango. I assumed something was just a little different between the newer release of OpenWRT that this router was built on, so I investigated the idea of installing a different version.

I couldn’t downgrade the AXT-1800 to match the Mango. This made sense to me. Why would GL.iNet start crafting their build for a new device with an older version of OpenWRT for a new piece of hardware?

I also noticed that I couldn’t upgrade the Mango to match the Slate. I knew there were official OpenWRT builds for the Mango, so I checked the OpenWRT site for firmware for the Slate, but I couldn’t find it. There wasn’t any open-source firmware for any of GL.iNet’s WiFi6 devices.

I thought I read that there wasn’t yet any OpenWRT support for WiFi6 devices

At least I thought I read this, but GL.iNet sells a few OpenWRT routers with WiFi6 chips. How can that be?! We just figured it out, and GL.iNet doesn’t hide what they’re doing. They spell it out right in the product description.

Qualcomm fork of OpenWRT

OpenWRT doesn’t yet have much support for WiFi6 devices, but Qualcomm has a fork of OpenWRT that supports their latest WiFi6 chipsets.

I haven’t decided precisely how I feel about this, but it is definitely making my troubleshooting more difficult. That is assuming you’re willing to call my ham-fisted attempts at messing about here troubleshooting!

I understand that I don’t truly understand how Tailscale works

I understand well enough how packets get from a Tailscale device at my house to a Tailscale device at another location. What I don’t understand is how packets on my Linux machines find their way into the Tailscale process, but I do understand that Tailscale has its own little IP stack hiding in there.

There are no entries in my routing tables listing any of my Tailscale addresses. They should all be matched by my default gateway, yet they manage to get snagged by Tailscale and routed appropriately.

What works for me with Tailscale on the GL.iNet GL-AXT1800?

I installed the Tailscale OpenWRT package and immediately noticed that it is built with an old enough version of Tailscale that it doesn’t support exit nodes. I cheated and replaced the package’s binaries with the latest Tailscale binaries. It fires right up, connects to my Tailnet, and I can connect to things.

I figured cheating was the right thing to do. It seemed smart to let the OpenWRT Tailscale package install startup scripts and maybe even LuCI GUI configuration bits and pieces for me, and I could just replace the outdated binaries.

What happens when I try to route traffic through an exit node?

When I run the command to route traffic through an exit node, things get weird.

As soon as the exit node is enabled, I can no longer reach the exit node’s IP address, but I can reach other nodes just fine. At least I thought I could reach other nodes just fine.

One time I was accidentally pinging the wrong node in the background while enabling the exit node. I got really excited when it continued to respond, but then I noticed that the round-trip time increased quite a bit. A few seconds later I noticed that I was pinging the wrong address.

You can pass Tailscale the --netfilter-mode=off option to prevent Tailscale from creating any firewall rules. This gave me the same results.

What is the solution?

Our friend’s solution is Tailscale on a Raspberry Pi. That’s a fine solution, but I did something that made me feel a bit dirty. I set up a Wireguard server in a Docker container.

I followed somebody’s guide. I don’t think it was this exact guide, but it was similar. I was disappointed when I saw that I had to hit my page-down key twelve times to get to the end of the documentation. This is an order of magnitude more work that setting up Tailscale and a handful of usable exit nodes.

Not only was it long, but the documentation didn’t work for me. I had to install the wireguard-dkms package on the Docker machine. This makes perfect sense, but it took longer than I’d like to admit for me to figure out that I needed Wireguard support in my host kernel.

The good news is that GL.iNet’s awesome Domino interface makes it extremely easy to connect to a Wireguard or OpenVPN server.

The Domino GUI on the Slate AX is fancier than the tiny Mango!

These are very different travel routers. The Mango cost me $20 two years ago, only has 2.4 GHz WiFi, only a pair of 10/100 Ethernet ports, and it happily boots up when plugged into a 0.5 amp USB hub.

The Slate AX has a beefy heat sink with a little fan. It comes with a 5-volt 4-amp power supply. It has near bleeding-edge WiFi speeds. It is a dense beast of a little machine, and it costs nearly six times more than the Mango. I am not surprised that there are significantly more options in the stock Domino interface.

Their Domino GUI is a really nice feature on both routers. Either will let you do things like connect to WiFi or tether your phone to use as your WAN. You could do this with OpenWRT and LuCI, but you’d have to click dozens of buttons and change so many settings, and you better not mess any of it up!

The Domino GUI lets you do this with just a few clicks. I’m not excited about having this at home, but it is extremely handy on a travel router.

The Slate AX has options to allow you to use Ethernet, WiFi, and a tethered cell phone as a weighted multiport WAN. I haven’t had a chance to test this, but I am impressed that the option is available, and it looks easy to configure. You can do this sort of thing with stock OpenWRT, but you will have to work very hard to do it!

Conclusion

It is a bummer that we couldn’t get a GL-AXT1800 routing through a Tailscale exit node, but I am mostly only bummed out about it because we couldn’t get our friend watching Jeopardy through an OpenWRT router with Tailscale. The things that we are currently able to do with Tailscale on an OpenWRT router are still quite impressive to me.

When I bought the Mango two years ago, loading Tailscale on it was an afterthought, and exit nodes didn’t even exist yet. I just thought it was neat that I could leave my Mango behind and still be able to ssh to it. That option alone has a ton of value to me, and there are dozens of times when something like this would have come in handy over the last twenty years.

Two years later, and my Mango can route traffic through an exit node or route outside traffic to its own local subnet. I am hopeful that things will improve over time from every angle. GL.iNet seems to really want to get the IPQ6000 support ported to upstream OpenWRT. OpenWRT should eventually have a more recent version of Tailscale in their package repositories. And of course Tailscale is constantly improving.

This conclusion has gotten away from me. What I am trying to say is that it sure seems like everyone involved is probably doing a good job.

Is It Time For You to Set Up Tailscale ACLs?

| Comments

If you’re a lone Tailscale user like me, there’s a good chance that you have no pressing need to set up Tailscale’s access control lists (ACLs). Until quite recently, I didn’t feel there was much reason to lock anything down.

Pretty much every computer I own has been running Tailscale for more than a year now. They could all ping each other. In fact, most of them are on the same LAN, and they could ping each other before I had Tailscale. Tailscale already locked them down a bit more thoroughly for me. Why lock them down any more?

Then I started using Tailscale SSH

As soon as I started enabling Tailscale SSH, I needed to set up some access controls. I wanted to emulate my previous setup.

My desktop and two laptops had their own SSH private keys, and their matching public keys were distributed to all my other machines. That meant these three computers could connect to any computer I own.

1
2
3
4
5
6
7
8
9
"ssh": [
        // I don't actually use this rule anymore!
      {
          "action": "accept",
          "src":    ["tag:workstation"],
          "dst":    ["tag:server", "tag:workstation"],
          "users":  ["autogroup:nonroot", "root"],
      },
  ],

I gave those three devices a tag of workstation, and I stuck a server tag on everything else. Then I set up an ssh rule in Tailscale to allow any workstation to ssh into any server or any other workstation.

So far, so good. This configuration does happen on Tailscale’s Access Controls tab, but it isn’t in the acls section of the file. At this point, my Tailnet was still wide open.

I got worried when I added my public web server to my Tailnet

I have a tiny Digital Ocean droplet running nginx hosting patshead.com, briancmoses.com, and butterwhat.com. I always said I should install Tailscale out there, but my web server droplet has been running an outdated operating system for a while, so I knew I would be creating a fresh VPS at some point.

I finally did that. I spun up one of the new $4 per month droplets, copied my nginx config over, and installed Tailscale. I am super excited about this because it means I don’t even have to have an ssh port open to the Internet on my web server.

However, this means that a scary server that I don’t personally own that is sitting out there listening for connections on the Internet is connected directly to my Tailnet. Yikes!

Tagging all your machines for use in ACLs is hard!

It isn’t hard because you have to click on every machine to add tags. It is challenging because choosing names for your tags is easy to goof up!

My original decision that a workstation would connect to a server and never the other way around was too simple. It wasn’t the right way for me to break things down, and as I started adding more tags, I wasn’t able to easily set things up the way I wanted.

I’ve been doing my best to make sure my Tailscale nodes don’t have any services open on their physical network adapters. My workstations are mostly locked down well, and I moved things like my Octoprint virtual machine behind the NAT interface of KVM instead of being bridged to my LAN.

Even so, I have two servers at home that need to be accessible from outside my Tailnet. My NAS shares video to my Fire TV devices just in case I need to watch Manimal, and I have lots of unsafe devices around the house that need to connect to my Home Assistant server.

This seemed easy. I immediately tagged my NAS, my Home Assistant server, and my public web server with a tag of dmz.

What was the problem with this?

I want my workstations to be able to see everything. I want my servers to be able to communicate with each other, but I don’t want my servers in the dmz to be able to connect to my internal servers or workstations.

This all seemed simple and smart until I realized that everything in my dmz already had a server tag. I also very quickly realized that my Home Assistant server listening to my LAN is much less threatening than my web server listening to the public Internet. One of those should be on an even more restricted tag!

Where did I actually land?

I have four main tags now:

  • workstation
  • server-ts
  • server-dmz
  • server-external

My personal workstations can connect to anything. Machines tagged server-ts can connect to machines tagged server-ts and server-dmz, while the server-dmz servers can only talk to other server-dmz machines.

1
2
3
4
5
6
7
8
  "acls": [
    {"action": "accept", "src": ["tag:workstation"],   "dst": ["*:*"]},
    {"action": "accept", "src": ["tag:server-ts"],     "dst": ["tag:server-ts:*", "tag:server-dmz:*", "autogroup:internet:*"]},
    {"action": "accept", "src": ["tag:server-dmz"],     "dst": ["tag:server-dmz:*"]},
    {"action": "accept", "src": ["tag:blogdev"],      "dst": ["tag:blogprod:22"]},
    {"action": "accept", "src": ["nas"],              "dst": ["seafile:*"]},
    {"action": "accept", "src": ["autogroup:shared"], "dst": ["tag:shared:22,80,443"]},
  ],

These are all my ACLs as of writing this. There are a couple of more specific rules there that I didn’t talk about yet.

There’s a rule there that allows one of my virtual machines here at home to publish content to my public web server.

My NAS is in the dmz, so I had to give it its own rule to allow it to connect to my Seafile Pi*. My NAS syncs extra copies of some of my data for use as a local backup!

I goobered up my exit nodes!

I am more than a little embarrassed by how many times I had to go back and forth between desks to figure out why the exit node on my GL.iNet Mango stopped passing traffic to the Internet.

The Mango had a tag that allowed it to access the exit node. If I took that tag away, it couldn’t ping the exit node. I’d add it back, and while it could ping the exit node, it couldn’t route any farther. If I dropped the original ACL that leaves everything wide open, the Mango could route traffic just fine. What was going wrong?!

It seems like I had this idea in my head that Tailscale’s ACLs only applied to Tailscale nodes and addresses. I didn’t immediately realize that I had to explicitly allow access to the Internet or even other subnets I might be routing!

1
{"action": "accept", "src": ["tag:server-ts"], "dst": ["tag:server-ts:*", "tag:server-dmz:*", "autogroup:internet:*"]},

I just had to add autogroup:internet to the allowed destinations for the appropriate tag. Duh!

Don’t think too hard before implementing your ACLs

This is especially true if you are down here at my scale with a couple dozen nodes and only a few shared nodes. Just drop some tags on things and set up some access controls that allow nodes access to what they need.

You probably won’t set things up optimally. I know I didn’t on my first try, and I am already seeing things I’d like to do differently. Even if my initial attempt left things more open than I might like, it was still a huge win just because it blocked my public web server from connecting to the rest of my Tailnet. Any other improvements are minor by comparison.

If money and other people’s livelihoods are on the line, maybe you should spend some time having meetings and planning things out on whiteboards. It only takes a few seconds to switch back to the single default ACL that leaves your Tailnet wide open, so if you do find a problem, you can at least revert your changes quickly and easily!

Tailscale SSH is affected by Tailscale network ACLs!

This seems obvious, but I wasn’t positive that this would be the case! Tailscale seems to always make the best possible default choices, and that got me thinking that it might be the case that Tailscale’s own SSH server would ignore the ACLs if the connection were allowed in the ssh section of the access control configuration.

This does not seem to be the case. If you want to use Tailscale SSH, then your networking ACLs have to allow it. To be clear, I think this was the correct thing for Tailscale to do.

Shared nodes are allowed access by default

I wasn’t sure about this. The default single ACL just has one line that allows everyone access to everything. The first thing you do when designing your own ACLs is delete that entry. At that point nobody has access to anything, so I assumed I would need to add a line similar to this:

1
{"action": "accept", "src": ["autogroup:shared"], "dst": ["tag:shared:*"]},

We tested this. This wasn’t necessary, but I figured it would be a good idea to lock down my shared nodes just a bit, so I wound up using this ACL:

1
{"action": "accept", "src": ["autogroup:shared"], "dst": ["tag:shared:22,80,443"]},

It is a bit lazy. Three people need access to ports 80 or 443 on the Seafile server, and Brian needs SSH access to rsync files to his blog. It gets the job done.

I did test out removing ports 80 and 443 from this ACL, and I watched the connections on my Seafile server. All the Tailscale IP addresses that I didn’t own dropped off the netstat list, and when I put those ports back in the ACLs, everyone connected back up immediately.

I am sure the documentation explains this, but I doubt I am the only one who likes to see things work in practice just to make sure!

Forgetting you have Tailscale ACLs configured makes troubleshooting a real challenge!

This happened to me yesterday! A friend sent me a GL.iNet GL-AXT1800 router to help him get his identical router to pass local traffic through a Tailscale exit node.

I installed the ancient OpenWRT Tailscale package, replaced the binaries with the official Tailscale static ARM binaries, ran tailscale up, and it gave me the URL to open to authenticate this new node. Everything went smoothly, except I couldn’t ping any of my other Tailscale devices!

Derp. Since I didn’t remember to put any tags on my new Tailscale device, it wasn’t matching any of my Tailscale ACLs, so it couldn’t actually connect to anything!

This was a simple mistake, but I walked back and forth between two desks and rebooted the GL.iNet router at least twice before remembering that I even configured any Tailscale ACLs in the first place!

Conclusion

If you’re just a home gamer like I am, you probably don’t need to worry about Tailscale ACLs. If you have one or more nodes on your Tailnet that have services running on the open Internet, you may want to lock things down a bit. It would be a real bummer if someone managed to crack open your public web server, because they might be able to ride Tailscale past your other routers and firewalls.

One of the awesome things about Tailscale is that I have absolutely no idea what you’re doing with it. You might just be one person sharing a Minecraft server with some friends. You might be sharing a couple of servers with business partners like I am. You might even be managing a massive and complicated Tailnet at a giant corporation.

You and your Minecraft server probably don’t need to worry about ACLs, but if you are in a position where you should be thinking about tightening up your access controls, I hope my thoughts have been helpful!

The OpenWRT Routers from GL.iNet Are Even Cooler Than I Thought!

| Comments

I have had my little GL.iNet Mango router for about two years now. It was an impulse buy. It was on sale for less than $20 on Amazon, and I just couldn’t pass it up. It was exciting for me to learn that there is a manufacturer that ships OpenWRT on their routers, and I really wanted to mess around with one.

I rarely use my Mango router. It live in my laptop bag. If I ever need a WiFi extender, it is there. If my home router fails, it would be my emergency spare. My Mango is a Tailscale subnet router, so if I am ever away from home and need to reach an odd device via my Tailscale network, then I can. It is pretty awesome!

I bought a new laptop a few months ago, and I have been tidying up my various laptop bags. I realized that I hadn’t updated OpenWRT on my Mango in two years, and my Tailscale install is just as old. It seems like it is time to update things!

I had some problems along the way, and I managed to lose all access to my Mango. It could hand out DHCP addresses. It could route traffic. It wouldn’t respond to pings, HTTP, or SSH.

I am really excited that I had problems, because I learned that the GL.iNet routers are even more awesome than I thought!

NOTE: I didn’t really have any problems with my Mango! Something weird was happening on my Windows 11 laptop.

There’s more available than just the stock firmware!

When I could no longer ping my Mango router, I first tried resetting to factory defaults. That didn’t work. Then I tried re-flashing the latest firmware, and it still didn’t work.

Then I noticed that GL.iNet supply several different firmware images for their routers. There’s the stock image with their own GUI called Domino. There’s another that skips Domino and just has the office OpenWRT LuCi GUI. Then there’s a third firmware that routes all your traffic through the Tor network. How cool is that?!

I flashed the LuCi-only firmware and my Mango starting working correctly. All the official GL.iNet firmware images for the Mango are based on OpenWRT 19.07.08. That’s not too bad. The OpenWRT folks are still updating version 19, but first release of version 21 happened last year.

You can definitely download a version 21 build or a release candidate of version 22 for the Mango directly from the OpenWRT site.

Should I just run LuCi, or do I want the Domino GUI?

I love LuCi. If I were permanently installing a GL.iNet router in my home I would most definitely skip GL.iNet’s Domino GUI. I would most likely be installing that release candidate of OpenWRT 22.03 just to avoid a major upgrade in the near future.

My Mango doesn’t have a permanent home. It is a tool that lives in my laptop bag. There’s a very good chance that I might let a friend borrow it. The Domino GUI is WAY more friend-friendly than LuCi!

The Domino GUI also makes some difficult things as easy as clicking a button.

The GL.iNet interface has a simple interface to allow you to use another WiFi network as your WAN port. It has an equally simple dialog to configure the Mango as a WiFi repeater.

Either of those configurations would require dozens of clicks in OpenWRT’s LuCi GUI, and Domino even lets you tie those configuration settings to a physical switch on the router.

I definitely want the Domino GUI on my toolkit’s router.

Should I have bought a higher-end GL.iNet router?

Two really cool things came into my life at about the same time two years ago: the GL.iNet Mango and Tailscale. The Mango only has three or four megabytes of free disk space, and the Tailscale static binaries add up to more than 20 megabytes. One cool thing doesn’t fit on the other cool thing!

Two years ago, the only way to get Tailscale onto an OpenWRT router was to install it manually. Now you can just install it with the OpenWRT package manager, and that is awesome!

I cheated and put the Tailscale binary on a USB flash drive when I set things up two years ago. It’d be nice to not have to do this, but in a way, I am pleased with this configuration.

What if I loan my Mango to a friend? What if they’re less than trustworthy? I can just pop the USB drive out! All the Tailscale configuration and keys live on that drive. If they don’t have that, they can’t access my Tailnet.

I am pretty sure the OpenWRT Tailscale package will work on the Mango

The Tailscale package is only around 2.8 megabytes. That would nearly fit on a fresh Mango router with the stock GL.iNet firmware!

The GL.iNet firmware is running OpenWRT 19, and there don’t seem to be any Tailscale packages in the OpenWRT 19 repositories. Even if you could squeeze the package in, you’re going to have trouble getting an official OpenWRT package.

I did notice that when I installed the clean OpenWRT 19 image from GL.iNet that there’s around 7 megabytes of free space. That’s plenty of room to install the Tailscale package!

You should be in good shape if you download the latest version of OpenWRT for your Mango straight from the OpenWRT site. It sure looks as though you’ll have enough room, and the packages will be in the repository for you to install right away.

I didn’t want to give up the Domino GUI. Being able to connect to the router and click a few buttons to switch modes between routing, repeating, and other things is ridiculously handy.

How do I run Tailscale on the Mango if the Mango doesn’t have enough storage?

I have been arguing with myself for five minutes about how much information to include in this section. A step-by-step guide would make this blog way too long, and a 10,000’ overview seems too broad. Let’s see if I can land in a good spot near the middle.

I mostly repeated what I did to install Tailscale on my Mango in 2020, but I made room on the diminutive SanDisk flash drive for Ventoy. I also cleaned things up so I can modify the Tailscale startup job without logging in to the Mango.

Ventoy is occupying the first two partitions on my USB drive, so I added a small ext3 filesystem as the third partition. This has a copy of my tsup.sh script, the state file for Tailscale, and it is where I unpacked the Tailscale mipsle package. For the convenience of future upgrades, I created a symlink pointing to the current version of Tailscale. This is the root directory of the ext3 filesystem:

1
2
3
4
5
6
7
8
pat@zaphod:~$ ls -l /mnt/sda3
total 17744
drwx------ 2 root root    16384 Jul 24 16:49 lost+found
lrwxrwxrwx 1 root root       25 Sep 18 06:52 tailscale -> tailscale_1.31.71_mipsle/
drwxr-xr-x 3 root root     4096 Jul 18 12:58 tailscale_1.28.0_mipsle
drwxr-xr-x 3 root root     4096 Sep 15 22:54 tailscale_1.31.71_mipsle
-rw------- 1 root root     1418 Sep 18 07:05 tailscale.state
-rwxr-xr-x 1 root root      676 Sep 18 07:12 tsup.sh

This is my tsup.sh:

1
2
3
4
5
6
7
8
9
10
#! /bin/sh

# Not sure if the sleep is necessary!
sleep 10

/mnt/sda3/tailscale/tailscaled -state /mnt/sda3/tailscale.state > /dev/null 2>&>
 
# Make sure my bootable USB partition is unmounted cleanly
/bin/umount /mnt/sda2
/bin/umount /mnt/Ventoy

To make this work, I used the advanced settings tab to add this one line to the end of OpenWRT’s startup script:

1
(sleep 15; /mnt/sda3/tsup.sh) &

This could all be better, but it works. I did have to sign in once via ssh to run tailscaled and tailscale up manually so I could authorize the Mango on my Tailnet.

The various sleep commands sprinkled around are just laziness. You can probably guess why each of them exist.

I purposely chose to store the tailscale.state file on the flash drive. If I loan out my Mango to a friend, I might not want them connecting to my Tailscale network. If I pop the flash drive out, they won’t have any of the data needed to make a connection.

My GL.iNet Mango can’t use Tailscale as an exit node

And I am not sure exactly why! Tailscale routes packets without issue. I have this node configured as a Tailscale subnet router for its own local subnet. That seems to work correctly, so it is able to route packets from WiFi clients to nodes on my Tailnet.

I was hoping to be able to have the Mango route traffic through an exit node. That was a FireTV or AppleTV or something similar could watch American Netflix from Ireland, but it isn’t cooperating with me.

At first I tried tailscale up --exit-node=seafile, but that immediately cut off all access to local clients connected to the Mango. I was able to ssh in via Tailscale and verify that the Mango was using the exit node.

I updated that command to tailscale up --exit-node=seafile --exit-node-allow-lan-access, and my Mango’s local devices were able to talk to the mango again, but they weren’t able to pass traffic any farther than the Mango.

I am close, but not quite close enough!

UPDATE: I got my Mango routing properly through an exit node just a few hours after publishing this blog! This should most likely get a proper write-up, but here’s the short answer. I added the tailscale0 interface as an unmanaged interface in the LuCI interface and made sure it was attached to the WAN firewall group. I am guessing this let the OpenWRT NAT rules do their thing!

What else can I do with my 32 gigabyte Tailscale USB drive?!

When I tested the viability of running Tailscale on a USB flash drive, I used a drive I had on hand. It was an extremely large drive in the physical sense. Once I knew it was working, I bought the smallest Sandisk Cruzer Fit that I could fine. It was 32 GB, which was nearly 32 GB more storage than I needed!

While I was redoing things this week, I decided that I should find a use for the rest of that space. I installed Ventoy and a whole mess of bootable disk images. Ventoy should let the drive boot on both UEFI and legacy BIOS systems. Ventoy’s installation script even had an option to leave some space on the end of the drive, so I added a little 512 megabyte ext3 partition for OpenWRT to use.

My little Ventoy drive has images for:

  • Memtest86
  • FreeDOS
  • Ubuntu 22.04 installer
  • Xubuntu 22.04 installer
  • Windows 10 installer
  • Windows 11 installer

None of this is terribly exciting. I only boot up a computer with a USB drive once every few years now, but did have to make several USB boot drives over the last few months. I had to reinstall Windows 10 a laptop with a dead NVMe. I had to install Xubunu 22.04 on my desktop when I upgraded to an NVMe. I had to run Memtest86 when I bought new RAM a few weeks ago.

I wish I thought to set this up sooner!

I should be carrying an identical bootable drive in my laptop bag, but I figure it can’t hurt to have spare boot images squirrelled away in my travel router’s USB port!

Conclusion

I think I made the correct choice by continuing to use the stock GL.iNet firmware on my Mango. If this were my permanent home router, it would be way more valuable having an extra 10 megabytes of flash for packages, but this isn’t my home firewall. This is a Swiss Army Knife that I keep in my laptop bag.

Being able to quickly configure the Mango to be a router using a wired connection, a router using WiFi, or a WiFi extender is so much more valuable in my laptop bag! Why can’t I do this easily with stock OpenWRT? Is there a package I don’t know about?!

How Much RAM Do You Need in 2022?

| Comments

I probably wouldn’t have given this much thought if I didn’t have a stick of RAM fail on me last year. I don’t know that I can remember another time when a stick of RAM that passed a long Memtest86+ test failed on me, and this was a total failure. The machine locked up and wouldn’t boot until I found and removed the bad stick.

Four sticks of RAM, One is Dead!

I couldn’t figure out whether I could do an advance RMA of my memory, and Corsair wanted me to RMA all four sticks as a set. I didn’t want to deal with downtime, and I didn’t want to buy RAM to hold me over while I waited, so I figured I’d limp along with this single-channel 24 GB configuration until it caused problems.

Running while short 8 GB really didn’t cause problems. Everything that I do fit pretty well into 24 GB of RAM. Even so, I bought a faster pair of inexpensive 16 GB DDR4-3200 DIMMs a few weeks ago, so I am back at 32 GB, back to a dual-channel configuration, and my slightly overclocked Ryzen 1600’s RAM is running at 2933 instead of 2666.

Some benchmarks are a quite a bit faster with dual-channel RAM, but I’m not noticing the extra 8 GB.

I’ve always bought as much RAM as possible

Within reason. There are always diminishing returns, but any extra RAM will be used for disk caching. For the last two or three decades, disks have been slow. Really slow. Especially when it comes to random access.

A 7200 RPM disk can perform between 100 and 200 random reads or writes per second. That was true for my 13 GB IBM Deskstar drives twenty years ago, and is true even for the latest 18 TB 7200 RPM drives today. The heads in a mechanical disk have to wait until the data they need passes underneath. Any given point on the disk only passes under the read head 120 times each second.

In those days, extra RAM was the only thing hiding those slow seek times.

1
2
3
4
5
pat@zaphod:~$ free -h                                                      130 
               total        used        free      shared  buff/cache   available
Mem:            31Gi       9.0Gi       1.6Gi       917Mi        20Gi        20Gi
Swap:           23Gi        93Mi        23Gi
pat@zaphod:~$ 

My memory usage today is definitely higher than it was a decade ago, but I have always bought enough RAM to make sure at least 50% would be used for disk cache.

My last two workstations have had 32 GB of RAM

I am doing my best to think back to all past desktop computers. The timeline is pretty fuzzy, but it seems like I approximately doubled the memory each time I have upgraded. I almost listed each machine off here, but that feels unnecessary.

My FX-8350 had 32 GB of RAM, which was double the RAM in [the giant HP laptop][lt] that it replaced. I put 32 GB into the Ryzen 1600 when I built it in 2017, and I left it at 32 GB when I replaced the RAM last month.

What’s different today?

NVMe and SSD drives are really fast!

We’ve needed to cache our disks with RAM for decades because our disks had been stuck at 200 random I/O operations per second. My first SSD could manage more than 5,000 I/O operations per second, and my new NVMe can handle 500,000 operations per second.

We don’t have to spackle over slow disks any longer. If I push my machine to the point where it only has a couple of gigabytes of free RAM to use as disk cache, it doesn’t matter. I won’t notice the difference.

While I am just sitting here writing this blog, my machine is using around nine gigabytes of memory for actual work. If I fire up a game, that will likely eat up another 10 or 12 gigabytes.

While I was limping along with only 24 GB in my machine, this was never a problem. Running a game might bring me down to just several gigabytes of RAM for disk cache, yet I didn’t notice any sort of slowdowns or stutters when I would switch back and forth between my game and productivity tasks.

My SATA SSD was fast enough!

I forgot to mention that the vast majority of the months where I was limping along with 24 GB of RAM happened before I upgraded to an NVMe. My 280 gigabyte per second Crucial SSD that could only manage a few tens of thousands of I/O operations per second was plenty fast enough for me to never notice when I as down to just a couple of gigabytes of memory available for caching.

In the old days before solid-state drives, this would never have worked out. In the days when I only had eight gigabytes of RAM, my workstation would have felt like it was struggling if I only had a gigabyte or two of free RAM available for caching. If I didn’t have half my RAM available for cache, I would have been shopping for a memory upgrade!

The future we are living in is fantastic.

Your mileage may vary!

I was getting by just fine with 24 gigabytes, and I bet I could just barely squeak by with just 16 gigabytes of memory, but I wouldn’t want to bother trying. I definitely wouldn’t want to give up dual-chanel RAM, but if it were possible to buy 12 gigabyte DIMMs, I might have enjoyed having a dual-channel 24-gigabyte setup!

I’m using a handful of gigabytes of memory for Firefox, Thunderbird, Discord, and various other programs. The stuff that is normally running eats up around 9 gigabytes or memory.

The heaviest things I run are Davinci Resolve and games, but never at the same time. I don’t have enough GPU memory for that.

gigabytes of ram meme

In the old days, I would have at least one or two virtual machines running on my workstation. Today, I have had a separate server in my office handling that job.

It used to be handy having my virtual machines on my laptop in the days when my only workstation was my laptop. It was awesome having everything with me at home, at the office, or on the road.

I get more value today having those virtual machines on a dedicated box. I don’t want Home Assistant or Octoprint to reboot just because my Nvidia driver goobers things up forcing me to reboot my desktop!

Besides which, it isn’t 2008 anymore. I don’t have to hope a coffee shop has WiFi. I don’t have to wiggle through an ssh tunnel to get to my data at home or in the office. I can share my phone’s 500-megabit Internet connection and connect to my machines around the world using Tailscale, and they’ll work just like they do when I’m at home.

You might need more memory for those tasks that I don’t have!

If you’re running a mess of virtual machines on your workstation, then you probably already know how much memory you need for those to comfortably fit. If the VM disk images live on an SSD or NVMe, maybe those machines don’t need as much RAM allocated as you think they do.

Those virtual machines are still computers, even if they are sharing processors and disks. Just like my desktop PC, our virtual machines don’t need to rely on cache memory nearly as heavily as they did before we had NVMe drives. The old rules of thumb from the days of slow mechanical disks just don’t apply anymore.

If you’re running make -j 32 on your 16-core Ryzen 5950X, you know you might need a lot of memory just to support all those compiler tasks, but it almost definitely isn’t a big deal if your whole source tree doesn’t stay cached all day. Your NVMe can touch hundreds of thousands of files every second without breaking a sweat!

Is swapping to an NVMe fast?!

I spent a week unscientifically messing with various swap and dirty page settings. I figured that Apple must be leaning on fast NVMe swap and paging to make their 8 GB M1 MacBook Air a usable machine. If they can do it, maybe I can force Linux to dip deeper into swap.

I was able to get about 5 or 6 GB onto my swap partition. When I did, things usually acted just fine. I couldn’t even tell you that my machine was swapping.

Every once in a while, though, things would get really goofy. The whole machine would just grind to a halt without much warning. I never timed it, but if I walk away, it would usually have worked itself out by the time I got back.

There’s probably somewhere between my current settings and the problematic settings that would work alright. Maybe the defaults would push me a few gigabytes into swap if I disabled half my RAM.

This was fun to experiment with a bit, but not worth spending a real amount of time working on.

Conclusion

It is better to err on the side of caution. If you need to round your memory requirements to the nearest pair of sticks of RAM, you should definitely round up instead of down. If you’re like me, and you think you can get by with 24 GB of RAM, then you had better buy 32 GB!

For decades I always made the same choice. If you asked me to choose between more memory or faster memory, I would always choose more memory. It wasn’t always a problem you could spend your way out of. Sometimes DIMMs with double the capacity were only available in slower speeds. Sometimes your chipset only supports faster RAM speeds with two DIMMs and not four.

My FX-8350 build in 2013 had 32 GB of RAM. My Ryzen 1600 build from 2017 has 32 GB of RAM. If I upgrade to a Ryzen 7800X, it will also have 32 GB of RAM. Before my FX-8350, every major computer upgrade I have gone through has at least doubled my RAM. This feels weird, but it is also amazing and awesome!

Six Months of lvmcache on My Desktop

| Comments

I admit it. It hasn’t quite been six full months since I put a fresh NVMe into my desktop machine and turned on lvmcache. I am nearly three weeks short of that date as I am writing this sentence, but it will probably be another week before I finish this blog, and if I wait any longer I might miss the target by a few months!

I believe I only have good news to report. I’ve torn down, rebuilt, or reconfigured the cache at least three times: once when I installed the NVMe drive, once when I split my slow storage volume into two pieces, and again when I replaced the ancient 4 TB drive with a 12 TB drive.

Here’s the tl;dr!

The cache is fantastic. It works well enough to cache all the games that I play, and they load just as fast as they would if they were installed directly on the NVMe.

I am no longer the least bit concerned about wearing out my flash storage. It sure looks like I won’t run out of write endurance until 10 years after Samsung’s 5-year warranty expires.

I split my slow storage into two separately cached volumes

I did some math a few weeks after setting up my lvmcache. My lvmcache partition on the NVMe is 300 GB, and I process around 200 GB of video files each month. That much is just fine.

Quite a few of my Steam games are over 100 GB in size.

Testing says that the video files I am working on do indeed wind up nearly 100% cached. If we oversimplify the way lvmcache works, and we assume that the cache will be smart enough to always evict the older video files that I won’t be working on in the near future, this only leaves me enough room in cache for a single game.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-------------------------------------------------------------------------
LVM [2.03.11(2)] cache report of given device /dev/mapper/zaphodvg-slow
-------------------------------------------------------------------------
- Cache Usage: 99.9% - Metadata Usage: 6.6%
- Read Hit Rate: 66.2% - Write Hit Rate: 56.7%
- Demotions/Promotions/Dirty: 27129/27165/0
- Feature arguments in use: metadata2 writeback no_discard_passdown
- Core arguments in use : migration_threshold 8192 smq 0
  - Cache Policy: stochastic multiqueue (smq)
- Cache Metadata Mode: rw
- MetaData Operation Health: ok

-------------------------------------------------------------------------
LVM [2.03.11(2)] cache report of given device /dev/mapper/zaphodvg-churn
-------------------------------------------------------------------------
- Cache Usage: 85.2% - Metadata Usage: 24.4%
- Read Hit Rate: 50.1% - Write Hit Rate: 3.1%
- Demotions/Promotions/Dirty: 0/183205/0
- Feature arguments in use: metadata2 writethrough no_discard_passdown
- Core arguments in use : migration_threshold 8192 smq 0
  - Cache Policy: stochastic multiqueue (smq)
- Cache Metadata Mode: rw
- MetaData Operation Health: ok

The math says I should have used 300 GB for my operating system and 700 GB for the cache. Resizing the encrypted root filesystem and juggling everything around felt like too much effort, so I just set up a separate cache on my old Crucial 480 GB SATA SSD.

I call my cached volumes slow and churn

I probably need better names. The slow volume’s name has been grandfathered in, but the churn volume’s name is rather appropriate.

One of those volumes is where I churn through data. I dump 100 GB of video onto that volume at a time, work on it for a few weeks, then I dump another 100 GB of video. This just keeps happening every few weeks. This sounds a bit like churning, doesn’t it?!

The churn volume is an 8 TB slice of my new 12 TB hard drive, and it is cached by the old 480 GB SSD. That SSD is plenty fast enough to handle the 50 and 100 megabit files our Sony ZV-1 and Sony A7S3 cameras create.

The slow volume got its name because it just isn’t fast like the NVMe. This is where my Steam library lives. I have installed just over 2.3 TB of games that are being cached by a 300 GB partition on my 1 TB NVMe.

Is my mid-range NVMe going to survive being a cache?!

I was going to tell you that it depends on who you ask, but both the conservative and pessimistic answers are both positive!

My 1 TB Samsung 980 has a 5-year warranty with a guarantee of 320 TB of written data. I am right around 10 TB of writes after six months. That means Samsung thinks I will make it 10 years past their warranty period.

1
2
3
4
5
Percentage Used:                    1%
Data Units Read:                    22,669,873 [11.6 TB]
Data Units Written:                 19,392,238 [9.92 TB]
Host Read Commands:                 188,764,027
Host Write Commands:                308,581,800

The data in the SMART report says I have only used 1% of my writes. If that’s correct, then this NVMe will outlive me.

I assume that my writes have slowed considerably. I had to drop the lvmcache every time I resized my slow and churn volumes. That means 1 TB of those writes had to happen just to refill the cache.

I’m also no longer passing 300 GB of video files through the NVMe’s cache partition every month. My old Crucial SSD is bearing that weight now.

That little old SSD is doing a good job. It spend eight years being the primary storage device in my desktop computer, and SMART says it has 33% of its life remaining. The data sheet says the Crucial SSD is rated for 72 TB of writes, so it will probably make it through the next couple of years!

The conundrum of two caches

Is it better for me to have two separate caches? Would I be better off with one 800 GB cache instead of a 300 GB cache and a 480 GB cache? It is complicated, and I just can’t make up my mind about how I feel about this.

I do know for certain that I would much rather have both these caches on my NVMe. If they were on the NVMe the proportions would be adjusted.

On one hand, dealing with a single storage volume can be much more convenient. When I installed my 12 TB hard drive, I had to decide how much space I needed for my Steam volume and how much space I would need for video files.

If I made the wrong choices, I will have to shrink one volume and extend the other in a year or two. I will have to disable and recreate both lvmcache caches to make that happen.

On the other hand, having two different caches handling two different kinds of data is a much more effective use of cache space. My Steam games that I play regularly tend to just stay put in the cache, and it doesn’t matter how long the older videos stay in their separate cache, because they won’t be pushing games out!

If I had one unified cache I bet it would probably take a month or more for old videos to get demoted. It wouldn’t surprise me if that means I’d have 300 GB of unnecessary video that I’ve already finished editing clogging up my cache at any given time.

One could argue that I could have sidestepped that problem by buying a 2 TB NVMe and using a bigger cache, but that doesn’t eliminate the issue. It just makes it a lot smaller, right? Besides, the goal was to save money by buying less flash storage!

I’m not running Linux! Is there some way I can do this on Windows?!

Yes. Maybe. Most likely.

My friend Brian Moses has been watching me talk about lvmcache for ages, and he’s been watching me post screenshots full of cache data for just as long on our Discord server. When he built his new gaming PC, he did some research and wound up buying a copy of Primocache.

I don’t think he’s run much in the way of benchmarks, and if he has, he hasn’t posted the results of this tests. I asked Google about primocache for gaming and the first hit is Leonard Putra’s video showing side-by-side footage of a few games loading with and without Primocache.

Primocache seems to be doing the job for Leonard. Three out of four games had a 98% cache hit rate. Just Cause 4 had a slightly lower hit rate, and it didn’t load much faster.

Some games just don’t benefit from faster disks. Most of the games I tested load just quickly from my SATA SSD as they do from my much faster NVMe.

I have no first-hand experience with Primocache, but it certainly looks like it is worth checking out.

Conclusion

This sort of caching is only a Band-Aid. In a five years we will likely have more NVMe storage than we know what do with.

In the mean time, I am excited to have lvmcache available in the mean time. I only have a 1 terabyte NVMe, but I have 2.4 terabytes of games installed. How awesome is that?!

Are you thinking about using a solid-state disk cache in front of a slow disk on your desktop or workstation? Are you already caching your workstation with lvmcache or something similar? How is it working out for you? Let me know in the comments, or stop by the Butter, What?! Discord server to chat with me about it!

So Many Tailscale Exit Nodes!

| Comments

I don’t know how I managed to notice this, because I almost never open the Google Play store on my phone, but I did open it a few nights ago, and there was a Tailscale update waiting. I clicked the update button, and I think I might have had to open Tailscale to fire the VPN connection back up.

That’s when I noticed a menu option to enable using my phone as an exit node. What?! My phone is set to install Tailscale beta releases. This says it is a release candidate, so I guess this feature has been hiding on my phone for a little while already.

Of course I had to try it out. It works just fine. This did make me realize that I have yet to set up any exit nodes on my Tailnet, so it must be time to put exit nodes on all the things.

I set up an exit node on one of my virtual servers in the house, my Android phone, and on my Raspberry Pi server at Brian Moses’s house.

Then I got an email telling me that I paid $4.26 for the month for my Digital Ocean droplet that runs the Nginx server for several of our blogs. Why didn’t I think to enable the droplet as an exit node?! It is an exit node now.

What is an exit node? Why would you need one?

An exit node is how you get yourself some of the functionality of something like NordVPN or Private Internet Access for free. Once a machine is configured to be an exit node, any other machine on your Tailnet can force all their Internet traffic through that node.

Exit Nodes Everywhere!

What if you’re on your laptop at Starbucks and want to make sure the barista who owns the WiFi can’t snoop on your traffic? What if the network in your hotel is blocking access to YouTube? What if you’re in Ireland and want to watch shows that are only on American Netflix?

You just click on your Tailscale icon, choose the exit node option, and choose which exit node you want to route this computer’s Internet traffic through. All your traffic will flow through an encrypted Wireguard connection from your laptop in Ireland to your other computer in Plano, TX, and from there it will travel the unencrypted Internet to Netflix.

Tailscale does the right thing again

It wasn’t until the next morning that I worried I had committed an offense! It seemed sensible to turn on at least one exit node at every physical location where I have a Tailscale node, and one of those nodes is my Seafile server at Brian Moses’s house.

I remembered that I am sharing the Seafile Pi with Jeremy Cook and my wife. Neither of these are nefarious characters that I would expect to abuse Brian’s Internet connection, but I certainly hadn’t thought about this, and I most definitely didn’t want to abuse my free colocation facility!

I didn’t need to worry. Tailscale does the right thing. If you activate an exit node after you’ve already shared the node, they won’t have access to the exit node. Not only that, but you can’t give your friends access to the exit node after the fact without their knowledge.

Tailscale Sharing Dialog

You have to send them a new share invite with the exit node enabled. I verified this by having Brian check to see if my Seafile server showed up in his list of available exit nodes.

Conclusion

Tailscale exit nodes are neat. Sometimes you need Netflix to think you’re in a different country. Sometimes you want to hide your traffic from Starbucks or your employer. Sometimes you just need to test that your website is working as expected from another physical location. A Tailscale exit node can cover all these situations and more.

I am not sure when I will need an exit node on an Android phone, but I am excited that I have the option, and I am excited about the idea of repurposing old Android hardware. You can run Octoprint on a phone using Octo4a, someone has set up a backup server on their old cracked Android phone, and now you can throw Tailscale on a cheap old phone from your junk drawer and leave an exit node behind anywhere you want. That’s awesome!

What I Learned Selling the 3D Printed Soda Adapter for Six Months on Tindie

| Comments

Putting the SodaStream Terra adapters up for sale in my Tindie store was an accident. My friend Alex designed the adapter. He got busy with real life and didn’t want to deal with the hassle of selling them on Etsy any longer, so he asked if we would like to take over.

Chris had just started setting up her Etsy store the week before, and she only had one item for sale. The timing seemed good, and he was selling one or two adapters every day. It seemed like a good way to get some initial sales onto her store, so we took on the task of printing and selling 3D printed soda adapters.

There was some lag between Alex running out of stock and us adding the item to Chris’s store, so there were immediately a bunch of orders. Chris paid for labels and shipped those out, then more orders came in, and she paid for labels and shipped those out.

Then Etsy closed her store. Etsy didn’t say why. Etsy didn’t respond to her emails. The store is gone, and Chris never got paid for the inventory she shipped out. It was quite a bummer.

So we dropped the item on my existing Tindie store.

tl;dr I just want a SodaStream Terra adapter!

I am no longer selling the adapters. As has been the case for most of the time the adapters have been in my store, you can download the 3D model of the soda adapter from printables.com and make your own.

The harder part is acquiring the rubber o-rings. They’re easy to get in quantities of 100 or 200, and they’re easy to get in assortments of hundreds of o-rings. The trouble we’ve had with the assortments is that not all assortments are measured the same way!

I have a whole mess of o-rings left over. You can find the correct o-rings in my Tindie store.

In my opinion, you should skip the 3D printed adapter and get the metal soda adapter from Amazon. I’ve been recommending this in my Tindie store since the product became available. It is a much more robust solution!

Why was I hesitant to sell the adapter?

Alex called me up one day and explained that he wanted to use his 3D printer to make an adapter to connect the old-style SodaStream CO2 canisters to the SodaStream Terra. I told him it was a bad idea, and that it couldn’t be done.

We drove to Target, bought a SodaStream Terra, and got to measuring. We had a basic part designed and printed in a couple of hours. It didn’t work, because SodaStream designed the new fitting to be difficult to connect to. Even though the adapter worked for Alex, I suspect SodaStream’s purposefully convoluted engineering has been trouble for our some of our customers.

It took him a few iterations to get the air directed to the correct places, but he did get it working.

Just because it was working doesn’t mean it is a good solution. I’ve been designing 3D printed parts for eight years. I know that 3D prints are weakest along their layer lines. I know PLA and PLA+ aren’t the ideal material to stand up to this sort of pressure.

Seeing it work and hearing that his customers were excited about using their adapters helped ease my concerns here.

There’s also the fact that SodaStream made it difficult to adapter their connector on purpose. I could write 2,000 about this part alone!

The failure rate is just too high

I sold 240 adapters over roughly six months. I’ve issued refunds or send replacements for around 30 orders. Why are they failing so often? Let’s start with the problems that may qualify as user error.

More than a few people have managed to cross-thread the adapter. If you are at all mechanically inclined, it is really obvious that this is about to happen. It is also pretty difficult to do accidentally, but if you do, most people are plenty strong enough to destroy PLA+ threads.

At least a few people seem to have trouble trimming the plunger to the correct length. It is a bummer that the plunger has to be trimmed to account for different bulk CO2 kits.

We suspect that many failures happen because the customer doesn’t screw the adapter on tight enough. If you don’t compress the large o-ring enough to make a good seal, CO2 can escape. Once the CO2 starts escaping, it has a much larger surface area to push up against.

This ties in with another problem. Some folks have most definitely managed to tighten the adapter way too much! The adapter is only 3D printed PLA+, so a person is definitely strong enough to break things. Especially if they put a wrench on it!

There’s no good way to document this for the average customer. Saying, “You have to tighten it enough, but don’t tighten it too much!” just isn’t terribly helpful.

There is also a good chance that some people’s SodaStreams are just built to slightly different tolerances than the machine Alex designed the adapter against. If the machining on Alex’s unit leaned towards the tighter side of the tolerances, then there’s a good chance that folks with machines leaning towards the looser side would have leaks.

Mitigating the weaknesses of 3D printing

At first, I was 3D printing with the default PrusaSlicer profiles just like Alex. As the failures came in, I started making tweaks.

Alex tried increasing the infill percentage, but that doesn’t make parts all that much stronger. I started by adding as many perimeters as would fit. Then I started slowly increasing the temperature and extrusion multipliers.

Hotter plastic tends to have better layer adhesion, at least up to a point, but it leads to stringier prints. I’d rather the adapters work than attempt to completely avoid stringing.

The slightly higher extrusion multiplier also helps keep gaps out of the layers, which helps with adhesion. I doubt either of these changes make a huge difference, but every inch counts!

The increased extrusion multiplier also has the side effect of the tolerances a bit tighter. That means the small o-ring is tighter in its slot, and the plunger pushes on it just a little harder. That ought to make it less likely to leak. The correct way to tighten up the tolerances would be editing the model, but that wasn’t really my goal. It was just a happy accident.

Why not try a different material?

This is where we get to the fundamental problem of Pat selling soda adapters.

I don’t drink soda. I am not a soda enthusiast. I am not excited about SodaStream machines. Printing with a very different material would require testing, tweaking, and more testing.

If this were my hobby, I would be diving right in. It isn’t my hobby, though, so I am just not excited about pushing the design into new materials.

Especially now that the all-metal soda machine adapter is available. There’s no beating that solution with plastic!

Since PLA+ works more than 80% of the time, I am confident that nylon would survive more than 99% of the time. Nylon is a pain to print with an FDM 3D printer. I sure don’t want to be doing that every day!

Expectation vs. reality

Most customers found my Tindie listing by way of Alex’s video about his adapter design. Alex’s video is pretty positive. He is proud of the work he did, as he should be, and he made those videos before significant number of people got adapters in their hands.

I’ve tried to keep a balanced description product description on Tindie. I don’t hide that there are failures. I made sure to point everyone towards the solid metal bulk CO2 adapter.

I believe most people understood what they were ordering. I think at least a few people were expecting some sort of unicorn to arrive in their mailbox.

Why continue to sell the plastic adapter when the metal adapter exists?!

I expected that I would be discontinuing the product as soon as the metal adapter was in stock. Surely everyone is using the 3D-printed adapter for bulk setups, right?!

Some people definitely continued to use the 3D-printed adapter for bulk CO2. A few people ordered adapters before messaging me to ask which bulk-CO2 kit they should buy! I told them they shouldn’t, and they should order the parts that match the all-metal adapter. If they told me that’s what they wanted to do, I refunded their money.

Most of my customers just want to be able to plug the SodaStream canisters from ALDI into their SodaStream Terra. They’re the reason I decided to keep on selling these adapters.

Conclusion

There aren’t any soda adapters in my Tindie store, but things are still chugging along. I am still cutting carbon-fiber ducts on the CNC pretty regularly, and I added a new carbon-fiber backpack hacking item to my store recently. I am pretty excited about those no-sew backpack straps, but I don’t have a good way to put them in front of the people who would want to use them. I don’t even have a good name for them!

I am sorry to see the extra revenue go. The extra money has actually made a real difference for us this year, but the ratio of happy to unhappy customers just isn’t high enough for me to feel comfortable. I am much happier selling over-engineered carbon-fiber doodads than plastic bits that have to stand up to 1,200 PSI!

Do You Need to Buy The Fastest NVMe?

| Comments

Do you want the easy answer? No! You almost definitely do not need the fastest NVMe available. Most of us probably won’t even notice the difference between the slowest and the fastest NVMe drives.

NOTE: The XPG Gammix S70 isn’t literally the absolute fastest NVMe available, but it is definitely very near the top of list, and it is the super-fast drive I most often see good deals on. Even if you manage to exhaust the S70’s large write cache in one go, it is still quite fast, and it often goes on sale for $100 per terabyte.

I am also absolutely certain that there is someone out here with a very particular use case that would truly benefit from 7 GB per second reads or writes. Most of us don’t even have software that can keep that up for more than a fraction of a second.

I don’t have a budget! I am just going to buy the fastest thing!

If you truly have no budget, then you should absolutely buy what makes you happy. Most of us who say we don’t have a budget are still making choices based on price.

The price-to-performance graph for any piece of hardware in your computer tends to look like a hockey stick. The price of a component usually increases pretty linearly from the low-end to very nearly the high end, but it usually takes a sharp turn about 80% or 90% of the way to the end of the graph. You might have to pay three times as much to go from 80% to the very top of the performance graph.

This may not even be worth writing about because the fastest NVMe drives only cost twice as much as the no-name cheap drives. The cheapest no-name NVMe deal I’ve seen so far was $55 per terabyte, while some of the fastest NVMe drives go on sale for around $110 per terabyte. The middle-of-the-road drives with good warranties from reputable manufacturers are usually between $80 and $90 per terabyte.

I wonder how much cheaper these will while you are reading this in the future?!

This isn’t anywhere near as big a jump as going from the biggest Ryzen CPU to the smallest Threadripper. Even so, if my words mean you can move $60 from your NVMe to a slightly faster CPU or GPU, then it was worth my time!

I can’t max out my lower-end Samsung 980 NVMe

I can run a benchmark or spam some zeroes over the drive with dd and hit several gigabytes per second. I am running LUKS on top of my NVMe, and that layer of AES encryption seems to have me capped out at around 1.6 gigabytes per second. I haven’t found a use case that will register anywhere near that much bandwidth while monitoring with dstat.

I can hit these big numbers if I copy a huge directory of files from the NVMe to itself. This isn’t something most people do all that often.

The Internet says my drive will run out of write cache if I can write 300 gigabytes as fast as the drive can write. I don’t have any external sources that can supply data that fast. In practice, my 40-gigabit Infiniband network tops out at 13 gigabits per second because it is limited by my desktop computer’s PCIe slots. That’s roughly as fast as my encryption can go, but the drives on my server can only sustain about 60% of that under the very best conditions.

The most data I ever dump onto my computer comes from my cameras. It is normal for me to have one or two nearly full 64 GB microsd cards after filming. This could potentially fill up 1/3 of my Samsung NVMe’s write cache, but those cards only read at about 20 megabytes per second.

I edit lots of video, but that never needs more than a 100 or 150 megabytes per second of disk bandwidth.

I’ve been monitoring game loading times from my lvmcache. I haven’t found a game that has a bottleneck on disk operations, and I have yet to see a number higher than 180 megabytes per second in dstat while loading a game or new level in a game.

It is nice that my NVMe can manage hundreds of thousands of IOPS. That’s at least ten times more than my old SATA SSD, but my usual tasks don’t go any faster after my upgrade.

dstat doesn’t tell the whole story

Just because I am only seeing 180 megabytes per second in dstat doesn’t mean that I’m not benefiting from the 1.6 gigabyte or more than my NVMe is capable of providing. dstat is giving me a snapshot of my throughput at intervals of one second.

During that full second, whatever game was loading had read 180 megabytes from the disk. Odds are that this happened in a little over 100 milliseconds. My old SATA SSD would have also read 180 megabytes during that same second, but it would have taken nearly 500 milliseconds.

This improved latency is nice, and if software is blocked while waiting for that data, then hundreds of milliseconds saved here and there would add up to actual seconds. Something that took 20 seconds to load on the SATA SSD might now take 17 seconds.

The game loading times that I have managed to check don’t show such improvements. These games are likely still busy computing something important while waiting for more data.

Conclusion

I am certain that some of you reading this will actually benefit from a top-of-the-line NVMe. There are most definitely workflows out there can benefit from 7 gigabyte per second reads and writes. I haven’t run into one myself yet, and I’d bet that the majority of you won’t either.

When I upgraded from a mechanical disk that topped out at 200 IOPS to the ancient Intel X25-m with its 5,000 IOPS it was an absolute game changer. Everything seemed to load from disk instantaneously. Upgrading to the next SATA SSD with 50,000 IOPS didn’t feel much different, and neither does this NVMe with 500,000 IOPS.

We need some pretty serious changes in our hardware, operating systems, and software to really take advantage of the difference between 50,000 and 500,000 IOPS. Until then, we can definitely save a few bucks by skipping the upgrades to the fastest NVMe drives on the market.