OpenCode on a Budget – Synthetic.new, Chutes.ai, and Z.ai

Feb 20th, 2026

I can’t help myself. I don’t need to try another coding plan. My needs are simple. I never reach the quota on my Z.ai Coding Lite plan, though this may be tighter now that Z.ai is charging three times the quota for GLM-5. Even so, I couldn’t help myself. I signed up for a $3 per month plan at Chutes.ai. I had so much fun there that I wound up also signing up for a $20 per month subscription from Synthetic.new.

I will definitely pare this back down to one or two subscriptions, but I figured I should write about these while I have all three subscriptions active.

My Home Assistant Macropad Dashboard

I added my coding plan quota status to my Home Assistant macropad dashboard on my desk. Isn’t that neat?!

I am going to assume that you are like me. Just a guy at home. Maybe you have a homelab. You write the occasional script to glue things together. You use OpenCode to help you set up fancy things in Home Assistant. You use the robots to help you model things in OpenSCAD.

That means I am going to focus on the cheaper end of every service’s pricing chart. We don’t need 1,350 LLM requests per hour. We aren’t developing software eight hours each day. We have a busy day if we manage to burn through 120 requests.

I am not going to attempt to benchmark the speed of these services with any precision. I am not going to try to figure out which service has stronger models. I am only going to tell you about my experiences. Benchmarks are hard work!

Comparing coding plans isn’t straightforward

Some plans have better models. Some have quotas that reset every five hours, while others reset once a day. Some plans that aren’t in this comparison have a separate weekly quota in addition to the five-hour limits.

That reset timing might matter more than you’d think. Maybe you only sit down to crank out code for a couple of hours each day, so you’d like to burn through the 300 requests you’re allotted for the entire day all at once. But if you’re the type to chip away at projects throughout the day, you might get more for your money with a quota of 120 requests that reset every five hours.

Then there’s the whole model situation. Some services give you one family of models, and that’s it. Others hand you the full assortment of Kimi, DeepSeek, and MiniMax. Having choices is great, but I’ll admit I sometimes waste five minutes just deciding which model to use. Sometimes you just get things done when you don’t have a choice.

Oh, and don’t get me started on the pricing. One subscriptions includes useful MCP servers, some providers are faster, some are in the US. Privacy policies seem alright across the board, but their verbiage is all different. By the time you factor it all in, you realize comparing these things is like comparing apples to… slightly different apples.

I started writing this post based on incorrect information!

I signed up for Synthetic.new because I figured that I had to try it out. Surely you get something when you pay seven times more for comparable quotas, right? You do, but it isn’t quite as impressive as I first thought.

On my first evening, Synthetic’s Kimi K2.5 throughput was three times faster than Chutes’ or OpenCode Zen’s temporarily free Kimi K2.5. What I didn’t find out until a few days later is that Synthetic was having trouble getting Kimi K2.5 up and running on their hardware, so they were outsourcing inference to another party.

Synthetic is running Kimi K2.5 on their own network now.

run	Chutes	Synthetic	opencode free
1	471	26.5	63
2	146	53	159
3	71	29	65
4	60	153	39
5	120	74	29

NOTE: The table shows the number of seconds each run took to check the grammar on this blog post using Kimi K2.5.

Don’t take this table as being proper science. I decided to point my grammar-checking swarm at three different Kimi K2.5 providers, and I ran a grammar check of this blog post at random times. I’m not averaging runs. I’m not running this pseudobenchmark every ten minutes. All that I can say for sure is that the numbers match my experiences.

We’ll talk about Chutes and Synthetic in more detail soon, but I do want to make a quick note about OpenCode Zen’s free Kimi K2.5. It won’t be free forever, and I have had instances where it rejects my prompts due to being overcapacity. It has taken 90 seconds or more before a prompt has gone through when that happens.

Z.ai’s coding plan

Z.ai has gotten a lot of criticism since I signed up for my Z.ai coding plan. Their service got slower for a while when they released GLM-4.7. I assume the hype of the release drove up usage and subscriber count, and I wouldn’t be surprised if it took them a while to figure out how to balance their GLM-4.6 and GLM-4.7 split on their inference servers.

They recently announced that they were going to be limiting the number of new users who can subscribe so they don’t go too far over capacity. Even so, their service is slower than ever. It is possibly too slow to reach the quotas of their Pro plan, and not nearly fast enough to properly utilize a Max plan.

Enough readers of my blog have used my Z.ai referral code that I wound up upgrading to the Pro plan. I’ve never managed to use more than 3% of my quota, but I also never work for an entire five-hour period.

Z.ai’s plan only gives you access to Z.ai’s models. GLM-4.7 is a capable model that works great with OpenCode, and their newly released GLM-5 feels like it might actually be a little better than Kimi K2.5.

The biggest perk of Z.ai’s coding plan is probably the MCP servers that they offer. I have them all configured in OpenCode. The work I do doesn’t hit them all that often, but I do rack up several dozen uses of the WebFetch and WebSearch MCP servers each month. They also offer a vision MCP with OCR and their Zread MCP for searching indexes of documentation for open-source repositories.

Z.ai’s Lite coding plan has a quota of 120 requests every five hours. There is no weekly quota, but they have been saying one is coming. They might already be here, but I can’t see them on my legacy account.

There is a big problem with Z.ai’s Lite plan at the time I am writing this. Their excellent new GLM-5 model is only available on the Pro and Max plans, but not on the Lite plan. Z.ai says they are working to add capacity so they can add GLM-5 to the Lite plan, but they haven’t said when that will happen. Until they do, this gives a big advantage to the next provider on the list.

Chutes.ai

I have been searching Google for inexpensive LLM subscriptions every week since signing up for my Z.ai plan. I didn’t learn about Chutes until I was reading a random Reddit thread where somebody mentioned them. Their service is well hidden!

I was immediately intrigued. $3 per month with a quota of 300 requests per day. They give you access to all sorts of open-weight models including GLM-5, Devstral 2, Kimi K2.5, MiniMax M2.5, and DeepSeek V3.2.

That is a real $3. They aren’t telling you this is 50% off like Z.ai’s marketing has been doing for the last several months. It is just listed at $3. Even better, the next step up the ladder jumps to nearly seven times the daily quota for $10.

Model	Z.ai	Chutes	Synthetic
GLM-5	Pro	Yes
GLM-4.7	Yes	Yes	Yes
Kimi K2.5		Yes	Yes
MiniMax M2.5		Yes	Yes
Devstral 2		Yes
DeepSeek V3.2		Yes	Yes
GPT-OSS-120B		Yes	Yes
Gemma 3 27B		Yes
MCP Servers	Yes

There’s a lot of value here. You get access to all the same models that Z.ai offers and then some. You can probably squeeze a bit more value out of Z.ai if you manage to burn through your Z.ai quota during three five-hour periods in the same day, but Chutes gives you more than double Z.ai’s five-hour quota for the entire day, and they’ll let you burn through all of them in a single session. This works better for me, and Chutes’ smallest plan is less than half the price of Z.ai’s offering.

While Z.ai has gotten slow over the last two months, Chutes has been slightly unreliable. Sometimes my OpenCode session will just seem to get stuck. I have to cancel the current operation and ask it to continue.

Is this a deal-breaker for you? I don’t mind sending a “boop” message every once in a while.

I’ve come to rely on Chutes for my blogging workflow. I have an OpenCode skill that launches a swarm of subagents to check my grammar. Each subagent uses a different model, and the main model collects the suggestions and shows me the ones that two models agreed on. I am using GLM-4.7 via my Z.ai subscription, and GPT-OSS-120B and Gemma 3 27B via my Chutes subscription. These three different lineages of models sometimes provide me with very different results.

Synthetic.new

Synthetic.new is priced in a different category than the other two. Their cheapest plan is $20 per month while having a similar quota to Z.ai’s $7 plan. They are based in the United States, and my understanding is that they do all their inference in the clouds of large providers in the United States.

Synthetic has a more professional and enterprise-grade feel than Chutes or Z.ai. The more time I spend flipping back and forth between the same models on Synthetic and Chutes, the more I think that professional veneer doesn’t matter.

Synthetic really needs to be as good or better than an OpenAI Codex subscription at this price.

The first thing I did was fire up two simultaneous OpenCode sessions using Kimi K2.5 using Chutes and Synthetic.new. I gave them the same code to work on and exactly the same prompt. It was obvious that Synthetic.new was running significantly faster.

They did NOT come back with identical plans, so comparing the completion times isn’t entirely fair or precise. That said, Synthetic.new can be three times faster than Chutes. Even so, sometimes Synthetic is slower than everyone else. I do feel that Synthetic is more consistent, but I don’t know if that small potential boost in speed is worth seven times the price.

I don’t feel that Synthetic.new fits all that well into this comparison. When I saw the pricing, I immediately thought their $60 plan would be a fantastic deal for professionals who are using Claude Max. The offerings from both Synthetic.new and Anthropic are definitely targeted at someone with bigger workloads than mine.

If Kimi K2.5, GLM-5, and Sonnet 4.6 were students, they would attend the same classes and all three would be getting passing grades. I suspect Sonnet 4.6 would be doing a little better, but they’re definitely peers. Maybe you would want to subscribe to Claude Max 5x at $100 per month to use your 75 or so requests with Opus for the planning phase, then use the 1,350 requests with Kimi K2.5 on your Synthetic.new Pro plan for implementation. You only get 900 Sonnet requests with a $200 Claude Max subscription.

NOTE: I am not sure how Anthropic does their math. Claude Max 5x gives you 225 Sonnet requests per five-hour window, and it is my understanding that they count one Opus request as three Sonnet requests. I can’t find any official documentation that says this. If you’re already using Claude Max, then you probably already have an idea of how much Opus use you get each day.

Z.ai and Chutes are good for the workhorse requests when using Claude, Codex, or Gemini for planning, but Synthetic.new’s service and privacy policy seem to match the three big providers more closely.

NOTE: I am not a legal expert. I’m not convinced that Synthetic’s privacy policy or terms of service are stronger than Z.ai’s or Chutes.ai’s, but there is something about Synthetic that feels more trustworthy to me.

Synthetic.new

There is something unique about Synthetic.new’s quotas!

Synthetic gives you a limited number of LLM requests during a 5-hour window, just like almost every competing LLM subscription. What is unique about Synthetic is that they don’t charge you a full request for small prompts.

They charge you 0.1 for tiny tool-call requests. I see these tiny requests go past my OpenCode window quite often. It is normal for me to see a dozen of them go by in a 20-minute OpenCode session.

Remember when I said that it is difficult to directly compare coding plans? Add this to the list of reasons. In practice, you’re probably getting more like 160 or 180 requests every five hours.

How much does speed matter?

I keep wanting to use the word performance, but most people are comparing the quality of the output when talking about the performance of an LLM regardless of how long it takes. This blog post is mostly focused on Kimi K2.5 or at least models that perform similarly enough. Speed of the delivery of tokens is what I am noticing here.

I was working on my blog post about local LLMs on a used Radeon RX 580 GPU the week I signed up for Chutes. I had OpenCode creating Podman containers on a remote host over SSH. I want to say that I averaged around 100 requests for each test container I created.

The speed of the LLM wasn’t a significant bottleneck. OpenCode was waiting for images to download and for llama.cpp to compile. It didn’t help that my test server was a high-performance gaming PC in 2013. A machine like that is a little slow these days!

I am enjoying the extra speed of Synthetic.new, even if it isn’t always faster than Chutes. When Synthetic is faster, it is almost twice as fast. That isn’t a game changer for me, but it is nice!

While it doesn’t make much difference for me, the extra speed might be a huge selling point for you!

Is there a difference in quality of responses from different providers?

Yes, but I am not the right person to attempt to conduct the science to figure out exactly how well each provider is doing.

You can quantize models to fit in less VRAM. That means you can fit larger models on smaller GPUs, like how I squeezed a 3-bit quant of GLM-4.7-Flash onto my 16 GB 9070 XT GPU. A smaller quant could also be used to squeeze more parallel requests onto the same GPU. The smaller the quant, the lower the quality of the output.

Kimi K2.5 isn’t a model that a provider is likely to be squeezing down into a smaller quant. It is a massive model with 1 trillion parameters, but it is already natively running at 4 bits. That is between one-half and one-quarter of the native size per weight of most other models.

I have been signing up for more services specifically to try Kimi K2.5, so I am less likely to notice if one provider is serving degraded models.

I have swapped back and forth between Z.ai and Chutes for GLM-4.7 and GLM-5 quite a bit, and I haven’t noticed Chutes performing any worse. This is not science. This is only my anecdote.

Quantization doesn’t matter. The model you’re using doesn’t matter. What matters is that the model on the service you choose gives you results that you’re pleased with at a price you can afford.

Which provider should you choose?

Whatever you choose, I would suggest that you start small. Don’t just prepay for a year of Z.ai’s Max plan without trying the Lite or Pro plan first. Upgrading later is easy.

Synthetic.new is the answer if you want to support an American company with a good privacy policy or if a little extra speed, reliability, and consistency are more important than price. Odds are good that Synthetic is more likely to be serving higher quality quants or unquantized models.

I am not sure that Synthetic is worth the price, though, when you can pay a bit less for OpenAI Codex and have access to even stronger and faster models.

NOTE: I have had a couple of nights where Kimi K2.5 on Synthetic, Chutes, and OpenCode Zen’s free tier were all extremely slow. Synthetic does SEEM to perform better more often, but their service isn’t perfect.

Z.ai is a good choice if you need their MCP servers, or if you want to show your support to one of the companies that is releasing fantastic open-weight LLMs. Z.ai is the creator of all the models that it hosts. Maybe the lack of choice in models is a bonus for you. GLM-5 is a fine model, and GLM-4.7 works well with the build agent. You won’t be tempted to waste time trying to figure out which open-weight coding model is ideal for your use case if you don’t have them in your arsenal.

Provider	Price	Requests every 5 hours	MCPs
Z.ai (Lite)	$3	120	Yes
Z.ai (Pro)	$15	600	No
Z.ai (Max)	$30	2,400	No
Chutes.ai (Base)	$3	300/day	No
Chutes.ai (Plus)	$10	2,000/day	No
Chutes.ai (Pro)	$20	5,000/day	No
Synthetic.new (Standard)	$20	135	No
Synthetic.new (Pro)	$60	1,350	No

NOTE: Z.ai is shaking up their plans a little, so I am not sure if the requests in the quota are actually the same today as they were when I made this table. I do know that Z.ai says they are charging 3 requests for every GLM-5 message.

Chutes is probably the best value. Chutes has all the best open-weight models that are currently available. Their pricing is the lowest. Their speed is adequate. If you told me to cancel everything and pick just one provider today, I would be choosing Chutes.

They are all inexpensive enough that you could try them all. You get a discount on Z.ai if you use my referral link, though I am not sure how much money you save. You definitely get $10 off your first month at Synthetic.new when using my referral code. Chutes doesn’t have a referral program.

You could spend less than $30 all at once and immediately have a ton of new toys to play with for the next month, or you could space out your testing over two or three months like I did. It is nice to have a week or two of overlap between services so you can see them operating side by side!

What about NanoGPT?

This is another service that was tough to find! NanoGPT gives you 30,000 requests per month at $8 per month, and they have all the latest open-weight models. Pricing, models, and usage are comparable in value to the $10 plan from Chutes.

I keep reading about problems with NanoGPT. I came very close to just explaining that the reviews were all poor, and the vibe didn’t seem right, and that I had no interest in trying the service. I couldn’t do that. I had to sign up and give it a try. Just stay away.

Nano-GPT with Kimi K2.5 in OpenCode Failing Tool Calls

Kimi K2.5, GLM-5, and MiniMax M2.5 are all slow. I don’t think I’ve had a single successful tool call with Kimi K2.5 or GLM-5 in half a million tokens. I tried the same prompt with GLM-4.7, but it just kept coming back with no response. I did manage to see some successful tool calls with MiniMax M2.5, but it is going at an absolute snail’s pace.

Sometimes Kimi K2.5 via NanoGPT will just say that it is going to investigate, but it just stops after that statement.

These two issues don’t always happen, but when they’re happening with a model, they do not stop happening. I assume it depends on which provider NanoGPT is routing your requests through at that time. I have been having better luck using models that only have one provider on NanoGPT, like MiniMax M2.5. If I have to stick to such a limited selection of models, why should I use NanoGPT at all?

They have my $8. I’m not going to try to get it back, but my NanoGPT subscription is nearly useless with OpenCode.

Conclusion

At the end of the day, any of these services will get the job done for casual coding, tinkering, and creating Podman or Docker containers. You don’t need to overthink it. I have been happily using Z.ai for months, and I only started experimenting with the others because I enjoy comparing tools and finding good deals. I imagine that most people would be perfectly content picking one service and sticking with it.

I didn’t write this blog post to do an accurate, exhaustive head-to-head comparison of these services. It took me months to learn that Chutes and Nano-GPT even existed, and judging by the threads in r/OpenCodeCLI, most people using OpenCode don’t know about these subscriptions. I just want you to be aware that they exist, and I want you to understand where they might fit into your workflow.

If you have questions or want to share your own experiences, come hang out with us in our Discord community. We are a friendly group of people who are all figuring this stuff out together. We aren’t just talking about machine learning in there. We have a good overlap of homelabbers, 3D printing enthusiasts, and gamers.

Fast Machine Learning in Your Homelab on a Budget

Jan 27th, 2026

Setting up local LLM services in your home is complicated. Running a coding model locally would cost on the order of $10,000, whereas the same capability is available in the cloud for about $3 per month and runs several times faster. I most definitely will not be trying to host a coding model at home, and I don’t think you should either!

You CAN squeeze the smallest yet barely usable coding model onto your $650 16 GB Radeon 9070 XT GPU, but it will still be slower than the much more capable $3-per-month cloud coding plan.

Litte Trudy Judy helps setting up the RX 580

Little Trudy helped me install Bazzite on the RX 580 test rig at my workbench

I’ve been thinking about this a lot over the last year. What models are ACTUALLY worth running in my home? What data would I prefer to never see leaving the house? Which services need to work if my home Internet connection goes down? Just how much hardware do you need to meet these needs? Do all of these things require GPU acceleration to be performant?

Everything I think about seems to revolve around Home Assistant. I don’t have any of the cameras, microphones, or speakers to make any of this work, but here’s what I’d like to be able to do locally:

Analyze photos of the front door to look for packages
Convert voice input to text
Have a capable LLM to process that text output
Convert the output of that LLM to speech

I realize that this essentially adds up to a local equivalent of Google Home’s voice assistant. I currently have none of the front-end hardware to make this work with Home Assistant. I would need a handful of open-source equivalents to the Google Home Mini. I would need at least one IP camera. I’m just not there yet, but I figured attacking the deepest part of the backend first would be a smart move. If things work out as well as I hope, then maybe it is time to investigate options for the rest of the hardware!

I already experimented with the LLM with vision using my 16 GB GPU. Google’s Gemma 3 4B with vision analysis components fits into four or five gigabytes of VRAM. I already surmised in that blog post that I could likely fit that LLM, a text-to-speech model, and a speech-to-text model into 8 GB of VRAM.

One of my friends in our Discord community told me that I should put my money where my mouth is. Technically he suggested that I put his money where his mouth is, but an 8 GB RX 580 GPU only came out to $56, and I don’t think I should take his money!

Why the 8 GB Radeon RX 580?

I think the RX 580 is the sweet spot.

I have friends in our Discord community using old enterprise Radeon Instinct Mi50 and Mi60 GPUs. These can both be found in 16 GB and 32 GB variants. The prices are fair, but you will need to add cooling ducts and your own fans to make these work. I think these GPUs are a fine way to go. More VRAM is always better.

Testing Gemma 4B Vision

I like that the RX 580 is a decent gaming GPU. It isn’t going to compete with a modern GPU, but it can run a lot of current games at 1080p60. If it doesn’t work out in my homelab, then maybe it will wind up replacing the 6800H mini PC that we use for playing games in the living room.

The RX 580 is one of the cheapest GPUs with 8 GB of VRAM. We have a lot of small but capable models now, and we keep seeing new ones pop up. This is a good amount of VRAM for chatting with an LLM, and that’s all I’m hoping to do. My hope is to be able to ask how the weather is going to be tomorrow, and to be able to ask for alarms and timers.

Why not a beefier GPU?

Here’s what I’ve figured out for my use cases. I have tasks where Gemma 3 4B is enough. I have tasks that require something more like Qwen 80B A3B, but that 80B model is barely enough to handle those tasks. When I need something bigger than Gemma 3 4B, I really want to be using GLM-4.7, Kimi K2, or even Claude Opus.

Being able to run slightly bigger models doesn’t make a big difference in the quality of my experience for these tasks. If Gemma 3 27B or Qwen 30B A3B can handle the job for me, odds are pretty good that Gemma 3 4B will manage just fine.

RX 580 and 6700 XT

The slightly smaller Radeon RX 580 next to my old and slightly larger Radeon 6700 XT

Not only is the 8 GB RX 580 pretty close to the minimum viable LLM GPU for my homelab, it is also as much GPU as I need unless I want to spend tens of thousands of dollars. Inching my way up the price ladder doesn’t buy me any extra utility.

Why Vulkan?

ROCm is usually faster than Vulkan, but it is so much more fiddly. You need to install the correct ROCm version that works with your GPU, and this becomes problematic as your GPU gets older. It doesn’t help that the GPUs with the lowest prices are no longer supported on the latest ROCm releases.

The containers that I build with llama.cpp and whisper.cpp are using Vulkan, and those containers should work on almost any Linux machine with a GPU. That includes the Intel N100, Ryzen 3550H, and Ryzen 6800H machines in my homelab, and my aging Ryzen 5700U laptop.

Vulkan support in llama.cpp has been improving steadily for the past year, and Vulkan support has been landing in other machine-learning software as well. It is starting to be a pretty good common denominator.

I am not looking to squeeze every ounce of performance out of this GPU. I want something that will perform well enough to do the job without becoming a maintenance nightmare.

Why in the heck is Pat testing this on Bazzite Linux?!

The appropriate place to run my homelab GPU would be on another Proxmox server in my homelab. I have several reasons why I installed Bazzite on the test machine with the RX 580.

I would want to run the LLM things in one or more LXC containers, because that would allow them to all share 100% of the GPU. Setting up LXC containers for this would have been a little more effort, and most of the other GPUs around the house that I want to compare the RX 580 to are in machines running Bazzite.

Setting up a working Podman container on one of these Bazzite boxes means that I have a Podman container that will work on every other Bazzite box. I can probably layer Podman inside a privileged LXC container.

I also wanted to see for myself exactly what sort of games I could run on this old GPU. Bazzite was the right choice for that. I’ve seen videos showing people playing Arc Raiders at 1080p on an RX 580 and getting better than 60 frames per second. They were using a DDR3 motherboard, but their CPU has 40% more single-core performance than my FX-8350. I suspect that will limit my choices of games.

Bazzite on a Ryzen 6800H Living Room Gaming PC

Vibe-coding my way to a working setup

I ordered the GPU on Tuesday. It arrived on Saturday. That night, I installed Bazzite then told OpenCode that I needed a llama.cpp server with Vulkan support in a Podman container. That went pretty well, so when I woke up this morning, I asked OpenCode and Kiki K2 to do the same thing with whisper.cpp. It is now Sunday night, and I have a fast test environment ready to go!

This seemed like a fitting way to get a machine learning test environment up and running. Did my vibe coding environment pick safe, trusted images to build this on top of? I have no idea. I’ll work on improving this foundation if and when I figure out how to tie these pieces together into something that I can connect to Home Assistant.

OpenCode creating a ROCm Immich setup

I am just testing for now. I am only trying to determine viability. Can we chat at the speed of speech? Can we scan photos of my front door fast enough to be worthwhile? Will it be cheaper to do that in the cloud?

This setup will give us a good idea of whether or not I’ve chosen the right hardware.

I am impressed with how well OpenCode manages to work with software on a remote machine. I told it the hostname of the machine, that it could be accessed via ssh, and that we would be working with Podman. It knew how to copy heredocs over an ssh connection, and when that didn’t go quite as planned, it created files locally and copied them over an ssh connection.

Devstral with Vibe CLI vs. OpenCode: AI Coding Tools for Casual Programmers

Why am I running llama-bench at 4,000 tokens of context?

I get a little grumpy when I go out searching for llama benchmarks. I believe the default context length is 512 tokens. That is definitely a long enough prompt to ask your local LLM about the weather, and probably enough context for my primary use cases here, but your conversation isn’t going much farther than that.

You usually see people benchmarking massive coding models like GLM-4.7 on a $10,000 Mac Mini, and they wind up using the default context length. You need many tens of thousands of tokens of context to use GLM-4.7 with Claude Code or OpenCode. I hit 80,000 tokens of context regularly. Prompt processing speed at 80,000 tokens of context is likely going to be an order of magnitude slower than at 512 tokens of context.

Asking Gemma 4B Vision Model If There Is a Package At My Door

You need to benchmark at a context length appropriate to your needs, and you need enough VRAM to hold that level of context.

I did run some benchmarks down at 400 tokens of context. Things weren’t much slower at 400 tokens than at 4,000 tokens. I figured I may as well stick with something spacious. Maybe we’ll find another use case for larger context down the road!

How fast is Gemma 3 4B on the RX 580?

I am going to say that Gemma 3 4B at Q6 performs admirably. I am getting better than 300 tokens per second in prompt processing, and token generation is just shy of 20 tokens per second. I suspect these numbers would be higher if I switched to ROCm, but I don’t want to worry about being deprecated in 12 months.

| GPU    | model                   |       size |     params | backend    | ngl |           test |                  t/s |
| -------| ----------------------- | ---------: | ---------: | ---------- | --: | -------------: | -------------------: |
| RX 580 | gemma3 4B Q6_K          |   3.12 GiB |     3.88 B | Vulkan     |  99 |  pp512 @ d4000 |        306.82 ± 0.47 |
| RX 580 | gemma3 4B Q6_K          |   3.12 GiB |     3.88 B | Vulkan     |  99 |  tg128 @ d4000 |         18.92 ± 0.01 |
| 5700U  | gemma3 4B Q6_K          |   3.12 GiB |     3.88 B | Vulkan     |  99 |  pp512 @ d4000 |        147.38 ± 0.59 |
| 5700U  | gemma3 4B Q6_K          |   3.12 GiB |     3.88 B | Vulkan     |  99 |  tg128 @ d4000 |          9.62 ± 0.01 |
| 6800H  | gemma3 4B Q6_K          |   3.12 GiB |     3.88 B | Vulkan     |  99 |  pp512 @ d4000 |        268.13 ± 0.76 |
| 6800H  | gemma3 4B Q6_K          |   3.12 GiB |     3.88 B | Vulkan     |  99 |  tg128 @ d4000 |         17.49 ± 0.02 |
| 9070XT | gemma3 4B Q6_K          |   3.12 GiB |     3.88 B | Vulkan     |  99 |  pp512 @ d4000 |      2561.24 ± 35.59 |
| 9070XT | gemma3 4B Q6_K          |   3.12 GiB |     3.88 B | Vulkan     |  99 |  tg128 @ d4000 |        124.77 ± 0.65 |
| 9070XT | gemma3 4B Q6_K          |   3.12 GiB |     3.88 B | ROCm       |  99 |  pp512 @ d4000 |      3305.23 ± 25.74 |
| 9070XT | gemma3 4B Q6_K          |   3.12 GiB |     3.88 B | ROCm       |  99 |  tg128 @ d4000 |         92.31 ± 0.21 |

NOTE: I Frankensteined the relevant lines from runs of llama-bench on different machines into one table. I added a column for the GPU, and I omitted the flash attention column from the 9070 XT runs. The 9070 XT is very slow on ROCm without flash attention, and that is the only run with flash attention enabled. Flash attention made little difference on any of the Vulkan test runs.

I don’t have a great way to properly benchmark the vision portion of Gemma 3. I just pulled up the llama.cpp web interface, pasted in a photo of my front door that was taken by an Amazon delivery driver, and I prompted the LLM to give me a simple yes or no answer about whether there was a package at my front door.

The web interface lies. It says it took a fraction of a second, but that is just the time it took to generate the “Yes” response. It actually takes around 10 seconds to process an image on the RX 580. Not lightning fast, but fast enough that I could have it check a still image from several cameras every few minutes without breaking a sweat.

Whisper.cpp speech to text runs faster than I can talk

I’m not trying to transcribe a 2-hour podcast here. I’m just hoping to say things like, “Hey Robot! Set a 13-minute timer!” Even the slow Intel N100 in my homelab was able to transcribe sentences like that in a couple of seconds using the Whisper’s tiny model on its CPU only.

I have moved up from the 75-megabyte tiny model to the 466-megabyte small model. It transcribes a single voice command so quickly on the RX 580, my 9070 XT, or my 6800H that it isn’t even worth attempting to measure the speed. Any of these machines would be more than up to the task.

Even with Gemma 3 4B, its vision model, and Whisper small all loaded into VRAM, I still have nearly three gigabytes of VRAM free. There will probably be room left over to either use a less quantized GGUF of Gemma 3 4B, extend the maximum context of Gemma 3 4B, or use Whisper medium.

Is Machine Learning Finally Practical With An AMD Radeon GPU In 2024?

Text to speech options are overwhelming!

There are SO MANY text-to-speech engines. I decided to try Piper, because that is the one that the Home Assistant community seems to be integrating with. I don’t know why I thought Piper had Vulkan support. OpenCode had a Podman container up and running pretty quickly that used the CPU, but it churned away for a long time trying to get any sort of GPU acceleration going.

I’m not sure that it matters. Piper was able to generate voice audio roughly ten times faster than it could speak the words, and that was using my ancient AMD FX-8350 CPU in my test server.

It does seem like you can get Piper running using ROCm, but I am hoping to avoid ROCm.

Is Machine Learning Finally Practical With An AMD Radeon GPU In 2024?

I tested the same Podman containers on my Ryzen 6800H mini PC

I was pleasantly surprised by how well everything works on the Radeon 680M iGPU. The text model is barely slower than when running the RX 580. Processing an image takes a few extra seconds, but not ridiculously long. The response to my short recording of my voice from the whisper.cpp interface is essentially instantaneous on either the 680M or RX 580.

If you already have a mini PC with a decent iGPU, then you probably don’t need an RX 580.

Why an RX 580 instead of a Ryzen 6800H or faster mini PC?

A used 8 GB RX 580 costs $56, while a new Ryzen 6800H mini PC costs somewhere between $300 and $400. If you think I am comparing apples to oranges, you are correct.

Many of you reading this already have a homelab setup. You may already have one or more machines in your lab with PCIe slots. Maybe you can swap out an extremely basic GPU in one of your servers for a GPU capable of running LLMs. Adding something like an RX 580 to your existing setup is almost a no-brainer if you have a use case for this scale of LLM.

When would an RX 580 pay for itself?

I have to tell you that I didn’t expect this to be a truly good deal in the long run. I assumed the only value would be in keeping your Home Assistant LLM inside your home, and keeping the photos of your doorstep out of the cloud. I resurrected my old FX-8350 homelab box when my 8 GB Radeon RX 580 arrived, and it isn’t an efficient machine. The meter says it is currently idling at 71 watts, and math says that keeping this box running 24/7 will cost at least $80 a year.

Surely sending images up to the cloud would cost less than that, right? I repeated my photo experiment using Gemma 3 Flash via OpenRouter. It cost about 1/3 of a penny to tell me that there was a package on my doorstep. That seems cheap! Let me show you the math:

At $0.0033 per image check, checking one image every five minutes adds up to 288 images per day, or 105,120 images per year. That works out to about $347 per year in cloud costs. That is more than four times what the electricity costs to run the local server. Even if you only check one image per hour, you’d spend about $29 per year in the cloud.

I had to stand up a new server, because my homelab consists entirely of power-sipping mini PCs now. My hardware would still pay for itself in a matter of months. You’ll do even better if you already have a server to plug this GPU into, because the GPU is only adding 10 or 20 watts on its own.

This math will change if you are hitting the LLM constantly.

Contemplating Local LLMs vs. OpenRouter and Trying Out Z.ai With GLM-4.6 and OpenCode

I forgot about Immich!

I didn’t start trying to set up Immich with ROCm acceleration for its machine-learning server until the last minute, and I didn’t manage to get it working on the GPU yet. I didn’t want to delay this blog post.

I first want to say that I feel like it would be better to run Immich on something like an Intel N100 mini PC. Probably the same mini PC that you should be using to host Jellyfin! Using machine learning for face detection is only a small part of what Immich needs to do.

If you’re syncing a lot of your phone videos, then your server is going to be transcoding video. The RX 580 has video encode and decode hardware, but it is old. The Intel N100 is able to transcode more modern codecs, and it can do it just fast enough to convert and tone maqp three simultaneous 4K HDR videos to playback on non-HDR displays.

I will update this section when I get a chance to figure out how to test Immich on the RX 580.

Is this the conclusion?

I only had one question when I was ordering the 8 GB RX 580 GPU. Can this GPU run the text, vision, and speech models necessary to implement some sort of voice assistant to tie into Home Assistant? The answer is definitely yes, but I am not at all prepared to move forward from here!

I don’t know anything about Home Assistant’s voice integration. I don’t ACTUALLY have any cameras around the house. I have no microphone and speaker hardware that is compatible with Home Assistant. It looks like my purchase and testing is only going to lead to more purchases and more testing, so I better start figuring out what to buy next!

The RX 580 gave me exactly what I was hoping for. It has enough VRAM to hold and run a few models at a time, those models are capable enough for tasks around the house, and the performance is more than up to these sorts of tasks. It’s a humble little GPU, but it’s proven itself more than capable for my homelab needs.

If you’re in the market for an affordable way to start experimenting with small local LLMs, vision, and speech models without going all-in on expensive hardware, this 8 GB RX 580 is worth a look. It’s not about chasing maximum performance. It is about finding a sweet spot where you get useful functionality without breaking the bank.

Now if you’ll excuse me, I have to figure out how to build a voice assistant!

If you’re tinkering with your own homelab AI projects or curious about what’s possible with budget hardware, come hang out in our Discord community to swap stories and share what you’re building.

Squeezing Value from Free and Low-Cost AI Coding Subscriptions

Jan 16th, 2026

I subscribed to Z.ai’s Coding Lite plan for what worked out to $3 per month. I’ve written in detail about my experiences with the Z.ai plan, and running OpenCode against Z.ai’s GLM-4.7 is most definitely all that I need. In fact, it is way more than enough to meet my needs, but the trouble is that I am a curious person. How much better are other models? I keep hearing that Claude Opus works much better with OpenCode’s planning agent. Is there a way I could use Opus without adding $20 a month to my Claude Code expenses? Should I be looking at other open-weight models like Kimi K2?

I am writing this from my perspective, but I believe my perspective here could be extended to more serious professionals, so don’t click away just yet! The same train of thought could also help you squeeze the most out of your Claude Code subscription by augmenting it with another company’s plan!

I keep seeing posts on Reddit from people unexpectedly hitting their weekly Claude Max quotas early in the week when they used to easily make it through Friday. I don’t know if Anthropic has accidentally introduced a bug, or if they are purposely tightening things up. It sounds like it might be a bug in the Claude Code client, but it sure seems like a good enough reason to explore other coding plans as a supplement.

Restrictions on third-party agentic coding clients

Things have gotten a bit more complicated in the world of LLM coding tools this week. Anthropic has started banning people for using Claude subscriptions with other agentic coding tools that aren’t Claude Code, and they’ve asked the authors of some of these tools to remove support. Google has also taken steps to block OpenCode-Antigravity-Auth.

This is exactly why OpenCode’s multi-model approach has become so valuable. When one provider decides to lock things down, you still have options. You’re not betting everything on one horse anymore.

These restrictions are also exactly why I’m so excited about Chutes.ai and Z.ai. They’re giving us access to capable models without the same level of vendor lock-in or platform restrictions. We will talk more about these providers soon.

Claude Code

OpenCode makes it easy to use multiple models from different vendors

You can use Z.ai’s coding plans with the Claude Code client, but you have to set up your API settings through environment variables. I believe there are some sneaky ways to streamline the process, but this essentially means that if you want to switch models, you have to exit Claude Code, point your variables to a different provider, and fire up Claude Code again. Some of the other open-source agentic tools are set up the same way.

OpenCode lets you mix and match models from different providers in a single interface. When I started writing this blog post, I was logged in to Z.ai, Google’s Antigravity API, Alibaba’s service, OpenRouter, and OpenCode Zen.

In OpenCode’s framework, you can assign different models to specialized roles. A planning agent breaks down complex tasks and creates a roadmap for the work, but doesn’t have the ability to modify files. Build agents, meanwhile, execute the actual coding work and make changes to your files. By assigning different LLM models to each role, you can optimize for both quality and cost. Until recently, I had my planning agent pointed to Claude Opus via the Antigravity API, but Google has blocked OpenCode-Antigravity-Auth, so I switched my planning agent to Kimi K2 via Chutes.

NOTE: Using OpenCode with Antigravity’s API seems to be against Google’s terms of service. Even so, Google has taken steps to block opencode-antigravity-auth. The authors of the plugin have bypassed Google’s restrictions again, but this could break at any time, and you almost definitely don’t want Google to ban your Gmail account.

You can get started whichever way makes sense for you. I started using OpenCode with a $3 a month subscription through Z.ai, but you could just as easily start using it with only free API calls, or you could connect OpenCode to your existing OpenAI Codex subscription, because it sure looks like OpenAI is going to be friendly to third-party clients.

Being able to switch models on the fly is nice. I don’t have to learn or configure a new tool when I want to try something new. I can just point OpenCode at a new model at any time in the future.

I just learned about Chutes.ai!

I’ve only been using Chutes for a few days, but I’m already pretty excited about what they’re offering. Chutes gives you 300 requests per day for $3 per month, and that scales up to 2,000 requests per day when you pay $10 per month.

What really gets me excited about Chutes is the model lineup they’ve got available. They’re offering GLM-4.7, DeepSeek v3.2, Kimi K2, MiniMax M2, and more. Kimi K2 is particularly interesting because it’s got around 1 trillion parameters. That is nearly three times the size of Z.ai’s GLM-4.7. Having access to a model that size at these price points is pretty wild.

I’m still in the early days of experimenting with Chutes, so I don’t have a fully formed opinion yet. But having access to multiple high-end models from a single provider, at these prices, and without the restrictions we’re seeing from the big players? That’s a compelling proposition.

Chutes is more versatile than Z.ai’s plans, and a Chutes subscription costs less. Chutes might be the way to go if you’re on a budget, but I haven’t been using it nearly long enough to say for sure. Is their service reliable? Is it fast, and does it stay fast? Is their GLM-4.7 service as capable as Z.ai’s, or is Chutes quantizing the model to save money? I don’t have the answers to any of these questions, but $3 isn’t a lot of money to risk to try the service.

I am expecting to change my recommendation for other low-volume users like myself from Z.ai’s coding plan to the Chutes plans, but I need at least a few weeks to feel confident in that. Having access to the same models as Z.ai and more at a lower price is fantastic.

Chutes includes Devstral 2

I am excited about Devstral 2. I tried Devstral 2 with Mistral’s Vibe CLI last month, and both the model and coding frontend are pretty good!

Devstral 2 stands out as the only dense coding model recently released. It is half the size of MiniMax M2, or around one-third the size of GLM-4.6, but they are both MoE models.

It will be fun to find out if a smaller, dense model can outperform a much larger MoE model on some tasks!

Devstral 2 Mistral Vibe CLI Announcement

A Z.ai plan is still a fine choice, and the value might be in their MCP servers

I’ve been using Z.ai for months now, and it’s been my go-to low-cost option. Their Coding Lite plan lists at $6 per month, but I’ve always been able to get it at 50% off, so I’m actually paying $3 per month for 120 requests every 5 hours. If you need more, they’ve got a $15 per month plan that gives you 600 requests every 5 hours.

For an 8-hour business day, you’re looking at more total requests with Z.ai compared to Chutes. The tradeoff is that Z.ai only has GLM-4.7 available, so you’re not getting access to the massive parameter models like Kimi K2 that Chutes also offers.

The Z.ai plan is inexpensive. For someone like me who only burns through 30 million tokens each month, pairing it with OpenCode makes it an extremely capable model. It is nice to have that $3 spending cap for a virtually unlimited number of tokens in a model that can do all the grunt work.

Z.ai’s plans come with a few useful MCP servers that I have plumbed in to OpenCode. They have a vision MCP, but it is challenging to do anything useful with vision when using OpenCode. Their Zread MCP can pull useful information from public GitHub repositories. I don’t use either of those, but my OpenCode sessions are making several dozen calls to Z.ai’s web search and web reader MCP servers every month.

I suspect that we might get more mileage out of the vision MCP as OpenCode matures. The MCP is able to check for differences between images, which might help OpenCode debug changes to a web site. It can also decode technical drawings and OCR text from an image.
What do I do with OpenCode and Z.ai? I would say that I’ve mostly been having it write glue code: scripts to use data from Home Assistant to light up the appropriate keys on my macro pad, and little daemon scripts to monitor the state of audio devices so the correct headphones are activated when they are turned on.

I am also using OpenCode to help me with blogging. I think that is going to be an entire blog post of its own soon, but I have been doing the majority of these tasks using Z.ai as well. I am definitely getting more than enough value for my money here.

What about NanoGPT?!

I regularly search for low-cost coding subscriptions. I want to make sure I’m not missing something interesting when I tell you about the inexpensive services that I am currently using. I had never once seen mention of NanoGPT. I only happened to come across it due to a random Reddit comment. I literally published this blog post earlier today, then I saw the comment and figured I should add some information.

NanoGPT has all the same large open-source models that Chutes offers, plus a variety of other smaller models. NanoGPT’s subscription costs $8 per month. You get 60,000 requests per month, and they also give you access to their image-generation models. This equates to roughly the same number of requests as a $10 Chutes subscription.

Does NanoGPT perform well? Is the uptime good? ~~I have no idea.~~ I only learned of its existence a few hours ago. I do know that it is a supported provider in the list of choices when you run opencode model auth at the command line.

UPDATE: I have tried NanoGPT. The major models are lucky when they manage to run a successful tool call. NanoGPT can’t be properly used with OpenCode

Chutes is $3 per month, Z.ai is $6 per month, and NanoGPT is $8 per month. These are all inexpensive enough that there isn’t much risk in trying them out. Try each service for a month. Snag them all at the same time and cancel the ones that don’t work out. Alternatively, wait until next month, because I’ll probably also be giving NanoGPT a try!

NanoGPT Subscriptions

Maybe you can stretch your subscriptions with some paid tokens?

I switched OpenCode’s explore subagent to use gpt-oss-120b via my Chutes subscription. This is a relatively simple task, so you don’t need to use massive models like Kimi K2, GLM-4.7, or Claude Opus to hammer on these for you. This felt like a good way to direct my less premium requests to a different service provider, and it will probably speed up my OpenCode sessions by at least a little.

This open-weight model from OpenAI is extremely inexpensive. You can pay for a million tokens via OpenRouter for less than a nickel. I haven’t been doing this long enough to check my statistics, but spending 25 cents a month on gpt-oss-120b via OpenRouter might be enough to keep me from bumping into my quota on my $3 Chutes or Z.ai subscriptions. I will do my best to analyze my OpenCode logs and statistics to see how this works out for me!

You could also sign up for an Nvidia developer account and add your Nvidia NIM free-trial API key to OpenCode. That would let you assign models on Nvidia’s service to the simpler subagents to keep a percentage of your paid quotas free for more difficult tasks.

I ran some small requests through Chutes using OpenCode using a handful of different models. My plan was to choose something FAST for these two subagents. I tried Qwen 30B A3B, but the Chutes implementation didn’t seem to agree with OpenCode. I tried MiniMax M2, but it seemed roughly as fast as GLM-4.7. I was pleased with the speed of gpt-oss-120b, so I stopped there.

Pricing and performance will vary by provider. You will have to puzzle out exactly what makes sense for your setup.

Vercel offers $5 in free credits for their AI Gateway every month

I don’t know a lot about this yet. Vercel offers GPU hosting infrastructure and it is an OpenRouter-style inference gateway.

I signed up, I got my free credits, and I chewed through 60 cents in Claude Opus 4.5 tokens in about two minutes. I was curious what Opus might make of the OpenSCAD source for my Li’l Magnum! mouse mod. I was wondering if it could intuit anything useful about the shape of the mouse in real life.

I asked Opus to give me a plan to help improve the print quality of the underside of the button paddles. Then I decided to have my Z.ai subscription implement the plan, since OpenCode burned 15% of my free tokens almost instantly while generating the plan.

The fix wasn’t great, but it did add material to the correct areas. I asked GLM-4.7 and Kimi K2 for similar plans, and they also did a reasonable job. In fact, GLM-4.7 was the only model that did the math to figure out the angle of the underside of the button paddles.

I don’t mind spending a dollar or two of my OpenRouter credits on Claude Opus, Google Gemini 3 Pro, or GPT 5.2 Codex when a tricky problem pops up. When I am paying for the tokens, though, I will probably only do it once. Having a few free credits to throw around encourages me to try all three of these models just to see how much better they are than GLM or Kimi!

Vercel

OpenRouter offers 1,000 requests to free models per day

The caveat is that you have to purchase $10 in API credits. None of the models available for free are the top-tier coding models at the time I am writing this. They do have Qwen3 Coder 480B and Step 3.5 Flash.

OpenAI has a similar deal where they will give you several million free tokens per day on some of their lower tier models as long as you have money in your account, but your paid credits at OpenAI expire if they haven’t been used in 12 months.

I’m a little grumpy that OpenAI deleted my money. I didn’t even get to use any free tokens. It wasn’t much money, but it felt scummy. I deposited $10 into my OpenRouter account more than 13 months ago, and most of it is still there.

OpenRouter’s free models may not be exciting, but they are useful. There are definitely tasks you could offload to these less capable models instead of eating into your quota on your paid coding plans.

OpenRouter

Nvidia NIM offers developers large quotas

Nvidia must not be advertising this massive free trial at all, because I only just learned about it, and I am adding this section to the blog post three weeks after it was published. Nvidia’s NIM trial doesn’t specify what the limits are, but they say up to 40 requests per minute.

They have all the usual open-weight models in their lineup: Kimi K2, GLM-4.7, Devstral 2, and MiniMax M2.1. Performance varies quite a lot. When I first fired up Kimi K2 with my Nvidia API key in OpenCode, it was screaming along. Later the same day, I felt like I was lucky to be getting anything back at all. That said, I’ve only been trying this for a few days. Maybe the bad luck I had is the exception rather than the rule.

If you’re running entirely on free tokens, then this seems like it would be a worthwhile service to pair with Vercel’s $5 monthly credits. You could use Vercel to run Claude Opus, GPT 5.2 Codex, or Gemini 3 Pro when Kimi K2 or GLM-4.7 aren’t able to handle your current task.

I’m happy to pay $3 per month for a steady stream of tokens, but Nvidia NIM’s free trial seems like a good way to stretch your plan a little farther.

I could probably get by using only free tokens

Google Antigravity gives you a free allotment of Gemini and Opus tokens every day. However, with Google blocking opencode-antigravity-auth, you’d need to use their Antigravity client to access these directly. Alibaba will give you one million free Qwen tokens per day, though I haven’t managed to get that to work quite well with OpenCode. You can get several million tokens per day for free from OpenAI as long as you have $10 in your account and agree to let them use your data for training.

I’ve previously written about trying out different models through OpenRouter, which is a great way to test before committing to a subscription.

I am confident that I would hit my daily limits on these services. However, I’m just as confident that I would be in pretty good shape if I could use a lower-volume Kimi K2 via Chutes for the planning agent while using Gemini 3 Flash or Qwen Coder for the build agent. I might have to swap between models on an occasional busy day, but I think I would do all right.

I still have Alibaba plugged into OpenCode to use Qwen Coder 480B for free. I don’t know if it is against Alibaba’s terms of service to use their free API with OpenCode, but I’m also not the least bit worried about being banned from their service. This wouldn’t be like losing access to my Google account, and I’m really only using it so I have more useful information to include in this blog post.

Possibly even more exciting, OpenAI now has a GPT Codex Mini coding model, and it’s one of the models that is included in OpenAI’s free daily API tokens. OpenAI provides several free models as long as you have a funded API account and opt in to let OpenAI use the tokens for training. Codex Mini seems to be in the same ballpark as GLM-4.7 or Claude Sonnet.

Even though I could use free tokens, spending around $30 for a year of what amounts to unlimited use for my purposes is nice. I don’t have to worry about using a lesser model or running out of tokens. I just get to keep trucking.

Using a swarm of models to check the grammar of my blog posts?!

I am absolutely delighted that this worked. I am still dialing things in a bit. I will publish everything that I am using once things feel a little more polished.

I have had two serious problems when asking an LLM to check my grammar. The LLM usually misses some of my mistakes, and the LLM almost always wants to correct lots and lots of things that don’t need correcting.

I don’t know if I did this correctly, but I set up three nearly identical grammar-checking subagents that each point to a different model. I’m currently using GLM-4.7 via Z.ai, and I’m experimenting with Kimi K2 and DeepSeek v3.2 via Chutes for the other two. With Google blocking opencode-antigravity-auth, I’m no longer able to use their models through OpenCode for this setup.

I have a grammar-checking skill set up that is instructed to call these subagents in parallel and collect their findings. It tells me where the various models reach consensus on grammar problems, and asks me which of those problems I would like it to correct.

Why did I pick these three models? It does help that they are all free, but that isn’t terribly important. It only costs a nickel in OpenRouter credits to use Claude Opus and Gemini 3 Pro to check one blog post, and it is even cheaper to use the more appropriate Claude Sonnet and Gemini 3 Flash. I thought it was important to use models with different lineages, because I suspect they’ll have different feelings on what constitutes a good blog post.

I don’t think I would consistently get the results I want by using GLM-4.7, GLM-4.6-Flash, and GLM-4.5 in three different subagents. I switched two of the swarm agents to GPT-OSS-120B and Gemma-27B after signing up for a Chutes plan. I’ve only used them in the trio once so far, but they seem to be up to the task, and they’re really fast.

OpenCode with Local LLMs — Can a 16 GB GPU Compete With The Cloud?

The LLM benchmarks don’t tell the whole story

I think LLM benchmarks are great. It is nice to get a rough idea of where a brand new model sits in relation to the models that are already available. However, don’t just pay attention to the benchmarks. Listen to what your friends say. Read about the experiences of other people. Different models work better for different languages, different coding styles, and for different people.

I am not an expert. I don’t have hundreds of hours of real-world experience with this stuff. Here’s what I’ve learned from talking to friends and paying attention to various coding communities.

Claude Opus, GPT 5.2 Codex, and Gemini 3 Pro are all in roughly the same league. Opus seems to still be in the lead, but I keep hearing that Codex is amazing at debugging complicated problems, and OpenAI gives you a lot more tokens and requests for your money than Anthropic.

Claude Sonnet, Gemini 3 Flash, and GLM-4.7 are all on the same lower rung of the ladder. It seems like Sonnet often feels like it is right in between GLM-4.6 and GLM-4.7 for a lot of tasks, but any one of them could come out ahead depending on what you’re trying to do.

I am using a lot of weak phrases here such as “seems like” or “feels like.” This is more like figure skating than playing darts.

I am hoping that OpenCode Black saves the day?!

There isn’t much more information available than rumors at the time I am writing this. The company behind OpenCode has a paid API gateway called OpenCode Zen. Zen charges you by the token, and the prices are reasonable.

They’ve been tweeting about a $200 OpenCode Black subscription, which is already sold out, and it seems to give you access to models from both OpenAI and Anthropic, while also giving you access to open-weight models like GLM-4.7 and Kimi K2. It seems that they are doing a limited subscription run in order to puzzle out just how many tokens subscribers will be using, and they’ll be able to use that data to set limits.

I am excited about the idea of having a single subscription that would give me access to so many models. The Chutes subscription is nice, because there are quite a few models included, but missing out on Claude Opus and GPT Codex is a bummer. Sure, I can and do pay for some tokens on these models via OpenRouter, but it would be nice to have these included as part of my fixed costs.

I sure hope OpenCode Black winds up having a pricing tier that makes sense for a casual user like me!

OpenCode with Local LLMs — Can a 16 GB GPU Compete With The Cloud?

Supplementing your premium plan with a low-cost plan

Are you paying for your own Claude Code Max subscription? Are you hitting the limits on that $200 plan too often? Does it feel like you’re paying too much?

Maybe a reasonable idea could be to drop down to a $20 Claude Pro subscription, then supplement that with a Codex Plus plan for $20 and a Z.ai Coding Pro plan for $15. You could easily assign the models available on these plans to different agents in OpenCode, and switching models when one isn’t capable of doing the job is just a keystroke away. Switching models in Claude Code is a bit more work, but it is definitely doable there as well.

You might use Opus for planning, Codex for debugging, and GLM-4.7 for the actual coding grunt work. That way you’d be using the most expensive tokens for the lowest volume, most challenging work. The limits on the Z.ai Pro plan should be roughly fifteen times higher than Claude Pro’s limits. The combo of a Claude Pro and Z.ai Pro subscription would get you close to Claude Max’s limits, but it would save you $160 every month.

Or maybe you’d go with Chutes instead of, or in addition to, Z.ai. You could use Chutes to get access to Kimi K2’s massive 1-trillion parameter model for particularly tricky problems, while still using Z.ai’s GLM-4.7 for the bulk of your work. Chutes also gives you DeepSeek v3.2 and MiniMax M2 as additional options to experiment with.

Given the restrictions Anthropic and Google have started placing on third-party tools, having both Chutes and Z.ai in your toolkit gives you a lot of flexibility. If one provider decides to change their policies or pricing, you’ve got alternatives ready to go. When one provider has an outage or their responses are slow due to load, you can switch to the other.

You can mix and match these providers however makes sense for your workflow.

I am for sure out of my depth here. I am just not a heavy enough user to get myself to this point. OpenCode makes it easy to mix and match these plans any way you like, and I am hearing about more and more people splitting their work between different models.

Hedging your bets by splitting your subscriptions between multiple companies seems smart, especially when those plans can be used in a single frontend like OpenCode. GitHub just announced that they are allowing OpenCode to integrate with their CoPilot subscription. OpenAI has tweeted something similar but with less solid of a commitment. Hopefully more companies will officially allow you to use the tokens you are paying in advance for in any way you like.

Claude Opus was the clear leader for a long time, but there are situations where Codex and Gemini Pro will do a better job. Not only that, but the much more inexpensive models are doing a good job at keeping pace with Claude Sonnet. Being able to slot new models into place will probably be even more useful in the future.

Wrapping Up

The key takeaway here is that you don’t need to spend a fortune or lock yourself into a single provider to get excellent coding assistance. OpenCode’s multi-model approach lets you easily mix and match plans from Z.ai, Chutes, and even the higher end plans from OpenAI, and it doesn’t take a lot of effort to swap Z.ai’s or Chutes’ models into your Claude Code client. I recently subscribed to Synthetic.new. I haven’t decided if they’re significantly more premium or just pricier than Chutes, but they’ll give you $10 off if you use my link.

You can even layer in free tiers from Google and Alibaba to stretch your budget even further. Whether you’re a casual coder like me or writing massive amounts of code, the strategy is the same: hedge your bets and use the right model for each job.

What makes this approach particularly valuable right now is the flexibility it gives you. With Anthropic and Google tightening restrictions on third-party tools, having accounts with multiple providers means you’re not left hanging when one changes their policies or prices. Z.ai and Chutes are filling an important niche here by offering capable models without the platform restrictions we’re seeing from some of the bigger players. Plus, the upcoming OpenCode Black subscription looks like it could be a game-changer for folks who want access to both proprietary and open-weight models while only making a single payment.

The most exciting part is that anyone can get started today with a $3 subscription or entirely through free tiers. You just need a little curiosity and willingness to experiment with different setups. I’m still discovering what works best for my workflow, and I’d love to hear what you’ve been trying. What’s your current stack? Are you married to one provider, or have you built your own Frankenstein setup to squeeze out more value? Join our Discord community and let’s swap tips and tricks for getting the most out of these tools!

OpenCode with Local LLMs – Can a 16 GB GPU Compete With The Cloud?

Jan 11th, 2026

There was a post on Hacker News yesterday about ByteShape’s success running Qwen 30B A3B on a Raspberry Pi with 16 gigabytes of RAM. I wondered if their quantization was really better. I had tried fitting a quant of Qwen Coder 30B A3B on my Radeon 9700 XT GPU shortly after I installed it, but I didn’t have much luck with OpenCode. The largest quant I could fit didn’t leave enough room for OpenCode’s context, and it wasn’t smart enough to correctly apply changes to my code most of the time.

AI Pat Talking To The Cloud and Local LLM

I am going to tell you the good news up front here. I was able to fit ByteShape’s Qwen 30B A3B Q3_K_S onto my GPU, and llama-bench gave me better than 200 tokens per second for prompt processing and 50 tokens per second generation speed with 48,000 tokens of context.

That is enough speed to be useful, especially since it is almost three times faster in the early parts of an OpenCode session when I am under 16,000 tokens of context. It isn’t a complete dope. It correctly analyzed some simple code. It was able to figure out how to make a simple change. It was even able to apply the change correctly.

OpenCode with ByteShape’s quant isn’t even close to being in the same league as the models you get with a Claude Code or Google Antigravity subscription. When my new GPU arrived, I couldn’t find a model that fit on my GPU, could create usable code, and consistently generate diffs that could consistently be applied using OpenCode. Two months later, and all of this is at least possible!

Will I be canceling my $6 per month Z.ai coding subscription?

Definitely not. First of all, this isn’t even Qwen Coder 30B A3B. There is no ByteShape quant of the coding model. Even if there were, unquantized Qwen Coder 30B A3B is way behind the capabilities of Z.ai’s relatively massive GLM-4.7 at 358B A32B.

My local copy of Qwen 30B A3B does feel roughly as fast as my Z.ai subscription when the context is minuscule, but my Z.ai subscription doesn’t slow down when the context pushes past 80,000 tokens. My GPU doesn’t have enough VRAM to get there with a 30B model, and it would be glacially slow if it could.

Not only that, but my GPU cost me more than $600. Is it worthwhile tying up my VRAM, eating extra power, and heating up my home office when the price of that GPU could pay for 100 months of virtually unlimited tokens from Z.ai?

I am certain that someone reading this has a good reason to keep their data out of the cloud, but it is a no-brainer for me to continue to use Z.ai.

You don’t need a $650 16 GB Radeon 9700 XT to use this model

If it runs on my GPU, then it will run just as easily on a $380 Radeon RX 9060 XT. It will probably run at around half the speed, but it will definitely run, and half speed might still feel fast for some use cases.

This model will also run on inexpensive used enterprise GPUs like the 16 GB Radeon Instinct Mi50. These have nearly double the memory bandwidth of my 9070 XT, yet they sell on eBay for half the price of a 16 GB 9060 XT. The Mi50 has less compute horsepower, and it is harder to get a good ROCm and llama.cpp environment up and running for these older cards, but it can definitely be done. This is a cheap way to add an LLM to your environment if you have an empty PCIe slot somewhere in your existing homelab!

llama-bench of ByteShape Qwen 30B A3B on my 9070 XT

I would expect that you’d get good performance on a Mac Mini, but I can’t test that. You can buy an M4 Mac Mini with 24 GB of RAM for $999. That is enough RAM for your operating system, some extra programs, and it would easily hold ByteShape’s quant of Qwen 30B A3B with way more than the 48,000 tokens of context that I can fit in 16 GB of VRAM.

ByteShape has me more excited about the future!

I don’t even mean all that far in the future. When they came out, 7B models were nearly useless. A year later, and those tiny models were almost as good as the state-of-the-art models from the previous year.

You couldn’t fit a viable coding model on a 16 GB GPU six months ago. Now I can get OpenCode and my GPU to easily create and apply a simple code change. This is a fancy quant, but it isn’t a quant of what is currently the best coding model of its size. There’s room for improvement there.

Not only do smaller models keep getting better every six months or so, but the minimum parameter count for a useful model seems to keep dropping.

I wouldn’t be the least bit surprised if Qwen’s 30B model is as capable at the end of the year as their 80B is today. I also wouldn’t be surprised if someone comes up with a way to squeeze a little more juice out of a slightly tighter quant during the next 12 months.

I’ve tested the full FP16 versions of Qwen Coder 30B and 80B using OpenRouter, and the larger model is noticeably more capable even with my simple tasks. Once again, I wouldn’t be surprised if we’ll be able to cram a model that is nearly as capable as GLM-4.7 into 16 GB of VRAM by the time the calendar flips over to 2027.

Is Machine Learning Finally Practical With An AMD Radeon GPU In 2024?

Update: There’s a new contender for coding on a 16 GB GPU

Z.ai recently released GLM-4.7-Flash. It is another 30B A3B MoE model. A few days ago, Unsloth uploaded a router-weighted expert pruning (REAP) of this model. I pulled down the 10-gigabyte IQ3_XSS quant of Unsloth’s REAPed model, pulled the latest llama.cpp updates, and I was able to squeeze 90,000 tokens of context onto my 16-gigabyte GPU.

I had to reduce the KV cache to Q4 and Q8 respectively to fit this much context. I could have gotten away with Q8 for both if llama-server was the only thing running on the GPU, but the programs running on my desktop are using a combined 2.2 gigabytes of VRAM. I am more than a little surprised that a Q3 model with a Q4 KV cache managed to perform meaningful edits to my little codebase.

This isn’t perfect, but it is pretty good. My speeds running llama-bench are around 1/3 to ½ as fast as ByteShape’s Qwen 30B A3B quant. I’ve been reading some complaints about GLM-4.7-Flash not yet performing as well as it should with llama.cpp. I think we just have to give it some time.

I had OpenCode analyze my Li’l Magnum! mouse source code. Much bigger models have suggested that I should refactor out some of the magic numbers that are sprinkled throughout my mouse’s OpenSCAD code, so I asked GLM-4.7-Flash to track down magic numbers for me.

It misunderstood, and I had to explain what magic numbers are. A smarter move would have been to start over at this point, because I had just polluted a quarter of our available context with a useless side journey. We’re trying to see if the 3-bit REAPed model will start messing up as we fill the context, so I decided to instead just keep trucking along.

We decided to replace every manually entered 0.16-mm layer height sprinkled throughout the code with a variable. Getting to that point took five minutes. Once I put the model to work on the refactor, it took less than nine minutes to carefully replace the numbers one at a time, run the build script to check for errors, and correct a couple of problems.

Had I done this manually, I would have just let sed replace every 0.16 with the new variable name. OpenCode was being more careful. It wanted to leave any comments with this number unedited.

The task of figuring out what I wanted to do and actually making the change took around 15 minutes and bumped the context up to just under 60,000 tokens. OpenCode would have attempted to compact the context several times if we tried to do this with the larger ByteShape model, because it would have gotten too close to the context limit.

I bet we’d cut this job down to nearly five minutes if we tried this again in a month or two. There will be patches to llama.cpp for GLM-4.7-Flash by then!

Unsloth GLM-4.7-Flash REAP 23B A3B

The massive LLMs won’t be standing still

It won’t just be these small models that are improving. Claude, GPT, and GLM will also be making progress. They’ll be taking advantage of the same improvements that help us run a capable model in 16 GB of VRAM.

Just because you can run a capable coding model at home doesn’t mean that you should. The best coding model twelve months ago was Claude Sonnet 4. You’d be at a huge disadvantage if you were running that model today instead of GLM-4.7, GPT Codex, or Claude Opus. Just like you’ll be massively behind the curve if you’re running a 30B model in 2027 while trying to compete with the speed and capabilities of tomorrow’s cloud models.

Buying hardware today in the hope that tomorrow’s models will be better isn’t a great plan. There is no guarantee that Qwen will continue to target 30B models. I wouldn’t have been able to write this blog post if the current Qwen model was 32B or 34B, because it just wouldn’t have fit on my GPU.

This is exciting for more than just OpenCode!

I was delighted with some of my experiments with llama.cpp when my Radeon 9070 XT arrived. I tried a handful of models, and I learned that I could easily fit Gemma 3 4B along with its vision component and 4,000 tokens of context into significantly less than 8 gigabytes of VRAM.

Why is that cool? That means we ought to be able to fit a reasonably capable LLM with vision capabilities, a speech-to-text model, and a text-to-speech model on a single Radeon RX 580 GPU that you can find on eBay for around $75. That would be a fantastic, fast, and inexpensive core for a potential Home Assistant Voice setup.

The trouble is that Gemma 3 4B didn’t work well in my test when it needed to call tools, at least with OpenCode.

ByteShape’s Qwen 30B A3B can call tools. Home Assistant wouldn’t need 48,000 tokens of context, so that ought to free up plenty of room for speech-to-text and text-to-speech models.

How Is Pat Using Machine Learning At The End Of 2025?

I tried to test this model on my 32 gigabyte Ryzen 6800H gaming mini PC

I thought about leaving this section out, but including it might encourage me to take another stab at this sometime after publishing.

I thought my living-room gaming mini PC would be a good stand in for a mid-range developer laptop. Having 32 gigabytes of RAM is plenty of room for 100,000 tokens of context with ByteShape’s Qwen quant, and there’d be plenty of room left over for an IDE, OpenCode, and a bunch of browser tabs.

My 6800H gaming mini PC

I copied my ROCm Distrobox container over to my mini PC, and I got ROCm and llama.cpp compiled and installed for what seems to be the correct GPU backend. I am able to run llama-bench with the CPU, but that is ridiculously slow. When I try to use the GPU it SEEMS to be running, because the GPU utilization sticks at 100%, but tons of time goes by without any benchmark results.

I found some 6800H benchmarks on Reddit while I was waiting, and they aren’t encouraging. They say 150 tokens per second prompt processing speed with the default of 4,000 tokens of context. That’s what my 9070 XT manages at 48,000 tokens of context with the ByteShape model. I’d expect to see something more like 20 tokens per second on the 6800H at 48,000 tokens of context.

I would consider my 9070 XT to be just barely on this side of usable. The 6800H wouldn’t be fun to use with OpenCode.

So where does that leave us?

So here’s where we stand at the start of 2026. If you have a reasonable 16 GB GPU sitting in your home office, you can actually run a competent coding assistant locally. This isn’t just in theory either. The speeds feel responsive enough to use. That’s real progress, and ByteShape’s aggressive quantization deserves credit for pushing the boundaries of what fits on consumer hardware.

At the same time, let’s not kid ourselves: my $600 GPU delivers an experience that’s still slower, so much less capable, and significantly more expensive per token than what I get from a $6 monthly cloud subscription. The exciting part isn’t that local models have caught up, because they haven’t! However, that the gap is narrowing at a pace that would have seemed unlikely a year ago.

Whether that matters for your use case depends entirely on whether you value data privacy, offline access, or just the sheer satisfaction of running this stuff on your own silicon. For me, it’s a “both/and” situation: I’ll keep paying for Z.ai because it’s objectively better, but I’ll also keep tinkering with local models because watching this space evolve is half the fun.

If you’re experimenting with local LLMs too, or you’re just curious about what’s possible, I’d love to hear about your setup. Come join our Discord community and share what hardware you’re using, what models you’re running, and what’s working (or not working) for you. The more we learn from each other, the faster we’ll all figure out the sweet spots in this rapidly evolving landscape.

Bazzite On My Workstation – Five Weeks Later

Dec 31st, 2025

It has been five weeks since I wiped my NVMe drive and installed Bazzite on my desktop workstation. Why five weeks? I figured a blog post about this would be a good way to wrap up 2025! When I wrote about my initial experience with Bazzite, I was only a few days into the migration. I’ve been using the machine daily since then, and I’m far enough in to provide a meaningful retrospective.

My Desk Setup With a Gaming TV

If you’ve been following along, you know that I spent months thinking about this switch. I first considered running Bazzite back in July, tested it on my laptop, ran it on a mini PC in the living room, and finally made the leap to replace my long-running Ubuntu installation. I’m not going to retell the entire story here, but I want to acknowledge that this wasn’t a spur-of-the-moment decision.

I’ll revisit some of the positives I mentioned earlier, but I also want to highlight a couple of unexpected challenges I had to work around!

The short answer: I’m happy

If you want the TL;DR version: I don’t regret switching to Bazzite, and I have no plans to go back to Ubuntu. The switch has been mostly positive, with a few annoyances that I’ve either worked around or learned to live with.

The biggest win has been having current Mesa libraries and AMDGPU drivers available without any effort on my part, and knowing that I will be brought to the cutting edge of Mesa’s ray-tracing performance every six months or so. My Radeon 9070 XT works flawlessly, and I know I’ll be ready for whatever AMD releases next. The gaming experience has been smooth, and I’ve been able to focus on playing games rather than tinkering with drivers.

The immutable nature of Bazzite has only gotten in the way twice, but I have been able to work around it. Almost everything I need is either installed via Flatpak or running inside one of my Distrobox containers.

Gaming performance with the 9070 XT

Gaming has been fantastic under Bazzite, but that was to be expected. One thing I didn’t realize about Bazzite is that it lets you run the command ujust install _mesa-git. This installs a nightly build of Mesa in your home directory and sets up a mesa-git wrapper script for you.

Why is this exciting? Mesa has some fantastic ray-tracing performance improvements that haven’t made it into a release yet. Using mesa-git gave me a bump from 65 to 70 FPS in Control when ray tracing is set to high, and it gave me a massive boost from 35 to 70 FPS when using ray tracing in Spider-Man 2!

Most of my last couple of weeks have been spent playing Arc Raiders. I can enable FSR4 in Arc Raiders with PROTON_FSR4_UPGRADE=1, which improves visuals over FSR3.

I have not managed to puzzle out a correct incantation to use Proton-GE’s easy FSR4 upgrade alongside mesa-git. I was able to mod FSR4 into Control using OptiScaler, but I can’t do that with a multiplayer game with anticheat like Arc Raiders.

Using A Ryzen 6800H Mini PC As A Game Console With Bazzite

Switching to Podman was more effort than I expected!

I was all set to fire up my two or three docker-compose.yaml containers using Podman, but Bazzite doesn’t ship with podman-compose. I believe podman-compose is the only thing I have installed manually using rpm-ostree.

This worked great for my first container, but my most important container runs OpenVPN, so it needed privileges that I just couldn’t assign using podman-compose, so I wound up being lazy. I had OpenCode (my AI coding assistant) convert my docker-compose.yaml to a Podman command line for me, and I am running that using sudo.

The containers I run on my workstation should be running in my homelab. I should just make the slightest effort to move them, and I’ll probably do that early next year. I would have skipped installing podman-compose if I had known that 50% of my containers wouldn’t easily work using podman-compose. I could have just converted the other one to a command line as well!

I didn’t even consider my thermal label printer!

Bazzite ships with CUPS. My cheap thermal printer works with CUPS. It should have been easy to get it working, but it was anything but!

I can install the PPD file, but my thermal printer needs a filter binary, and the CUPS filter directory is read-only. There is no easy and clean way to make this work.

I hemmed and hawed for a day, then I decided the easiest solution was disabling CUPS on the host and setting up a Podman container specifically for the thermal printer. I would document exactly how I did that, but I’m not sure I did it in a way anyone should replicate.

Why did I opt to use a container? I figured this container could easily follow my thermal printer through Bazzite upgrades. I could also move the printer and container to one of the mini PCs in my homelab in the future.

Setting up lvmcache went smoothly!

Mostly. I set up my pair of bulk-storage volumes in stages, starting with the basic, uncached volumes and moving data into place. I added them to fstab with noauto set so they wouldn’t mount automatically. I didn’t want to troubleshoot during a reboot if I made a mistake, which turned out to be a good decision. I had accidentally put a twelve-crypt instead of a crypt-twelve in either the fstab or crypttab at one point. That definitely wouldn’t have booted!

LUKS encryption was happy. The lvmcache on both my NVMe and SATA SSD were both happy. My data was happy. It was easy to flip the noauto to auto in my fstab, and everything has been chugging along ever since.

lvmcache-statistics

I was running with a 600 gigabyte root filesystem and a 300 gigabyte lvmcache on my previous install, but I flipped that around this time. I did this because it should be easy to move some of my ever-growing and poorly maintained directories, like ~/Downloads, to one of the big cached volumes.

Bazzite is fairly locked down. Flatpak programs get grumpy if I move something out of my home directory and connect it back up with a symlink. I assume that I will have to attack this problem with bind mounts, but I haven’t gotten to the point where I need to do that yet!

My Home Assistant shenanigans easier than I expected!

I thought it was going to be a pain to install hass-cli. It wouldn’t be a big deal if I had to run it in Distrobox, but I wanted to get and set Home Assistant variables from scripts that run directly on Bazzite. I was excited to see that homeassistant-cli was available in the Brew package manager’s preconfigured repositories!

I am having no trouble using hass-cli to fetch the state of my espresso machine and office lighting to put the correct color indications on the appropriate macropad buttons on my little Mission Control macro pad at my desk. I have been using JC Pro Macro 2 mechanical keypads for years to control various aspects of my workflow, including integrating with Home Assistant to control studio lighting.

I am not certain how I installed hacompanion! I didn’t write it down!

There is a static hacompanion binary in my ~/.local/bin/ directory, and there is a static binary in their GitHub releases. I bet I downloaded that and dropped it in place!

Home Assistant knows when I am done using my computer, so the lights turn off automatically. I have keys configured on my macro pad to switch between just my monitor, just the TV as the display, or both at the same time. My scripts are able to reach out to Home Assistant to turn on the TV and select the correct HDMI input, and it is able to use kscreen-doctor to configure the appropriate outputs on the GPU.

Cheater automatic Bluetooth headphone switching with some vibe coding?!

I’ve been using the gaming version of the Bose QC30 headphones for the last five years. I use them wired, because there is less latency and I don’t have to remember to charge them. I had no way for the computer to detect whether or not I had the headphones on, so I had to switch audio outputs with a key bind.

My Bose headphones went on a plane trip with my wife, because that’s where the noise canceling shines, so I’ve been limping along using an older set of AKG NC70 headphones. My limping made me impulse by a set of Anker Q20i wireless headphones that were on sale for $35.

I have nothing but nice things to say about the Anker headphones. I haven’t been able to compare them to the older Bose headphones back-to-back, but Anker is obviously trying to imitate Bose here. They look similar. They have the same soft pleather earcups. They have similar features.

The budget noise canceling definitely gives older Bose headphones a run for their money. The Anker headphones are only 10% of the price of the current iteration of my Bose headphones, and they punch way above their price point. I’d buy these Anker headphones again in a heartbeat.

#!/bin/bash

if command -v wpctl &> /dev/null; then
    WPCTL="wpctl"
else
    WPCTL="distrobox-host-exec wpctl"
fi
HEADSET_NAME="soundcore Q20i"
SLEEP_TIME=2

prev_sink=""
headset_connected=false

echo "Adjusting default wpctl settings"

wpctl settings linking.pause-playback false

echo "Done adjusting wpctl settings"

cleanup() {
    echo "Reverting to default wpctl settings"
    wpctl settings linking.pause-playback true
    echo "Done reverting wpctl settings"
    exit
}

trap cleanup EXIT INT TERM

echo "Starting to watch for headset: $HEADSET_NAME"

while true; do
    status=$($WPCTL status)
    
    # Find the currently active sink (marked with *)
    current_sink=$(echo "$status" | awk '/Sinks:/,/Sources:/' | awk '/\*/ { print $3; exit }')
    
    # Check if headset is connected
    headset_sink=$(echo "$status" | awk '/Sinks:/,/Sources:/' | awk -v name="$HEADSET_NAME" '$0 ~ name { print $2; exit }')
    
    if [ -n "$headset_sink" ] && [ "$headset_connected" = false ]; then
        # Headset just connected
        prev_sink="$current_sink"
        echo "Headset connected. Previous sink: $prev_sink"
        $WPCTL set-default "$headset_sink"
        echo "Switched to headset sink: $headset_sink"
        headset_connected=true
    elif [ -z "$headset_sink" ] && [ "$headset_connected" = true ]; then
        # Headset just disconnected
        if [ -n "$prev_sink" ]; then
            echo "Headset disconnected. Switching back to sink: $prev_sink"
            $WPCTL set-default "$prev_sink"
        fi
        headset_connected=false
    fi
    
    sleep $SLEEP_TIME
done

What’s the trouble? Bazzite doesn’t automatically switch to the wireless headphones when they connect, and it doesn’t switch back to the speakers when they disconnect. I found two suggestions on the Internet, and I didn’t manage to get either to work correctly, so I asked [OpenCode and Z.ai][zai] to write me a little daemon script to automatically swap inputs when the headphones are connected.

The ROCm Distrobox experiment

I mentioned in my first post that I had set up a ROCm-enabled Distrobox container and run some benchmarks with llama.cpp. I haven’t done a lot with it since then, but the container is there and it works.

I have played around with a few different models in llama.cpp, and the performance has been right where I expected. My 9070 XT with 16 GB of VRAM is plenty for the smaller models I’ve been testing, and the prompt processing and token generation speeds are snappy. I haven’t found a practical use for local LLMs in my day-to-day workflow yet, but it’s nice to know the capability is there when I need it. I did write about whether local LLMs make sense versus using cloud services and my experience deciding not to buy a Radeon Instinct Mi50 before I upgraded to my 9070 XT.

Setting up the ROCm container was straightforward thanks to the guides I found, and I haven’t had to touch it since. It’s one of those things that just works in the background until I need it. I have been able to grab new models as they are released to give them a try.

Gemma 3 4B with its vision model and a ton of context fits well and runs great in 8 gigabytes of VRAM. In fact, I am wondering if I could squeeze that model, a speech-to-text model, and a text-to-speech model into 8 gigabytes. There are a few older 8 gigabyte gaming GPUs available on eBay for under $100. That would be a neat way to run a voice assistant for the house, wouldn’t it?!

Daily workflow and productivity

My daily workflow is largely unchanged from what it was on Ubuntu, but it’s happening in different places.

The browser is a Flatpak, Steam is native on the host, and Resolve is in its own Distrobox container.

My actual work of writing, coding, and general productivity all happens in a single Debian Distrobox container. Emacs is there, my zsh configuration is there, my dotfiles are there, and OpenCode runs in there. It feels like home.

The split between host and container has been cleaner than I expected. I worried that I would constantly be context-switching and thinking about whether I should be installing something in the container or on the host.

I’ve been getting real work done this whole time. My blog is getting written, code is being written, and I haven’t felt like the migration has slowed me down. That’s the real test – can I still get stuff done? The answer is yes. I wrote earlier about how Bazzite uses Distrobox to containerize things like DaVinci Resolve, and that approach is working well for me.

I have gotten into some trouble when asking OpenCode to write a helper script that is meant to run on Bazzite, because OpenCode is running in a Distrobox container. I have learned that I can explain to the LLM that it needs to run commands with distrobox-host-exec when they are not found, and it usually manages to work things out.

What is Distrobox?

What I’ve learned

Five weeks isn’t enough time to declare victory or declare defeat, but it’s enough time to learn some things.

Immutable Linux isn’t as scary. The idea of not being able to install packages freely on the host system felt restrictive, but I’ve adapted. The Distrobox integration is good enough that I don’t feel limited.

The containerization has benefits beyond what I expected. I can update my Debian container without worrying about breaking Resolve or my ROCM container. I can blow away a container and recreate it in minutes if I need to. I can also duplicate Distrobox containers locally, or I can export them to use on my laptop.

I haven’t used Bazzite long enough for its strengths to really shine. I have gotten myself accidentally stuck on an aging Ubuntu release in the past. Sometimes the timing of a major upgrade is bad, so I hold off, but then don’t manage to get around to it. Sometimes a major Ubuntu upgrade will cause headaches, so I put it off.

Almost every single thing that I have customized won’t be touched by a Bazzite system upgrade. My Distrobox environments are independent. My Flatpak apps don’t care what operating system is running on the host. Upgrading to major Bazzite releases should feel quite seamless, and I am excited about that.

Conclusion

Five weeks isn’t forever, but it’s enough time to know whether a migration was a mistake. Switching to Bazzite wasn’t a mistake.

There have been annoyances, and there are still rough edges (like figuring out how to make my thermal printer work). My setup isn’t perfect. But overall, I’m happier than I was on Ubuntu. The computer works, the games play, the videos edit, and I’m getting my work done.

I still have things to configure. I need to move my workstation containers to the homelab, and I haven’t set up all my cron jobs and automation yet. I’ll probably discover missing software for months. That’s normal – every new OS install has a period of discovery.

But the foundation is solid. Bazzite + Distrobox + Flatpak is working for me, and I’m looking forward to years of stability with minimal maintenance.

If you’re on the fence about trying an immutable distro, I’d say give it a shot. Maybe start with a laptop or a secondary machine like I did. Set up a Distrobox container with your comfort distro and use it for a while. You might find that you don’t miss the old way of doing things as much as you thought you would.

Are you using an immutable Linux distribution like Bazzite? How has your experience been? Or are you on the fence about making the switch yourself? I’d love to hear about your setup, the challenges you’ve faced, and what you’ve discovered along the way. If you’re interested in chatting about immutable Linux, gaming on Linux, homelab setups, or machine learning with AMD GPUs, come join our Discord community! We’d love to hear your stories and help you on your own Linux journey.

Devstral with Vibe CLI vs. OpenCode: AI Coding Tools for Casual Programmers

Dec 24th, 2025

I am not sure this is going to be as direct of a comparison as the title implies. I am not a scientist. I don’t plan to concoct an experiment to test both tools and models against the same task. There are already benchmarks out there, and I don’t think they matter all that much in real life. What do I want to know? Which tools and models FEEL better to use.

I got curious about this almost immediately after Devstral 2 was released. As I am writing this, Mistral is offering free tokens for what seems like nearly unlimited use of their new Vibe-CLI tool. You can also pay for Devstral 2 tokens on OpenRouter, and they are quite inexpensive. Inexpensive enough that I might have paid less by the token for Devstral 2 had I used it instead of my $3 Z.ai Coding Plan. Maybe.

AI Image of Pat with his Robots

Devstral 2 is a newer model than GLM-4.6, so that gives Mistral a potential edge over Z.ai. Devstral 2 is only a 123B model, while GLM-4.6 is a 355B model. Being three times as big is a huge advantage!

Either model comes in way behind Claude Opus in this race, but both models are much cheaper and at least somewhat faster than a Claude Code subscription.

NOTE: When I tried Devstral 2, GLM-4.6 was Z.ai’s latest model. They released GLM-4.7 while I was putting the finishing touches on this post.

Who is this blog post for?

It is for people like me. I don’t write code 40 hours per week. That isn’t my job. I have been firing up OpenCode to help me bang out a small coding task roughly once every two or three days. I might be firing it up more often than I need to because my Z.ai subscription is new, shiny, and fun.

I don’t write code often enough to justify paying $20 per month for a Claude Pro subscription, and I certainly don’t code enough to justify $100 per month for Claude Max!

Maybe you write code as occasionally as I do. Maybe you use an LLM to help you configure things like Proxmox, Jellyfin, and nginx in your homelab. Maybe you have a $100 Claude Max subscription at work, but you need something to fill in the gap for your occasional coding needs at home.

I definitely believe that a $6 Z.ai subscription was a no-brainer when I wrote that blog post two weeks ago. Maybe paying by the token for Devstral will wind up being nearly as good, a little faster, and manage to cost even less.

Is The $6 Z.ai Coding Plan a No-Brainer?

Go try this vibe coding stuff while it is free!

It looks like Devstral 2 is going to be free to use for the entire month of December. Google’s Gemini-CLI allows 60 requests per hour against their API for free. Qwen-Code can be used with Qwen’s API for free. There are a lot of free ways to use agentic coding interfaces, and they’re not expiring at the end of the year.

I am sure there are other ways of testing out or even regularly using these sorts of coding tools completely for free. Don’t forget that free things are almost always free for a reason! Mistral is currently free to get you hooked. Other API’s are free when you agree to let them use your data for training.

You might also want to consider where your data is going. I previously talked about how my Z.ai subscription is served from China, and your ethics might not line up with that. This is also true of Alibaba’s Qwen service.

Maybe you would feel better paying a little more for Devstral 2 knowing that they are a French company and their servers are in Europe. Maybe you’d prefer to pay a massive company like Google that is based in the United States.

Thoughts on Vibe and Devstral 2 from a shadetree programmer

I have a lovely JC Pro Macro Pad on my desk. Nearly everything that I use this macro pad for needed to be redone when I migrated from Ubuntu to Bazzite on my workstation a few weeks ago. This is my Mission Control macro pad, and the keys usually light up in a way that indicates the state of the action. The headphone toggle turns red when the speakers are active, and my espresso machine button turns blue when Home Assistant thinks the espresso machine has cooled down.

I needed a new way to control the state of the LEDs. The Arduino gets grumpy if too many processes try to write to the serial port at the same time, so I had OpenCode with Z.ai write me a pair of scripts. One creates and watches a fifo for new commands that the Arduino already understands, and it ships those over the serial port as they come on.

The other script is called macroled, and it has the simple job of converting English color names to RGB values then writing the appropriate commands to the fifo.

pat@zaphod:~$ macroled 1 red
Set LED 100 to color red (150 0 0)
pat@zaphod:~$ macroled 2 orange
Set LED 102 to color orange (150 150 0)
pat@zaphod:~$ 

I installed Vibe-CLI today, and I asked it to create another script. This one watches my Radeon 9070 XT GPU’s wattage. If the wattage cap is set to 250 or below, it sets the macro pad key to blue. If the cap is over 250, the key turns red. Blue for cool and quiet, red when full power is available. It adds more green to the mix as actual power consumption rises.

I’m not entirely happy with how these colors wind up mixing, but this gives me a visual indicator of both my maximum available GPU performance and how hard I am hitting the GPU.

Devstral 2 did a fantastic job here. We had to go back and forth several times. I decided that I didn’t like the color getting diluted when the GPU was only using 20 or 30 watts, so I asked Vibe to only mix in the green when the GPU goes above 50 watts. I also went back and forth a couple of times swapping colors around and changing maximum brightness.

This was a small task, but small tasks are what I usually need to work on. Vibe and Devstral 2 did as good of a job here as I would expect to see from OpenCode and GLM-4.6.

OpenCode

Is vibe coding OK? How do you define it?

The early uses of vibe coding seemed to be used in a derogatory manner towards non-programmers producing LLM-generated code that they didn’t understand. It doesn’t feel derogatory any longer, and it seems to encompass a wider variety of processes.

AI Image of Pat with a Cow talking to a Robot

I have seen quite a few attempts at definitions, but nobody seems to agree on the boundaries. I personally decided that I feel that I am vibe coding when I don’t touch the code in a text editor. I take a peek at most of the shell scripts that OpenCode spits using cat or less, but I almost never open them in Emacs. I am being a little safer by making sure there are no sneaky rm -rf commands in there, but I’m not changing anything.

I think that counts as vibe coding.

OpenCode and Vibe write better shell scripts than I do!

Listen. I can write a good shell script. The fact is, though, that I usually don’t. I hack something together that works. I might sneak in some error checking around the areas that were causing me problems while attempting to get things to work correctly, but most of the short scripts that I write have nearly zero good error checking.

The vibe coded scripts are more likely to break things down into functions. They’re more likely to check for error codes. They’re more likely to stop and let you know why something didn’t work right when you run them. The vibe-coding machine does a MUCH better job of making sure the scripts output extra text to make sure you can see what is going on as they run.

Is my script going into production on a server? I will put in the effort to do all these things and more. Is the script just setting the color of an LED on my macro pad? I will leave all of this out. The robots will beat me here every time.

Is the Z.ai Coding Lite Plan still a no-brainer?!

I almost had to guess at this, but Z.ai just added a usage view to their subscription dashboard. I have used 26.7 million tokens in the last 30 days.

Devstral 2 is currently free, and has done a good job for me here, but what about in January when it isn’t free? I see in my OpenRouter account that paid Devstral 2 would cost $0.05 per million input tokens and $0.22 per million output tokens. Assuming Devstral 2 would have matched GLM-4.6 on token count, which is a MASSIVE assumption, I would have paid $1.33 for my input tokens at OpenRouter. I think it is safe to round up to a million output tokens and say I would have paid $0.60 in that direction.

That adds up to a bit less than the $3 that I paid, because of the half price deal, for my first month on my Z.ai Coding Lite plan.

That isn’t QUITE a no-brainer anymore, right?! I’m happy with what I’ve paid for. GLM-4.6 is a more powerful model. I suspect there will be jobs that GLM-4.6 can easily handle where Devstral 2 might fail, and $3 per month isn’t a lot of money. Not only that, but so far my usage is trending upwards.

A part-timer could probably use free API services and tools for the foreseeable future

Everyone wants your money. They all want you to subscribe. They all want to get you hooked on their tool and model.

I suspect that one company or another will have a free coding plan for the next year or two. Alibaba wants you to use Qwen. Google wants you to use Gemini. Mistral wants you to use their new Vibe tool with Devstral 2. If you have a good experience while it is free, then you might become a paying customer when it isn’t. You might even use their models in the programs you’re writing.

I completely understand this. Even with my light use, I have gotten used to OpenCode and GLM-4.6. Devstral was easy to work with using Vibe, but everything felt a little weird. I don’t mind paying $36 or $72 for the year knowing that I will be able to use my preferred tools for the next 12 months.

That is currently MY preferred tool. You might like Vibe, Qwen-Code, or Claude Code more than OpenCode. You can probably slot Z.ai’s subscription into Vibe or Qwen-Code like you can with Claude Code or OpenCode, but maybe it isn’t just the tool you like. Maybe you feel more comfortable working with the Devstral or Qwen Coder models. That’s fine. You should try everything!

Maybe you shouldn’t limit yourself to just one model!

If you are an occasional user like myself, I think it is just fine to lock yourself in to a single model for 3, 6, or even 12 months. Especially if the price is right.

What if you actually are an extremely heavy user? Should you just spend $200 every month on the biggest Claude Max subscription? Maybe not!

I just learned about the oh-my-opencode plugin for OpenCode. It is extremely opinionated and absolutely bananas! It is preconfigured to call the best-suited model for each task. It uses GPT-5.2 for design and debugging, Gemini 3 Pro for frontend development, Claude Sonnet 4.5 for documentation and codebase exploration, and Grok Code for fast codebase explortation. That is at least three different APIs or subscriptions.

You might still be better off with separate lesser subscriptions even if you aren’t using oh-my-opencode. I keep reading that Claude is better at implementing straightforward solutions, while OpenAI’s Codex is better at debugging complicated problems. People also seem to feel that Z.ai’s GLM-4.6 is good enough for handling most of the grunt work.

Maybe upgrading from the $20 to the $100 Claude subscription isn’t the best move when you start reacing the 5-hour limit. It might be better to spend $20 on Claude and add a $20 Codex plan to the mix to attack those problems where Claude falls short. You can probably get more than double the work done with Codex when you run out of Claude tokens.

When that still isn’t enough, you can add a Z.ai subscription to the mix. A tool like OpenCode can connect to all three subscriptions, and switching between them is just a few keystrokes away.

If you are already subscribed to the $200 tiers of both Claude and Codex, and you are maxing them out, then none of this applies directly to you. You’re way beyond the audience of this blog post!

The important thing to remember is that you’re not locked in. If you pay for a plan that is undersized, you can always upgrade, and you are free to mix and match. I am excited that I landed on OpenCode, because I can plug it into all sorts of different backends, and I can configure different agents to use whatever API might be appropriate.

Conclusion

The landscape of AI coding tools is changing faster than ever. What was a clear “no-brainer” subscription a month ago now has serious competition from free tiers and pay-per-use models. Devstral 2 with Vibe-CLI has proven to be a capable setup, while OpenCode with GLM-4.7 remains my go-to tool. The key takeaway is that there’s no one-size-fits-all solution. What matters most is finding the combination that fits your workflow, budget, and privacy comfort level.

I’d love to hear about your experiences with these tools. What’s your current AI coding setup, and are you happy with it? Have you tried Vibe-CLI, OpenCode, or similar tools? Are you more concerned with cost, performance, privacy, or ease of use? Come join our Discord community where we discuss AI coding tools, homelab setups, and all things tech. It’s a great place to share your experiments and learn from others navigating the same decisions.

I Am Running Bazzite Linux On My Workstation

Nov 21st, 2025

I’m probably stretching the word “workstation” a little further than I should. I’m talking about the machine I’m typing on right now. It’s my gaming PC, video editing machine, and the place where I sit when I work on blog posts. It feels like a reasonable word to use to convey the situation in which I’m running this immutable Linux gaming distribution.

Fake Pat with a 9070 XT

I created this image with state-of-the-art image-combining AI last year, but I used Flux Context to swap the Radeon 6700 XT in last year’s image for my new Radeon 9070 XT. This image made me giggle last year, so I knew I had to bring it up to date!

I first tried Bazzite on the Ryzen 6800H mini PC that we use for gaming on our living-room TV. I had a good experience, and that got me thinking that an immutable distro might be a good fit for me moving forward, so I installed Bazzite in desktop mode on my Ryzen 5700U laptop. Bazzite makes some difficult tasks easy, like getting OBS Studio with hardware encoding working, DaVinci Resolve Studio playing well with ROCm and OpenCL with a Radeon GPU, and keeping itself reasonably updated with cutting-edge gaming drivers and libraries. The productivity stuff that I use is simple to set up compared to those things that touch the hardware so deeply.

Things have been going well on my laptop, and I knew I would eventually move forward with loading Bazzite on my desktop PC, but I’ve been procrastinating. I decided to order a 16 GB Radeon 9070 XT yesterday, and my aging Ubuntu install just doesn’t have new enough Mesa libraries for an RDNA 4 GPU, so I had incentive to bite the bullet.

I’m honestly only around 24 hours into my fresh Bazzite installation as I’m writing this paragraph. My plan is for this blog post to be an actual log on the web of what I’m doing, how things are going, and the quirks I’m working around. I’ve been running Sawfish as my window manager for something like 20 years. I have code to arrange my windows into columns, and I sometimes tile terminal windows vertically in those columns. I rely on all sorts of weird muscle memory and custom scripts, and I’ve been building this memory for decades.

I’m popping back in from a few days in the future to write this paragraph. I’m realizing that this is almost turning into a list of all the little oddities that a long-time Linux user switching to an immutable distro might encounter along with my workarounds. I think this writeup is more valuable than I expected it would be, but the audience of people who will find value here might be extremely small!

The installation went smoothly

I store my Steam games on a volume backed by lvmcache that lives on my primary NVMe drive, and I store my recorded video footage on a volume with lvmcache on a separate and slower 1-terabyte SATA SSD. I had the NVMe split up with a little over 600 GB for boot, root, and home, and just shy of 300 GB for the lvmcache. My root volume was usually less than half full, so I decided to flip that around.

I haven’t set up the other volumes for Steam or video footage on the mechanical disk yet, and I haven’t configured the lvmcache. Everything I need to make it work is there. It just isn’t the priority yet. Configuration in /etc is not immutable, so I can add things to fstab and crypttab.

NOTE: My lvmcache is set back up, my Steam games are stored on the slow hard drive behind the NVMe cache, and the relevant bits are in my fstab and crypttab, but they are both set to noauto. Today was just not the day that I wanted to potentially troubleshoot a volume failing to mount during boot!

Everything worked. I installed the game I’ve been playing most recently on Steam, and it was running as smoothly or more smoothly under Wayland than it was on X11. I set up a basic OBS profile, and I was able to record my 3440x1440 screen at full resolution using VAAPI hardware-encoded H.265 without a problem. Maybe I’ll be able to try hardware-encoded AV1 when the new GPU gets here tomorrow!

Six Months of lvmcache on My Desktop

I couldn’t use Bazzite at my desk without Distrobox!

Bazzite is an immutable distro. Just about the only acceptable way to install software is via Flatpak, and almost everything available as a Flatpak package is a GUI application. There aren’t really any command line tools in the Flatpak world, and you don’t want to try to shoehorn dozens of packages into your base install. I wouldn’t want to work without things like Emacs and zsh, and I need Ruby and rbenv to publish my blog using my ancient Octopress setup.

I knew this migration was coming. I set up a Distrobox Debian installation on my desktop more than a month ago with the intention of configuring it to be the place where I live at the command line. I used it as an opportunity to upgrade from Emacs 26 to Emacs 30, and I got most of my important Emacs packages and configuration working in there. I even used Distrobox’s FAQ to learn how to export my image on my workstation and transfer a copy over to my laptop.

Almost everything that I need to get by every day is installed on that Debian image.

I had some concerns about this when I was setting up OpenCode in the Distrobox container last week. All my Distrobox images share my home directory with the host, and OpenCode spilled its installation all over my home directory. That includes the executable file. I expected this was going to make things a little ugly, because calling into the Distrobox container from the host might wind up getting circular.

It turns out it isn’t going to be a big deal, because my terminal is now set to open my Debian Distrobox session by default. I’ll never run OpenCode, or any of the handful of similarly installed program, on the Bazzite host. There won’t be any development happening up there. Everything will be happening inside Distrobox.

I’m not at 100% of my usual operating capacity inside my Debian Distrobox, but I’m getting there. I’ve been using fasd for more than a decade, and it has been deprecated for a long time. I’ll have to look into replacing it with zoxide or fzf. Without fasd and autojump, my Go command no longer works, and I can tell you that I type g A LOT. Modernizing this is near the top of my list!

The majority of my productivity happens inside that Debian Distrobox, but I also have an Ubuntu 18.04 Distrobox just for Octopress and my blog. I’m not sure why I had to go back that far, because everything had been working on my much newer Ubuntu install last week, but going backward was easier than puzzling out the problem.

Is The $6 Z.ai Coding Plan a No-Brainer?

DaVinci Resolve Studio is working great

Another day has passed, and my Sapphire Pulse 9070 XT arrived this morning. I was holding off on running ujust install davinci-resolve because I wasn’t sure if it would install a different version of ROCm depending on the model of GPU I had installed.

I didn’t dig all that deep. I loaded a video. Playback worked. Simple edits worked.

I usually export a short video from an existing project when I upgrade Resolve to make sure the important bits are indeed working, but I don’t have any projects handy. This upgrade seemed like a good time to get rid of six years of podcasting projects that I haven’t touched in ages.

I will update this section if I run into any trouble exporting footage.

KZones is better AND worse than my custom window-management scripts

KZones is a lot like FancyZones on Windows. You define zones, you can drag windows into those zones, and KZones does the job of sizing the windows to precisely fit your grid. You can also configure keyboard shortcuts to move the current window into a particular zone.

I’m staring at the same sort of zones that I used to look at most of the time when using Sawfish. The screen is split into three even zones. Emacs is in the middle, a web browser is to my left, and a big OpenCode window is to my right. I’ve enjoyed my upgrade to a single ultrawide monitor because I get to have one wide window right in the middle.

Blog editing!

I have a fourth zone that occupies the same space as the center and right zones for those times when I need something bigger that isn’t quite full screen.

I set up another KZones layout where the center window is around 45% of the width. I haven’t used this much, but KZones has a shortcut key to cycle through layouts and another shortcut to snap all the windows into the new zones. I kind of wish that would happen automatically, but two keystrokes isn’t bad.

There’s one thing I miss about my old configuration. I had things set up so pressing a shortcut once would put the window in the expected zone, but pressing it again when the window is already in the zone would move it to a different zone. That meant I had one key that would put a window in the center zone AND the wider 2/3 zone. Not a huge problem, but my muscle memory keeps trying to do this.

Gaming is fantastic

I’ve only installed one game, and it was already running slightly better on the Radeon 6700 XT than it was on Ubuntu. The graphical settings didn’t change, but the game asked me whether it should run with Vulkan or DX11 when I fired it up, and I can no longer verify that I was definitely using Vulkan before!

I dropped in the Radeon 9070 XT this morning, and everything just worked. This is not a surprise, but I’m definitely happy that I received a functional piece of hardware, and that things are running smoothly.

I’m playing Ghost Recon: Breakpoint on the hardest difficulty with no AI teammates. I played the game on the same day both before and after installing Bazzite, and I really do feel like the game felt more responsive on the same hardware. I have no equipment here to accurately test latency, so I have no way to know whether or not it is just my imagination. I do wonder if I’m noticing the new Vulkan anti-lag feature. The release notes indicate that it is enabled by default, and I see it listed in vulkaninfo on my machine.

I am able to record 144 FPS at 3440x1440 gaming footage using gpu-screen-recorder without any trouble. I was warned by gpu-screen-recorder that AV1 and H.265 may be problematic. It was correct about AV1. My game had some stutters and the recording was mostly dropped frames, but the H.265 footage came out perfect and I couldn’t even tell that the replay buffer was active.

Sapphire Pulse Radeon 9070 XT at Amazon

Thunderbird was being problematic

The problem is that I’m doing it wrong. I dropped my old .thunderbird directory into /home/pat/ and pointed the Flatpak installation of Thunderbird there. It would lose the location of my profile every time I rebooted, and sometimes it just didn’t want to open my profile unless I brought a fresh copy back over.

A Google search suggested that I’m supposed to drop my old .thunderbird into ~/.var/cache/org.mozilla.Thunderbird/cache/thunderbird. I almost did what they asked, but I didn’t like how deep that directory was, and the usual location is already part of my backup plan.

It seemed like it would be easier to just run apt install thunderbird in my Debian distrobox, so I did that! I ran distrobox-export -a /usr/share/applications/thunderbird.desktop in the Debian container to make the application available to KDE up on Bazzite, and I was good to go.

Was this the right way to fix this? Probably not, but I got to test exporting an application from a Distrobox container instead of just a binary. That was fun!

A quick test of `llama.cpp` in Distrobox

Another day has gone by, so I believe I’m on my third day with Bazzite. I already wrote the conclusion section, but I thought my quick test with llama.cpp was worth including. This also means I get to procrastinate a little longer on hyperlinks and image editing for this post!

It has been a long time since I messed around with llama.cpp or ROCm. I knew I wanted to set myself up for this stuff in a Distrobox machine, but I wasn’t sure where I should begin, and I had no idea which ROCm stuff would work with my new 9070 XT GPU. I found this how-to on setting up a ROCm Distrobox machine, and they also had a how-to on setting up llama.cpp for ROCm on the same site.

📦[pat@almalinux-rocm `llama.cpp`]$ ./build/bin/llama-bench -m models/Qwen_Qwen3-14B-Q6_K_L.gguf -d 7000 --cache-type-k q8_0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | type_k |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | --------------: | -------------------: |
| qwen3 14B Q6_K                 |  11.63 GiB |    14.77 B | ROCm       |  99 |   q8_0 |   pp512 @ d7000 |        617.73 ± 5.10 |
| qwen3 14B Q6_K                 |  11.63 GiB |    14.77 B | ROCm       |  99 |   q8_0 |   tg128 @ d7000 |         29.31 ± 6.14 |

build: 134e6940c (7149)
📦[pat@almalinux-rocm llama.cpp]$

A friend in our Discord community recommended that I try Qwen 3 14B at Q6_K_L. He said it is a good model to fit into 16 GB of VRAM, though context would be a little tight. I managed to just barely squeeze 7,000 tokens of context at Q8 in my llama-bench run. I don’t know if there is anything I should do to optimize my settings, but I’m not unhappy with 600 tokens per second of prompt processing with my VRAM filled to the brim.

I did interact with the model, and it conversed with me just fine. When I gave it this entire blog post and asked for a summary, the model just kept repeating nonsense. I wound up swapping out Qwen3 14B Q6 for Qwen3 8B Q6. It was able to summarize this blog post just fine, and I was able to push the llama-bench up to 12,000 tokens of context.

I’m excited to have a working ROCm Distrobox image and a functional llama.cpp setup. That’s a good start, and it bodes well for future machine-learning shenanigans!

My zsh fix makes me feel dirty

I set Bazzite’s default terminal app to open new sessions inside my Debian Distrobox image. I have my shell inside that container set to /usr/bin/zsh. If I kill the container, my first session in the container fires up zsh just fine. Every subsequent connection winds up running bash.

# .bashrc
 
if [ -e /usr/bin/zsh ]; then
  exec /usr/bin/zsh
fi

I found a lot of suggestions on Google, but none of them worked. The only suggestion I didn’t try was deleting my Distrobox image, starting from scratch, and passing in a SHELL=/usr/bin/zsh during creation. I’m not doing that today.

I wound up with a massive kludge of a fix. Since Bazzite doesn’t ship with zsh, I added a check to my .bashrc that checks for the binary’s presence. If it’s there, it will switch to zsh. If not, it will just continue to use bash.

I guess the upside is that I’ll get a free upgrade if Bazzite decides to ship zsh.

Conclusion

This isn’t truly the conclusion. I still have plenty of rough edges to sand down, and I’ll absolutely be running into missing software or configuration for months. I’ve never set up a brand new machine that had everything ready to go in the first week. I always run into a rare task that I’m not quite prepared to solve at some point in the future! I should probably note here that I upgraded the same Ubuntu installation at my desk from 2009 until 2023. I wrote a five-week follow-up post covering how things have been going since this initial setup.

I’m in a good place. Games work. Emacs is upgraded, though in a similar state of configuration completeness as my Bazzite installation. If you’re reading this, then my blog published successfully. I can also play games and edit video. All my major tasks are covered, so I’m ready to move forward!

Are you using an immutable Linux distribution? How has your experience been? I’d love to hear about your setup, the challenges you’ve faced, and any workarounds you’ve discovered. If you’re interested in learning more about Bazzite, immutable Linux distributions, or just want to chat about gaming on Linux, feel free to join our Discord community! We’d love to have you as part of the conversation and help you on your own Linux journey.

Is The $6 Z.ai Coding Plan a No-Brainer?

Nov 19th, 2025

I’m not going to make you wait until the end to learn the answer. I’m going to tell you what I think right in the first paragraph. I believe you should subscribe to the Z.ai Coding Lite plan even if you only write a minuscule amount of code every month. This is doubly true if you decide to pay for a quarter or a full year in advance at 50% off.

I’m only a week and seven million tokens deep into my 3-month subscription, but I’m that guy who only occasionally writes code. I avoided trying out Claude Code because I knew I would never get $200 worth of value out of a Claude Pro subscription. I also now know that I could have paid for a full year of Z.ai for less than the cost of two months of Claude Pro.

OpenCode with Z.ai

I saw a Hacker News comment suggesting that Z.ai’s coding plan GLM-4.6 is about 80% as good as Claude Code. I don’t know how to quantify that, but OpenCode paired with GLM-4.6 has been doing a good job for me so far. Z.ai claims that you get triple the usage limits of a Claude Pro subscription, but what does that even mean in practice?

GLM-5 Update!

As of the time I am typing this update, GLM-5 was released two days ago on the Coding Max plan, and it was added to my Z.ai Coding Pro plan yesterday. I also received an email from Z.ai explaining that GLM-5 will be coming to the Coding Lite plan “as soon as compute resources allow.”

Z.ai has also raised the price of their coding plans. This is the first time since I signed up that the 50% promotion isn’t active, and the Coding Lite plan moved from $6 to $7.

Is the Z.ai coding plan a no-brainer with these changes? I think the Coding Lite plan is still a good value, but that level won’t be a no-brainer again until you can use GLM-5.

I have been using Kimi K2.5 for my planning agent via both Chutes.ai’s $3 subscription and Synthetic’s $20 subscription for the last few weeks, and I have been using my Z.ai Coding Plan’s GLM-4.7 with OpenCode’s build agent. I have switched OpenCode’s planning agent to GLM-5. Kimi K2.5 and GLM-5 are pretty comparable, but it feels like GLM-5 is in the lead.

I haven’t been measuring with a stopwatch, but I feel like the performance of the models on my Z.ai subscription has been improving. This is just my anecdote. All my subscriptions vary in performance depending on time of day or day of the week.

OpenCode on a Budget — Synthetic.new, Chutes.ai, and Z.ai

Let’s start with the concerns!

Z.ai is based in Beijing. Both ethics and laws are different in China than they are in the United States or Europe, especially when it comes to intellectual property.

I’m not making any judgments here. You can probably guess just how much concern I have based on the fact that I’m using the Z.ai Coding Plan while working on this blog post. I just think this is important to mention. Do you feel better or worse about sending all your context to OpenAI, Anthropic, or Z.ai?

Is Z.ai having performance problems with their subscription service?

This blog post has been up for a month, and I sure seem to be recommending a paid service here. A few people in our Discord community are using the service, so I asked how things are going for everyone there. I’ve also been keeping an eye out for people complaining in places like Reddit.

I had a weekend where I was getting connection errors every half dozen prompts or so, but the service was running at its usual speed. It was a pain having to hit the up arrow in OpenCode and send the same prompt again, but it didn’t really slow me down.

There are a few posts on Reddit complaining that the Z.ai GLM service has gotten so slow that it is unusable. The replies usually have a few people saying things are sometimes a little slower than usual.

Are these growing pains due to a large influx of new users snapping up the $2.40 per month rate? Will their capacity grow to meet their demand? Will things settle down on their own? Are things even all that bad? Will the increasing RAM and GPU prices make it hard for Z.ai to increase capacity? We don’t know, but these are the questions I’d ponder a bit before paying up front for a 12-month subscription.

I am only one anecdote, but everything is still completely usable for my limited needs. I am still happy that I am subscribed, and I would still risk $29 on a full year’s subscription to the Coding Lite plan to lock in at $2.40 per month for the first year.

Are the limits actually more than twice as generous as Claude Pro?

I assume that the statement is true. The base Claude Pro subscription limits you to 45 messages during each 5-hour window, while the Z.ai Coding Lite plan has a 120-message limit in the same window. That is very nearly three times more messages, but are these actually equivalent?

I haven’t managed to hit the limit on the Coding Lite plan. The fact that I haven’t hit the limit should be a good indicator of how light of a user I am!

I suspect that this is one of those situations where your mileage may vary. We know that Claude Opus is a more advanced model than GLM-4.6. Opus is more likely to get things right the first time, and Opus may need fewer iterations to reach the correct result than GLM-4.6.

I’d bet that they’re comparable most of the time, and you really do get nearly three times as much work out of Z.ai’s plan. However, I would also assume there are times when you might eat through some extra prompts trying to zero in on the correct results. If you’re curious about how GLM-4.6 stacks up against other affordable options, I’ve since written a comparison of Devstral 2 with Vibe CLI vs. OpenCode with GLM-4.6 that looks at how these tools feel in practice for casual programmers.

I’m not sure that an accurate answer to this question matters, since Claude subscriptions cost three or six times as much.

Devstral with Vibe CLI vs. OpenCode: AI Coding Tools for Casual Programmers

What have I done with OpenCode and Z.ai?

My Li’l Magnum! gaming mouse project is written in OpenSCAD. I have a simple build script that should have been a Makefile, but instead it is a handful of for loops that run sequentially. This wasn’t a big deal early on, but now I am up to three variations of eight different mice. Running OpenSCAD 24 separate times is taking nearly four full minutes.

Instead of converting this to a Makefile, I decided to ask OpenCode to make my script parallel. OpenCode’s first idea was to build its own job manager in bash. I said, “No way! We should use xargs to handle the jobs!” GLM-4.6 agreed with me, and we were off to the races.

OpenCode with Z.ai

I watched OpenCode set up the magic with xargs. I eventually asked it to combine its large number of functions into fewer functions by passing variables around. I had OpenCode add optional debugging statements so I could verify that the openscad commands looked like they should.

We ran into a bug at some point, and OpenCode had to start calling my build script to make sure STL and 3MF files showed up where they belonged, but OpenCode didn’t know that my script only builds files that have been modified since the last build. After telling OpenCode that it needed to touch the *.scad files before testing, it started trying and testing lots of things. This is probably a piece of information that belongs in this project’s agents.md file!

I had something I was happy with during my first session, but I wound up asking OpenCode for more changes the next day. We lost the xargs usage at some point, but I didn’t pay attention to when!

There is still a part that isn’t done in parallel, but it is kind of my own fault. I have one trio of similar mice that share a single OpenSCAD source file. I have some custom build code to set the correct variables to make that happen, and OpenCode left those separate just like I did.

I’m pleased with where things are. Building all the mice now takes less than 45 seconds.

OpenCode

You can wire Z.ai into almost anything that uses the OpenAI API, but the Z.ai coding plan is slow!

I almost immediately configured LobeChat and Emacs’s gptel package to connect to my Z.ai Coding Lite plan. I was just as immediately disappointed by how slow it is.

Everything seems pretty zippy in OpenCode. Before subscribing, I was messing around with GLM-4.6 using the lightning-fast model hosted by Cerebras. I am sure Cerebras is faster while using OpenCode, but it isn’t obviously faster. OpenCode is sending up tens of thousands of tokens of context, and it is doing that over and over again between my interactions.

This is different than Emacs and LobeChat. I wasn’t able to disable reasoning in LobeChat, so I wind up waiting 50 seconds for 1,000 tokens of reasoning even when I just ask it how it is doing. I assume the same reasoning is happening in Emacs when I highlight a paragraph and ask for it to be translated into Klingon.

I assume the Coding Plan is optimized for large context, so I wound up keeping Emacs and LobeChat pointed at my OpenRouter account. Each of these sorts of interactive sessions only eat up the tiniest fraction of a penny. I am not saving a measurable amount of money by using my free subscription tokens here.

OpenCode Stats

Six million input tokens would have cost at least $6 at OpenRouter, and I am only two weeks into my first month!

It’s tools like OpenCode, Claude Code, or Aider where you have to make sure you’re using an unlimited subscription service. I can easily eat through two million tokens using OpenCode, and that could cost me anywhere from $1.50 to $10 on OpenRouter. It depends on which model I point it at!

I am using OpenCode with Z.ai Coding Lite right now!

I messed around with Aider a bit just before summer. It was neat, but I was hoping it could manage to help me with my blog posts. It seemed to have no idea what to do with English words.

How well OpenCode worked with my Markdown blog posts using Cerebras’s GLM-4.6 was probably the thing that pushed me over the edge and made me try a Z.ai subscription. I can ask OpenCode to check my grammar, and it will apply its fixes as I am working. I can ask it to add links to or from older blog posts, and it will do it in my usual style.

OpenCode with Z.ai

I can ask OpenCode if I am making sense, and I can ask it to write a conclusion section for me. I already do some of these things either from Emacs or via a chat interface, but I have always had to do them very manually, and I would have to paste in the LLM’s first pass at a conclusion.

I could never burn through $3 in OpenRouter tokens in a month using chat interfaces—I probably couldn’t do it in a year even if I tried! Even so, OpenCode is saving me time, and I will use it for writing blog posts several times each month. That is worth the price of my Z.ai Coding Lite subscription.

Do you need the Z.ai Coding Pro or Coding Max plan?

If you do, then you probably shouldn’t be reading this blog! I am such a light user, and I suspect my advice will apply much better to more casual users of LLM coding agents.

That said, the more expensive plans look like a great value if you are indeed running into limits all the time. The Coding Pro plan costs five times more, and you get five times the usage limit. You also supposedly get priority access with 40% faster results from the models, and you also get upgraded to image and video inputs. The Coding Max plan seems like an even better value, because it only costs twice as much again, but it has four times the usage.

Z.ai has built a pricing ladder that actually provides some value for your money. Even so, the best deal is to pay only for what you ACTUALLY NEED!

I would also expect that if you’re doing the sort of work that has you regularly hitting the limits of Z.ai’s Coding Lite plan, then you might also be doing the sort of work that would benefit from the better models available with a Claude Pro or Claude Max subscription. I have this expectation because I assume you are getting paid to produce code, and even a small productivity boost could easily be worth an extra $200 a month.

Conclusion

The Z.ai Coding Lite plan offers exceptional value for casual coders and writers like myself. At just $6 per month (or $3/month with the current promotional discount), you get access to an extremely capable AI coding assistant. While it may not match Claude’s raw power, it is more than useful enough to justify its price, even if you only use it a few times a month.

The integration with OpenCode, which is ridiculously easy to set up, creates a seamless workflow that is easily worth $6 per month, and the generous usage limits mean I am unlikely to worry about hitting caps. For light users, hobbyists, or anyone looking to dip their toes into AI-assisted coding without breaking the bank, Z.ai’s Coding Lite plan is genuinely a no-brainer. If you use my link, I believe you will get 10% off your first payment, and I will receive an equivalent credit in future credits. Don’t feel obligated to use my link, but I think it is a good deal for both of us!

Want to join the conversation about AI coding tools, share your own experiences, or get help with your setup? Come hang out with us in our Discord community where we discuss all things AI, coding, and technology!

The Li’l Magnum! Ultralight Fingertip Gaming Mouse 2.0 Is Almost Here!

Nov 12th, 2025

What does it take to upgrade a 3D-printed mouse mod from version 1.0 to 2.0? With software, you usually increment the major number when you’re making a change that makes the program incompatible with the old version in some major way.

Li'l Magnum! mice in different colors

I have been experimenting with some rainbow color-changing filaments. Getting a nice color change is a challenge when the shell only weighs three grams!

There are a lot of minor changes to the Li’l Magnum! in version 2.0, but I also made significant changes to the button paddles. The thinning of the paddles might not technically qualify as a compatibility-breaking change, but a few of the mice had to have their button offset lowered by one layer to regain solid pre-engagement.

Li’l Magnum! repo on GitLab
Li’l Magnum mod for the Corsair Sabre Pro V2 and Dareu A950 Wing at MakerWorld
Li’l Magnum mod for the Corsair Sabre Pro V2 and Dareu A950 Wing at Printables
Bambu Lab A1 Mini 3D Printer

What has changed since version 1.0?

Let’s start with a list of what’s new!

Much lower default click force
- Configurable from 20 grams to 40 grams
Modeled-in supports for the grips
No slicer-generated supports required when using modeled-in supports
- Better overhang angles on all grip arms
OpenSCAD-generated sub-parts
- Exactly two layers of PETG support for multimaterial
  - Larger build plate contact surfaces on most built-in supports
- Separate button parts to apply extra top layers

I believe we are just at a point where the Li’l Magnum! is a better mouse overall. Most of the models are slightly lighter. All the models feel a little more solid. While the button paddles have more flex, I expect they will be even more durable.

I love having configurable button pressure!

I took a few Li’l Magnum! mice with me to display at our booth at Texas Linux Fest last month. I wasn’t sure what to expect. This isn’t a gaming crowd, but I did expect to run into a lot of tech enthusiasts. More than a few people assumed that the Li’l Magnum! must have a motor so it can run around on the floor like a mouse.

I was extremely excited when I ran into one actual gamer who plays first-person shooters, and he immediately knew what the Li’l Magnum! was for. Not only does he play shooters, he has four or five times as many hours as I do in Team Fortress 2. I was so excited that I ended up sending him home with my VXE R1 Pro Li’l Magnum!.

His first piece of feedback was about how stiff I made the buttons, and he is right. I purposely configured it for a short press travel while ensuring I wouldn’t accidentally click when I didn’t intend to.

OpenSCAD view of the configurator for Li'l Magnum button force

I ended up thinning out the paddle between the plunger and the front of the mouse. I printed dozens of test mice. I worked hard to get that overhang in the flexible spot to print reasonably clean. I also set up the customizer so that you can choose your own click force separately for each mouse button. That means you can make it easier to shoot while also making it harder to accidentally set off your stickybomb trap with a stray right click.

Are the click-force settings really as precise as the customizer says? Definitely not. Reliably measuring 18-grams of force with the mouse on a scale is hard. Every spool of PLA varies slightly. If your printer prints the overhangs more poorly, your force will be even lighter. The actual click force will also be influenced by the stiffness of your mouse’s microswitches.

Think of the force measurement in the customizer as a guideline.

How much force does it take to hit the buttons?

It is challenging to accurately measure the click of a button with a scale, but I did my best. I think I have a good way of explaining the click feel by comparing things to my Logitech G305, because the click force of a normal mouse like the G305 gets lower when you click closer to the front of the mouse. You have more leverage out there!

The old version of the Li’l Magnum! was pretty stiff. It was like clicking the G305 just behind the mouse wheel. This is where someone with an extreme claw grip might be clicking their G305-sized mouse.

The default clicks for version 2.0 are quite light. Clicking my own Corsair Li’l Magnum! feels like clicking the G305 out near the front tip of the mouse. Adjusting the customizer upward by two or three notches would make my clicks feel similar to clicking the G305 near the center of the wheel.

Upgraded grips

I am extremely pleased with the modeled-in supports for the grips. The supports connect to the grip with tiny 0.4-mm diameter nubbins. The supports break off easily, and the nubbins can be knocked off with your thumbnail or a metal tool. Please don’t use anything sharp!

In order for this setup to work, I had to chamfer the bottom of the grips to bring things to a point for the nubbins to connect to. I had no idea how much softer and more pleasant that chamfer would make the grips feel. I don’t notice it on the finger side, but the thumb grip feels nicer.

OpenSCAD view of the Li'l Magnum V2

The new supports for the grips break off easily, and a quick scrape with a metal tool leaves the underside of the grip soft and smooth!

We can blame this on the Corsair Sabre V2 Pro and Dareu A950. I made sure to line up the arms on every other mouse with the bottom of their grips. That means that the bottoms of the grips were always printed as bridges. I had to put one of the Corsair’s arms a little higher, requiring me to print the grip on tree supports, which I didn’t like.

Now that the base of the grips is always supported, I don’t have that limitation. I moved almost every arm upwards by at least one millimeter. You can’t always feel the difference, but in theory this should make every pair of grips just a little more rigid.

No slicer supports needed!

If you can’t print your Li’l Magnum! with multimaterial supports, you will still need to enable tree supports in your slicer. If you are using multimaterial supports, there is nothing left on any of the Li’l Magnum! models that needs to be supported.

Dialing in the Li'l Magnum! button overhangs

The red mouse on the left has the original button angle, while the mouse on the right is slightly steeper. This drastically improves the quality of the unsupported overhang, and it helps achieve just the right feel for the clicks!

The connectors that join the paddles to the grips are entirely bridges and reasonable overhangs. The connector across the front is a bridge. Everything should print fine on a modern printer.

The Dareu A950 Wing and Corsair Sabre V2 Pro are now the ultimate Li’l Magnum! donor mice

I bought a Corsair Sabre V2 Pro the same day they showed up on Amazon for $99. It is a fine mouse even without modding. It looked like it had extremely light internals, and I was pleased to learn that this was indeed correct. I’ve been gaming with it ever since it arrived, and most of my Li’l Magnum! builds with the Corsair have weighed 15.2 to 15.4 grams. I even have one test print that came in at 14.92 grams!

We have confirmation from at least two people that the $52 Dareu A950 Wing fits perfectly in the Li’l Magnum! shell. The PCB is nearly identical to the Corsair, because Corsair seems to be putting their branding on Dareu’s existing mouse.

There are some differences. They use different software to configure the mice. The Dareu uses a 30,000-DPI PAW3950 sensor, while the Corsair uses a Corsair-branded 33,000-DPI sensor.

Li'l Magnum subobjects

Subobjects are labeled in your slicer, and the labels include basic print-setting reminders

The list price for the Dareu on Amazon is $20 lower than the Corsair at $80. The Dareu regularly goes on sale for around $60 and has gone on sale for as little as $52.

These prices make it hard to recommend any other mice for your Li’l Magnum! build. If you are really on a budget, the VXE R1 SE is still the lowest price. Unfortunately, they only sell the R1 SE with a massive 500-mAh battery, so your Li’l Magnum! build will come in at over 25 grams.

If you are in the United States, then you’re going to pay $36 for a 25-gram Li’l Magnum!. You could spend $20 to $30 more on the Dareu and get a better sensor and the absolute lightest possible Li’l Magnum! build. You can probably still get an R1 SE for under $20 outside of America, so the math might be different for everyone else.

The price gap between the cheapest donor mouse and the most impressive donor mouse has gotten so small. It means that the mice in between the R1 SE and the Dareu A950 Wing are mostly pointless. If you already have a VXE Mad R or a VXE R1 Pro, then I think you should print a Li’l Magnum! shell. You already have a great donor mouse.

Now there are only two mice to buy. The cheapest VXE R1 you can find or the Dareu A950.

You don’t need to shave off every possible gram

One of Optimum’s Zeromouse builds was down around 17 grams, but every iteration since then has gotten heavier. I think there is a reason for this.

I notice that my 25-gram Li’l Magnum! is heavier than the rest. I can swap out its battery to bring it down to 21 grams. I can assure you that it’s difficult to notice the difference between a 15-, 17- and 21-gram Li’l Magnum!.

You can probably pick up on it when you’re really paying attention. You’ll notice it when you lift the mouse to recenter your aim. You probably won’t notice a difference while aiming. I think it is more important for me to have a fingertip mouse than it is for me to have a 15-gram mouse.

Chasing numbers and specs can be fun. I don’t want to stop you from having fun finding lighter and lighter mice. It might even be an inexpensive hobby for you.

One of the reasons I designed the Li’l Magnum! is so that you don’t have to spend $180 to find out whether or not you like ultralight fingertip mice. You shouldn’t feel like you’re missing out if you can only afford the cheapest Li’l Magnum! donor mouse.

What makes the Li’l Magnum! special?

The Li’l Magnum! is an open-source project. You can download and modify the OpenSCAD source code. It will still be here even if I’m gone.

The Li’l Magnum! is parametric. All the surfaces that you touch while gaming are adjustable in the customizer on MakerWorld. Does your thumb sit farther back? You can move the grip. Do you need a stiffer right click? Do you want an angle on one of the grips? You can easily make it happen.

I am also aiming directly at consumer 3D printers and PLA plastic. There are other printing processes that are great for printing skeletal mouse mods, and there are other materials that could be a bit more suitable for the Li’l Magnum!.

I tried PETG early on. It is a much more appropriate material for the buttons to have flex, but that extra flex of PETG also means that the buttons want to pivot, and the side grips wind up being really soft. I would have to add material and weight to the mouse to switch to PETG, and fewer people are able to print PETG at home. I figured it was best to focus on the easier material to print.

The Li’l Magnum! supports eight different donor mice so far, and it is relatively easy to add support for new mice. The important pieces that come in contact with a new mouse are mostly parametric. Most of the work is figuring out where the screw holes and microswitches are located on the new mouse PCB.

The Li’l Magnum! isn’t just my project. It is our project!

I’d rather you print your own, but you can buy a shell from my Tindie store

I run all my Li’l Magnum! prints on my Bambu A1 Mini. I use the AMS Lite to print multimaterial supports, but you can print a perfectly good Li’l Magnum! without the AMS. You’ll just need to file the bottom of the plungers a bit. You can spend $250 on a printer, and you can print a Li’l Magnum! for you and all your friends. I can assure you that you’ll find other fun uses for your printer.

I charge about $20 for a Li’l Magnum! print in my Tindie store. Your friend with a 3D printer can print one for you for free. You can for sure find 3D-printing services that will print the STL for less.

Why should you pay a little extra for a Li’l Magnum! from my store? I think the biggest reason is that I have the print settings for a Li’l Magnum! optimized to give you the right balance between rigidity and weight. The default print settings will give you a shell that weighs around three grams more than my own settings. The settings aren’t a secret.

I also guarantee that my prints fit the mice they are supposed to fit. If you own a Dareu A950 Wing, and I send you a Dareu A950 Wing Li’l Magnum! shell, then you are going to be able to make it work. Sometimes the manufacturer changes the PCB. We have already seen this happen with the MCHOSE L7. I will either work with you to adjust the model, or I will give you a refund.

I am not here to make a living selling mice. I’ll be happy enough if the Tindie sales earn enough money to keep buying more donor mice to keep the project moving forward.

Bambu Lab A1 Mini 3D Printer

Wrapping up

That’s the Li’l Magnum! 2.0. We’ve tweaked the button feel, made the grips more pleasant, and optimized the print settings to make the whole process smoother from your slicer to your desk. This is less about a giant leap and more about numerous small refinements that add up to a much nicer experience.

But here’s the real secret: this project has never been just about me or my ideas. It’s been shaped by every piece of feedback. Sometimes feedback is about the feel of the mouse. Sometimes the feedback is about a slightly different mouse model fitting just fine. This thing is a collective effort, and that’s what makes it so special.

The best part of all this isn’t the grams we’ve shaved off; it’s the community that we are building up around a shared interest in tinkering and making gaming gear truly our own.

Let’s keep building together!

I genuinely believe the coolest ideas for the Li’l Magnum! are still out there, waiting to be discovered by someone in our community. Maybe that’s you!

I’d love to see you join our friendly Discord community. It’s the central hub where we all hang out, share prints, troubleshoot builds, and brainstorm what’s next.

Whether you’ve just printed your first shell, you’re an old hand at modding mice, or you’re just curious and have questions, you are welcome. Let’s see what we can build together.

What are your thoughts on the new version? What donor mouse are you planning to use? Do you have a donor mouse in mind that I haven’t thought of yet? Come tell us about it on Discord!

Li’l Magnum! repo on GitLab
Li’l Magnum mod for the Corsair Sabre Pro V2 and Dareu A950 Wing at MakerWorld
Li’l Magnum mod for the Corsair Sabre Pro V2 and Dareu A950 Wing at Printables
Bambu Lab A1 Mini 3D Printer

Contemplating Local LLMs vs. OpenRouter and Trying Out Z.ai With GLM-4.6 and OpenCode

Nov 11th, 2025

My feelings about local large-language models (LLMs) waffle back and forth every few months. New smaller models come out that perform reasonably well, in both speed and output quality, on inexpensive hardware. Then new massive LLMs arrive two months later that blow everything out of the water, but you would need hundreds of thousands of dollars in equipment to run them.

Everything depends on your use case. The tiny Intel N100 mini PC could manage to run a 1B model to act as a simple voice assistant, but that isn’t going to be a useful coding model to put behind Claude Code, Aider, or OpenCode.

OpenCode for Blogging

Most of what I ask of an LLM is somewhere in the middle. The models that fit on my aging 12-gigabyte gaming GPU were already more than capable of helping me write blog posts two years ago, and even smaller models can do a more than acceptable job today. I don’t need to use DeepSeek’s 671-billion parameter model for blogging, because it is only marginally better than Qwen Next 30B A3B. If you are coding, this is a different story.

I believe I should tell you that I started writing this blog post specifically because I subscribed to Z.ai’s lite coding plan. Yes, that is my referral link. I believe that you get a discount when you use my link, and I receive some small percentage of your first payment in credits.

Z.ai is offering 50% off your first payment, so you can get half price on up to one full year of your subscription. It works out to $3 per month. I aimed for the middle and bought three months for $9. I will talk in more detail about this closer to the end of this blog post!

Why would you want to run an LLM locally?!

I would say that the most important reason is privacy. Your information might be valuable or confidential. You might not be legally allowed to send your customers’ data to a third party. If that is the case, then spending $250,000 on hardware to run a powerful LLM for your company might be a better value than paying OpenAI for a subscription for twenty employees.

Reliability might be another good reason. I could use a tiny model to interact with Home Assistant, and I don’t want to have trouble turning the heat on or the lights off when my terrible Internet connection decides to go down.

Price could be a good reason, especially if you’re a technical person. You can definitely fit a reasonable quantized version of Qwen 30B A3B on a used $300 32 GB Radeon Instinct Mi50 GPU, and it will run at a good pace. This doesn’t compete directly with Claude Code in quality or performance, but Qwen Coder 30B A3B can be used for the same purposes. Yes, it is like the difference between using a van instead of a Miata when moving to a new apartment, but it is also a $300 total investment vs. paying $17 per month. The local LLM in this case would start to be free before the end of the second year.

Local LLM performance AND available hardware are both bummers!

You certainly have to use a language model that is smart enough to handle the work you are doing. You just can’t get around that, but I believe the next most important factor is performance.

People are excited about the $2,000 Ryzen AI Max+ 395 mini PCs with 128-gigabytes of fast LPDDR5 RAM. There are a lot of Mac Mini models that are reasonably priced with similar or even better specs. They are excited because you can fit a 70B model in there with a ton of context, but a 70B model runs abysmally slow on these DDR5 machines. Prompt-processing speeds as low as 100 tokens per second and token-generation speeds below 10 tokens per second.

While these mini PCs with relatively fast RAM can fit large models, they really only have enough memory bandwidth to run models like Qwen 30B A3B at reasonable speeds. The benchmarks say the Ryzen 395 can reach 600 tokens per second of prompt processing speed and generate tokens at better than 60 tokens per second.

I send 3,000 tokens of context to the LLM when I work on blog posts. Waiting 30 seconds for the chat to start working on a conclusion section for a blog post isn’t too bad, and it will only take it another minute or two to generate that conclusion. I am used to my OpenRouter interactions of this nature being fully completed in ten seconds, but this wouldn’t be the worst thing to wait for.

My OpenCode sessions often send 50,000 tokens of context to the LLM, and it will do this several times on its own after only one prompt from me. I cannot imagine waiting ten minutes, or potentially multiples of ten minutes, to start giving me back useful work on my code or blog post.

Waiting ten minutes for a 70B model would stink, while waiting one minute for Qwen 30B A3B would feel quite acceptable to me.

On the other end of the local-LLM spectrum are dedicated GPUs. You can spend the same $2,000 on an Nvidia 5090 GPU, but that assumes you already have a computer to install it in. The RTX 5090 should run Qwen 30B A3B at a reasonable quantization with prompt-processing speeds at least five times faster than a Ryzen Max+ 395.

I have a friend in our Discord community who is running Qwen 30B A3B on a Radeon Instinct Mi60 with 32 GB of VRAM. These go for around $500 used on eBay, but the older Radeon Instinct Mi50 cards with 32 GB of VRAM used to go for around half that, but the prices have been inching up. There are benchmarks of the Mi50 on Reddit showing Qwen 30B A3B hitting prompt-processing speeds of over 400 tokens per second while generating at 40 tokens per second. That’s not bad for $500!

There just isn’t one good answer. This is all apples, oranges, and bananas here. You can either run big models slowly or mid-size models quickly for $2,000, or you could run mid-size models at a reasonable speed for $500. You would need to figure out which models can meet your needs.

Local LLMs might be fantastic if you can fit within the constraints!

I recently upgraded my computer with a 16 GB Radeon 9070 XT. I upgraded to Bazzite at the same time, and set up Distrobox containers to keep a few things separated. One of those Distrobox containers is a ROCm setup for mucking about with large language models.

I already know that my minimum viable OpenCode model is likely to be Qwen Coder 30B A3B at Q8. That’s around 30 GB of VRAM without context, and OpenCode needs at least 16,000 tokens of context. The only way I am running a model that size would be at a medium pace on a $2,100 Ryzen AI Max 395 mini PC.

I have managed to puzzle out an important nugget of useful information. I can fit Gemma 3 4B at Q6 with its vision model and 4,000 tokens of context in just under 8 gigabytes of VRAM. I can push that up around 16,000 tokens of context if I run the context at Q8 and still fit in around 8 gigabytes of VRAM.

Gemma 3 4B Multimodal running locally on my 9070 XT

I think this is neat. I have been saying in our Discord community that it would be nice to have Gemma 3 4B running locally, and I’ve been betting that it would fit on a used $100 8 GB Radeon RX580. It’d be a tight fit, but I could drop the max context a little and bring the context quantization down to Q8 if I had to.

A lot of us in the homelab community are likely to have a spare PCIe slot in one of our servers. Spending less than $100 to add an always-on LLM with a decent multimodal model with image recognition capabilities might be awesome. You could ship surveillance camera images to it. You could forward it photos of your receipts. You could tie it into your Home Assistant voice integration.

Having a reasonably capable model that doesn’t fail when your Internet connection drops out might be nice. Sure, it isn’t going to fix your OpenSCAD project’s build script, but it can still do really useful things!

You can try most local models using OpenRouter.ai

I am a huge fan of OpenRouter. I put $10 into my account last year, and I still have $9 in credits remaining. I have been messing around with all sorts of models from Gemma 2B to DeepSeek 671B and everything in between. Every time I have the urge to investigate buying a GPU to install in my homelab, I head straight over to OpenRouter to see if the models I want to run could actually solve the problems that I am hoping to solve!

I used OpenRouter this week to learn that Qwen 30B A3B is indeed a viable LLM for coding with things like Aider, OpenCoder, and the Claude Code client. That gives me some confidence that it could actually be worthwhile to invest some of my time and money into getting a Radeon Mi60 up and running.

The only trouble is that the Qwen 30B that I tested in the cloud isn’t as heavily quantized. I would need to run Qwen 30B at Q4_K_M, and the results will be degraded at that level of quantization. That may be enough to push the model beyond the point where it is even usable.

Testing the small models at OpenRouter helps you zero in on how much hardware you would need to get the job done, but it most definitely isn’t a perfect test!

Deciding Not To Buy A Radeon Instinct Mi50 With The Help Of Vast.ai!

Tools like OpenCoder rip through tokens!

Listen. I am not a software developer. I can write code. I occasionally program my way out of problems. I write little tools to make my life easier. I do not write code eight hours a day, and I certainly don’t write code every single day.

I have found a few excuses to try the open-source alternatives to Claude Code, like Aider and OpenCode. They eat tokens SO FAST.

OpenCode burns through tokens

Don’t trust the cost! Some of those 3.2 million tokens over the two-day period were using various paid models on OpenRouter, while more than half were free via my Z.ai coding plan

It took me 11 months to burn through 80 cents of my $10 in OpenRouter credits. Chatting interactively to help me spice up my blog posts only uses fractions of a penny. One session with OpenCode consumed 18 cents in OpenRouter credits, and I only asked it to make one change to six files. I repeated that with two other models, and I used up as much money in tokens in an hour as I did in the previous 11 months.

This is why subscriptions to things like Google AI, Claude Code, or Z.ai with usage limits and throttling make a lot of sense for coding.

Squeezing Value from Free and Low-Cost AI Coding Subscriptions

Blogging with OpenCode

This week is the first time I have had any success using one of the LLM coding tools with blog posts. I tried a few months ago with Aider, and I had limited success. It didn’t do a good job checking grammar or spelling, it didn’t do a good job rewording things, and it did an even worse job applying the changes for me.

OpenCode paired with both big and small LLMs has been doing a fantastic job. It can find grammar errors and apply the fixes for me. I can ask OpenCode to write paragraphs. I can ask it to rephrase things.

OpenCode for Blogging

I don’t feel like my blog is turning into AI slop. I don’t use sizable sections of words that the robots feed to me. I ask it to check my work. I sometimes ask it to rewrite entire sections, or sometimes the entire post, and I sometimes find some interesting phrasing in the robot’s work that I will integrate into my own.

I almost always ask the LLM to write my conclusion sections for me. I never used their entire conclusion, but I do use it as a springboard to get me going. The artificial mind in there often says cheerleading-things about what I have worked on. These are statements I would never write on my own, but I usually leave at least one of them in my final conclusion. It feels less self-aggrandizing when I didn’t actually write the words myself.

Is Machine Learning In The Cloud Better Than A Local LLM?

Trying out Z.ai’s coding plan subscription

A handful of things came together around the same day to encourage me to write this blog post. I decided to try OpenCode, it worked well on my OpenSCAD mouse project and my blog, and I learned about Z.ai’s $3-per-month discount. I figured out that it would be easy to spend $1 per week in OpenRouter credits when using OpenCode, and I also assumed that I could plumb my Z.ai account into other places where I was already using OpenRouter.

Z.ai’s Lite plan using GLM-4.6 is not fast—I was using OpenCode with Cerebras’s 1,000-token-per-second implementation of GLM-4.6 via OpenRouter. I was only seeing 200 to 400 tokens per second, which is way better than the 20 to 30 tokens per second that I am seeing on my Z.ai subscription. They do say that the Coding Pro plan is 60% faster, but I have not tested this.

Z.ai Performance In LobeChat

These are the stats from one interaction with GLM-4.6 on my Z.ai subscription using LobeChat

I wound up plumbing my Z.ai subscription in to my local LobeChat instance and Emacs. The latency here is noticeably slower than when I connect to large models on OpenRouter. My gptel interface in Emacs takes more than a dozen seconds to replace a paragraph, whereas DeepSeek V3.2 appears to respond almost instantly.

It isn’t awful, but it isn’t amazing. I would be excited if I could use just one LLM subscription for all my needs, but my LobeChat and Emacs prompts each burn an infinitesimally small fraction of a penny. I won’t be upset if I have to keep a few dollars in my OpenRouter account!

I was concerned that I might be violating the conditions of my subscription when connecting LobeChat and Emacs to my account. Some of the verbiage in the documentation made me think this wouldn’t be OK, but Z.ai has documentation for connecting to other tools.

OpenCode performance is way more complicated. I am not noticing a difference in my limited testing. This may be due to GLM-4.6 being a better coding agent, so I might be using fewer tokens and fewer round trips for OpenCode to get to my answers. I’ve since written a detailed comparison of Devstral 2 with Vibe CLI vs. OpenCode with GLM-4.6 that looks at how these tools feel in practice for casual programmers.

I have only been using my Z.ai subscription for two days. I expect to write a more thorough post with my thoughts after I have had enough time to mess around with things.

Devstral with Vibe CLI vs. OpenCode: AI Coding Tools for Casual Programmers

Conclusion

Where does all this leave us? After spending so much time digging into both local LLM setups and cloud services, I firmly believe that there isn’t one right answer for everyone.

For my own use case, I might eventually land on a hybrid setup with both a local setup in my homelab and a cloud subscription for the heavy lifting. For now, I’ll keep using OpenRouter for short, fast prompts and testing new models. The inexpensive Z.ai subscription, while a little slower, will do a fantastic job of keeping me from accidentally spending $50 on tokens for OpenCode in a week—that $6 per month ceiling will be nice to have!

The most important thing I learned is that you should test before you buy. OpenRouter has saved me from making at least two expensive hardware purchases by letting me try models first. For anyone else trying to figure out their own LLM setup, I’d recommend the same approach.

If you’re working through these same decisions about hardware, models, or services, I’d love to hear what you’re finding. Come join our Discord community where we’re all sharing what works (and what doesn’t) with our different LLM setups. There are people there running everything from tiny local models, to on-site rigs costing a couple thousand dollars, to running everything in the cloud, and it’s been incredibly helpful to see what others are actually using in the real world.

The LLM landscape changes so fast that what’s true today might be outdated in three months. Having a community to bounce ideas off of makes it much easier to navigate without wasting money on hardware that won’t meet your needs.

← Older Blog Archives

Comparing coding plans isn’t straightforward

I started writing this post based on incorrect information!

Z.ai’s coding plan

Chutes.ai

Synthetic.new

There is something unique about Synthetic.new’s quotas!

How much does speed matter?

Is there a difference in quality of responses from different providers?

Which provider should you choose?

What about NanoGPT?

Conclusion

Why the 8 GB Radeon RX 580?

Why not a beefier GPU?

Why Vulkan?

Why in the heck is Pat testing this on Bazzite Linux?!

Vibe-coding my way to a working setup

Why am I running llama-bench at 4,000 tokens of context?

How fast is Gemma 3 4B on the RX 580?

Whisper.cpp speech to text runs faster than I can talk

Text to speech options are overwhelming!

I tested the same Podman containers on my Ryzen 6800H mini PC

Why an RX 580 instead of a Ryzen 6800H or faster mini PC?

When would an RX 580 pay for itself?

I forgot about Immich!

Is this the conclusion?

Restrictions on third-party agentic coding clients

OpenCode makes it easy to use multiple models from different vendors

I just learned about Chutes.ai!

Chutes includes Devstral 2

A Z.ai plan is still a fine choice, and the value might be in their MCP servers

What about NanoGPT?!

Maybe you can stretch your subscriptions with some paid tokens?

Vercel offers $5 in free credits for their AI Gateway every month

OpenRouter offers 1,000 requests to free models per day

Nvidia NIM offers developers large quotas

I could probably get by using only free tokens

Using a swarm of models to check the grammar of my blog posts?!

The LLM benchmarks don’t tell the whole story

I am hoping that OpenCode Black saves the day?!

Supplementing your premium plan with a low-cost plan

Wrapping Up

Will I be canceling my $6 per month Z.ai coding subscription?

You don’t need a $650 16 GB Radeon 9700 XT to use this model

ByteShape has me more excited about the future!

Update: There’s a new contender for coding on a 16 GB GPU

The massive LLMs won’t be standing still

This is exciting for more than just OpenCode!

I tried to test this model on my 32 gigabyte Ryzen 6800H gaming mini PC

So where does that leave us?

The short answer: I’m happy

Gaming performance with the 9070 XT

Switching to Podman was more effort than I expected!

I didn’t even consider my thermal label printer!

Setting up lvmcache went smoothly!

My Home Assistant shenanigans easier than I expected!

Cheater automatic Bluetooth headphone switching with some vibe coding?!

The ROCm Distrobox experiment

Daily workflow and productivity

What I’ve learned

Conclusion

Who is this blog post for?

Go try this vibe coding stuff while it is free!

Thoughts on Vibe and Devstral 2 from a shadetree programmer

Is vibe coding OK? How do you define it?

OpenCode and Vibe write better shell scripts than I do!

Is the Z.ai Coding Lite Plan still a no-brainer?!

A part-timer could probably use free API services and tools for the foreseeable future

Maybe you shouldn’t limit yourself to just one model!

Conclusion

The installation went smoothly

I couldn’t use Bazzite at my desk without Distrobox!

DaVinci Resolve Studio is working great

KZones is better AND worse than my custom window-management scripts

Gaming is fantastic

Thunderbird was being problematic

A quick test of llama.cpp in Distrobox

My zsh fix makes me feel dirty

Conclusion

GLM-5 Update!

Let’s start with the concerns!

A quick test of `llama.cpp` in Distrobox