Home Assistant Voice Control With a $56 GPU and Local Machine Learning

The last time we visited this topic, I had ordered a used 8 GB Radeon RX 580 for $56, and I had tested most of the capabilities required for Home Assistant voice control. A local LLM was smart and fast enough, and speech recognition was also fast on the GPU. I didn’t get voice generation working on the GPU, but the CPU is plenty fast enough for that. What I hadn’t done was combine everything together, and I didn’t find a good way to get my voice command into Home Assistant.

I am not nearly finished setting this up, but I am far enough down the path that all the pieces are in place and functioning. The things that are functioning are working well. Things are extremely barebones so far, because all the batteries are definitely not included when you turn on Home Assistant’s Voice Assist, and most of my scenes and automations are named poorly. Asking for things to be done in my home has to be worded pretty awkwardly!

These are all solvable problems. The difficult technical hurdles have all been cleared.

Fast Machine Learning in Your Homelab on a Budget

Ava turns your Android device into a voice satellite

This is the part that held me back. I looked into microphone and speaker hardware that is compatible with Home Assistant, and I didn’t get a good vibe from any of it. Some hardware seemed overpriced, other hardware seemed like complete junk. How was I going to get my words into Home Assistant?

I looked at and dismissed one or two projects intended to get either your Android device or a web browser plumbed into Home Assistant’s voice controls. Then I stumbled across Ava!

Ava is an open-source Android app that runs in the background and emulates an ESPHome voice satellite. This is awesome, because a 10” Android tablet already has a permanent home on my desk. Not only do I have a full-time Discord display and Home Assistant macropad dashboard on this small screen, but the microphone is always listening for the wake word.

This was the part of the setup that I was most worried about. Thank goodness it is working perfectly. Home Assistant saw Ava immediately, it was easy to configure, and it hasn’t missed a wake word yet.

I don’t know about you, but I have quite a few old Android tablets. I can start dropping Android tablets with Home Assistant dashboards in every room of the house as soon as I get all the rough edges ironed out, and I won’t even have to buy anything! Ava claims to work well on extremely low end Android hardware.

The ESPHome voice satellite protocol supports two different wake words, each pointing to a different assistant. I’m only using one wake word right now, but I’m already planning to point the second one at an OpenClaw instance. There’s an integration available in HACS, which would give me a general-purpose AI assistant alongside my Home Assistant voice control on the same tablet.

Ava is not available in the Play Store. You have to download the release from their Github repository and manually install it.

Using An Android Tablet As Second Independent Display And Macropad At My Desk

Qwen 3.5 4B is delightful

New models get released so fast! When I first visited this topic in January, I was squeezing Gemma 3 4B onto my 8 GB GPU. That was the best model available in this size range that could use vision to analyze photos. I don’t have a camera on my front porch yet, but my plans involve having the local robot be able to identify whether or not a package has been delivered!

More models have been released since then. I tried Qwen 3.5 4B at Q6 as soon as it was available, and I was immediately impressed. Gemma 3 wasn’t terrible at tool calling, but Qwen 3.5 4B hasn’t missed a tool call yet. This is important, because every time you ask the LLM to turn on a light or activate a scene, the LLM has to call a tool in Home Assistant to make that happen.

Performance was an issue at first. Home Assistant sends a system prompt to the LLM that contains information about every single entity. This means I am at nearly 2,500 tokens of context before even getting to the words that I spoke. It takes somewhere between 5 and 8 seconds when we have to process that entire prompt.

It was a little tricky to tune with the latest llama.cpp update, but I managed to set up the context checkpoint caching to keep most of the preprocessed prompt in memory. The first prompt still takes 8 seconds, but now it takes between 500 and 900 milliseconds to process any subsequent prompts. I don’t lose the cache unless I have to restart llama.cpp or my entities in Home Assistant change.

I think this is perfect. My $56 GPU can respond to the text from spoken sentences in less than a second.

I asked OpenCode to figure out how to compile llama.cpp with the correct Vulkan support, and I had OpenCode set me up scripts to start, stop, and update the llama.cpp container on my test machine.

Wyoming Whisper: CPU vs. GPU

The first thing I did after installing Ava was click the buttons to install the voice assistant things on my Home Assistant server. I let it set things up on its own, so it installed a Wyoming Whisper speech-to-text service and the Piper text-to-speech service in the HASSOS virtual machine on my little Intel N100 mini PC.

I also installed the Extended OpenAI Conversation HACS integration. The default OpenAI integration can only connect to OpenAI’s servers and not third-party OpenAI-compatible endpoints. I did initially take the lazy route and connect it to GPT-OSS-120B on my $3 Chutes subscription.

It worked, but it was slow. The CPU-based Whisper server is slow on the N100, the responses from the LLM are slow on my cheap Chutes account, but at least Piper is reasonably responsive.

My test machine is an old AMD FX-8350 machine running Bazzite. It is my home office’s second gaming PC, and I might even keep things set up this way. I asked OpenCode to set up a Wyoming Whisper Podman container with Vulkan support for me, and it has a similar management script to my llama.cpp container.

It is working great. Using the default base.en-q5_1 model has my voice commands transcribed in 500 to 600 milliseconds and uses only 26 megabytes of VRAM. Bumping up to the larger models bumped the voice recognition time up to 1.2 seconds. That isn’t a massive difference, and I have VRAM to spare, but the base model is transcribing my voice just fine.

What is this going to cost me?!

I’ve dug myself into a bit of a conundrum. This old FX-8350 machine was my gaming PC and workstation in 2013. When I upgraded, it became my homelab and NAS. When this old machine was effectively my entire homelab, it averaged 70 watts at the power outlet. It might be a little higher with the beefier GPU now.

I replaced that server with an N100 mini PC and a single 14 TB USB hard drive which combined average less than 15 watts. Even my best mini PC can’t quite match the LLM performance of my $56 GPU, but adding a slightly underclocked FX-8350 running 24/7 back into the mix will cost me between $80 and $100 in electricity every year.

That is $56 up front for the GPU, plus, let’s just call it twenty cents per day in electricity. I can add a dashboard display and voice satellite to any room for an extra $50 or so.

I don’t think that’s terrible for a voice assistant, even if my current voice assistant is extremely limited in capabilities.

slot update_slots: id  3 | task 394 | new prompt, n_ctx_slot = 25088, n_keep = 0, task.n_tokens = 1956
slot update_slots: id  3 | task 394 | n_past = 1928, slot.prompt.tokens.size() = 1941, seq_id = 3, pos_min = 1940, n_swa = 1
slot update_slots: id  3 | task 394 | restored context checkpoint (pos_min = 1419, pos_max = 1419, n_tokens = 1420, size = 50.251 MiB)
slot update_slots: id  3 | task 394 | n_tokens = 1420, memory_seq_rm [1420, end)
slot update_slots: id  3 | task 394 | prompt processing progress, n_tokens = 1444, batch.n_tokens = 24, progress = 0.738241
slot update_slots: id  3 | task 394 | n_tokens = 1444, memory_seq_rm [1444, end)
slot init_sampler: id  3 | task 394 | init sampler, took 0.40 ms, tokens: text = 1956, total = 1956
slot update_slots: id  3 | task 394 | prompt processing done, n_tokens = 1956, batch.n_tokens = 512
slot print_timing: id  3 | task 394 | 
prompt eval time =    2100.90 ms /   536 tokens (    3.92 ms per token,   255.13 tokens per second)
       eval time =     520.67 ms /    10 tokens (   52.07 ms per token,    19.21 tokens per second)
      total time =    2621.57 ms /   546 tokens

Qwen 3.5 4B is pretty quick to respond when the context checkpoints are working!

The big win arrives when you are doing image recognition. I figured out in January that I would spend way more than $80 per year in API costs if I had to ask the cheapest vision model whether there was a package on my front porch once every five minutes during business hours. I can do that today with nearly zero additional cost, and I could even check multiple cameras without doubling, tripling, or quadrupling my spending.

I have a Ryzen 6800H mini PC. It is my Steam machine in the living room. We play Grand Theft Auto V, Red Dead Redemption 2, and Dead Cells out there. I have done some llama.cpp testing on that box, and that $330 mini PC that idles at 8 watts is roughly 80% as fast as the $56 RX 580 GPU.

Maybe your homelab has a free PCIe slot, or maybe your homelab has an extremely basic GPU that can be upgraded. I think a used RX 580 with 8 GB of VRAM would be a fantastic option for you. If you’re just starting your Home Assistant journey, then choosing a mini PC with decent memory bandwidth and a reasonable iGPU might be a better place to start.

What if you already have an N100 mini PC that just can’t run the LLM fast enough? A $3 per month Chutes subscription would go a long way, and this use case doesn’t violate their terms of service.

More info on performance, noise, and power

I chose the 8 GB RX 580 because it is inexpensive, fast enough for the job, and isn’t a dead end. If my homelab machine-learning experiments didn’t wind up being useful, I could still build a low-end gaming PC out of my spare parts. This $56 GPU combined with old junk parts from my closet is several times faster than the $330 Ryzen 6800H mini PC that I use as a Steam machine in the living room.

I fully expected to wind up turning my spare-parts gaming PC test rig back into a Proxmox host. The trouble is that I didn’t expect my ancient machine to be able to run Arc Raiders at better than 60 FPS, and I didn’t know that I would enjoy the idea of having a second gaming PC at the second desk in my home office.

That means I can’t relegate this machine to my network cupboard on the other side of the house. It has to be in the room with me. I have to be able to tolerate the noise.

pat@rx580:~$ ./llama-vulkan/bin/manage bench -m /models/OmniCoder-2-9B.i1-IQ3_XXS.gguf -d 20000 -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 580 Series (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
watts          | model                          |       size |     params | backend    | ngl | type_k | type_v | fa |            test |                  t/s |
--------------:| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | --------------: | -------------------: |
75w            | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  pp512 @ d20000 |         79.64 ± 0.52 |
75w            | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  tg128 @ d20000 |          9.70 ± 0.02 |
100w 41%       | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  pp512 @ d20000 |        102.68 ± 0.33 |
100w 41%       | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  tg128 @ d20000 |         10.49 ± 0.02 |
115w 44%?      | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  pp512 @ d20000 |        113.65 ± 0.16 |
115w           | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  tg128 @ d20000 |         10.58 ± 0.02 |
125w 55%       | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  pp512 @ d20000 |        119.59 ± 0.08 |
125w           | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  tg128 @ d20000 |         10.58 ± 0.01 |
150w (default) | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  pp512 @ d20000 |        126.01 ± 0.26 |
150w (default) | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  tg128 @ d20000 |         10.58 ± 0.01 |
220w (max)     | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  pp512 @ d20000 |        132.45 ± 0.14 |
220w (max)     | qwen35 ?B IQ3_XXS - 3.0625 bpw |   3.66 GiB |     8.95 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |  tg128 @ d20000 |         10.59 ± 0.02 |

The first thing I did was rip out the old 92-mm CPU fan and the older, filthy, loud 120-mm case fan. I replaced both with a spare pair of reasonably quiet and modern Antec 120-mm fans. That helped a lot, but when the GPU gets choochin’ on inference, the GPU fans get pretty loud.

The RX 580 defaults to 150 watts. I decided to run some llama-bench tests from 75 watts to the default 150 watts, and then all the way up to the maximum of 220 watts. I would have run these tests with Qwen 3.5 4B if I had known that I would include the table in this blog post. I decided to run OmniCoder-2-9B. It is a heavier model, and I pushed the context high enough to heat up the GPU, but I kept the context low enough that the tests would complete in a reasonable amount of time.

I learned that the noise level stays low as long as I keep the RX 580’s fans under 50%. That is why I ran that extra test at 115 watts, and that is where I wound up setting the power limit. The default power limit is only 9% faster at prompt processing. That is a small price to pay to keep my home office quiet!

Home Assistant’s voice assist isn’t a replacement for your Google Home Mini

At least not out of the box. I can’t set timers. I can’t set alarms. I can’t play music. I can’t check the weather report. I can only query and control entities in my home.

I use our Google Home Mini in the kitchen regularly to set timers and to listen to the Dirtman while I make my morning latte. I learned that all I have to do is add a timer helper entity to my Home Assistant server, and I could have voice timers on my new voice satellite!

This was technically correct. I added the helper, and I was able to ask for a 5-minute timer. However, when I asked for the status of the timer, she explained that I had no timers, and I couldn’t see the timer anywhere in Home Assistant. Even so, the satellite started making noise five minutes later. Usable, but not exactly at parity with Google here.

As I said earlier, I am not terribly deep into trying to get all the functionality working. It definitely looks like I will be able to get all the features I need working, but I will have to work to enable them myself. I don’t know for sure how successful I will be.

At this point, I am just excited to have input, output, and machine-learning hardware that can handle what should be the hardest part of the job!

What’s next?

My plan is to just inch along. All the hard parts that I was worried about are working, and they are working well, so I thought it was worth writing this much down. I might need to rename some of my scenes and entities to make them easier to control verbally, but I’m not so sure I would even want to verbally control individual entities. I should probably work on setting up good names for the scenes that are useful but challenging to automate.

I am also looking into a locally hosted doorbell camera. The Internet seems to like the Reolink doorbells, and a couple of people in our Discord community are having good luck with them. Being able to determine when deliveries are waiting is high on my priority list, and I am now wondering if I could have Qwen silence the doorbell when salespeople are trying to ring the bell!

Have you been tinkering with Home Assistant voice control? Are you running a local LLM on your homelab hardware, or are you sticking with cloud services? What kind of GPU or mini PC are you using, and how has your experience been? Come hang out with us in our Discord community and let’s compare notes! We’re a friendly bunch of homelabbers, tinkerers, and machine learning enthusiasts.