I don’t know for certain that I have enough to say here to justify writing an entire blog post, but let’s see what happens!
It has been a little over a year since I upgraded this computer from an ancient Nvidia GTX 970 to an AMD Radeon RX 6700 XT. I really needed that upgrade, but I was stubborn, and I didn’t want to pay those inflated crypto-mining GPU prices, so I hung on to that GTX 970 for way longer than I wanted to.
I think I did a good job. Fifteen months later, and I have only seen my new GPU go on sale for at most $120 less than what I paid. I am happy to have paid less than $10 per month for the privilege of being able to have so much more fun playing games, so I think the timing of my upgrade was pretty decent!
I am not some sort of software developer trying to teach an LLM to learn how to read and do other stuff good too. I am just a guy hoping to use existing models in some sort of useful way.
One of the first things I learned immediately after installing my new GPU was that running AI models with an AMD GPU on Linux was a crapshoot.
At the time of my upgrade, getting Stable Diffusion to run with an Nvidia GPU was as easy as running one command and waiting for things to download. Getting it running on my Radeon took several attempts, and I felt like I was lucky to get it working at all. Every time I wanted to update my Stable Diffusion install, it was almost certain that something would break.
Getting Llama up and running seemed like it would be nearly impossible, but things are much improved today!
- Should You Run A Large Language Model On An Intel N100 Mini PC?
- Harnessing the Potential of Machine Learning for Writing Words for My Blog
- Deciding Not To Buy A Radeon Instinct Mi50 With The Help Of Vast.ai!
- Microsoft Copilot+ Recall AI Looks Awesome!
- Self-hosting AI with Spare Parts and an $85 GPU with 24GB of VRAM
I had Oobabooga’s text-generation-webui
up and running in thirty minutes
Since you are reading this, you can probably do it in less than half the time.
The first problem I had was picking out a model or two to download. I don’t know if I will find something better, but so far I have been pleased with MistralMakise-Merged-13B. It seems reasonably capable, and it fits well in my 12 GB of VRAM.
NOTE: So far, I am happier with DiscoPOP-zephyr-7b-gemma, and I am using it to help me put the finishing touches on this blog post before I send it to my editor for review.
My second problem was easily solved by punching some error messages into Google, but it took a few clicks before I found the solution. It is mentioned in their documentation under the AMD heading, but that section is way down near the bottom, and I managed to miss it.
1 2 3 4 5 6 |
|
If you have a functioning Nvidia GPU, CUDA will just work. If you have a working AMD GPU, things are a little more complicated. When you have the RIGHT Radeon GPU with ROCm correctly installed, Oobabooga’s text-generation-webui
will also probably just work.
When you have a different Radeon, you have to give pytorch
some hints as to which ROCm bits to actually use. This is a pain point, but if this is the only real problem we’re going to be running into today, then things are already infinitely better than they were a year ago!
Installing ROCm and OpenCL might be a pain!
The documentation says that I need ROCm 5.6 to use text-generaton-webui
, but I already have ROCm 5.4.6 installed. That is the version that seems to work well with DaVinci Resolve Studio 19, my RX 6700 XT, and Mesa 24. It seems to be working just fine for text-generation-webui
as well!
I would love to tell you the correct way to install ROCm and OpenCL, but I always goof something up, and I wind up jumping through hoops to fix it. That means I don’t REALLY know how to install these things. I know how to goof it up, correct at least some of my mistakes, then wind up with a working installation. I am not even confident that doing things in what seems to be the correct way would even get me to the correct destination!
The newer versions of ROCm can’t easily exist alongside the bleeding-edge versions of Mesa. If you install ROCm 5.6 or newer you can expect to not be able to play games or run DaVinci Resolve. At least, that was the case when I set things up last. This should be a problem that will eventually straighten itself out.
I don’t think this is currently any better or worse than it was a year ago. This is something AMD really, really, really needs to do better. Really.
Should you bother running an LLM on your local machine?
I am sure there are some fantastic reasons to avoid using the ChatGPT API. I do not enjoy the idea of sending all my words up to some company’s servers, but all the words I write are published to one of my blogs, so that doesn’t really matter for me.
The ChatGPT API is quite inexpensive. It didn’t even cost me an entire dime when I was messing around with sending every paragraph of my blog posts up to ChatGPT for rewrites and suggestions. That was with GPT-3.5-Turbo.
GPT-4o is still inexpensive, but I could easily get up into dollars instead of cents. One of the problems is that GPT-4 offers a much bigger context window, so I can send entire blog posts up as context. Even though GPT-4o is still inexpensive per token, it encourages me to send up 1,500 tokens of context with each query.
OpenAI’s API is FAST. I forgot just how fast it was until I decided to ask GPT-4o and my local Mistral LLM to rephrase the same paragraphs. I ran the paragraph through ChatGPT first because I have some shortcuts to make that happen in Emacs, and I was surprised that it was able to ingest my context and give me a full answer almost instantly. The local LLM on my $300 GPU took a noticeable number of seconds to give me a response.
OpenAI’s API isn’t just ridiculously fast—it also gives significantly better responses than my tiny GPU ever could. I can’t even run an equivalent to GPT-3.5-Turbo locally, and GPT-4 and GPT-4o are orders of magnitude bigger than that.
Speed doesn’t matter if you don’t want to send your private information to a third party.
Sometimes, quantity beats quality!
The game shifts a little when you can run something locally and do not have to pay for every single transaction.
My GPU can consume an additional 175 watts when running at full tilt. It would take something like four hours of me constantly interacting with a local LLM to add 10 cents to my electric bill, and I certainly can’t ask her enough questions to keep her running without lots of breaks. My cost to keep the LLM running and answering all my questions is effectively zero.
I absolutely love being able to run Stable Diffusion locally. I can try out a handful of weird prompts to find something that makes me giggle. Then I can ask Stable Diffusion to generate a eight images at two different guidance scales using six different checkpoints. It will grind away for ten to fifteen minutes while I make a latte, and I will have 96 images to evaluate when I sit down. Usually one will be goofy enough to break up a wall of words in a blog post.
I can’t max out my GPU with an LLM for long, but asking Stable Diffusion to generate 96 images will keep my GPU maxed out for ten minutes. That means I can generate more than 2,000 images for a dime.
I can see myself doing something similar for my blog-writing workflow in Emacs. Right now, I just send a paragraph or two to GPT-4o when I can’t find a synonym I like, can’t decide how to start the next paragraph, or just don’t like the flow of a sentence. OpenAI’s API is almost always just a lazy thesaurus for me. ChatGPT’s writing feels either way too pretentious or corporate for my taste, but it does often inspire me to reorder sentence fragments into something that reads more pleasant.
When the LLM doesn’t cost me anything to run, why not throw everything into that blender to see what comes out? I could write some Emacs Lisp that will send every paragraph to the OobaBooga interface as soon as I hit the Enter key. I’ve already tried connecting my Emacs automations to my local LLM’s API, and it works great even if it feels so much slower than GTP-4o!
Maybe it could show me the rephrased paragraph next to the window I am typing in. Maybe I could puzzle out a prompt that would coax the robot into only speaking up if its rewrite or suggestion seems like it would be helpful to me. Perhaps I could send it the last two or three paragraphs and give it a chance to write the next one?
I think this sort of thing would have to be done one paragraph at a time, or at least be limited to a few paragraphs. When I asked ChatGPT to turn six columns of a Google Sheet into a Markdown table, it gave me back the results in a few seconds. It LOOKED like it was typing the results slowly, but I was able to hit the copy code
button right away, and the entire table was available.
It took my local Mistral robot 90 seconds to give me the complete table of mini PC prices and performance. The latency would be too high if my local artificial brain works with too much text at once!
Not every employee needs to be Albert Einstein
My little Radeon 6700 XT with 12 GB of VRAM will never run an LLM that can compete with what can be run on even a single AMD MI300X with 192 GB of RAM, and it certainly can’t compete with a server full of those cards!
That is OK. I don’t need to hire Albert Einstein to figure out how fast my FPV drone falls when I dive down the side of a building. A high school student should be equipped to handle that task, just like my little Mistral 7B LLM can give me a handful of synonyms.
I don’t need to hire Douglas Adams to fix up my words, even if I wish I still could!
Let’s get back on topic
We are supposed to be talking about how much easier it is now to run machine learning stuff on a Radeon GPU. I feel like automatic1111’s stable-diffusion-webui
and oobabooga’s text-generation-webui
cover something like 90% of the machine learning tasks we might want to do at home. These are both reasonably easy to get going with ROCm.
The other popular machine learning project is the Whisper speech-to-text engine. There is a webui for this, but it doesn’t seem to make it simple to get going with a Radeon GPU. Even so, I am not certain that a webui would be the right place to use Whisper.
Whisper feels like it needs to be built into something else. I want it to transcribe my video footage and turn the text into subtitles. I want to automatically transcribe any audio files that land in a particular directory. I don’t want to be doing any of this manually.
DaVinci Resolve Studio has a fantastic speech-to-text workflow. You can delete words from the transcription, and Resolve will cut it right out of the video timeline. How cool is that?!
I very nearly had to delete this entire blog post!
The version 1.8 release of text-generation-webui
showed up in my feeds right in the middle of writing the previous section. I did the thing that any competent professional writer might do, and I upgraded to the latest release!
My GPU acceleration immediately stopped working. That took my performance down from between about 12 to 25 tokens per second to an abysmal 2 to 5 tokens per second.
Someone already filed a bug report. I decided to put this blog on hold, and I figured I could check back in a few days. The stickied AMD bug thread had a fix that worked. I had to edit the requirements_amd.txt
file to replace one of the packages with an older version.
There were two lines with two slightly different versions. I assume that they weren’t supposed to be there, so I deleted both before pasting in the URL from the comment.
Llama 3.1 and Gemma 2 on an AMD GPU with Oobabooga
All the recent releases of oobabooga
ship with broken support for llama.cpp
when using AMD’s ROCm.
I forged ahead and installed the latest version anyway. I wound up getting GPTQ versions of Llama 3.1 8B and Gemma 2 9B running using the ExLlamav2_HF
loader. They both seem to run at comparable speeds to the Llama 3 and InternLM GGUF models I was using before, so that is exciting!
I was a bit bummed out because not having a working llama.cpp
meant that I couldn’t use any of the GGUF files I have been running. The new models are better than what I was using, but I didn’t want to miss out on using CodeQwen 1.5 7B.
I wound up editing the requirements-amd.txt
file once again, and I downgraded the llama.cpp
packages just like I did before. That means I can run all my old GGUF files exactly as I was, and I can now also run the newer models via ExLlamav2_HF
. That’ll do!
Conclusion
I was eager to write this blog. It was exciting to no longer feel like a second-class citizen in the world of machine learning with my budget-friendly AMD Radeon GPU. Then I found out that no one attempted to run the text-generation-webui
with a ROCm GPU in the two days between that dependency being updated and the release of version 1.8, and my citizenship level was once again demoted.
Is that the end of the world? Definitely not. Finding and applying a fix wasn’t a challenge, but even so, everything would have just worked if I had bought an Nvidia GPU, and everything would have just worked for the entirety of this past year. My 6700 XT is comparable in gaming performance and price to an RTX 4060 Ti, but I have 50% more VRAM. In theory, I should have as much or more machine-learning performance as well, except that there is so much less optimization work happening outside of the world of Nvidia’s CUDA.
What do you think? Are you running automatic1111’s stable-diffusion-webui
or OobaBooga’s text-generation-webui
on an AMD GPU? How have things been working out for you? Do you think it is worth the extra effort and problems to be able to own a Radeon GPU with 24 GB of VRAM for less than half the price of an Nvidia RTX 4090? Let me know in the comments, or stop by the Butter, What?! Discord server to chat with me about it!