GPT-4o Mini Vs. My Local LLM

| Comments

I am not some sort of machine-learning enthusiast. I am not trying to shoehorn large language models into every problem I encounter. I am just a dude who writes a blog post every week or so, and I have been messing around with various things to see how they might help streamline my workflow.

I am probably only scratching the surface, but I figured this is a reasonable time to write down what I have learned so far.

Stable Diffusion man using an llm

I am almost definitely not using the most optimal local models. I have tried a handful of different things available on Huggingface, and I settled in on things that seem to work well for my needs.

Large language models aren’t replacements for search engines

I used to see a lot of comments on Reddit and Hacker News that were excitedly posting what ChatGPT had to say about the topic at hand, or what ChatGPT said the answer was. More recently I am seeing questions asked on Reddit, not being answered in a way that the poster believes is adequate, so the original poster leaves a comment with the better advice they believe they got from ChatGPT.

Large language models make stuff up all the time. I asked one of the local models I was trying out about the specifications of my first childhood computer: the Texas Instruments TI 99/4a. Much to my surprise, that rather tiny large language model seemed to be correct! I didn’t verify that everything was perfect, but it matched my memory, and that was more than enough to impress me.

Then I asked it for the specs of an Apple 2e. It confidently told me that it had a dual Intel Xeon machine with 192 GB of RAM.

There is a lot of information hiding in a lossy compressed state inside these models. GPT-4o Mini definitely has more real-world information hiding in its weights than my local LLM, and the full GPT-4 model has way more information than that. Either has a better chance of being correct than my tiny local model, but they all suffer from the same problem.

Even if full GPT-4 will be correct more often, it will still be just as confidently wrong as my local model.

I have asked ChatGPT to summarize some of my 3D printing blog posts, and it has given me back lots of bullet points that are exactly the opposite of what I actually said. I only know this because I wrote the words. I’d be careful basing any important decisions off of a summary from ChatGPT if I were you.

Large language models can definitely supplement your use of search engines!

When Windows Copilot+ Recall was announced, I almost immediately started writing a blog post about how I felt about this sort of software. I remembered with absolute certainty that there was a Gnome project more than twenty years ago that attempted to keep track of your information for you in a similar way.

The project was called Dashboard. It monitored your email, instant messages, and text files in an attempt to gather related information before you needed it. It didn’t use AI, but it felt a lot like what Copilot is hoping to accomplish. The trouble is that I couldn’t remember the name of the project, and my Google searches were coming up completely empty.

Stable Diffusion man using an llm

I had a back-and-forth with ChatGPT about it. It went off the rails a lot, and ChatGPT was extremely confident of some wrong answers, but I didn’t eventually get it to mention both Dashboard and Nat Friedman in one of its responses.

Finding that nugget of information made it easy for me to find some old references to the actual Dashboard project!

This applies to OpenAI’s models and local models. They are so often breathtakingly incorrect, but we used to say the same thing about search engines twenty years ago. If you’re not a subject matter expert, you had better dig a little deeper to verify the responses from whichever chatbot you decide to use!

OpenAI’s API is stupidly inexpensive

OpenAI’s most expensive API costs $30 per million tokens, and their newest and cheapest model, GPT-4o Mini, costs only $0.60 per million tokens.

I have been using an Emacs package to help me quickly and easily send paragraphs and entire blog posts up to their API to have thing rewritten, rephrased, or have introductions written for me. I added $6.61 to my OpenAI account in November of 2023, and I have $5.83 left in my account as of July 2024.

I have no idea why I wound up adding $6.61 to my account. That is a really odd number, isn’t it?!

Sending a paragraph or two at a time for GPT-3.5 Turbo to rephrase was costing me a few pennies per blog post. Things got way more expensive when I started using the huge context window of GPT-4 to start each session with entire 2,500 word blog posts. That was costing me nearly an entire dime to work on a single blog post!

I used a local LLM to help me write an intro, conclusion, and call to action for my most recent blog post. When GPT-4o Mini went live, I sent it the exact same text, and I gave it the exact same prompt. That work cost me less than a penny. Not just less than a penny, but less than a tenth of a penny.

This was fun, because I have been messing around with various local models using the oobabooga webui for a few weeks, and I had forgot how lightning fast the OpenAI API can be!

My local LLM is slower and not as good as GPT-4o Mini

I didn’t buy some big, expensive LLM-friendly workstation with a $2,00 24 GB Nvidia RTX 4090. I have a little $300 12 GB Radeon 6700 XT. I didn’t buy it for running AI stuff. I bought it to play games! The AI stuff is a free bonus.

My local LLM can ingest an entire blog post in a handful of seconds, and it can respond to my prompt about that blog at around 25 tokens per second. That winds up settling down to something around 15 tokens per second as I get deeper into the chat and the context goes up.

How much better is GPT-4o Mini?

I gave both my local Gemma 2 9B LLM and GPT-4o Mini the following prompt:

If I give you the text of a blog post I am writing, can you write me a “conclusion” section with a call to action that includes joining our discord server and leaving a comment?

I pasted them both the contents of the blog post. I got this response back from GPT-4o:

Conclusion

In the world of 3D printing, choosing the right machine can be both exciting and daunting. While the Sovol SV06 has its merits, especially with the recent firmware upgrades that enhance its performance, it’s essential to recognize that the Bambu A1 Mini excels in various aspects, including user-friendliness and print quality. For those who value speed and ease of use, the A1 Mini is hard to beat, while the Sovol SV06 may appeal to enthusiasts looking for a larger build volume or the joy of tuning and customizing their machines.

Ultimately, your choice should align with your printing needs and preferences. But whether you opt for the Sovol or the Bambu, remember to keep your expectations realistic and informed.

I’d love to hear your thoughts! Have you had experiences with either the Sovol SV06 or the Bambu A1 Mini? Share your insights in the comments below. And if you’re passionate about 3D printing or want to connect with other enthusiasts, consider joining our Discord server! It’s a fantastic space to share tips, get advice, and stay updated on the latest in the 3D printing community. Let’s keep the conversation going!

This was the response I got from Gemma 2 9B:

Great! Here’s a conclusion with a call to action incorporating your Discord server and a comment section:

Sovol’s SV06 is a capable machine, especially when you consider the price. But, if you’re looking for an out-of-the-box experience with exceptional print quality and ease of use, the Bambu A1 Mini is hard to beat.

Want to explore the world of 3D printing and share your own projects? Join our Discord server where you can connect with fellow makers, get advice, and show off your creations! [Link to your Discord server]

Let me know in the comments what you think of the Sovol SV06 – would you consider it for your next 3D printing project?

I could never use any of these words without heavy editing. Every AI has a weird voice that feels way too pretentious for me.

I also don’t want to say that one of these did a better job than the other. The response from OpenAI is rather long winded, but it does contain a lot of useful information that Gemma’s response is missing. That said, the response I got from Gemma 2 9B is much more like the sort of conclusion I would write on my own.

While both GPT-4o Mini and my local LLM do appear to slowly type a response to you query, the difference is that GPT-4o Mini is only pretending to be as slow as it appears. It actually took GPT-4o Mini about two seconds to complete the entire task, whereas my local LLM took 25 seconds to ingest my blog post and return its entire first response.

That is definitely a usable pace, but I have spent a lot of money on hardware to get lesser responses that take longer. I could work with GPT-4o Mini to help me rewrite all 800,000 words of blog posts I have ever written and it wouldn’t even cost 5% of what I paid for my GPU.

Do you know what I think is neat? I had an LLM generate a conclusion for one of my recent blogs, and I decided to leave in a couple of sentences and phrases that absolutely never would have written on my own. I didn’t feel entirely comfortable speaking the way the machine wrote the words, but they were useful and relevant statements. If it didn’t work out, I could blame the robot!

The GPT-4o Mini API is extremely cheap, but my local LLM is almost free

The electricity required to have an LLM churning away on my workstation is a rounding error. My GPU maxes out at 174 watts, but llama.cpp doesn’t seem capable of utilizing all of the capacity, so it rarely goes far past 100 watts. It would cost less than twenty cents if you could somehow coax oobabooga into running for 10 hours straight with no downtime. That would be somewhere between 500,000 and 900,000 tokens.

The hardware is a sunk cost. I need my GPU to edit videos with DaVinci Resolve and play games. I need my CPU, RAM, and monitor just to be able to do any work at all. I am not investing in hardware to run a language model. I already own it. Running the model is effectively free.

Stable Diffusion neon pcb

Free isn’t cheap enough to justify the effort. My back catalog of blog posts should be somewhere around a million tokens. It would cost me 30 cents every time I have GPT-4o Mini ingest the whole thing, and it’d only cost $1.20 to get a full rewrite of my blog back out. If I were working with the API to actually do a good job of reworking my old posts, then I would most definitely have to go back and forth more than once with many of the paragraphs.

I can’t imagine having to spend more than $10 or so conversing with the OpenAI API to rewrite my entire blog. The OpenAI API would respond faster than my local API via oobabooga, and that alone would save me way more than $10 worth of my own time.

I would never actually do this, but this is by far the most extreme use case I can come up with for using an LLM, and it would only cost me ten bucks!

What if I don’t want to upload all my words to a third party?

This has to be the best reason by far to use avoid using an LLM in the cloud. Would Stephen King want to risk the chapters of his latest novel leaking? Maybe he wouldn’t care. I don’t think that a leak would have a significant impact on his life, but I think you understand the idea.

I have no qualms about uploading my words to OpenAI before I publish them. It feels icky in principle, but I’m not some sort of investigatory journalist trying to beat everyone to the latest scoop. The majority of the words that I write wind up on the public Internet anyway. Who cares if they manage to sneak out a week early?

You might not be as fortunate as I am. You might have actual private work that needs to stay private. I could totally see investing in a small server with an Nvidia 3090 to run Gemma 2 27B so your small business can have access to a reasonably powerful LLM. Spending a few thousands dollars to not leak your secrets is pretty inexpensive!

What is Pat actually running locally?

The two models that I have really settled on are CodeQwen1.5-7B-Chat-GGUF and gemma-2-9b-it-GPTQ, both of which are available at Hugging Face. I think Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 is also fantastic, but Gemma 2 seems to fit my blogging needs better.

1
2
3
4
5
6
7
8
9
10
$ du -sh *GPTQ*
5.4G  hugging-quants_Meta-Llama-3.1-8B-Instruct-GPTQ-INT4
5.8G  marcsun13_gemma-2-9b-it-GPTQ

$ ls -sh *.gguf
4.5G codeqwen-1_5-7b-chat-q4_k_m.gguf
4.0G internlm2_5-7b-chat-IQ4_XS.gguf
4.4G internlm2_5-7b-chat-q4_k_m.gguf
6.0G internlm2_5-7b-chat-Q6_K.gguf
4.2G Llama-3-Instruct-8B-SPPO-Iter3-IQ4_XS.gguf

I was mostly using InternLM 2.5 before Gemma 2 9B and Llama 3.1 8B were released. I tried three different quantizations of InternLM with the context window set to about 12,000 tokens. They eat up about 6, 6.5, and 7 gigabytes of VRAM on my Radeon 6700 XT, and they seem to have the same speed and quality for my use cases.

InternLM supports a context window as large as one million. The more VRAM I leave free, the larger I can adjust the context window. I haven’t needed more than six or seven thousand tokens of context yet.

I had trouble getting CodeQwen1.5 to load. I recall the errors in the stack trace seeming a little nonspecific, so I originally assumed there was just a compatibility issue somewhere. I dialed back CodeQwen’s massive default context window, and it loaded right up. I was just running out of VRAM!

I have not properly used an LLM to help writing any code. I don’t tend to write much code. I just figured I should try out one of the programming-focused LLMs, and CodeQwen seemed to do a nice job of spitting out the short shell scripts that I asked for.

Stable Diffusion is also awesome!

I feel like I have to say this every time I write any words about machine learning. I have been using Stable Diffusion via the automatic1111 webui since I upgraded to my current GPU. It is so much fun generating silly images to break up the walls of words that are my blogs.

I can queue up a bunch of images with different config scales using a handful of different checkpoints, then I can wander off an make my morning latte. I will have several hundred images waiting for me when I get back, and I usually just pick the ones that make me giggle the most. The more fingers the better!

Why am I messing around with running these things locally at all?!

I’ve already said that it doesn’t make a lick of difference if my blog posts go up into the cloud before they are published, and OpenAI’s API is much faster and costs almost nothing. Why bother with any of this?

I think it is neat, and I am having some fun. I am excited that I know where the limits of local models seem to be, and I now understand how much GPU you need to buy to run something useful.

It is awesome that things are moving so quickly. When I bought my 12 GB GPU just over a year ago, I looked around to see what sort of value a large language model small enough to fit in my VRAM might add. At the time, those small models seemed to do a rather poor job.

A year later, and these small models are starting to feel quite capable! I imagine that things will continue to improve in the future.

Getting Llama 3.1 and Gemma 2 working with my Radeon GPU and oobabooga webui required a bit of a kludge!

At least, I think I am using Llama 3.1 8B. You have to roll back llama.cpp to an older version if you want to use the latest releases of oobabooga with AMD’s ROCm, and that version of llama.cpp doesn’t work with Llama 3.1 or Gemma 2, so I am running Llama 3.1 and Gemma 2 9B via ExLlama V2. I have no idea if I am doing this correctly.

Conclusion

As I continue to explore the capabilities of large language models and local alternatives, it’s clear that these tools have the potential to assist with my creative processes in interesting ways. My hope is that machine learning can take some of the monotonous work off my shoulders.

I believe it is doing that to a small extent, but at the same time it is creating more work for me while at the same time improving my writing. I’m not sure how much my blog posts are improving when I bounce things off of the artificial intelligence monstrosities, but it is interesting enough that I am going to continue to do so just to see where it leads.

This is the part where GPT-4o Mini suggested that I invite you to join the Butter, What?! Discord community, where we share our experiences, insights, and tips on leveraging AI in creative projects. I can’t say that, because it isn’t even true! There is hardly any machine learning nonsense going on in there, but it is a fun place where a lot of homelab, NAS, and 3D printing projects are often being talked about!

Additionally, I’d love to hear your thoughts on this topic! Have you experimented with local models or found innovative ways to integrate LLMs into your work? Leave me a comment below and let’s start a conversation!

Comments