We live in a day and age where using AI on our own servers has become possible. As a business owner and software entrepreneur, I think it's just as interesting as it is important to consider setting up AI systems on your own backend instead of relying on hosted platforms and APIs.
If you're a software founder interested in using AI technology without depending on someone else’s unit economics, this is for you. Today, we’ll dive into running your own ChatGPT replacement for fun and profit.
I’ve been tinkering with this tech over the last few weeks to great success. So, in the spirit of building in public, why not tell you what I did and how I did it?
In my latest SaaS, PodScan, I use two types of artificial intelligence: an audio-to-text transcription system and an LLM, like ChatGPT, that generates responses based on prompts.
The transcription system is cool but not something every software entrepreneur needs since it's specific to converting audio into text, and audio is a niche medium for most of us. But all software founders work with text data —somewhere in our databases— customer records, notes, instructions, it’s all text.
Founders got very excited when ChatGPT came out. All of a sudden, particularly once we could access the service through an API, we could build on top of these amazingly “smart” language models.
And that pioneering spirit has brought us to an interesting inflection point. Because the “Open” in OpenAI has been a catalyst for the open-source community.
It turns out that the most exciting development in recent years is not just the existence of ChatGPT but the fact that many universities, research groups, and companies have open-sourced their code for training these models.
And when the nerds start building stuff together in the open, working on public data, for free and without restrictions, interesting things happen.
One of them is llama.cpp. It’s a cross-platform framework that allows us to train and run our own AI models on our own consumer hardware. Now, we don’t have the massive GPUs and RAM amounts that the big guys have, but we get to run tech that’s almost as good. And in most cases, it’s good enough.
A big contributor here is that we can avoid the costs and dependencies associated with using hosted platforms. In addition to that, we gain more control and flexibility over AI applications for our businesses. So risk goes down, and control goes up.
That’s the indie founder’s dream, right?
Let me share an example from just this week.
Podscan was, until Wednesday, a keyword alerting tool for podcasts. You’d write down a list of words, Podscan would transcribe every newly released podcast out there, compare your list against the transcript, and alert you if there was a match.
So far, so good. This is already creating massive value in the world of podcast discovery, which is severely underserved.
But what if you don’t know the keywords beforehand? What if you want to be alerted for something as nebulous as “podcasts where people talk about community events organized by women” or “podcasts where people really nerd out about their favorite Sci-Fi show?”
Even if you wanted to, you couldn’t come up with all the keywords that would allow you to reliably match every podcast that falls into those categories.
But what if you could ask each show a simple question? “Does this episode have nerds talk about sci-fi in an excited way?” That’s what local AI allowed me to build. With the help of llama.cpp and an LLM called Mistral 7B, I set up a backend service that takes a transcript and a question and spits out either a “yes” or a “no.”
Any transcript. Any question! In under a second per combination.
And, most importantly, on the same hardware that my transcription servers are already running, slopping up new podcast episodes and transcribing them. I don’t have to count API calls. I just have to have a computer with a GPU and 8 GB of RAM.
Cloud hosting for GPUs is still super expensive. You pay around $500 per month for a single server with a GPU. But even a Mac Mini can run this kind of AI inference at one question to a transcript per second.
Now, platforms have started to compete on price here. OpenAI’s API is a big mover here. It's really affordable to use GPT 3.5, the “budget” version, for any task where you need scale. You can get millions of tokens (words or characters) for under a dollar. That's impressive and fits most budgets.
However, it doesn't fit all budgets. If you deal with lots of data and need to run prompts on that data constantly, like analyzing every podcast out there, GPT3.5 can cost tens or even hundreds of dollars per day. GPT4 would easily go into the thousands. That's not scalable for a small business. But running your own local LLM sure is.
And here's another thing that business owners eventually need to scale. Our sponsor this week is Paid.
Rediscover payroll with Paid. This cutting-edge system is designed specifically for founders and owners, providing all essential payroll services at zero cost. Set yourself apart with an exclusive just-in-time submission feature, enhancing cash flow by processing payroll hours before employees receive their pay. Additionally, incentivize and retain top talent with innovative flexible compensation capabilities, allowing you to restructure - not just increase - compensation. With Paid, superior payroll solutions are finally accessible, free, and tailored to empower your team's financial wellness and performance!
Having AI on a server was impossible for a long time. And the requirement of GPUs still makes it expensive.
But fortunately, AI comes in two forms: inference (applying a prompt and getting a reply) and training. And inference is suprisingly possible on regular old CPUs as well.
Traditionally, over the last few years, all kinds of machine learning and AI work has been done on a GPU because GPUs are designed for massively parallel computation. Recent GPUs have added tensor cores, which handle specific mathematical operations used in games, machine learning, and other tasks requiring lots of computation.
Your computer's boring old CPU handles regular computations, and a GPU is much faster for certain tasks. But in recent times, LLMs that used to require a GPU now run hilariously fast on a CPU. So you can run these models on your computer without a graphics card.
Which is what most servers are. Computer without GPUs, but with quite some RAM and a lot of CPU cores. This change has led to the growth of an open source community creating local, large language models that can be used on both kinds of chips.
The .cpp in llama.cpp (and whisper.cpp, it’s speech-to-text sibling project) stands for C++ is a sign that these tools were meant for CPU-based inference.
And the open-source nature of these projects has been supported by an unlikely ally.
Meta, the company behind Facebook, released an open source large language model called Llama in 2023. That is BIG! Since OpenAI’s GPT models are proprietary and the company has published only research papers and no code, people have been inspired to build their own models using public data. There's even a benchmarking system to compare these self-trained models with GPT-3 or GPT-4. And some of these new models come pretty close.
Now, independent companies and open source communities release new large language models daily that perform better than GPT-3.5 and almost as well as GPT-4 in terms of speed and accuracy. These open-source models are available to everyone and help advance the field of AI language processing.
You’ll find them all on HuggingFace.co, a website where you can download open-source language models in various forms. I recommend following Tom Jobbins there. Tom’s models come in all kinds of formats and have been reliably good. People like Tom share the source material, the model training data, and all the files needed to run it on your computer.
And that’s all that is to it. You download llama.cpp, you compile it, you download a model, and then you’re done. Llama lets you run a command on your computer or start a server that loads a large language model into your graphics card memory or regular memory. It then allows you to do local inference through an HTTP API. It even comes with an example page.
Even if you’re not into AI and don’t see an immediate use for it, I highly recommend looking into this. Llama detects the best capacity your computer has for running inference. It checks if there's a GPU available, if you have the right drivers, and if you have the necessary toolkits installed on your system. Then, it uses these to maximize efficiency. You’ll have as fast and performant an AI on your own personal computer. It’s quite magical.
And it’s yours.
This is the year when software entrepreneurs learn how to wrangle control back onto their systems. It’s a wild ride, for sure, and things change every day, but there is something incredibly powerful about knowing that OpenAI can implode and shut down their API tomorrow, but my local installation of Mistral and a couple of cloud computers that I have it running on will still be mine to command.
Local AI is here to stay.
I'll share a few updates about my SaaS on the pod, and I'd love to know what you think about them! Please leave a voice message at podline.fm/arvid 🥰
And if you want to track your brand mentions on podcasts, check out podscan.fm!
Classifieds
I just launched The Bootstrapper's Bundle, which contains Zero to Sold, The Embedded Entrepreneur, and Find your Following. If you want to start a bootstrapped business and build a validated product and a personal platform while doing it, check out this bundle. It contains all eBooks, audiobooks, video courses and extra materials I ever created. It's just $50, for now.
Thank you for reading this week’s essay edition of The Bootstrapped Founder. Did you enjoy it? If so, please spread the word and share this issue on Twitter.
If you want to reach tens of thousands of creators, makers, and dreamers, you can apply to sponsor an episode of this newsletter. Or just reply to this email!
To make sure you keep getting your weekly dose of Bootstrapped Founder, please add arvid@thebootstrappedfounder.com to your address book or whitelist us.
Want to change which emails you get from The Bootstrapped Founder or unsubscribe for good? No worries, just click this link: change email preferences or unsubscribe.
Our postal address: 113 Cherry St #92768, Seattle, WA 98104-2205
Josh Pigford (@shpigford), who created Maybe and Baremetrics, talks about bringing his financial tool back to life by making it open-source!
This week I made a choice. And it has a lot to do with focus and opportunity. And, most of all, money.
Is the creator economy a misnomer? When I think about creators and the effort and time they spend on their work, it seems similar to a 9-to-5 job.
Plus: France's most active family offices in tech; latest deals View in browser Sponsor Card - Flagship-33 Good morning there, Peter Carlsson, the man who has led Swedish gigafactory Northvolt
There could be 300 billion dollar companies in this category alone. This Week at YC November 24th, 2024 ✨ The Latest Vertical AI Agents Could Be 10X Bigger Than SaaS As AI models continue to rapidly
And save 40% while doing so with our greatest Black Friday deal ever! Black Friday_Header_2 Hey Friend , This Black Friday, you get the biggest, most lucrative sale we've ever offered—and with it,
Initiator Creator - Issue #145 - ( Read in browser ) By Saurabh Y. // 23 Nov 2024 Presented by NorthPoll This Week's Notes: Content-rich designs looks more convincing I just love how Basecamp
This is your sign to take action—2025 could be your breakthrough year, but only if you start now. Black Friday_Header_2 Hey Friend , This is getting serious. We're handing over $1700 in value as