Skip to content Skip to footer

Set up a local LLM on CPU with chat UI in 15 minutes | by Kasper Groes Albin Ludvigsen | Feb, 2024


This blog post shows how to easily run an LLM locally and how to set up a ChatGPT-like GUI in 4 easy steps.

Photo by Liudmila Shuvalova on Unsplash

Thanks to the global open source community, it is now easier than ever to run performant large language models (LLM) on consumer laptops or CPU-based servers and easily interact with them through well-designed graphical user interfaces.

This is particularly valuable to all the organizations who are not allowed or not willing to use services that requires sending data to a third party.

This tutorial shows how to set up a local LLM with a neat ChatGPT-like UI in four easy steps. If you have the prerequisite software installed, it will take you no more than 15 minutes of work (excluding the computer processing time used in some of the steps).

This tutorial assumes you have the following installed on your machine:

  • Ollama
  • Docker
  • React
  • Python and common packages including transformers

Now let’s get going.

The first step is to decide what LLM you want to run locally. Maybe you already have an idea. Otherwise, for English, the instruct version of Mistral 7b seems to be the go-to choice. For Danish, I recommend Munin-NeuralBeagle although its known to over-generate tokens (perhaps because it’s a merge of a model that was not instruction fine tuned). For other Scandinavian languages, see ScandEval’s evaluation of Scandinavian generative models.

Next step is to quantize your chosen model unless you selected a model that was already quantized. If your model’s name ends in GGUF or GPTQ Quantization is a technique that converts the weights of a model (its learned parameters) to a smaller data type than the original, eg from fp16 to int4. This makes the model take up less memory and also makes it faster to run inference which is a nice feature if you’re running on CPU.

The script quantize.pyin my repo local_llm is adapated from Maxime Labonne’s fantastic Colab notebook (see his LLM course for other great LLM resources). You can use his notebook or my script. The method’s been tested on Mistral and Mistral-like models.

To quantize, first clone my repo:

git clone https://github.com/KasperGroesLudvigsen/local_llm.git

Now, change theMODEL_IDvariable in the quantize.py file to reflect your model of choice.

Then, in your terminal, run the script:

python quantize.py

This will take some time. While the quantization process runs, you can proceed to the next step.

We will run the model with Ollama. Ollama is a software framework that neatly wraps a model into an API. Ollama also integrates easily with various front ends as we’ll see in the next step.

To build an Ollama image of the model, you need a so-called model file which is a plain text file that configures the Ollama image. If you’re acquainted with Dockerfiles, Ollama’s model files will look familiar.

In the example below, we first specify which LLM to use. We’re assuming that there is a folder in your repo called mistral7b and that the folder contains a model called quantized.gguf. Then we specify the model’s context window to 8,000 – Mistral 7b’s max context size. In the Modelfile, you can also specify which prompt template to use, and you can specify stop tokens.

Save the model file, eg as Modelfile.txt.

For more configuration options, see Ollama’s GitHub.

FROM ./mistral7b/quantized.gguf

PARAMETER num_ctx 8000

TEMPLATE """<|im_start|>system {{ .System }}<|im_end|><|im_start|>user {{ .Prompt }}<|im_end|><|im_start|>assistant<|im_end|>"""

PARAMETER stop <|im_end|>
PARAMETER stop <|im_start|>user
PARAMETER stop <|end|>

Now that you have made the Modelfile, build an Ollama image from the Modelfile by running this from your terminal. This will also take a few moments:

ollama create choose-a-model-name -f <location of the file e.g. ./Modelfile>'

When the “create” process is done, start the Ollama server by running this command. This will expose all your Ollama model(s) in a way that the GUI can interact with them.

ollama serve

The next step is to set up a GUI to interact with the LLM. Several options exist for this. In this tutorial, we’ll use “Chatbot Ollama” – a very neat GUI that has a ChatGPT feel to it. “Ollama WebUI” is a similar option. You can also setup your own chat GUI with Streamlit.

By running the two commands below, you’ll first clone the Chatbot Ollama GitHub repo, and then install React dependencies:

git clone https://github.com/ivanfioravanti/chatbot-ollama.git
npm ci

The next step is to build a Docker image from the Dockerfile. If you’re on Linux, you need to change the OLLAMA_HOST environment variable in the Dockerfile from hhtp://host.docker.internal:11434to http://localhost:11434 .

Now, build the Docker image and run a container from it by executing these commands from a terminal. You need to stand in the root of the project.

docker build -t chatbot-ollama .

docker run -p 3000:3000 chatbot-ollama

The GUI is now running inside a Docker container on your local computer. In the terminal, you’ll see the address at which the GUI is available (eg. “http://localhost:3000″)

Visit that address in your browser, and you should now be able to chat with the LLM through the Ollama Chat UI.

This concludes this brief tutorial on how to easily set up chat UI that let’s you interact with an LLM that’s running on your local machine. Easy, right? Only four steps were required:

  1. Select a model on Huggingface
  2. (Optional) Quantize the model
  3. Wrap model in Ollama image
  4. Build and run a Docker container that wraps the GUI

Remember, it’s all made possible because open source is awesome 👏

GitHub repo for this article: https://github.com/KasperGroesLudvigsen/local_llm



Source link