Skip to main content
HomeTutorialsArtificial Intelligence (AI)

Run LLMs Locally: 7 Simple Methods

Run LLMs locally (Windows, macOS, Linux) by leveraging these easy-to-use LLM frameworks: GPT4All, LM Studio, Jan, llama.cpp, llamafile, Ollama, and NextChat.
May 2024  · 14 min read

Using large language models (LLMs) on local systems is becoming increasingly popular thanks to their improved privacy, control, and reliability. Sometimes, these models can be even more accurate and faster than ChatGPT.

We’ll show seven ways to run LLMs locally with GPU acceleration on Windows 11, but the methods we cover also work on macOS and Linux.

Feature image for the tutorial: 7 Ways to Run LLMs Locally

LLM frameworks that help us run LLMs locally. Image by Abid Ali Awan.

If you want to learn about LLMs from scratch, a good place to start is this course on Large Learning Models (LLMs).

Let’s start by exploring our first LLM framework.

1. GPT4All

The nomic-ai/gpt4all is an LLM framework and chatbot application for all operating systems. We can run the LLMs locally and then use the API to integrate them with any application, such as an AI coding assistant on VSCode. This is the most beginner-friendly and simple method of downloading and running LLMs on your local machines.

A. Downloading the client

Download the installer from the nomic-ai/gpt4all GitHub repository. Your choice depends on your operating system—for this tutorial, we choose Windows.

B. Downloading the model

Install the GPT4All package by selecting the default options. When we launch the GPT4All application, we’ll be prompted to download the language model before using it. Download a model of your choice.

Download the top LLM on GPT4ALL

C. Selecting the model

Once the downloading is complete, close the model page to access the chat user interface.

Select the model you’ve downloaded—we chose Nous Hermes 2 Mistral DPO.

Chose the Model on GPT4ALL

D. Generating response

If you have a CUDA install, it’ll automatically start using a GPU to accelerate the response generation. If not, and you have an Nvidia GPU, you might want to install CUDA Toolkit 12.4 first.

We can use the application similar to how we use ChatGPT online. Notice that it’s much faster than the typical GPT-4 response.

Using the GPT4ALL

E. Model settings

We can customize the model response by going to the settings and playing around with model parameters.

GPT4ALL settings

We can also connect a local folder with the files to get a context-aware response.

Moreover, we can enable the API server so that any application can use our model using an API key.

F. Accessing OpenAI models

We can access GPT-3.5 and GPT-4 models by providing the OpenAI API key.

We need to go to the model’s page, scroll down, provide the API key to the GPT-4 model, and press the install button.

Setting up OpenAI API key for accessing the GPT-4 model

Then, we select the ChatGPT-4 model at the chat user interface.

Selecting the GPT-4 model

We can now start using it as if we’re using it on our browser.

Using GPT4ALL

Let’s move on to our next LLM. This tutorial on LLM classification will help you choose the best LLM for your application.

2. LM Studio

LM Studio provides options similar to GPT4All, except it doesn’t allow connecting a local folder to generate context-aware answers.

A. Installation

We can download the installer from LM Studio’s home page.

Once the download is complete, we install the app with default options.

Finally, we launch LM Studio!

LM Studio on windows

B. Downloading the model

We can download any model from Hugging Face using the search function.

In our case, we'll download the smallest model, Google’s Gemma 2B Instruct.

Downloading the Gemma 2B model on LM studio

C. Generating the response

We can select the downloaded model from the drop-down menu at the top and chat with it as usual. LM Studio offers more customization options than GPT4All.

Using LM Studio

D. Local inference server

Like GPT4All, we can customize the model and launch the API server with one click. To access the model, we can use the OpenAI API Python package, CURL, or directly integrate with any application.

Running Local Inference Server

E. Using multiple models

The key feature of LM Studio is that it offers the option to run and serve multiple models at once. This allows users to compare different model results and use them for multiple applications. In order to run multiple model sessions, we need a high GPU VRAM.

Running multiple LLM models

Fine-tuning is another way of generating context-aware and customized responses. You can learn to fine-tune your Google Gemma model by following the tutorial Fine Tuning Google Gemma: Enhancing LLMs with Customized Instructions. You'll learn to run inference on GPUs/TPUs and fine-tune the latest Gemma 7b-it model on a role-play dataset.

3. Jan

One of the most popular and best-looking local LLM applications is Jan. It’s faster than any local LLM application—it generates a response at 53.26 tokens/sec. For comparison, GPT4All’s rate is 31 tokens/sec.

A. Installation

We can download the installer from Jan.ai.

Once we install the Jan application with default settings, we’re ready to launch the application.

Jan AI Windows application

B. Importing the model

When we covered GPT4All and LM Studio, we already downloaded two models. Instead of downloading another one, we'll import the ones we already have by going to the model page and clicking the Import Model button.

Importing the model file

Then, we go to the applications directory, select the GPT4All and LM Studio models, and import each.

  • GPT4All: "C:/Users/<user_name>/AppData/Local/nomic.ai/GPT4All/"
  • LM Studio: "C:/Users/<user_name>/.cache/lm-studio/models"

C. Accessing the local models

To access the local models, we go to the chat user interface and open the model section in the right panel.

Selecting the Nous-Hermess-2-Mistral-7b model

We see our imported models are already there. We can select the one we want and start using it immediately!

D. Generating the response

The response generation is very fast. The user interface feels natural, similar to ChatGPT, and does not slow down your laptop or PC.

generating the response in the Jan AI

Jan's unique feature is that it allows us to install extensions and use proprietary models from OpenAI, MistralAI, Groq, TensorRT, and Triton RT.

E. Local API server

Like LM Studio and GPT4All, we can also use Jan as a local API server. It provides more logging capabilities and control over the LLM response.

Running the Jan AI local server

4. llama.cpp

Another popular open-source LLM framework is llama.cpp. It's written purely in C/C++, which makes it fast and efficient.

Many local and web-based AI applications are based on llama.cpp. Thus, learning to use it locally will give you an edge in understanding how other LLM applications work behind the scenes.

A. Downloading the llama.cpp

First, we need to go to our project directory using the cd command in the shell—you can learn more about the terminal in this Introduction to Shell course.

Then, we clone all the files from the GitHub server using the command below:

$ git clone --depth 1 https://github.com/ggerganov/llama.cpp.git

B. Using MakeFile on Windows

The make command line tool is available by default in Linux and MacOS. For Windows, however, we need to take the following steps:

  • Download the latest w64devkit Fortran version of w64devkit for Windows.Downloading the w64devkit zip file
  • Extract w64devkit on our local directory.
  • In the main folder, we need to find the file w64devkit.exe and run it.
  • Use the $ cd C:/Repository/GitHub/llama.cpp command to access the llama.cpp folder.
  • Type $ make and press Enter to install llama.cpp.

running the make file to install necessary packages

 

B. Starting the llama.cpp’s WebUI server

After we complete the installation, we run the llama.cpp web UI server by typing out the command below. (Note: We’ve copied the model file from the GPT4All folder to the llama.cpp folder so we can easily access the model).

$ ./server -m Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf -ngl 27 -c 2048 --port 6589

launching the llama.cpp web server

 

The web server is running at http://127.0.0.1:6589/. You can copy this URL and paste it into your browser to access the llama.cpp web interface.

Before interacting with the chatbot, we should modify the settings and model’s parameters.

llama.cpp web app running in the browserCheck out this llama.cpp tutorial if you want to learn more!

D. Generating the response

The response generation is slow because we run it on CPU, not GPU. We must install a different version of llama.cpp to run it on GPU.

$ make LLAMA_CUDA=1

using the llama.cpp web application

 

5. llamafile

If you find llama.cpp a bit too complex, try llamafile. This framework simplifies LLMs for both developers and end-users by combining llama.cpp with Cosmopolitan Libc into a single-file executable. It removes all the complexities associated with LLMs, making them more accessible.

A. Downloading the model file

We can download the model file we want from llamafile’s GitHub repository.

We'll download LLaVA 1.5 because it can also understand images.

downloading the LLaVA 1.5 llamafile

B. Making changes for Windows

Windows users must add .exe to file names in the terminal. To do this, right-click the downloaded file and select Rename.

renaming the llamafile

C. Running the LlamaFile

We first go to llamafile directory by using the cd command in the terminal. Then, we run the command below to start the llama.cpp web server.

$ ./llava-v1.5-7b-q4.llamafile -ngl 9999

The web server uses the GPU without requiring you to install or configure anything.

llamafile running in the terminal

It'll also automatically launch the default web browser with the llama.cpp web application running. If it doesn’t, we can use the URL http://127.0.0.1:8080/ to access it directly.

D. Generating the response

After we settle on the model’s configuration, we can start using the web application.

llamafile web application

Running the llama.cpp using the llamafile is easier and more efficient. We generated the response with 53.18 tokens/sec (without llamafile, the rate was 10.99 tokens/sec).

using the llama.cpp web app

 

6. Ollama

Ollama is a tool that allows us to easily access through the terminal LLMs such as Llama 3, Mistral, and Gemma.

Additionally, multiple applications accept an Ollama integration, which makes it an excellent tool for faster and easier access to language models on our local machine.

A. Installing Ollama

We can download Ollama from the download page.

Once we install it (use default settings), the Ollama logo will appear in the system tray.

ollama running in the notification bar

B. Running Ollama

We can download the Llama 3 model by typing the following terminal command:

$ ollama run llama3

Llama 3 is now ready to use! Bellow, we see a list of commands we need to use if we want to use other LLMs:

Various ollama commands for using various LLMs

C. Running custom models

To access models that have already been downloaded and are available in the llama.cpp folder, we need to:

 
  • Go to the llama.cpp folder using the cd command.
$ cd C:/Repository/GitHub/llama.cpp
 
  • Create a filename called Modelfile and add the line "FROM ./Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf".
$ echo "FROM ./Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf" > Modelfile
 
  • Build the model by providing the model’s name.
$ ollama create NHM-7b -f Modelfile
 

creating the custom model

  • Run the NHM-7b model.
$ ollama run NHM-7b
 
  • Use it as any other chat application.

running the custom model

With this method, we can download any LLM from Hugging Face with the .gguf extension and use it in the terminal. If you want to learn more, check out this course on Working with Hugging Face.

7. NextChat

NextChat, previously known as ChatGPT-Next-Web, is a chat application that allows us to use GPT-3, GPT-4, and Gemini Pro via an API key.

It’s also available on the web UI, and we can even deploy our own web instant using one click on Vercel.

You can learn more about NextChat by following this detailed introduction to ChatGPT Next Web (NextChat).

A. Installation

We can download the installer from the GitHub repository. For Windows, select the .exe file.

download the windows installer for NextChat

As before, we’ll install the package using the default settings.

B. Setting up the API key

The NextChat application won't run until we add a Google AI or OpenAI API key.

NextChat windows application

To get the API key for Google AI, we need to go to Gemini API and click the blue button Get API key in Google AI Studio.

After that, we need to click the Get API key button and then create and copy the API key. It’s free, with no token limitation.

Once we have the API key, we navigate to NextChat’s settings and scroll down to the Model Provider section. We select Google as the model provider and paste the API key into the designated field.

NextChat Settings

C. Generating the response

On the main chat user interface page, click the robot (🤖) button above the chat input and select the gemini-pro model.

Selecting the Gemini Pro model

We use Google Gemini locally and have full control over customization. The user data is also saved locally.

Using the NextChat application

Similarly, we can use the OpenAI API key to access GPT-4 models, use them locally, and save on the monthly subscription fee.

Conclusion

Installing and using LLMs locally can be a fun and exciting experience. We can experiment with the latest open-source models on our own, enjoy privacy, control, and an enhanced chat experience.

Using LLMs locally also has practical applications, such as integrating it with other applications using API servers and connecting local folders to provide context-aware responses. In some cases, it is essential to use LLMs locally, especially when privacy and security are critical factors.

You can learn more about LLMs and building AI applications by following these resources:


Photo of Abid Ali Awan
Author
Abid Ali Awan

As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.

Topics

Build your AI career with DataCamp!

track

AI Business Fundamentals

11hrs hours
Accelerate your AI journey, conquer ChatGPT, and develop a comprehensive Artificial Intelligence strategy.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

blog

The Pros and Cons of Using LLMs in the Cloud Versus Running LLMs Locally

Key Considerations for selecting the optimal deployment strategy for LLMs.
Abid Ali Awan's photo

Abid Ali Awan

8 min

blog

8 Top Open-Source LLMs for 2024 and Their Uses

Discover some of the most powerful open-source LLMs and why they will be crucial for the future of generative AI
Javier Canales Luna's photo

Javier Canales Luna

13 min

tutorial

How to Run Llama 3 Locally: A Complete Guide

Run LLaMA 3 locally with GPT4ALL and Ollama, and integrate it into VSCode. Then, build a Q&A retrieval system using Langchain, Chroma DB, and Ollama.
Abid Ali Awan's photo

Abid Ali Awan

15 min

tutorial

How to Build LLM Applications with LangChain Tutorial

Explore the untapped potential of Large Language Models with LangChain, an open-source Python framework for building advanced AI applications.
Moez Ali's photo

Moez Ali

12 min

tutorial

Fine-Tuning Llama 3 and Using It Locally: A Step-by-Step Guide

We'll fine-tune Llama 3 on a dataset of patient-doctor conversations, creating a model tailored for medical dialogue. After merging, converting, and quantizing the model, it will be ready for private local use via the Jan application.
Abid Ali Awan's photo

Abid Ali Awan

19 min

tutorial

Llama.cpp Tutorial: A Complete Guide to Efficient LLM Inference and Implementation

This comprehensive guide on Llama.cpp will navigate you through the essentials of setting up your development environment, understanding its core functionalities, and leveraging its capabilities to solve real-world use cases.
Zoumana Keita 's photo

Zoumana Keita

11 min

See MoreSee More