Home Tutorials Artificial Intelligence (AI)

Llama.cpp Tutorial: A Complete Guide to Efficient LLM Inference and Implementation

This comprehensive guide on Llama.cpp will navigate you through the essentials of setting up your development environment, understanding its core functionalities, and leveraging its capabilities to solve real-world use cases.

Nov 2023 · 11 min read

Large language models (LLMs) are revolutionizing various industries. From customer service chatbots to sophisticated data analysis tools, the capabilities of this powerful technology are reshaping the landscape of digital interaction and automation.

However, practical applications of LLMs can be limited by the need for high-powered computing or the necessity for quick response times. These models typically require sophisticated hardware and extensive dependencies, which can make difficult their adoption in more constrained environments.

This is where LLaMa.cpp (or LLaMa C++) comes to the rescue, providing a lighter, more portable alternative to the heavyweight frameworks.

Llama.cpp logo (source)

Develop AI Applications

Learn to build AI applications using the OpenAI API.

Start Upskilling for Free

What is Llama.cpp?

LLaMa.cpp was developed by Georgi Gerganov. It implements the Meta’s LLaMa architecture in efficient C/C++, and it is one of the most dynamic open-source communities around the LLM inference with more than 390 contributors, 43000+ stars on the official GitHub repository, and 930+ releases.

Some key benefits of using LLama.cpp for LLM inference

Universal Compatibility: Llama.cpp's design as a CPU-first C++ library means less complexity and seamless integration into other programming environments. This broad compatibility accelerated its adoption across various platforms. 
Comprehensive Feature Integration: Acting as a repository for critical low-level features, Llama.cpp mirrors LangChain's approach for high-level capabilities, streamlining the development process albeit with potential future scalability challenges. 
Focused Optimization: Llama.cpp focuses on a single model architecture, enabling precise and effective improvements. Its commitment to Llama models through formats like GGML and GGUF has led to substantial efficiency gains.

With this understanding of Llama.cpp, the next sections of this tutorial walks through the process of implementing a text generation use case. We start by exploring the LLama.cpp basics, understanding the overall end-to-end workflow of the project at hand and analyzing some of its application in different industries.

Llama.cpp Architecture

Llama.cpp’s backbone is the original Llama models, which is also based on the transformer architecture. The authors of Llama leverage various improvements that were subsequently proposed and used different models such as PaLM.

Difference between Transformers and Llama architecture (Llama architecture by Umar Jamil)

The main difference between the LLaMa architecture and the transformers’:

Pre-normalization (GPT3): used to improve the training stability by normalizing the input of each transformer sub-layer using the RMSNorm approach, instead of normalizing the output.
SwigGLU activation function (PaLM): the original non-linearity ReLU activation function is replaced by the SwiGLU activation function, which leads to performance improvements.
Rotary Embeddings (GPTNeao): the rotary positional embeddings (RoPE) was added at each layer of the network after removing the absolute positional embeddings.

Setting Up the Environment

The prerequisites to start working with LLama.cpp include:

Python: to be able to run the pip, which is the Python package manager
Llama-cpp-python: the Python binding for llama.cpp

Create the virtual environment

It is recommended to create a virtual environment to avoid any trouble related to the installation process, and conda can be a good candidate for the environment creation.

All the commands in this section are run from a terminal. Using the conda create statement, we create a virtual environment called llama-cpp-env.

conda create --name llama-cpp-env

After successfully creating the virtual environment, we activate the above virtual environment using the conda activate statement, as follows from:

conda activate llama-cpp-env

The above statement should display the name of the environment variable between brackets at the beginning of the terminal as follows:

Name of the virtual environment after activation

Now, we can install the Llama-cpp-python package as follows:

pip install llama-cpp-python
or
pip install llama-cpp-python==0.1.48

The successful execution of the llama_cpp_script.py means that the library is correctly installed.

To make sure the installation is successful, let’s create and add the import statement, then execute the script.

First, add the from llama_cpp import Llama to the llama_cpp_script.py file, then
Run the python llama_cpp_script.py to execute the file. An error is thrown if the library fails to import; hence, it needs further diagnosis for the installation process.

Understand Llama.cpp Basics

At this stage, the installation process should be successful, and let’s dive into understanding the basics of LLama.cpp.

The Llama class imported above is the main constructor leveraged when using Llama.cpp, and it takes several parameters and is not limited to the ones below. The complete list of parameters is provided in the official documentation:

model_path: The path to the Llama model file being used
prompt: The input prompt to the model. This text is tokenized and passed to the model.
device: The device to use for running the Llama model; such a device can be either CPU or GPU.
max_tokens: The maximum number of tokens to be generated in the model’s response
stop: A list of strings that will cause the model generation process to stop
temperature: This value ranges between 0 and 1. The lower the value, the more deterministic the end result. On the other hand, a higher value leads to more randomness, hence more diverse and creative output.
top_p: Is used to control the diversity of the predictions, meaning that it selects the most probable tokens whose cumulative probability exceeds a given threshold. Starting from zero, a higher value increases the chance of finding a better output but requires additional computations.
echo: A boolean used to determine whether the model includes the original prompt at the beginning (True) or does not include it (False)

For instance, let’s consider that we want to use a large language model called <MY_AWESOME_MODEL> stored in the current working directory, the instantiation process will look like this:

# Instanciate the model
my_aweseome_llama_model = Llama(model_path="./MY_AWESOME_MODEL")


prompt = "This is a prompt"
max_tokens = 100
temperature = 0.3
top_p = 0.1
echo = True
stop = ["Q", "\n"]


# Define the parameters
model_output = my_aweseome_llama_model(
       prompt,
       max_tokens=max_tokens,
       temperature=temperature,
       top_p=top_p,
       echo=echo,
       stop=stop,
   )
final_result = model_output["choices"][0]["text"].strip()

The code is self-explanatory and can be easily understood from the initial bullet points stating the meaning of each parameter.

The result of the model is a dictionary containing the generated response along with some additional metadata. The format of the output is explored in the next sections of the article.

Your First Llama.cpp Project

Now, it is time to get started with the implementation of the text generation project. Starting a new Llama.cpp project has nothing more than following the above python code template that explains all the steps from loading the large language model of interest to generating the final response.

The project leverages the GGUF version of the Zephyr-7B-Beta from Hugging Face. It is a fine-tuned version of the mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO).

Our An Introduction to Using Transformers and Hugging Face provides a better understanding of Transformers and how to harness their power to solve real-life problems. We also have a Mistral 7B tutorial.

Zephyr model from Hugging Face (source)

Once the model is downloaded locally, we can move it to the project location in the model folder. Before diving into the implementation, let’s understand the project structure:

The structure of the project

The first step is to load the model using the Llama constructor. Since this is a large model, it is important to specify the maximum context size of the model to be loaded. In this specific project, we use 512 tokens.

from llama_cpp import Llama


# GLOBAL VARIABLES
my_model_path = "./model/zephyr-7b-beta.Q4_0.gguf"
CONTEXT_SIZE = 512


# LOAD THE MODEL
zephyr_model = Llama(model_path=my_model_path,
                    n_ctx=CONTEXT_SIZE)

Once the model is loaded, the next step is the text generation phase, by using the original code template, but we use a helper function instead called generate_text_from_prompt.

def generate_text_from_prompt(user_prompt,
                             max_tokens = 100,
                             temperature = 0.3,
                             top_p = 0.1,
                             echo = True,
                             stop = ["Q", "\n"]):




   # Define the parameters
   model_output = zephyr_model(
       user_prompt,
       max_tokens=max_tokens,
       temperature=temperature,
       top_p=top_p,
       echo=echo,
       stop=stop,
   )


   return model_output

Within the __main__ clause, the function can be executed using a given prompt.

if __name__ == "__main__":


   my_prompt = "What do you think about the inclusion policies in Tech companies?"


   zephyr_model_response = generate_text_from_prompt(my_prompt)


   print(zephyr_model_response)

The model response is provided below:

The model’s response

The response generated by the model is <What do you think about the inclusion policies in Tech companies?> and the exact response of the model is highlighted in the orange box.

The original prompt has 12 tokens
The response or completion tokens have 10 tokens and,
The total tokens is the sum of the above two tokens, which is 22

Even though this complete output can be useful for further use, we might be only interested in the textual response of the model. We can format the response to get such a result by selecting the “text” field of the “choices” element as follows:

final_result = model_output["choices"][0]["text"].strip()

The strip() function is used to remove any leading and trailing whitespaces from a string and the result is:

Tech companies want diverse workforces to build better products.

Llama.CPP Real-World Applications

This section walks through a real-world application of LLama.cpp and provides the underlying problem, the possible solution, and the benefits of using Llama.cpp.

Problem

Imagine ETP4Africa, a tech startup that needs a language model that can operate efficiently on various devices for their educational app without causing delays.

Solution with Llama.cpp

They implement Llama.cpp, taking advantage of its CPU-optimized performance and the ability to interface with their Go-based backend.

Benefits

Portability and Speed: Llama.cpp's lightweight design ensures fast responses and compatibility with many devices.
Customization: Tailored low-level features allow the app to provide real-time coding assistance effectively.

The integration of Llama.cpp allows ETP4Africa app to offer immediate, interactive programming guidance, improving the user experience and engagement.

Data Engineering is a key component to any Data Science and AI project, and our tutorial Introduction to LangChain for Data Engineering & Data Applications provides a complete guide for including AI from large language models inside data pipelines and applications.

Conclusion

In summary, this article has provided a comprehensive overview on setting up and utilizing large language models with LLama.cpp.

Detailed instructions were provided for understanding the basics of Llama.cpp, setting up the working environment, installing the required library, and implementing a text generation (question-answering) use case.

Finally, Practical insights were provided for a real-world application and how Llama.cpp can be used to efficiently tackle the underlying problem.

Ready to dive deeper into the world of large language models? Enhance your skills with the powerful deep learning frameworks LangChain and Pytorch used by AI professionals with our How to Build LLM Applications with LangChain tutorial and How to Train a LLM with PyTorch.

Author

Zoumana Keita

A multi-talented data scientist who enjoys sharing his knowledge and giving back to others, Zoumana is a YouTube content creator and a top tech writer on Medium. He finds joy in speaking, coding, and teaching . Zoumana holds two master’s degrees. The first one in computer science with a focus in Machine Learning from Paris, France, and the second one in Data Science from Texas Tech University in the US. His career path started as a Software Developer at Groupe OPEN in France, before moving on to IBM as a Machine Learning Consultant, where he developed end-to-end AI solutions for insurance companies. Zoumana joined Axionable, the first Sustainable AI startup based in Paris and Montreal. There, he served as a Data Scientist and implemented AI products, mostly NLP use cases, for clients from France, Montreal, Singapore, and Switzerland. Additionally, 5% of his time was dedicated to Research and Development. As of now, he is working as a Senior Data Scientist at IFC-the world Bank Group.

Topics

Artificial Intelligence (AI)

Start Your AI Journey Today!

course

Generative AI Concepts

2 hours

25.6K

Discover how to begin responsibly leveraging generative AI. Learn how generative AI models are developed and how they will impact society moving forward.

See Details

Start Course

track

AI Fundamentals

10hrs hours

Discover the fundamentals of AI, dive into models like ChatGPT, and decode generative AI secrets to navigate the dynamic AI landscape.

See Details

Start Course

course

AI Ethics

1 hour

9.1K

Explore AI ethics focusing on principles, fairness, bias reduction, and trust in AI design.

See Details

Start Course

tutorial

How to Run Llama 3 Locally: A Complete Guide

Run LLaMA 3 locally with GPT4ALL and Ollama, and integrate it into VSCode. Then, build a Q&A retrieval system using Langchain, Chroma DB, and Ollama.

Abid Ali Awan

15 min

tutorial

Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model

Learn how to fine-tune Llama-2 on Colab using new techniques to overcome memory and computing limitations to make open-source large language models more accessible.

Abid Ali Awan

12 min

tutorial

Fine-Tuning Llama 3 and Using It Locally: A Step-by-Step Guide

We'll fine-tune Llama 3 on a dataset of patient-doctor conversations, creating a model tailored for medical dialogue. After merging, converting, and quantizing the model, it will be ready for private local use via the Jan application.

Abid Ali Awan

19 min

tutorial

Run LLMs Locally: 7 Simple Methods

Run LLMs locally (Windows, macOS, Linux) by leveraging these easy-to-use LLM frameworks: GPT4All, LM Studio, Jan, llama.cpp, llamafile, Ollama, and NextChat.

Abid Ali Awan

14 min

tutorial

LlamaIndex: Adding Personal Data to LLMs

LlamaIndex is your friendly data sidekick for building LLM-based apps. You can easily ingest, manage, and retrieve both private and domain-specific data using natural language.

Abid Ali Awan

10 min

code-along

Fine-Tuning Your Own Llama 2 Model

In this session, we take a step-by-step approach to fine-tune a Llama 2 model on a custom dataset.

Maxime Labonne

See More See More

Develop AI Applications

What is Llama.cpp?

Llama.cpp Architecture

Setting Up the Environment

Create the virtual environment

Understand Llama.cpp Basics

Your First Llama.cpp Project

Llama.CPP Real-World Applications

Problem

Solution with Llama.cpp

Benefits

Conclusion

How to Run Llama 3 Locally: A Complete Guide

Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model

Fine-Tuning Llama 3 and Using It Locally: A Step-by-Step Guide

Run LLMs Locally: 7 Simple Methods

LlamaIndex: Adding Personal Data to LLMs

Fine-Tuning Your Own Llama 2 Model

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Generative AI Concepts

AI Fundamentals

AI Ethics

How to Run Llama 3 Locally: A Complete Guide

Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model

Fine-Tuning Llama 3 and Using It Locally: A Step-by-Step Guide

Run LLMs Locally: 7 Simple Methods

LlamaIndex: Adding Personal Data to LLMs

Fine-Tuning Your Own Llama 2 Model

Generative AI Concepts