Deploy and run LLM on LattePanda Sigma (LLaMA, Alpaca, LLaMA2, ChatGLM)

by L.P

Introduction

In this dynamic field of AI, the fusion of language models and hardware accelerators has become a notable pursuit. The Lattepanda Sigma is a SBC(single-board computer) based on the Intel Core i5-1340P processor. In this article, we will delve into how to deploy and run popular LLMs (LLaMA, Alpaca, LLaMA2, ChatGLM) on the Sigma (32GB), as well as optimize building your own AI chatbot server on these devices. These models represent significant research achievements in the NLP(natural language processing) domain and can be used for various tasks such as dialogue generation, and text summarization. Taking Llama2 as an example, we will provide details on the CPU requirements for running these models, explain the steps to deploy LLM on Sigma, and give several suggestions for accelerating LLM, such as modifying the command or using OpenVINO. Finally, we’ll offer a token speed comparison table for running LLaMA, Alpaca, LLaMA2, and ChatGLM on the Lattepanda Sigma (32GB).

How to Choose LLM

1. How to choose LLM?

LattePanda Sigma is a flagship performance, yet compact-sized single-board computer powered by the latest Intel Core i5-1340P processor. With 12 cores and 16 threads, it boasts a turbo frequency of up to 4.60 GHz (performance cores) and 3.40 GHz (efficiency cores), showcasing exceptional performance when handling various computing tasks, making it highly suitable for high-performance computing scenarios. We have chosen the LattePanda Sigma(32GB) and popular models to experience and test the performance of LLM on Sigma.

P.S.

1.ARC(AI2 Reasoning Challenge)

2.HellaSwag(Testing the model's common sense reasoning abilities)

3.MMLU(Measuring Massive Multitask Language Understanding)

4.TruthfulQA(Measuring How Models Mimic Human Falsehoods)

2. How to select different versions of the same LLM?

For example, LLaMA2-7B-chat-hf is utilized with the HuggingFace's transformers library, using Transformers for inference. The downloaded .bin file represents the HuggingFace version weights. LLaMA2-7B-chat employs the PyTorch library for inference, and the downloaded .pth file represents the PyTorch version weights.

Since the HuggingFace library is specifically designed for natural language processing tasks and has a highly optimized inference engine, theoretically, LLaMA2-7B-chat-hf may offer faster inference speeds than LLaMA2-7B-chat.

How to run LLM

Here is the process of deploying and running LLaMA2 on LattePanda Sigma CPU(32GB Ubuntu 20.04) :

On Sigma open a terminal and use git to clone the repository.

Deploy and run LLM on Lattepanda 3 Delta 864 single board computer (LLaMA, LLaMA2, Phi-2, ChatGLM2)

Deploy and run LLM on Raspberry Pi 5 vs Raspberry Pi 4B (LLaMA, LLaMA2, Phi-2, Mixtral-MOE, mamba-gp

Deploy and run LLM on Raspberry Pi 4B (LLaMA, Alpaca, LLaMA2, ChatGLM)

CODE

git clone http://github.com/ggerganov/llama.cpp

Run make to compile the C++ code [Need to install gcc (apt install gcc / apt install g++) and Python in advance]:

CODE

make

Create a models/ folder in your llama.cpp directory that directly contains the 7B and sibling files and folders from the LLaMA model you downloaded.

Download URL: https://huggingface.co/meta-llama/Llama-2-7b-chat/tree/main

Llama.cpp/models/7B folder should contain the following files:

Install a series of Python modules. These modules will work with the model to create a chatbot.

CODE

pip install torch numpy sentencepiece

Before running the conversion scripts, models/7B/consolidated.00.pth should be a 13.5 GB file.

The script convert.py converts the model into "ggml FP16 format":

CODE

python3 convert.py models/7B/ --ctx 4096

CODE

./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0

Llama.cpp/models/7B folder should contain the following files. The size of the quantized q4 model is approximately 3.8GB:

Modify the path in chat.sh：

Run

CODE

./examples/chat.sh

Running results:

Interpretation of Llama2 running speed:

· load time: loading model file

· sample time: generating tokens from the prompt/file choosing the next likely token.

· prompt eval time: how long it took to process the prompt/file by LLaMa before generating new text.

· eval time: how long it took to generate the output (until [end of text] or the user set limit).

· total: all together

How to accelerate LLM

Modify running command

default command:

CODE

./main -m models/llama-13b-v2/ggml-model-q4_0.gguf -c 512 -b 1024 -n 256 --keep 48 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

· -c 512: Set the size of the prompt context.

· -b 1024: Set the batch size to 1024.

· -n 256: Set the generated text length to 256.

· --keep 48: Specify the number of tokens from the initial prompt to retain when the model resets its internal context.

· --repeat_penalty 1.0: Control the repetition of token sequences in the generated text.

· --color: Enable colored output.

· -i: Enable interactive mode.

· -r "User:": Set the prompt for the interactive mode to "User:".

· -f prompts/chat-with-bob.txt: Specify the path to the input file.

To make the Llama2 run faster on your device, you can try the following methods, which will reduce memory requirements:

· Decrease the batch size.

· Reduce the number of historical contexts.

· Decrease the length of generated text.

For specific function parameters, please refer to: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md

Accelerate LLM using OpenVINO

You can also consider using OpenVINO to accelerate LLM. OpenVINO (Open Visual Inference and Neural Network Optimization) is an open-source toolkit developed by Intel. It is designed to optimize and deploy deep learning models across a variety of Intel hardware platforms, including CPUs, GPUs, FPGAs, and VPUs (Vision Processing Units). LattePanda Sigma is compatible with OpenVINO.

Please refer to these projects for more information:

1. Running Llama2 on CPU with OpenVINO：

https://raymondlo84.medium.com/running-llama2-on-cpu-with-openvino-125fbf10daa1

2. Create LLM-powered Chatbot using OpenVINO：https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/254-llm-chatbot

OpenVINO supports GPU acceleration in many projects, providing faster execution speeds. If you wish to utilize the integrated graphics (iGPU) acceleration on the Sigma, please refer to the following link: https://github.com/openvinotoolkit/openvino_notebooks/discussions/540

Summary

Test for Lattepanda Sigma(32GB) CPU & LLM

This article thoroughly explores the possibilities and challenges of running LLM on Lattepanda Sigma. Through practical deployment and testing, we have demonstrated the feasibility of this approach. It opens up new possibilities for implementing an AI Chatbot Server on edge devices.

Deploy and run LLM on Lattepanda 3 Delta 864 (LLaMA, LLaMA2, Phi-2, ChatGLM2)

Deploy and run LLM on Raspberry Pi 5 vs Raspberry Pi 4B (LLaMA, LLaMA2, Phi-2, Mixtral-MOE, mamba-gp

Deploy and run LLM on Raspberry Pi 4B (LLaMA, Alpaca, LLaMA2, ChatGLM)

Deploy and run LLM on LattePanda Sigma (LLaMA, Alpaca, LLaMA2, ChatGLM)

Introduction

How to Choose LLM

How to run LLM

Related articles

How to accelerate LLM

Modify running command

Accelerate LLM using OpenVINO

Summary

Related Article