Deploy and run LLM on LattePanda Sigma (LLaMA, Alpaca, LLaMA2, ChatGLM)
In this dynamic field of AI, the fusion of language models and hardware accelerators has become a notable pursuit. The Lattepanda Sigma is a SBC(single-board computer) based on the Intel Core i5-1340P processor. In this article, we will delve into how to deploy and run popular LLMs (LLaMA, Alpaca, LLaMA2, ChatGLM) on the Sigma (32GB), as well as optimize building your own AI chatbot server on these devices. These models represent significant research achievements in the NLP(natural language processing) domain and can be used for various tasks such as dialogue generation, and text summarization. Taking Llama2 as an example, we will provide details on the CPU requirements for running these models, explain the steps to deploy LLM on Sigma, and give several suggestions for accelerating LLM, such as modifying the command or using OpenVINO. Finally, we’ll offer a token speed comparison table for running LLaMA, Alpaca, LLaMA2, and ChatGLM on the Lattepanda Sigma (32GB).
How to Choose LLM
1. How to choose LLM?
LattePanda Sigma is a flagship performance, yet compact-sized single-board computer powered by the latest Intel Core i5-1340P processor. With 12 cores and 16 threads, it boasts a turbo frequency of up to 4.60 GHz (performance cores) and 3.40 GHz (efficiency cores), showcasing exceptional performance when handling various computing tasks, making it highly suitable for high-performance computing scenarios. We have chosen the LattePanda Sigma(32GB) and popular models to experience and test the performance of LLM on Sigma.
1.ARC(AI2 Reasoning Challenge)
2.HellaSwag(Testing the model's common sense reasoning abilities)
3.MMLU(Measuring Massive Multitask Language Understanding)
4.TruthfulQA(Measuring How Models Mimic Human Falsehoods)
2. How to select different versions of the same LLM?
For example, LLaMA2-7B-chat-hf is utilized with the HuggingFace's transformers library, using Transformers for inference. The downloaded .bin file represents the HuggingFace version weights. LLaMA2-7B-chat employs the PyTorch library for inference, and the downloaded .pth file represents the PyTorch version weights.
Since the HuggingFace library is specifically designed for natural language processing tasks and has a highly optimized inference engine, theoretically, LLaMA2-7B-chat-hf may offer faster inference speeds than LLaMA2-7B-chat.
How to run LLM
Here is the process of deploying and running LLaMA2 on LattePanda Sigma CPU(32GB Ubuntu 20.04) :
On Sigma open a terminal and use git to clone the repository.
git clone http://github.com/ggerganov/llama.cpp
Run make to compile the C++ code [Need to install gcc (apt install gcc / apt install g++) and Python in advance]:
Create a models/ folder in your llama.cpp directory that directly contains the 7B and sibling files and folders from the LLaMA model you downloaded.
Llama.cpp/models/7B folder should contain the following files:
Install a series of Python modules. These modules will work with the model to create a chatbot.
pip install torch numpy sentencepiece
Before running the conversion scripts, models/7B/consolidated.00.pth should be a 13.5 GB file.
The script convert.py converts the model into "ggml FP16 format":
python3 convert.py models/7B/ --ctx 4096
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0
Llama.cpp/models/7B folder should contain the following files. The size of the quantized q4 model is approximately 3.8GB:
Modify the path in chat.sh：
Interpretation of Llama2 running speed:
· load time: loading model file
· sample time: generating tokens from the prompt/file choosing the next likely token.
· prompt eval time: how long it took to process the prompt/file by LLaMa before generating new text.
· eval time: how long it took to generate the output (until
[end of text] or the user set limit).
· total: all together
How to accelerate LLM
Modify running command
./main -m models/llama-13b-v2/ggml-model-q4_0.gguf -c 512 -b 1024 -n 256 --keep 48 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
· -c 512: Set the size of the prompt context.
· -b 1024: Set the batch size to 1024.
· -n 256: Set the generated text length to 256.
· --keep 48: Specify the number of tokens from the initial prompt to retain when the model resets its internal context.
· --repeat_penalty 1.0: Control the repetition of token sequences in the generated text.
· --color: Enable colored output.
· -i: Enable interactive mode.
· -r "User:": Set the prompt for the interactive mode to "User:".
· -f prompts/chat-with-bob.txt: Specify the path to the input file.
To make the Llama2 run faster on your device, you can try the following methods, which will reduce memory requirements:
· Decrease the batch size.
· Reduce the number of historical contexts.
· Decrease the length of generated text.
For specific function parameters, please refer to: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md
Accelerate LLM using OpenVINO
You can also consider using OpenVINO to accelerate LLM. OpenVINO (Open Visual Inference and Neural Network Optimization) is an open-source toolkit developed by Intel. It is designed to optimize and deploy deep learning models across a variety of Intel hardware platforms, including CPUs, GPUs, FPGAs, and VPUs (Vision Processing Units). LattePanda Sigma is compatible with OpenVINO.
Please refer to these projects for more information:
1. Running Llama2 on CPU with OpenVINO：
2. Create LLM-powered Chatbot using OpenVINO：https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/254-llm-chatbot
OpenVINO supports GPU acceleration in many projects, providing faster execution speeds. If you wish to utilize the integrated graphics (iGPU) acceleration on the Sigma, please refer to the following link: https://github.com/openvinotoolkit/openvino_notebooks/discussions/540
Test for Lattepanda Sigma(32GB) CPU & LLM
This article thoroughly explores the possibilities and challenges of running LLM on Lattepanda Sigma. Through practical deployment and testing, we have demonstrated the feasibility of this approach. It opens up new possibilities for implementing an AI Chatbot Server on edge devices.