Deploy and run LLM on Lattepanda 3 Delta 864 (LLaMA, LLaMA2, Phi-2, ChatGLM2)

by L.P

Introduction

This article will guide you on how to deploy and run popular LLMs (Large Language Models) on the Lattepanda 3 Delta 864, including LLaMA, LLaMA2, Phi-2, and ChatGLM2. We will compare the differences in runtime speed, resource consumption, and model performance among these LLMs to assist you in selecting a device that meets your needs and to provide a reference for AI research with limited hardware resources. Additionally, we will discuss the key steps and considerations to help you experience and test the performance of LLMs on the Lattepanda 3 Delta 864.

How to Choose LLM

LLM usually puts forward the prerequisite requirements for CPU/GPU in the project requirements. Since GPU inference for LLM is not currently available on the Lattepanda 3 Delta 864, we need to prioritize models that support CPU. Due to the RAM limitations, we should give preference to smaller models. Generally, a model requires RAM that is double its memory size to operate smoothly. Quantized models have lower memory demands. Therefore, we recommend using quantized models to experience the performance of LLMs on the Lattepanda 3 Delta 864.

The following list is a selection of smaller models from the open_llm_leaderboard on the Huggingface website and the latest popular models.

P.S.

1.ARC(AI2 Reasoning Challenge)

2.HellaSwag(Testing the model's common sense reasoning abilities)

3.MMLU(Measuring Massive Multitask Language Understanding)

4.TruthfulQA(Measuring How Models Mimic Human Falsehoods)

How to run LLM

We used LLaMA.cpp and the CPU of the Lattepanda 3 Delta 864 to infer LLMs. Here, we will take ChatGLM-6B as an example to provide you with detailed instructions on how to deploy and run an LLM on the Lattepanda 3 Delta 864, which has 8GB RAM, 64GB eMMC, and is running Ubuntu 20.04.

Quantization

The following is the process of quantizing ChatGLM2-6B 4bit via GGML on a Linux PC:

The first section of the process is to set up ChatGLM.cpp on a Linux PC, download the ChatGLM-6B-int4 model, convert and copy it to a USB drive. We need the Linux PC’s extra power to convert the model as the 8GB of RAM in a delta864 is insufficient.

Clone the ChatGLM.cpp repository into your local machine:

CODE

git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the chatglm.cpp folder:

CODE

git submodule update --init --recursive

Install necessary packages:

CODE

python3 -m pip install -U pip

CODE

python3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece

Compile the project using CMake:

CODE

sudo apt-get install cmake
cmake -B build
cmake --build build -j --config Release

CODE

pip uninstall transformers
pip install transformers==4.33.0

Download the model and other files to chatglm.cpp/THUDM/chatglm-6b: https://huggingface.co/THUDM/chatglm-6b-int4

Use convert.py to transform ChatGLM-6B into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:

CODE

python3 chatglm_cpp/convert.py -i THUDM/chatglm-6b -t q4_0 -o chatglm-ggml.bin

Model Deployment

Here is the process of deploying and running ChatGLM-6B-q4 on Lattepanda 3 delta 864 Ubuntu 20.04:

CODE

git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp

CODE

git submodule update --init --recursive

CODE

python3 -m pip install -U pip

CODE

python3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece

CODE

sudo apt-get install cmake
cmake -B build
cmake --build build -j --config Release

CODE

pip uninstall transformers
pip install transformers==4.33.0

To run the model in interactive mode, add the -i flag. For example:

CODE

cd chatglm.cpp
./build/bin/main -m chatglm-ggml.bin -i

In interactive mode, your chat history will serve as the context for the next round of conversation.

Run ./build/bin/main -h to explore more options!

Summary

Test for Lattepanda 3 Delta 864 (8GB) & LLM

Test for Raspberry Pi 5 (8GB) & LLM

Reference：

ChatGLM.cpp

Llama.cpp

Deploy and run LLM on LattePanda Sigma (LLaMA, Alpaca, LLaMA2, ChatGLM)

Deploy and run LLM on Raspberry Pi 5 vs Raspberry Pi 4B (LLaMA, LLaMA2, Phi-2, Mixtral-MOE, mamba-gp

Deploy and run LLM on Raspberry Pi 4B (LLaMA, Alpaca, LLaMA2, ChatGLM)

Deploy and run LLM on Lattepanda 3 Delta 864 (LLaMA, LLaMA2, Phi-2, ChatGLM2)

Introduction

How to Choose LLM

How to run LLM

Quantization

Model Deployment

Summary

Test for Lattepanda 3 Delta 864 (8GB) & LLM

Test for Raspberry Pi 5 (8GB) & LLM

Reference：

Related Article: