TOP

Run SLMs (phi3, gemma2, mathstral, llama3.1) on Compute Module (Lattepanda Mu)

Introduction

In today's era of intelligent computing, Single Board Computer (SBC) have gained increasing popularity among developers due to their compact design and exceptional computing performance. At the same time, Small Language Models (SLMs) play a crucial role in diverse application scenarios, thanks to their efficiency and convenience. This article aims to provide an in-depth analysis of the performance of various SLMs on the Lattepanda Mu x86 compute module, running Ubuntu 22.04. We will conduct a detailed comparison of models such as mathstral, phi 3, llama 3.1, gemma2 2b, qiwen, Deepseek coder V2, llama2 in terms of execution speed, model size, open-source licenses, and runtime frameworks. Our goal is to provide developers with valuable data and insights.

 

mathstral-7B-v0.1-q4

Model size: 4.1GB

Speed: 2.35 tokens/s

Open-source license: Apache 2.0

Runtime framework: ollama

 

Mathstral is built on Mistral 7B, supporting a context window length of 32k. It is a specialized large code model based on the Mamba2 architecture for mathematical reasoning.

CODE
curl -fsSL https://ollama.com/install.sh | sh
sudo ollama run mathstral
Token speed of phi3 3.8b-q4 running on LattePanda Mu

Token speed of phi3 3.8b-q4 running on LattePanda Mu

 

phi3 3.8b-q4

Model size: 2.2GB

Speed: 5.62 tokens/s

Open-source license: MIT

Runtime framework: ollama

 

Install ollama and run the command:

CODE
sudo ollama run phi3
Token speed of phi3 3.8b-q4 running on LattePanda Mu

Token speed of phi3 3.8b-q4 running on LattePanda Mu

 

Llama 3.1-8b-q4

Model size: 4.7GB

Speed: 3.18 tokens/s

Open-source license: llama3.1

Runtime framework: ollama

 

Install ollama and run the command:

CODE
sudo ollama run llama3.1
Token speed of Llama 3.1-8b-q4 running on LattePanda Mu

Token speed of Llama 3.1-8b-q4 running on LattePanda Mu

 

gemma2-2b-q4

Model size: 1.6 GB

Speed: 1.51 tokens/s

Open-source license: gemma license

Runtime framework: ollama

 

Install ollama and run the command:

CODE
sudo ollama run gemma2
Token speed of gemma2-2b-q4 running on LattePanda Mu

Token speed of gemma2-2b-q4 running on LattePanda Mu

 

 

qwen-0.5b

Model size: 395MB

Speed: 16.1 tokens/s

Open-source license: Apache 2.0

Runtime framework: ollama

 

Install ollama and run the command:

CODE
sudo ollama run qwen:0.2b
Token speed of qwen-0.5b running on LattePanda Mu

Token speed of qwen-0.5b running on LattePanda Mu

 

Summary

Differences in SLMs

- Mathstral-7B-v0.1-q4: Focuses on mathematical reasoning problems, based on the Mamba2 architecture, suitable for scenarios requiring complex mathematical calculations and reasoning.

- Phi3 3.8b-q4: Versatile with a wide range of applications, highly flexible, and suitable for general natural language processing tasks.

- Llama 3.1-8b-q4: A powerful general-purpose language model, well-suited for various NLP tasks, including text generation, translation, and dialogue systems.

- Gemma2-2b-q4: A smaller model designed for resource-constrained environments while delivering decent performance.

- Qwen-0.5b: Supports Chinese, small in size, and fast, making it ideal for real-time applications that require high responsiveness.

- Deepseek V2-7b-q4: It specializes in code-related issues, offering efficient code generation and understanding capabilities ideal for development and programming applications.

 

Performance Summary of SLMs on Lattepanda Mu

Performance Summary of SLMs on Lattepanda Mu

 

Model Size Comparison:

Smallest Model: qwen-0.5b-q4, weighing only 395MB, is significantly smaller than other models and suitable for environments with limited resources.

Medium Model: phi3 3.8b-q4 and gemma2-2b-q4, with sizes of 2.2GB and 1.6GB respectively, are well-suited for balancing performance and resource demands.

 

Processing Speed Comparison:

Fastest Model: qwen-0.5b-q4, achieving a processing speed of 16.1 tokens/s.

Faster Model: phi3 3.8b-q4 takes second place with a speed of 5.62 tokens/s, demonstrating a good performance-to-model-size ratio.

Medium-Speed Model: Llama 3.1-8b-q4 and mathstral-7B-v0.1-q4, with processing speeds of 3.18 and 2.35 tokens/s respectively.

FAQs

  • Why should I consider using Lattepanda Mu for running SLMs?
    Lattepanda Mu offers a compact design with exceptional computing performance, making it ideal for developers seeking efficient and portable solutions. Its x86 architecture supports a variety of SLMs, ensuring flexibility in applications. However, it is not suitable for tasks requiring GPU acceleration or extremely high processing power.
  • Is Lattepanda Mu suitable for running large language models like Llama 3.1?
    Yes, Lattepanda Mu can run large language models like Llama 3.1, but performance may be constrained by model size and processing speed. While it supports models up to 4.7GB, larger models may lead to slower token processing speeds, impacting real-time applications.
  • How do I set up Lattepanda Mu to run SLMs using ollama?
    Install the ollama framework to execute SLMs on Lattepanda Mu by following specific setup commands. Ensure compatibility with Ubuntu 22.04 and check model-specific requirements. Caution is advised to prevent resource exhaustion, which can lead to system instability.
  • How does the processing speed of qwen-0.5b compare to other models on Lattepanda Mu?
    Qwen-0.5b is the fastest model on Lattepanda Mu, processing at 16.1 tokens/s, thanks to its small size. While it excels in speed, its limited feature set may not suit complex tasks requiring extensive language capabilities, unlike larger models like phi3 or llama3.1.