Different OpenVINO Batch Sizes for Different Computational Engines' Evaluation Comparison Benchmarks

LattePanda 2020-10-21 16:12:49 8512 Views0 Replies
Hello, fellow panda lovers!

Here is yet another wonderful post from the phenomenal AI Developer Contest that DFRobot held back in August. This post has just been translated from the original Chinese to English for your convenience, but the original post by community user Xiaoshou Zhilian Laoxu can be found here. When reposting this article, please give credit where credit is due, and please enjoy!

I recently participated in the "Intel® OpenVINO™ Pilot Alliance DFRobot Industry AI Developer Competition" activity. When I used benchmark_app.py as a model benchmark, I found that using different Batch Sizes resulted in a big gap between CPU, GPU and MYRIAD which has been organized below for reference.

Hardware Platform:

The organizer provided us with the Latte Panda Delta and the Intel Neural Compute Stick 2 with 4GB of memory onboard. The data listed in this article are all obtained by running the series of operations on this platform.

The CPU of this project was the LattePanda, which has Intel's new N-series N4100 Celeron 4-core 4-thread processor, which has up to 2.40 GHz and 4MB cache.

The GPU of this project is the UHD600 integrated graphics card, which has a basic frequency of 200MHz, and maximum dynamic frequency of 700MHz.

MYRIAD is the Intel Neural Compute Stick NCS2 in this project, with the Intel® Movidius™ Myriad™ X VPU core, which is plugged into the LattePanda Delta via the USB 3.1 Type-A interface.

Software:

Windows 10, OpenVINO 2020.4, Python 3.6.5

Model:

The model used: Xubett964.fp16.xml

The input: 28x28 pixel images, with a network structure diagram as follows:

Batch Size：1

benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d CPU

[ INFO ] Read network took 174.37 ms
[ INFO ] Network batch size: 1
[ INFO ] Load network took 234.38 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration)

Count: 751604 iterations
Duration: 60002.19 ms
Latency: 0.29 ms
Throughput: 12526.28 FPS

benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d MYRIAD

[ INFO ] Read network took 31.25 ms
[ INFO ] Network batch size: 1
[ INFO ] Load network took 1589.09 ms

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests, limits: 60000

Count: 33676 iterations
Duration: 60010.21 ms
Latency: 7.01 ms
Throughput: 561.17 FPS

benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d GPU

[ INFO ] Read network took 15.62 ms
[ INFO ] Network batch size: 1
[ INFO ] Load network took 14325.00 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 2 streams for GPU, limits: 60000 ms duration)

Count: 67800 iterations
Duration: 60000.80 ms
Latency: 3.39 ms
Throughput: 1129.98 FPS

You can see that the results of the CPU are very good, far surpassing the GPU and MYRIAD.

Batch Size：32

benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d CPU -b 32

[ INFO ] Read network took 31.27 ms
[ INFO ] Reshaping network: 'imageinput': [32, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 32
[ INFO ] Load network took 156.26 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 32 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration)

Count: 44064 iterations
Duration: 60006.13 ms
Latency: 5.48 ms
Throughput: 23498.40 FPS

benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d MYRIAD -b32

[ INFO ] Read network took 31.27 ms
[ INFO ] Reshaping network: 'imageinput': [32, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 32
[ INFO ] Load network took 1561.80 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 32 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests, limits: 60000

Count: 8680 iterations
Duration: 60017.96 ms
Latency: 27.45 ms
Throughput: 4627.95 FPS

benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d GPU -b32

[ INFO ] Read network took 22.20 ms
[ INFO ] Reshaping network: 'imageinput': [32, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 32
[ INFO ] Load network took 14444.84 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 32 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 2 streams for GPU, limits: 60000

Count: 25220 iterations
Duration: 60010.31 ms
Latency: 10.25 ms
Throughput: 13448.36 FPS

Batch Size：1024

benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d CPU -b1024

[ INFO ] Read network took 31.24 ms
[ INFO ] Reshaping network: 'imageinput': [1024, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 1024
[ INFO ] Load network took 222.76 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1024 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration)

Count: 1420 iterations
Duration: 60276.96 ms
Latency: 183.38 ms
Throughput: 24123.31 FPS

benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d MYRIAD -b1024

[ INFO ] Read network took 31.27 ms
[ INFO ] Reshaping network: 'imageinput': [1024, 1, 28, 28]
[ INFO ] Reshape network took 15.63 ms
[ INFO ] Network batch size: 1024
[ INFO ] Load network took 1521.97 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1024 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests, limits: 60000

Count: 388 iterations
Duration: 60890.69 ms
Latency: 627.76 ms
Throughput: 6525.00 FPS

benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d GPU -b1024

[ INFO ] Read network took 15.65 ms
[ INFO ] Reshaping network: 'imageinput': [1024, 1, 28, 28]
[ INFO ] Reshape network took 15.68 ms
[ INFO ] Network batch size: 1024
[ INFO ] Load network took 14388.24 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1024 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 2 streams for GPU, limits: 60000

Count: 1128 iterations
Duration: 60223.88 ms
Latency: 189.66 ms
Throughput: 19179.64 FPS

Summary:

For a relatively small image input similar to this article (like 28x28), if you select a batch size like 1, the CPU will be much faster than the GPU and MYRIAD, because the GPU and MYRIAD take more time to read the data than the CPU. MYRIAD reads the data through the USB, which has the longest latency, and the GPU is only slightly better, and the burden on the IO pins is relatively large; increasing the batch size will increase the size of the data read each time, the burden placed on the IO pins will become smaller, the proportion of time spent performing calculations will become larger, and the throughput will become larger. The corresponding latency also increases accordingly. Limited by memory and processing capacity, the degree of throughput will not increase with the same proportion. It is necessary to select an appropriate batch size according to different processing data sizes, algorithm structures, latency selections, and processing engines.