Benchmark demonstrates remarkable reduction in glass-to-glass latency using GPUDirect/RDMA

5 February 2020
010502-f-1740g-002.jpg

In mission critical military embedded computing applications, time is of the essence. Milliseconds can mean the difference between life and death. The goal of every electronic warfare developer is to minimize latency to the greatest extent possible in applications that derive huge amounts of data from sensors and turn that data into actionable information in the least possible time – applications such as image analysis, image enhancement, 360-degree video stitching, sensor fusion, target detection and so on.

GPUs are widely recognized for providing the tremendous horsepower required by compute-intensive workloads, enabled by their ability to apply hundreds of cores in parallel to the problem. This means that GPUs consume data much faster than CPUs – and, as the computing horsepower of GPUs increases, so too does the demand for I/O bandwidth.

Using GPUDirect, multiple GPUs, network adapters, solid-state drives (SSDs) and now NVMe drives can directly read from and write to CUDA host- and device memory, eliminating unnecessary memory copies - reducing latency and also dramatically lowering CPU overhead.

GPUDirect RDMA is a technology that enables a direct path for data exchange between the GPU and a third party peer device – such as network interfaces, video acquisition devices and storage adapters- using standard features of PCI Express. 

 

Responsive image

 

Abaco’s GR4 3U VPX High Performance Quad Channel Video Capture Board is designed for the highest performance video capture applications such as ISR, C4ISR, SAR, situational awareness and remote sensing/analysis.

In order to demonstrate the ability of GPUDirect to substantially reduce latency – and, specifically, ‘glass-to-glass’ latency, which is the time taken between data being captured by a lens and actionable information being presented on a screen – we deployed a GR4 at the heart of a benchmark.

 

Benchmark hardware configuration

The system used for this benchmark comprised the VPX370 3U VPX development Platform;  the SBC329 3U VPX single board computer: the GR4; and an HD-SDI camera.

 

Responsive image

 

Method: HD-SDI ‘camera pointed at screen’

  • Displaying stopwatch on screen
  • Display live HD-SDI camera input capture on screen
  • Screen capture
  • Determine time delta between screen images – shows latency between frame capture at camera to frame rendering of HD-SDI input (glass-to-glass).

 

Responsive image

 

Non-RDMA Pipeline:

  • Wait for a full frame to be delivered
  • Wait for OpenGL draw() to be called
  • Deliver frame from CPU RAM to GPU memory via OpenGL Sub Texture
  • OpenGL Vertex Drawing
  • Monitor Render

RDMA Pipeline:

  • Wait for OpenGL draw() to be called
  • Deliver frame from FPGA RAM buffer to GPU Memory via OpenGL Sub Texture
  • OpenGL Vertex Drawing
  • Monitor Render

Results: glass-to-glass latency measurements

- Non-RDMA (CPU capture) = 80ms

- RDMA (GPU capture) = 50ms

The benchmark clearly shows that glass-to-glass latency is reduced by a remarkable ~40% when RDMA is used compared with the non-RDMA approach.

Notes

The actual measurement depends on the relative sync of the camera frame and monitor refresh, and the time of the snapshot.

It is typically an FPGA that communicates via PCI Express to the GPU. Since no buffering is done in the FPGA, the actual latency of writing to GPU memory when using RDMA is microseconds when done at a scanline level.  Using the non-RDMA capture method, we must wait for a full frame and then delivery to the GPU memory.  This is typically in the range 16.6-20ms.

The GPUDirect feature is currently supported by Linux, and has been integrated into Abaco’s AXIS ImageFlex, a toolkit designed to simplify the development of real-time image processing, visualization and autonomy applications. It is easily interoperable with OpenGL, CUDA, OpenCL, OpenCV and so on.

Fabio Ancona

Fabio is Abaco’s Field Application Engineer for Germany, Austria, Switzerland, Italy and the Balkans. He has over 25 years’ experience working in embedded electronics – initially as an embedded software engineer, followed by 16 years as regional sales manager. He took up his current assignment two years ago. Fabio has a PhD in Electronic Engineering and Computer Science from the University of Genoa.