Share

In mission critical military embedded computing applications, time is of the essence. Milliseconds can mean the difference between life and death. The goal of every electronic warfare developer is to minimize latency to the greatest extent possible in applications that derive huge amounts of data from sensors and turn that data into actionable information in the least possible time – applications such as image analysis, image enhancement, 360-degree video stitching, sensor fusion, target detection and so on.
GPUs are widely recognized for providing the tremendous horsepower required by compute-intensive workloads, enabled by their ability to apply hundreds of cores in parallel to the problem. This means that GPUs consume data much faster than CPUs – and, as the computing horsepower of GPUs increases, so too does the demand for I/O bandwidth.
Using GPUDirect, multiple GPUs, network adapters, solid-state drives (SSDs) and now NVMe drives can directly read from and write to CUDA host- and device memory, eliminating unnecessary memory copies - reducing latency and also dramatically lowering CPU overhead.
GPUDirect RDMA is a technology that enables a direct path for data exchange between the GPU and a third party peer device – such as network interfaces, video acquisition devices and storage adapters- using standard features of PCI Express.
Abaco’s GR4 3U VPX High Performance Quad Channel Video Capture Board is designed for the highest performance video capture applications such as ISR, C4ISR, SAR, situational awareness and remote sensing/analysis.
In order to demonstrate the ability of GPUDirect to substantially reduce latency – and, specifically, ‘glass-to-glass’ latency, which is the time taken between data being captured by a lens and actionable information being presented on a screen – we deployed a GR4 at the heart of a benchmark.
Benchmark hardware configuration
The system used for this benchmark comprised the VPX370 3U VPX development Platform; the SBC329 3U VPX single board computer: the GR4; and an HD-SDI camera.
Method: HD-SDI ‘camera pointed at screen’
- Displaying stopwatch on screen
- Display live HD-SDI camera input capture on screen
- Screen capture
- Determine time delta between screen images – shows latency between frame capture at camera to frame rendering of HD-SDI input (glass-to-glass).
Non-RDMA Pipeline:
- Wait for a full frame to be delivered
- Wait for OpenGL draw() to be called
- Deliver frame from CPU RAM to GPU memory via OpenGL Sub Texture
- OpenGL Vertex Drawing
- Monitor Render
RDMA Pipeline:
- Wait for OpenGL draw() to be called
- Deliver frame from FPGA RAM buffer to GPU Memory via OpenGL Sub Texture
- OpenGL Vertex Drawing
- Monitor Render
Results: glass-to-glass latency measurements
- Non-RDMA (CPU capture) = 80ms
- RDMA (GPU capture) = 50ms
The benchmark clearly shows that glass-to-glass latency is reduced by a remarkable ~40% when RDMA is used compared with the non-RDMA approach.
Notes
The actual measurement depends on the relative sync of the camera frame and monitor refresh, and the time of the snapshot.
It is typically an FPGA that communicates via PCI Express to the GPU. Since no buffering is done in the FPGA, the actual latency of writing to GPU memory when using RDMA is microseconds when done at a scanline level. Using the non-RDMA capture method, we must wait for a full frame and then delivery to the GPU memory. This is typically in the range 16.6-20ms.
The GPUDirect feature is currently supported by Linux, and has been integrated into Abaco’s AXIS ImageFlex, a toolkit designed to simplify the development of real-time image processing, visualization and autonomy applications. It is easily interoperable with OpenGL, CUDA, OpenCL, OpenCV and so on.