Propose ArmNN-based accelerated vision_detector node for non-CUDA hardware

Description

The vision_darknet_detect node currently used by Autoware to provide object detection is written
using the Darknet framework which optionally uses CUDA for acceleration.

Diminished performance is observed on drive / development platforms without CUDA support enabled,
which limits its adoption in real-time applications.

A unified vision_detector package can be written in the ArmNN framework - which accepts pre-trained
models and targets a wide variety of processors and accelerators at compile time - allowing it to take
immediate advantage of recent and future advances in compute technology.

When the proof of concept is developed and performance metrics are available, there are two possible
integration paths: (i) for the CUDA-accelerated vision_darknet_detect and the unified vision_detector
to co-exist and remain modular implementations of object detection; and (ii) for vision_detector to
absorb the darknet-CUDA backend as a compile-time option. The community can decide on a suitable
course of action at that point.

Implementation considerations

An ArmNN implementation of vision_detector would:

  • Exhibit suitable characteristics for real-time applications , namely:
    • low processing delay;
    • minimal delay jitter;
    • drop messages and timeout in a deterministic way.
  • Maintain black-box, abstraction, and compatibility guarantees:
    • use a subset ( contravariance ) of input topics;
    • publish a superset ( covariance ) of output topics;
    • present the same retry / queueing / timeout semantics.
  • Be suitable for upstream adoption in production systems :
    • use a framework written in C++;
    • provide a variety of accelerator back-ends;
    • support model export and import from a variety of training frameworks.
  • Present a drop-in, reference implementation of a vision_detector node.

Alternatives

  1. Develop separate accelerator nodes targeting each supported platform
    A set of object detecting nodes could be developed: one for vector-instruction acceleration;
    one for GPU acceleration; one for ML ASIC acceleration; etc. A unified node that uses a high-
    level framework (e.g. ArmNN) and pluggable backends would reduce maintainer workload.

Additional Information

Proposed Steps

  1. Develop a proof-of-concept vision_detector node targeting the ArmNN reference CPU backend.
  2. Extend target platform support to include:
    a. SIMD extensions: NEON
    b. General-Purpose compute: OpenCL
    c. Purpose-built accelerators
  3. Present findings and discuss next step regarding vision_detector architecture:
    a. separate nodes, or
    b. unified node with build options.
  4. Testing to ensure conflict-free operation when run as part of Autoware stack
  5. Integrate into upstream development branch for release

More about ArmNN

ArmNN is a portable C++ framework with pluggable backend support; targets include:
CPU (reference), NEON, OpenCL, and external accelerators (e.g. USB, PCIe solutions).

Disclaimer

I work for Arm, and it is in my interest to develop a proof-of-concept using the ArmNN framework.

4 Likes

Can you give more detail on the trade-offs of using a GPU directly versus using it via the ArmNN framework?

How easy is ArmNN to work with as a library? How well does it work in virtualised environments, Docker, etc.?

There are benefits to having a single implementation that can choose acceleration at compile time, but on the other hand if we end up just swapping one annoying framework (CUDA) for another we have not necessarily gained anything.

I understand you may need to do some of your proof-of-concept work before you can give complete answers to these questions.

2 Likes

Targeting an accelerator directly requires a significant amount of non-transferable development effort. The problem is that the acceleration is architecture-specific and cannot be run on other platforms. It would be ideal to target each accelerator with a separate node, but maintenance and development costs make this impractical.

ArmNN allows developers to reach more platforms and maintain less code by presenting an abstraction layer that parses models from a number of popular frameworks and executes them on supported backends, at the expense of latency. It would not be unreasonable to expect the overhead to be a fraction of the execution time of the models.

I am still familiarising myself with it.

I have managed to get ArmNN to build and run in a Docker container with CPU (reference) and NEON support but have not tried using the OpenCL backend as of yet.

I agree. After I have made sufficient headway into my investigation, I will update this thread and we (the community) can discuss a suitable course of action – including the possibility of merging this as a separate node upstream – the existing node would provide CUDA acceleration, and ArmNN could be used to target everything else.

When you have some data to quantify the overhead, I think we would all be interested in seeing it.

This is a reasonable course of action, but not the ideal because we don’t want to have to maintain the same code twice, effectively. However, CUDA is a popular platform so if necessary we will have to do this.