Currently, Autoware.ai contains 3 types of traffic light classification systems:
- A heuristic approach (
region_tlr
) - A DNN approach which utilizes the MXNet framework (
region_tlr_mxnet
) and CUDA - A DNN approach which utilizes the SSD framework (
region_tlr_ssd
) and CUDA
Each of these assumes that it will be handed two pieces of information:
- A list of regions of interest which contain the traffic lights
- A raw image
These traffic light ROIs (regions of interest) are provided by the feat_proj
(feature projection) node which uses information from the vector map and camera intrinsic and extrinsic parameters to project the bulb locations from the map in 3D space into the 2D plane of the camera’s field of view.
In developing our own traffic light classification node, we have run into the following issues:
- The results from the heuristic approach were terrible and nearly unusable in real-world scenarios.
- Both the MXNet and SSD/Caffe detectors contain backbones which were trained using the COCO dataset, which is not available for commercial use. This severely limits the applicability of these nodes since they were designed specifically for these frameworks.
- Much of the functionality is duplicated between the DNN-based nodes.
- Attempting to separate the cropping/ROI extraction functionality of these nodes from the classification functionality leads to either timing issues or chicken-and-egg problems (described in more detail in a bit).
- The
feat_proj
node is somewhat naive in it’s cropping mechanisms. Variance in pitch/roll/yaw of the vehicle, multiple bulbs in certain vector map versions, and the field-of-view of the detection camera lead to ROI images which are either too large, too small, only covering a portion of the light, or not pointing at the light whatsoever.
We think it is possible to provide a general-purpose architecture for traffic light recognition (and other constrained-ROI-type image classification problems) but some pros/cons must be considered for each approach. I’m not going to speak too much to the types or capabilities of neural networks which would fit the bill since that isn’t my area of expertise - I’ll invite my colleague Joe Driscoll to speak on this topic - but more on the overall architecture. I think our intended goals for the new architecture look something like this:
- The recognizer must take in raw images from one or more cameras and detect if a traffic light which exists in the vector map also exists within those images.
- If a traffic light exists in the vector map and is applicable to the currently-occupied lane and direction of travel but is outside the field of view of the camera(s), the recognizer should provide this feedback.
- The recognizer must classify a found traffic light from the image as having one of four states:
- Stop (usually red)
- Yield (usually yellow)
- Go (usually green or blue)
- Unknown
- The recognizer must publish three pieces of information:
- An overall recommendation to the vehicle control system about how to proceed given the detected and classified traffic light(s).
- An image which contains the contents of the raw image plus superimposed squares around the detected traffic lights and their associated classifications (this is currently provided by all classifiers but is only useful for human interaction and not necessary for system functionality).
- A
visualization_msgs/MarkerArray
which contains markers for the detected lights and their classification results (again, this is not necessary for system functionality but is provided by all current classifiers).
For implementing an architecture which does the above, these possible options come to mind:
- Combine the problems of detection and classification into a single, neural-network-based approach.
- Pros:
- Simplifies the architecture immensely
- Reduces compute resources (no image re-publishing, cropping, etc.)
- Gets rid of latency and chicken-and-egg problems
- Cons:
- Difficult to make general-purpose given the feature, language, and library support of the many neural network frameworks that are available.
- Difficult to troubleshoot since the task domain has increased and you can’t obtain partial results.
- People who are using Autoware without a GPU are out of luck.
- Keep the detection task separate from the classification task but improve
feat_proj
's ROI selection and, in the classification task,separate the actual classifier from the ROS node by either creating the classifier as a library or a separate node with a ROS service call in-between.
- Pros:
- Reuses existing code.
- The classification task can be heuristic or DNN-based and the classifier-interface node doesn’t have to care - it just makes a service call or overloaded function call.
- Cons:
- The classifier-interface node can run into latency issues. Whether using a library or separate-node approach for the classifier, the latency between receiving the raw image and ROIs for that image and publishing the superimposed image with classification results is non-trivial. To keep consistency in the published superimposed image, the classifier-interface node has to receive the raw image and ROIs, make a call to the classifier, wait for the classifier to return the result, annotate the image and marker arrays, and then publish them. This can cause incoming images and ROIs to not be processed while the classifier-interface node is waiting for the current classification result. Making this call in any sort of asynchronous way means that you have to manage a list of received raw images and ROIs and coordinate them with the returned classification results.
- If using a library-based approach, only one language can be used and the choice of the language for the classifier-interface node would determine which language would be supported for the classification libraries (e.g., if using C++, can’t use Python for the classifier).
- The detection task essentially remains the same but contains band-aids and hacks to make it better at feature projection in a real-world environment.
- Use feature projection to either a) determine if a relevant traffic light is in the image(s) AND find the ROI for that light or b) if a GPU is available, just determine if a relevant traffic light is in the image(s) while handing the detection task to a DNN-based node if the traffic light is in the image(s). In addition, separate the classifier as described in 2.
- Pros:
- Makes all involved nodes “general-purpose” and usable with or without a GPU.
- Improves the ROI generation with a common approach that is known to produce high-reliability results (the DNN-based detector).
- The placement of traffic lights in the vector map and the yaw/pitch/roll of the vehicle are much less relevant to the detection task because the DNN-based detector only uses the raw image as input.
- The neural network for traffic light detection could be very shallow if only looking for one type of object in the image.
- Each of the stages of detection and classification can be more easily troubleshot and fine-tuned due to increased visibility into the pipeline.
- Same pros as 2.
- Cons:
- More computationally expensive than other approaches because of multiple neural networks running simultaneously and more nodes.
- Same cons as 2.
Well, now that I’ve written a novel on the subject, please provide feedback and suggestions on how we can make an awesome traffic light detector that works well enough to be used in real traffic!