Generalized Traffic Light Classification Architecture

Currently, Autoware.ai contains 3 types of traffic light classification systems:

  • A heuristic approach (region_tlr)
  • A DNN approach which utilizes the MXNet framework (region_tlr_mxnet) and CUDA
  • A DNN approach which utilizes the SSD framework (region_tlr_ssd) and CUDA

Each of these assumes that it will be handed two pieces of information:

  1. A list of regions of interest which contain the traffic lights
  2. A raw image

These traffic light ROIs (regions of interest) are provided by the feat_proj (feature projection) node which uses information from the vector map and camera intrinsic and extrinsic parameters to project the bulb locations from the map in 3D space into the 2D plane of the camera’s field of view.

In developing our own traffic light classification node, we have run into the following issues:

  1. The results from the heuristic approach were terrible and nearly unusable in real-world scenarios.
  2. Both the MXNet and SSD/Caffe detectors contain backbones which were trained using the COCO dataset, which is not available for commercial use. This severely limits the applicability of these nodes since they were designed specifically for these frameworks.
  3. Much of the functionality is duplicated between the DNN-based nodes.
  4. Attempting to separate the cropping/ROI extraction functionality of these nodes from the classification functionality leads to either timing issues or chicken-and-egg problems (described in more detail in a bit).
  5. The feat_proj node is somewhat naive in it’s cropping mechanisms. Variance in pitch/roll/yaw of the vehicle, multiple bulbs in certain vector map versions, and the field-of-view of the detection camera lead to ROI images which are either too large, too small, only covering a portion of the light, or not pointing at the light whatsoever.

We think it is possible to provide a general-purpose architecture for traffic light recognition (and other constrained-ROI-type image classification problems) but some pros/cons must be considered for each approach. I’m not going to speak too much to the types or capabilities of neural networks which would fit the bill since that isn’t my area of expertise - I’ll invite my colleague Joe Driscoll to speak on this topic - but more on the overall architecture. I think our intended goals for the new architecture look something like this:

  1. The recognizer must take in raw images from one or more cameras and detect if a traffic light which exists in the vector map also exists within those images.
  2. If a traffic light exists in the vector map and is applicable to the currently-occupied lane and direction of travel but is outside the field of view of the camera(s), the recognizer should provide this feedback.
  3. The recognizer must classify a found traffic light from the image as having one of four states:
    1. Stop (usually red)
    2. Yield (usually yellow)
    3. Go (usually green or blue)
    4. Unknown
  4. The recognizer must publish three pieces of information:
    1. An overall recommendation to the vehicle control system about how to proceed given the detected and classified traffic light(s).
    2. An image which contains the contents of the raw image plus superimposed squares around the detected traffic lights and their associated classifications (this is currently provided by all classifiers but is only useful for human interaction and not necessary for system functionality).
    3. A visualization_msgs/MarkerArray which contains markers for the detected lights and their classification results (again, this is not necessary for system functionality but is provided by all current classifiers).

For implementing an architecture which does the above, these possible options come to mind:

  1. Combine the problems of detection and classification into a single, neural-network-based approach.
  • Pros:
    • Simplifies the architecture immensely
    • Reduces compute resources (no image re-publishing, cropping, etc.)
    • Gets rid of latency and chicken-and-egg problems
  • Cons:
    • Difficult to make general-purpose given the feature, language, and library support of the many neural network frameworks that are available.
    • Difficult to troubleshoot since the task domain has increased and you can’t obtain partial results.
    • People who are using Autoware without a GPU are out of luck.
  1. Keep the detection task separate from the classification task but improve feat_proj's ROI selection and, in the classification task,separate the actual classifier from the ROS node by either creating the classifier as a library or a separate node with a ROS service call in-between.
  • Pros:
    • Reuses existing code.
    • The classification task can be heuristic or DNN-based and the classifier-interface node doesn’t have to care - it just makes a service call or overloaded function call.
  • Cons:
    • The classifier-interface node can run into latency issues. Whether using a library or separate-node approach for the classifier, the latency between receiving the raw image and ROIs for that image and publishing the superimposed image with classification results is non-trivial. To keep consistency in the published superimposed image, the classifier-interface node has to receive the raw image and ROIs, make a call to the classifier, wait for the classifier to return the result, annotate the image and marker arrays, and then publish them. This can cause incoming images and ROIs to not be processed while the classifier-interface node is waiting for the current classification result. Making this call in any sort of asynchronous way means that you have to manage a list of received raw images and ROIs and coordinate them with the returned classification results.
    • If using a library-based approach, only one language can be used and the choice of the language for the classifier-interface node would determine which language would be supported for the classification libraries (e.g., if using C++, can’t use Python for the classifier).
    • The detection task essentially remains the same but contains band-aids and hacks to make it better at feature projection in a real-world environment.
  1. Use feature projection to either a) determine if a relevant traffic light is in the image(s) AND find the ROI for that light or b) if a GPU is available, just determine if a relevant traffic light is in the image(s) while handing the detection task to a DNN-based node if the traffic light is in the image(s). In addition, separate the classifier as described in 2.
  • Pros:
    • Makes all involved nodes “general-purpose” and usable with or without a GPU.
    • Improves the ROI generation with a common approach that is known to produce high-reliability results (the DNN-based detector).
    • The placement of traffic lights in the vector map and the yaw/pitch/roll of the vehicle are much less relevant to the detection task because the DNN-based detector only uses the raw image as input.
    • The neural network for traffic light detection could be very shallow if only looking for one type of object in the image.
    • Each of the stages of detection and classification can be more easily troubleshot and fine-tuned due to increased visibility into the pipeline.
    • Same pros as 2.
  • Cons:
    • More computationally expensive than other approaches because of multiple neural networks running simultaneously and more nodes.
    • Same cons as 2.

Well, now that I’ve written a novel on the subject, please provide feedback and suggestions on how we can make an awesome traffic light detector that works well enough to be used in real traffic!

4 Likes

@JWhitleyWork
Thank you for starting the discussion. I believe a lot of people are interested in the topic.

About approach 1&3, don’t you still need to calculate ROI from vector_map information even with DNN? Determining the presence of relevant traffic light does not ensure that irrelevant traffic lights are not in the image. If DNN detects multiple traffic lights in the image, you need information to choose which one to look at.

You are correct. That is a fact that I missed. However, we could use the centroid of the initial ROI estimate from the feature projection node to find the closest detected light from those that the DNN node found.

It’s really a trade-off between attempting to make the feature-projection node more robust to differences between the idealized world (vector map) and the real world and providing an alternative to feature projection that is less prone to the same real-world pitfalls as feature projection but needs feature projection as an anchor to the idealized world.

@JWhitleyWork thank you for your comprehensive analysis of the current status and for prividing possible solutions related to the classification of traffic lights.

Here are some of my comments related to your post:

If you agree with me, (A) even with current ADAS map format, the definition of which traffic light applies to which lane is not well defined. Current format defines “closest lane ID”, but not exactly which one it is, it also only allows the definition of a single lane. (not sure about other ADAS formats)

(B) just like you mentioned previously, current definition of a traffic light is ambiguous. It requires the search of individual “lamps” to define a traffic light object. This is not as straightforward as it should be. In my opinion a new layer defining the traffic lights objects need to be added to the ADAS format.

Now, regarding the 3 of the proposed solutions. I still think that approach (2) of your list is the most “dynamic”, since different nodes can be used to classify regardless the DL framework. However, I also agree that it needs to be improved.

In my opinion, the tlr_xxx nodes should only publish an array of results, for each image in a similar fashion to astuff_sensor_msgs/ObjectArray, but containing the Traffic Light classification results.

Visualization (super_impose, and markers) should be handled in an independent tlr_visualization node.

I mentioned above an array of classification results, instead of a single result, so the classifier can handle more complex combinations.
Case in point, only in central Japan these three types of traffic can be found:


so instead of adding a new class for each RED_GREEN_LEFT, RED_GREEN_RIGHT, etc.
the classifier would just report the most likely classes with its score, i.e.

Example:

TrafficLightResultArray[] = { 
[RED, 0.99], 
[GREEN_LEFT, 0.9], 
[GREEN_FWD, 0.9],
[YELLOW, 0.0]
}

would match to:
red_green

I believe this should be able to handle different combination in different parts of the world.

What do you think?

Is it possible to only emit a single ROI from feature-projection node by a simple filtering rule such as FOV cropping or largest ROI.

@amc-nu I agree with your assessment of the shortcomings of the mapping format with regard to traffic lights. However, we are unable to modify the existing ADASMap format (it is Aisan’s proprietary format) which is why Autoware is moving toward a new “Autoware Map” format. This format supports many-to-one relationships between lights and lanes (see page 17 of https://gitlab.com/autowarefoundation/autoware.ai/utilities/uploads/cf3a0b8082aab657d55bd3ea421a0653/AutowareMapsFormat.pdf). I believe the new format also supports what you are suggesting in regard to multiple lights on a signal. @mitsudome-r may be able to comment more on this issue.

I like the idea for tlr_visualization but it will run into the following problem:

  • Camera publishes raw image
  • Feature Projection publishes ROIs (or light positions, as suggested)
  • Classifier consumes raw image and ROIs (or light positions, as suggested) and produces classification (takes unknown amount of time)
  • tlr_visualization consumes raw image and classification

In this scenario, because of the latency introduced by the classifier, the tlr_visualization cannot correctly identify which raw image is tied to the classification results. If it publishes the most recently-received version of both, it will be superimposing old classifications on new images. The time delay won’t be much but I just wanted to mention this issue.

Publishing probability with the state would just push the logic of decision to the consumer module. Could cause duplication in the case of multiple consumer modules. IMHO tl modules are the best place to have definitive decision and output a definitive result. Please give a possible scenario where this probabilistic information is useful for a consumer module.

Approach 1 I really like for the following reasons:

  1. Simplicity, simplify code and compute graph, reduce large message passing
  2. More robust, given all problem with current feature extraction node, a DNN approach with some filtering from vector map information should give a lot more robust result. Especially in areas that vector map is inaccurate or lacking.
  3. Achievability, seems to be a well-explored approach, a lot of networks and training data that can be used. https://arxiv.org/pdf/1805.02523.pdf, https://medium.com/@anthony_sarkis/self-driving-cars-implementing-real-time-traffic-light-detection-and-classification-in-2017-7d9ae8df1c58, https://hci.iwr.uni-heidelberg.de/node/6132
  4. Speed, people running autoware without a GPU should already be in a lot of pain given that perception and tl detection module already reply on NNs. A CPU only simple heuristic node could still be provided by combining current feat_proj and region_tlr nodes, but I suspect the resultant quality would fit nobody’s needs.

PS: Crazy idea, draw ROI using perception node, run small NN on cropped ROI. eliminating feat_proj.

1 Like

@LiyouZhou - I agree with your assessment about passing the probability on to another node for the decision. However, these would be very useful for training/tweaking so I don’t think publishing it is a bad thing. We could publish it and just not use it in any downstream nodes.

Regarding approach 1, I would tend to agree with most of your points. However, as far as using the
perception node for the light ROI, as @amc-nu mentioned, we need to have some measure of certainty about which light applies to the lane we are currently in and we can’t determine this without data from the vector map, thus necessitating feat_proj.

which is why Autoware is moving toward a new “Autoware Map” format. This format supports many-to-one relationships between lights and lanes (see page 17 of https://gitlab.com/autowarefoundation/autoware.ai/utilities/uploads/cf3a0b8082aab657d55bd3ea421a0653/AutowareMapsFormat.pdf ).

Actually, we have decided to use Lanelet2 format as IO format and also as internal format in TSC meeting. (AutowareMapFormat will disappear.)
However, we would need to add some extension to Lanelet2 format so that it has enough information to minimize the degradation of current Autoware functionalities. I will post the idea about extended format on discourse soon.

I believe the new format also supports what you are suggesting in regard to multiple lights on a signal.

Lanelet2 supports a single lane linked to multiple traffic lights and also multiple lanes linked to a single traffic light.

@LiyouZhou I also agree that it would be simpler to have tl modules to have definitive decision. However, one of the concerns that @amc-nu is mentioning is that having only four states (Stop, Yield, Go, Unknown) is not enough in some situations (at least in Japan).

Even the lighting pattern doesn’t change, whether you are allowed to “Go” or not depends on which way the vehicle is going. In the following example from @amc-nu’s post, the vehicle is allowed to “Go” only straight or to the left, but not to right.

Therefore, you would need more than the four states.

Yes more states should be helpful. Attaching probability to each state, not. I wonder if the following in enough:

  1. tfd node publish a light color for going forward, turn left, turn right. Default to red. Hence the tfd would be in charge of condensing the different shapes of lights into these 3 channels.
  2. DecisionMakerNode currently have logic to do straight/left/right recognition by using angle. This node can then make a go/hold decision based on tfd output.
1 Like

So, I think there are multiple ways to handle the light layout that @mitsudome-r describes above. Here are a couple:

  1. Have the entire light cluster classified by the tlr in one go. This is difficult to implement because of the required number of statuses for different light combinations and the complete re-training required for the existing neural networks, not to mention the number of images required for each state for each combination of lights for training.
  2. Create individual, “direction-based” signals. Using the example image from above, create 3 signals with seperate classification states for each. We would need to add metadata to the mapping format to indicate the direction of travel that each signal controls and then either have the feat_proj choose the correct one based on a “direction-of-travel” input or have a decision node after the classification decide which to use. The upside of this approach is that the existing tlr NNs would only require a bit of transfer learning to train rather than compete re-training.
  3. Create “direction-based” signals as described in 2 but have the classifier learn to distinguish between the different signal types and produce both a signal type and a state classification.

Thoughts?

I am having a hard time imagining how would approach 2 work. What is the image that the classifier will classify? It will have to be an image of a whole light cluster ie: image I don’t think there is a way to subdivide this image any further.

Also what do you mean by “direction-based” signals? Who create this signals? Do you mean 3 separate ros topics (left, right, ahead) to publish separately a [r, g, y] state for each?

Please clarify.

Maybe I misunderstood the meaning of the lights in your example cluster. I’m not very familiar with traffic lights in Japan. Are each of the arrow lights on the bottom row associated with a single light on the top row or does the entire top row indicate something totally separate from the bottom row?

@amc-nu will have to correct me. But If I was driving in Japan, I would interpret this as:

  1. The top row is the normal traffic light scenario, I.E. 3 lights with R/Y/G. Meaning Stop/Slow/Go for all directions. (I think each position only shows one colour, otherwise, it would be difficult for colour blind people.)
  2. The bottom row contains modifiers that overwrite the top row instruction. (they are only ever green, they have 2 states, off/green)

In the above picture:

  1. Stop for all directions
  2. For a left turn, override to Go. For going ahead override to forward.

Hence the full state of the signal for all 3 directions can only be obtained after recognizing the whole cluster. Hence we wouldn’t be able to subdivide the cluster.

It looks like the TLD need to be modified for geographical regions anyways. e.g.

  1. In Japan, the green light is blue, the pre-processing of the raw image need to account for that.
  2. In the US there is right on red, and TLD or other node needs to be able to recognize the (No turn on red) sign.
  3. In Europe, there is ever only one directional modifier in addition to the main light.

Hence I think it would be more important to design the topics to be flexible enough to cover most traffic rules. Then the implementation can be swapped in and out for different regions.

WRT the discussion of detection + classification 2 stage pipeline V.S. a single stage pipeline, here are some research on both approaches:

  1. Join traffic sign and traffic light detection and classification using a single DNN. https://arxiv.org/abs/1806.07987
  2. 2 stage pipeline - https://ywpkwon.github.io/pdf/18itsc.pdf

2 also references a large number of research papers using 2 stage solution, which uses either image processing technique as a first stage, or YOLO-like object detection CNN as a first stage.

To offer maximum flexibility, I propose the following:

  • specify a single stage TLD. If multiple stages are required, they can be implemented as internal stages of the TLD.
  • feat_proj will continue to exist but improved to take into account the pose of the car
  • TLD takes the output of feat_proj as a suggestion but doesn’t wholly rely on the map to provide accurate information
  • TLD outputs a go/slow/stop states for each of right/left/ahead directions.
  • A decision node will take the output of TLD and path planner to issue STOP/GO signals to the vehicle.

In a graph

@startuml
digraph G {
    ROI [shape = diamond]
    raw_image [shape = diamond]
    light_bounding_boxes_and_states [shape = diamond] 
    light_left_right_ahead_state [shape = diamond] 
    planned_path [shape = diamond]
    STOP_GO [shape = diamond]
    Feature_Projection -> ROI
    raw_image -> TLD_BlackBox
    ROI -> TLD_BlackBox
    TLD_BlackBox -> light_bounding_boxes_and_states
    TLD_BlackBox -> light_left_right_ahead_state
    light_bounding_boxes_and_states -> GUI
    light_left_right_ahead_state -> light_applicability_decider
    planned_path -> light_applicability_decider
    light_applicability_decider -> STOP_GO
    subgraph cluster_01 {
        label = "Legend";
        Topics [shape = diamond]
        Nodes
        Nodes -> Topics
    }
} 
@enduml
  • To deal with geographical differences, TLD node will have different implementation depending on the application.
  • Autoware can provide a generic TLD node trained using a publically available database. As there are established research projects in this area, a reference implementation can be done following one of the papers.

To improve ease of internationalisation it would be helpful to split TLD black box into 2 parts,

  1. TLD_Vision_Detector to perform purely the task of detecting the lights from an image.
  2. TLD_internationalisation to integrate information from other sources to condense vision detection into light/right/ahead light signals.

@startuml
digraph G {
    ROI [shape = diamond]
    raw_image [shape = diamond]
    light_bounding_boxes_and_states [shape = diamond] 
    light_left_right_ahead_state [shape = diamond] 
    planned_path [shape = diamond]
    STOP_GO [shape = diamond]
    Feature_Projection -> ROI
    raw_image -> TLD_Vision_Detector
    ROI -> TLD_Vision_Detector
    TLD_Vision_Detector -> light_bounding_boxes_and_states
    light_bounding_boxes_and_states -> TLD_Internationalisation
    TLD_Internationalisation -> light_left_right_ahead_state
    light_bounding_boxes_and_states -> GUI
    light_left_right_ahead_state -> light_applicability_decider
    planned_path -> light_applicability_decider
    light_applicability_decider -> STOP_GO

    subgraph cluster_01 {
        label = "Legend";
        Topics [shape = diamond]
        Nodes
        Nodes -> Topics
    }
} 
@enduml

@LiyouZhou - In your revised graph above, wouldn’t the purpose of the “TLD_Vision_Detector” only be to verify that the light exists inside the ROI? Doesn’t this then become a two-stage detection system as was described above? I feel like we might be better off offering alternative implementations of the “TLD_BlackBox” described in your first graph for different countries or just flags which set the country to enable/disable functionality within the node. Thoughts?

I think Feature_Projection and ROI can be deprecated. Neural networks are capable of detecting and classify traffic lights from the raw image. I left it in the graph to be a “optional” input.

  • TLD_Vision_Detector would only deal with running a neural network and is hence very simple.
  • TLD_internationalisation would then take multiple signal and discern a left/right/ahead signal.

For example, there could be multiple traffic light in view of the camera. the Vision Detector would detect all of them. TLD_internationalisation would then take the one “closest” to the vehicle and output its state.

This way if special sauce is needed for a particular country, for example right on red in US (default green for turning right), then one need only to modify TLD_internationalisation, and not have to touch any neural network bits in TLD_Vision_Detector.

That is my intention, but I would have no problem if all this is done as a blackbox inside a single node.