ROS Resources: Documentation | Support | Discussion Forum | Service Status | Q&A answers.ros.org

Tackling of Flaky Tests

In the last ROS 2 TSC call, a question on how to deal with the “flaky” behavior of the OSRF’s buildfarm came up. At Apex.AI we have over the last couple of years developed some tooling and best practices for dealing with the “flaky” tests.

Suggestions for addressing flaky tests

  1. Stop calling them flaky tests: they are broken tests
  2. Track broken tests - plotting the number of failures over time for each broken test helps you find new unstable tests and how often they are unstable
  3. Don’t blame CI
    • If the test needs special guarantees (real-time constraints, system resources, etc), then as a test developer, you should be aware of these requirements
      • More often than not, a unit test can be rewritten to not need special guarantees
      • In other cases, it may be possible to configure the CI machines appropriately (e.g. increasing network buffers), but this can only be done if the test developer provides these requirements to the people maintaining the CI machines
    • If special guarantees are needed, then the test likely needs a special tests system; these special test systems are usually needed for higher level tests – performance tests, stress tests, marathon tests, etc.
  4. Triage tests - a broken test shouldn’t prevent others from continuing to develop
  5. Avoid writing tests that are time-sensitive:
    • Any test that requires a sleep is likely a bad test, since, unless you are running on a real-time kernel, there is no guarantee that the sleep time will be respected
    • Tests can be rewritten to make them robust to timing jitter (e.g. https://github.com/ros2/rcutils/pull/167)
  6. Isolate tests:
    • Especially when running integration tests of ROS2 nodes, make sure you aren’t getting messages from other tests!
      • This can be done setting the ROS_DOMAIN_ID, for example – this can be done using the domain_coordinator (https://github.com/ros2/ament_cmake_ros/tree/master/domain_coordinator)
      • You can also make sure that test are using unique topic names: e.g., demo_nodes_cpp and demo_nodes_cpp_native used to have a talker with the same topic name which would end up in cross-talking to the other packages’ tests
    • Make sure that two docker containers on the same test machine can’t talk to each other either – this is more paranoia, but easy to configure
    • Another option is to run test sequentially, but this is really just masking the problem rather than fixing it
  7. Understand what you are trying to test, and write the test accordingly:
  8. Gather as much information as possible from the CI machine:
    • We have a background job that collects CPU/memory statistics while the tests run
    • We collect coredumps
  9. Provide ways to reproduce the environment as closely as possible:
    • For example, we have ade debug (in ade-cli) which will download the artifacts from a failed pipeline, and start a docker environment with the exact same build artifacts
    • ade debug --connect me@my-rpi can be used to reproduce aarch64 set ups
  10. Run the tests multiple times and with stress:
    • The main difference between your local machine and the CI machine is that the CI machine is running a lot more tests, a lot more often
  11. If all else fails, spend time to understand why the test is broken:
    • Unfortunately, some of these failures are hard to reproduce locally, and the only option is to look at the code
    • Read the code critically:
      • What assumptions is it making?
      • What if you added random long sleeps in the middle of the test?
      • Are there possible race conditions?
      • Is everything getting initialized properly?
      • Is everything getting cleaned up properly?
      • etc.
    • All of this is time consuming, and there is usually a combination of hubris (“I write perfect code”) and laziness (“It’s more interesting to write code than read it”) that needs to be overcome
  12. The disk performance of cloud machine is unstable. Therefore to rule out the flakiness caused by disk being too slow, we mounted some essential folders to memory.

This article has a nice description of the different properties of good tests: https://medium.com/@kentbeck_7670/test-desiderata-94150638a4b3

JP Samper & Fadi Labib

11 Likes

This is excellent! Thank you for writing up the process. The big problem with flakiness is, of course, that it makes CI uninformative and unactionable. If the user expects green CI, then red CI or a newly failed test is an obvious issue. But if CI is always red or yellow, there’s not an obvious way to know when a change is concerning.

  1. How does Apex track when a failure is handled? Is there a way to make it visible from the CI status report whether a failure has an owner (e.g. a name, a link to a github issue, or an in-progress PR)? It’s often impossible to know even what repo to the failure is in let alone whether a failure has a known cause.

  2. How does Apex suppress or deal with non-test failures? e.g. persistent failures that are picked up in warnings plugins and not associated with a test case, like those mentioned in https://github.com/ros2/rclcpp/pull/1047#issuecomment-613499118 ?