Design By Contract

Quick replies :

  • pyros-test is MIT or BSD, along these lines, I just haven’t taken the time to put a file there… I ll try to do it soon.
  • python makes things simpler than C++ as there are defacto standard test frameworks. So I am trying to support whatever basic python and ROS support ( unittest, doctest, and nose ) and pytest. unittest already includes a mock library by the way in python3, which is just the mock library in python2 .
  • pyros-test is currently very very simple (probably too simple to need a separate package), and I’d like to eventually improve it when I get the chance…

But IMHO you re probably better of extracting what I already started in https://github.com/pyros-dev/pyros-msgs and https://github.com/pyros-dev/pyros-schemas : have a look at property based testing and hypothesis, it should be simple enough to automatically generate fake nodes based on an existing message definition and then send fake messages around :slight_smile: .

Some example of hypothesis use here : https://github.com/pyros-dev/pyros-schemas/blob/nested_merged/tests/test_pyros_schemas/hypothesis_example.py and an example of generating messages with it there https://github.com/pyros-dev/pyros-schemas/blob/nested_merged/tests/test_pyros_schemas/test_basic_fields.py

By the way, roschaos looks fun, and there is probably a clever way to integrate it with roslaunch or feed it launch/test files… I ll play with it when I get some time.

1 Like

Wow, a ros-hypotesis could be very powerful and effective. I knew about property based testing (e.g. RapidCheck for C++) but did not think about to adapt it to ROS so far. Creating a framework for property based ROS node testing could be a hard task I guess. What use cases are you thinking about exactly? I would love to contribute to it :blush:

While I’m a big fan of automatic checking, I wonder which problem(s) this proposal tries to solve.

This is not to say there are no problems. I just think it would help the discussion a lot if we knew what people here are interested in, in terms of outward behavior of system.

The proposed contracts relate to a) rates and b) response times. From my own work, I know that rate information is necessary but not sufficient for determining whether a system can have sampling effects. I am also strongly of the opinion that rates should be a property of a system, not a component.

I don’t know of much utility, but a great deal of problems, in specifying response times for a distributed system, particularly one with one very little real-time support, such as ROS or ROS2. If you really care about that, put your stuff into one process and ensure response times in the usual means, which have little do to with ROS.

Initially my intention to start this thread was to discuss if and how formal specification and formal verification of ROS 2 node/nodelet interfaces by means of DbC could be applied to/implemented in ROS 2. I am interested in DbC because it has the big advantage that it could speed up the integration of ROS 2 nodes/nodelets within a system because it prevents from struggling with bugs which relation to the interface based interaction between nodes/nodelets. However I do not consider DbC as measure for formal verification in the testing context but in the debugging context instead because it can hardly be accurate enough w.r.t. timing, as you said, especially in real-time systems. (I do not know about any tracing tools for ROS which are usual tools for real-time related verification.) in the debugging context a comparably rough estimate is often sufficient. Assuming “system” means the overall sum of ROS nodes/nodelets this thread is not about “outward behavior of system” but its “internal” integration only. If your system exposes interfaces in terms of ROS interfaces which interacts with another system DbC could address “outward behaviour” of the single sub-systems.

As DbC would be hard to implement in ROS 2 and it’s benefits are not considered relevant enough in comparison to other quality improving measures like developing and using tools like a “ROS Simian Army” the direction of the thread turned into the direction of what tools are missing and could be helpful to verify a ROS based system. (Not in terms of verifying real-time behaviour.)

For me the proposed contracts are more about valid and invalid values/value ranges of the node/nodelet interfacess like topics.

However w.r.t. rates and response times I would consider different “classes”:

  1. the “incoming” rates one node/nodelet expect from other nodes to receive, the nodes/nodelets “outgoing” rates which are expected from other nodes, the response time of one node
  2. the rates and response time of a component
  3. the rates and response time of a system

If 1) the node/nodelets do not satisfy “rough” rate or response time requirements there is a pretty good chance that 2) the component or 3) the system will not behave like you would like it to behave as well.

“I am also strongly of the opinion that rates should be a property of a system, not a component.” → People having a background in safety critical, real-time embedded system development could disagree here.

If we care about rates and response times we are already putting nodelets into a single process if this is possible. (If nodes are distributed over different machines I do not know about any way to improve rates and response times anyway.)

Can you give one or more examples of the kind of interface bugs you mean?

In my experience, the most frequent issue – at least initially – with connecting nodes is that something is not connected, because of a wrong name. The next most frequent seems to be spelling and/or range issues in parameters.

I wouldn’t call these interface issues as such, or at least we don’t need new specs on that, it would be sufficient to actually check the current ones.

Have you seen my talk at ROSCon? It’s not specifically about that, but I mention how we used LTTng to trace messages and callback invocations. This is in the soon-to-be-released tracetools package.

Aren’t those specs strongly related to your requirements or, in other words, to the outward behavior your system is expected to have? For example, when I drive a 1m/s in a warehouse, I might have different requirements regarding pose update rate then when I drive 50km/h through city traffic.

Hmm, most of what I know about embedded systems comes from the automotive people here at Bosch, and they are very concerned with safety. I might have got it wrong, of course, but given how often we’ve internally talked about this, I would be surprised.

What is true is that rates are often specified on a task level, with callbacks being assigned to a task according to their rate needs, but that is an implementational aspect, which most people actually don’t like, but accept as the way things are, unfortunately, implemented.

Anyway, in most cases, they are not interested in rates at all, only in response time, and that’s again a system-level aspect.

The response time of the system can be broken down into the response time of each stage of the processing pipeline’s longest path. Thus you need to know what the response time patterns of the nodes in that path look like. If we are dealing with a hard real-time system, then it should be possible to define, for a given execution environment, what response time requirement the node is capable of satisfying, which would be useful information for a system integrator. But it seems to me that this is a very specific requirement, as it would be tied completely to your specific environment, including the other nodes and how they execute (e,g, does one of them tend to hog the CPU a little?). So after @iluetkeb’s talk and speaking to him in person, I think that he’s right about the need for tools that make it really easy to understand these sorts of properties of nodes when they are installed in your system, rather than a way to specify something like a maximum update rate capability as part of a node’s interface. That seems like a figure that would change too much between execution environments based on things as obvious as CPU speed and as esoteric as the structure of the disc controller.

One example: You integrate 2 nodes (“a” and “b”) which have not been tested before (or if so not well enough). The first node “a” is publishing a topic and the second one “b” subscribes to it and publishes an own topic as well. You know that the subscribing nodes (“b”) topic message values may never be outside a valid value range. During integration something goes wrong and you locate the root cause in wrong topic message values published of node “b”. You do not know why exactly the topic message values are invalid. It could relate to invalid topic message values which “b” received from “a”, or in a wrong implementation of “b” itself which does not prevent from publishing invalid values. Having a DbC mechanism violations of such exceptions would be notified about during node integration even before integration issues could be discovered at all. (It would be possible that integration issues occur just in rare cases and the issue could keep undetected during integration and pop up in the field the first time.)

As far as I know the only way to detect unconnected nodes which should be connected is to use rqt_graph. (Mismatched topic types are even harder to detect and require to look for missing Connections: when using rosnode list <node>.) Does anyone know how to check issues like that in an automated fashion?

Aren’t rostests with paramtest test nodes the candidate to prevent from range issues in parameters?

Right. (It was never about interface issues in terms of the ROS implementation just in terms of its usage.)

Not yet. I was looking for exactly something like LTTng. Will there be open sourced tools in addition to LTTng?

If the same node is used in different applications and/or environments the requirements that’s true abd could be classified via use cases (here: warehouse, car) and the checks parametrized according to that. (Application specific: A warehouse robot should never drive faster than 1m/s. Environment specific: Every of your self-driving cars should not drive faster than 50km/h in the city and faster than 200km/h on the autobahn.)

I do not know about the automotive sector at all but the industrial automation sector (hard real-time safety critical, up to IEC61508 SIL3) only. However I know for sure that people use trace tools on the RTOS level like Trace for FreeRTOS/SafeRTOS and on the C function level like microtrace for function runtime and response time analysis. However it’s possible that this is just done for cases when the system’s overall response time cannot be determined using measurements on the system level (e.g. error case handling which could be hard to force on the system level for some cases).

Okay, range specs would certainly be interesting.

I also once had a case with parameters, where the combination of two parameters resulted in an out-of-range condition.

You can use “roswtf” during runtime, it will report subscriptions that have no publisher, and also type mismatches.

The thing is, you never know whether a subscription might be optional, and there are also cases where either one or the other subscription can be used, but not both.

Again, I agree this would be useful to check. We’ve so far called such things “graph consistency checks”.

The video-link is here: ROSCon 2017 Vancouver Day 2: Determinism in ROS – or when things break /sometimes / and how to fix it… on Vimeo

I have since published the related code at GitHub - boschresearch/ros1_tracetools: Tracing tools for ROS, but be aware that it’s not very useful without our instrumentation of roscpp. I’m also in the process of publishing that, but not quite there, yet.

Well, yes, you could, but that’s why I would prefer to put those specs (at least in part) onto the resulting system.

Right, that are typical cases for combinatorial ROS node unit tests. However often it can be hard to guess the right samples of combinations for input data (parameters, received topic messages, service requests, action requests) for tests in advance. It can be easier to define property based tests which can be setup to execute the tests with all combinations of input data and decide later if the input data for failing tests are actually invalid or if they are ok (and exclude them from further test executions to prevent from false positives if applicable). However I do not know about the current state of property based unit testing of nodes in ROS. Some of @asmodehn projects use hypothesis which is a property based testing framework for Python already but I don’t know how it is used in detail yet:

Thanks for that hint. (I thought about to wrap some ROS command line tool functionalities to find graph inconsistencies into a ROS node, run the node during node integration which asserts with error log messages if the graph would be inconsistent.)

Thanks a lot. I am looking forward to it.

Right. Makes sense.

Hi @iluetkeb, the ros_comm branch kinetic-devel is BOSCH CRs instrumented version of ros_comm I guess.

Yes, though the plan is to make a PR soon.

1 Like

In case someone is still interested in Design by Contract in ROS1 feel free to check out my spare time project rosdbc. (Don’t expect fast progress.)

1 Like

In the web development domain Contract Driven Contracts (Design by Contract for microservice architectures) are considered best practice for quite a long time (consider this web post “Consumer Driven Contracts” from 2006). Most people developing microservices try to limit end-to-end testing to the absolute minimum because they are “difficult, slow, brittle, and expensive” and stick to Service Component Tests and Service Integartion Contract Tests instead.

These test patterns don’t depend on instrumentation. Other patterns like some of the observability patterns listed on microservices.io don’t depend on instrumentation but could be (conceptually) adapted to ROS2 as well and made available e.g. as packages to ease the implementation of concepts and avoid copy&paste kind boilerplate (compare Spring Cloud and the Microservice chassis pattern).

There are other well known concepts like Distributed Tracing which requires instrumentation. Probably there is some potential for adoption to ROS2 for this concept as well if not already considered. (As instrumentation is hard to add later it usually needs early attention.)

But it is probably better to open one/several new thread for this/these topic(s), right?

In my very personal point of view, software in robotics is moving slower than software in general… and before integrating new concepts into ROS, I would like to see ROS using what is already there in a better way.

In microservices, in the python case like your example developers work hard to write lots of tests, and achieve high coverage, under high load. Primarily because their business value is kinked to the site never being down.

It is something I have yet to see in robotics. Probably because the internals of a robot are much more opaque to the final user…

Python has a lot of tools and best practices that are not even followed in ROS.

So I think we need to follow what exists in the language we use and where needed add instrumentation and reporting tools to get a better idea of the quality level of some robot software. We need more people to be able to “see” the quality before we can get the resources to address it.

So yep one thread or specific test tool integration or python process conformance might be useful :wink:

I agree.

Right. Reliability (as the other quality attributes) is business means money driven. If unreliable software can get very expensive like in the “finance” web domain people start to implement things to prevent from unreliable software.

If no one complains nothing will change. There is a well known statement in agile sw development: “let everyone feel the pain”… In the end in terms of money.

I think we could prevent a lot of future work if we

  • would consider some concept now
  • think about what they depend on (from a design/implementation point of view)
  • consider these dependencies (design/implementation) asap if the effort is reasonable

Relating to this and the rest of your post, a robot can be a very opaque and complex system and for a complex ROS-based robot there may be many nodes with a lot of data flying around. Making it easy to see where an error originated could go a long way to helping integrators find nodes with quality problems so those can be reported. This means making it more obvious why a node process died, for example. Was it because of a logic error in the node, or because another node published some bad data? (In the later case, both nodes need to be fixed.)

We already discussed the idea of having a “chaos node” sending random message ( with a content following the message format ) to existing topics and services, which helps discovering “unsafe nodes” during development.

But here I notice your focus on post-mortem analysis on “systems in production”. Maybe we could parametrize the “chaos node” to be able to run it in production, just like server farms run some potentially destructive tests in production, but when there is a low demand ?

Post mortem also mean, the system is running in production, but crashes and we want to know what happens. I am not aware of any tools for this purpose yet… Each node can use the tools of the programming language it was written in, but as for the communication between nodes, maybe we should have a library bagging each message, and keeping it for a certain amount of time ? some instrumentation that node writer could add to their code, so that when one node crash, we can get the log of all messages received (and sent) by that node ?

Is anyone aware of such a lib/module, or maybe it s just an extra feature we can add to existing core ROS libs ?

We could use rosbag for this purpose, except instead of storing the topics in a file, it instead stores in a buffer (might be a file) of fixed size/time duration.

In the worst case scenario of seg-fault, we can’t get all the messages that the affected node published/received, but we can certainly get the messages from the “publisher queue” from the memory dump (If we know where to look at). Getting subscriber messages might also be possible in a similar manner (after changes in current subscriber code). The bigger issue would be finding where the the relevant memory sections are given the dump. It would require some tooling, on top of my head something like a gdb plugin CfViz.

Maybe we could run the chaos node and the modified rosbag style logging node together. We could check the number of nodes at user dictated frequency (similar to rosnode list) and store only those messages permanently which originate on topics subscribed/published by a node going off the radar without publishing on a certain topic (only because ROS2 bring in the concept of managed nodes)

Not just in production, but in CI as well. For example, look at the CI system that Fetch Robotics presented at ROSCon 2016. They record all the data they can on the robot so they can analyse it if an error occurs during a test run. Widely-available tools to make analysis of this data straightforward, such that it rapidly directs the developer back from the point of the error through the causality chain, would be immensely useful. A lot of this is about instrumentation, but also a lot of it is about tools to analyse the data that comes out of that instrumentation.