Design By Contract

Types/Contracts : In my mind, ‘Design by Contract’ was an informal concept/practice introduced a few years ago because type systems of most languages at that time was insufficient to guarantee correct program behavior. But it is fundamentally the same thing…
Except that we have researched type theory for a while now, whereas ‘contract theory’ is probably not what you would expect after learning about DbC…

These days I am following dependent types and experiments to bring them into distributed systems.

1 Like

That summarizes the difference between types and contracts. And that is exactly about why I didn’t propose types here: In my experience the hard to find defects tend to have their root cause in implicit, incomplete or missing definitions of the dynamic characteristics of here in ROS, node interactions.

Unfortunatelly that is exactly what I found out when looking into the ROS2 sources. One could add deadlines for topics that (a) do not change or (b) do change over node runtime they could be (a) defined and/or (b) updated via the rmw C API which wrapps the DDS DynamicData API or the statically generated DDS functionality from the IDL definitions. However as you said: The interface considers the aspects of the message description languages IDL only, not a node description language. And a node description language would be required to add functionality which would be most benefitial.

That’s right. The question should be: “How can I prevent from introducing defects into/detecting defects in distributed ROS systems which have their root cause in the dynamic interaction of several ROS nodes?” I am biased and did not propose physical continuous integration because that seems hard to implement for distributed systems. DbC or actually model checking based on kind of a node description language seems to be cheaper to me.

I might be stating the obvious here, but still worth reminding everyone I think…

How can I prevent from introducing defects into distributed ROS systems which have their root cause in the dynamic interaction of several ROS nodes?

  • Don’t build a distributed (==multiprocess) system if you don’t have to. Programming language elements (functions, classes, libraries, packages) are made for composing correctly in all sorts of ways, and there is usually theoretical background, tooling, conventions, processes, to help you satisfy the cognitive biases you didn’t know you had. No distributed software system that allows you to control the distribution graph, has anything equivalent to that currently. ROS is no exception (actually erlang might be the only exception).
    Example : A whole part of Operating System design is to prevent processes interactions, and most recent OSes are preempting ? This is opposed to the features suitable for a distributed system, which by definition needs process cooperation, and where controlling when each process can be interrupted, or not, is really useful. In one process, in one language, all these problems vanish.

  • If you have to build a distributed system, congratulations, you are doing distributed system research. This is not robotics and there is a different set of assumptions coming along in that context.
    Example : most existing and widely-used distributed software systems rely on the fact that a message, a unit of computation “task”, is atomic and idempotent. That requirement usually cannot be met in a robotic platform, because of side effects on the real world, the whole point of it. Painful lesson after a year working on GitHub - asmodehn/celeros: Celery ROS python interface - .

For the rest of us having to do both distribution and real world side-effect, I feel the most promising way, is still integrating side-effects into the theory. But, as far as I know, it is still a software research topic on its own.

Regarding ROS, the best bet is likely to integrate/interface/implement ROS with the existing programming language that provide the feature that you need, instead of trying to integrate “that awesome language feature” into ROS (because it implies re-implementation and proactive maintenance from ROS community for something that is not purely robotics related)

For DbC, I’m thinking if you get around implementing a Eiffel-based ROS interface/integration/implementation, you might find some interesting changes needed in ROS itself, even in REPs, in order to make that possible without compromising Eiffel. I’m thinking these changes would likely be worth it for ROS, especially in the long run.
Disclaimer: I’m currently following the same path with Python, improving how ROS integrate with it along the way, and finding basic problems where I didn’t expect to…

But ultimately, writing a “solid” software project is a matter of computer science and software engineering expertise, so general software theory, knowledge and tools apply there. It’s not a problem specific to robotics, and therefore robotic science and tools (like ROS) are not focusing on it.

ROS is a multi process system on the robot/(multi-)processor level itself. I guess you mean multi device system instead of multiprocess system. Right, distributed systems are beyond robotics. However robotics and distributed systems have already merged into distributed robotics (Kiva robots in an Amazon warehouse in 2011). Isn’t it the time to ease the development of such systems (and single robots) by providing better framework capabilities?

As Amazon already did non real time distributed robotics in 2011 I think considerations w.r.t. preemption and inter-process interaction beyond the robot level is more critical in real-time (soft/firm/hard) applications.

I would argue that if a framework implementation lacks conceptual features this cannot be compensated with the choice of suitable programming languages which address lower levels of abstractions only. But you are right, it is better to improve a framework w.r.t. to it’s given capabilities instead of trying to integrate language features (at least in the short term).

Instead of waiting for more capabilities in ROS2 it is probably more valuable to add more capabilities to the current ROS1 framework. Without DbC on the ROS node level one can verify the ROS node interface with rostest. Currently the set of reusable test nodes is limited. What about adding more generic test nodes like fake topic publishers (draft state) to ros_comm? (The example tests can be run with catkin_make run_tests_rostest_rostest_test_faketopicpublisher.test and catkin_make run_tests_rostest_rostest_test_faketopicpublisher0.test in the catkin workspace. Quick start guide about the other generic test nodes of rostest.).

Actually I do mean multiprocess. But I do not mean “We cannot make it work/do what we want” or even “We cannot make it do what we want all the time”. I mean “We cannot be sure that it will never do things that weren’t intended”.
And I know for a fact that there are more robots out in the world, doing unexpected dangerous thing, because they weren’t programmed with total safety in mind, than most people know about. All it takes is an unchecked integer to wrap around, and disaster strikes in the real world. Make it distributed, and the disaster likeliness increases exponentially.

So sure we can build robots, and distributed systems, but, when it come to a robot that can poke the eye of a child because it looks like it’s a button to press, it’s different than a harmless backend database cluster in the basement, so you better be sure of what you’re programming… and for distributed systems (multiprocess) the theory is quite new, so most language/frameworks won’t help you there.

Definitely yes, but instead of adding potentially heavy features, without being sure they will be used and maintained, I would first focus on doing like https://jepsen.io/, that is, provide tools that show people working in robotics, what and where the problems are in the system they build. Actually probably doing the same as what works for security hackers : tell people their system is broken/unsafe, nobody cares. Make anyone (including their customer) able to break it, and then they react… and some might listen.

I personally fully agree with this statement, but I think most people are focusing on ROS2 these days, which means even less maintenance resource for ROS1, so we need to be careful that what we add is really worth it.

I would also agree there. You can always send a PR to add the tests node you miss to rostest, and discuss it with the maintainers :slight_smile:
And you can also write a package for the specific nodes you need. I started doing that for my own needs in GitHub - pyros-dev/pyros-test: Test package for Pyros.
But these days I am thinking we need something more like a ROS Simian Army :

  • some package that randomly kill and restart nodes, probably based on launch files…
  • some package that randomly sends messages around, like a ros-hypothesis that would generate any valid message based on a ROS definition, to test your nodes against. I have already implemented most of this one, as part of other projects, but I still need to make it a package on its own, whenever I get the time and motivation…
  • probably a few more…
2 Likes

I cannot visualize a possible bad case situation better than you did (I have to remember that one for the future.) However even if you are working in an environment where your work could potentially lead to disastrous situations you should change your mind set from “prevent from/find every bug” (which could lead to something like an displaced child’s eye for sure, but what cannot be prevented from with 100% probability for sure as well) to “become better in preventing from/finding the most important bugs”… better than suffering a depression.

Thanks for that hint.

Good to not provide an USB port…

I am not going to PR into ROS1 :wink: Right now I am fine with a fork of ros_comm/rostest for dummy, fake, spy and mock nodes. (In case I consider your package as template.)

Having something like that would be great.

I began a new ROS package roschaos. The package is in an early stage but it can already be used to kill local ROS node processes randomly using a command line interface. To get feedback and proposals for improvement right from the beginning I decided to make the project public already in this early stage. However there are a lot of features missing (refer to issues). Feel free to contribute to get more features implemented.

BTW: Thanks @gavanderhoorn for your answers on answers.ros.org like this one which helped a lot to get started.

I put generic test nodes which act as dummy or fake nodes for faking, not verifying (according to the general terminology of test doubles in software engineering) into a package rosfake. However integrating custom verification nodes into rostest seems to be not straightforward according to the comment to this question on answers.ros.org. @asmodehn What is your approach of integrating custom verification nodes?

It makes sense to put spy and mock nodes for verifying into a package rosmock. However to get something like that generic is harder because e.g. the package depends on the test framework used to assert.

What license do you use for pyros-test? What test framework do you use to assert? If one would use Python unittest which is widely supported “to beeing integrated” into other frameworks one could think about to merge generic verification nodes into a kind of rosmock. PRs into rostest take too long to be accepted (if they get accepted at all). I think the functionality of rostest (providing the test framework) and the verification part (generic fake nodes → rosfake, generic verification → rosmock) would better be separated anyway…

Quick replies :

  • pyros-test is MIT or BSD, along these lines, I just haven’t taken the time to put a file there… I ll try to do it soon.
  • python makes things simpler than C++ as there are defacto standard test frameworks. So I am trying to support whatever basic python and ROS support ( unittest, doctest, and nose ) and pytest. unittest already includes a mock library by the way in python3, which is just the mock library in python2 .
  • pyros-test is currently very very simple (probably too simple to need a separate package), and I’d like to eventually improve it when I get the chance…

But IMHO you re probably better of extracting what I already started in https://github.com/pyros-dev/pyros-msgs and https://github.com/pyros-dev/pyros-schemas : have a look at property based testing and hypothesis, it should be simple enough to automatically generate fake nodes based on an existing message definition and then send fake messages around :slight_smile: .

Some example of hypothesis use here : https://github.com/pyros-dev/pyros-schemas/blob/nested_merged/tests/test_pyros_schemas/hypothesis_example.py and an example of generating messages with it there https://github.com/pyros-dev/pyros-schemas/blob/nested_merged/tests/test_pyros_schemas/test_basic_fields.py

By the way, roschaos looks fun, and there is probably a clever way to integrate it with roslaunch or feed it launch/test files… I ll play with it when I get some time.

1 Like

Wow, a ros-hypotesis could be very powerful and effective. I knew about property based testing (e.g. RapidCheck for C++) but did not think about to adapt it to ROS so far. Creating a framework for property based ROS node testing could be a hard task I guess. What use cases are you thinking about exactly? I would love to contribute to it :blush:

While I’m a big fan of automatic checking, I wonder which problem(s) this proposal tries to solve.

This is not to say there are no problems. I just think it would help the discussion a lot if we knew what people here are interested in, in terms of outward behavior of system.

The proposed contracts relate to a) rates and b) response times. From my own work, I know that rate information is necessary but not sufficient for determining whether a system can have sampling effects. I am also strongly of the opinion that rates should be a property of a system, not a component.

I don’t know of much utility, but a great deal of problems, in specifying response times for a distributed system, particularly one with one very little real-time support, such as ROS or ROS2. If you really care about that, put your stuff into one process and ensure response times in the usual means, which have little do to with ROS.

Initially my intention to start this thread was to discuss if and how formal specification and formal verification of ROS 2 node/nodelet interfaces by means of DbC could be applied to/implemented in ROS 2. I am interested in DbC because it has the big advantage that it could speed up the integration of ROS 2 nodes/nodelets within a system because it prevents from struggling with bugs which relation to the interface based interaction between nodes/nodelets. However I do not consider DbC as measure for formal verification in the testing context but in the debugging context instead because it can hardly be accurate enough w.r.t. timing, as you said, especially in real-time systems. (I do not know about any tracing tools for ROS which are usual tools for real-time related verification.) in the debugging context a comparably rough estimate is often sufficient. Assuming “system” means the overall sum of ROS nodes/nodelets this thread is not about “outward behavior of system” but its “internal” integration only. If your system exposes interfaces in terms of ROS interfaces which interacts with another system DbC could address “outward behaviour” of the single sub-systems.

As DbC would be hard to implement in ROS 2 and it’s benefits are not considered relevant enough in comparison to other quality improving measures like developing and using tools like a “ROS Simian Army” the direction of the thread turned into the direction of what tools are missing and could be helpful to verify a ROS based system. (Not in terms of verifying real-time behaviour.)

For me the proposed contracts are more about valid and invalid values/value ranges of the node/nodelet interfacess like topics.

However w.r.t. rates and response times I would consider different “classes”:

  1. the “incoming” rates one node/nodelet expect from other nodes to receive, the nodes/nodelets “outgoing” rates which are expected from other nodes, the response time of one node
  2. the rates and response time of a component
  3. the rates and response time of a system

If 1) the node/nodelets do not satisfy “rough” rate or response time requirements there is a pretty good chance that 2) the component or 3) the system will not behave like you would like it to behave as well.

“I am also strongly of the opinion that rates should be a property of a system, not a component.” → People having a background in safety critical, real-time embedded system development could disagree here.

If we care about rates and response times we are already putting nodelets into a single process if this is possible. (If nodes are distributed over different machines I do not know about any way to improve rates and response times anyway.)

Can you give one or more examples of the kind of interface bugs you mean?

In my experience, the most frequent issue – at least initially – with connecting nodes is that something is not connected, because of a wrong name. The next most frequent seems to be spelling and/or range issues in parameters.

I wouldn’t call these interface issues as such, or at least we don’t need new specs on that, it would be sufficient to actually check the current ones.

Have you seen my talk at ROSCon? It’s not specifically about that, but I mention how we used LTTng to trace messages and callback invocations. This is in the soon-to-be-released tracetools package.

Aren’t those specs strongly related to your requirements or, in other words, to the outward behavior your system is expected to have? For example, when I drive a 1m/s in a warehouse, I might have different requirements regarding pose update rate then when I drive 50km/h through city traffic.

Hmm, most of what I know about embedded systems comes from the automotive people here at Bosch, and they are very concerned with safety. I might have got it wrong, of course, but given how often we’ve internally talked about this, I would be surprised.

What is true is that rates are often specified on a task level, with callbacks being assigned to a task according to their rate needs, but that is an implementational aspect, which most people actually don’t like, but accept as the way things are, unfortunately, implemented.

Anyway, in most cases, they are not interested in rates at all, only in response time, and that’s again a system-level aspect.

The response time of the system can be broken down into the response time of each stage of the processing pipeline’s longest path. Thus you need to know what the response time patterns of the nodes in that path look like. If we are dealing with a hard real-time system, then it should be possible to define, for a given execution environment, what response time requirement the node is capable of satisfying, which would be useful information for a system integrator. But it seems to me that this is a very specific requirement, as it would be tied completely to your specific environment, including the other nodes and how they execute (e,g, does one of them tend to hog the CPU a little?). So after @iluetkeb’s talk and speaking to him in person, I think that he’s right about the need for tools that make it really easy to understand these sorts of properties of nodes when they are installed in your system, rather than a way to specify something like a maximum update rate capability as part of a node’s interface. That seems like a figure that would change too much between execution environments based on things as obvious as CPU speed and as esoteric as the structure of the disc controller.

One example: You integrate 2 nodes (“a” and “b”) which have not been tested before (or if so not well enough). The first node “a” is publishing a topic and the second one “b” subscribes to it and publishes an own topic as well. You know that the subscribing nodes (“b”) topic message values may never be outside a valid value range. During integration something goes wrong and you locate the root cause in wrong topic message values published of node “b”. You do not know why exactly the topic message values are invalid. It could relate to invalid topic message values which “b” received from “a”, or in a wrong implementation of “b” itself which does not prevent from publishing invalid values. Having a DbC mechanism violations of such exceptions would be notified about during node integration even before integration issues could be discovered at all. (It would be possible that integration issues occur just in rare cases and the issue could keep undetected during integration and pop up in the field the first time.)

As far as I know the only way to detect unconnected nodes which should be connected is to use rqt_graph. (Mismatched topic types are even harder to detect and require to look for missing Connections: when using rosnode list <node>.) Does anyone know how to check issues like that in an automated fashion?

Aren’t rostests with paramtest test nodes the candidate to prevent from range issues in parameters?

Right. (It was never about interface issues in terms of the ROS implementation just in terms of its usage.)

Not yet. I was looking for exactly something like LTTng. Will there be open sourced tools in addition to LTTng?

If the same node is used in different applications and/or environments the requirements that’s true abd could be classified via use cases (here: warehouse, car) and the checks parametrized according to that. (Application specific: A warehouse robot should never drive faster than 1m/s. Environment specific: Every of your self-driving cars should not drive faster than 50km/h in the city and faster than 200km/h on the autobahn.)

I do not know about the automotive sector at all but the industrial automation sector (hard real-time safety critical, up to IEC61508 SIL3) only. However I know for sure that people use trace tools on the RTOS level like Trace for FreeRTOS/SafeRTOS and on the C function level like microtrace for function runtime and response time analysis. However it’s possible that this is just done for cases when the system’s overall response time cannot be determined using measurements on the system level (e.g. error case handling which could be hard to force on the system level for some cases).

Okay, range specs would certainly be interesting.

I also once had a case with parameters, where the combination of two parameters resulted in an out-of-range condition.

You can use “roswtf” during runtime, it will report subscriptions that have no publisher, and also type mismatches.

The thing is, you never know whether a subscription might be optional, and there are also cases where either one or the other subscription can be used, but not both.

Again, I agree this would be useful to check. We’ve so far called such things “graph consistency checks”.

The video-link is here: ROSCon 2017 Vancouver Day 2: Determinism in ROS – or when things break /sometimes / and how to fix it… on Vimeo

I have since published the related code at GitHub - boschresearch/ros1_tracetools: Tracing tools for ROS, but be aware that it’s not very useful without our instrumentation of roscpp. I’m also in the process of publishing that, but not quite there, yet.

Well, yes, you could, but that’s why I would prefer to put those specs (at least in part) onto the resulting system.

Right, that are typical cases for combinatorial ROS node unit tests. However often it can be hard to guess the right samples of combinations for input data (parameters, received topic messages, service requests, action requests) for tests in advance. It can be easier to define property based tests which can be setup to execute the tests with all combinations of input data and decide later if the input data for failing tests are actually invalid or if they are ok (and exclude them from further test executions to prevent from false positives if applicable). However I do not know about the current state of property based unit testing of nodes in ROS. Some of @asmodehn projects use hypothesis which is a property based testing framework for Python already but I don’t know how it is used in detail yet:

Thanks for that hint. (I thought about to wrap some ROS command line tool functionalities to find graph inconsistencies into a ROS node, run the node during node integration which asserts with error log messages if the graph would be inconsistent.)

Thanks a lot. I am looking forward to it.

Right. Makes sense.