Design process of ROS 2

During today’s navigation WG meeting, we had some more discussion about the use of life cycle nodes and actions in the ROS 2 navigation code. This discussion came back again to issues that several people have with the design of the APIs for these features, which then further got into the fact that there is not much discussion about the detailed design and implementation of many features. Several of us feel that time pressures are leading to designs that are driven by implementation, features that are not fully fleshed-out before being implemented, or what are assumed to be prototype implementations ending up not changing and then being used by an increasing number of users, making them hard to change. We would prefer to see long-term design that aims for “we will eventually get to here”. Even if the implementation does not get there immediately, at least everyone will know where we are going and that any current design or implementation is not the final result.

The design repository is good and has allowed many people outside the OSRF, myself included, to contribute to designing concepts for ROS 2 and in some cases put quite a lot of detail into specific parts (the launch facilities come to mind). But the documents in this repository more often than not do not go into detail on things like APIs or library organisation, which are important details that impact how developers can use the feature being discussed. The two most prominent examples that I have seen are the life cycle nodes being implemented as a separate object type, which requires duplication of much of the API and means life cycle nodes can’t be used in some parts of the ROS 2 API, and the implementation of actions external to the Node class while topics and services are included in the Node class, which contrasts with what many were expecting would happen before implementation would begin. I have also seen the design of topic names often come up in security-related discussions, such as this issue.

We recognise that there are limited resources being put into ROS 2 development, that those resources are also being asked to maintain and develop ROS 1, create new ROS 1 releases, maintain a build farm, and so on in addition to developing ROS 2 under huge time pressure. However the main problem is not so much the lack of time being put into detailed design of core ROS 2 features as it is the lack of an opportunity for people to comment on the details until a pull request with a completed implementation comes along - by which time it is difficult to say “change the whole concept of how this is implemented”. When detailed discussions do happen in GitHub issues, it is usually after-the-fact and frequently the discussion thread just stops without a resolution - probably because the relevant people have so much on their plates to deal with. It was also mentioned today that these discussions often end up going circular due to the limitations of talking in text all the time. This is in contrast to there being working groups for non-core features that meet regularly (weekly, in the case of the navigation WG) and talk about requirements, goals, etc. that are driving how many things will work. These weekly meetings help push things along and help prevent discussions just stopping.

The purpose of this thread is not to accuse anyone of ignoring detailed design. The conversations on the issues linked above show that the Open Robotics developers put a lot of thought into how they design APIs and implement features. I do feel that in general the core ROS 2 libraries are well-designed, but there are a growing number of places where I have concerns, and I know that others have places that concern them, too. What I think we need is:

  • A venue to have regular discussion of detailed design and implementation issues before we get presented with an implementation, and not buried amongst the dozens of other issues that come across our GitHub notification lists every day.
  • Active participation in this venue by anyone working on a core ROS 2 feature.
5 Likes

I agree strongly. I have the same concerns, particularly about Actions and Lifecycle nodes, which we’re trying to implement within Navigation2 in the near future (Dashing release). I would really like to see a “Design Forum” or working group formed to discuss topics such as these.

I think there are two opposing goals about the development process here: on the one hand it should be as thoroughly designed and implemented as possible and on the other hand the community wants many new features as quickly as possible. We have done more of the former in the beginning of the ROS 2 development and then more recently - based on the feedback of the community - changed more towards the later. The desire for a fast development pace comes with the side effect that a first implementation of a concept might not be the final perfect solution. But I think that is absolutely fine. That is why ROS distributions exist. While the API in each distribution stays stable a newer ROS distribution has the freedom to break them in order to move forward.

For ROS 1 - which is around for 10+ years - it is reasonable to expect that “core” API doesn’t usually change between distros - or if it does that it aims for a tick tock cycle. For ROS 2 which is much less mature I don’t think this is required at all. Just because a certain feature has been implemented doesn’t imply it is final by any means. ROS 2 is too young to assume everything is stable / mature and not changing. As long as changes are clearly documented it can be expected from packages using the API to follow those instructions (e.g. for Dashing) in order to support future ROS distros.

The two features mentioned - lifecycle and actions - are good example of the process. There has been a significant amount of discussion and iteration on the design documents themselves as well as on the pull requests implementing these features. But most of the current concerns have not been raised during that multi-months time. These kind of feedback often only comes up way later when a bigger audience is trying to use the new functionality which I think is natural. Would it be great of these aspects would have been brought up and addressed earlier in the process? For sure. But the fact that they didn’t come up earlier shouldn’t prevent us from iterating on the design as well as the implementation in the future as needed. If this would imply “change the whole concept of how this is implemented” that would be unfortunate but if the reasoning behind that need is sound it should probably be done.

Regarding the question about the venue / forum for this kind of discussions: imo these already exist in various forms. There is the design repo with already merged, pending or to-be-created issues and PR, the actual tickets implementing specific features as well as this Discourse category. All these can be used to discuss topics like this. You referenced two specific tickets where this exact kind of discussion is actually happening. The main step missing from these discussions is imo a follow up in terms of someone taking the lead to act on the discovered shortcomings and working towards improving them. And ultimately this is very often limited due to resource constraints as well as lack of initiative.

3 Likes

@gbiggs What kind of forum do you have in mind? As @dirk-thomas mentions, there are already ample asynchronous communication tools at our fingertips, but they don’t appear to meet your (or @mkhansen’s) needs. Is the problem from your perspective the lack of a regular synchronous interaction?

Maybe I’m just tired from a long week (yay Saturday!) but I don’t seem to be as worried about this as I was yesterday. :slight_smile: Also I kinda agree with all but the first half of the last paragraph of Dirk’s post. I like an agile, iterative approach to development like that ROS2 is following (or at least trying to).

My post above was an amalgamation of the points raised in the navigation WG meeting, so I don’t want to speak out of turn for all those points of view. I think for me the main problem is that although we expect and need to iterate on features, that iteration is not necessarily happening. Dirk alluded to this in his last paragraph. For example, the life cycle nodes implementation is still mostly unchanged (in terms of how it is implemented) from the first implementation 2 (?) years ago. Yet now tools, libraries and major projects are starting to be built on it. The larger the mass of users the greater the force against major changes, and at some point each feature will reach a critical mass of use where the pushback against change is greater than the push for change, and we end up stuck where we are. Isn’t this why ROS2 began in the first place?

But as everyone keeps pointing out, this lack of iteration likely comes from a lack of resources. As each feature gets done, developers move on to the next feature being demanded by the community and lose track of something that they probably originally intended to go over again. I know I’m guilty of this over the years. (Doing new stuff is always more fun than reworking old stuff!)

In regard to what kind of forum might be needed, I don’t like the way it is hard to see into the thought processes of why a feature is implemented the way it is, and where it is intended to go. We are not all in the same office so it is hard to just roll over in my chair (too lazy to walk) and ask. Yes, we have many textual forums available, but text is slow and easy to misunderstand while simultaneously hard to correct. This is why I like the idea of having a regularly scheduled teleconference where we can ask about something and get a rapid, easily-clarified answer. A question for the TSC, perhaps?

Another improvement could be to introduce epics into the feature management process. That might make it easier to track how a feature is being iterated on, what the current end goal is and how far along towards that goal we are. There are some issues that track major features with a checklist of smaller issues, but these are usually “implement large feature X for next release”. Epics are traditionally used across many release cycles and take a longer-term view. I’m not sure how easy it would be to get epics into GitHub’s or Waffle board’s facilities, but I feel like this could be a relatively low-resource way to improve visibility of the longer-term development goals and process. I’ve added a task to my list to look into if it is technically possible with our current tools in the next week.

1 Like

I think lack of time to iterate and the lack of a democratic process for contentious design choices are issues here, but instead of solving those difficult issues here and now, I think better communication practices could help in the meantime.

tl;dr I think we should:

  • use instantaneous chat more often
  • consider some tools to facilitate that (like discord https://discordapp.com/open-source or slack)
  • try to mitigate overwhelming core contributors with community moderation (as needed)

I think the discussion we had about the actions API suffered due to the nature of email/discourse discussions being so delayed. You end up with a lot of interleaved responses and it’s hard to iterate on disagreements.

To that point, I think one thing we’re missing is good instantaneous chat.

We have an IRC channel (#ros) which is text chat and is in my opinion just ok.
I am usually on IRC but rarely use it.
However, I recently had some design related discussions with a community member (@kyrofa sorry to single you out) and it was honestly much more productive than back and forth on GitHub or discourse are usually. So I’d like to see more of this kind of chat to see if it helps these cases.

However, IRC is just ok and we lack a convenient way to do higher bandwidth communication like voice or video without scheduling a meeting (for what ever reason we don’t use IRC to agree to do a Google meet or skype or w/e). I really enjoyed having a google hangout “situation room” during the crystal release. I think it was helpful for community members trying to get things into the release to be able to just hop in and ask a question, or just listen.

I know @dirk-thomas likes to say “different tools will not solve the problem” (and I tend to agree), usually meaning that when there’s an underlying issue like lack of time to read and respond to design questions different tools won’t help, but in this case I think a chat system that’s more modern than IRC might be helpful, especially if it has a voice and video option which would make it casual to start up a conversation.

I think using text chat (whether IRC or something else) would help, but honestly it will cause other issues. Usually, if I don’t answer a question about design things on discourse or GitHub immediately it’s because I’m busy. If someone hits me up on IRC with a very specific question, they’re more likely to get an immediate response out of me, and that’s good for both me and them, but it also makes it harder for me to focus on long running tasks. It’s all too easy for me to loose an entire day to reviews or design questions or help with bugs (whether they come from the OR slack or IRC or GitHub, etc…). So if we spend more time talking on these high bandwidth channels, I think it’s going to improve others ability to work with us, but it’s also going to consequentially impact our productivity.

I’ve seen this issue with content creators in the video game industry, and the way they balance this is by having community moderation, where issues bubble up through levels of access and insulate people who are easily overwhelmed. Email/GitHub/Discourse are kind of good for this, because it makes it easier for me to ignore conversations until I have time, but I still see them and often fall into reading them, thinking about them, or responding to them. I think a more casual setting might encourage community members to respond to each other before I even have time to read it. For what ever reason, I find this is more common on something like IRC than in forums like email. Perhaps because the perceived cost of a response is lower.

I’ve been thinking about suggesting that we emulate these communities for some time, but ironically I haven’t had the time to put together a proposal. But briefly, I think we could learn from their use of community moderation, and the tools they use to accomplish that. One pretty common thing is that almost all of them use discord, which is a free service that provides text, voice, and video chat in channels (sort of like slack). However, it is geared towards gaming and gaming communities, but it seems to excel at community management, and it provides a lot of tools and automation for this purpose. There are a few projects using it already, and they have an Open Source outreach page: https://discordapp.com/open-source Other communities have similar solutions with different tools, choosing to use slack or gitter or one of the other options. I’m not tied to discord, but I do think it’s a good solution. Obviously there’s always a discussion to be had about memorandum and the benefits of asynchronous communication like email, but I personally think there’s a place for these more modern solutions to facilitate ephemeral discussions.

Currently we don’t use the services we already have (IRC) to full effect, but I think if we had a more modern tool like discord or slack, and more of us committed to being available on it (IRC is pretty dead most of the time), then we might increase our use of it. This would help with communication. We could start with no moderation or hierarchy, i.e. everyone has “access” to everyone else, and if it becomes an issue we can start to add some structure to prevent overwhelming individuals.

No matter what we do (in terms of mitigation techniques like community moderation), this will negatively impact some people’s productivity more than others, but it will also hopefully let us better leverage other contributors who are blocked or frustrated otherwise. So I think we just need to be aware of this trade-off and (this isn’t a new request from me) also take this time spent collaborating into account when scheduling feature development. Personally I think this point needs discussion and support from the TSC level, but that’s just my opinion.

1 Like

Maybe comparing some political models for open-source communities might be helpful.

Such as

The main trade-off maybe being: speeding up any process until an outcome vs. improving how well the outcome covers the needs of different (sub-)communities.

I was coming to say pretty much exactly this. IRC isn’t perfect, although we’ve been using it successfully for years in Ubuntu. If we were starting today would we use something else? Maybe, but I don’t think the choice of tech matters so much as long as it’s open to everyone and consistently used and attended (including Open Robotics folks). We have rooms with hundreds of people in it, and our community knows that we Canonical employees are always there. How do they know? Because we have our own design discussions there, live, in front of everyone. Some of us actually used Telegram for about a year because we were starting to tire of IRC, but we ended up finding that having design discussions and making architectural decisions out of the public eye ended up making our community feel a little alienated and out of the loop. This conversation sounds familiar :slight_smile: .

I totally get that being on [chat platform] and being pinged causes a context switch that quickly eats up time, but delegation is key here-- if everyone is there, we can all share the load. And I’ve actually told people “I’m sorry, I’d love to help but I’m slammed right now, can you come back tomorrow maybe an hour earlier in the day?” and they will. Or “I can’t help at the moment, but [other person] was working on that yesterday.”

Not every community works like Ubuntu, I get that. One of the reasons we’re able to have design discussions in front of everyone is that Canonical is a mainly remote company, and we have to have those conversations somewhere. We might as well have them in public, and that has worked really well for us. As I understand it, Open Robotics is not a remote company, but you’re working with a large community who is. Having discussions in chat instead of in-person would probably slow you down, but I wanted to share my perspective.

1 Like

I’m total outsider here. But with 20+ years experience in the aerospace software world. I would not suggest adopting methods used in my industry, they result in very slow progress. But I would make one hard change in your process.

Note the observation about “lack of an opportunity for people to comment on the details until a pull request with a completed…” this is natural, people don’t notice until finally the change is made.

One change in your software process might be that “NO CODE IS WRITTEN UNTIL A X DAYS AFTER A DETAILED DESIGN IN CHECKED-IN” This will cause some slow down but that everyone sees the pull request for the change in the design documents and comments THEN, not after the code is already written. There is a natural tendency to NOT change working code.

Basically take advantage of that observation that no one cares until after the pull request. So make the pull an early step.

Just saying “we should talk more” rarely actually happens, Make a rule that forces more talking.

The goal should be to make changes early not “later”,

With the increasing addition of remote engineering resources thanks to the efforts of the TSC, there must be some kind of instantaneous chat going on. @wjwwood mentioned an Open Robotics slack; having an internal slack is not something I’m going to argue against because I agree it is needed. As everyone seems to agree, what is lacking is the equivalent for the whole community that is actually used. I gave up trying to maintain my connection to #ROS because was 95% no activity, 4% someone making a random statement along the lines of “Hi there! I’m learning ROS! Yay”, and the remaining 1% was interesting discussion. This not-being-used thing is a problem. As @kyrofa said, having to hold your design discussions in a public channel that supports rapid, lightweight response is better all around.

What technical solution to use for instantaneous chat is another thread now, but I am all in favour of trying to transform things a bit in this way.

Plus my collection of Slacks is not nearly big enough yet.

1 Like

I agree with this. I’ll go to whichever chat platform is used. I think it would be useful to examine what particular issue causes the IRC to be inactive. My first thought is that more activity from Open Robotics folks would be helpful.

I’m not sure I agree that adding slack or IRC is going to resolve this any better than Github and Discourse has. I think its just going to add more confusion, not less, IMO. I already get 10-30 github notifications, dozens of emails, and 5-10 Discourse notifications each day. Adding IRC/Slack will only make things less focused, not more. I think something more like the model that Google Cartographer is using for their Open House is a better way to discuss the key design issues in real-time.

I would also like to add that I do understand the iterative approach, but without knowing what the long-term design goal is, often the first iteration becomes the only implementation (as noted above by @gbiggs). At least if the goal is stated clearly in a design document (or an Epic as he suggested), then it can be made clear that there will be changes upcoming to reach that final design. I don’t think we have that clarity right now. It might be clear to the implementers that it was only meant to be a first implementation, not the final one, but to the community, it’s not clear. See the thread on Actions that was mentioned above for a good example of this.

My observation was that of two of my recent design interactions with people outside our office, one was on GitHub and one was on GitHub but with short bursts on IRC, and that the latter was much more productive. It certainly doesn’t address issues where we have insufficiently written down what’s in our heads so that external people can more clearly see what we envision for the API or design, nor does it address the fact that we’re all already overwhelmed with notifications.

To be clear we already have IRC, so we don’t need to add it, though I suppose you mean “add it to our workflow”, in which case my experience is that it would help, but again that’s just my opinion.

In my opinion, that’s just using instantaneous chat on a regular schedule with meeting notes. Another way to address this short fall of “being on the same page” would be to have more meetings where we sync with one another, though I’m not sure if that’s more or less efficient that ad-hoc discussions in something like IRC or discord (or a voip conversation arranged in either). Personally, I’m at my capacity for meetings already.

I’ve always been a proponent of more detailed design and finding as many issues as possible before starting work (to that point I created the design.ros2.org website and tried to encourage its use), but I’ve always been encouraged to just jump in and get started or to avoid things like “decision paralysis”, “premature optimization”, or “over-engineering”. I think those are valid concerns, but at the same time I anticipated issues related to not having enough of our design written down in order to involve external team members. I also started to make a ROS 2 version of the “Conceptual Overview” from ROS 1 in order to convey more of this, but I’ve not had time to keep it up, nor has any project come along that working on that seemed to fit with:

ros_core_documentation/source/rclcpp_cpp_client_library_overview.rst at master · ros2/ros_core_documentation · GitHub (note this is out of date by now)

Perhaps someone who is more on the side of “get things done” could speak up here.

1 Like

I think it is important to discuss this point in particular. I don’t think that anything in ROS 2 is written in stone. The decisions that have been made are generally made with the best information at the time, but as we gain more experience and use cases, these can all evolve. So things that are still on “first iteration” are mostly there because nobody has had the time to evolve them to where they need to be.

(clipping some good points by William)

I am typically more on the “let’s get something in and iterate later” side, but when building a middleware system, some advance thought does need to be put in. While my point above about pieces evolving is true, we can’t always improve them in the current release (since we generally guarantee API compatibility). So my opinion is that there should be some design work done up-front, but then things should move along and we can iterate towards a better solution later.

Circling back towards whether instantaneous chat helps or hurts here, I have also seen it work on previous projects. But it will mean that everyone involved (including at Open Robotics) needs to make a dedicated effort to be on that chat and to conduct design discussions on that chat, even when they are with people in the same building.

It always comes back to the resources problem. I’m curious if someone (@gerkey?) is able to shed some light on how much development resources ROS2 currently has, how much help the TSC idea has been in increasing them, and how much development resources are expected to grow in the near future. It won’t really help this discussion but it would put some things in a better context.

I was reminded of this need again this morning by this post on a design pull request. I’m not going to argue against wanting to gather around a whiteboard and discuss things in detail. That would be stupid. It’s by far the most efficient approach (for the people there). But I wonder, were an instantaneous chat platform and instant teleconferencing available and used, if it would be feasible for when such a discussion begins to announce it in the chat and throw up a teleconference for anyone interested and available at that time to join in. I’m grateful that notes were taken and posted to the pull request but if I had the opportunity I would have liked to be involved in that discussion and hear the context for and detail of the comments made.

Another idea is, when a new design pull request or major feature implementation pull request goes up, to schedule a day or two later a teleconference to talk about it as part of the review process. Give people some time to read the document or the code, perhaps start their GitHub-based reviews, but also give a forum after that short period for having a chat about it to clear up more complex concerns.

A couple thoughts, as an outsider (slash interested student).

Something like PX4’s weekly dev call would help outsiders like me keep up with development and slowly get involved. It would also make development more transparent and open to the community. There’s already a couple WGs for specific areas (e.g. navigation), but there are none for core ROS2 stuff (AFAIK) other than the (closed-ish) TSC.

Too many options just gets confusing, even if each one has its specific uses (discourse vs. answers vs. GitHub vs. IRC).

This is something I’m really in favour of, even though I have too many meetings already.

3 Likes