Transport Priority QoS policy to solve IP flow ambiguity while requesting 5G network QoS

My name in Ananya Muddukrishna. I am a researcher in Ericsson working on productive methods to connect ROS2 applications to 5G networks (telecommunication). The topic is relevant to Ericsson since ROS2 and 5G networks have machine-machine communication QoS concepts that stand out.

I would like to discuss a limitation in ROS2 design and potential solutions with you.

5G networks enable robotic applications to assign required quality of service (QoS) to IP (internet protocol) flows [1,2]. The QoS assignment requires that IP flows between communicating entities be unambiguously identified. However, this is problematic in the context of ROS2 due to ambiguity in the IP flows between publishers and subscribers across the same source and destination nodes. One solution to the problem involves exposing the transport_priority QoS policy of DDS within ROS2 API so that the ambiguity is reduced.

As an example, consider a ROS2 application with nodes N1 and N2. N1 contains publishers P1 and P2. N2 contains subscribers S1 and S2. P1 and S1 are associated with the same topic T1 and compatible QoS. Similarly P2 and S2 are associated with topic T2 and compatible QoS. So connections P1-S1 and P2-S2 are established and messages start to flow.

Nodes N1 and N2 are assigned unique transport layer identifiers in the DDS layer. These transport identifiers are a combination of IP addresses and UDP/TCP ports. The IP packets of messages from P1 and P2 have the same transport identifier. So do packets from S1 and S2. Since source and destination nodes are the same, the IP flows of P1-S1 and P2-S2 appear similar from a transport layer viewpoint.

Assume that P1-S1 requires a 5G network QoS, say delay of 5ms, that is different from P2-S2, say delay of 300ms. Then, by construction, since their IP flows are similar, they cannot be assigned the required 5G network QoS. Both P1-S1 and P2-S2 can either be assigned QoS 5ms delay or QoS 300ms delay. The former case represents a waste of resources, the latter case leads to performance loss.

DDS has a good solution for reducing such ambiguity. It is the transport_priority QoS policy. This policy can be used to differentiate packets across publisher-subscriber connections. The idea is that publishers specify a transport_priority value (unsigned long type). This value is transferred (perhaps with some bit-twiddling) to the Differentiated Services (DSCP) field [3] in IP packets of the messages between the publisher and its subscribers. No other action is taken by DDS. Now the transport identifier of the publisher becomes a combination of IP address, TCP/UDP port, and DSCP value, enabling unambiguous IP flow differentiation (upto 2^6 publisher-subscriber pairs across two nodes)

However, the transport_priority policy is not available in ROS2. I don’t know the reasons for its exclusion. In my own private forks of ROS2 repositories, I have found that simple refactoring can expose transport_priority from DDS implementations to rclcpp/rclpy.

Are there any fundamental design problems with exposing transport_priority as a new QoS policy in ROS2?

Do you see other good methods to resolve the IP flow ambiguity problem?

References:

  1. 5G System Architecture 3GPP TS 23.501 (link)
  2. Exposure of 5G Capabilities for Connected Industries and Automation Applications 5G ACIA WP (link)
  3. Differentiated Services Architecture (link)
5 Likes

Thanks for the detailed explanation of the problem you are trying to solve.

One of the design goals of ROS 2 is to be transport-agnostic. While we recommend and use DDS as the default transport layer, there are also alternate transports in use (some examples here).

Every time we expose a DDS feature through the RMW interface, this causes additional work for the non-DDS RMWs. So we have tried to carefully choose what is exposed through the RMW.

The other reason we have been cautious about exposing all of the DDS features is to avoid overwhelming users with too many choices about their transport. As we’ve discussed elsewhere, most of our users want to work on robotics, not on transports. They expect the transport to work reasonably well “out-of-the-box” as much as possible.

Neither of those reasons mean we can’t expose new DDS features, but that we need to carefully consider the impact of doing so.

I will also point out that there is another alternative here. It is possible to “side-load” some DDS features into the DDS implementation of your choice (see <no title> for some examples). The method currently varies between DDS vendors, and there are some caveats to doing so, but it allows a way to go around the more limited ROS 2 RMW API. We’ve had some interest in standardizing this, but there hasn’t really been much work done along these lines yet: Add design document on configuring QoS at startup time · Issue #280 · ros2/design · GitHub.

3 Likes

Thanks for sound arguments and relevant hyperlinks, Chris.

I understand the rationale to be transport-agnostic and enable middleware other than DDS. Side-loading is a great idea but unfortunately options to set transport priorities for select publishers in a ROS2 application are unavailable in any of the tier-1 DDS implementations.

I want to think more about what you have said and pointed to. Will get back here with my thoughts.

1 Like

@clalancette I have also made use of DDS’s transport_priority QoS in multiple commerical products for some of the same reasons as @anamud, plus a few others. I wanted to add another vote here for the usefulness of that particular QoS policy, but also comment on the ability to access features provided by DDS that are very useful for more specific and advanced use-cases that are more common in commercial and industrial settings (See also my recent writeup on partition QOS here: https://github.com/ros2/design/issues/261) .

I’ve weighed in a couple of times on some discussions around the interface to the DDS APIs, knowing that while the end users of the robotic systems may not want to, or don’t care about, the implementation details of transport and middleware configurations, the system developers who are creating platforms for these users (such as Ericsson in this case, and my company as well) very much do.

Perhaps rather than pose the burden on the RMW layer itself, such that all middleware integrators would have to implement this feature, there could be a ROS2 DDS API abstraction layer, which allows applications to access all standard DDS features if they are running on top of a DDS-based RMW implementation (of which I would argue most systems will). This would prevent them from writing vendor specific code, at the expense of writing DDS-specific code, but I would argue that, in general, two robots running on different families of RMW implementations (such as DDS vs OPCUA) won’t be able to communicate directly without a bridge layer if one is using DDS and the other something else, so there is no real penalty there. If the features in the DDS layer find enough use, perhaps they could eventually be elevated out of the DDS API layer into RMW, if it makes sense.

Perhaps this would be a good topic for the Middleware Working Group that I recently saw a post about? I would love to find a way to have deeper discussions on these DDS/QoS related topics, and would be happy to provide resources to tackle some of these issues.

Curious to hear your thoughts.

5 Likes

I think this would be a great topic for the Middleware WG! I’ve been wondering if we could think about rearchitecting how we expose QoS settings in a more generic way. We’ve chosen a few we care about, but exposing each of them required “shotgun surgery” across many layers and repositories, which is not ideal. I’d love to discuss different potential architectures.

One clarification for me (I am not familiar with 5G) - how are you connecting ROS2 communications across the cellular connection? I have only seen examples of communicating over the LAN. I think RTI has a separate bridge server you can run, but I haven’t seen the equivalent for other RMW implementations. The only way I personally achieved this functionality in ROS1 was by opening a VPN to the robot (which was on LTE at a known IP), but this solution is less than ideal.

4 Likes

@emersonknapp While I can’t speak to what specific use-case Ericsson has relating to 5G, I have some professional experience working with private LTE cells, which are now fairly accessible due to the ability to operate on the technology neutral 3.5GHz spectrum via the recently opened CBRS band (see: https://en.wikipedia.org/wiki/Citizens_Broadband_Radio_Service). I know Ericsson, Nokia, and Motorola have products/services that allow you to set up private 4G LTE cells/private enterprise networks composed of multiple technologies. Since the spectrum is technology neutral, it will be possible for 5G systems to be deployed in these networks as well, which I’m sure all of the players in the industry are pursuing.

This is particularly useful in IIOT environments such as marine ports, construction sites, automotive, etc. due to the ability to have a completely private, low-latency, high-bandwidth LTE network for your devices and robots to operate on, without paying data costs for LTE from an MNO/MVNO. In my experience, these cells and networks can be configured pretty much out of the box to function similar to any wireless LAN solution, like WiFi combined with routers/switches, and DDS/ROS2 work without too much fuss over them, since you aren’t dealing with carrier grade NAT, firewalls, or changing IPs like you would in traditional MNO networks.

2 Likes

The problem with a proposal like this is that it immediately breaks the promise of a transport-agnostic API. If lots of packages start using this, then you can’t take an arbitrary ROS package and run it on the transport of your choice (which is what you can do today). You have to use DDS.

I’d be more interested in proposals around this concept, to either a) make it easier to expose more features, or b) expose the features in a much more dynamic way.

I would argue that that is only a problem if you are developing a package that is to be 1.) distributed at all and 2.) intended for completely generic use-cases. For many commercial and industrial applications, that isn’t the case, and if their software is to be deployed for others to use, placing a constraint on the use of DDS is generally not a problem. It is also not difficult to split the functionalities in your packages into middleware-agnostic components and middleware dependent components. My experience is often the case that core algorithms, logic, functions, etc. are written as libraries in such a way as to completely isolate them from the middleware/transport interfaces, while the end-users of the package (typically individual nodes) are what utilize the middleware layer to connect data with the algorithms.

I highly doubt you would start to see a lot of DDS-dependent libraries and nodes popping up in the ecosystem, since most regular users of ROS won’t need such configurations, and if you do, it may be a sign that those features actually are valuable for a broader set of users. Where you are more likely to see these functions used is in nodes and frameworks that have been designed to create a specific robot ecosystem, typically by integrators and product developers, where the middleware implementation may be selected and made consistent throughout the ecosystem.

As an example, at OpenROV and now at my current company, we are developing a ROS2 based framework for marine robotics, including GUI applications and the software components that run on the robots. There will never be a case where it is necessary to maintain middleware agnosticism in the nodes, and if there ever was a user that wanted to integrate with other RMW layers, that is accomplished easily enough by creating a bridge between the two ROS2 implementations. a very rare case, I would expect, and would be necessary anyway since DDS based communications wouldn’t interoperate with the other middleware anyway. There are very specific DDS features that I would like to rely on to cover important use-cases in our field, and there is no value proposition to most of our users in maintaining middleware agnosticism. There IS value prop in maintaining DDS vendor agnosticism though, in that there is flexibility to use open source vs proprietary implementations.

2 Likes

Sometimes its easier to come up with good proposals after brainstorming with some relevant parties first :slight_smile:

2 Likes

So I will say that large parts of our ecosystem are exactly this; distributed, and intended for generic use-cases.

But this is clearly not true in the ecosystem at large, as the existence of the other RMWs demonstrates. These RMWs aren’t toys; I know they are being used in production in some places. So it’s not a factor that we can ignore.

To be clear; I’m not talking about making DDS interoperate with other middlewares. As you’ve mentioned elsewhere, that requires bridging, etc., and I would agree that most users don’t care about this. What users do care about is being able to design a system completely based on DPS or iceoryx as the middleware, for instance. Those middlewares have advantages that DDS can’t match in some circumstances.

I’m not making the argument that there aren’t ecosystems that require agnosticism or use different implementations, nor do I have any conception that systems using alternate RMW implementations are toys. I have been working with ROS, DDS, and communication middlewares for a long time now and am very aware of the various pros, cons, and tradeoffs that the different middleware options provide.

Your point that those middleware have advantages that DDS can’t match in some circumstances (and in my case, vice-versa) indicates the need to expose middleware-dependent features in some way such that developers in those ecosystems have the ability to leverage the advantages of their middleware implementation. After all, why would anyone choose a different RMW implementation technology if not for some perceived advantage or feature set that that implementation provides (leaving out the obvious discussion of licensing of implementations within a middleware family).

In the end, my argument is that use of these features is typically a concern of developers who are deploying software within an ecosystem they are designing, and there ought to be a way for them to take advantage of those features in a consistent, vendor-agnostic manner. This argument applies not just for DDS, but I would argue that given multiple OPCUA, or multiple zero-copy shmem middleware vendors, or whatever it is, there ought to be a way to expose APIs to specific transport/middleware capabilities.

2 Likes

@clalancette

if i may, could i ask if there are like criteria what could be integrated into rmw interfaces or not?(such as transport agnostic.) Or maybe that is something we could discuss on Middleware WG which is coming up now?
I understand it is dependent on situation and trade-off, also pretty hard to clear the boundary, but concept/idea would be really helpful, and this information is really big to us.

I’d be more interested in proposals around this concept, to either a) make it easier to expose more features, or b) expose the features in a much more dynamic way.

:+1:

thanks in advance,

1 Like

Sorry, off the top of my head, I don’t have criteria. But I do think that would be a good task for the Middleware Working Group to look at.

1 Like

@clalancette

SGTM! thanks.

I too believe there is room for improvement in the current QoS implementation. It does not exercise the underlying RMW QoS controls to their full potential.

ROS2 applications that I study are connected to bespoke private industrial networks. The capacities of these networks are planned out in detail before deployment. See more details about such networks here and here. Similar to @spiderkeys, I did not face problems getting basic ROS2 applications up and running in my tests.

I believe that an approach similar to QoS profiles is key to extending the QoS API. QoS profiles abstract away the various underlying QoS policies by assigning sensible defaults. This enables the addition of new QoS policies, including all of the 20-odd remaining QoS policies of DDS, without burdening the end-user. In other words any amount of new QoS policies can be added as long as sensible defaults are available within QoS profiles and the RMW realizes the QoS required by the profile. The advantage is that those users who have the ability to configure QoS policies directly can fine-tune QoS profiles without requiring complicated stretches like side-loading. High-level pseudo-code for the idea is shown below:

QosProfile qp = SensorQosProfile();

/* Modify directly */
qp.depth(1);
qp.deadline(1s);

/* Resolve to DDS RMW and modify */ 
DdsQosProfile dds_qp = qp.resolve(RMW_DDS);
dds_qp.transport_priority(0x42);
dds_qp.latency_budget(30ms);

/* Resolve to NON_DDS RMW and modify */
NonDdsQoSProfile non_dds_qp = qp.resolve(RMW_NON_DDS);
non_dds_qp.burst(512);
non_dds_qp.compression(9);  

In the code, users who are happy with the default sensor profile can use it as is. Those who would like to customize it can modify directly. Modification of depth and deadline policies are shown. Further fine-tuning is possible by resolving to specific RMW components. The first case resolves to DDS RMW components and fine-tunes latency_budget and transport_priority QoS policies. The second resolves to a hypothetical RMW called NON_DDS and fine-tunes its bust and compression QoS policies. Note that the QoS policies in the resolved QoS profiles need not be similar.

The essential idea is that it is the QoS profile that enables RMW-/DDS-/transport-agnosticism, not the QoS policy.

The catch with this is, what do we do for an RMW implementation that doesn’t use DDS and doesn’t provide all of those QoS policies? The default assumption is that we have to implement them in the RMW implementation for that non-DDS middleware, which can be a significant burden.

In the code, users who are happy with the default sensor profile can use it as is. Those who would like to customize it can modify directly.

This proposal makes a node’s implementation depend directly on two underlying middlewares, breaking the RMW abstraction.

1 Like

I am unable to see the catch. A QoS profile is composed using mutually-exclusive QoS policies from underlying RMW implementations. The assumption that all QoS policies should be implemented by all RMW implementations goes away as soon as QoS profiles are introduced. A valid RMW is only required to realize the QoS expected by the profile.

Please clarify how it breaks. I see it differently. A node is dependent on the RMW deep down. The proposed code makes that explicit to end-users so that they can fine-tune the coupling. The RMW abstraction is not lost entirely. Users who are not interested in fine-tuning can stay away from the explicit interfaces.

One of the main reasons for the RMW abstraction is so that a node never has to know what RMW implementation is being used. By putting code into the node that depends on specific RMW implementations, that abstraction is lost and the node’s behaviour starts changing depending on which RMW the user of the node loads.