This is a topic that I thought to discuss in other thread, but it probably is better to do it separately.
In practice, the use_sim_time parameter is a point of failure in a ROS2 application that sometimes is not easy to detect. My question is: Why is this parameter necessary? In practice, if there is a /clock topic, it is clear that the nodes should use it.
Apart from simulators, this topic is useful in a distributed system (yes, I know that there is a complex theory around this) in a local network when you want to have a unique clock. We do that in MOCAP4ROS2 project with a node that publishes its clock, but it is strange set use_time_time to true when we are not really using a simulator.
I thought that scenario “publishing wall-time to /clock” is something from the category “do not every try it”. Does it work? Can it work? How well can it work? I thought this kind of scenarios should be setup so that all PCs in the LAN are time-synchronized by some standard protocol (NTP, PTP, Chrony) and all of them use their wall clock. This is our setup, too.
As for the importance of the parameter - I can imagine that the /clock publisher can appear late in the network, and it’s probably better if all earlier started nodes instantly know that they should work in simtime and wait for the /clock publisher, rather than starting in walltime and doing a switch to simtime when already running… That would often lead to ROS time jumping backwards by a lot for the nodes, which is something that’s better avoided (not all nodes process this event correctly). Timers and sleeps could also have a problem with doing this switch.
Yes, it can work, but you will have objectively worse performance than using standard synchronization techniques in almost every metric. You will either end up with a lot of overhead from sending messages very often or having very choppy stepwise time or both. The systems leveraging a local clock and corrections are much more efficient and provide more accurate continuous results. You can start to get into trying to send projections, curves, and rates, but then you’re really looking at reinventing all of the very well established NTP protocols and other synchronization methods that there is extensive research for the explicit problem. I would strongly recommend against trying to use the clock in realtime and to use the much more optimized tools such and NTP and chrony.
If you want a way to think about this you’re no longer using a real timeline but using a simulated timeline that’s being controlled by you “unique clock”. If you’re making that simulated timeline look like the real timeline and don’t expect to diverge, I’ll fall back to my above suggestion that you not publish what is effectively a poor replacment for system time.
As @peci1 mentions the reason for the parameter is that there are many edge cases. In particular you start with the presumption "If there is a /clock topic which is not always a deterministic answer in a distributed system. The most common case is at startup of a node. It will startup and before it is able to complete discovery and connectivity for the /clock topic the user may ask for the current time. If the parameter is set (which can be guaranteed before startup) then the node will know not to just use the system time. If the first time the system started up it setup a few rates or timers at wall time, and then it jumps a billion seconds into the past to simulated time. Then every single timer, rate, sleep cached data and other thing must be reset in the node.
There are also potential challenges if a node looses connectivity potentially temporarily. This may be caused by network issues or wifi distance. Or it may be caused by the publisher temporarily restarting such as playing two rosbags in a row. If the nodes publishing the /clock restarts or is replaced, then there’s an instance when all other nodes in the system might not see /clock and then would return to system time temporarily and then jump back to simulated time when the new /clock publisher comes back online.
Any hysteresis of jumping between timelines can make a well written node close to non-functional because it generally must clear all internal state due to the timeline jump, all state and historical data generally has to be considered invalid as you’re on a new timeline. And in addition to that there are many nodes that are not robust to negative timeline jumps (forward jumps usually work out with rates, timers, and caches automatically aging out) and thus avoiding the initial negative timeline jump for user experience.
The first implementations of ROS time didn’t have the parameter and it was quickly added because the systems were much harder to use.
You also have challenges such as potentially having someone start a node that happens to advertise /clock on a system being able to completely halt a running robot. For example testing playback of a bag that you just recorded. Any system deployment with hardware in the loop should be very careful and explicit about when to allow simulated time.