Error using python3 nodes on ARM64 & CoreDX

I have everything for ROS2 beta3 compiled and running on aarch64 with CoreDX. Most things work fine. However, whenever I try to create a subscription in a python node, I get this error:
Error in /usr/bin/python3: free(): invalid pointer: 0x…
I see the error when running demo_nodes_py listener. The same code executes fine on x64 platforms. Does our python extension rely upon platform-specific code for some of its pointer handling or struct copies? Of course aarch64 should also be using 64bit pointers. What would be different between FastRTPS (which works) and CoreDX in the python extensions? What can I do? Where should I look?

PS, FastRTPS still spews a lot of this when I run the listener: [RTPS_HISTORY Error] Change payload size of ‘8808’ bytes is larger than the history payload size of ‘5000’ bytes and cannot be resized. -> Function add_change

  • Do you happen to have a backtrace ?
  • Does it fail when you create a subscription or when you try to deserialize a message?
  • Do you have the same problem with message types that are of known size (like uint64t) ?

Can you reproduce it with a pure C communication ? (running test_messages_c from test_communication for example)
I would expect the problem to be the same but that would at least allow you to rule out anything Python related.

My guess is that it may be something in the coreDX C typesupport. We test only Fast-RTPS on aarch64 ATM that uses the introspection mechanism. It’s possible that the alignment (and thus the pointers you access and free) are not correct when its built for aarch64.

Can you please provide more details on the related thread (Payload size error with FastRTPS)

Do you happen to have a backtrace?

I can’t seem to create one. No core dump is produced even with ulimit -c unlimited. The error must be caught (but subscription not created).

Does it fail when you create a subscription or when you try to deserialize a message?

Definitely when creating the subscription.

Do you have the same problem with message types that are of known size (like uint64t) ?

Yes, I do.

Can you reproduce it with a pure C communication?

No, I cannot.

running test_messages_c…

All the rclcpp tests from test_communications pass and all the rclpy tests fail with the same error:
[test_subscriber] *** Error in /usr/bin/python3’: free(): invalid pointer: 0x0000007f9753f768 ***`

I question the PyMem_Free call here (and in similar contexts throughout the file): rclpy/rclpy/src/rclpy/_rclpy.c at ec2220239d089f7b5d03a6ed1f6a3136e7c4d049 · ros2/rclpy · GitHub
I’m no Python extension expert, but surely it’s not a best-practice to destroy objects passed in by the user. That would be the user/caller’s responsibility. What if the same QoS object was passed to multiple subscription calls but the first one destroyed it? In looking around online, it seems that the right approach is to call Py_XDECREF on each object created via PyArg_ParseTuple (every parameter with type “O”, other types having their individual cleanup instructions).
Update: The documentation says that “O” objects are not ref-incremented, so I think the stuff I was reading about decrementing their ref count is not correct.

Thanks for the detailed answer.

That is a very good point. We could either modify the behavior and make it clear that users should destroy such objects themselves or document the fact that this is freed by the function. Though I tried creating a bunch of subscriptions using the same qos_profile object (custom or default) and couldn’t reproduce the error described in this thread.

To go back to the problem at hand, if the PyMem_Free is the issue, it doesn’t explain why the first subscription created would crash because at that point the object has not been destroyed yet. And if this was only a logic issue in the memory management in the Python stack, I would expect the error to be consistent regardless of the rmw implementation used…
So I do still think that’s something inside rmw_coredx not behaving as expected.

Could you confirm if the error disappears if you modify rclpy to not free the qos_profile on subscription creation ?

gdb should allow you to trace the code through the c extensions.
gdb --args python3 <PATH_TO_YOUR_PYTHON_LISTENER>
The backtrace would be very useful to track down where the problem comes from.

Just to confirm: you have exactly the beta3 code without modification except that you are using your own version of rmw_coredx?

Could you confirm if the error disappears if you modify rclpy to not free the qos_profile on subscription creation?

I confirmed that the error persists even without destroying the qos_profile.

you have exactly the beta3 code without modification except that you are using your own version of rmw_coredx?

Correct. I did manage to get a not-so-helpful stack trace from gdb but I don’t think it is right yet since it’s from publisher.py. I’ll keep working on that.

I’ve been attempting to get a stack trace of the crash through manual means. It crashes here:

  1. https://github.com/ros2/rosidl/blob/48eb1bc0ee2902feaad47eb55c19c10a2c322410/rosidl_generator_c/src/message_type_support.c#L28
  2. from https://github.com/asirobots/rmw_coredx/blob/5aad0cca594edfeee7e8967327ac7f0fc6348a56/rmw_coredx_cpp/src/functions.cpp#L60
  3. from rmw_create_subscription in that same file.

What is func set to there? Or where would I put my next printf?

The level of “remote” debugging is getting pretty difficult. If you can provide a set of copy-n-paste-able steps to reproduce the problem I can take a look at the actual failure in gdb.

Sure, although I’m not sure these steps are easier than helping me understand message_type_support.c. Steps to be ran on aarch64 (ARM Cortex 53 or 57):

  1. Take the ros2.repos file from https://github.com/asirobots/ros2 in the release-beta3-asi branch.
  2. Remove rmw_opensplice from the repos file so as to not confuse rclpy (which seems to load opensplice message libraries at random when running with CoreDX as the middleware – feel free to debug that too).
  3. Use vcs tool to process the repos file.
  4. Modify src/ros2/tinyxml_vendor/tinyxml_cmakelists.txt to include set(CMAKE_POSITION_INDEPENDENT_CODE ON), which might only be necessary for GCC7 (which I have been using for cortex57 support; I’m not sure if the error happens with GCC5).
  5. export CFLAGS="-march=native", export CXXFLAGS="-march=native"
  6. export RMW_IMPLEMENTATION=rmw_coredx_cpp
  7. compile: src/ament/ament_tools/scripts/ament.py build --skip-packages ros1_bridge --cmake-args -DCMAKE_BUILD_TYPE=Release (although you probably don’t want a Release build for tracking this)
  8. Get your CoreDX environment variables established (run their env script and then run its output). You may need a copy of CoreDX from here: http://www.twinoakscomputing.com/lic_eval/coredx-4.0.20-Linux_2.6_aarch64_gcc49-Evaluation.tgz . TwinOaks will give you an evaluation license if needed.
  9. Source in the install and execute ros2 run demo_nodes_py listener – it will give an error and not work correctly.

The struct containing func is being declared here.

For the CoreDX C typesupport it is initialized here.

The function basically provide the typesupport for a specific message which enables “generically” written code to use / interact with messages which were not available at compile time but which have been defined later.

I would suggest to build in Debug so that the assert’s are being checked. Additionally printing the value of handle as well as its members and identifier might provide valuable information (within message_type_support.c).

I was expecting a list of command line invocations which I can easily run by copy-and-pasting them to a console. I managed to follow your text instructions (even though bullet 7 and 8 are in the wrong order). In the future it would be great to provide ready-to-run commands (e.g. instead of letting people edit files it is more convenient to share the already edited version through Gist) since it reduces the effort to help you and makes the whole process less ambiguous.

Anyway my build finished and when I run the talker and listener I see the following error message twice:

dds_thread_set_stacksize(): pthread_attr_setstacksize() error(22):Invalid argument

Beside that the talker is printing Publishing: "Hello World: N" and the listener I heard: [Hello World: N] so that seems to work just fine for me. Note: I built Debug since I expected the need to debug asserts / segfaults / etc.

Wow, Dirk. Kudos! That’s good support. I thought no one would work through those steps. I worked on this issue a little more today. First, I discovered that I don’t get the error when I compile rosidl_interfaces in debug mode. When in release mode, it is crashing in this line std::istringstream ss(value) from the split method in type_support_dispatch.cpp (the c one, not the cpp one). It must be a bug in whatever local version of the standard library I’m running. Is it glibc that I would check the version on? I’m out of time for the day, but I will try a different split implementation tomorrow.

I replaced the split implementation in
src/ros2/rosidl_typesupport/rosidl_typesupport_c/src/type_support_dispatch.cpp
It then crashed for me in very similar code found in
src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp

After that everything was working for me. It appears to be a bug in the aarch64’s libstdc++.6.0.24 – at least that’s the version that I have on my ARM64 machines; it came with GCC 7.2. I was unable to make a simple program that reproduced the problem. However, I see that there are some pending stringstream fixes for the next GCC version, e.g. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81338 .

For my fix I incorporated this split implementation: https://stackoverflow.com/a/1493195/1208289