OpenVDB Compile Size

openvdb_vendor was unfortunately not part of this release. Despite the rosdistro PR being merged. This is required for spatio_temporal_voxel_layer.

/tmp/binarydeb/ros-jazzy-openvdb-vendor-2.5.0/.obj-aarch64-linux-gnu/openvdb_vendor-prefix/src/openvdb_vendor/openvdb/openvdb/…/openvdb/tools/LevelSetRebuild.h:360:1: fatal error: error writing to /tmp/ccTany5H.s: No space left on device
360 | } // namespace openvdb

Is this diskspace issue something that resolves itself?

The main problem isn’t diskspace, as far as I can tell. Here is the latest attempted build: Jbin_uN64__openvdb_vendor__ubuntu_noble_amd64__binary #9 Console [Jenkins]

If you look closely, you can see that the compiler was killed. That is almost always because the compilation has used too much memory, and the OOM-killer has kicked in. So probably the best thing to do is to try to profile the compilation, and see if there are ways to reduce the amount of memory it is using at compile time.

For some reason, I didn’t get emails about this one, so I didn’t notice to address sooner! Sorry about that. It looks like openvdb_vendor doesn’t have my email in the package.xml.

So probably the best thing to do is to try to profile the compilation

How much memory is the build farm configured with by default? I’ll say even I struggle to compile OpenVDB on my laptop due to the same reason. This was built successfully on Humble/Iron, though its about 30/70 if it gets through the build.

The source jobs turn over reliably, which is interesting. I expect the difference there is release flags.

I’m not sure. @nuclearsandwich can probably chime in here.

Each build agent is a compute-optimized “xlarge” instance in EC2 spread between c5.xlarge, c6i.xlarge, or m5.xlarge which all have 4vCPUs and 8GiB of memory.

Each individual package build is expected to use 1/4th of those resources: 1vCPU and 2GiB of RAM. These are soft limits that are not enforced, but since agents run up to four jobs at a time, packages using more than the available memory will be subject to pressure from the OS.

That explains some of the non-determinism on the Humble / Iron jobs that seem to pass 20-30% of the time.

Is there a not-too-painful way to get more memory for this job? For most packages, I might choose to not release them if this is going to be an issue, but this is a somewhat core package to the Nav2 userbase that we addressed with the vendorization by popular demand and multiple OS’ where the openvdb binaries were not suitable. I don’t expect this package to be rebuilt often (this package has very infrequent updates).

Do you know how much memory is needed? 8GiB is a hard limit since we don’t have bigger machines than that in the worker pool.

Most of these beefy packages succeed when retried after everything else has already built and they can use an entire worker.

I don’t have a hard number, I’ll say that my developer PC has issues with it with 16GB, but Iron / Humble seem to be able to do it in the build farm with 8GB (not sure why Jazzy is different beyond 24.04). @Timple were you compiling this locally previously, can you chime in?

Part of the way that we size the machines on the build farm is that it’s also basically setting a requirement for all users to compile.

Because of this being a core package it’s going to effectively be a minimum requirement for any Nav2 user to compile from source. If we bump that up to 8GB it will make life a lot harder for SBC users and other embedded dev environments.

Hi, @smac, thank you for providing openvdb_vendor, which makes it easy to use OpenVDB with cmake. We plan to add OpenVDB support to RTAB-Map for better map representation. New features are expected to be launched in half a year. We hope this will be fixed so that we can use OpenVDB from the ROS repository as a dependency directly.

It took some iterations on Humble and Iron as well. But it succeeded in the end. And as you mentioned, there hasn’t been a need for re-compilation due to the minimal amount of dependencies.

If there isn’t a maximum duration for the jobs, we might be able to enforce something like MAKEFLAGS=-j1 for the buildfarm.
Does anyone here know how to pass this to the buildfarm without hardcoding it in the package?

There is a maximum duration (2 hours), but I don’t think that matters anyway; the buildfarm by default already uses MAKEFLAGS=-j1, as far as I understand (@nuclearsandwich can confirm).

The default is already -j1 inherited from Debian defaults. We have never changed this specifically because we intend for the build farm to parallelize primarily at the package build level so that individual package builds are as deterministic as possible to aid debugging.

@Timple what’s your compiling experience been? Admittedly I don’t use STVL on a day to day anymore in my current role.

While suboptimal, we could compile a debian and link it to a release in github that people can download the debian from to install. I’m not 100% sure how we’d make STVL apt-installable using that except by having the STVL CMake manually get that file and download/undeb it. This is a bit of a security issue, but that way STVL can be apt installed but openvdb_vendor can be compiled once where we don’t have the limitations.

I was also thinking perhaps github actions could be used here to compile it… that way there’s at least some visibility into the workflow that created the binary.

Without the build farm / provided binaries by some mechanism, I’m pretty sure SBC users can’t compile openvdb on their own hadware anyway.

I took a quick peek through openvdb/CMakeLists.txt but it seems that only the core and the binaries are build by default.
The binaries are disabled here.

Which leaves the core, which builds only two libraries, a static and a shared. But since -j1 is already present, these should build sequential.

So not sure how to reduce the memory footprint for this compilation…

(We might gain a factor 0.5 in time by disabling the static libraries though, depending on how good ccache does it’s work).

I don’t think time is the issue here, its memory. 03:37:43 is when the build starts and 04:41:31 is when its c++: fatal error: Killed signal terminated program cc1plus.

Admittedly, that’s suspiciously close to 1 hour, if the build duration maximum was actually 1 hour in stead of 2…

Another job failed with the following, which I’m not 100% sure how to read into, but I could believe that’s another instantiation of “I’m out of memory” unless someone knows its due to something else → I see jobs where this happens after 30 minutes - 1 hour. Most jobs are actually failing this way (not c++: fatal error: Killed signal terminated program cc1plus), but only for bin jobs.


03:37:42 FATAL: command execution failed
03:37:42 java.nio.channels.ClosedChannelException
03:37:42 	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
03:37:42 	at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:221)
03:37:42 	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:817)
03:37:42 	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:288)
03:37:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:179)
03:37:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:281)
03:37:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:501)
03:37:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:246)
03:37:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:198)
03:37:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:211)
03:37:42 	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:785)
03:37:42 	at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:172)
03:37:42 	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311)
03:37:42 	at hudson.remoting.Channel.close(Channel.java:1502)
03:37:42 	at hudson.remoting.Channel.close(Channel.java:1455)
03:37:42 	at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:884)
03:37:42 	at hudson.slaves.SlaveComputer.access$100(SlaveComputer.java:110)
03:37:42 	at hudson.slaves.SlaveComputer$2.run(SlaveComputer.java:765)
03:37:42 	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
03:37:42 	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
03:37:42 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
03:37:42 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
03:37:42 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
03:37:42 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
03:37:42 	at java.lang.Thread.run(Thread.java:750)
03:37:42 Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from ip-10-0-1-232.us-west-1.compute.internal/10.0.1.232:51244' is disconnected.

The maximum duration is definitely two hours: ros_buildfarm_config/jazzy/release-build.yaml at 9985e2e406678e41ff026373e52b0d2ce5541bf3 · ros2/ros_buildfarm_config · GitHub

1 Like

On Humble it succeeded at once (once the dependencies were in order). However the second build took 7 iterations.

On Iron, both times it took 4 iterations.

So it seems indeed like a non-consistency in the buildfarm were sometimes the runner is lucky in that it’s not competing with other runners.

I’m not sure if the “8gb should be a limit to guard for dev requirements” holds. My experience with ROS 2 in general is that memory usage is factors more than the same code in ROS 1. But this is just what it is, not sure if we want to enforce these specifications for the future to come. I just popped more memory in my laptop :sunglasses:

The problem is cost. All of the workers run in the cloud, and going from 8gb to something larger costs more. Given how many workers we use across the infrastructure, this adds up fast. So its probably not viable to increase the memory size of the workers just to compile one or two packages (unless someone dumps a bunch of money on OSRF to support that).

For this instance specifically (so not a proper solution): what about building it in a PPA / some other context and then setting up an import job @nuclearsandwich? Similar to how this is done for other packages (like Colcon, Gazebo and catkin_lint) in ros-infrastructure/reprepro-updater?

That’s probably what @smac was thinking of in his comment.

Would there be a way to donate cycles instead of money?