Ros.org latency & availability

Hey everyone,

today, ros.org was basically not available for me for the whole day,
although it did respond sporadically. What is happening there?

In the same context, at least over here in Hamburg, Germany the whole domain
and especially the wiki are awfully slow quite often.
I notice this especially every time I teach a practical course and ask the students
to lookup tutorials or package descriptions on the wiki…

Did others notice that too?

Is it be possible to improve the reliability of the site?

Yes, there are problems. We’re actively working on it right now, trying to understand what’s happening and how to fix it.

It looks like we’re still having some issues with the server.

If you need content urgently there are several mirrors worldwide. Since the main reference is here: http://wiki.ros.org/Mirrors

I’ve coped the up to date mirrors, as of today, below.

You can set your own up as well. Instructions are here: https://github.com/ros-infrastructure/mirror

@v4hn Note that the wiki response time is sluggish even locally since we’ve grown beyond what moin moin was designed for. Though it’s clearly amplified at a further distance.

2 Likes

Are there any mirrors of docs.ros.org?

The status page is still reporting website, wiki, and docs as down.

The standard mirror process does both the wiki and the docs.

The usual approach is to switch the sub-domain wiki for docs.

From the list at least the following mirrors have docs active.

http://mirror-ap.docs.ros.org/
http://mirror-eu.docs.ros.org/
http://docs.ros.org.ros.informatik.uni-freiburg.de/

PS: The main site is up at the moment, but we are still trying to solve the underlying instability.

Just a short follow up, we’ve identified that the /Robots page on the wiki uses a lot of resources. We’ve actually know that for a while, but now it’s the easiest thing we can fix up to help us with stability and speed.

We’re working on a new solution which is a statically generated site. We’ll make an announcement when that’s available.

Wiki is down again with 500 errors…

https://status.ros.org/ page is tracking it.

Rohan

Is there a longer-term solution for this in the works, or can we do anything to help?

We’re actively moving the mirror rsync jobs off the server: Updated rsync endpoints

The apache instance has been falling behind when the mirrors update and if it falls too far behind it never catches up and needs a restart.

In the very long term we’re looking at using more static hosting methods for ROS2 documentation which can scale much better than the wiki.

We’re well over the designed size of Moin Moin. We’ve known that for a while and have been optimizing it over time disabling features and streamlineing our delivery. Looking briefly at the analytics we’ve had approximately a 20% increase in traffic in the last 2 months which has probably pushed us past our last stable equilibrium.

The biggest challenge is that we have a lot of content and switching to any other system will take a herculean effort. There are a lot of automated conversion tools that we could try using. However they don’t take into account the several macros that we use heavily for automating the tutorials etc. @wdsmart helped lead an effort at looking into migrating to Media Wiki and might be able to comment.

We’ve looked at this a lot and there just hasn’t been a clear best solution. It’s late now so I can’t go into more details at the moment. I’ve got the machine back online now.

Like @tfoote said, we’ve looked at migrating the wiki to MediaWiki a while back, and came to the conclusion that it’s not going to be easy. We can do about 70% of it automatically, but there are so many custom macros involved that we’d basically have to eyeball every page (that’s 18,000 pages, give or take) to make sure things worked and to fix things when they didn’t. I’m not sure that this is practical.

We’re working on another solution right now, which would wrap the current wiki and allow us to migrate page at a time, without losing the current content or having an extended downtime. It’s still in the early stages, but I’d be happy to talk about it offline, if anyone’s interested.

FWIW, I think that we need to fix this ASAP, since it’s really impacting people. Of course, we (the community) have been saying this for a while now, so this is not news to anyone. The problem is that there’s so much stuff, and it’s so heavily used that the transition strategy isn’t clear.

– Bill

How is the hosting architecture? To me it seems that it is served by plain Apache, which is flexible but very slow. If this is the case it could easily be improved, for example by putting an HTTP cache in front. I would suggest Varnish, which could already improve delivery by a factor many (it should be able to saturate outgoing bandwidth). And if this is not sufficient other improvements to hosting architecture could be possible, avoiding to spend the man hours to migrate pages. I’d be happy to discuss further ideas and help where I can.

Regards,
Willem

We’ve tried varnish in the past, but ran into issues where it was too aggressive in caching, so we had to add exceptions to pages, but those pages and actions ended up being the ones that constituted the majority of the overhead in hosting.

For example, we identified in the past that wiki attachments are a good deal of the large traffic spikes and tried to cache those, but the problem with that is there are ACL like settings for wiki attachments, so fetching an attachment must go through moinmoin’s runtime so that it can check whether or not the currently logged in user has access according to the current security settings. I’m sure there is a way to handle this and still benefit from varnish, but there’s no out-of-the-box setup for varnish + moinmoin that we’ve found to handle things like this.

Out hosting providers recently suggested varnish again too, so maybe we/they will look into more.

The server is doing much better now. We expect the uptime to be much higher having deployed this patch:

Since that was deployed we’ve had solid uptime and improved response times and much lower server load averages.

There are several neighboring PRs to clean up other wiki macros. And the benefit of cleaning up wiki macros is that we decrease the potential future cost of porting in addition to avoiding potentially high loads caused by them.