Why is ROS answers so down again so long? Long outages should not happen in modern environments. Where is your monitoring? Whatever happened, please explain it.
The monitoring is OK, I believe: status.ros.org:
Partial System Outage
but all the monitoring won’t help if there are no humans around (ie: awake) to react to alerts.
Edit: does anyone know why the graphs have disappeared from status.ros.org
?
When posting, I think its better form to be less aggressive and try to understand what’s happened and if it bothers you, be part of the solution.
I understand your frustration, but its not productive in this case unless your volunteering to help maintain this infrastructure.
Edit: As an example strategy, in complicated topics I try to take the viewpoint that its “us against the problem” and not “you versus me”.
I’m sorry that this happened and caused you some problems. The reality of the situation is that ROS Answers is a bit long in the tooth and could use some love and attention. We don’t have a full time development or ops team working on it at all times (like say some other, larger, Q&A sites). If the server goes down someone needs to actually ssh into the server and restart the service.
The ROS answers server is maintained by a group of people who aren’t formally “on-call”. We do have a monitoring service attached to ROS answers but it isn’t like anyone has a “pager” that goes off in the middle of the night if something goes wrong. This particular outage happened on the evening over the weekend for most of the admins. The server came back up approximately when an admin looked at their e-mail on Sunday morning.
Would distributing that “reset the server” duty across a couple of time-zones help?
I’d be willing to push the button if needed during regular business hours here in Europe.
Count me in as well to be in the pool of ssh reseters
Or simply since you already monitor ROS answer outages, let a script restart the service. No human interaction needed.
I can do the Asia time zone if needed.