rosmaster stops responding
At some point rosmaster will stop responding to queries. It is an intermittent issue but has been happening more often as our system gets more complex with more nodes, more topics, etc. The symptoms is that rosmaster will accept TCP connections, but will not reply any data. Any already running nodes that have already subscribed to topics will continue to work, as they talk directly to each other, but anything that needs to subscribe to new topics, or list existing nodes will hang. Here is the end of the output of running strace rosnode list
:
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 4
connect(4, {sa_family=AF_INET, sin_port=htons(11311), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
sendto(4, "POST /RPC2 HTTP/1.1\r\nHost: 127.0"..., 339, 0, NULL, 0) = 339
recvfrom(4,
The problem tends to happen after 20 minutes to an hour of the system running, so it's not super fast to reproduce. I don't see anything obviously wrong in various log files.
Any pointers on where I should be looking for problems would be helpful, thanks
This may be a known issue (with perhaps even already a PR). Check the ros/ros_comm/issues tracker.
Thanks @gvdhoorn, I found a mention of CLOSE_WAIT sockets, and noticed I had a lot of those, will try the suggested fix and see if that solves my problem