Best practices for reproducibility
Working with a half a dozen Turtlebots and 14 students we have a real problem with reproducibility. That is to say things work and then don't work without us knowing what might have changed. Now I can think of many many culprits:
- some package that was inadvertently or automatically updated (changed)
- slight hardware differences we might not even know about
- timing ordering of launching different nodes
- draining batteries
- and so on and so forth
We have a problem with reproducibility. I am sure we're not the only one, by far. It's a question of software and hardware "hygiene" I suspect. We've thought of some techniques to solve this but haven't implemented them yet::
- Have one linux image that is authorized and install it on bare metal (ok)
- Followed by a shell script that installs very specific versions of everything (hard, but may be doable)
- Prohibit (via passwords?) anyone from installing or deinstalling anything (not sure how to do this)
- Turn off all automatic update mechanisms (not sure how to do this)
My question is, how do you avoid this problem? What are your best practices? What are your tools?
Asked by pitosalas on 2019-11-26 21:09:52 UTC
Answers
Updated packages: Docker should help with bullet 1 if you give everyone the same docker image to work with (a colleague I'm sure will come by and mention Singularity, which is another option I haven't personally explored but what I read is incredibly tempting).
Slight hardware differences: Can you be more specific how this is causing you issues? In most cases I'm not sure how to get around that. if its calibration-like parameters, you can have a file in each robot with calibration results for the specific robot on the computer (or version controlled with IDs).
Ordering of bringup: If in ROS2 I'd say lifecycle. If in ROS1 but not concerned too much with adding a few seconds to bringup, bash scripts.
Draining batteries: Not sure what all I can say there - if batteries are getting old, replace them.
It sounds like some type of containerized environment. If you'd like to hide that from your students if they're not well versed in it (which, yeah, I don't think any students would be) you can wrap the docker pulls and getting into the session for them and after that its essentially just a terminal. Admittedly there's some learning curve, but if your students know ROS or able to learn it in the course, Docker (singularity) isn't that big of a step.
Asked by stevemacenski on 2019-11-27 00:34:09 UTC
Comments
a colleague I'm sure will come by and mention Singularity
hah ;)
Asked by gvdhoorn on 2019-11-27 02:38:20 UTC
Thanks for the great ideas. A few bits of feedback:
Docker: is a good idea. Better I think than trying to have shell scripts. You were I think referring to using it on the students' computer but there's nor reason not to also use it on the robot itself, yes?
Singularity: I tried it once about 6 months ago and found it really difficult. I never got it working to appreciate it's value. At the time it seemed very 'rough'
The students are all 'fairly' proficient with the shell, linux, ros and so on. It is true that it is often "ready fire aim" with them. Too quick to try something and if it works not looking back and not worrying about why it worked.
See also my comments to my frequent correspondent @gvdhoorn
Asked by pitosalas on 2019-11-27 21:58:36 UTC
Comments
It would be good if you could give some examples of what you feel are "problems with reproducability".
Right now you only list (what you have identified as) potential causes with a list of potential solutions, but you don't really describe what the problems are you are running into.
Working/not working is too vague, and rather binary.
Turtlebots are real systems, closed-loop-ish controlled. If for instance you'd like each and every one of them to reach exactly the same spot in a map, that is, without some serious tweaking and calibration, not going to happen.
Asked by gvdhoorn on 2019-11-27 02:40:25 UTC
Hoi! When I say "not working" I mean something pretty fundamental. As an example the student tells me they got something to work but then they run it again to show it to me and it doesn't work at all. The lidar stopped spinning for no apparent reason; the new navigation destination in rviz doesn't do anything; some weird error that I've never seen before shows up on the log. Sometimes rebooting the robot and re-running roscore etc does it, sometimes power cycling the whole robot makes the problem go away.
Asked by pitosalas on 2019-11-27 22:03:31 UTC
that may be, and it may be perfectly clear to you, but if you don't write these things down, we can't know, so can't help you.
Asked by gvdhoorn on 2019-11-28 02:58:37 UTC
I know what you’re saying. The very nature of irreproducibility is that it’s different every time, and that it’s hard for me pin in down what exactly went wrong. Let me refer back to my original question, which was not to solve a particular problem, but asking for experts like you for their best practices. And the previous response I received actually had some specific and actionable practices. So, what is your advice for “good hygiene” in a scenario like mine? (e.g. always brush your teeth at least once a day, don’t drink coffee after 9pm, always look twice before crossing the road)
Asked by pitosalas on 2019-11-29 18:18:40 UTC