Slow reading from many rosbags in python. No speedup from parallelization.
Hi everyone, I've run into a performance issue with slow read speeds when reading from many rosbag files. For context, all these results were gathered on my personal machine with a WD SN750 SSD (quoted max sequential read speed around 3400 MB/s) and an Intel i7 4-core (8 threads). I am using ROS Noetic on Ubuntu 20.04 for the read experiments I describe below, but the rosbag files were generated in a docker image running ROS Kinetic on Ubuntu 16.04.
I've setup a small python test program which sequentially opens 81 different bag files, reads all messages from a single odometry topic in that bag file, and stores the data in a python list. It does not do any other processing with the data after this. Here are the results when I run that program in a single-threaded manner:
- Number of workers: 1
- Bag file count, total size (MB), avg size (MB): 81, 532.4, 6.6
- Total time (s): 9.639
- Time per file (s): 0.119
- Throughput (MB/s): 55.2
Note that the "throughput" reported here is likely a gross over-estimate of the true number since I am only reading a very small subset of the full data stored in each bag file. This number was computed as total_size / total_time.
I monitored CPU and I/O performance with htop
and iotop
respectively while this ran. The I/O impact appeared almost negligible, but I noticed the CPU usage was at 100% of a single CPU thread. The 7 remaining threads were idle (besides some light background use). Since this seemed to be a CPU bottleneck, I tried parallelizing the operation with the Python concurrent
library by using a separate thread to load and read each bag file. Here are the results:
- Number of workers: 8
- Bag file count, total size (MB), avg size (MB): 81, 532.4, 6.6
- Total time (s): 10.985
- Time per file (s): 0.136
- Throughput (MB/s): 48.5
So parallelizing the operation actually resulted in slightly decreased performance, even though each bagfile is entirely independent from the others. I also still saw 100% utilization on only a single CPU thread while the other 7 threads sat idle. However, the utilization was now divided over 8 python threads with each thread consuming roughly 12.5%.
It seems like the Python rosbag API might not support being parallelized in this way, but I am not familiar those kind of implementation details. In the past I've tried converting to other data formats (e.g. HDF5) and gotten better read performance, but performing the conversion takes some time and has other disadvantages. Any ideas what may be causing this, and/or potential solutions to improve read speed?
I can post the full test script as well if anyone is interested.
Thanks, Charlie