It is a little hacked right now, and can be activated by saying “Hello ROS”, which whisper understands as “Hello Ross”.
Noticeably, audio_common does not build for ros2. Hence, there is a small node that captures and publishes audio using pyaudio.
Moving forward, it would probably make sense to
Fix audio_common
Clean up the code for doing inference using whisper. Ideally, the inference node should just publish text to a ros2 topic, rather than doing further processing
Any chance you could land a README.md at GitHub - mhubii/ros2_whisper and provide simple instructions on how to get it to work in a Linux-based workstation? I think this has lots of potential to be re-used.
thanks for checking this out @vmayoral , there is now a read me and a little documentation. But this repository is very experimental. I wonder if this could be turned into something more useful
Action server for whisper.cpp that publishes intermediate transcription as feedback and final transcription as result
A vendor package for whisper.cpp
A manager for downloading whisper.cpp models to cache
Ring buffer for audio
Note that there is another package out there whisper_ros, which also utilizes whisper.cpp but has no action server, no vendor, no model caching, and no ring buffer. It does, however, have voice activation detection through silero-vad. This could be supported nicely within ros2_whisper through the action server.