It is a little hacked right now, and can be activated by saying “Hello ROS”, which whisper understands as “Hello Ross”.
Noticeably, audio_common does not build for ros2. Hence, there is a small node that captures and publishes audio using pyaudio.
Moving forward, it would probably make sense to
Fix audio_common
Clean up the code for doing inference using whisper. Ideally, the inference node should just publish text to a ros2 topic, rather than doing further processing
Any chance you could land a README.md at GitHub - mhubii/ros2_whisper and provide simple instructions on how to get it to work in a Linux-based workstation? I think this has lots of potential to be re-used.
thanks for checking this out @vmayoral , there is now a read me and a little documentation. But this repository is very experimental. I wonder if this could be turned into something more useful
Action server for whisper.cpp that publishes intermediate transcription as feedback and final transcription as result
A vendor package for whisper.cpp
A manager for downloading whisper.cpp models to cache
Ring buffer for audio
Note that there is another package out there whisper_ros, which also utilizes whisper.cpp but has no action server, no vendor, no model caching, and no ring buffer. It does, however, have voice activation detection through silero-vad. This could be supported nicely within ros2_whisper through the action server.
There are some cool concepts with the action server etc but the whisper cpp implementation is (despite high expectations and cuda backend) much slower than the plain pytorch one. Maybe there are some improvements now but it might make more sense to just use Open AIs pytorch version and keep some of the action server concepts