We implemented an inference node for the openai whisper audio recognition here GitHub - mhubii/ros2_whisper
It is a little hacked right now, and can be activated by saying “Hello ROS”, which
whisper understands as “Hello Ross”.
Noticeably, audio_common does not build for
ros2. Hence, there is a small node that captures and publishes audio using
Moving forward, it would probably make sense to
- Clean up the code for doing inference using
whisper. Ideally, the inference node should just publish text to a
ros2 topic, rather than doing further processing
This is pretty cool, thanks @mhubii!
Any chance you could land a README.md at GitHub - mhubii/ros2_whisper and provide simple instructions on how to get it to work in a Linux-based workstation? I think this has lots of potential to be re-used.
thanks for checking this out @vmayoral , there is now a read me and a little documentation. But this repository is very experimental. I wonder if this could be turned into something more useful
There have been changes to the
ros2_whisper package. The package now lives under GitHub - ros-ai/ros2_whisper: Whisper C++ inference action server for ROS 2 and is built using whisper.cpp.
Main new ideas are:
- Action server for
whisper.cpp that publishes intermediate transcription as feedback and final transcription as result
- A vendor package for
- A manager for downloading
whisper.cpp models to cache
- Ring buffer for audio
Note that there is another package out there whisper_ros, which also utilizes
whisper.cpp but has no action server, no vendor, no model caching, and no ring buffer. It does, however, have voice activation detection through silero-vad. This could be supported nicely within ros2_whisper through the action server.