ROS2 OpenAI Whisper

We implemented an inference node for the openai whisper audio recognition here GitHub - mhubii/ros2_whisper

It is a little hacked right now, and can be activated by saying “Hello ROS”, which whisper understands as “Hello Ross”.

Noticeably, audio_common does not build for ros2. Hence, there is a small node that captures and publishes audio using pyaudio.

Moving forward, it would probably make sense to

  • Fix audio_common
  • Clean up the code for doing inference using whisper. Ideally, the inference node should just publish text to a ros2 topic, rather than doing further processing

This is pretty cool, thanks @mhubii!

Any chance you could land a at GitHub - mhubii/ros2_whisper and provide simple instructions on how to get it to work in a Linux-based workstation? I think this has lots of potential to be re-used.

thanks for checking this out @vmayoral , there is now a read me and a little documentation. But this repository is very experimental. I wonder if this could be turned into something more useful

1 Like

There have been changes to the ros2_whisper package. The package now lives under GitHub - ros-ai/ros2_whisper: Whisper C++ inference action server for ROS 2 and is built using whisper.cpp.

Main new ideas are:

  • Action server for whisper.cpp that publishes intermediate transcription as feedback and final transcription as result
  • A vendor package for whisper.cpp
  • A manager for downloading whisper.cpp models to cache
  • Ring buffer for audio

Note that there is another package out there whisper_ros, which also utilizes whisper.cpp but has no action server, no vendor, no model caching, and no ring buffer. It does, however, have voice activation detection through silero-vad. This could be supported nicely within ros2_whisper through the action server.