ROS Resources: Documentation | Support | Discussion Forum | Service Status | Q&A

Do I need a cloud for speech recognition or speech-to-text?

This is an X-Post to StackExchange “Robotics” section.

In the context of speaking to a robot and make it understand: Is there a difference between the two words speech recognition and speech to text?

My tasks
In the current state of my knowledge and plans I only need speech to text in the meaning that the robot will record spoken words via its microphones and convert that to strings - not more. How the strings are interpreted is the business of the programmer - the scripts I put on the machine.

It is not my goal to bring automaticly “sense” into the spoken words like the “smart” speakers trying to do.

Do I need a “cloud”?
When and why do I need a extern computer system (e. g. a cloud-based servicer like one of the usa-data-hungry “KI”-systems or a NVIDIA Jetson System) for that tasks? What are the “borders” of the different solutions?

Why I ask?
My question is not about specific products or coding problems. I prepare to buy a research robot (don’t want to make advertising at this point) and try to figuring out the conrete setup/configuration of the machine.

The machine will have contact to vulnarable people in case of research. So there are a lot of reasons why cloud-based service are not an option: Privacy of the subject, data security laws, ethical concers (ethics commissions won’t never say OK).

The goal of my question is to get a bigger picture about that topic and it’s side topics.

No, you do not need a ‘cloud’.

Getting text from speech is a separate problem from the interpretation of that text. That last part is sometimes called NLU: Natural Language Understanding and is related to Natural Language Processing.

Several cloud or online services exist that do both and there are that only do the text-to-speech, not the NLU.
There are offline (ROS)-packages as well:

1 Like

The mycroft community has been working on local speech to text:

Mycroft also has a project called OpenSTT, although it seems to be presently dormant.

Also, if you happen to need something thing to go the other direction, say TTS: