No, you do not need a ‘cloud’.
Getting text from speech is a separate problem from the interpretation of that text. That last part is sometimes called NLU: Natural Language Understanding and is related to Natural Language Processing.
Several cloud or online services exist that do both and there are that only do the text-to-speech, not the NLU.
There are offline (ROS)-packages as well:
- https://wiki.ros.org/pocketsphinx (though not easy to make reliable in my experience). Also take a look at https://answers.ros.org/question/246247/speech-recognition-packages-for-ros-kinetic-kame/
- For RoboCup@Home, my team uses https://github.com/tue-robotics/dragonfly_speech_recognition together with https://github.com/tue-robotics/grammar_parser
- https://answers.ros.org/question/60323/speech-recognition-packages/
- https://github.com/julius-speech/julius
- https://snips.ai/