Unlike traditional remote-controlled robots, this robot system combines a mobile robot and a robotic arm, which is able to understand context. When user say "find the red cup and pick it up, " it would use AI vision to locate objects, plans its approach, and executes precise movements. This compound robot is perfect for researchers, students, or anyone fascinated by human-robot interaction.
This project originates from a graduate research team laboratory at Arizona State University. For detailed project information, please refer to the following link: Voice-Operated-Mobile-Manipulator-Turtlebot-Mycobot/README.md at main · Sjschhabra/Voice-Operated-Mobile-Manipulator-Turtlebot-Mycobot · GitHub
What You’ll Learn
● Integrating multiple AI models
● Building modular ROS 2 nodes for complex robotic systems
● Creating real-time dashboards with PyQt5
● Implement sensor fusion with LiDAR, cameras, and IMU
● Designing robust human-robot interaction workflows
Hardware Requirements
● Mobile robot platform with built-in navigation
● 6 DOF collaborative roboti arm for manipulation tasks (Elephant Robotics myCobot 280)
● OAK-D Camera - Depth perception and RGB imaging
● Raspberry Pi 4 - Edge computing for AI inference
● External Microphone - High-quality audio capture
● Ethernet Switch - Reliable ROS 2 communication
Software Stack
Operating System: Ubuntu 22.04
ROS Version: ROS 2 Humble
AI Models: Whisper.cpp (Speech), YOLOv8 Nano (Vision)
GUI Framework: PyQt5
Communication: Cyclone DDS over Ethernet
System Architecture Deep Dive
This compound robot system operates on a perception-action loop that continuously processes multiple data streams:
-
Audio Processing: Whisper.cpp converts speech to text locally
-
Command Parsing: Natural language gets transformed into robot instructions
-
Visual Recognition: YOLOv8 identifies and tracks objects in real-time
-
Motion Planning: ROS 2 navigation stack handles path planning and obstacle avoidance
-
Manipulation Control: Custom inverse kinematics drive the myCobot 280
ROS 2 Node Network
Each node handles specific responsibilities while communicating through ROS2 topics, ensuring modularity and fault tolerance.
Step-by-Step Build Guide
Step 1: Environment Setup
First, prepare your development environment with ROS 2 and required dependencies:
# Install ROS 2 Humble
sudo apt update
sudo apt install ros-humble-desktop
# Install TurtleBot4 packages
sudo apt install ros-humble-turtlebot4-desktop
# Install Python dependencies
pip install whisper-cpp-python ultralytics PyQt5 opencv-python
Step 2: Audio Processing Node
Create the speech recognition system using Whisper.cpp for local, privacy-focused processing:
#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
import whisper_cpp
class MicListenerNode(Node):
def __init__(self):
super().__init__('mic_listener_node')
self.publisher = self.create_publisher(String, 'voice_text', 10)
self.whisper_model = whisper_cpp.Whisper('base.en')
# Initialize microphone capture
self.setup_audio_capture()
def setup_audio_capture(self):
# Configure audio parameters for optimal speech recognition
self.sample_rate = 16000
self.chunk_size = 1024
# Audio capture implementation here
def process_audio_chunk(self, audio_data):
# Convert audio to text using Whisper
result = self.whisper_model.transcribe(audio_data)
if result['text'].strip():
msg = String()
msg.data = result['text']
self.publisher.publish(msg)
self.get_logger().info(f"Recognized: {result['text']}")
Step 3: Object Detection Integration
Implement YOLOv8 for real-time object recognition optimized for edge computing:
#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
from cv_bridge import CvBridge
from ultralytics import YOLO
import cv2
class ObjectDetectorNode(Node):
def __init__(self):
super().__init__('object_detector_node')
# Load YOLOv8 Nano for speed/accuracy balance
self.model = YOLO('yolov8n.pt')
self.bridge = CvBridge()
# Subscribe to camera feed
self.subscription = self.create_subscription(
Image, '/oakd/rgb/preview/image_raw',
self.image_callback, 10)
# Publisher for detected objects
self.detection_pub = self.create_publisher(
String, 'detected_objects', 10)
def image_callback(self, msg):
# Convert ROS image to OpenCV format
cv_image = self.bridge.imgmsg_to_cv2(msg, 'bgr8')
# Run YOLO inference
results = self.model(cv_image, conf=0.5)
# Process detections
detected_objects = []
for result in results:
for box in result.boxes:
class_name = self.model.names[int(box.cls)]
confidence = float(box.conf)
detected_objects.append(f"{class_name}:{confidence:.2f}")
# Publish results
if detected_objects:
detection_msg = String()
detection_msg.data = ','.join(detected_objects)
self.detection_pub.publish(detection_msg)
Step 4: Command Processing Logic
Build the brain that converts natural language into robot actions.
#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
from geometry_msgs.msg import Twist
import re
class CommandParserNode(Node):
def __init__(self):
super().__init__('command_parser_node')
# Command vocabulary mapping
self.navigation_commands = {
r'move forward|go forward|forward': 'move_forward',
r'turn left|left': 'turn_left',
r'turn right|right': 'turn_right',
r'stop|halt': 'stop',
r'go back|reverse|backward': 'move_backward'
}
self.manipulation_commands = {
r'pick up|grab|grasp': 'grasp_object',
r'wave|hello': 'wave_gesture',
r'home position|rest': 'home_position'
}
# Subscribers and publishers
self.voice_sub = self.create_subscription(
String, 'voice_text', self.parse_command, 10)
self.cmd_pub = self.create_publisher(String, 'voice_command', 10)
def parse_command(self, msg):
text = msg.data.lower()
# Check navigation commands
for pattern, command in self.navigation_commands.items():
if re.search(pattern, text):
self.publish_command(command)
return
# Check manipulation commands
for pattern, command in self.manipulation_commands.items():
if re.search(pattern, text):
self.publish_command(command)
return
self.get_logger().info(f"Unknown command: {text}")
def publish_command(self, command):
cmd_msg = String()
cmd_msg.data = command
self.cmd_pub.publish(cmd_msg)
self.get_logger().info(f"Parsed command: {command}")
Step 5: Movement Controller
Copying the follow python code to implement the navigation logic and translate commands into robot motion.
#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from geometry_msgs.msg import Twist
from std_msgs.msg import String
class MovementControllerNode(Node):
def __init__(self):
super().__init__('movement_controller_node')
# Publishers for robot control
self.cmd_vel_pub = self.create_publisher(Twist, 'cmd_vel', 10)
# Subscriber for voice commands
self.command_sub = self.create_subscription(
String, 'voice_command', self.execute_command, 10)
# Movement parameters
self.linear_speed = 0.2 # m/s
self.angular_speed = 0.5 # rad/s
def execute_command(self, msg):
command = msg.data
twist = Twist()
if command == 'move_forward':
twist.linear.x = self.linear_speed
elif command == 'move_backward':
twist.linear.x = -self.linear_speed
elif command == 'turn_left':
twist.angular.z = self.angular_speed
elif command == 'turn_right':
twist.angular.z = -self.angular_speed
elif command == 'stop':
# All velocities default to 0
pass
self.cmd_vel_pub.publish(twist)
self.get_logger().info(f"Executing: {command}")
Step 6: Real-Time Dashboard
Copy the following python to create an intuitive pyQt5 GUI for monitoring.
#!/usr/bin/env python3
import sys
import rclpy
from rclpy.node import Node
from PyQt5.QtWidgets import *
from PyQt5.QtCore import *
from PyQt5.QtGui import *
from sensor_msgs.msg import Image
from std_msgs.msg import String
import cv2
import numpy as np
class DashboardNode(Node, QMainWindow):
def __init__(self):
Node.__init__(self, 'web_dashboard_node')
QMainWindow.__init__(self)
self.setWindowTitle("Voice Robot Control Dashboard")
self.setGeometry(100, 100, 1200, 800)
# Create central widget with tabs
self.central_widget = QTabWidget()
self.setCentralWidget(self.central_widget)
self.setup_camera_tab()
self.setup_voice_tab()
self.setup_sensors_tab()
self.setup_logs_tab()
# ROS subscriptions
self.image_sub = self.create_subscription(
Image, '/oakd/rgb/preview/image_raw',
self.update_camera_feed, 10)
self.voice_sub = self.create_subscription(
String, 'voice_text',
self.update_voice_display, 10)
def setup_camera_tab(self):
camera_widget = QWidget()
layout = QVBoxLayout()
# Video display
self.video_label = QLabel()
self.video_label.setMinimumSize(640, 480)
self.video_label.setStyleSheet("border: 2px solid gray")
layout.addWidget(self.video_label)
# Detection info
self.detection_info = QTextEdit()
self.detection_info.setMaximumHeight(100)
layout.addWidget(self.detection_info)
camera_widget.setLayout(layout)
self.central_widget.addTab(camera_widget, "Camera")
def setup_voice_tab(self):
voice_widget = QWidget()
layout = QVBoxLayout()
# Voice command display
self.voice_display = QTextEdit()
self.voice_display.setFont(QFont("Arial", 14))
layout.addWidget(QLabel("Voice Commands:"))
layout.addWidget(self.voice_display)
# Manual command input
self.manual_input = QLineEdit()
self.manual_input.setPlaceholderText("Type command manually...")
layout.addWidget(self.manual_input)
voice_widget.setLayout(layout)
self.central_widget.addTab(voice_widget, "Voice Control")
Step 7: System Integration & Launch
Create launch files to orchestrate all nodes:
# launch/voice_robot_launch.py
from launch import LaunchDescription
from launch_ros.actions import Node
def generate_launch_description():
return LaunchDescription([
Node(
package='voice_robot',
executable='mic_listener_node',
name='mic_listener'
),
Node(
package='voice_robot',
executable='command_parser_node',
name='command_parser'
),
Node(
package='voice_robot',
executable='object_detector_node',
name='object_detector'
),
Node(
package='voice_robot',
executable='movement_controller_node',
name='movement_controller'
),
Node(
package='voice_robot',
executable='web_dashboard_node',
name='dashboard'
)
])
Testing and Calibration
Voice Recognition Tuning
-
Microphone Positioning: Place mic 1-2 feet from user, away from robot motors
-
Noise Filtering: Implement voice activity detection to reduce false triggers
-
Command Vocabulary: Start with 10-15 core commands, expand gradually
-
Confidence Thresholds: Adjust Whisper confidence levels based on environment
Vision System Optimization
-
Camera Calibration: Ensure proper focal length and distortion correction
-
Lighting Conditions: Test under various lighting scenarios
-
Detection Confidence: Tune YOLO confidence thresholds (0.5-0.7 range)
-
Object Training: Fine-tune on specific objects in your environment
Summary
This voice-controlled mobile manipulator demonstrates the power of integrating multiple AI technologies in a single robotic platform. By combining voice recognition, computer vision, and autonomous navigation, we’ve created a system that bridges the gap between human communication and robotic action.
The modular ROS2 architecture ensures the system can grow and adapt as new capabilities are needed. Whether you’re a researcher exploring human-robot interaction, a student learning about AI integration, or an enthusiast building the next generation of service robots, this project provides a solid foundation for innovation.