Mobile Grasping Robot (AGV+robot arm) with Voice & Vision Control

Unlike traditional remote-controlled robots, this robot system combines a mobile robot and a robotic arm, which is able to understand context. When user say "find the red cup and pick it up, " it would use AI vision to locate objects, plans its approach, and executes precise movements. This compound robot is perfect for researchers, students, or anyone fascinated by human-robot interaction.

This project originates from a graduate research team laboratory at Arizona State University. For detailed project information, please refer to the following link: Voice-Operated-Mobile-Manipulator-Turtlebot-Mycobot/README.md at main · Sjschhabra/Voice-Operated-Mobile-Manipulator-Turtlebot-Mycobot · GitHub

What You’ll Learn

● Integrating multiple AI models

● Building modular ROS 2 nodes for complex robotic systems

● Creating real-time dashboards with PyQt5

● Implement sensor fusion with LiDAR, cameras, and IMU

● Designing robust human-robot interaction workflows

Hardware Requirements

● Mobile robot platform with built-in navigation

● 6 DOF collaborative roboti arm for manipulation tasks (Elephant Robotics myCobot 280)

● OAK-D Camera - Depth perception and RGB imaging

● Raspberry Pi 4 - Edge computing for AI inference

● External Microphone - High-quality audio capture

● Ethernet Switch - Reliable ROS 2 communication

Software Stack

Operating System: Ubuntu 22.04
ROS Version: ROS 2 Humble
AI Models: Whisper.cpp (Speech), YOLOv8 Nano (Vision)
GUI Framework: PyQt5
Communication: Cyclone DDS over Ethernet

System Architecture Deep Dive

This compound robot system operates on a perception-action loop that continuously processes multiple data streams:

  1. Audio Processing: Whisper.cpp converts speech to text locally

  2. Command Parsing: Natural language gets transformed into robot instructions

  3. Visual Recognition: YOLOv8 identifies and tracks objects in real-time

  4. Motion Planning: ROS 2 navigation stack handles path planning and obstacle avoidance

  5. Manipulation Control: Custom inverse kinematics drive the myCobot 280

ROS 2 Node Network

Each node handles specific responsibilities while communicating through ROS2 topics, ensuring modularity and fault tolerance.

Step-by-Step Build Guide

Step 1: Environment Setup

First, prepare your development environment with ROS 2 and required dependencies:

# Install ROS 2 Humble
sudo apt update
sudo apt install ros-humble-desktop
 
# Install TurtleBot4 packages
sudo apt install ros-humble-turtlebot4-desktop
 
# Install Python dependencies
pip install whisper-cpp-python ultralytics PyQt5 opencv-python

Step 2: Audio Processing Node

Create the speech recognition system using Whisper.cpp for local, privacy-focused processing:

#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
import whisper_cpp
 
class MicListenerNode(Node):
    def __init__(self):
        super().__init__('mic_listener_node')
        self.publisher = self.create_publisher(String, 'voice_text', 10)
        self.whisper_model = whisper_cpp.Whisper('base.en')
        
        # Initialize microphone capture
        self.setup_audio_capture()
        
    def setup_audio_capture(self):
        # Configure audio parameters for optimal speech recognition
        self.sample_rate = 16000
        self.chunk_size = 1024
        # Audio capture implementation here
        
    def process_audio_chunk(self, audio_data):
        # Convert audio to text using Whisper
        result = self.whisper_model.transcribe(audio_data)
        
        if result['text'].strip():
            msg = String()
            msg.data = result['text']
            self.publisher.publish(msg)
            self.get_logger().info(f"Recognized: {result['text']}")

Step 3: Object Detection Integration

Implement YOLOv8 for real-time object recognition optimized for edge computing:

#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
from cv_bridge import CvBridge
from ultralytics import YOLO
import cv2
 
class ObjectDetectorNode(Node):
    def __init__(self):
        super().__init__('object_detector_node')
        
        # Load YOLOv8 Nano for speed/accuracy balance
        self.model = YOLO('yolov8n.pt')
        self.bridge = CvBridge()
        
        # Subscribe to camera feed
        self.subscription = self.create_subscription(
            Image, '/oakd/rgb/preview/image_raw', 
            self.image_callback, 10)
            
        # Publisher for detected objects
        self.detection_pub = self.create_publisher(
            String, 'detected_objects', 10)
    
    def image_callback(self, msg):
        # Convert ROS image to OpenCV format
        cv_image = self.bridge.imgmsg_to_cv2(msg, 'bgr8')
        
        # Run YOLO inference
        results = self.model(cv_image, conf=0.5)
        
        # Process detections
        detected_objects = []
        for result in results:
            for box in result.boxes:
                class_name = self.model.names[int(box.cls)]
                confidence = float(box.conf)
                detected_objects.append(f"{class_name}:{confidence:.2f}")
        
        # Publish results
        if detected_objects:
            detection_msg = String()
            detection_msg.data = ','.join(detected_objects)
            self.detection_pub.publish(detection_msg)

Step 4: Command Processing Logic

Build the brain that converts natural language into robot actions.

#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
from geometry_msgs.msg import Twist
import re
 
class CommandParserNode(Node):
    def __init__(self):
        super().__init__('command_parser_node')
        
        # Command vocabulary mapping
        self.navigation_commands = {
            r'move forward|go forward|forward': 'move_forward',
            r'turn left|left': 'turn_left', 
            r'turn right|right': 'turn_right',
            r'stop|halt': 'stop',
            r'go back|reverse|backward': 'move_backward'
        }
        
        self.manipulation_commands = {
            r'pick up|grab|grasp': 'grasp_object',
            r'wave|hello': 'wave_gesture',
            r'home position|rest': 'home_position'
        }
        
        # Subscribers and publishers
        self.voice_sub = self.create_subscription(
            String, 'voice_text', self.parse_command, 10)
        self.cmd_pub = self.create_publisher(String, 'voice_command', 10)
    
    def parse_command(self, msg):
        text = msg.data.lower()
        
        # Check navigation commands
        for pattern, command in self.navigation_commands.items():
            if re.search(pattern, text):
                self.publish_command(command)
                return
        
        # Check manipulation commands  
        for pattern, command in self.manipulation_commands.items():
            if re.search(pattern, text):
                self.publish_command(command)
                return
                
        self.get_logger().info(f"Unknown command: {text}")
    
    def publish_command(self, command):
        cmd_msg = String()
        cmd_msg.data = command
        self.cmd_pub.publish(cmd_msg)
        self.get_logger().info(f"Parsed command: {command}")
Step 5: Movement Controller
Copying the follow python code to implement the navigation logic and translate commands into robot motion.

#!/usr/bin/env python3
import rclpy
from rclpy.node import Node
from geometry_msgs.msg import Twist
from std_msgs.msg import String
 
class MovementControllerNode(Node):
    def __init__(self):
        super().__init__('movement_controller_node')
        
        # Publishers for robot control
        self.cmd_vel_pub = self.create_publisher(Twist, 'cmd_vel', 10)
        
        # Subscriber for voice commands
        self.command_sub = self.create_subscription(
            String, 'voice_command', self.execute_command, 10)
        
        # Movement parameters
        self.linear_speed = 0.2  # m/s
        self.angular_speed = 0.5  # rad/s
        
    def execute_command(self, msg):
        command = msg.data
        twist = Twist()
        
        if command == 'move_forward':
            twist.linear.x = self.linear_speed
        elif command == 'move_backward':
            twist.linear.x = -self.linear_speed
        elif command == 'turn_left':
            twist.angular.z = self.angular_speed
        elif command == 'turn_right':
            twist.angular.z = -self.angular_speed
        elif command == 'stop':
            # All velocities default to 0
            pass
        
        self.cmd_vel_pub.publish(twist)
        self.get_logger().info(f"Executing: {command}")
Step 6: Real-Time Dashboard
Copy the following python to create an intuitive pyQt5 GUI for monitoring.

#!/usr/bin/env python3
import sys
import rclpy
from rclpy.node import Node
from PyQt5.QtWidgets import *
from PyQt5.QtCore import *
from PyQt5.QtGui import *
from sensor_msgs.msg import Image
from std_msgs.msg import String
import cv2
import numpy as np
 
class DashboardNode(Node, QMainWindow):
    def __init__(self):
        Node.__init__(self, 'web_dashboard_node')
        QMainWindow.__init__(self)
        
        self.setWindowTitle("Voice Robot Control Dashboard")
        self.setGeometry(100, 100, 1200, 800)
        
        # Create central widget with tabs
        self.central_widget = QTabWidget()
        self.setCentralWidget(self.central_widget)
        
        self.setup_camera_tab()
        self.setup_voice_tab()
        self.setup_sensors_tab()
        self.setup_logs_tab()
        
        # ROS subscriptions
        self.image_sub = self.create_subscription(
            Image, '/oakd/rgb/preview/image_raw', 
            self.update_camera_feed, 10)
        self.voice_sub = self.create_subscription(
            String, 'voice_text',
            self.update_voice_display, 10)
    
    def setup_camera_tab(self):
        camera_widget = QWidget()
        layout = QVBoxLayout()
        
        # Video display
        self.video_label = QLabel()
        self.video_label.setMinimumSize(640, 480)
        self.video_label.setStyleSheet("border: 2px solid gray")
        layout.addWidget(self.video_label)
        
        # Detection info
        self.detection_info = QTextEdit()
        self.detection_info.setMaximumHeight(100)
        layout.addWidget(self.detection_info)
        
        camera_widget.setLayout(layout)
        self.central_widget.addTab(camera_widget, "Camera")
    
    def setup_voice_tab(self):
        voice_widget = QWidget()
        layout = QVBoxLayout()
        
        # Voice command display
        self.voice_display = QTextEdit()
        self.voice_display.setFont(QFont("Arial", 14))
        layout.addWidget(QLabel("Voice Commands:"))
        layout.addWidget(self.voice_display)
        
        # Manual command input
        self.manual_input = QLineEdit()
        self.manual_input.setPlaceholderText("Type command manually...")
        layout.addWidget(self.manual_input)
        
        voice_widget.setLayout(layout)
        self.central_widget.addTab(voice_widget, "Voice Control")

Step 7: System Integration & Launch

Create launch files to orchestrate all nodes:

# launch/voice_robot_launch.py
from launch import LaunchDescription
from launch_ros.actions import Node
 
def generate_launch_description():
    return LaunchDescription([
        Node(
            package='voice_robot',
            executable='mic_listener_node',
            name='mic_listener'
        ),
        Node(
            package='voice_robot', 
            executable='command_parser_node',
            name='command_parser'
        ),
        Node(
            package='voice_robot',
            executable='object_detector_node', 
            name='object_detector'
        ),
        Node(
            package='voice_robot',
            executable='movement_controller_node',
            name='movement_controller'
        ),
        Node(
            package='voice_robot',
            executable='web_dashboard_node',
            name='dashboard'
        )
    ])

Testing and Calibration

Voice Recognition Tuning

  1. Microphone Positioning: Place mic 1-2 feet from user, away from robot motors

  2. Noise Filtering: Implement voice activity detection to reduce false triggers

  3. Command Vocabulary: Start with 10-15 core commands, expand gradually

  4. Confidence Thresholds: Adjust Whisper confidence levels based on environment

Vision System Optimization

  1. Camera Calibration: Ensure proper focal length and distortion correction

  2. Lighting Conditions: Test under various lighting scenarios

  3. Detection Confidence: Tune YOLO confidence thresholds (0.5-0.7 range)

  4. Object Training: Fine-tune on specific objects in your environment

Summary

This voice-controlled mobile manipulator demonstrates the power of integrating multiple AI technologies in a single robotic platform. By combining voice recognition, computer vision, and autonomous navigation, we’ve created a system that bridges the gap between human communication and robotic action.

The modular ROS2 architecture ensures the system can grow and adapt as new capabilities are needed. Whether you’re a researcher exploring human-robot interaction, a student learning about AI integration, or an enthusiast building the next generation of service robots, this project provides a solid foundation for innovation.

3 Likes