Cobot Magic: AgileX achieved the whole process of Mobile Aloha model training in both the simulation and real environment
Mobile Aloha is a whole-body remote operation data collection system developed by Zipeng Fu, Tony Z. Zhao, and Chelsea Finn from Stanford University. link.
Based on Mobile Aloha, AgileX developed Cobot Magic, which can achieve the complete code of Mobile Aloha, with higher configurations and lower costs, and is equipped with larger-load robotic arms and high-computing power industrial computers. For more details about Cobot Magic please check the AgileX website .
Currently, AgileX has successfully completed the integration of Cobot Magic based on the Mobile Aloha source code project.
Simulation data training
Data collection
After setting up the Mobile Aloha software environment(metioned in last section), model training in the simulation environment and real environment can be achieved. The following is the data collection part of the simulation environment. The data is provided by the team of Zipeng Fu, Tony Z. Zhao, and Chelsea Finn team.You can find all scripted/human demo for simulated environments here. here
After downloading, copy it to the act-plus-plus/data directory. The directory structure is as follows:
act-plus-plus/data
βββ sim_insertion_human
β βββ sim_insertion_human-20240110T054847Z-001.zip
βββ ...
βββ sim_insertion_scripted
β βββ sim_insertion_scripted-20240110T054854Z-001.zip
βββ ...
βββ sim_transfer_cube_human
β βββ sim_transfer_cube_human-20240110T054900Z-001.zip
β βββ ...
βββ sim_transfer_cube_scripted
βββ sim_transfer_cube_scripted-20240110T054901Z-001.zip
βββ ...
Generate episodes and render the result graph. The terminal displays 10 episodes and 2 successful ones.
# 1 Run
python3 record_sim_episodes.py --task_name sim_transfer_cube_scripted --dataset_dir <data save dir> --num_episodes 50
# 2 Take sim_transfer_cube_scripted as an example
python3 record_sim_episodes.py --task_name sim_transfer_cube_scripted --dataset_dir data/sim_transfer_cube_scripted --num_episodes 10
# 2.1 Real-time rendering
python3 record_sim_episodes.py --task_name sim_transfer_cube_scripted --dataset_dir data/sim_transfer_cube_scripted --num_episodes 10 --onscreen_render
# 2.2 The output in the terminal shows
ube_scripted --num_episodes 10
episode_idx=0
Rollout out EE space scripted policy
episode_idx=0 Failed
Replaying joint commands
episode_idx=0 Failed
Saving: 0.9 secs
episode_idx=1
Rollout out EE space scripted policy
episode_idx=1 Successful, episode_return=57
Replaying joint commands
episode_idx=1 Successful, episode_return=59
Saving: 0.6 secs
...
Saved to data/sim_transfer_cube_scripted
Success: 2 / 10
The loaded image renders as follows:
Data Visualization
Visualize simulation data. The following figures show the images of episode0 and episode9 respectively.
The episode 0 screen in the data set is as follows, showing a case where the gripper fails to pick up.
The visualization of the data of episode 9 shows the successful case of grippering.
Print the data of each joint of the robotic arm in the simulation environment. Joint 0-13 is the data of 14 degrees of freedom of the robot arm and the gripper.
Model training and inference
Simulated environments datasets must be downloaded (see Data Collection)
python3 imitate_episodes.py --task_name sim_transfer_cube_scripted --ckpt_dir <ckpt dir> --policy_class ACT --kl_weight 10 --chunk_size 100 --hidden_dim 512 --batch_size 8 --dim_feedforward 3200 --num_epochs 2000 --lr 1e-5 --seed 0
# run
python3 imitate_episodes.py --task_name sim_transfer_cube_scripted --ckpt_dir trainings --policy_class ACT --kl_weight 1 --chunk_size 10 --hidden_dim 512 --batch_size 1 --dim_feedforward 3200 --lr 1e-5 --seed 0 --num_steps 2000
# During training, you will be prompted with the following content. Since you do not have a W&B account, choose 3 directly.
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:
After training is completed, the weights will be saved to the trainings directory. The results are as follows:
trainings
βββ config.pkl
βββ dataset_stats.pkl
βββ policy_best.ckpt
βββ policy_last.ckpt
βββ policy_step_0_seed_0.ckpt
Evaluate the model trained above:
# 1 evaluate the policy add --onscreen_render real-time render parameter
python3 imitate_episodes.py --eval --task_name sim_transfer_cube_scripted --ckpt_dir trainings --policy_class ACT --kl_weight 1 --chunk_size 10 --hidden_dim 512 --batch_size 1 --dim_feedforward 3200 --lr 1e-5 --seed 0 --num_steps 20 --onscreen_render
And print the rendering picture.
Data Training in real environment
Data Collection
1.Environment dependency
1.1 ROS dependency
β Default: ubuntu20.04-noetic environment has been configured
sudo apt install ros-$ROS_DISTRO-sensor-msgs ros-$ROS_DISTRO-nav-msgs ros-$ROS_DISTRO-cv-bridge
1.2 Python dependency
# Enter the current working space directory and install the dependencies in the requirements.txt file.
pip install -r requiredments.txt
2.Data collection
2.1 Run βcollect_dataβ
python collect_data.py -h # see parameters
python collect_data.py --max_timesteps 500 --episode_idx 0
python collect_data.py --max_timesteps 500 --is_compress --episode_idx 0
python collect_data.py --max_timesteps 500 --use_depth_image --episode_idx 1
python collect_data.py --max_timesteps 500 --is_compress --use_depth_image --episode_idx 1
After the data collection is completed, it will be saved in the ${dataset_dir}/{task_name} directory.
python collect_data.py --max_timesteps 500 --is_compress --episode_idx 0
# Generate dataset episode_0.hdf5 . The structure is :
collect_data
βββ collect_data.py
βββ data # --dataset_dir
β βββ cobot_magic_agilex # --task_name
β βββ episode_0.hdf5 # The location of the generated data set file
βββ episode_idx.hdf5 # idx is depended on --episode_idx
βββ ...
βββ readme.md
βββ replay_data.py
βββ requiredments.txt
βββ visualize_episodes.py
The specific parameters are shown:
Name | Explanation |
---|---|
dataset_dir | Data set saving path |
task_name | task name, as the file name of the data set |
episode_idx | Action block index number |
max_timesteps | The number of time steps for the maximum action block |
camera_names | Camera names, default [βcam_highβ, βcam_left_wristβ, βcam_right_wristβ] |
img_front_topic | Camera 1 Color Picture Topic |
img_left_topic | Camera 2 Color Picture Topic |
img_right_topic | Camera 3 Color Picture Topic |
use_depth_image | Whether to use depth information |
depth_front_topic | Camera 1 depth map topic |
depth_left_topic | Camera 2 depth map topic |
depth_right_topic | Camera 3 depth map topic |
master_arm_left_topic | Left main arm topic |
master_arm_right_topic | Right main arm topic |
puppet_arm_left_topic | Left puppet arm topic |
puppet_arm_right_topic | Right puppet arm topic |
use_robot_base | Whether to use mobile base information |
robot_base_topic | Mobile base topic |
frame_rate | Acquisition frame rate. Because the camera image stabilization value is 30 frames, the default is 30 frames |
is_compress | Whether the image is compressed and saved |
The picture of data collection from the camera perspective is as follows:
Data visualization
Run the following code:
python visualize_episodes.py --dataset_dir ./data --task_name cobot_magic_agilex --episode_idx 0
Visualize the collected data. --dataset_dir
, --task_name
and --episode_idx
need to be the same as when βcollecting dataβ. When you run the above code, the terminal will print the action and display a color image window. The visualization results are as follows:
After the operation is completed, episode${idx}qpos.png, episode${idx}base_action.png and episode${idx}video.mp4 files will be generated under ${dataset_dir}/{task_name}. The directory structure is as follows:
collect_data
βββ data
β βββ cobot_magic_agilex
β β βββ episode_0.hdf5
β βββ episode_0_base_action.png # base_action
β βββ episode_0_qpos.png # qpos
β βββ episode_0_video.mp4 # Color video
Taking episode30 as an example, replay the collected episode30 data. The camera perspective is as follows:
Model Training and Inference
The Mobile Aloha project has studied different strategies for imitation learning, and proposed a Transformer-based action chunking algorithm ACT (Action Chunking with Transformers). It is essentially an end-to-end strategy: directly mapping real-world RGB images to actions, allowing the robot to learn and imitate from the visual input without the need for additional artificially encoded intermediate representations, and using action chunking (Chunking) as the unit to predict and integrates accurate and smooth motion trajectories.
The model is as followsοΌ
Disassemble and interpret the model.
- Sample data
Input: includes 4 RGB images, each image has a resolution of 480 Γ 640, and the joint positions of the two robot arms (7+7=14 DoF in total)
Output: The action space is the absolute joint positions of the two robots, a 14-dimensional vector. Therefore, with action chunking, the policy outputs a k Γ 14 tensor given the current observation (each action is defined as a 14-dimensional vector, so k actions are a k Γ 14 tensor)
- Infer Z
The input to the encoder is a [CLS] token, which consists of randomly initialized learning weights. Through a linear layer2, the joints are projected to the joint positions of the embedded dimensions (14 dimensions to 512 dimensions) to obtain the embedded joint positions embedded joints. Through another linear layer linear layer1, the k Γ 14 action sequence is projected to the embedded action sequence of the embedded dimension (k Γ 14 dimension to k Γ 512 dimension).
The above three inputs finally form a sequence of (k + 2) Γ embedding_dimension, that is, (k + 2) Γ 512, and are processed with the transformer encoder. Finally, just take the first output, which corresponds to the [CLS] tag, and use another linear network to predict the mean and variance of the Z distribution, parameterizing it as a diagonal Gaussian distribution. Use reparameterization to obtain samples of Z.
- Predict a action sequence
β First, for each image observation, it is processed by ResNet18 to obtain a feature map (15 Γ 20 Γ 728 feature maps), and then flattened to obtain a feature sequence (300 Γ 728). These features are processed using a linear layer Layer5 is projected to the embedding dimension (300Γ512), and in order to preserve spatial information, a 2D sinusoidal position embedding is added.
β‘ Secondly, repeat this operation for all 4 images, and the resulting feature sequence dimension is 1200 Γ 512.
β’ Next, the feature sequences from each camera are concatenated and used as one of the inputs of the transformer encoder. For the other two inputs: the current joint positions joints and the βstyle variableβ z, they are passed through the linear layer linear layer6, linear layer respectively Layer7 is uniformly projected to 512 from their respective original dimensions (14, 15).
β£ Finally, the encoder input of the transformer is 1202Γ512 (the feature dimension of the 4 images is 1200Γ512, the feature dimension of the joint position joins is 1Γ512, and the feature dimension of the style variable z is 1Γ512).
The input to the transformer decoder has two aspects:
On the one hand, the βqueryβ of the transformer decoder is the first layer of fixed sinusoidal position embeddings, that is, the position embeddings (fixed) shown in the lower right corner of the above figure, whose dimension is k Γ 512
On the other hand, the βkeysβ and βvaluesβ in the cross-attention layer of the transformer decoder come from the output of the above-mentioned transformer encoder.
Thereby, the transformer decoder predicts the action sequence given the encoder output.
By collecting data and training the above model, you can observe that the results converge.
A third view of the model inference results is as follows. The robotic arm can infer the movement of placing colored blocks from point A to point B.
Summary
Cobot Magic is a remote whole-body data collection device, developed by AgileX Robotics based on the Mobile Aloha project from Stanford University. With Cobot Magic, AgileX Robotics has successfully achieved the open-source code from the Stanford laboratory used on the Mobile Aloha platform, including in simulation and real environment.
AgileX will continue to collect data from various motion tasks based on Cobot Magic for model training and inference. Please stay tuned for updates on Github. And if you are interested in this Mobile Aloha project, join us with this slack link: Slack. Letβs talk about our ideas.
About AgileX
Established in 2016, AgileX Robotics is a leading manufacturer of mobile robot platforms and a provider of unmanned system solutions. The company specializes in independently developed multi-mode wheeled and tracked wire-controlled chassis technology and has obtained multiple international certifications. AgileX Robotics offers users self-developed innovative application solutions such as autonomous driving, mobile grasping, and navigation positioning, helping users in various industries achieve automation. Additionally, AgileX Robotics has introduced research and education software and hardware products related to machine learning, embodied intelligence, and visual algorithms. The company works closely with research and educational institutions to promote robotics technology teaching and innovation.