Teaching Robots to Do the Dishes: A Data-Driven Approach
  Imagine a future where robots handle everyday chores like dishwashing. No more dishes piling up in the sink. No more having to load and unload the dishwasher.
This key to training robots to perform complex tasks, like dishwashing, in various environments with different layouts and objects, lies in collecting and utilizing vast amounts of data.
The Universal Manipulation Interface (UMI) - research from Stanford and TRI, offers a practical solution for collecting this crucial data. UMI consists of portable hand-held grippers equipped with wrist-mounted cameras that allow for data collection in any environment with minimal setup.
Physical Mirrors Create Virtual Viewpoints: Two side mirrors are mounted on the hand-held gripper, strategically positioned within the GoPro's wide field-of-view. These mirrors reflect the scene, effectively acting like additional virtual cameras positioned at different viewpoints.
Implicit Stereo Vision: This setup allows the single GoPro camera to capture multiple perspectives of the scene simultaneously – the main view and the reflected views from the mirrors. By analyzing the differences in these views, the system can infer depth information, similar to how traditional stereo vision systems with two cameras work.
Digital Reflection for Consistent Orientation: The images from the mirrors are digitally reflected and the left and right mirror images are swapped before being used for policy learning.
This is crucial because objects in mirror reflections appear with reversed orientations compared to the main camera view. Reflecting the mirror images ensures that object orientations are consistent across all viewpoints, preventing confusion for the vision models. Using the mirror images directly without reflection could negatively impact performance.
Unlike traditional teleoperation methods that rely on expensive hardware and specialized operators, UMI utilizes readily available GoPro cameras and an intuitive, open-sourced design that can be used as an extension of human hands. Its portability enables the collection of in-the-wild data, capturing diverse human demonstrations in realistic settings, where the operator just puppeteers the hands to collect data, for example while washing dishes. UMI's design minimizes the embodiment gap between human demonstration and robot execution by using the same wrist-mounted camera and end-effector setup for both the data collection device and the robotic arms. This allows for direct skill transfer, making the learned policies hardware-agnostic and deployable on various robot platforms.
MimicGen: Amplifying Human Demonstrations
While UMI simplifies data collection, training robots for complex, multi-step tasks still requires a large volume of demonstrations. MimicGen, research from NVIDIA and UTAustin, is a data generation system that addresses this challenge. MimicGen takes a small set of human demonstrations and generates a significantly larger dataset by adapting them to new contexts. It achieves this by parsing demonstrations into object-centric segments, transforming them based on object positions in a new scene, and generating a new trajectory for the robot to execute.
  
    Well, a professor from New Jersey, Usman Roshan and his CTO at 7XR, have been working on just this, for the past few months. 
    
  
      
        
  
      The X-post from 7xr.tech showcases a glimpse of our robotic future, featuring a couple of robotic arms from Elephant Robotics tackling dirty dishes. The system is completely vision based, with cameras near the end-effectors and learns from 60 videos of the CTO washing dishes at his girlfriend's apartment.
Autonomous robot dishwasher (ARD1) identifies plates, tap, water, sponge, sponge box, dirty vs clean plate, which plate to pick, as we see in the four views below - all of this helps it to understand what it is doing #robots #robotics #AI #deeplearning #machinelearning… pic.twitter.com/whSFr8Bh03
— Usman Roshan (@Deeplearner2) September 26, 2024
They don't reveal all the details about the training process, but the data was presumably collected by teleoperating the arms to wash dishes (and not from the static camera on top of the sink watching human hands wash dishes). The video also shows bounding boxes around the dishes in the sink and Usman mentions in the comments that YOLO object bounding boxes are provided to a transformer architecture for the vision processing - which is presumably then passed to the network that outputs the motor control commands, again possibly trained using a combination of Reinforcement Learning & Imitation Learning.
    
  
Hans Peter Brondmo, the ex-CEO of Everyday Robots, the now disbanded Google X lab unit working on household robotics, says in a recent Wired article:  "One of the most significant challenges was teaching robots to function in complex, unpredictable environments. While AI systems have made strides in learning from large datasets, robots require even more data to understand and respond to the real world effectively. It may take ‘many thousands, maybe even millions of robots’ collecting data in diverse settings before we reach a point where AI models can perform tasks beyond narrow, well-defined roles."
| The UMI data collection device from Stanford/TRI | 
The Universal Manipulation Interface (UMI) - research from Stanford and TRI, offers a practical solution for collecting this crucial data. UMI consists of portable hand-held grippers equipped with wrist-mounted cameras that allow for data collection in any environment with minimal setup.
The cameras feature a clever technique to provide stereo vision with a single GoPro camera: side mirrors. Here's how it works:
Implicit Stereo Vision: This setup allows the single GoPro camera to capture multiple perspectives of the scene simultaneously – the main view and the reflected views from the mirrors. By analyzing the differences in these views, the system can infer depth information, similar to how traditional stereo vision systems with two cameras work.
Digital Reflection for Consistent Orientation: The images from the mirrors are digitally reflected and the left and right mirror images are swapped before being used for policy learning.
This is crucial because objects in mirror reflections appear with reversed orientations compared to the main camera view. Reflecting the mirror images ensures that object orientations are consistent across all viewpoints, preventing confusion for the vision models. Using the mirror images directly without reflection could negatively impact performance.
MimicGen: Amplifying Human Demonstrations
While UMI simplifies data collection, training robots for complex, multi-step tasks still requires a large volume of demonstrations. MimicGen, research from NVIDIA and UTAustin, is a data generation system that addresses this challenge. MimicGen takes a small set of human demonstrations and generates a significantly larger dataset by adapting them to new contexts. It achieves this by parsing demonstrations into object-centric segments, transforming them based on object positions in a new scene, and generating a new trajectory for the robot to execute.
For instance, in a dishwashing task, MimicGen could take a few human demonstrations of washing a plate and break it down into smaller sub-tasks such as grasping the plate, rinsing the plate and placing it on the drying rack. It would then utilize these object-centric segments to synthesize new demonstrations in novel contexts. Given a new scene with different object placements, MimicGen identifies the relevant segment from the source video demonstrations and transforms them to create a new trajectory for the robot to follow. This process ensures that the generated demonstrations are physically cosistent and reflect real world variations in object placement and task execution. 
This process allows MimicGen to act as a data amplifier, generating thousands of demonstrations from a limited set of human demonstrations.
MimicGen proves particularly useful for tasks involving diverse scene configurations, object instances, and robot arms, enabling the training of proficient robot agents through imitation learning. This data-driven approach offers a possible scalable and efficient pathway to train robots for real-world applications. 
From loading dishwashers and folding laundry to organizing cluttered rooms, these AI-powered robotic assistants are poised to liberate us from the mundane, ushering in an era where technology truly enhances our daily lives. The future of home automation isn't just knocking at our door—it's ready to roll up its sleeves and get to work.
Comments
Post a Comment