Get ready for a game-changer in the world of robotics! Xiaomi, the tech giant, is stepping into the robotics arena with a bold move. Introducing Xiaomi-Robotics-0, a groundbreaking robot model that's set to revolutionize the industry.
Xiaomi, known for its smartphones and smart home devices, is now aiming to make its mark in robotics research. And they're not holding back! With an open-source vision-language-action (VLA) model boasting an impressive 4.7 billion parameters, Xiaomi-Robotics-0 is designed to bring together visual understanding, language comprehension, and real-time action execution - the essence of what Xiaomi calls 'physical intelligence'.
But here's where it gets controversial... Xiaomi claims their model is already breaking records in both simulations and real-world tests. How is that possible? Let's dive deeper.
At its core, robotics models like Xiaomi-Robotics-0 tackle a closed loop of perception, decision-making, and execution. In simple terms, a robot needs to 'see', 'understand', 'plan', and then 'act'. Xiaomi's model is specially crafted to balance a broad understanding of its environment with precise motor control. And this is the part most people miss - achieving this balance is a significant challenge in robotics.
The Xiaomi-Robotics-0 Model: A Two-Pronged Approach
The model utilizes a Mixture-of-Transformers (MoT) architecture, dividing responsibilities between two key components:
Visual Language Model (VLM): Acting as the 'brain', the VLM is trained to interpret human instructions, even vague ones like 'fold the towel'. It understands spatial relationships from high-res visual input, handling object detection, visual question answering, and logical reasoning.
Action Expert: Built around a multi-layer Diffusion Transformer (DiT), the Action Expert generates 'Action Chunks' - sequences of movements - using flow-matching techniques for accuracy and smoothness. Unlike traditional models, it doesn't produce single actions, making it more versatile.
One common issue with VLA models is the trade-off between understanding and physical action capabilities. Xiaomi claims to have overcome this by co-training the model on multimodal and action data. The result? A system that can reason about the world and learn to move within it simultaneously.
Training and Optimization: A Step-by-Step Process
The training process is a multi-stage affair. First, an 'Action Proposal' mechanism forces the VLM to predict action distributions while interpreting images, aligning its internal representation with actions. Then, the VLM is frozen, and the DiT is trained separately to generate accurate action sequences from noise, using key-value features instead of language tokens.
Xiaomi also addressed the practical issue of inference latency, which can lead to awkward pauses or unstable behavior due to delays between model predictions and physical movement. They implemented asynchronous inference, decoupling model computation from robot operation, ensuring continuous movement even if the model takes extra time to 'think'.
To improve stability, Xiaomi uses a 'Clean Action Prefix' technique, feeding back the previously predicted action to ensure smooth, jitter-free motion. Additionally, a Λ-shaped attention mask biases the model towards current visual input, making the robot more responsive to sudden environmental changes.
Benchmarks and Real-World Performance
In benchmark testing, Xiaomi-Robotics-0 reportedly achieved state-of-the-art results in LIBERO, CALVIN, and SimplerEnv simulations, outperforming around 30 other models. But the real test is in the real world.
Xiaomi deployed the model on a dual-arm robot platform for experiments. In long-horizon tasks like folding towels and disassembling building blocks, the robot demonstrated impressive hand-eye coordination, handling both rigid and flexible objects seamlessly. Unlike previous VLA systems that sacrificed multimodal reasoning for action training, Robotics-0 retains strong visual and language capabilities, especially in tasks blending perception and physical interaction.
So, what do you think? Is Xiaomi's Robotics-0 a game-changer or just another robot model? Share your thoughts in the comments! We'd love to hear your opinions on this exciting development in the world of robotics.