A new open-source release, Wall-OSS-0.5, is revolutionizing the field of robot foundation models. Developed by X Square Robot, this 4B vision-language-action model is built around a 3B VLM backbone and action-generation components. What sets it apart is its focus on measuring the pretrained checkpoint’s capabilities before task-specific fine-tuning.
The model was tested on a 17-task real-robot suite without fine-tuning, with four tasks achieving over 80% progress: block sorting, fruit sorting, ring stacking, and rope tightening. This approach is significant because it shifts the focus from fine-tuned downstream scores to the pretrained model’s inherent abilities.
The method used by Wall-OSS-0.5 combines action-token cross entropy, multimodal cross entropy, and flow matching to provide a stronger learning signal to the VLM backbone. This approach addresses the challenge of continuous action losses not being sufficient for execution on their own.
While zero-shot performance does not yet solve the most complex manipulation tasks, such as towel folding and charger insertion, this new direction for embodied AI is a promising step forward. The open question remains whether real-robot zero-shot suites will become a standard for evaluating robot foundation models.
For more information, visit the code repository, paper, project page, or the Hugging Face org.
Photo by Kindel Media on Pexels
Photos provided by Pexels
