Meta Dives Deep into AI Perception and Collaboration with Five New Projects

Meta’s FAIR division has unveiled a suite of five AI projects aimed at replicating human-level perception and enabling advanced collaboration. The initiatives span vision encoding, language modeling, robotics, and collaborative agents, all designed to create AI systems capable of processing sensory input and making intelligent decisions.

The centerpiece is the Perception Encoder, a large-scale vision encoder engineered for excellence across image and video tasks. Acting as the ‘eyes’ of the AI, this encoder interprets visual data and tackles the challenges of advanced AI, bridging vision and language, and handling imagery in difficult conditions. Meta claims the model surpasses existing open-source and proprietary solutions in image and video zero-shot classification and retrieval, and translates well to language tasks when aligned with a large language model.

The Perception Language Model (PLM) is an open and reproducible vision-language model focused on complex visual recognition tasks. Trained using large-scale synthetic data combined with open vision-language datasets, without knowledge distillation from external proprietary models. A new dataset of 2.5 million human-labelled samples focuses on fine-grained video question answering and spatio-temporal captioning. PLM is available in 1, 3, and 8 billion parameter versions.

Meta Locate 3D bridges language commands with physical action. This model empowers robots to pinpoint objects in a 3D environment using natural language queries. Processing 3D point clouds directly from RGB-D sensors, it combines 2D features with a pretrained 3D-JEPA encoder and the Locate 3D decoder to output bounding boxes and masks for specified objects. A new dataset for object localization includes 130,000 language annotations across 1,346 scenes from the ARKitScenes, ScanNet, and ScanNet++ datasets.

Meta has also released the model weights for its 8-billion parameter Dynamic Byte Latent Transformer, further advancing language modeling. This byte-level architecture demonstrates superior performance compared to tokeniser-based models, improving inference efficiency and robustness. Meta reports that the Dynamic Byte Latent Transformer “outperforms tokeniser-based models across various tasks, with an average robustness advantage of +7 points.

Finally, the Collaborative Reasoner addresses the development of AI agents capable of effective collaboration with humans and other AI. Collaborative Reasoner offers a framework for assessing and enhancing these skills. It features goal-oriented tasks requiring multi-step reasoning achieved through conversation between two agents. Evaluations revealed that current models struggle to consistently leverage collaboration for better outcomes, so Meta proposes a self-improvement technique using synthetic interaction data where an LLM agent collaborates with itself.

Photo by Anna Shvets on Pexels
Photos provided by Pexels