Microsoft Announces Magma Foundation Model That Can Complete Multimodal Agentic Tasks

Microsoft researchers introduced a brand new basis mannequin on Wednesday that may carry out agentic capabilities. Dubbed Magma, the bogus intelligence (AI) mannequin is pre-trained on a big quantity of datasets throughout textual content, photographs, movies, in addition to spatial codecs. The Redmond-based tech big mentioned that Magma is an extension of vision-language (VL) fashions and it can’t solely perceive multimodal data however may plan and act on them. The AI agent-enabled mannequin can be utilized in a variety of duties together with laptop imaginative and prescient, person interface (UI) navigation, and robotic manipulation.

Microsoft Announces Magma Foundation Model

In a GitHub post, Microsoft researchers detailed the brand new Magma basis mannequin. Foundation fashions are distinctive massive language fashions (LLMs), that are constructed from scratch and are usually not distilled from another mannequin. They typically turn into the baseline for different fashions within the sequence. Magma is exclusive within the sense that the AI mannequin is pre-trained on a variety of datasets.

The researchers acknowledged that the bottom structure behind Magma is the Llama 3 AI mannequin. However, Magma can also be geared up with the power to plan and act within the visual-spatial world. This permits the mannequin to not solely generate outputs like a chatbot but additionally execute actions.

It can be utilized as a pc imaginative and prescient chatbot that may supply details about the world it views when paired with digital camera sensors. Magma will also be used to regulate the UI of a tool. But extra apparently, it may well additionally management robots to finish advanced duties utilizing agentic capabilities.

The researchers mentioned a significant motive behind these capabilities is the varied dataset together with two technical parts — Set-of-Mark and Trace-of-Mark. The former permits motion grounding in photographs, movies and spatial knowledge by having the mannequin predict numeric marks for buttons or robotic arms in picture house. The latter feeds the mannequin temporal video dynamics and makes it predict the following frames earlier than it takes motion. This permits the mannequin to develop a robust spatial understanding.

Microsoft researchers additionally shared the benchmark scores of the AI mannequin based mostly on inside testing. It has achieved aggressive scores throughout all of the agentic analysis assessments, outperforming fashions by OpenAI, Alibaba, and Google. The firm has not launched Magma within the public area as of now.

February 21, 2025

beiboa.com

Microsoft Announces Magma Foundation Model That Can Complete Multimodal Agentic Tasks