To receive industry-leading AI updates and exclusive content, sign up for our daily and weekly newsletters. Learn more
of The Allen Institute for AI (Ai2) today announced Molmo.is an open-source family of cutting-edge multimodal AI models that outperform top proprietary competitors in several third-party benchmarks, including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5.
Therefore, these models, like the main proprietary models, can accept and analyze user-uploaded images.
But Ai2 also As pointed out in X’s post Molmo says it uses “1000x less data” than its own competing products, thanks to a technical report paper published by the company, founded by Paul Allen and led by Ali Farhadi, and some clever new training techniques that we’ll explain in more detail below.
Ai2 says the release underscores its commitment to open research by making high-performance models with open weights and data available to the broader community and, of course, to companies looking for a solution that they can fully own, control and customize.
This follows another open model, OLMoE, released by Ai2 two weeks ago, which is a “mix of experts” or combination of smaller models designed for cost-efficiency.
Bridging the gap between open and proprietary AI
Molmo consists of four main models that differ in parameter size and functionality.
- Molmo-72B (72 billion parameters, or configurations – flagship model based on Alibaba Cloud’s Qwen2-72B open source model)
- Molmo-7B-D (a “demo model” based on Alibaba’s Qwen2-7B model)
- Molmo-7B-O (based on Ai2’s OLMo-7B model)
- MolmoE-1B (based on the OLMoE-1B-7B mixed-expert LLM, which Ai2 says “nearly matches the performance of GPT-4V on both academic benchmarks and user preferences”)
These models have achieved high performance in a wide range of third-party benchmarks, outperforming many proprietary alternatives, and they are all available under the permissive Apache 2.0 license, making them suitable for virtually any research or commercial (including enterprise-grade) use.
Notably, Molmo-72B topped academic evaluations, achieved top scores across 11 major benchmarks, and ranked second only to GPT-4o in user preference.
Vaibhav Srivastav, machine learning developer advocate engineer at AI code repository company Hugging Face, commented: About the X releaseMolmo highlights that it offers a powerful alternative to closed systems and is setting a new standard for open multimodal AI.
In addition, robotics researchers at Google DeepMind Ted Shao is X We praise Molmo for including pointing data and believe it will revolutionize visual grounding in robotics.
This feature enables Molmo to provide visual descriptions and interact more effectively with the physical environment, a feature currently lacking in most other multimodal models.
Not only are these models highly performant, they are also fully open, allowing researchers and developers to access and build on cutting-edge technology.
Advanced model architectures and training approaches
Molmo’s architecture is designed to maximize efficiency and performance. All models use OpenAI’s ViT-L/14 336px CLIP model as a vision encoder to process multi-scale, multi-crop images into vision tokens.
These tokens are then projected into the input space of the language model via multi-layer perceptron (MLP) connectors and pooled for dimensionality reduction.
The language model components are decoder-only transformers with a variety of options ranging from the OLMo series to the Qwen2 series to the Mistral series, each offering different capacity and levels of openness.
Molmo’s training strategy involves two main stages:
- Multimodal Pre-Training: In this stage, the model is trained to generate captions using newly collected, detailed image descriptions provided by human annotators. This high-quality dataset, called PixMo, is a key factor in Molmo’s outstanding performance.
- Supervised fine-tuning: The models are then fine-tuned on a diverse combination of datasets, including standard academic benchmarks and newly created datasets that enable the models to handle complex real-world tasks such as document reading, visual reasoning, and even pointing.
Unlike many contemporary models, Molmo does not rely on reinforcement learning with human feedback (RLHF), instead focusing on a tightly tuned training pipeline that updates all model parameters based on pre-training states.
Outperforming Major Benchmarks
The Molmo model showed impressive results across multiple benchmarks, especially compared to proprietary models.
For example, Molmo-72B scores 96.3 in DocVQA and 85.5 in TextVQA, outperforming both Gemini 1.5 Pro and Claude 3.5 Sonnet in these categories. Additionally, it scores 97.4 in AI2D (Ai2’s proprietary benchmark, “One picture is worth 12 imagesA dataset of over 5,000 elementary school science diagrams and over 150,000 rich annotations
The model also excels on visual grounding tasks, with Molmo-72B achieving the highest performance on RealWorldQA, making it particularly promising for applications in robotics and complex multi-modal reasoning.
Open Access and Future Releases
Ai2 uses these models and datasets Hug Face SpaceIt is fully compatible with popular AI frameworks such as Transformers.
This open access is part of Ai2’s broader vision to foster innovation and collaboration in the AI community.
Over the coming months, Ai2 plans to release additional models, training code, and expanded technical reports, further enriching the resources available to researchers.
If you’re interested in learning more about Molmo’s capabilities, a public demo and some model checkpoints are available below. Mormo’s official page.