Molyn

A family of open state-of-the-art multimodal AI models

Introducing Molyn

Molyn is a family of open state-of-the-art multimodal AI models. Our most powerful model closes the gap between open and proprietary systems across a wide range of academic benchmarks as well as human evaluations. Our smaller models outperform models 10x their size.

While current multimodal models interpret multimodal data and express it in natural language, their full potential remains untapped. Molyn goes beyond. By learning to point at what it perceives, Molyn enables rich interactions with physical and virtual worlds, empowering the next generation of applications capable of acting and interacting with their environments.

Open, cutting-edge, and actionable

Today's most advanced multimodal models remain proprietary. Research efforts aimed at building vision-language models (VLMs) utilizing open data lag significantly behind this state-of-the-art. Recent stronger open-weights models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch.

We present Molyn, a new family of state-of-the-art VLMs. Starting from pre-trained vision encoders and language-only LLMs, the entire remainder of our VLM pipeline – weights, code, data, and evaluations – is open and free from VLM distillation. Our key innovation is a new collection of datasets called PixMo that includes a novel highly-detailed image caption dataset collected entirely from human annotators using speech-based descriptions, and a diverse mixture of fine-tuning datasets that enable new capabilities. Notably, PixMo includes innovative 2D pointing data that enables Molyn to answer questions not just using natural language but also using non verbal cues. We believe this opens up important future directions for VLMs enabling agents to interact in virtual and physical worlds. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and most critically the quality of our newly collected datasets, all of which we have released.

The best in class model within the Molyn family not only outperforms others in the class of open weight and data models, but also compares favorably against proprietary systems like GPT-4o, Claude 3.5 and Gemini 1.5. We have released all of our model weights, captioning and fine-tuning data, and source code for training, inference, and evaluation. Or, without downloading anything, it’s easy to try Molyn using our public demo showcasing the Molyn-7B-D model.

PixMo: Data quality wins over quantity

Large VLMs are conventionally trained on billions of image text pairs sourced from the web. Such massive corpora tend to be extremely noisy, requiring models to separate signal from noise in their training process. Noisy text also leads to hallucinations in a model's output. We take a vastly different approach to sourcing data with an intense focus on data quality, and are able to train powerful models with less than 1M image text pairs, representing 3 orders of magnitude less data than many competitive approaches.

The most critical ingredient to the success of the Molyn family of models is PixMo, Molyn's training data. Pixmo includes two broad categories of data: (1) dense captioning data for multimodal pre-training and (2) supervised fine-tuning data for enabling a wide array of user interactions, including behaviors like question answering, document reading, and pointing. Our primary constraint in the collection of this data is to avoid making use of existing VLMs, since we want to build a performant VLM from the ground-up, rather than by distillation of an existing system (note that we do make use of language-only LLMs, but we never pass images to these models).

In practice, it is challenging to collect dense captioning datasets from human annotators. If asked to write an image description, the result often only mentions a few salient visual elements and lacks detail. If a minimum word count is enforced, annotators will either take too long to type, making collection uneconomical, or copy-and-paste responses from proprietary VLMs, circumventing our goal to avoid distillation. As a result the open research community has struggled to create such datasets without relying on synthetic data from proprietary VLMs. Our key innovation is a simple but effective data collection methodology that avoids these problems: we ask annotators to describe images in speech for 60 to 90 seconds rather than asking them to write descriptions. We prompt the annotators to describe everything they see in great detail and include descriptions of spatial positioning and relationships. Empirically, we found that with this modality switching "trick" annotators provide far more detailed descriptions in less time, and for each description we collect an audio receipt (i.e., the annotator's recording) proving that a VLM was not used. In total, we collected detailed audio descriptions for 712k images that were sampled from 70 diverse topics (e.g., street signs, memes, food, drawings, websites, blurry photos, etc.).

Our fine-tuning data mixture includes standard academic datasets as well as several newly collected datasets, all of which we have released. While the academic datasets primarily allow the model to work well on benchmark datasets, our newly collected datasets enable a wide range of important functionality including the ability to answer general questions about images in chats with users beyond the scope of the academic benchmark datasets, improves OCR-centric tasks like reading documents and charts, enables accurate reading of analog clocks, and allows the model to point to one or more visual elements in the image. Pointing provides a natural explanation grounded in image pixels resulting in new and improved capabilities for Molyn. We believe that in the future pointing will be an important communication channel between VLMs and agents. For example, a robot could query a pointing-enabled VLM for a waypoint or the location of an object to pick up, or a web agent could query the VLM for the location of a user interface element to click.

A description of the newly collected datasets is available below.

PixMo-Cap

PixMo-Cap is a dataset for pre-training VLMs to understand images in great detail. It contains 712,000 distinct images with approximately 1.3 million dense image captions. The captions were generated by human annotators who provided detailed 60-90 second spoken descriptions of diverse web images, which were then transcribed and refined using language models. The dataset covers a wide range of topics and includes detailed descriptions of image contents, objects, text, positions, subtle details, background, style, and color.

PixMo-AskModelAnything

PixMo-AskModelAnything is a dataset designed to enable AI models to answer diverse questions about images. It includes 162,000 question-answer pairs for 73,000 images, created through a process where human annotators selected images, wrote questions, and iteratively refined answers generated by a language model based on image captions and OCR output. The dataset also incorporates unusual requests, such as answers written upside down, to increase diversity.

PixMo-Points

PixMo-Points is a dataset that includes image captioning data where human annotators were asked to point at objects in images and write descriptions of them. The dataset contains 2.3 million question-point pairs from 428,000 images, including instances where annotators pointed to every occurrence of a described object and cases where objects were not present in the image. This dataset aims to enable models to point to anything described by text, count objects by pointing, and use pointing as a form of visual explanation.

PixMo-CapQA

This dataset contains 214,000 question-answer pairs generated from 165,000 image captions using a language model. The questions cover diverse topics and styles to increase variety.

PixMo-Docs

This dataset includes 255,000 text and figure-heavy images (charts, documents, tables, diagrams) with corresponding code generated by a language model. It also contains 2.3 million question-answer pairs based on the generated code.

PixMo-Clocks

This is a synthetic dataset of 826,000 analog clock images with corresponding questions and answers about the time. The dataset features about 50 different watch types and 160,000 realistic watch face styles with randomly chosen times.

Benchmark evaluations and large-scale human preference rankings

Vision-language model evaluation is evolving rapidly with new academic benchmarks constantly appearing. These benchmarks work well for evaluating specific skills, but doing well on them often requires answering questions in a benchmark-specific style. These answers are often short and do not work well in other settings. As a result, academic benchmarks provide only a partial picture of how a model performs. To complement these benchmarks we also perform a human evaluation that allows us to rank models according to user preference.

For academic benchmarking, we attempted to collect results for all models on a set of 11 commonly used academic benchmarks.[1] We prioritized using numbers published by the authors themselves when they were available, but many were missing. When results were not available, we attempted to find the best previously reported values from other technical reports or from public leaderboards, such as the OpenVLM Leaderboard. Finally, if a value was still missing we computed it ourselves. We note that computing results is difficult in practice and for a fixed model results on a given benchmark can vary by a large amount (e.g., 10 percentage points) depending on the details of how it was evaluated. Further complicating matters, in many cases critical evaluation details, such as what prompts were used or how the data was processed, may not be available making it difficult to reproduce published results. These issues underscore the importance of open evaluation.

We also avoid making a strong distinction between claimed "zero-shot" performance (often reported for closed-data models) and the supervised performance of models that explicitly train on benchmark training sets. The distinction between supervised training and zero-shot transfer is fuzzy since one can curate new data sources that serve as effective proxies for any given benchmark's literal training data. When training data is not disclosed, the community has no means of evaluating zero-shot transfer claims.

For our human evaluation, we collected a diverse set of image and text prompt pairs and queried a set of VLMs for responses. We then presented the resulting image-text-response triplets for all VLM pairings to a set of ~870 human annotators who gave pairwise preference rankings. From these preference rankings, we calculated an ELO ranking using the Bradley-Terry model following the methodology of LMSYS Org's Chatbot Arena. We collected over 325,231 pairwise comparisons across 27 models, making it the biggest human preference evaluation for multimodal models to date. As a reference, our ELO rankings are based on 3X more votes than Chatbot Arena (LMSYS) for vision models.

Broadly speaking, the academic benchmark results and human evaluation strongly agree, with the exception of Qwen VL2 which performs strongly on the academic benchmarks and comparatively under performs in the human evaluation.

Human preference evaluations. Our Elo human preference evaluations used 15k image and text prompt pairs. We queried each VLM for responses, and presented the resulting image-text-response triplets for all VLM pairings to a set of 870 human annotators who gave pairwise preference rankings, for a total of 325,231 pairwise comparisons across 27 models, making it the biggest human preference evaluation for multimodal models to date. As a reference, our ELO rankings are based on 3X more votes than Chatbot Arena (LMSYS) for vision models.

Small is the new big, less is the new more

We highlight a few key results:

Our most efficient Molyn model MolynE-1B, based on our fully open OLMoE-1B-7B mixture-of-experts LLM, nearly matches the performance of GPT-4V on both academic benchmarks and human evaluation.
Our two Molyn-7B models perform comfortably between GPT-4V and GPT-4o on both academic benchmarks and human evaluation, and significantly outperform recently released Pixtral 12B models on both benchmarks.
Our best-in-class Molyn model, Molyn-72B, achieves the highest academic benchmark score and ranks second on human evaluation, just slightly behind GPT-4o.
Our best Molyn model also outperforms several state-of-the-art proprietary systems, including Gemini 1.5 Pro and Flash and Claude 3.5 Sonnet.

Model Architecture

Our model architecture follows the simple and standard design of combining a language model with an image encoder. It consists of four components: (1) a pre-processor that converts the input image into a set of multiscale, multi-crop images; (2) a ViT image encoder that independently maps each of these images into a set of vision tokens; (3) a connector that projects the vision tokens to the language model's input dimension with an MLP and then pools the vision tokens to reduce their count; and (4) a decoder-only Transformer LLM.

From this template, we construct a family of models that is parameterized by the choice of vision encoder and LLM. Given these choices, the subsequent training data and recipe are the same for all models (aside from optimizer learning rates). For the vision encoder, all of our released models use OpenAI's ViT-L/14 336px CLIP model which provides consistently good results (while this model uses closed data, it can be reproduced from scratch as shown by MetaCLIP; we use the model from OpenAI because it was trained for higher resolution images). For the LLM, we have trained models on a variety of choices at different scales and degrees of openness including: the fully open-weight and data OLMo-7B-1024 (using the October, 2024 pre-released weights, which will be public at a later date), the efficient fully open-weight and data OLMoE-1B-7B-0924, open-weight Qwen2 7B, open-weight Qwen2 72B, open-weight Mistral 7B, open-weight Gemma2 9B, and Phi 3 Medium). Today we are releasing 4 samples from this family.

Starting from an independently pre-trained vision encoder and LLM, our training process is simple and consists of two stages: (1) multimodal pre-training for caption generation using our newly collected captioning data and (2) supervised fine-tuning using our dataset mixture described above. All model parameters are updated in both stages. We do not use RLHF.

Model Architecture

Our first release included a demo, inference code, a brief technical report on arXiv and the following model weights

MolynE-1B, a mixture of experts model with 1B (active) 7B (total)
Molyn-7B-O, our most open 7B model
Molyn-7B-D, our demo model
Molyn-72B, our best model

In part 2, we have released:

A major update to the technical report, including in-depth implementation details and extensive ablation experiments revealing the important model and data design decisions
Our PixMo family of datasets
Training and evaluation code to complement the already available inference code

Future Molyns releases will include:

Updated model weights based on new OLMo releases
A version of Molyn based on the MetaCLIP vision encoder and OLMo LLM that will be the most open version of Molyn yet, where every bit of data used in the entire model, including pre-trained vision encoder and LLM, are open