Trustworthy AI: The case for openness in language modeling

By Anabelle Nicoud

The release of ChatGPT 2 years ago opened a new chapter in AI, driven by large language models of unprecedented size and complexity. These models are now a leading force in research and business, but many of them don't release their data, the full trading recipe or their checkpoints. That’s where nonprofit organization Allen Institute for Artificial Intelligence (Ai2) comes in. Ai2 got its start in 2014, founded by Microsoft co-founder Paul Allen. The research group works on language models, multimodal models and evaluation frameworks in open source.

Recently, Ai2 released Molmo, a family of state-of-the-art multimodal AI models aiming to significantly close the gap between open and proprietary systems. “Even our smaller models outperform competitors 10x their size,” says Ai2.

Earlier in September, Ai2 released OlmoE, a mixture of experts model with 1 billion active and 7 billion total parameters that was developed conjointly with Contextual AI. It was trained on 5 trillion tokens and built on a new data mix incorporating lessons from Ai2’s Dolma.

We spoke with Hanna Hajishirzi, Senior Director of NLP Research at Ai2, after her keynote at the PyTorch Conference in San Francisco to discuss open source models and AI literacy.

OlmoE was released a few weeks back. What has happened between the release and now?

We did a minor release for OLMoE in September. Despite being a small model, it performs really well on many tasks. Since then, we have seen great reception from the community. We’ve also created an app that runs the language model directly on smartphones without connecting to a GPU. It’s still in progress—we’re working on safety features and improving the UI—but it’s exciting. We’re also working on training bigger models.

How did you achieve this?

It's no surprise that mixture of expert models work well as we have seen them included in frontier models. The benefit of a mixture of experts is that with the same training effort, you get higher accuracy compared to dense models. What was interesting for us was to take this to the extreme and train the smallest model we could, like a 1 billion-parameter model, to see what happens. We were excited by the results.

So how did we get there? First, we improve our training pipeline. We started with a dense model architecture, did several experiments successfully extending it to mixture of expert models. Second, we made improvements on our data mix, which led to a better model. Together, these two things gave us the best results.

Can you talk about the limitations of openness in larger models?

There’s a wide range of openness in the AI community. For example, models like OpenAI’s ChatGPT have opened their APIs, but who knows what’s happening behind closed doors?

It all seems very fancy, but this lack of transparency is the opposite of promoting AI literacy. The public has no real understanding of why these models behave the way they do. It all feels like magic as these models seem to get better.

The AI community needs to start releasing more information about opaque models and to explain why they give certain answers. For example, they could explain that a model responds in a certain way because it encounters specific patterns in its training data.

Educating the public on this is essential. Although it’s challenging to connect specific decisions to data points in a way that’s easy for the public to understand, creating demos that showcase this process would be really impactful.

The training data often feels like a mystery, doesn’t it?

Exactly! That’s a significant focus of our project: we aim to release both the model weights and the training data.

Using our OLMo and OLMoE models, researchers in the community are working on how model decisions connect to the data. Our open dataset, Dolma has enabled researchers to analyze it, leading to publications that explain how specific data points contribute to model behavior. This transparency would also help inform the public.

We’ve talked about public knowledge, but I’d also like to discuss trust. How do we build trust in the field of language modeling?

I can address this from two perspectives. First, when we started this project, we questioned the validity of numbers reported by some companies. We wanted to ensure that those figures weren’t derived from selective test sets or benchmarks. This highlights a level of trust within the research community.

For our model, it’s straightforward, because we provide access to our data and demonstrate how our models are evaluated. This transparency makes clear what is in the data and how the models are trained. We also release various checkpoints, which are intermediate stages of training. Researchers can use these checkpoints to observe how knowledge and improvements develop over time. And some researchers are already leveraging our checkpoints to study this evolution.

Finally, in terms of public trust, a similar approach applies. Many people believe that language models simply hallucinate. By connecting their outputs to training data and explaining decision-making processes, we can enhance trustworthiness. Although we’re not there yet, improving transparency about our training data offers significant opportunities to build public trust.

There’s a lot of interest in open source AI right now. What are your thoughts on this trend?

I believe open source AI is essential to enable and accelerate the science of language models. We have made so much progress in research and development in language models due to open, scientific research, and we should continue making efforts to keep open source AI active.

How to choose the right AI foundation model