The mixture of experts (MoE) architecture aims to balance the knowledge capacity of larger models with the inference efficiency of smaller models by subdividing the layers of the model’s neural network into multiple “experts.” Rather than activating every model parameter for each token, MoE models using a gating function that activates only the “experts” best suited to processing that token.

Llama 4 Scout, the smaller of the two new models with a total parameter count 109B, is divided into 16 experts. At inference, it has an active parameter count of only 17B, enabling it to serve more users in parallel. Trained on 40 trillion tokens of data, Llama 4 Scout offers performance rivalling or exceeding that of models with significantly larger active parameter counts while keeping costs and latency low. Despite those lean compute requirements, Llama 4 Scout beats comparable models on coding, reasoning, long context and image understanding benchmarks.

Llama 4 Maverick is divided into 128 experts, drawing from the knowledge of its 400B total parameters while maintaining the same 17B active parameter count as Llama 4 Scout. Per Meta AI’s official announcement, Llama 4 Maverick beats OpenAI’s GPT-4o and Google’s Gemini 2.0 Flash “across the board” on a wide range of multimodal benchmarks and rivals the reasoning and coding performance of the much larger DeepSeek-V3 on reasoning and coding tasks.