Small language models: Edge AI innovation from AI21

While most in the AI world are working to create ever-larger language models, like OpenAI’s GPT-5 and Anthropic’s Claude Sonnet 4.5, Israeli AI startup AI21 is taking a different path.
AI21 has just unveiled Jamba Reasoning 3B, a model with 3 billion parameters. This compact, open-source model can handle massive popups of 250,000 tokens (meaning it can “remember” and reason about much more text than traditional language models) and can run at high speed, even on consumer devices. This launch highlights a growing shift: smaller, more efficient models could shape the future of AI just as much as raw scale.
“We believe in a more decentralized future for AI, where not everything happens in huge data centers,” says Ori Goshen, co-CEO of AI21, in an interview with IEEE Spectrum. “Large models will still play a role, but small, powerful models running on devices will have a significant impact” on both the future and the economics of AI, he says. Jamba is designed for developers who want to create Edge-AI applications and specialized systems that run efficiently on the device.
AI21’s Jamba Reasoning 3B is designed to handle long sequences of text and difficult tasks like math, coding, and logical reasoning, while running at impressive speed on everyday devices like laptops and mobile phones. Jamba Reasoning 3B can also work in a hybrid configuration: simple tasks are handled locally by the device, while larger issues are sent to powerful cloud servers. According to AI21, this smarter routing could significantly reduce AI infrastructure costs for certain workloads, potentially by an order of magnitude.
A small but powerful LLM
With 3 billion parameters, Jamba Reasoning 3B is tiny by today’s AI standards. Models like GPT-5 or Claude well exceed 100 billion parameters, and even smaller models, such as Llama 3 (8B) or Mistral (7B), are more than twice the size of the AI21 model, notes Goshen.
This compact size makes it all the more remarkable that AI21’s model can handle a 250,000 token pop-up on consumer devices. Some proprietary models, like GPT-5, offer even longer popups, but Jamba sets a new benchmark among open source models. The previous open model record 128,000 tokens was held by Meta’s Llama 3.2 (3B), Microsoft’s Phi-4 Mini, and DeepSeek R1, all of which are much larger models. Jamba Reasoning 3B can process over 17 tokens per second even when running at full capacity-that is to say with Extremely long entries that use up its full 250,000 token popup. Many other models slow down or struggle once their input length exceeds 100,000 tokens.
Goshen explains that the model is built on an architecture called Jamba, which combines two types of neural network designs: transformer layers, familiar from other large language models, and Mamba layers, designed to be more memory efficient. This hybrid design allows the model to handle long documents, large code bases, and other extensive input directly on a laptop or phone, using about a tenth of the memory of traditional processors. Goshen says the model runs much faster than traditional transformers because it relies less on a memory component called KV cache, which can slow down processing as inputs become longer.
Why small LLMs are needed
The model’s hybrid architecture gives it an advantage in both speed and memory efficiency, even with very long inputs, confirms a software engineer working in the LLM industry. The engineer requested anonymity because he is not authorized to comment on other companies’ designs. As more users run generative AI locally on laptops, models must quickly handle long context lengths without consuming too much memory. With 3 billion parameters, Jamba meets these requirements, the engineer explains, making it a model optimized for on-device use.
Jamba Reasoning 3B is open source under the permissive Apache 2.0 license and available on popular platforms such as Hugging Face and LM Studio. The release also comes with instructions for refining the model through an open source reinforcement learning platform (called VERL), making it easier and more affordable for developers to adapt the model to their own tasks.
“Jamba Reasoning 3B marks the start of a family of small, effective reasoning models,” said Goshen. “Scaling up enables decentralization, customization and cost-effectiveness. Instead of relying on expensive GPUs in data centers, individuals and businesses can run their own models on devices. This opens up new economic opportunities and broader accessibility.”
From the articles on your site
Related articles on the web




