Gemma 4 support arrives with multimodal audio and vision capabilities

dhiltgen

·Apr 2, 2026·#15214Add support for gemma4

Ollama now runs Google Gemma 4 models locally, adding audio transcription and vision understanding to the open-source LLM runtime.

Google's Gemma 4 family of models is now available in Ollama. The release brings multimodal capabilities to Gemma 4—including vision for image understanding and audio for transcription tasks—expanding what developers can build with local AI inference.

Gemma 4 models use a Mixture of Experts architecture that selectively activates different neural pathways per token, improving efficiency. The implementation includes sliding window attention for local context alongside full attention layers for longer-range dependencies, and proportional RoPE positional encoding that adapts to layer type.

For audio workflows, the update adds a transcription endpoint compatible with OpenAI's API format. Developers can send audio files and receive transcriptions without changing existing code that targets OpenAI's service. The audio pipeline includes a USM conformer encoder that processes WAV files through mel spectrogram conversion before feeding into the model.

Vision support follows a similar pattern, with a vision transformer encoder that processes images through patch embedding, 2D positional encoding, and attention layers. Images resize intelligently to fit token budgets while preserving aspect ratio.

The release also includes a CUDA optimization for Q6_K quantized models, improving performance for users running Gemma 4 with compressed weights. Benchmark tooling received improvements for more accurate token counting during performance testing.

In the Ollama runtime, the model integrates with existing infrastructure for chat rendering, tool calling, and the thinking/thought block format that Gemma 4 uses for chain-of-thought reasoning.