ollama
Development Feed
Development Feed
The app now defaults to showing a chat conversation on first load, moving the old landing screen aside for a more immediate experience.
Gemma4 models will no longer use flash attention after benchmark testing revealed severe slowdowns—up to 60% worse than running without the optimization.
AMD GPU users can now build and run GGML-based models without compilation errors — a missing type mapping and const-correctness fix unlock ROCm support.
The gemma4 model family can now run on NVIDIA GPUs after a one-line change enables flash attention, resolving CUDA initialization errors that blocked the model from loading.
A CUDA operation that crashes during graph capture is now safely skipped, fixing gemma4 model loading failures when multiple requests run in parallel.
Tool calls from gemma4 models will no longer silently fail when arguments contain quoted strings with embedded quote characters.
Argument parsing for Gemma4 tool calls has been rebuilt — complex state-machine logic replaced with regex-based conversion that better aligns with the reference implementation and rejects malformed inputs that previously slipped through.
Ollama now runs Google Gemma 4 models locally, adding audio transcription and vision understanding to the open-source LLM runtime.
The Gemma4 converter now handles the audio tower's updated tensor naming scheme, with mappings for new layer conventions, attention patterns, and feed-forward blocks that mirror the standard naming structure.
Models using SentencePiece BPE encoding (like Gemma4:31b) will no longer silently drop accented characters and diacritics—foreign language output should now render correctly.