Tag: Inference Optimization
-
The Hidden Context Window Problem in RAG Systems: A Real Production Incident with vLLM and Qwen3
When Your 32K Context LLM Fails at 4K Tokens: A Production vLLM Troubleshooting Guide One of the most common misconceptions in Generative AI systems is: “The model supports 32K context, so my application automatically supports 32K context.” In production, that assumption can lead to unexpected failures. Recently, we encountered a production issue…
-
Google Gemma 4 12B: The Model That Signals a Bigger Shift in AI Infrastructure
Over the past year, I’ve spent a considerable amount of time working with both local and production AI environments. On one side, I’ve been experimenting with local LLMs using Ollama, testing quantized models, and exploring how much intelligence can realistically run on developer laptops. On the other side, I’ve been deploying production workloads…