Tag: vLLM
-
Why a 7B Parameter Model Won’t Run Comfortably on a 14 GB GPU (And Why Most Engineers Get This Wrong)
If you’ve recently started working with Large Language Models (LLMs), you’ve probably seen a calculation like this: 7 Billion Parameters × 2 Bytes (FP16) ≈ 14 GB At first glance, it seems perfectly reasonable to conclude: “A GPU with 14 GB of VRAM should be enough.” Unfortunately, that’s one of the most…
-
The Hidden Context Window Problem in RAG Systems: A Real Production Incident with vLLM and Qwen3
When Your 32K Context LLM Fails at 4K Tokens: A Production vLLM Troubleshooting Guide One of the most common misconceptions in Generative AI systems is: “The model supports 32K context, so my application automatically supports 32K context.” In production, that assumption can lead to unexpected failures. Recently, we encountered a production issue…