|
Ask hardcore ChatGPT, Claude, and Perplexity about their opinions on local LLMs, and you’ll hear a bunch of arguments about their performance-hogging nature, complex setup, and lack of computational prowess. However, the most common complaint is that self-hosted models aren’t capable of anything more than serving as chatbots. And truth be told, I used to think the same way before diving into the local LLM ecosystem, as my limited interactions with weaker models had left me dissatisfied with the results. 但在我用工作站测试了各种模型,并把它们和家里的开源工具链打通之后,我发现本地模型如果配置得当,完全可以成为生产力引擎。 
我的本地模型甚至能在卡壳时自动调用Claude,这让我彻底转向了本地优先的工作流。 很多人对本地模型有误解,觉得它们只能跑小参数版本。0.8B到4B的模型确实容易胡言乱语,尤其是遇到技术类问题时。20B以上的模型推理能力好很多,但对算力要求也高。 关键突破在于MoE(混合专家模型)卸载技术。 传统大模型在显存不足时,只能用--ngl标志把整个层推到系统内存,结果速度惨不忍睹。MoE模型不一样——你可以用--n-cpu-moe把重型专家模块卸载到CPU和内存,同时让注意力层继续在显卡上跑。只要你的显存加内存能装下几百亿参数,老机器也能跑得动。 我就是这么在GTX 1080(8G显存+24G内存)上跑Gemma4-26B-A4B,在RTX 3080 Ti(12G显存+32G内存)上跑Qwen3.6-35B-A3B的。token生成速度没低于14t/s,日常任务完全够用。 本地模型和编程工具的配合尤其出色。连Claude Code都支持接入本地模型。
|