I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?
these2026-05-06 01:49
ENGLISH (원문)
Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.
Farmadupe2026-05-06 01:50
ENGLISH (원문)
I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack? Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card? Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs? As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis
disiplus2026-05-06 01:54
ENGLISH (원문)
nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.
dvt2026-05-06 01:59
ENGLISH (원문)
It's not implemented in mlx[1] yet (or llama.cpp[2]), so it may take a while. [1] https://github.com/ml-explore/mlx-lm/pull/990 [2] https://github.com/ggml-org/llama.cpp/pull/22673
zdw2026-05-06 02:00
ENGLISH (원문)
MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon. The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.
dakolli2026-05-06 02:04
ENGLISH (원문)
yet, still mostly useless.
skybrian2026-05-06 02:07
ENGLISH (원문)
Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.
svachalek2026-05-06 02:08
ENGLISH (원문)
I've gotten it to work with other models. They've got to be perfectly aligned usually, in terms of provider, quantization etc. Might be a bit before you can get a matched set.
EGreg2026-05-06 02:15
ENGLISH (원문)
How does this get added in practice?
shay_ker2026-05-06 02:18
ENGLISH (원문)
curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...
Havoc2026-05-06 02:19
ENGLISH (원문)
Normally when LM Studio doesn't like it it's because of the presence of mmproj files in the folder. Sometimes removing them helps it show up. They're somehow connected to vision & block speculative decode...don't ask me how/why though For gemma specifically had more luck with speculative using the llama-server route than lm studio
WhitneyLand2026-05-06 02:19
ENGLISH (원문)
Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.
nalinidash2026-05-06 02:20
ENGLISH (원문)
technical details are here: https://x.com/googlegemma/status/2051694045869879749
Havoc2026-05-06 02:21
ENGLISH (원문)
There is a decent yt here going through what google's logic with gemma overall might be https://www.youtube.com/watch?v=sXgZhGzqPmU As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway
AlphaSite2026-05-06 02:22
ENGLISH (원문)
Yes. Make sure you’re not using the Gemma sparse models since they don’t have a small model to use. Also I removed all the image models from the workspace.
zargon2026-05-06 02:23
ENGLISH (원문)
It's the same thing, but Google removed the MTP heads from the original safetensora release. (They were not removed from the LiteRM format.)
pu_pe2026-05-06 02:26
ENGLISH (원문)
So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?
댓글
18