Gemma 4 가속화: 멀티 토큰 예측 드래프터를 활용한 더 빠른 추론

mchusma 2026-05-06 01:41

ENGLISH (원문)

I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?

these 2026-05-06 01:49

ENGLISH (원문)

Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.

Farmadupe 2026-05-06 01:50

ENGLISH (원문)

I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack? Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card? Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs? As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis

disiplus 2026-05-06 01:54

ENGLISH (원문)

nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.

dvt 2026-05-06 01:59

ENGLISH (원문)

It's not implemented in mlx[1] yet (or llama.cpp[2]), so it may take a while. [1] https://github.com/ml-explore/mlx-lm/pull/990 [2] https://github.com/ggml-org/llama.cpp/pull/22673

zdw 2026-05-06 02:00

ENGLISH (원문)

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon. The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

dakolli 2026-05-06 02:04

ENGLISH (원문)

yet, still mostly useless.

skybrian 2026-05-06 02:07

ENGLISH (원문)

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

svachalek 2026-05-06 02:08

ENGLISH (원문)

I've gotten it to work with other models. They've got to be perfectly aligned usually, in terms of provider, quantization etc. Might be a bit before you can get a matched set.

EGreg 2026-05-06 02:15

ENGLISH (원문)

How does this get added in practice?

shay_ker 2026-05-06 02:18

ENGLISH (원문)

curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

Havoc 2026-05-06 02:19

ENGLISH (원문)

Normally when LM Studio doesn't like it it's because of the presence of mmproj files in the folder. Sometimes removing them helps it show up. They're somehow connected to vision & block speculative decode...don't ask me how/why though For gemma specifically had more luck with speculative using the llama-server route than lm studio

WhitneyLand 2026-05-06 02:19

ENGLISH (원문)

Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.

nalinidash 2026-05-06 02:20

ENGLISH (원문)

technical details are here: https://x.com/googlegemma/status/2051694045869879749

Havoc 2026-05-06 02:21

ENGLISH (원문)

There is a decent yt here going through what google's logic with gemma overall might be https://www.youtube.com/watch?v=sXgZhGzqPmU As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway

AlphaSite 2026-05-06 02:22

ENGLISH (원문)

Yes. Make sure you’re not using the Gemma sparse models since they don’t have a small model to use. Also I removed all the image models from the workspace.

zargon 2026-05-06 02:23

ENGLISH (원문)

It's the same thing, but Google removed the MTP heads from the original safetensora release. (They were not removed from the LiteRM format.)

pu_pe 2026-05-06 02:26

ENGLISH (원문)

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

Gemma 4 가속화: 멀티 토큰 예측 드래프터를 활용한 더 빠른 추론

댓글

좋아요가 저장됐어요!