May 2026

LLM Experiments #3: Noodling

Note: Experimenting with quick writeups as I play around with LLMs and my homegrown agentic workflow system. Previously.

Nothing much to write about today, but here are some things I’ve explored since my last post.

Gemma 4 has goblin tokens

Gemma 4 tokens can be seen in tokenizer.json here. I took a look, and it turns out that Gemma has a “ Yuri“ token: "▁Yuri": 142143, as well as a few goblin-related tokens: "▁goblin": 218798,"▁Goblin": 171680, "oblins": 236144 (the underscore-looking character encodes a space). The tokenizer is BPE (line 311 of tokenizer.json) so I wonder what the presence of these tokens says about the frequency of goblins in the training data. Probably not much, since each of tacos, Tocco, getColumn, StudentVector, flipping, counselling, and combustibles are also tokens, as well as other infrequent words. There are multiple tokens for each of Google and OpenAI, and tokens for anthrop and ic.

The existence of these tokens implies that you can use llama-server’s --logit-bias option to selectively upweight goblins, for example, or steer the model into profanity by increasing the probability of bastard.

You need to know the goal to refine the task

I relearned this lesson today, this time in the context of optimizing data pipelines whose outputs are subjective.

When you’re just starting out it can be hard to know what you want precisely enough to describe to a small language model, since small models fail loudly in visible ways that force you to specify increasing amounts of procedural detail in your prompt. In contrast, big models are often right enough that you can defer tweaking until you know more about how the results will be applied.

Sampling in llama-server

One way to control model output is by changing your sampling strategy. llama-server has many options for this. Many strategies are represented as functions that accept and return a probability distribution, and can thus be composed. Others, like adaptive-p, are intended to be used as the last sampler in the chain:

adaptive-p selects a token ID rather than just mutating candidates, so it must be last in the sampler chain. It shares this behaviour with some existing samplers like mirostat, dist, and greedy (mirostat being the closest relative).

Lots of good descriptions in the completions tool readme.

Gemma 4 26B is pretty good at extracting emotion-subject pairs

Gemma 4 does surprisingly well, but has false positives, eg. failing in simple cases like “I want my pilot to lie to me.” which returns { subject: "pilot", emotion: "desire" } where the referent of the emotion is inaccurate (and I don’t think wanting is an emotion, though it’s apparently debatable). Other edge cases involve tracking the subject through nested conversations, eg. it emitted { subject: "speaker", emotion: "ridicule" } for the following quote:

And the funny thing is I own Final Cut Pro. I could just click, click, click and make it happen, but I’m like, but I want to figure it out using ffmpeg. And somebody could objectively say, "You idiot, it’s going to take you an hour to do it this way. It would take a minute to do it this way." I’m like, "But this makes me happier." And that matters.

Whether or not this is acceptable depends on what you’re planning to do with it, but to me, a quote where the speaker imagines a critic ridiculing the speaker feels like it should be represented differently from one in which the the speaker directly ridicules themselves.

I wonder if, rather than trying to anticipate all possible edge cases in the prompt, a generator-verifier pattern might be useful here. I have an unvalidated intuition that decomposing into workflows can be a way to work with the grain of a model, and, more generally, that designing robust pipelines that inherently compensate for expected failure may be a better strategy than to trying to engineer it out.