May 2026

LLM Experiments #2: Eval Planning

Note: Experimenting with quick writeups as I play around with LLMs and my homegrown agentic workflow system. Previously.

LLM costs scale with a request volume, and it turns out that my extractive summarization approach is much too expensive when run at scale on frontier-ish cloud models. So I’m seeing if I can get a small off-the-shelf model, Gemma 4 E2B, to do the task more efficiently.

Since this is a side project I’m been experimenting with letting an agent autonomously drive both the creation and optimization of the pipeline, and view my job as giving the agent a good home with easy access to tools for exploring data, debugging code, and taking performance measurements.

As part of this, the workflow system itself is event-sourced and includes a provenance tracking system for every data output, allowing us to effortlessly connect outputs back to inputs and intermediate results.

One point of concern is that extractive text compression doesn’t have an obvious scoring function. Fortunately language models are pretty good at language so I’m letting Opus 4.7 judge the quality of the output, with small bits of targeted feedback from me. And for inspiration I did have it read John McPhee’s Draft No. 4.

In terms of technology, I’ve installed llama-cli and am running with enough parallelism to almost saturate the GPU at peak load as measured by mactop. Beyond that, I haven’t done any additional tuning beyond making sure that we’re caching common prefixes across prompts:

{     
    "model": "ggml-org/gemma-4-E2B-it-GGUF",
    "messages": […],                   
    "cache_prompt": true,                      
    "max_tokens": 2048,
    "response_format": { … }
}

To look at the results I had the agent build me a viewer that lets me browse the iterative simplifications and easily leave individualized feedback on sentences or paragraphs whose outputs look incorrect. The feedback goes into a file that the agent looks at to guide its optimization.