Hypothesis and the Home Lab

Hypothesis

I've recently been reading The Scaling Era: An Oral History of AI, 2019–2025. Superficially, it is a collection of selected excerpts from Dwarkesh Patel's podcast. I don't mind that, although some people might find it annoying. I don't mind it because those excerpts are, in essence, Socratic dialogues. It's fascinating to read transcripts of key figures in the LLM space attempting to reason about things that were happening around them.

One section in particular caught my attention. François Chollet, the maintainer of ARC-AGI, was arguing that LLMs store "a set of program templates" that are used to find solutions to known problems. Furthermore, he claimed that storing program templates doesn't constitute general intelligence. Rather "[g]eneral intelligence is the ability to approach any problem, any skill, and very quickly master it using very little data."

ARC-AGI puzzle solve — François really has an incredible intuition for this stuff. Look at the puzzle side of this ARC-AGI question. Subtle additions like the need to have three blobs touch at once are obvious to humans, but would catch out the exact limitation that he mentions in the interview.

This quote is a few years old now, and I think François has been vindicated by the success of Anthropic and OpenAI in RLing long-horizon agentic coding tasks. Coding ability was improved by paving the way for the long-horizon coding with tons of training data from tools like Claude Code. Procedures were learned in the long-horizon timeframe; the models didn't organically learn these tasks without long training traces.

Regardless, I still find his point to be fascinating. This is because I know of a benchmark that would be perfect for testing his claim: crosswords. Crosswords require the combination of spatial reasoning with fact recall and a variety of lateral reasoning categories. Additionally, crossword solving methodology is unlikely to be in the training data of existing LLMs. If François is right, I don't think that language models will be able to reliably solve a crossword puzzle without fine-tuning on related data.

A large, solved crossword puzzle — Check out 32 Across. The theme hints all involve combining a color with a food. 32 Across combines both the color and the food into one word.

The Homelab

So last December, I got my hands on a Framework Desktop. I loaded it up with 128GB of RAM and AMD's sweet iGPU that can consume all of that RAM on the cheap (relatively speaking). I installed Omarchy and got ready to rip some tokens!

That was the plan anyway. The process of setting up the machine for inference was harder than I thought. For one, I tried to etch a combo USB-A / USB-C drive. Apparently, those sorts of drives are usually treated as fixed disks rather than removable media. A bootloader needs to be considered removable media by the computer, so that's a non-starter. At least, that's what ChatGPT told me. I'll be honest, I tried to find a primary or even non-LLM secondary source on the topic, but all I was given by our LLM overlords was obscure UEFI spec info and handwaving about manufacturer idiosyncrasies. To be honest, I'm not 100% certain of the accuracy of that explanation, because I had another problem that was blocking my ability to install the OS.

I hadn't disabled Secure Boot. To my discredit, disabling Secure Boot was mentioned multiple times by the LLM and by the official Framework Guide. The guide even gives explicit instructions on why and how this is necessary. C'est la vie. Sometimes you get LLM psychosis, sometimes the LLM psychosis gets you. I should have read the guide first.

The Framework BIOS screen — It was right there, highlighted in yellow! What did I need, a video of Subway Surfers playing next to it?

Once I got the OS actually running, I installed Ollama for local inference. Ollama has fallen out of favor somewhat, but it serves my needs. I don't have to worry about distributing inference since I have 128GB of RAM soldered onto the motherboard. That removes one major hurdle to people adopting Ollama. It also ships with a pre-built API, which saves me the trouble of having to connect a more sophisticated inference engine to a web server. Finally, it's low configuration and comes with a directory of quantized models that fit on my machine. That's not optimal for maximizing token throughput or scoring well on highly refined benchmarks like MMLU. Thankfully, this experiment is not about that, so Ollama fits my needs perfectly.

Ollama is low configuration, but not zero configuration. I learned that quickly, along with how to use systemctl and journalctl. Although AMD's Ryzen AI MAX+ 395 chip is powerful, it does come with unique restrictions. Namely, it isn't able to take advantage of AMD's CUDA competitor, ROCm. This, in conjunction with the iGPU approach, causes some issues.

In my experience, Ollama tried to boot using ROCm. Presumably, it identified the AMD architecture and figured ROCm would be a fit. This does not work with the iGPU setup. I'm sure there are a number of practical concerns, but I ran into issues with dynamic RAM allocation. Since Ollama tried to boot using ROCm, my computer wouldn't allocate any memory to the iGPU. This caused Ollama to complain about having insufficient RAM. This would cause it to fall back to using the CPU where all 128GB of RAM were being allocated.

I didn't notice this until I got btop running. First I noticed that the CPU was running high, which seemed wrong. Next, I noticed that the GPU was not reporting any telemetry. Installing rocm-smi-lib solved that problem. Don't ask me how I figured that out, I don't even know. By then I was in trial-and-error mode. Once I had good GPU telemetry, it was clear that the GPU wasn't running at all.

btop system telemetry — This screenshot was taken around the time that the crossword agent had made the decision to ruminate on the puzzle for 20 minutes rather than actually using one of its tools.

To fix that, I adjusted the Ollama service. I added an environment variable to enable Vulkan support and turned on debug logs. The logs helped me diagnose that it was trying to boot using ROCm and running out of memory as a consequence. Here are the lines I added to my service configuration.

Environment=OLLAMA_DEBUG=1
Environment=OLLAMA_VULKAN=1

Setting the Vulkan environment variable alone didn't fix the problem. The enhanced logs indicated that, for some reason, Ollama still prioritized ROCm over Vulkan. I tried a few clean fixes to this, but ultimately deleting the ROCm strategy from the Ollama source fixed the problem. I know that's an inelegant solution, but it worked. Once Ollama actually ran using Vulkan, it correctly allowed the iGPU to have RAM allocated to it.

Quick aside here: Writing this article led me to realize that the AI MAX+ 395 chip does support ROCm. A cursory look at things indicates that this might be a new development, and that it is a bit slow. I'll have to do more research to get a definitive answer.

The last setup issue came when trying to load two models simultaneously. By default, Ollama holds only one set of weights in RAM at a given time. In order to get two models into the RAM at once I had to set another environment variable in the service definition.

Environment=OLLAMA_MAX_LOADED_MODELS=2

Setting the max models environment variable alone didn't solve my problem. Dynamically allocated iGPU RAM doesn't work well with holding multiple models in memory at once. Even with max models set to two, it would still evict the only model in memory because it detected insufficient RAM. I learned that iGPU RAM allocation was configurable in the BIOs, and I changed iGPU RAM allocation from auto to a fixed 96GB. This fixed the model eviction problem. It also started correctly showing available GPU RAM in btop.

With Ollama correctly configured, I was ready to start testing. In the next part of this series, I'll discuss the Crossword agent harness that I hooked up to the this local inference setup.