Distilling Race Predictions: From LLM to a 5KB Model That Runs in a Millisecond
Here's something I didn't expect: if you give a good LLM a structured summary of someone's recent training — interval paces, tempo distances, long run heart rates, weekly mileage, and a few other signals — and ask it to predict their 5K, 10K, half marathon, and marathon times, it can actually do a decent job.
Not perfect. Not magical. But useful.
I noticed this early while building Hypla, the hybrid training app. Feed an LLM a set of training features like interval pace, tempo pace, long run distance, weekly volume, or heart rate ratios, and it returns plausible race predictions. The standard Riegel formula gets you into the right general area, but an LLM can do something different: it can take several signals into account at once. A runner doing 4:30/km intervals on 25 km per week is a different athlete from one doing the same intervals on 60 km per week. Same interval pace, different likely outcome. LLMs tend to pick up on that distinction reasonably well.
You can use Claude for this. You can use ChatGPT for this. Probably others too. The point is not one specific model. The point is that modern LLMs are surprisingly capable at mapping structured training summaries to plausible race predictions.
The Problem: This Is Expensive
So great, just call an LLM every time a user opens the dashboard?
Not really.
Every dashboard load would mean an API call. You send the workout summary, the system prompt, maybe some context, and get back four predicted race times. That might work at small scale, but it is still a heavyweight solution to what is, at the end of the day, a simple numerical mapping: around a dozen input features to four output values.
There are workarounds. Cache predictions. Only recompute when workouts change. Batch things in the background. But all of that is really just compensating for the fact that a large model is being used for a small prediction problem.
This is not really a language problem. It is a prediction problem.
And that is exactly where a small model should win.
Teach the Small Model Once, Then Use It Forever
That led me to a pretty simple idea.
If the LLM is already giving reasonable predictions, maybe I do not need to keep calling it in production. Maybe I can instead train a tiny model to copy the pattern in those predictions.
In other words: use the big model once to learn the mapping, then hand the job over to a much smaller one.
That basic idea has a name in machine learning: distillation.
The big model acts as the "teacher." The small model acts as the "student." The student does not need to understand running in any deep sense. It just needs to learn the relationship between the training inputs and the race-time outputs well enough to mimic the teacher.
That is the important mental shift. The LLM is not necessarily the thing you deploy. It is the thing you learn from.
The Data Problem
But there was a problem: to train the small model, I needed data.
And I did not really have any.
Or more precisely, I had my own training data, which is one athlete. That is obviously not enough to train anything generalizable. The ideal solution would be a large dataset of real runners with structured training logs and known race performances. Big platforms may have that. I do not.
So I simulated the data.
I wrote a Python script that generates synthetic athlete profiles across a wide range of ability levels, from relatively casual runners to quite fast ones. Each synthetic profile gets a plausible combination of features: interval pace, tempo pace, weekly mileage, long run distance, heart rate ratios, and so on. The goal was not to recreate real people. The goal was to create realistic training profiles that cover the space of inputs the app is likely to see.
Then came the key step: I passed these synthetic profiles to an LLM and asked it to predict race times.
That gave me a dataset of examples. The inputs were synthetic, but plausible. The outputs were not real race results. They were the LLM's judgments about what those profiles would likely run.
That distinction matters. I am not claiming to have discovered ground truth. I am training a small model to approximate the predictions of an LLM that seems to do a decent job on this task.
For my purposes, that is enough.
Training the Student
Once that dataset existed, the rest was pretty standard.
I trained a small neural network in Python using scikit-learn, with standardized inputs and outputs. Two hidden layers. Nothing fancy. The whole training process takes seconds on a laptop.
The resulting model is tiny. A few thousand parameters. The exported weights are just a small JSON blob. The production inference code is a single TypeScript file: normalize the inputs, run them through the learned weights, convert the outputs back into race times, and clamp them to sane ranges.
No TensorFlow. No ONNX. No Python in production. Just a lightweight numerical forward pass in plain TypeScript.
Inference takes less than a millisecond. It costs nothing. It can run on every page load without anyone noticing.
Does It Work?
The honest answer is: yes, well enough for what I want.
It is not as nuanced as querying a strong LLM directly. The larger model handles odd cases and unusual combinations better. But the distilled model is close enough to be useful in practice, and that is the whole point.
It captures the main relationships you would want it to capture. More mileage generally pushes predictions down. Faster interval work helps. Longer tempo runs matter. Long runs matter more for longer race distances. The model also reflects interactions that simple formulas miss. Two athletes with the same interval pace but very different training volume should not get the same marathon prediction, and the model generally behaves accordingly.
That is exactly the tradeoff I was after: give up a bit of nuance, gain instant and free inference.
The Pattern Is General
What I like about this project is that the pattern seems much more general than this specific use case.
If you have a problem where an LLM can make decent judgments from structured inputs, you do not have a large labeled dataset, you need predictions to be fast and effectively free at inference time, and the final input-output mapping is simple enough, then this approach is worth considering.
Use the LLM as the teacher. Generate plausible inputs. Let the model label them. Train a small student. Deploy the student.
Race prediction is just one example. VO2 max estimation. Training zone updates from pace, volume, and heart rate patterns. Threshold pace estimation. Fitness score calculation. These are all basically the same kind of problem: structured inputs, numerical outputs, cheap inference preferred.
The Tooling
I ended up packaging the pipeline into a small open-source toolkit.
It covers the full path: generate or assemble structured examples, label them with an LLM, train a small model in Python, export the weights to JSON, and run inference in TypeScript with essentially no overhead.
The TypeScript side is deliberately minimal. It works in the browser, in Node, and in edge environments. No heavy runtime, no binary dependencies, nothing fragile in deployment.
That was part of the appeal for me. Once the model is trained, the whole system becomes boring in the best possible way.
The Bigger Point
There is a tendency right now to treat LLMs as the final layer in every product. You have a problem, so you call the API. Sometimes that is exactly right.
But not always.
For a lot of narrower prediction tasks, the LLM is not the thing you actually want to deploy. It is the thing you use once to help build something much smaller, faster, and cheaper.
That feels like an important distinction.
The expensive model is the teacher. The cheap model is the product.
Train on the teacher's judgment. Deploy the student. Pay for the teacher once. Run the student forever.
Ready to put this into practice?
Get your training plan →