What a Domain-Expert Language Model Can (and Can't) Teach an Agent

It's hard not to be intrigued by the rise of Agentic AI. The idea that networks of LLMs can coordinate to make relatively sound decisions on ambiguous problems is undeniably compelling. Yet it raises a lingering question: how reliable are these decisions when they are ultimately grounded in word-based models of the world?
LLMs are trained on linguistic correlations, not on direct interaction with real-world processes. They can sound confident in reasoning but stumble with numbers or dynamics that demand more than pattern recognition. That limitation made us reflect on whether this "new" paradigm is actually new at all.
Interestingly, agentic AI isn't really new. Reinforcement Learning has long embodied the agent paradigm — grounded in feedback loops where agents learn by doing. So we found ourselves asking a more interesting question: is there a way to bring together the grounded, environment-aware learning of RL with the strategic reasoning of LLMs?
The Inspiration
The paper A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health (arXiv:2402.14807) tested exactly this hypothesis. It used an LLM planner to influence RL behaviour during training, treating language as a vehicle for injecting domain preferences into a learning agent.
Inspired by that approach, we ran a similar experiment on a deliberately simple problem — retail inventory management — to see whether an LLM could act as a domain expert guiding an RL agent through reward shaping.
The Setup
We built a weekly inventory environment for a single product, with all the messy dynamics you'd expect: seasonal demand modelled as a non-stationary Poisson process, stochastic lead times of one to two weeks, holding costs, stockout penalties, and fixed plus variable ordering costs. The agent's job was to decide how much to order each week to minimise total cost.
We then layered three policies for comparison:
- A Base Stock Policy derived from the classic newsvendor critical ratio — our analytical benchmark.
- A Proximal Policy Optimisation (PPO) agent with hyperparameters tuned via Optuna — our RL baseline.
- A PPO + LLM approach, where an LLM (GPT-4o-mini) generated candidate reward-shaping functions as executable code, which were then validated, tuned, and evaluated against the baselines.
The LLM-guided pipeline ran in three phases: a reward function search, a hyperparameter optimisation pass for the top candidates, and a cross-evaluation across multiple seeds and episodes. Crucially, we kept the base cost (the operational metric we actually care about) cleanly separated from the shaping penalty (the LLM-generated nudge), so we could always benchmark policies on real-world value rather than artefacts of a changed reward signal.

What we found
The headline result was honest and a little deflating: in this environment, the LLM's influence was largely negligible compared to a well-tuned PPO baseline.
Across runs, Base PPO consistently performed as well as, or better than, most LLM-shaped variants. In a few isolated cases — like an Early Reordering shaping function — the LLM-generated rewards produced lower base costs than the baseline. But these wins were inconsistent, and in many trials the shaped agents slightly underperformed.
There were methodological caveats worth naming. Phase 1 pruned candidates on short training runs, which may have filtered out shaping ideas that needed longer horizons to prove themselves. Phase 2 optimised for total reward rather than base cost directly, which could have favoured policies that looked good in reward terms but weren't strictly cost-optimal. And Phase 3's relatively short evaluation training adds noise to any "win" or "loss" we observed.
But even granting those caveats, the broader signal was clear: in a simple, low-dimensional environment where traditional methods already excel, the LLM's domain reasoning gets quietly cancelled out by an RL agent that can simply learn the dynamics directly.
The more interesting question
That finding, while modest, points to something more useful than a clean win would have. It reframes the question from "can LLMs improve RL?" to "when does LLM reasoning meaningfully enhance RL?"
Probably not in tightly defined optimisation problems with clear cost structures and stationary-ish dynamics — PPO already has those well in hand. The promise lies in the messier territory: complex, ambiguous, and deeply human spaces where the task isn't merely to optimise quantitatively, but to interpret meaning beyond the noise of numbers. Robotics with safety preferences. Healthcare allocation under shifting ethical constraints. Resource decisions that depend on context a reward function can't easily encode. These are environments where the LLM's ability to translate human intent into structured incentives might genuinely earn its keep.
The methodology itself also held up well. The LLM reliably produced interpretable shaping functions and behaved as a credible collaborative reward engineer. Even when the numerical impact was small, the process of having a language model articulate why a policy should care about smooth inventory transitions or peak-season buffers added a layer of interpretability that pure end-to-end RL rarely offers.
Why this matters
There's a tendency, in the current excitement around Agentic AI, to treat language-based reasoning as a universal upgrade. This experiment is a small but useful corrective. LLMs are not a free improvement layer — they're a tool whose value depends heavily on the shape of the problem in front of them. Knowing where they help, and where they don't, is what makes the difference between a thoughtful system and an expensive one.
The notebook accompanying this post is a launchpad. If you want to test these ideas on your own environment — particularly something higher-dimensional or harder to specify in pure cost terms — it's structured to let you swap in your own domain and see what happens. We'd love to hear what you find.
Notebook: LLM-Guided Reward Shaping for RL
Reference: Verma, S. et al. (2024). A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health. arXiv:2402.14807
