When are LLMs sufficient policy optimizers?

Stephane Hatgis-Kessell, under supervision of Emma Brunskill

All expressed positions are my own.

Cite this work
@article{hatgiskessell2026promptpo,
  title   = {When are LLMs sufficient policy optimizers?},
  author  = {Hatgis-Kessell, Stephane and Brunskill, Emma},
  year    = {2026}
}

Introduction

A central goal of reinforcement learning (RL) is to produce a policy that maximizes expected return with respect to a reward function [1]. To this end, an influential line of RL research has focused on developing algorithms to train a policy that maximized expected return through continued interaction with an environment [2, 3, 4, 5, 6, 7, 8]. These methods are often computationally intensive—requiring substantial interaction with the environment—and brittle, as they depend on careful hyperparameter tuning and the selection of an appropriate RL algorithm. As a result, applying RL in practice can be challenging.

PromptPO overview: we input a description of the state space, action space, and reward function—all in Python code. Importantly, we avoid inputting context about the environment's transition dynamics to evaluate PromptPO in model free settings. PromptPO generates a set of policies and an evaluation function, both implemented in Python code. The policies are rolled out in the environment, evaluated with respect to the generated evaluation function and inputted reward function, and then the best performing policy with respect to the mean return is fed back to PromptPO. PromptPO proposes a new policy at the next round to maximize expected return.
Figure 1. PromptPO overview: we input a description of the state space, action space, and reward function—all in Python code. Importantly, we avoid inputting context about the environment's transition dynamics to evaluate PromptPO in model free settings. PromptPO generates a set of policies and an evaluation function, both implemented in Python code. The policies are rolled out in the environment, evaluated with respect to the generated evaluation function and inputted reward function, and then the best performing policy with respect to the mean return is fed back to PromptPO. PromptPO proposes a new policy at the next round to maximize expected return.

To address these problems, we investigate when Large Language Models (LLMs) can produce a policy that maximizes expected return as a drop in replacement for standard RL algorithms. More broadly, we argue that LLM-based optimization can serve as a competitive alternative to standard RL algorithms for many canonical tasks, and should be considered a practical addition to the policy optimization toolbox.

To support this claim, we study a method that prompts an LLM(For this work, we use Gemini 3 Pro.) with a Python-formatted description of the state space, action space, and reward function, and asks it to implement a policy that maximizes expected return. We first sample $N$ candidate policies from the LLM. We additionally prompt the LLM to generate auxiliary evaluation criteria beyond the provided reward function. The generated policies are then rolled out and evaluated using both the reward function and these auxiliary metrics. We provide the LLM with the best-performing policy and its evaluation results, and iteratively prompt it to produce improved policies. We refer to this iterative procedure as Prompted Policy Optimization (PromptPO). Across a suite of both canonical and previously unseen environments, we then investigate when PromptPO is a desirable alternative to standard RL algorithms.

PromptPO matches or exceeds the performance of the standard RL algorithms we conisder on several delayed-reward environments that require substantial exploration, including gridworld MDPs with randomly generated transition dynamics and Point Maze, a challenging exploration environment. It achieves this performance with significantly fewer environment interactions—on average, more than over an order of magnitude fewer in the gridworld environments and $7\times$ fewer in Point Maze—suggesting it can design policies that effectively handle non-trivial exploration. Notable, for grid-world environments, PromptPO implements planning algorithms like Dijkstra's algorithm or Value Iteration using an empirical estimate of the environments transition dynamics—where the ground-truth transition dynamics model is unknown to PromptPO—highlighting its ability to select and apply a canonical planning algorithm when appropriate rather than outputting a rule based policy.

On a suite of 3 robotic manipulation tasks from Metaworld, PromptPO matches the performance of the standard RL algorithms we conisder using, on average, more than $7\times$ fewer environment steps, and in a 4th Metaworld environment, PromptPO outperforms the final performance of the standard RL algorithms within the training budgets we consider. We additionally evaluate PromptPO in three real-world control tasks: designing lockdown regulations for the COVID-19 pandemic, administering insulin to patients with type II diabetes, and controlling a fleet of autonomous vehicles merging onto a highway. PromptPO is more than $60\times$ more sample efficient in the former and outperforms the final performance of standard RL algorithms we conisder within the training budgets we consider in the latter two environments. PromptPO underperforms standard RL algorithms on a suite of MuJoCo continuous control tasks, indicating that current LLMs may have difficulty designing policies for environments that require fine-grained control.

We demonstrate that PromptPO, by way of reasoning over natural language descriptions of the state space, action space, and reward function, can produce policies that maximize the expected return with significantly fewer environment interactions—and algorithmic hyperparameters to tune—than standard RL algorithms in a diverse suite of environments. To maximize expected return, and without further explicit prompting, the policies PromptPO outputs range from tuned proportional controllers or rule-based plans to policies that run planning algorithms like value iteration. Broadly, we show that LLMs are sufficient black box policy optimizers for many environments and should potentially been seen as practical substitutes for canonical RL algorithms.

We conclude by positioning PromptPO within the RL landscape and discussing its implications for practice and research:

We release our code base here, including the minimal implementation of PromptPO used for this work, the context we provided to PromptPO for each environment, and examples of the policies PromptPO generates for each environment.

Related Works

We follow the same procedure as a line of prior work that prompts an LLM to iteratively generate a code artifact [9, 10, 11] or solution [12, 13, 14] to optimize for a specific objective. We investigate when this framework is sufficient to solve classical, single agent sequential decision making tasks as a substitute for standard RL algorithms, and in particular how this framework compares to standard RL algorithms with respect to environment sample efficiency. Following the insights from [15] that policies can be represented as executable programs, we represent generated policies as Python code. Our work differs from [15] and a line of others [16, 17, 18, 19, 20] in that they use LLMs to generate policies to satisfy a natural language description of a task or reward function. Rather, we study the setting where an LLM must produce a policy to optimize expected return with respect to a specified reward function, i.e., the standard RL setting. We summarize the differences between our work and these in Table 1.

Other work focuses more directly on using LLMs for policy optimization, such as prompting an LLM to output an action directly at every decision making step [21, 22]. This approach doesn't naturally scale to the long-horizon decision making settings we consider. [23] uses an LLM to generate policy parameter vectors, but they assume the policy can be expressed with a small, human-specified set of parameters that the LLM can reliably output, limiting expressivity and requiring manual design of the policy class. We investigate the more general setting where an LLM must output the policy itself, not parameters for a prespecified policy class.

The primary objective of this work is to investigate when PromptPO serves as a strong alternative to standard RL methods, and to argue that LLM-based policy optimization should be adopted as a standard tool within the RL toolkit.

MethodEvaluates in sequential
decision making domains
Uses specific
optimization target
Evaluates for sample
efficiency improvements
AutoResearch [9]✗✓✗
GEPA [14]✗✓✓
AlphaEvolve [11]✗✓✗
TextGrad [12]✗✓✗
Feedback Descent [13]✗✓✓
Voyager [16]✓✗✓
Can [17]✓✗✗
Inner Monologue [18]✓✗✗
Reflexion [19]✓✗✗
Language Models as Agents [20]✓✗✗
CSRO [10]✓✓✗
PromptPO (ours)✓✓✓
Table 1. Comparison of LLM-based optimization methods across key properties. To evaluate LLM-based optimization as a drop-in-replacement for RL algorithms, PromptPO focus on all properties.

Prompted Policy Optimization (PromptPO)

In this section, we describe a specific instantiation of Prompted Policy Optimization. More broadly, we use this term to refer to a class of methods that leverage LLM-based optimization to generate policies for sequential decision-making tasks. The particular method presented here serves as a concrete implementation to evaluate the effectiveness of this broader class.

PromptPO inputs a description of the state space, action space, and the reward function implementation. These inputs are expressed in Python code, and naturally accompany environments designed for RL such as via Gymnasium [24] interfaces. Importantly, we avoid providing context about the environments' transition dynamics to understand PromptPO's ability to maximize expected return without explicit model based knowledge. All context used by PromptPO is available in our Github repository.

We first ask PromptPO to implement an evaluation class called Feedback via the prompt summarized in Figure 9. The implemented evaluation provides additional supervision beyond the environment reward function. Standard RL algorithms reason about prior interactions by way of Bellman backups or temporal difference errors propagated through the state-action space. In contrast, PromptPO updates policies using only trajectory-level feedback, making the design of evaluation functions beyond the reward signal a key component of the method. While future work could incorporate state–action level feedback into PromptPO, we hypothesize that current LLMs may struggle to reason over long-horizon sequences of individual transitions, particularly in environments with hundreds or thousands of steps. Upon generating an evaluation function, PromptPO is then prompted to generate a policy to maximize expected return using a prompt summarized in Figure 10.

We sample $N$ candidate policies from PromptPO and roll out each policy in the environment. For each rollout, we compute the return under the provided reward function and evaluate performance using the generated evaluation function. The best-performing policy—measured by return—is then reported back to PromptPO, which produces a natural language reflection on its performance using the prompt in Figure 11. The history of the best policy from all previous rounds is maintained in context. Finally, this reflection, along with the history of prior best policies and their associated reflections, is used to generate a new policy via the prompt in Figure 10. At any point in the iterative process, if a Python execution error occurs due to incorrectly generated code (e.g., in the policy or evaluation), PromptPO is provided with the error and prompted to re-implement the relevant code artifact.

Experiments Summary

Figure 2 and 3 illustrate a summary of our results when $N=10$ policy candidates are sampled from PromptPO at each round. PromptPO matches or outperforms the best performing RL algorithms in 15/19 environments, is substantially more sample-efficient in 14/19 environments, and is more than an order of magnitude more sample efficient in 11/19 environments.

<strong>Comparison of PromptPO to best performing RL algorithm in terms of final performance (color) and sample efficiency (y position).</strong> Green points are environments where PromptPO attains a higher mean return than RL. Blue points are environments where PromptPO attains the same mean return as RL, and red points are those where it attains a lower mean return. All points above the gray dotted line are environments where PromptPO is more sample efficient, and the y-axis denotes by how much. Mean return is computed over 3 seeds. The plotted RL performance corresponds to the best-performing algorithm for each environment, selected from the suite of RL methods described in Section <a href="#sec:detailed_desc">6</a>.
Figure 2. Comparison of PromptPO to best performing RL algorithm in terms of final performance (color) and sample efficiency (y position). Green points are environments where PromptPO attains a higher mean return than RL. Blue points are environments where PromptPO attains the same mean return as RL, and red points are those where it attains a lower mean return. All points above the gray dotted line are environments where PromptPO is more sample efficient, and the y-axis denotes by how much. Mean return is computed over 3 seeds. The plotted RL performance corresponds to the best-performing algorithm for each environment, selected from the suite of RL methods described in Section 6.
<strong>Comparison of PromptPO to best performing RL at the step when PromptPO achieves its best performance</strong>. Returns are normalized such that a uniformly random policy has value 0 and the best-performing RL policy has value 1; values greater than 1 indicate that PromptPO outperforms RL’s best policy. Points below the line $y = x$ correspond to environments where PromptPO attains higher performance than RL at the time PromptPO reaches its best performance. All other details follow Figure <a href="#fig:main_fig">1</a>.
Figure 3. Comparison of PromptPO to best performing RL at the step when PromptPO achieves its best performance. Returns are normalized such that a uniformly random policy has value 0 and the best-performing RL policy has value 1; values greater than 1 indicate that PromptPO outperforms RL’s best policy. Points below the line $y = x$ correspond to environments where PromptPO attains higher performance than RL at the time PromptPO reaches its best performance. All other details follow Figure 1.

Exploration Environments We introduce NoiseWorld, a suite of randomly generated grid worlds designed to evaluate PromptPO’s exploration capabilities while mitigating the LLM’s prior knowledge of the environment. We additionally evaluate PromptPO on a physics-based maze navigation task—the Point Maze Gymnasium environment [25]—using a custom maze layout. Across these environments, PromptPO matches the performance of the best-performing RL algorithms, demonstrating strong exploration driven by sampling and evaluating many candidate policies at each update round. Notably, 2/5 of the NoiseWorld environments remain unsolved by both PromptPO and RL baselines. Further details are provided in Section 6.1. Interestingly, in the NoiseWorld environments, PromptPO’s generated policies implement a planning algorithm, such as Dijkstra's algorithm or Value Iteration using the board layout, which is provided in the observation, together with an empirically estimated transition model derived from prior experience. The resulting planning algorithm is executed at every environment timestep. This behavior indicates the effectiveness of PromptPO's prior on what optimization procedure to employ. In Point Maze, PromptPO's implemented policy is a proportional controller.

Robotics Environments We evaluate PromptPO on six MuJoCo [26] and four Meta-World robotic manipulation environments [27]. A key distinction is that MuJoCo tasks require actions in the form of joint torques, whereas Meta-World tasks operate over end-effector displacements. PromptPO underperforms standard RL algorithms on the MuJoCo suite, but in Meta-World it achieves greater sample efficiency in all tasks and improves final performance on one task. These results highlight a limitation of PromptPO: it struggles with fine-grained continuous control (e.g., joint torques), but performs well when the action space is more amenable to natural language reasoning. See Section 6.2 for additional details. Across all robotics environments, PromptPO generates proportional controllers and tunes their parameters as it receives additional environment feedback.

Real World Control Environments We evaluate PromptPO on three real-world decision-making environments used by [28]: administering insulin to patients with Type II diabetes, determining lockdown regulations to manage a Covid-19 pandemic, and controlling the accelerations of a fleet of autonomous vehicles merging onto a highway. PromptPO is more than $60\times$ more sample efficient in the Pandemic Mitigation environment, and outperforms the final best performance of RL for the training budget we consider in the Glucose Monitoring and Traffic Control environments. These results suggest that PromptPO can be a strong policy optimizer in real-world settings, where the LLM may leverage prior knowledge about the environment from pretraining to generate effective policies. See Section 6.3 for additional details. In these environments, PromptPO implements rule-based heuristic policies that are iteratively refined as additional environment feedback is obtained.

On Pretraining Biases and Privileged Information in PromptPO Most of the environments we consider are publicly available and likely present in the LLM’s pretraining data. As a result, PromptPO may benefit from strong inductive biases or access to information about environment dynamics or effective policies that would not typically be available to an RL agent. We take several steps to mitigate these concerns. First, we evaluate PromptPO on newly generated environments (e.g., NoiseWorld) and novel configurations of existing environments (e.g., Point Maze) that are unlikely to appear in pretraining data. We do not allow access to external tools such as internet search, and we do not provide transition dynamics in context. We additionally audit generated policies using LLM-guided search to check for reliance on publicly available solutions or privileged information. In one surfaced instance, for example, PromptPO reproduced a maze layout from publicly available documentation; we mitigate this by using a custom layout. While prior exposure to related environments likely provide useful inductive biases in ways that are hard to predict, PromptPO must still synthesize policies that maximize expected return in the specific instances we consider.

On Comparing PromptPO to RL from Scratch We compare PromptPO to RL methods that optimize for expected return starting from a randomly initialized policy. This comparison is intentionally asymmetric: while RL methods learn solely from environment interaction, PromptPO leverages substantial prior knowledge from pretraining. Our goal is not to equalize these settings, but to understand the extent to which pretraining can substitute for environment interaction in canonical RL tasks. In particular, this comparison targets a central question: how much useful structure about policies and environments is already captured by large-scale pretraining, and how effectively can it be leveraged for decision making? From a practical perspective, we argue that such comparisons are appropriate, as practitioners can choose between methods that learn from scratch and, via LLM based optimization methods, those that use pretrained models. Our results suggest that LLM-based optimization methods like PromptPO can offer a compelling alternative due to their strong priors and resulting sample efficiency.

Detailed Environment Descriptions and Experiments

We provide additional details about the set of environments we use for evaluation in Section 4, and present results for each individual environment. In Appendix 9, we present results for PromptPO trained with different values of $N$, the number of candidate policies generated per round.

Exploration Environments

NoiseWorld We propose NoiseWorld to evaluate PromptPO in a setting where it doesn't reasonably have access to a strong prior on the environment from its training data. NosiseWorld is a grid-world environment where the agent starts in the top left corner and must maximize its expected return by navigating to a goal state in the bottom right corner. The action space consists of moving in the 4 cardinal directions. By default, each NoiseWorld board is $10\times10$ with a horizon of $100$. The agent accrues a $-1$ reward at each timestep, and $+1000$ reward for reaching the goal state. NosieWorld has additional 6 cell types:

  • Cell type 0 is a blank; transitions out of cell type 0 are deterministic
  • Cell type 1 is a wall; no transitions into these cells are successful.
  • Cell type 2 is such that any transitions out of this cell are successful with probability $p$, and otherwise result in the agent staying in the current cell with probability $(1-p)$.
  • Cell type 3 is such that any transitions out of this cell are successful with probability $p$, and otherwise result in the agent transitioning back to the start state with probability $(1-p)$.
  • Cell type 4 ends the episode with a penalty of $-1000$.

We instantiate NoiseWorld1, NoiseWorld2, and NoiseWorld3 where cell types are sampled from the types listed above, and where relevant, the transition success probabilities are also sampled for each cell. When providing the observation space to PromptPO, which consists of the agents current location and the board layout, we do not indicate which cells types correspond to what transition dynamics or behaviors; PromptPO must figure this out on its own.

We additionally evaluate PromptPO with NoiseWorld4 and NoiseWorld5, which in addition to the cell types above, also have cell types 5 and 6; there is exactly one of these cell types each per board, and to attain the reward of $+1000$ upon visiting the goal state, the agent must visit the cell of type 5 and then the cell of type 6 first. Otherwise, reaching the goal state ends the episode with no bonus.

Point Maze We evaluate PromptPO in the Point Maze environment [30], where a 2-DoF ball that is force-actuated must navigate through a maze to reach a target goal. The action space is continuous and 2-dimensional, and the observation consists of the balls position and velocity, the goal's position, and whether the goal has been reached. Each episode has a horizon of 800, and a delayed reward of +1 for reaching the goal and 0 otherwise. We generate a new $10 \times 9$ Point Maze map that differs from the default map provided by [30] to force PromptPO to explore the environment rather than rely on the LLM's training data. We found that when we did not do this, PromptPO—without access to the internet—recalled the default map used by Point Maze Large from their documentation.

Considered RL Baselines For NoiseWorld, we evaluate PPO with and without hindsight experience replay [31]—which often aids exploration—and report the mean return of the best performing method for each board layout. For Point Maze, we consider SAC, PPO, and variants of both with hindsight experience replay, reporting the mean return of the best performing method. See Appendix 8 for details on our choice of hyperparameters and implementation details for NoiseWorld. For Point Maze, we use the hyperparameters from [32].

Results PromptPO matches the best performing RL algorithms performance with significantly fewer environment samples in all NoiseWorld environments and in Point Maze; in the NoiseWorld environment, PromptPO attains its best performing policy without updating its generated policies using feedback from the environment, despite no information about the different cell types. We attribute this strong performance to PromptPO's ability to sample many different policies, which in effect results in strong exploration capabilities. When the many different policies sampled by PromptPO don't immediately yield the best performing policy—as is the case in Point Maze–PromptPO incorporates the environment feedback (e.g., expressed via the return and generated evaluation function) to sample improved policies in the next round. Figure 5 shows this learning capability for Point Maze, and Figure 4 show it for NoiseWorld.

Unlike in NoiseWorld1, NoiseWorld2, and NoiseWorld3, for NoiseWorld4 and NoiseWorld5, PromptPO and the best performing RL algorithm fail to find a policy that behaves near-optimally; learning to visit cell types 5 and 6 to unlock the high reward upon entering the goal state. This result indcates that PromptPO, like standard RL methods, struggle with difficult exploration problems; future work should try to incorporate lessons from exploration for RL into PromptPO. Providing PromptPO with context about cell types 5 and 6, i.e., that visiting them is necessary to attain the high reward bonus at the goal state, results in PromptPO producing a policy that exhibits this desired behavior. This type of prior knowledge might be difficult to specify to a standard RL algorithm, but is quite simple to specify to PromptPO via natural language and, as shown in Figure 15, enables PromptPO to generate a performant policy.

Figure 4. Training curves across NoiseWorld boards for PromptPO and the best performing RL algorithm out of the set of methods we consider. Mean return is reported over 3 seeds. The dotted lines show best achieved final performance. In the rightmost two plots, unlike in the leftmost 3 plots, PromptPO and RL do not attain near-optimal performance.
PromptPO's training performance in Point Maze versus SAC, which is the best performing RL algorithm out of the set of methods we consider. Mean return is reported over 3 seeds. The dotted lines show best achieved final performance.
Figure 5. PromptPO's training performance in Point Maze versus SAC, which is the best performing RL algorithm out of the set of methods we consider. Mean return is reported over 3 seeds. The dotted lines show best achieved final performance.
Robotics Environments

Mujoco Environments We evaluate PromptPO in the Hopper, Ant, Half-Cheetah, and Swimmer environments, where the objective is to train a locomotion policy, and in the Reacher and Inverted Pendulum environments where the objective is to move a two-jointed robot arm close to a target and balance a two-jointed pole on a cart, respectively [26].

Metaworld Environments We evaluate PromptPO in the Button Press, Door Open, Drawer Open, and Pick and Place environments [27], where the objective is to train a 7-DOF Sawyer robotic arm to press a button, open a door, open a drawer, and pick and place objects, respectively.

For the Mujoco and Metaworld environments, we use the default observation and action spaces, and for PromptPO, additionally provide a description of what each observation variable denotes.

Considered RL Baselines For each Mujoco environments, we train a policy with PPO, SAC, and TD3 and report the mean return of the best performing method per environment. We use the hyperparameters from [33] for the Mujoco environments, and from [27] for Metaworld.

For Metaworld, we train policy with SAC, following the best performing method of [27].

Results PromptPO under performs RL in all Mujoco environments except Reacher, where it is substantially more sample efficient, and Swimmer, where it attains similar performance but after many more environment interactions. On the other hand, PromptPO is significantly more sample efficient than RL in all Metaworld environments, and additionally outperforms the final performance attained by RL in the Pick and Place environment. In Metaworld, PromptPO generates its best performing policy after incorperating only a single or a few rounds of feedback from the environment, as illustrated in Figure 6.

The difference in relative performance between PromptPO and RL in the Mujoco and Metaworld environments provide insight into the type of environments where PromptPO is a sufficient policy optimizer; for Mujoco, the action space consists of torques applied to hinge joints, while for Metaworld it is the end-effector displacement and gripper finger positions. We hypothesize that PromptPO struggles to implement a performant policy for environments that require fine-grained continuous control, and hence it performs poorly in Mujoco environments while exhibiting strong performance in Metaworld environments. For all environments, PromptPO implements a proportional controller and tunes the controller after subsequent environment feedback is provided in context. This suggests that, for some tasks, PromptPO restricts its policy class.

Figure 6. Training curves across Meta-World tasks for PromptPO and the best performing RL algorithm out of the set of methods we consider. Mean return is reported over 3 seeds. The dotted lines show best achieved final performance.
Figure 7. Training curves across MuJoCo continuous control tasks for PromptPO and the best performing RL algorithm out of the set of methods we consider. Mean return is reported over 3 seeds. The dotted lines show best achieved final performance.
Real World Control Tasks

Glucose Monitoring In Glucose Monitoring, the objective is to design a policy to administer insulin to a patient with Type II diabetes; an action must be outputted every 5 minutes for 20 days, and observations are provided by a continuous glucose monitor (CGM). We use the environment parameters and reward function from [28].

Pandemic Mitigation In Pandemic Mitigation, the objective is to design COVID-19 pandemic lockdown regulations to balance economic and health outcomes. The available actions are four lockdown regulation stages, and the observations consist of the states from an SEIRS pandemic model. We use the Pandemic simulator from [34] and the environment parameters and reward function from [28]

Traffic Control In Traffic Control, a policy must control a fleet of autonomous vehicles merging onto a highway to maximize traffic flow. The action space consists of the autonomous vehicle accelerations, the observation space is location and velocity—as well as other statistics derived from these measures—of all cars on the highway. We use the simulator from [35] and the environment parameters and reward function from [28]

Considered RL Baselines We train a policy with PPO in all environments, following the implementation and hyperparameters of [36] for these environments.

Results For these environments, representative of many real world control tasks, PromptPO is both significantly more sample efficient than RL and, for Traffic and Glucose environments, substantially outperforms the best policy found by RL within the environment sample budget considered. Training curves for all methods are shown in Figure 8. These results showcase PromptPO's strength as a policy optimizer in real world settings where it may be capable of leveraging its pretraining data as a strong prior on generating a performant policy.

bibliographystyle{plain} bibliography{references}

Figure 8. Training curves across real-world control environments for PromptPO and PPO. Mean return is reported over 3 seeds. The dotted lines show best achieved final performance.

Implications of LLM-Based Policy Optimization

RL practitioners should consider PromptPO as an accessible first approach PromptPO is both easy to use—in that it requires few algorithmic hyperparameters to tune and interfaces with natural language information a practitioner can provide—and is a strong choice of policy optimizer for environments where an LLM can reason over the state and action space. Practitioners without extensive RL expertise may find PromptPO easier to apply to sequential decision-making problems than standard RL methods. While PromptPO has clear limitations—most notably its reliance on state and action representations that are amenable to natural language reasoning, which excludes settings with image or embedding-based observations—it provides a strong and practical starting point for policy optimization in many real-world environments that meet these requirements.

We should prioritize RL research beyond the "training from scratch'' paradigm A dominant paradigm in RL research is to compare methods that train policies from scratch. This approach is well-suited for developing general-purpose algorithms that can be applied to novel environments where no prior knowledge is available. However, the emergence of LLM-based optimization highlights the value of leveraging large-scale pretraining, which can provide both effective optimization strategies and strong priors over many environments. This suggests an additional research direction for the RL community: developing and evaluating methods in settings where such priors are available, whether over the environment, the policy class, or the optimization process.

For many research questions and methods, particularly those used in conjunction with existing policy optimizers—such as methods to encourage exploration, abide by safety constraints, leverage world models for planning, construct hierarchical policies, or enable continual learning—showing that the proposed techniques improve performance with both standard RL methods and LLM-based policy optimization methods would be a practical contribution. Evaluating methodological improvements may be difficult as the internet is populated with more policy optimization solutions, but environments with randomly generated transition dynamics could provide a useful testbed for LLM-based policy optimization methods.

Standard RL libraries should support LLM-Based policy optimization methods To encourage adoption of PromptPO and related LLM-based optimization methods, widely used RL libraries should support them as drop-in alternatives to standard approaches such as PPO or SAC. However, widespread adoption may also introduce new evaluation challenges: as solutions to many policy optimization problems become publicly available, it may become increasingly difficult to assess whether future methods genuinely generalize to unseen environments. To mitigate this, we advocate for the development of more unseen evaluation settings, such as environments with randomly generated transition dynamics.

Will LLM-Based policy optimization go beyond the capabilities of current RL algorithms by implementing novel RL algorithms? Proponents of the Bitter Lesson [29] might assert that the human inductive biases embedded in LLMs would limit PromptPO methods from significantly surpassing human capabilities. If so, PromptPO remains a practical tool for generating high-performing policies across many environments, and may facilitate broader adoption of RL in real-world settings—particularly among practitioners without RL expertise. But we observe that in the NoiseWorld environments, PromptPO implements Value Iteration without being explicitly prompted to do so. In the continuous control environments, like in Mujoco or Metaworld environments, PromptPO implements and tunes a proportional controller. This provides evidence that PromptPO can select appropriate policy optimization procedures for a given setting, suggesting that future LLM-based optimization methods may similarly implement more sophisticated existing RL algorithms—or even develop novel ones that extend beyond current approaches.

Appendix

Prompts used by PromptPO
Trajectory Feedback Prompt
Implement a class called Feedback with a method summarize_trajectory(self, traj).
traj is a list of observation objects (each a {obs_cls} instance) for one episode in time order.
Compute brief statistics to help improve a policy maximizing expected reward under {reward_function_name} over {n_timesteps} timesteps.
Examples: sum, min, max, mean of reward-related quantities or other useful summaries.
Figure 9. Prompt used to generate trajectory-level feedback summaries for improving policies.
Policy Generation Prompt
Observation Context: {obs_context}
Implementation Details: {policy_context}
{reward_function_name}: {reward_src}
{history_of_previously_generated_best_policies}
Given the {reward_function_name}, implement a policy in python that inputs an observation and outputs an action. The policy should maximize the expected sum of rewards with respect to the {reward_function_name} over {n_timesteps} timesteps. Think step-by-step internally before producing final code. Implement the policy in a class called {policy_class_name} with a function act(obs) that takes in the observation and returns a valid action.
Figure 10. Prompt used to instruct the language model to generate a policy implementation from observation context, implementation details, and a reward function specification.
Policy Evaluation Prompt
{reward_function_name}: {reward_src}
Latest policy to assess (attempt {current_round_index}) source:
{generated_policy}
Latest policy (attempt {current_round_index}) episode returns (sum of rewards over each rollout): {list(episode_returns)}, mean return {mean_return_latest:.6f}.
{history_of_previously_generated_policies}
Compare the latest policy's returns to the earlier attempts listed above. Reply in 3–4 sentences total. Your first sentence must open with a comparison: for example that this policy did better than certain prior attempts, worse than certain prior attempts, or performed similarly to them (name attempt numbers if helpful).
If the latest mean return is clearly very low for the {horizon}-timestep rollout, or clearly far below the best prior mean return in this run, or otherwise obviously suboptimal, open instead with that this policy did poorly (or equivalent), rather than implying improvement.
Then briefly relate this to maximizing the expected sum of rewards under the {reward_function_name} over {horizon} timesteps.
Reply with only that explanation, no code.
Figure 11. Prompt used to elicit concise natural language evaluations of generated policies, comparing performance across attempts and relating outcomes to the target reward function.
Hyperparameter tuning method for NoiseWorld

On each NoiseWorld board, we tune PPO as implemented in Stable-Baselines3 [37]. For PPO we grid-search learning rate, PPO clip range, minibatch size, and optimization epochs (three values each; $3^4{=}81$ configurations per board). Every configuration is trained for a fixed environment step budget with periodic deterministic evaluation rollouts; we record the best mean evaluation return along each run and the environment step at which that peak first occurs. We repeat each configuration with three random seeds and, for a given board, select the hyperparameters that maximize the mean (across seeds) of those per-seed highest returns, breaking ties by lower mean timestep-to-highest return.

Varying the number of samples generated by PromptPO

The results show in Figures 1 and 3, as well as all other figures in the main text, sample $N=10$ candidate policy generations for each round of PromptPO. Figures 12 and 13 summarizes PromptPOs performance when $N \in \{5,10,20\}$, where the number of PromptPO update rounds remains fixed for all settings evaluated. Decreasing the number of candidate policies per round decreases the final performance attained; PromptPO benefits from sampling a diverse pool of candidates. Increasing the number of candidate policies per round decreases sample efficiency, as more policies need to be rolled out and evaluated in the environment.

Figure 12. PromptPO performance summary for different numbers of sampled candidate policies per round ($N \in \{5,10,20\}$). This figure complements Figure 2; see its caption for details on interpretation.
Figure 13. PromptPO performance summary for different numbers of sampled candidate policies per round ($N \in \{5,10,20\}$). This figure complements Figure 3; see its caption for details on interpretation.
Providing exploration context for PromptPO

PromptPO, like the RL algorithms we consider, fail to find a near-optimal policy for NoiseWorld4 and NoiseWorld5, where a sequence of keys must be visited before the agent can attain high reward from the goal state. When providing the addition context in Figure X, however, PromptPO does attain near optimal performance as illustrated in 15. This type of user-provided prior might be difficult to specify to an RL algorithm, but can be easily specified via natural language to PromptPO

Observation Context Addition
Trailing progress flags (two scalars in {0,1}, after board):
First flag: 1 if the agent has already visited the cell labeled 6 at least once this episode, else 0.
Second flag: 1 if the agent has already visited the cell labeled 7 at least once this episode after the first flag became 1 (i.e. the ordered pair of milestones is complete), else 0.
Figure 14. Additional observation-context text describing the two trailing progress flags appended after board.
PromptPO's training performance in NoiseWorld5 versus PPO, which is the best performing RL algorithm out of the set of methods we consider. Here, PromptPO is provided with context stating it must visit cell 5 and cell 6 to reap the high positive reward at the goal state. Mean return is reported over 3 seeds. The dotted lines show best achieved final performance.
Figure 15. PromptPO's training performance in NoiseWorld5 versus PPO, which is the best performing RL algorithm out of the set of methods we consider. Here, PromptPO is provided with context stating it must visit cell 5 and cell 6 to reap the high positive reward at the goal state. Mean return is reported over 3 seeds. The dotted lines show best achieved final performance.

Acknowledgements

Simon Guo, Miguel Liu-Schiaffini, and Shannon Sequeira were extremely helpful in discussing ideas and providing feedback on this post.

References

  1. [1] Richard S. Sutton, Andrew G. Barto. "Reinforcement Learning: An Introduction". MIT Press. 2018.
  2. [2] Christopher J. C. H. Watkins, Peter Dayan. "Q-learning". Machine Learning. 1992.
  3. [3] Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour. "Policy Gradient Methods for Reinforcement Learning with Function Approximation". NeurIPS. 2000.
  4. [4] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al.. "Human-level control through deep reinforcement learning". Nature. 2015.
  5. [5] John Schulman, Filip Wolski, Prafulla Dhariwal, et al.. "Proximal Policy Optimization Algorithms". arXiv preprint arXiv:1707.06347. 2017.
  6. [6] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor". arXiv preprint arXiv:1801.01290. 2018.
  7. [7] John Schulman, Sergey Levine, Pieter Abbeel, et al.. "Trust Region Policy Optimization". International Conference on Machine Learning (ICML). 2015.
  8. [8] Scott Fujimoto, Herke van Hoof, David Meger. "Addressing Function Approximation Error in Actor-Critic Methods". International Conference on Machine Learning (ICML). 2018.
  9. [9] Andrej Karpathy. "autoresearch: AI agents running research automatically". 2026.
  10. [10] Daniel Hennes, Zun Li, John Schultz, Marc Lanctot. "Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models". arXiv preprint arXiv:2603.10098. 2026.
  11. [11] Alexander Novikov, Ng\^an V\~u, Marvin Eisenberger, et al.. "Alphaevolve: A coding agent for scientific and algorithmic discovery". arXiv preprint arXiv:2506.13131. 2025.
  12. [12] Mert Yuksekgonul, Federico Bianchi, Joseph Boen, et al.. "Textgrad: Automatic" differentiation" via text". arXiv preprint arXiv:2406.07496. 2024.
  13. [13] Yoonho Lee, Joseph Boen, Chelsea Finn. "Feedback descent: Open-ended text optimization via pairwise comparison". arXiv preprint arXiv:2511.07919. 2025.
  14. [14] Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, et al.. "Gepa: Reflective prompt evolution can outperform reinforcement learning". arXiv preprint arXiv:2507.19457. 2025.
  15. [15] Jacky Liang, Wenlong Huang, Fei Xia, et al.. "Code as policies: Language model programs for embodied control". 2023 IEEE International conference on robotics and automation (ICRA). 2023.
  16. [16] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, et al.. "Voyager: An open-ended embodied agent with large language models". arXiv preprint arXiv:2305.16291. 2023.
  17. [17] Michael Ahn, Anthony Brohan, Noah Brown, et al.. "Do as i can, not as i say: Grounding language in robotic affordances". arXiv preprint arXiv:2204.01691. 2022.
  18. [18] Wenlong Huang, Fei Xia, Ted Xiao, et al.. "Inner monologue: Embodied reasoning through planning with language models". arXiv preprint arXiv:2207.05608. 2022.
  19. [19] Noah Shinn, Federico Cassano, Ashwin Gopinath, et al.. "Reflexion: Language agents with verbal reinforcement learning". Advances in neural information processing systems. 2023.
  20. [20] Wenlong Huang, Pieter Abbeel, Deepak Pathak, Igor Mordatch. "Language models as zero-shot planners: Extracting actionable knowledge for embodied agents". International conference on machine learning. 2022.
  21. [21] Ethan Brooks, Logan Walls, Richard L Lewis, Satinder Singh. "Large language models can implement policy iteration". Advances in Neural Information Processing Systems. 2023.
  22. [22] Andrew Szot, Max Schwarzer, Harsh Agrawal, et al.. "Large language models as generalizable policies for embodied tasks". The Twelfth International Conference on Learning Representations. 2023.
  23. [23] Yifan Zhou, Sachin Grover, Mohamed El Mistiri, et al.. "Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs". arXiv preprint arXiv:2511.21928. 2025.
  24. [24] Mark Towers, Ariel Kwiatkowski, Jordan Terry, et al.. "Gymnasium". 2023.
  25. [25] Farama Foundation. "Gymnasium Robotics: Point Maze Environment". 2023.
  26. [26] Emanuel Todorov, Tom Erez, Yuval Tassa. "MuJoCo: A physics engine for model-based control". 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2012.
  27. [27] Tianhe Yu, Deirdre Quillen, Zhanpeng He, et al.. "Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning". Proceedings of the Conference on Robot Learning (CoRL). 2020.
  28. [28] Alexander Pan, Kush Bhatia, Jacob Steinhardt. "The effects of reward misspecification: Mapping and mitigating misaligned models, 2022". URL https://arxiv. org/abs/2201.03544. 2022.
  29. [29] Richard Sutton. "The Bitter Lesson". 2019.
  30. [30] Farama Foundation. "Point Maze Environment". 2026.
  31. [31] Marcin Andrychowicz, Filip Wolski, Alex Ray, et al.. "Hindsight experience replay". Advances in neural information processing systems. 2017.
  32. [32] Jiashun Liu, Johan Obando-Ceron, Pablo Samuel Castro, et al.. "The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning". International Conference on Machine Learning (ICML). 2025.
  33. [33] Antonin Raffin. "RL Baselines3 Zoo". 2021.
  34. [34] Varun Kompella, Roberto Capobianco, Stacy Jong, et al.. "Reinforcement learning for optimization of COVID-19 mitigation policies". arXiv preprint arXiv:2010.10560. 2020.
  35. [35] Cathy Wu, Abdul Rahman Kreidieh, Kanaad Parvate, et al.. "Flow: A Modular Learning Framework for Mixed Autonomy Traffic". IEEE Transactions on Robotics. 2022.
  36. [36] Cassidy Laidlaw, Shivam Singhal, Anca Dragan. "Correlated proxies: A new definition and improved mitigation for reward hacking, 2025". URL https://arxiv. org/abs/2403.03185.
  37. [37] Antonin Raffin, Ashley Hill, Adam Gleave, et al.. "Stable-Baselines3: Reliable Reinforcement Learning Implementations". Journal of Machine Learning Research. 2021.