When are LLM’s sufficient black box policy optimizers?
Inspired by AutoResearch, we ask when are LLM’s sufficient black box policy optimizers? I.e., when can we replace classic RL algorithms like PPO or SAC with an LLM?
Inspired by AutoResearch, we ask when are LLM’s sufficient black box policy optimizers? I.e., when can we replace classic RL algorithms like PPO or SAC with an LLM?