When are LLM’s sufficient policy optimizers for sequential RL tasks?
Inspired by AutoResearch, we ask when are LLM’s sufficient policy optimizers for sequential RL tasks? I.e., when can we replace classic RL algorithms like PPO or SAC with an LLM?
Inspired by AutoResearch, we ask when are LLM’s sufficient policy optimizers for sequential RL tasks? I.e., when can we replace classic RL algorithms like PPO or SAC with an LLM?