The dream of reinforcement learning is that it can one day be used to derive automated solutions to real-world tasks, with little-to-no human effort1. Unfortunately, in its current state, RL fails to deliver. There have been basically no real-world problems solved by DRL; even on toy problems, the solutions found are often brittle and fail to generalize to new environments. This means that the per-task human effort – i.e. task-specific engineering effort and hyperparameter tuning – is quite high. Algorithms are sample-inefficient, making them expensive in terms of both data collection effort and compute effort, too. Currently RL-based automated solutions compare very unfavorably to alternatives (such as hiring a team of roboticists to engineer a solution, or just not automating at all).

But reinforcement learning, especially deep RL, is an exciting research area precisely because of its enormous unfulfilled potential. Improvements in RL directly translate into an improved ability to automate complex, cognitively-demanding tasks, which is where humanity, collectively, currently spends a lot of effort. If we can push reinforcement learning forward far enough, we will be able to take tasks that currently require substantial human effort, and solve them with no human effort: just a little bit of data and a lot of computation.

With this in mind, let’s delve a bit more into what it means to automate a task with reinforcement learning. The basic process can be decomposed into two steps: first reduce the problem to RL by writing it as an MDP or POMDP, and then solve for the optimal policy of the MDP or POMDP2. The optimal policy then allows us to fully automate the task, completing it any number of times with no further human effort.

Although this two-part breakdown is straightforward, my impression is that not many RL researchers currently think about their work through the lens of automation. I find it to be a very useful perspective, and it really influences how I think about reinforcement learning research. First, I’ll describe each of the two aspects in some more detail.

Reduction to RL

When we attempt to automate a real-world problem with RL, the first step is to reframe that problem as an MDP. This is easy to do for most problems, since the MDP framework is extremely generic. This is why RL is able to save so much human effort; rather than find a solution, the human’s only task is to write the problem in a slightly different form.

Unfortunately, not all MDPs are equally easy to solve. In most cases, naively reducing to RL will result in posing an intractably difficult problem. For example, if a task has very sparse rewards that only arrive at the end of an episode, it will be quite difficult to find rewards at all, let alone to solve the task. But by reward shaping – adding in intermediate rewards that guide policies towards the true optimum – we can make the task of the RL algorithm far easier. Another example of a technique to make reduction-to-RL more tractable is sim2real, where we first hand-engineer a simulator (which is an MDP that approximates the real environment’s MDP), then identify the optimal policy on our simulator, and finally transfer it back to the real environment. Collecting data from a simulator requires compute, but it doesn’t require interacting with the real world, which is much slower. Therefore, it is quite a handy way to save effort on problems that require a lot of data collection3.

Unfortunately, these techniques come with a huge downside: in changing the MDP, they change the MDP’s optimal policy. This means that even if our MDP solver gives us the perfect solution – the exact optimal policy for the MDP it was given – this solution might potentially perform terribly on the real task! Research in Reduction-to-RL is motivated by the question, “How can we pose real-world problems as MDPs, such that when we apply our MDP solver, the resulting policy performs well on the real task?” Subfields of RL that fit under this umbrella include transfer learning, unsupervised RL, meta-learning, sim2real, reward engineering, AI safety, and more.

Solving an MDP

At the core of reinforcement learning is just one fundamental problem: to find the optimal policy of an MDP. An MDP is a nice, clean mathematical abstraction, with none of the messiness of the real world. We do not have to ask questions like, “is this the right reward function?” or “will this solution transfer to other tasks?”, as we did during the reduction-to-RL phase. There is just one ground-truth MDP, and all we need to do is find the policy with the highest expected return.

Since the MDP framework is so generic, there’s still a lot of difficult research ahead of us before we discover an algorithm that will find the optimal policy for any MDP4. In order to make progress, it’s often nice to make assumptions about the MDP we are given. For example, we might assume that the state space is tabular, or that it is continuous but the transition function is Lipschitz, or that the rewards are bounded by [-1, 1]. These simplifications make it easier to reason about the MDP, as well as make it easier to empirically demonstrate the effectiveness of our proposed solutions.

How can this taxonomy guide research?

Most reinforcement learning research is focused on improving or understanding just one of these two aspects; this choice has a large impact on everything from motivations, to evaluation strategies, to interpretation of the results. However, the community currently does not do a good job of differentiating between the two. Many papers seem to straddle the border, leading to lack of rigor, researchers talking past each other, and misinterpretation of results.

To make this concrete, let’s take a look at one specific paper through this lens: the O.G. deep reinforcement learning paper, “Human-level control through deep reinforcement learning” (Mnih et al, 2015). This was the first paper to conclusively show that a deep reinforcement learning algorithm could learn to play games at human-level from high-dimensional pixel inputs, and is widely credited with kickstarting the entire field of deep RL. I want to start by focusing on one choice that Mnih et al. made: the decision to clip all rewards to [-1, 1].

If we interpret this clipping as an algorithmic choice for their MDP-solver algorithm, it is clearly not a great idea. The algorithm “clip rewards to [-1, 1] and then do Q-learning” is simply a bad algorithm. It’s trivial to come up with an MDP where the policy found by that algorithm is very far from optimal. But if, instead, we interpret this clipping as an RL reduction technique, it begins to seem more sensible. Solving an MDP with unbounded rewards is hard when using a deep-neural-net function approximator; solving an MDP with bounded rewards is much easier. And it turns out for most Atari games, the optimal policy of an L1-bounded-reward version of the game is quite similar to the optimal policy of the original game. So, for this problem, it’s a useful reduction technique, and one that Mnih et al. were able to apply with great success.

There are a few other examples of RL reductions that we can find. One is the discount factor, $\gamma=.99$. Since Atari is episodic (i.e. every episode eventually terminates), Q-learning should converge whether there is a discount factor or not. However, since deep RL is quite unstable, solving an MDP with a low contraction rate is harder than solving an MDP with a fast contraction rate, and so discounting is beneficial. Of course, changing the discount factor from 1 to .99 changes the optimal policy – but again, in this case, it seems not to matter. The final reduction heuristic employed5 is “terminate on life loss”, which is Atari-specific. An MDP which terminates whenever a life is lost has shorter episodes, making credit assignment easier, which makes it easier to solve; and once again, empirically, it seems that terminating early on these games does not change the optimal policy very much.

So, why do I feel that it is important that we identify these algorithmic decisions as being reduction-to-RL oriented, rather than MDP-solution oriented? One major reason is for evaluation. Mnih et al. – and every Atari DRL paper since – evaluate all of their games against the original, ground-truth versions of the Atari environments. This means that any proposed change is automatically evaluated in the context of both its effects on reduction and solution. Entangling these two factors is problematic, and can result in some weird conclusions.

Consider the following hypothetical (but plausible) situation. Let’s say we have some MDP for the game of Pong, PONG. We also have an MDP for the game of Pong with various simplifying changes, such as clipped rewards and a discount factor; let’s call this one GNOP. Let’s say the optimal policy on PONG, \(\pi^*_{\textbf{PONG}}\), gets expected return of 20 points on PONG: \(V_{\textbf{PONG}}(\pi^*_{\textbf{PONG}}) = 20\). Similarly, the optimal policy on GNOP, \(\pi^*_{\textbf{GNOP}}\), gets 15 points on GNOP: \(V_{\textbf{GNOP}}(\pi^*_{\textbf{GNOP}}) = 15\). It’s a bit lower, of course, because of the clipping and discount. Now, as it turns out, when we run \(\pi^*_{\textbf{GNOP}}\) on PONG, we get a return equal to 18 points: \(V_{\textbf{PONG}}(\pi^*_{\textbf{GNOP}}) = 18\). This means that we can solve the simpler MDP, GNOP, and still get 90% optimal return on the MDP we actually care about, PONG – not a bad reduction! But now, let’s also assume that there exists some other policy \(\pi^{\text{meh}}\), such that \(V_{\textbf{GNOP}}(\pi^{\text{meh}}) = 14\), but \(V_{\textbf{PONG}}(\pi^{\text{meh}}) = 19\).6

Let’s say I’m an RL researcher, and I have come up with some algorithm for solving MDPs, novel(), as well as a baseline, baseline(). I claim that novel() is better at finding the optimal policy than baseline(). So, I go to evaluate it in the standard Atari setting, which involves training on the reduced version of the environment but evaluating on the true environment. As it turns out baseline(GNOP) = \(\pi^{\text{meh}}\), and novel(GNOP) = \(\pi^*_{\textbf{GNOP}}\). Amazing! My new algorithm has beat the baseline by finding the true optimal policy of the MDP….but, wait. When I go to evaluate on the true environment, I find: \(V_{\textbf{PONG}}(baseline(\textbf{GNOP})) = 19\), but \(V_{\textbf{PONG}}(novel(\textbf{GNOP})) = 18\). Suddenly, my algorithm – which is genuinely an improvement over the baseline – doesn’t look so great. And if we only report evaluation on PONG and not GNOP, nobody will ever know.

I think that this is a huge issue with the current standards for evalutation within the DRL community. I suspect that conflating the issues of reduction-to-RL and solution-of-MDPs has impeded our ability to recognize progress on either front. This issue is most noticeable in Atari by far, but it applies even in other domains; for example, some MuJoCo control tasks are episodic, but we still train with a discount factor and evaluate without one.

Clearly, the way to solve this issue is to also report numbers on GNOP. This is easy enough to do; simply identify which aspects of the algorithm are reduction-oriented and which are solution-oriented, create a version of the environment which obeys all of the reductions, and then evaluate both your algorithm and your baseline on this new MDP. Make the assumptions associated with each reduction explicit in the paper, in order to give a clear picture of the limitations of each method.

Reduction Results On Atari Are Irrelevant (When Studying MDP-Solving)

Now, I want to push a bit farther into some potentially-controversial territory: I think that evaluating on GNOP is actually far more important than evaluating on PONG. Why? Well, when looking at the big picture, nobody really cares about whether we can learn to play Atari games. It’s a phenomenal benchmark, because it is complex, diverse, has convenient human baselines to compare against, and can be cheaply and quickly emulated. But at the end of the day, it is most useful when viewed simply as a collection of complex MDPs, to be used to drive progress in algorithms to solve complex MDPs. At the moment, those MDPs are too complex for us to solve – and that’s okay! We can just apply reductions to those MDPs until they are solvable, and then do research on algorithms for those more restricted domains. Whether or not the policy transfers well to the other (original) collection of MDPs is more-or-less irrelevant.

Additionally, over time, researchers will push forward by removing reductions on the MDPs they study, getting closer and closer to solving true, original Atari. For example, the POPART method (Hessel et al., 2018) removes the need to clip the rewards. This is a really clever technique, and it is a definite step towards learning a policy for the true Atari environment (since it’s one fewer reduction). However, it also makes the MDP-solving harder, which means we may not be able to see improved performance immediately. To be given a fair evaluation, POPART should be compared to baseline methods on Atari with unclipped rewards, where it stands out far more (see Figure 2 in their paper). Do its learned policies transfer back to the original environment better (because the optimal policy is closer to the true environment’s optimal policy), or worse (because of the optimization difficulties)? Who cares! The important thing is that, within its problem domain, POPART outperforms alternatives.

Of course, parallel research into reduction-to-RL techniques is also important. There are many interesting research avenues around what reductions are useful, what sorts of tasks can be used as stand-in proxies for others, etc. For these sorts of studies, Atari may also be a good benchmark, and the question of performance on the original environment once again becomes the most essential question to answer. I just want to suggest disentangling the two wherever possible, and highlighting these distinctions.


I hope that this perspective is valuable. If there is any research that doesn’t fit nicely into this taxonomy, or you have other thoughts or disagreements, please reach out to me on Twitter @jacobmbuckman, I’d love to discuss.

Thanks for reading. Follow me on Substack for more writing, or hit me up on Twitter @jacobmbuckman with any feedback or questions!

Thanks to Saurabh Kumar for feedback when writing this post.

  1. The goal of automation is to save effort. I’m going to define effort in a very hand-wavy way, something along the lines of “whatever it takes to solve a task”. There are many things which we could consider effort. I want to highlight a few main categories: human effort, data collection effort, and computational effort. (There are clearly other categories too, but these are the most relevant to RL.) These types of effort form a hierarchy of sorts. Human effort is the most expensive, since it requires the time and attention of a person who would rather be doing something else; data collection is next-most-expensive, since it requires setting up infrastructure to interact with the messy real world; and compute effort is the least expensive, because CPUs/GPUs are relatively cheap and algorithms can often be parallelized. We are almost always happy to transmute effort from a higher tier to a lower tier. For example, everyone loves calculators, which convert the effort required to multiply two numbers from human effort into compute effort. 

  2. Just going to write MDP from here on out, but everything I say applies to both MDPs and POMDPs. 

  3. Admittedly, at the expense of human effort to build the simulator. 

  4. Although there’s really no reason to believe we even need to be able to find the optimal policy for all MDPs. It’s enough to do be able to solve all MDPs that correspond to problems that we might care about; given the regularity of the world we live in, this set is almost certainly much smaller. 

  5. These are all the ones I noticed, but if you see another one, let me know! 

  6. Of course, in reality, \(\pi^{\text{meh}}\) may or may not actually exist for this specific setting (Pong & reductions of clipping + discount), but just bear with me for the sake of the example! There is no reason, in general, that this couldn’t happen.