18/08/26 17:11:36.23 CL1hr8qnX BE:29373326-2BP(3)
4
Related Work
Intrinsic Motivation:
A family of approaches to intrinsic motivation reward
an agent based on prediction error [2, 27, 36, 42], prediction uncertainty [11, 44], or improvement [19, 34] of a forward dynamics model of the environment that gets trained along with the agent 's policy.
As a result the agent is driven to reach regions of the environment that are difficult to predict for the forward dynamics model, while the model improves its predictions in these regions.
This adversarial and non-stationary dynamics can give rise to complex behaviors.
Relatively little work has been done in this area on the pure exploration setting where there is no external reward.
Of these mostly closely related are those that use a forward dynamics model of a feature space such as Stadie et al. [42] where they use autoencoder features, and Pathak et al. [27] where they use features trained
with an inverse dynamics task.
These correspond roughly to the VAE and IDF methods detailed in Section 2.1.
Smoothed versions of state visitation counts can be used for intrinsic rewards [3, 9, 24, 47].
Count-based methods have already shown very strong results when combining with extrinsic rewards such as setting the state of the art in the Atari game Montezuma 's Revenge [3],
and also showing significant exploration of the game without using the extrinsic reward.
It is not yet clear in which situations count-based approaches should be preferred over dynamics-based approaches; we chose to focus on dynamics-based bonuses in this paper since we found them straightforward to scale and parallelize.
In our preliminary experiments, we did not have sufficient success with already existing count-based implementations in scaling up for a large-scale study.