(強いAI)技術的特異点/(世界加速) 23

(強いAI)技術的特異点/(世界加速) 23at FUTURE

(強いAI)技術的特異点/(世界加速) 23 - 暇つぶし2ch484:YAMAGUTIseisei
18/08/06 02:29:34.41 FnAR0u04o BE:132176669-2BP(3)
Metric Parallel Incremental Units

Area, 32 entries 288 78 LUTs
Area, total, 32 entries 340 150 LUTs
Period 5.0 4.3 ns
Period, pipelined 2.9 2.5 ns
Area, total * period 1700 645 LUT*ns

Broadcast flash iterative
Event bank conflicts? never sometimes

Area, 4 events/cycle 288 156 LUTs
Area, 64 entries 576 130 LUTs

TABLE II:
Comparing parallel and incremental schedulers.

However the parallel scheduler retains some brute force advantages.
It can process a broadcast event in a single cycle, whereas the incremental scheduler must iteratively drain a broadcast queue at a rate of 1-2 instructions per cycle.
This may cause issue stalls in some workloads.
The incremental scheduler is also subject to even/odd target bank conflicts which may delay an instruction wake up.
Real workload studies are needed to measure whether these effects overshadow its substantial area*period advantage.
Finally consider future scale up to wider issue and larger instruction windows.
The parallel scheduler does not grow when subdivided into more banks to process twice as many events per cycle, whereas the incremental scheduler core area doubles.
To grow the instruction window to 64 entries, the parallel scheduler require twice as much area, whereas the incremental scheduler area grows more modestly.

485:YAMAGUTIseisei
18/08/06 02:30:24.21 FnAR0u04o BE:9791322-2BP(3)
IV.
CONCLUSION
This paper presents our work towards a practical out-oforder dataflow soft processor architecture for FPGAs.
We set out to discover whether a novel EDGE instruction set architecture, optimized for simpler high ILP microarchitectures in ASICs, is also a good fit for FPGAs, or whether general purpose soft processors will remain stuck in the scalar RISC slow lane.
We considered two different dataflow instruction scheduler designs and their optimized FPGA implementations.
In the context of commercial 200 MHz, 1,000-2,000 LUT soft processors, the limited FPGA resource cost and clock period impact of either design seems acceptable and practical.
Both design alternatives will scale well to future four-decode/twoissue implementations.

REFERENCES
[1] J. Gray,
``Homebrewing RISCs in FPGAs,'' URLﾘﾝｸ(fpgacpu.org) August 1996.
[2] ---- ,
``Building a RISC System in an FPGA,''
Circuit Cellar Ink, no. 116 -- 118, March, April, May 2000.
[Online]. Available:
URLﾘﾝｸ(fpgacpu.org)
[3] Altera Corporation,
``Nios Embedded Processor Software Development Reference Manual,'' March 2001.
[4] Xilinx Inc., ``MicroBlaze Processor Reference Guide,''
2002.
[5] A. K. Jones, R. Hoare, D. Kusic, J. Fazekas, and J. Foster,
``An FPGAbased VLIW Processor with Custom Hardware Execution,''
in Proceedings of the 13th International Symposium on Field-programmable Gate Arrays, 2005, pp. 107 -- 117.
[6] K. O. I. Tili and J. G. Steffan,
``TILT: A Multithreaded VLIW Soft Processor Family,''
in Proceedings of the International Conference on Field Programmable Logic and Applications, August 2013.

486:YAMAGUTIseisei
18/08/06 02:35:01.27 FnAR0u04o BE:61193055-2BP(3)
[7] P. Yiannacouras, J. G. Steffan, and J. Rose,
``VESPA: Portable, Scalable, and Flexible FPGA-based Vector Processors,''
in Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 2008, pp. 61 -- 70.
[8] J. Yu, G. Lemieux, and C. Eagleston,
``Vector Processing As a Soft-core CPU Accelerator,'' in Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, 2008, pp. 222 -- 232.
[9] R. Carli,
``Flexible MIPS Soft Processor Architecture,''
Master's thesis, Massachusetts Institute of Technology, May 2008.
[10] K. Aasaraai and A. Moshovos,
``Towards a viable out-of-order soft core: Copy-free, checkpointed register renaming,''
in Proceedings of the 19th International Conference on Field Programmable Logic and Applications, August 2009.
[11] B. H. Dwiel, N. K. Choudhary, and E. Rotenberg,
``FPGA Modeling of Diverse Superscalar Processors,''
in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2012, pp. 188 -- 199.
[12] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John, C. Lin, C. R. Moore,
J. Burrill, R. G. McDonald, W. Yoder, X. Chen, R. Desikan, S. Drolia, J. Gibson, M. S. S. Govindan,
P. Gratz, H. Hanson, C. Kim, S. K. Kushwaha, H. Liu, R. Nagarajan, N. Ranganathan,
E. Reeber, K. Sankaralingam, S. Sethumadhavan, P. Sivakumar, and A. Smith,
``Scaling to the End of Silicon with EDGE architectures,'' IEEE Computer, vol. 37, no. 7, pp. 44 -- 55, July 2004.

487:YAMAGUTIseisei
18/08/06 02:35:26.75 FnAR0u04o BE:24477252-2BP(3)
[13] M. Gebhart, B. A. Maher, K. E. Coons, J. Diamond, P. Gratz, M. Marino, N. Ranganathan, B. Robatmili, A. Smith, J. Burrill, S. W. Keckler, D. Burger, and K. S. McKinley,
``An Evaluation of the TRIPS Computer System,''
in Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009, pp. 1 -- 12.
[14] C. Kim, S. Sethumadhavan, M. S. Govindan, N. Ranganathan, D. Gulati, D. Burger, and S. W. Keckler,
``Composable lightweight processors,''
in Proceedings of the 40th International Symposium on Microarchitecture, 2007, pp. 381 -- 394.
[15] B. Robatmili, D. Li, H. Esmaeilzadeh, S. Govindan, A. Smith, A. Putnam, D. Burger, and S. W. Keckler,
``How to Implement Effective Prediction and Forwarding for Fusable Dynamic Multicore Architectures,''
in Proceedings of the 19th International Symposium on High Performance Computer Architecture, 2013, pp. 460 -- 471.
[16] M. S. S. Govindan, B. Robatmili, D. Li, B. Maher, A. Smith, S. W. Keckler, and D. Burger,
``Scaling power and performance via processor composability,''
IEEE Transactions on Computers, March 2013.

8

488:YAMAGUTIseisei
18/08/26 16:15:24.18 CL1hr8qnX BE:24477252-2BP(3)
>156 ｰ 180824 1738 3eCkMSqb
:
>URLﾘﾝｸ(thenextweb.com)
:
> AIに好奇心を実装、
>
>Open AI―イーロンマスクが共同出資したシンギュラリティ専門のシンクタンク―は好奇心による学習という広範な研究 ry （ URLﾘﾝｸ(pathak22.github.io)
:

489:YAMAGUTIseisei
18/08/26 16:18:00.66 CL1hr8qnX BE:73431465-2BP(3)
Page 1

Large-Scale Study of Curiosity-Driven Learning

Yuri Burda∗
OpenAI
Harri Edwards∗
OpenAI
Deepak Pathak∗
UC Berkeley
Amos Storkey
Univ. of Edinburgh
Trevor Darrell
UC Berkeley
Alexei A. Efros
UC Berkeley

Abstract

Reinforcement learning algorithms rely on carefully engineering environment rewards that are extrinsic to the agent.
However, annotating each environment with hand-designed, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent.
Curiosity is a type of intrinsic reward function which uses prediction error as reward signal.
In this paper: (a) We perform the first large-scale study of purely curiosity-driven learning, i.e. without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite.
Our results show surprisingly good performance, and a high degree of alignment between the intrinsic curiosity objective and the hand- designed extrinsic rewards of many game environments.
(b) We investigate the effect of using different feature spaces for computing prediction error and show that random features are sufficient for many popular RL game benchmarks,
but learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.).
(c) We demonstrate limitations of the prediction-based rewards in stochastic setups.
Game-play videos and code are at
URLﾘﾝｸ(pathak22.github.io)
.

490:YAMAGUTIseisei
18/08/26 16:20:05.55 CL1hr8qnX BE:58745838-2BP(3)
1
Introduction

Reinforcement learning (RL) has emerged as a popular method for training agents to perform complex tasks.
In RL, the agent policy is trained by maximizing a reward function that is designed to align with the task.
The rewards are extrinsic to the agent and specific to the environment they are defined for.
Most of the success in RL has been achieved when this reward function is dense and well-shaped, e.g., a running `` score '' in a video game [21].
However, designing a well-shaped reward function is a notoriously challenging engineering problem.
An alternative to `` shaping '' an extrinsic reward is to supplement it with dense intrinsic rewards [26], that is, rewards that are generated by the agent itself.
Examples of intrinsic reward include `` curiosity '' [11, 22, 27, 35, 40] which uses prediction error as reward signal, and `` visitation counts '' [3, 20, 24, 30] which discourages the agent from revisiting the same states.
The idea is that these intrinsic rewards will bridge the gaps between sparse extrinsic rewards by guiding the agent to efficiently explore the environment to find the next extrinsic reward.

491:YAMAGUTIseisei
18/08/26 16:20:46.49 CL1hr8qnX BE:66087893-2BP(3)
But what about scenarios with no extrinsic reward at all? This is not as strange as it sounds.
Developmental psychologists talk about intrinsic motivation (i.e., curiosity) as the primary driver in the early stages of development [32, 41]: babies appear to employ goal-less exploration to learn skills that will be useful later on in life.
There are plenty of other examples, from playing Minecraft to visiting your local zoo, where no extrinsic rewards are required.
Indeed, there is evidence that pre-training an agent on a given environment using only intrinsic rewards allows it to learn much faster when fine-tuned to a novel task in a novel environment [27, 28].
Yet, so far, there has been no systematic study of learning with only intrinsic rewards.

*Alphabetical ordering; the first three authors contributed equally.

Preprint.
Work in progress.

492:YAMAGUTIseisei
18/08/26 16:22:09.28 CL1hr8qnX BE:110146695-2BP(3)
Page 2

Figure 1:
A snapshot of the 54 environments investigated in the paper.
We show that agents are able to make progress using no extrinsic reward, or end-of-episode signal, and only using curiosity.
Video results, code and models at
URLﾘﾝｸ(pathak22.github.io)
.

493:YAMAGUTIseisei
18/08/26 16:23:45.97 CL1hr8qnX BE:73431656-2BP(3)
In this paper, we perform a large-scale empirical study of agents driven purely by intrinsic rewards across a range of diverse simulated environments.
In particular, we choose the dynamics-based curiosity model of intrinsic reward presented in Pathak et al. [27] because it is scalable and trivially parallelizable, making it ideal for large-scale experimentation.
The central idea is to represent intrinsic reward as the error in predicting the consequence of the agent 's action given its current state, i.e., the prediction error of learned forward-dynamics of the agent.
We thoroughly investigate the dynamics-based curiosity across 54 environments: video games, physics engine simulations, and virtual 3D navigation tasks, shown in Figure 1.

To develop a better understanding of curiosity-driven learning, we further study the crucial factors that determine its performance.
In particular, predicting future state in high dimensional raw observation space (e.g., images) is a challenging problem and, as shown by recent works [27, 42], learning dynamics in an auxiliary feature space leads to improved results.
However, how one should choose such an embedding space is a critical, yet open research problem.
Through a systematic ablation, we examine the role of different ways to encode agent 's observation such that an agent can perform well driven purely by its own curiosity.

494:YAMAGUTIseisei
18/08/26 16:27:38.51 CL1hr8qnX BE:36715853-2BP(3)
To ensure stable online training of dynamics, we argue that the desired embedding space should: (a) be compact in terms of dimensionality,
(b) preserve sufficient information about the observation, and (c) be a stationary function of the observations.
We show that encoding observations via a random network turn out to be a simple, yet effective technique for modeling curiosity across many popular RL benchmarks.
This might suggest that many popular RL video game test-beds are not as visually sophisticated as commonly thought.
Interestingly, we discover that although random features are sufficient for good performance at training, the learned features appear to generalize better (e.g., to novel game levels in Super Mario Bros.).

In summary:
(a) We perform a large-scale study of curiosity-driven exploration across a variety of environments including:
the set of Atari games [4], Super Mario Bros., virtual 3D navigation in Unity [1], multi-player Pong, and Roboschool [39] environments.
(b) We extensively investigate different feature spaces for learning the dynamics-based curiosity: random features, pixels, inverse- dynamics [27] and variational auto-encoders [15] and evaluate generalization to unseen environments.
(c) We conclude by discussing some limitations of a direct prediction-error based curiosity formulation.
We observe that if the agent itself is the source of stochasticity in the environment, it can reward itself without making any actual progress.
We empirically demonstrate this limitation in a 3D navigation task where the agent controls different parts of the environment.

2

495:YAMAGUTIseisei
18/08/26 16:30:44.18 CL1hr8qnX BE:14686632-2BP(3)
Page 3

2
Dynamics-based Curiosity-driven Learning

Consider an agent that sees an observation xt, takes an action at and transitions to the next state with observation xt+1.
We want to incentivize this agent with a reward rt relating to how informative the transition was.
To provide this reward, we use an exploration bonus involving the following elements:
(a) a network to embed observations into representations φ(x),
(b) a forward dynamics network to predict the representation of the next state conditioned on the previous observation and action p(φ(xt+1)|xt,at).
Given a transition tuple {xt,xt+1,at}, the exploration reward is then defined as rt = ? log p(φ(xt+1)|xt,at), also called the surprisal [2].
An agent trained to maximize this reward will favor transitions with high prediction error, which will be higher in areas where the agent has spent less time, or in areas with complex dynamics.
Such a dynamics-based curiosity has been shown to perform quite well across scenarios [27] especially when the dynamics are learned in an embedding space rather than raw observations.
In this paper, we explore dynamics-based curiosity and use mean-squared error corresponding to a fixed-variance Gaussian density as surprisal, i.e., f(xt,at) ? φ(xt+1)2 2 where f is the learned dynamics model.
However, any other density model could be used.

2.1
Feature spaces for forward dynamics
Consider the representation φ in the curiosity formulation above.
If φ(x) = x, the forward dynamics model makes predictions in the observation space.
A good choice of feature space can make the prediction task more tractable and filter out irrelevant aspects of the observation space.
But, what makes a good feature space for dynamics driven curiosity? We narrow down a few qualities that a good feature space should have:

496:YAMAGUTIseisei
18/08/26 16:33:46.18 CL1hr8qnX BE:48954645-2BP(3)
• Compact: The features should be easy to model by being low(er)-dimensional and filtering out irrelevant parts of the observation space.
• Sufficient: The features should contain all the important information.
Otherwise, the agent may fail to be rewarded for exploring some relevant aspect of the environment.
• Stable: Non-stationary rewards make it difficult for reinforcement agents to learn.
Exploration bonuses by necessity introduce non-stationarity since what is new and novel becomes old and boring with time.
In a dynamics-based curiosity formulation, there are two sources of non-stationarity: the forward dynamics model is evolving over time as it is trained and the features are changing as they learn.
The former is intrinsic to the method, and the latter should be minimized where possible

In this work, we systematically investigate the efficacy of a number of feature learning methods, summarized briefly as follows:

Pixels
The simplest case is where φ(x) = x and we fit our forward dynamics model in the observation space.
Pixels are sufficient, since no information has been thrown away, and stable since there is no feature learning component.
However, learning from pixels is tricky because the observation space may be high-dimensional and complex.

Random Features (RF)
The next simplest case is where we take our embedding network, a convolutional network, and fix it after random initialization.
Because the network is fixed, the features are stable.
The features can be made compact in dimensionality, but they are not constrained to be.
However, random features may fail to be sufficient.

497:YAMAGUTIseisei
18/08/26 16:36:04.77 CL1hr8qnX BE:9791322-2BP(3)
Variational Autoencoders (VAE) VAEs were introduced in [15, 31] to fit latent variable generative models p(x, z) for observed data x and latent variable z with prior p(z) using variational inference.
The method calls for an inference network q(z|x) that approximates the posterior p(z|x).
This is a feedforward network that takes an observation as input and outputs a mean and variance vector describing a Gaussian distribution with diagonal covariance.

VAE IDF RF Pixels
Stable No No Yes Yes
Compact Yes Yes Maybe No
Sufficient Yes Maybe Maybe Yes

Table 1:
Table summarizing the categorization of different kinds of feature spaces considered.

3

Page 4

We can then use the mapping to the mean as our embedding network φ.
These features will be a low-dimensional approximately sufficient summary of the observation,
but they may still contain some irrelevant details such as noise, and the features will change over time as the VAE trains.

498:YAMAGUTIseisei
18/08/26 16:37:31.56 CL1hr8qnX BE:156653388-2BP(3)
Inverse Dynamics Features (IDF) Given a transition (st,st+1,at) the inverse dynamics task is to predict the action at given the previous and next states st and st+1.
Features are learned using a common neural network φ to first embed st and st+1.
The intuition is that the features learned should correspond to aspects of the environment that are under the agent 's immediate control.
This feature learning method is easy to implement and in principle should be invariant to certain kinds of noise (see [27] for a discussion).
A potential downside could be that the features learned may not be sufficient, that is they do not represent important aspects of the environment that the agent cannot immediately affect.
A summary of these characteristics is provided in Table 1.
Note that the learned features are not stable because their distribution changes as learning progresses.
One way to achieve stability could be to pre-train VAE or IDF networks.
However, unless one has access to the internal state of the game, it is not possible to get a representative data of the game scenes to train the features.
One way is to act randomly to collect data, but then it will be biased to where the agent started, and won't generalize further.
Since all the features involve some trade-off of desirable properties, it becomes an empirical question as to how effective each of them is across environments.

2.2
Practical considerations in training an agent driven purely by curiosity
Deciding upon a feature space is only first part of the puzzle in implementing a practical system.
Here, we detail the critical choices we made in the learning algorithm.
Our goal was to reduce non-stationarity in order to make learning more stable and consistent across environments.
Through the following considerations outlined below, we are able to get exploration to work reliably for different feature learning methods and environments with minimal changes to the hyper-parameters.

499:YAMAGUTIseisei
18/08/26 16:38:36.96 CL1hr8qnX BE:176234898-2BP(3)
• PPO.
In general, we have found PPO algorithm [38] to be a robust learning algorithm that requires little hyper-parameter tuning, and hence, we stick to it for our experiments.
• Reward normalization.
Since the reward function is non-stationary, it is useful to normalize the scale of the rewards so that the value function can learn quickly.
We did this by dividing the rewards by a running estimate of the standard deviation of the sum of discounted rewards.
• Advantage normalization.
While training with PPO, we normalize the advantages [46] in a batch to have a mean of 0 and a standard deviation of 1.
• Observation normalization.
We run a random agent on our target environment for 10000 steps, then calculate the mean and standard deviation of the observation and use these to normalize the observations when training.
This is useful to ensure that the features do not have very small variance at initialization and to have less variation across different environments.
• More actors.
The stability of the method is greatly increased by increasing the number of parallel actors (which affects the batch-size) used.
We typically use 128 parallel runs of the same environment for data collection while training an agent.
• Normalizing the features.
In combining intrinsic and extrinsic rewards, we found it useful to ensure that the scale of the intrinsic reward was consistent across state space.
We achieved this by using batch-normalization [13] in the feature embedding network.

500:YAMAGUTIseisei
18/08/26 16:41:37.25 CL1hr8qnX BE:97908858-2BP(3)
2.3
` Death is not the end ' : discounted curiosity with infinite horizon
One important point is that the use of an end of episode signal, sometimes called a ` done ' , can often leak information about the true reward function.
If we don't remove the ` done ' signal, many of the Atari games become too simple.
For example, a simple strategy of giving +1 artificial reward at every time-step when the agent is alive and 0 on death is sufficient to obtain a high score in some games,
for instance, the Atari game ` Breakout ' where it will seek to maximize the episode length and hence its score.
In the case of negative rewards, the agent will try to end the episode as quickly as possible.

4

501:YAMAGUTIseisei
18/08/26 16:42:21.10 CL1hr8qnX BE:29373034-2BP(3)
Page 5

0 100 200 300 400
0 500 1000 1500 2000 2500 3000 3500 4000

BeamRider
BreakOut
MontezumaRevenge
Pong
Mario
Qbert
Reverraid
Seaquest
SpaceInvaders

Frames (millions)
Extrinsic Reward per Episode

Pixels
VAE features
Inverse Dynamics features
Random CNN features

Figure 2:
A comparison of feature learning methods on 8 selected Atari games and the Super Mario Bros.

502:YAMAGUTIseisei
18/08/26 16:43:13.06 CL1hr8qnX BE:48954454-2BP(3)
These evaluation curves show the mean reward (with standard error) of agents trained purely by curiosity, without reward or an end-of-episode signal.
We see that our purely curiosity-driven agent is able to gather rewards in these environments without using any extrinsic reward at training.
Results on all of the Atari games are in the appendix in Figure 8.
We find curiosity model trained on pixels does not work well across any environment and VAE features perform either same or worse than random and inverse dynamics features.
Further, inverse dynamics-trained features perform better than random features in 55% of the Atari games.
An interesting outcome of this analysis is that random features for modeling curiosity are a simple, yet surprisingly strong baseline and likely to work well in half of the Atari games.

In light of this, if we want to study the behavior of pure exploration agents, we should not bias the agent.
In the infinite horizon setting (i.e., the discounted returns are not truncated at the end of the episode and always bootstrapped using the value function), death is just another transition to the agent, to be avoided only if it is boring.
Therefore, we removed ` done ' to separate the gains of an agent 's exploration from merely that of the death signal.
In practice, we do find that the agent avoids dying in the games since that brings it back to the beginning of the game, an area it has already seen many times and where it can predict the dynamics well.
This subtlety has been neglected by previous works showing experiments without extrinsic rewards.

503:YAMAGUTIseisei
18/08/26 16:47:17.14 CL1hr8qnX BE:44059436-2BP(3)
3
Experiments

In all of our experiments, both the policy and the embedding network work directly from pixels.
For our implementation details including hyper-parameters and architectures, please refer to the Appendix A.
Unless stated otherwise, all curves are the average of three runs with different seeds, and the shaded areas are standard errors of the mean.
We have released the code and videos of a purely curious agent playing across all environments on the website 2.

3.1
Curiosity-driven learning without extrinsic rewards We begin by scaling up a pure curiosity-driven learning to a large number of environments without using any extrinsic rewards.
We pick a total of 54 diverse simulated environments, as shown in Figure 1,
including 48 Atari games, Super Mario Bros., 2 Roboschool scenarios (learning Ant controller and Juggling), Two-player Pong, 2 Unity mazes (with and without a TV controlled by the agent).
The goal of this large-scale analysis is to investigate the following questions:
(a) What actually happens when you run a pure curiosity-driven agent on a variety of games without any extrinsic rewards?
(b) What kinds of behaviors can you expect from these agents? (c) What is the effect of the different feature learning variants in dynamics-based curiosity on these behaviors?

2
URLﾘﾝｸ(pathak22.github.io)

5

504:YAMAGUTIseisei
18/08/26 16:48:28.26 CL1hr8qnX BE:9791322-2BP(3)
Page 6

A) Atari Games
To answer these questions, we began with a collection of well-known Atari games and ran a suite of experiments with different feature learning methods.
One way to measure how well a purely curious agent performs is to measure the extrinsic reward it is able to achieve, i.e. how good is the agent at playing the game.
We show the evaluation curves of mean extrinsic reward in on 8 common Atari games in Figure 2 and all 48 Atari suite in Figure 8 in the appendix.
It is important to note that the extrinsic reward is only used for evaluation, not for training.
However, this is just a proxy for pure exploration because the game rewards could be arbitrary and might not align at all with how the agent explores out of curiosity.

The first thing to notice from the curves is: most of them are going up.
This shows that a pure curiosity-driven agent can learn to obtain external rewards even without using any extrinsic rewards during training.
It is remarkable that agents with no extrinsic reward and no end of episode signal can learn to get scores comparable in some cases to learning with the extrinsic reward.
For instance, in Breakout, the game score increases on hitting the ball with the paddle into bricks which disappear and give points when struck.
The more times the bricks are struck in a row by the ball, the more complicated the pattern of bricks remaining becomes, making the agent more curious to explore further, hence, collecting points as a bi-product.
Further, when the agent runs out of lives, the bricks are reset to a uniform structure again that has been seen by the agent many times before and is hence very predictable, so the agent tries to stay alive to be curious by avoiding reset by death.

505:YAMAGUTIseisei
18/08/26 16:54:32.44 CL1hr8qnX BE:132176669-2BP(3)
This is an unexpected result and might suggest that many popular RL test-beds do not need an external reward.
This may be because game designers (similar to architects, urban planners, gardeners, etc.) are
very good at setting up curriculums to guide agents through the task explaining the reason Curiosity-like objective decently aligns with the extrinsic reward in many human-designed environments [6, 12, 16, 48].
However, this is not always the case, and sometimes a curious agent can even do worse than random agent!
This happens when the extrinsic reward has little correlation with the agent 's exploration, or when the agent fails to explore efficiently (e.g. see games ` Atlantis ' , ` IceHockey ' in Figure 8).
We further encourage the reader to refer to the game-play videos of the agent available on the website for a better understanding of the learned skills.

Comparison of feature learning methods:
We compare four feature learning methods in Figure 2: raw pixels, random features, inverse dynamics features and VAE features.
Training dynamics on raw-pixels performs bad across all the environments, while encoding pixels into features does better.
This is likely because it is hard to learn a good dynamics model in pixel space, and prediction errors may be dominated by small irrelevant details.

Surprisingly, random features (RF) perform quite well across tasks and sometimes better than using learned features.
One reason for good performance is that the random features are kept frozen (stable), the dynamics model learned on top of them has an easier time because of the stationarity of the target.
In general, random features should work well in the domains where visual observations are simple enough, and random features can preserve enough information about the raw signal, for instance, Atari games.
Interestingly, we find that while random features work well at training, IDF learned features appear to generalize better in Mario Bros. (see Section 3.2 for details).

506:YAMAGUTIseisei
18/08/26 16:55:34.83 CL1hr8qnX BE:110146695-2BP(3)
The VAE method also performed well but was somewhat unstable, so we decided to use RF and IDF for further experiments.
The detailed result in appendix Figure 8 compares IDF vs.
RF across the full Atari suite.
To quantify the learned behaviors, we compared our curious agents to a randomly acting agent.
We found that an IDF-curious agent collects more game reward than a random agent in 75% of the Atari games, an RF-curious agent does better in 70%.
Further, IDF does better than RF in 55% of the games.
Overall, random features and inverse dynamics features worked well in general.
Further details in the appendix.

B) Super Mario Bros.
We compare different feature learning methods in Mario Bros. in Figure 2.
Super Mario Bros has already been studied in the context of extrinsic reward free learning [27] in small-scale experiments, and so we were keen to see how far curiosity alone can push the agent.
We use an efficient version of Mario simulator faster to scale up for longer training keeping observation space, actions, dynamics of the game intact.
Due to 100x longer training and using PPO for optimization, our agent is able to pass several levels of the game, significantly improving over prior exploration results on Mario Bros.
Could we further push the performance of a purely curious agent by making the underlying optimization more stable? One way is to scale up the batch-size.
We do so by increasing the number of parallel threads for running environments from 128 to 2048.

6

507:YAMAGUTIseisei
18/08/26 16:58:13.95 CL1hr8qnX BE:34268827-2BP(3)
Page 7

0 10 20 30
0 250 500 750 1000 1250 1500 1750 2000

Extrinsic Reward per Episode

Number of gradient updates
(a) Mario w/ large batch
Batch of 128 environments
Batch of 1024 environments

Frames (millions)
(b) Juggling (Roboschool)
Pure curiosity (no-reward, infenite-horizon) exploration
juggling (Roboschool)

Frames (millions)
(c) Two-player Pong
Pure curiosity (no-reward, infenite-horizon) exploration
Two player Pong

Figure 3:
(a) Left: A comparison of the RF method on Mario with different batch sizes.
Results are without using extrinsic reward.
(b) Center: Number of ball bounces in the Juggling (Roboschool) environment.
(c) Right: Mean episode length in the multiplayer Pong environment.
The discontinuous jump on the graph corresponds to the agent reaching a limit of the environment -
after a certain number of steps in the environment the Atari Pong emulator starts randomly cycling through background colors and becomes unresponsive to agent 's actions

508:YAMAGUTIseisei
18/08/26 16:59:09.88 CL1hr8qnX BE:19582324-2BP(3)
We show the comparison between training using 128 and 2048 parallel environment threads in Figure 3(a).
As apparent from the graph, training with large batch-size using 2048 parallel environment threads performs much better.
In fact, the agent is able to explore much more of the game: discovering 11 different levels of the game, finding secret rooms and defeating bosses.
Note that the x-axis in the figure is the number of gradient steps, not the number of frames, since the point of this large-scale experiment is not a claim about sample-efficiency, but performance with respect to training the agent.
This result suggests that the performance of a purely curiosity-driven agent would improve as the training of base RL algorithm (which is PPO in our case) gets better.
The video is on the website.

C) Roboschool Juggling
We modified the Pong environment from the Roboschool framework to only have one paddle and to have two balls.
The action space is continuous with two-dimensions, and we discretized the action space into 5 bins per dimension giving a total of 25 actions.
Both the policy and embedding network are trained on pixel observation space (note: not state space).
This environment is more difficult to control than the toy physics used in games, but the agent learns to intercept and strike the balls when it comes into its area.
We monitored the number of bounces of the balls as a proxy for interaction with the environment, as shown in Figure 3(b).
See the video on the project website.

509:YAMAGUTIseisei
18/08/26 16:59:48.41 CL1hr8qnX BE:119937877-2BP(3)
D) Roboschool Ant Robot
We also explored using the Ant environment which consists of an Ant with 8 controllable joints on a track.
We again discretized the action space and trained policy and embedding network on raw pixels (not state space).
However, in this case, it was less easy to measure exploration because the extrinsic distance reward measures progress along the racetrack, but a purely curious agent is free to move in any direction.
We find that a walking like behavior emerges purely out of a curiosity-driven training.
We refer the reader to the result video showing that the agent is meaningfully interacting with the environment.

E) Multi-agent curiosity in Two-player Pong
We have already seen that a purely curiosity-driven agent learns to play several Atari games without reward, but we wonder how much of that behavior is caused by the fact that the opposing player is a computer-agent with hardcoded strategy.
What would happen if we were to make both the teams playing against each other to be curious? To find out, we take Two-player Pong game where both the sides (paddles of pong) of the game are controlled by curiosity-driven agents.
We share the initial layers of both the agent and have different action heads, i.e., total action space is now the cross product of the actions of player 1 by the actions of player 2.

Note that the extrinsic reward is meaningless in this context since the agent is playing both sides, so instead, we show the length of the episode.
The results are shown in Figure 3(c).
We see from the episode length that the agent learns to have more and longer rallies over time, learning to play pong without any teacher ? purely by curiosity on both sides.
In fact, the game rallies eventually get so long that they break our Atari emulator causing the colors to change radically, which crashes the policy as shown in the plot.

7

510:YAMAGUTIseisei
18/08/26 17:01:15.34 CL1hr8qnX BE:19581942-2BP(3)
Page 8

3.2
Generalization across novel levels in Super Mario Bros.
In the previous section, we showed that our purely curious agent can learn to explore efficiently and learn useful skills, e.g., game playing behaviour in games, walking behaviour in Ant etc.
So far, these skills were shown in the environment where the agent was trained on.
However, one advantage of developing reward-free learning is that one should then be able to utilize abundant `` unlabeled '' environments without reward functions by showing generalization to novel environments.

To test this, we first pre-train our agent using curiosity only in the Level 1-1 of Mario Bros.
We investigate how well RF and IDF-based curiosity agents generalize to novel levels of Mario.
In Figure 4, we show two examples of training on one level of Mario and finetuning on another testing level, and compare to learning on the testing level from scratch.
The training signal in all the cases is only curiosity reward.
In the first case, from Level 1-1 to Level 1-2, the global statistics of the environments match (both are ` day ' environment in games, i.e., blue background) but levels have different enemies, geometry and difficulty level.
We see that there is strong transfer from for both methods in this scenario.
However, the transfer performance is weaker in the second scenario from Level 1-1 to Level 1-3.
This is so because the problem is considerably harder for the latter level pairing as there is a color scheme shift from day to night, as shown in Figure 4.

511:YAMAGUTIseisei
18/08/26 17:02:53.25 CL1hr8qnX BE:29373034-2BP(3)
We further note that IDF-learned features transfer in both the cases and random features transfer in the first case, but do not transfer in the second scenario from day to night.
These results might suggest that while random features perform well on training environments, learned features appear to generalize better to novel levels.
However, this needs more analysis in the future across a large variety of environments.
Overall, we find some promising evidence showing that skills learned by curiosity help our agent explore efficiently in novel environments.

IDF scratch
IDF transfer
RF scratch
RF transfer

0 10 20 30
0 250 500 750 1000 1250 1500 1750 2000

World 1 level 1 to world 2 level 1
0 10 20 30
0 250 500 750 1000 1250 1500 1750 2000
World 1 level 1 to world 3 level 1
Frames (millions)
Extrinsic Reward per Episode

Figure 4:
Mario generalization experiments.
On the left we show transfer results from Level 1-1 to Level 1-2, and on the right we show transfer results from Level 1-1 to Level 1-3.
Underneath each plot is a map of the source and target environments.
All agents are trained without extrinsic reward.

512:YAMAGUTIseisei
18/08/26 17:04:02.97 CL1hr8qnX BE:36715853-2BP(3)
Frames (millions)
Extrinsic Reward per Episode

Unity maze

Random CNN features
Extrinsic only
Inverse dynamics features

Figure 5: Mean extrinsic reward in the Unity environment while training with terminal extrinsic + curiosity reward.
Note that the curve for extrinsic reward only training is constantly zero.

3.3 Curiosity with Sparse External Reward
In all our experiments so far, we have shown that our agents can learn useful skills without any extrinsic rewards driven purely by curiosity.
However, in many scenarios, we might want the agent to perform some particular task of interest.
This is usually conveyed to the agent by defining extrinsic rewards.
When rewards are dense (e.g. game score at every frame), classic RL works well and intrinsic rewards generally should not help performance.
However, designing dense rewards is a challenging engineering problem (see introduction for details).
In this section, we evaluate how well curiosity can help an agent perform a task in presence of sparse, or just terminal, rewards.

Terminal reward setting:
For many real problems, e.g. navigation, the only terminal reward is available, a setting where classic RL typically performs poorly.
Hence, we consider the 3D navigation in a maze designed in the Unity ML-agent framework with 9 rooms and a sparse terminal reward.

8

513:YAMAGUTIseisei
18/08/26 17:04:57.59 CL1hr8qnX BE:78326584-2BP(3)
Page 9

There is a discrete action space consisting of: move forwards, look left 15 degrees, look right 15 degrees and no-op.
The agent starts in the room-1, which is furthest away from room-9 which contains the goal of the agent.
We compare an agent trained with extrinsic reward (+1 when the goal is reached, 0 otherwise) to an agent trained with extrinsic + intrinsic reward.
Extrinsic only (classic RL) never finds the goal in all our trials which means it is impossible to get any meaningful gradients.
Whereas extrinsic+intrinsic typically converges to getting the reward every time.
Results in Figure 5 show results for vanilla PPO, PPO + IDF-curiosity and PPO + RF-curiosity.

Sparse reward setting: In preliminary experiments, we picked 5 Atari games which have sparse rewards (as categorized by [3]), and compared extrinsic (classic RL) vs.
extrinsic+intrinsic (ours) reward performance.
In 4 games out of 5, curiosity bonus improves performance (see Table 2 in the appendix, the higher score is better).
We would like to emphasize that this is not the focus of the paper, and these experiments are provided just for completeness.
We just combined extrinsic (coefficient 1.0) and intrinsic reward (coefficient 0.01) directly without any tuning.
We leave the question on how to optimally combine extrinsic and intrinsic rewards as a future direction.

514:YAMAGUTIseisei
18/08/26 17:11:36.23 CL1hr8qnX BE:29373326-2BP(3)
4
Related Work

Intrinsic Motivation:
A family of approaches to intrinsic motivation reward
an agent based on prediction error [2, 27, 36, 42], prediction uncertainty [11, 44], or improvement [19, 34] of a forward dynamics model of the environment that gets trained along with the agent 's policy.
As a result the agent is driven to reach regions of the environment that are difficult to predict for the forward dynamics model, while the model improves its predictions in these regions.
This adversarial and non-stationary dynamics can give rise to complex behaviors.
Relatively little work has been done in this area on the pure exploration setting where there is no external reward.
Of these mostly closely related are those that use a forward dynamics model of a feature space such as Stadie et al. [42] where they use autoencoder features, and Pathak et al. [27] where they use features trained
with an inverse dynamics task.
These correspond roughly to the VAE and IDF methods detailed in Section 2.1.

Smoothed versions of state visitation counts can be used for intrinsic rewards [3, 9, 24, 47].
Count-based methods have already shown very strong results when combining with extrinsic rewards such as setting the state of the art in the Atari game Montezuma 's Revenge [3],
and also showing significant exploration of the game without using the extrinsic reward.
It is not yet clear in which situations count-based approaches should be preferred over dynamics-based approaches; we chose to focus on dynamics-based bonuses in this paper since we found them straightforward to scale and parallelize.
In our preliminary experiments, we did not have sufficient success with already existing count-based implementations in scaling up for a large-scale study.

515:YAMAGUTIseisei
18/08/26 17:16:26.46 CL1hr8qnX BE:68536447-2BP(3)
Learning without extrinsic rewards or fitness functions has also been studied extensively in the evolutionary computing where it is referred to as ` novelty search ' [17, 18, 43].
There the novelty of an event is often defined as the distance of the event to the nearest neighbor amongst previous events, using some statistics of the event to compute distances.
One interesting finding from this literature is that often much more interesting solutions can be found by not solely optimizing for fitness.

Other methods of exploration are designed to work in combination with maximizing a reward function, such as those utilizing uncertainty about value function estimates [5, 23], or those using perturbations of the policy for exploration [8, 29].
Schmidhuber [37] and Oudeyer [25], Oudeyer and Kaplan [26] provide a great review of some of the earlier work on approaches to intrinsic motivation.
Alternative methods of exploration include Sukhbaatar et al. [45] where they utilize an adversarial game between two agents for exploration.
In Gregor et al. [10], they optimize a quantity called empowerment which is a measurement of the control an agent has over the state.
In a concurrent work, diversity is used as a measure to learn skills without reward functions Eysenbach et al. [7].

Random Features:
One of the findings in this paper is the surprising effectiveness of random features, and there is a substantial literature on random projections and more generally randomly initialized neural networks.
Much of the literature has focused on using random features for classification [14, 33, 49] where the typical finding is that whilst random features can work well for simpler problems,
feature learning performs much better once the problem becomes sufficiently complex.
Whilst we expect this pattern to also hold true for dynamics-based exploration, we have some preliminary evidence showing that learned features appear to generalize better to novel levels in Mario Bros.

9

516:YAMAGUTIseisei
18/08/26 17:18:41.35 CL1hr8qnX BE:110146695-2BP(3)
Page 10

5
Discussion

We have shown that our agents trained purely with a curiosity reward are able to learn useful behaviours: (a) Agent being able to play many atari games without using any rewards.
(b) Mario being able to cross over over 11 levels without reward.
(c) Walking like behavior emerged in the Ant environment.
(d) Juggling like behavior in Robo-school environment (e) Rally-making behavior in Two-player Pong with curiosity-driven agent on both sides.
But this is not always true as there are some Atari games where exploring the environment does not correspond to extrinsic reward.

More generally, these results suggest that, in environments designed by humans, the extrinsic reward is perhaps often aligned with the objective of seeking novelty.
The game designers set up curriculums to guide users while playing the game explaining the reason Curiosity-like objective decently aligns with the extrinsic reward in many human-designed games [6, 12, 16, 48].

0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4 5 6 7 8

Frames (millions)
Extrinsic Reward per Episode

RF with TV off
RF with TV on
IDF with TV off
IDF with TV on

Figure 6:
We add a noisy TV to the unity environment in Section 3.3.
We compare IDF and RF with and without the TV.

517:YAMAGUTIseisei
18/08/26 17:19:53.47 CL1hr8qnX BE:78326584-2BP(3)
Limitation of prediction error based curiosity:
A more serious potential limitation is the handling of stochastic dynamics.
If the transitions in the environment are random, then even with a perfect dynamics model, the expected reward will be the entropy of the transition, and the agent will seek out transitions with the highest entropy.
Even if the environment is not truly random, unpredictability caused by a poor learning algorithm, an impoverished model class or partial observability can lead to exactly the same problem.
We did not observe this effect in our experiments on games so we designed an environment to illustrate the point.

We return to the maze of Section 3.3 to empirically validate a common thought experiment called the noisy-TV problem.
The idea is that local sources of entropy in an environment like a TV that randomly changes channels when an action is taken should prove to be an irresistible attraction to our agent.
We take this thought experiment literally and add a TV to the maze along with an action to change the channel.
In Figure 6 we show how adding the noisy-TV affects the performance of IDF and RF.
As expected the presence of the TV drastically slows down learning, but we note that if you run the experiment for long enough the agents do sometimes converge to getting the extrinsic reward consistently.
We have shown empirically that stochasticity can be a problem, and so it is important for future work to address this issue in an efficient manner.

518:YAMAGUTIseisei
18/08/26 17:22:09.20 CL1hr8qnX BE:44059829-2BP(3)
Future Work:
We have presented a simple and scalable approach that can learn nontrivial behaviors across a diverse range of environments without any reward function or end-of-episode signal.
One surprising finding of this paper is that random features perform quite, but learned features appear to generalize better.
Whilst we believe that learning features will become important once the environment is complex enough, we leave that up to future work to explore.
Our wider goal, however, is to show that we can take advantage of many unlabeled (i.e., not having an engineered reward function) environments to improve performance on a task of interest.
Given this goal, showing performance in environments with a generic reward function is just the first step, and future work could investigate transfer from unlabeled to labeled environments.

Acknowledgments

We would like to thank Chris Lu for helping with the Unity environment, Phillip Isola and Alex Nichols for feedback on the paper.
We are grateful to the members of BAIR and OpenAI for fruitful discussions.
DP is supported by the Facebook graduate fellowship.

References

[1] Unity ML-agents.
URLﾘﾝｸ(github.com)
.
2

10

Page 11

[2] J. Achiam and S. Sastry. Surprise-based intrinsic motivation for deep reinforcement learning.
arXiv:1703.01732, 2017. 3, 9
[3] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos.
Unifying count-based exploration and intrinsic motivation. In NIPS, 2016. 1, 9
[4] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253279, jun 2013. 2

519:YAMAGUTIseisei
18/08/26 17:23:49.66 CL1hr8qnX BE:51402637-2BP(3)
[5] R. Y. Chen, J. Schulman, P. Abbeel, and S. Sidor.
UCB and infogain exploration via q-ensembles.arXiv:1706.01502, 2017. 9
[6] G. Costikyan. Uncertainty in games. Mit Press, 2013. 6, 10
[7] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine.
Diversity is all you need: Learning skills without a reward function. arXiv preprint, 2018. 9
[8] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg.
Noisy networks for exploration. arXiv:1706.10295, 2017. 9
[9] J. Fu, J. D. Co-Reyes, and S. Levine.
EX2: Exploration with exemplar models for deep reinforcement learning. NIPS, 2017. 9
[10] K. Gregor, D. J. Rezende, and D. Wierstra.
Variational intrinsic control. ICLR Workshop, 2017. 9
[11] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel.
Vime: Variational information maximizing exploration. In NIPS, 2016. 1, 9
[12] R. Hunicke, M. LeBlanc, and R. Zubek.
Mda: A formal approach to game design and game research. In AAAI Workshop on Challenges in Game AI, 2004. 6, 10
[13] S. Ioffe and C. Szegedy.
Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 4
[14] K. Jarrett, K. Kavukcuoglu, Y. LeCun, et al.
What is the best multi-stage architecture for object recognition?
In Computer Vision, 2009 IEEE 12th International Conference on, pages 21462153. IEEE, 2009. 9
[15] D. P. Kingma and M. Welling.
Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013. 2, 3
[16] N. Lazzaro. Why we play games:
Four keys to more emotion in player experiences. In Proceedings of GDC, 2004. 6, 10
[17] J. Lehman and K. O. Stanley.
Exploiting open-endedness to solve problems through the search for novelty. In ALIFE, 2008. 9
[18] J. Lehman and K. O. Stanley.
Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 2011. 9

520:YAMAGUTIseisei
18/08/26 17:25:07.68 CL1hr8qnX BE:51402637-2BP(3)
[19] M. Lopes, T. Lang, M. Toussaint, and P.-Y. Oudeyer.
Exploration in model-based reinforcement learning by empirically estimating learning progress. In NIPS, 2012. 9
[20] M. Lopes, T. Lang, M. Toussaint, and P.-Y. Oudeyer.
Exploration in model-based reinforcement learning by empirically estimating learning progress. In NIPS, 2012. 1
[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al.
Human-level control through deep reinforcement learning. Nature, 2015. 1
[22] S. Mohamed and D. J. Rezende.
Variational information maximisation for intrinsically motivated reinforcement learning. In NIPS, 2015. 1
[23] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy.
Deep exploration via bootstrapped dqn. In NIPS, 2016. 9
[24] G. Ostrovski, M. G. Bellemare, A. v. d. Oord, and R. Munos.
Count-based exploration with neural density models. arXiv:1703.01310, 2017. 1, 9
[25] P.-Y. Oudeyer.
Computational theories of curiosity-driven learning. arXiv preprint arXiv:1802.10546, 2018. 9

11

Page 12

[26] P.-Y. Oudeyer and F. Kaplan.
What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 2009. 1, 9
[27] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell.
Curiosity-driven exploration by self-supervised prediction. In ICML, 2017. 1, 2, 3, 4, 6, 9
[28] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell.
Zero-shot visual imitation. In ICLR, 2018. 1
[29] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz.
Parameter space noise for exploration. arXiv:1706.01905, 2017. 9
[30] P. Poupart, N. Vlassis, J. Hoey, and K. Regan.
An analytic solution to discrete bayesian reinforcement learning. In ICML, 2006. 1

521:YAMAGUTIseisei
18/08/26 17:27:09.49 CL1hr8qnX BE:29373034-2BP(3)
[31] D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014. 3
[32] E. L. Ryan, Richard; Deci.
Intrinsic and extrinsic motivations: Classic definitions and new directions. Contemporary Educational Psychology, 2000. 1
[33] A. M. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng.
On random weights and unsupervised feature learning. In ICML, pages 10891096, 2011. 9
[34] J. Schmidhuber.
Curious model-building control systems. In Neural Networks, 1991. 1991 IEEE International Joint Conference on, pages 14581463. IEEE, 1991. 9
[35] J. Schmidhuber.
A possibility for implementing curiosity and boredom in model-building neural controllers.
In From animals to animats: Proceedings of the first international conference on simulation of adaptive behavior, 1991. 1
[36] J. Schmidhuber.
A possibility for implementing curiosity and boredom in model-building neural controllers, 1991. 9
[37] J. Schmidhuber.
Formal theory of creativity, fun, and intrinsic motivation (19902010). IEEE Transactions on Autonomous Mental Development, 2010. 9
[38] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.
Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 4
[39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.
Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 2
[40] S. P. Singh, A. G. Barto, and N. Chentanez.
Intrinsically motivated reinforcement learning. In NIPS, 2005. 1
[41] L. Smith and M. Gasser.
The development of embodied cognition: Six lessons from babies. Artificial life, 2005. 1
[42] B. C. Stadie, S. Levine, and P. Abbeel.
Incentivizing exploration in reinforcement learning with deep predictive models. NIPS Workshop, 2015. 2, 9
[43] K. O. Stanley and J. Lehman.
Why greatness cannot be planned: The myth of the objective. Springer, 2015. 9

522:YAMAGUTIseisei
18/08/26 17:28:31.63 CL1hr8qnX BE:66087893-2BP(3)
[44] S. Still and D. Precup.
An information-theoretic approach to curiosity-driven reinforcement learning. Theory in Biosciences, 2012. 9
[45] S. Sukhbaatar, I. Kostrikov, A. Szlam, and R. Fergus.
Intrinsic motivation and automatic curricula via asymmetric self-play. In ICLR, 2018. 9
[46] R. S. Sutton and A. G. Barto.
Reinforcement learning: An introduction. MIT press Cambridge, 1998. 4
[47] H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel.
#Exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems, 2017. 9
[48] P. Wouters, H. Van Oostendorp, R. Boonekamp, and E. Van der Spek.
The role of game discourse analysis and curiosity in creating engaging and effective serious games by implementing a back story and foreshadowing. Interacting with Computers, 2011. 6, 10
[49] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang.
Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pages 14761483, 2015. 9

12

Page 13

A
Implementation Details

We have released the training code and environments on our website 3.
For full details, we refer the reader to our code and video results in the website.

Pre-processing:
All experiments were done with pixels.
We converted all images to grayscale and resized to size 84x84.
We learn the agent 's policy and forward dynamics function both on a stack of historical observations [xt?3, xt?2, xt?1, xt] instead of only using the current observation.
This is to capture partial observability in these games.
In the case of Super Mario Bros and Atari experiments, we also used a standard frameskip wrapper that repeats each action 4 times.

523:YAMAGUTIseisei
18/08/26 17:29:10.00 CL1hr8qnX BE:117489986-2BP(3)
Architectures:
Our embedding network and policy networks had identical architectures and were based on the standard convolutional networks used in Atari experiments.
The layer we take as features in the embedding network had dimension 512 in all experiments and no nonlinearity.
To keep the scale of the prediction error consistent relative to extrinsic reward, in the Unity experiments we applied batchnorm to the embedding network.
We also did this for the Mario generalization experiments to reduce covariate shift from level to level.
For the VAE auxiliary task and pixel method, we used a similar deconvolutional architecture the exact details of which can be found in our code submission.
The IDF and forward dynamics networks were heads on top of the embedding network with several extra fully-connected layers of dimensionality 512.

Hyper-parameters:
We used a learning rate of 0.0001 for all networks.
In most experiments, we used 128 parallel environments with the exceptions of the Unity and Roboschool experiments where we could only run 32 parallel environments, and the large scale Mario experiment where we used 2048.
We used rollouts of length 128 in all experiments except for the Unity experiments where we used 512 length rollouts so that the network could quickly latch onto the sparse reward.
In the initial 9 experiments on Mario and Atari, we used 3 optimization epochs per rollout in the interest of speed.
In the Mario scaling, generalization experiments, as well as the Roboschool experiments, we used 6 epochs.
In the Unity experiments, we used 8 epochs, again to more quickly take advantage of sparse rewards.

B
Additional Results

524:YAMAGUTIseisei
18/08/26 17:29:43.70 CL1hr8qnX BE:66087893-2BP(3)
0 250 500 750 1000 1250 1500 1750
0 1000 2000 3000 4000 5000 6000 7000

0 100 200 300 400

BeamRider
BreakOut
MontezumaRevenge
Pong
Mario
Qbert
Reverraid
Seaquest
SpaceInvaders

Frames (millions)
Extrinsic Reward per Episode

Pixels
VAE features
Inverse Dynamics features
Random CNN features

(a) Best returns
(b) Episode length

Figure 7:
(a) Left: Best extrinsic returns on eight Atari games and Mario.
(c) Right: Mean episode lengths on eight Atari games and Mario.

525:YAMAGUTIseisei
18/08/26 17:32:07.16 CL1hr8qnX BE:176234898-2BP(3)
Frames (millions)
Extrinsic Reward per Episode

Inverse Dynamics features
Random agent
Random CNN features

Figure 8:
Pure curiosity-driven exploration (no extrinsic reward, or end-of-episode signal) on 48 Atari games.
We observe that the extrinsic returns of curiosity-driven agents often increases despite the agents having no access to the extrinsic return or end of episode signal.
In multiple environments,
the performance of the curiosity-driven agents is significantly better than that of a random agent, although there are environments where the behavior of the agent is close to random, or in fact seems to minimize the return, rather than maximize it.
For the majority of the training process RF perform better than a random agent in about 67% of the environments, while IDF perform better than a random agent in about 71% of the environments.

Reward Gravitar Freeway Venture PrivateEye MontezumaRevenge
Ext Only 999.3 ± 220.7 33.3 ± 0.6 0 ± 0 5020.3 ± 395 1783 ± 691.7
Ext + Int 1165.1 ± 53.6 32.8 ± 0.3 416 ± 416 3036.5 ± 952.1 2504.6 ± 4.6

Table 2:
These results compare the mean reward (± std-error) after 100 million frames across 3 seeds for an agent trained with intrinsic plus extrinsic reward versus extrinsic reward only.
The extrinsic (coefficient 1.0) and intrinsic reward (coefficient 0.01) were directly combined without any hyper-parameter tuning.
We leave the question on how to optimally combine extrinsic and intrinsic rewards up to future work.
This is to emphasize that combining extrinsic with intrinsic rewards is not the focus of the paper, and these experiments are provided just for completeness.

526:YAMAGUTIseisei
18/08/26 17:32:51.06 CL1hr8qnX BE:58745546-2BP(3)
B.1
Atari
To better measure the amount of exploration, we provide the best return of curiosity-driven agents in figure 7(a) and the episode lengths in figure 7(b).
Notably on Pong the increasing episode length combined with a plateau in returns shows that the agent maximizes the number of ball bounces, rather than the reward.

Figure 8 shows the performance of curiosity-driven agents based on Inverse Dynamics and Random features on 48 Atari games.

Although not the focus of this paper, for completeness we include some results on combining intrinsic and extrinsic reward on several sparse reward Atari games.
When combining with extrinsic rewards, we use the end of the episode signal.
The reward used is the extrinsic reward plus 0.01 times the intrinsic reward.
The results are shown in Table 2.
We don't observe a large difference between the settings, likely because the combination of intrinsic and extrinsic reward needs to be tuned.
We did observe that one of the intrinsic+extrinsic runs on Montezuma 's Revenge explored 10 rooms.

3Website at URLﾘﾝｸ(pathak22.github.io)

13

527:YAMAGUTIseisei
18/08/26 17:33:32.21 CL1hr8qnX BE:58745546-2BP(3)
Page 14

0 2500 5000 7500 10000 12500 15000 17500
0 25000 50000 75000 100000 125000 150000 175000 200000

Extrinsic Reward per Episode
Number of gradient updates

Scale in Mario

Batch of 128 environments
Batch of 1024 environments

Figure 9:
Best extrinsic returns on the Mario scaling experiments.
We observe that larger batches allow the agent to explore more effectively, reaching the same performance in less parameter updates, and also achieving better ultimate scores.

B.2 Mario We show the analogue of the plot shown in Figure 3(a) showing max extrinsic returns.
See Figure 9
.
14

Page 15

15

528:YAMAGUTIseisei
18/08/26 17:36:22.93 CL1hr8qnX BE:29372562-2BP(3)
>>488-

>>514
A family of approaches to intrinsic motivation reward an agent based on prediction error , prediction uncertainty , or improvement of a forward dynamics model of the environment that gets trained along with the agent 's policy.

>>515
literature has focused on using random features for classification where the typical finding is that whilst random features can work well for simpler problems, feature learning performs much better once the problem becomes sufficiently complex.

529:YAMAGUTIseisei
18/09/09 07:47:01.01 vMpjqLBja BE:119937877-2BP(3)
This is the html version of the file URLﾘﾝｸ(mazonka.com)
. Google automatically generates html versions of documents as we crawl the web.

Page 1

A Simple Multi-Processor Computer Based on Subleq

Oleg Mazonka and Alex Kolodin
mazonka@gmail.com alex.kolodin@gmail.com

May 2011 (Revision 3, draft 8)

Abstract

Subleq (Subtract and Branch on result Less than or Equal to zero) is both an instruction set and a programming language for a One Instruction Set Computer (OISC).
We describe a hardware implementation of an array of 28 one-instruction Subleq processors on a low-cost FPGA board.
Our test results demonstrate that the computational power of our Subleq OISC multi-processor is comparable to that of a CPU of a modern personal computer.
Additionally, we provide implementation details of our complier from a C-style language to Subleq.

530:YAMAGUTIseisei
18/09/09 07:48:08.65 vMpjqLBja BE:48954454-2BP(3)
Contents

1. Introduction.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　2
2. Subleq Assembly Language.　　.　　.　　.　　.　　.　　.　　.　　.　　.......3
3. Hardware design .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　6
3.1 Overview.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　..6
3.2 Interface description.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.7
3.3 Subleq processor .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　......7
4. C Compiler for Subleq.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.......8
4.1 Stack.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.8
4.2 Expressions .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　....10
4.3 Function calls .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.11
4.4 Stack variables .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.......14
4.5 Multiplication.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.15
4.6 Conditional jump .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　...16
5. Results.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　......17
5.1 Test #1.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　....17
5.2 Test #2.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　....18
6. Conclusion .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　....... 19
7. Appendix.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　..21
7.1 C with multiplication .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.....21
7.2 C without multiplication .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　21
7.3 Subleq code.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　....22
References.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　23

~ 1 ~

Page 2

531:YAMAGUTIseisei
18/09/09 07:51:37.91 vMpjqLBja BE:34268827-2BP(3)
1. Introduction

OISC (One Instruction Set Computer) is the ultimate RISC (Reduced Instruction Set Computer) with conventional CPU, where the number of instructions is reduced to one.
Having only one available processor instruction eliminates the necessity of op-code and permits simpler computational elements, thus allowing more of them implemented in hardware with the same number of logical gates.
Since our goal was to build a functional multi-processor system with a maximum possible number of processors on a single low-cost programmable chip,
OISC was a natural choice, with the remaining step being selection of a suitable single-processor instruction set.

Currently known OISC can be roughly separated into three broad categories:

1. Transport Triggered Architecture Machines;
2. Bit Manipulating Machines;
3. Arithmetic Based Turing-Complete Machines.

Transport Triggered Architecture (TTA) is a design in which computation is a side effect of data transport.
Usually some memory registers (triggering ports) within common address space, perform an assigned operation when the instruction references them.
For example, in an OISC utilizing a single memory-to-memory copy instruction [1], this is done by triggering ports performing arithmetic and instruction pointer jumps when writing into them.
Despite appealing simplicity, there are two major drawbacks associated with such OISC.
Firstly, the CPU has to have separate functional units controlling triggering ports.
Secondly, there are difficulties with generalization of the design, since any two different hardware designs are likely to use two different assembly languages.
Because of these disadvantages we ruled out this class of OISCs for our implementation.

532:YAMAGUTIseisei
18/09/09 07:52:35.82 vMpjqLBja BE:58745164-2BP(3)
Bit Manipulating Machines is the simplest class.
A bit copying machine, called BitBitJump, copies one bit in memory and passes the execution unconditionally to the address specified by one of the operands of the instruction [2].
This process turns out to be capable of universal computation (i.e.
being able to execute any algorithm and to interpret any other universal machine) because copying bits can conditionally modify the code ahead to be executed.
Another machine, called the Toga computer, inverts a bit and passes the execution conditionally depending on the result of inversion [3].
Yet another bit operating machine, similar to BitBitJump, copies several bits at the same time.
The problem of computational universality is solved in this case by keeping predefined jump tables in the memory [4].
Despite simplicity of the bit operating machines, we ruled them out too, because they require more memory than is normally available on an inexpensive FPGA.
To make a functional multiprocessor machine with bit manipulation operation, at least 1Mb of memory per processor is required.
Therefore we decided that a more complex processor with less memory is a better choice for our purposes.

~ 2 ~

Page 3

Arithmetic based Turing-complete Machines use an arithmetic operation and a conditional jump.
Unlike the two previous classes which are universal computers, this class is universal and Turing-complete in its abstract representation.
The instruction operates on integers which may also be addresses in memory.
Currently there are several known OISCs of this class, based on different arithmetic operations [5]:
addition ? Addleq, decrement ? DJN, increment ? P1eq, and subtraction ? Subleq (Subtract and Branch on result Less than or Equal to zero).
The latter is the oldest, the most popular and, arguably, the most efficient [6][7].
A Subleq instruction consists of three operands: two for subtraction and one for the conditional jump.

533:YAMAGUTIseisei
18/09/09 07:53:50.33 vMpjqLBja BE:117489986-2BP(3)
Attempts to build hardware around Subleq have been undertaken previously.
For example, David A Roberts designed a Subleq CPU and wrote a software Subleq library [8].
His implementation is a single CPU with keyboard input, terminal, control ROM, and 16Mb RAM, and is much more complex than ours.
There were a few other similar designs described on various Internet sites, e.g. [9].
However all of them were just proof-of-concept simulations without a practical implementation.

In the following sections we describe components of the system we built.
Section 2 outlines a Subleq abstract machine and its assembly notation.
Section 3 describes our hardware implementation of the multiprocessor core.
Section 4 briefly describes techniques used to convert a high level programming language into Subleq instruction code.
In Sections 5 and 6 we present comparative speed-test results for our device, followed by a discussion and summary.
In the Appendix the code calculating factorials is presented in C and Subleq notations.

Subleq software <- USB -> Driver, Programming environment

Figure 1 FPGA board is connected to PC by USB cable

Figure 1 represents the connection of our device to a computer with USB cable.

534:YAMAGUTIseisei
18/09/09 07:55:07.61 vMpjqLBja BE:73431465-2BP(3)
2. Subleq Assembly Language

A Subleq abstract machine operates on an infinite array of memory, where each cell holds an integer number.
This number can be an address of another memory cell.
The numbering starts from zero.
The program is defined as a sequence of instructions read from the memory with the first instruction at the address zero.
A Subleq instruction has 3 operands:

A B C

Execution of one instruction A B C subtracts the value in the memory cell at the address stored in A from the content of a memory cell at the address stored in B and then writes the result back into the cell with the address in B.
If the value after subtraction in B is less or equal to zero, the execution jumps to the address specified in C; otherwise execution continues to the next instruction, i.e. the address of the memory cell following C.

~ 3 ~

Page 4

Assembly notation helps to read and write code in Subleq.
The following is the list of syntax conventions:

? label;
? question mark;
? reduced instruction;
? multi-instruction;
? literal and expression;
? data section;
? comment.

535:YAMAGUTIseisei
18/09/09 07:56:10.49 vMpjqLBja BE:117490368-2BP(3)
Label is a symbolic name of a particular address followed by a colon.
In the following example

　A B C
A:2 B:1 0
C:B B 0

each line represents one instruction with three operands.
Here A, B, and C are not abstract names, but labels (addresses) of specific cells in the memory.
For example, label A refers to the 4th cell, which is initialised with value 2.
The first instruction subtracts the value of cell A from the value of cell B, which value is 1, and stores the result in the cell B, which becomes -1.
Since the result is less than zero, the next instruction to be executed is the third line, because value C is the address of the first operand of the instruction on the third line.
That subtracts B from B making it zero, so the execution is passed to the address 0.
If these three lines is the whole program, then the first operand of the first instruction has the address 0.
In this case the execution is passed back to the first instruction which would make B -2.
That process continues forever.
The instructions being executed are only the first and the third lines, and the value of the cell B changes as 1, -1, 0, -2, 0, -2, 0, and so on.

A question mark is defined as the address of the next cell in memory:

A B ?
B B 0

is the same as

A B C
C:B B 0

Reduced instruction format is a convenient shortcut: two operands instead of three assume the third to be the address of next instruction, i.e. ?; and only one operand assumes the second to be the same as the first, so

536:YAMAGUTIseisei
18/09/09 07:57:06.05 vMpjqLBja BE:44058863-2BP(3)
A

is the same as

A A

~ 4 ~

Page 5

and is the same as

A A ?

If more than one instruction is placed on the same line, each instruction except the last must be followed by semicolon.
The following code copies the value from

A to B: Z; B; A Z; Z B

Integer numbers like A:72 are used as constants in the code.
Literals can be used instead of integers assuming their ASCII values.
For example, A:'H' is the same as A:72; and A:"Hi" is the same as A:'H' 'i'.
Addition, subtraction, parenthesis, and unary minus can be used in expression.
The code Z Z ?+3 A B C D E F sets Z to zero and jumps to the third instruction D E F.
Since instructions can be reduced, the assembler must know when to generate a full instruction with three operands.
To avoid such generation, period at the beginning of the line is used.
Thus program data can be placed on such a line.
The code

537:YAMAGUTIseisei
18/09/09 07:58:40.41 vMpjqLBja BE:198264299-2BP(3)
A A ?+1
. U:-1
U A

sets A to 1.
Without the period on the second line the code would be interpreted as

A A ?+1
U:-1 (-1) ?
U A

Comments are delimited by hash symbol #: everything from # till the end of the line is a comment.
Jump to a negative address halts the program.
Usually (-1) as the third operand is used to stop the program, for example:

# halt
Z Z (-1)

The parentheses around (-1) are necessary to indicate that it is the third operand, so the instruction would not be interpreted as

Z Z-1 ?

To make a Subleq program interactive (requesting data and responding to user while working), input and output operations can be defined as operations on a non-existing memory cell.
The same (-1) address can be used for this.
If the second operand is (-1) the value of the first operand is the output.
If the first operand is (-1), the second operand gets the value from the input stream.
Input and output operations are defined on byte basis in ASCII code.
If the program tries to output a value greater than 255, the behaviour is undefined.

Below is a "Hello world" program adapted from Lawrence Woodman helloworld.sq [10].

538:YAMAGUTIseisei
18/09/09 07:59:47.32 vMpjqLBja BE:34267872-2BP(3)
~ 5 ~

Page 6

It is exceptionally terse, but is a good example of Subleq efficiency.

L:H (-1); U L; U ?+2; Z H (-1); Z Z L
. U:-1 H:"hello, world\n" Z:0

A special variable called Z is often used in Subleq as an intermediate temporary variable within a highly small scope.
It is commonly assumed that this variable is initialised at zero and left at zero after every usage.

The program above consists of five instructions.
The first instruction prints the character pointed by its first operand (the first pointer) which is initialised to the beginning of the data string ? the letter ’h’.
The second instruction increments that pointer ? the first operand of the first instruction.
The third instruction increments the second pointer, which is the second operand of the forth instruction.
The forth instruction tests the value pointed by the second pointer and halts the program when the value is zero.
It becomes zero when the pointer reaches the cell one after the end of the data string, which is Z:0.
The fifth instruction loops back to the beginning of the program, so the process continues until the halt condition is not satisfied.

539:YAMAGUTIseisei
18/09/09 08:00:42.28 vMpjqLBja BE:102803876-2BP(3)
3. Hardware design

3.1 Overview

We have used Altera Cyclone III EP3C16 FPGA as the basis for a hardware implementation.
The choice was based on the relatively low price (~ US$30) of this FPGA IC chip and availability of the test hardware for it.

The test board we used has a DDR2 RAM IC fitted, but access to the RAM is limited to one process at a time.
True parallel implementation requires separate internal memory blocks allocated for each processor hence the amount of available memory in the FPGA limits the number of processors.
The EP3C16 FPGA has 56 of 16 bit memory blocks of 8 Kbits each.
Our implementation of one 32 bit Subleq processor requires a minimum of 2 memory blocks, so only 28 processors can be fitted into the FPGA.
We could choose 16 bit implementation and have more processors (up to 56), but with only 1 Kbyte of memory allocated to each.

The FPGA is connected to the USB bus with the help of an external Cypress FX2 CPU that is configured as a bridge between USB and SPI (Serial Peripheral Interface) utilised to load FPGA with code and data.
The interface bridge is transparent for the PC software.

　　　　FPGA
MEMORY　　<-> PROCESSOR1
MEMORY　　<-> PROCESSOR2
MEMORY　　<-> PROCESSOR3
　　:　　　　　　 <-> SPI <-> CONTROL_CPU <-> USB
MEMORY　　<-> PROCESSOR7
MEMORY　　<-> PROCESSOR8

Figure 2 Block-diagram of the board

~ 6 ~

Page 7

540:YAMAGUTIseisei
18/09/09 08:01:42.82 vMpjqLBja BE:88117294-2BP(3)
Figure 2 is a communication block diagram of the board.
The solution was coded in VHDL and compiled with Quartus II Web Edition software freely available from the Altera website.
Our code is scalable for up to 63 processors for possible use with larger FPGAs.
The limit of 63 processors is due to our SPI bus addressing implementation and can be increased, if necessary.

All 28 processors run independently and synchronized by a single 150 MHz clock generated by one of the FPGA PLLs from a reference oscillator fitted on the PCB.

To increase the number of processors, additional boards with FPGAs could be easily connected via USB bus.

3.2 Interface description

Each processor has a serial interface to the allocated memory and the Status byte, accessible from a single address serial loading.
The serial interface takes over memory's data and address buses when processing is stopped.

Address space inside the FPGA is organised as a table addressed by two numbers: processor index and memory address.
Reading one byte from index 0 returns the number of processors inside the FPGA.
For this design the returned value is 28.
Indices from 1 to 28 are assigned to the processors, which have 2048 bytes (512 of 32 bit words) of memory available to each.

Writing to a processor memory is an operation which sequentially loads a buffer of 2048 bytes.
Reading from the processor’s memory is different: the first word (4 bytes) returned is the status of the processor and the rest is the memory content.

541:YAMAGUTIseisei
18/09/09 08:02:33.56 vMpjqLBja BE:58745838-2BP(3)
The Status byte ? the first byte of the first word ? can be in one of three states: 0xA1 ? running, 0xA2 ? stopped, or 0xA0 ? stopped and not run since power on.
Writing into the processor’s memory automatically starts execution, thus eliminating the need in a separate command.
Reading from a processor’s memory stops that processor.
An exception is reading the first byte of status, which does not stop the processor.
Additionally, a processor can be stopped by Subleq halt operand (-1) as mentioned in Section 2.
Other negative references, such as input or output as described above in the Subleq assembly language section, also stop the processor, because no IO operations are defined in this architecture.

3.3 Subleq processor

The state machine algorithm can be presented in pseudocode as:

　　IP = 0
　　while (IP >= 0)
　　{
　　　　A = memory[IP]
　　　　B = memory[IP+1]
　　　　C = memory[IP+2]
　　　　if (A < 0 or B < 0):
　　　　{
　　　　　　IP = -1
　　　　}
　　　　else:
　　　　{
　　　　　　memory[B] = memory[B] - memory[A]
　　　　　　if (memory[B] > 0)
　　　　　　　　IP = IP + 3
　　　　　　else:
　　　　　　　　IP = C
　　　　　　}
　　　　}

542:YAMAGUTIseisei
18/09/09 08:03:12.40 vMpjqLBja BE:110146695-2BP(3)
~ 7 ~

Page 8

where IP is an instruction pointer, memory[] is a value of a memory cell, and A, B, and C are integers.

The Subleq processor core is written with the help of RAM 2 Port Megafunction of Quartus II software we used to build dual port memory access.
The implemented solution allows access to the content by two distinct addresses (memory[A] and memory[B]) simultaneously, which results in saving on processing clock ticks.
The disadvantage of this implementation is an additional latency of one clock tick to access the data and address buses compared to a single port memory implementation.
However the total of processing clock ticks per memory access for the dual port is still less than that required for a single port.

The core is based on a state machine, which starts automatically when the memory of a processor is loaded.
On any read or write operation, or encountering a negative operand the processing stops, but the first status byte can be read at any time without affecting the computation.

4. C Compiler for Subleq

In this section we briefly describe some elements of the compiler we wrote, which compiles simplified C code into Subleq [11].
The compiler is used in one of our tests, so that direct comparison is possible between execution of a compiled native C code and a Subleq code, compiled from the same C source.
The compiler is a high-level language interface to OISC ? the only known to us such compiler at the time of writing.

4.1 Stack

The primary C programming language concepts are functions and the stack.
Implementation of the stack in Subleq is achievable by using memory below the code.
Using code self-modification one can place into and retrieve from the stack values.
Function calls require return address to be placed into the stack.
Consider the following C code:

543:YAMAGUTIseisei
18/09/09 08:06:52.07 vMpjqLBja BE:110147459-2BP(3)
　　void f()
　　{
　　　　...
　　}

　　void main()
　　{
　　　　...
　　}

　　f();
　　　　 ...
　　}

~ 8 ~

Page 9

After the above is compiled to machine code, it must perform the following operations
1) the address of the instruction immediately after calling f has to be put on the stack;
2) a jump to the code of the function f must be made
3) at the end of the function f, the address from the stack needs to be extracted and
4) the execution should be transferred to the extracted address.
According to C standard, the function main is a proper C function, i.e. it can be called from other functions including itself.
Hence the program must have a separate entry point, which in the following code is called sqmain.
The C code above compiles into:

544:YAMAGUTIseisei
18/09/09 08:08:07.99 vMpjqLBja BE:110146695-2BP(3)
　　　　0 0 sqmain
　　_f:
　　　　...
　　　　#return ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0
　　_main:
　　　　...
　　　　#call f
　　　　dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
　　　　?+6; sp ?+2; ?+2 0 _f; . ?; inc sp
　　　　...
　　　　#return
　　　　?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0
　　sqmain:
　　　　#call main
　　　　dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
　　　　?+6; sp ?+2; ?+2 0 _main; . ?; inc sp
　　　　0 0 (-1)
　　. inc:-1 Z:0 dec:1 sp:-sp

The cell stack pointer sp is the last memory cell in the program.
It is initialised with the negative value of its own address.
A negative value is used here to speed up the code execution ? working with subtraction operation may sometimes save a few steps if the data is recorded as negative of its actual value.
The instruction dec sp subtracts 1 from sp, hence increasing its real value by 1.
Below is an excerpt calling the function f in more readable form ? relative references ? are replaced with labels.

545:YAMAGUTIseisei
18/09/09 08:13:11.96 vMpjqLBja BE:44059829-2BP(3)
　　dec sp
　　A; sp A
　　B; sp B
　　A:0 B:0
　　C; sp C
　　D C:0 _f
　　. D:?
　　inc sp

The instruction on the fourth line is to clear the cell in the stack, since some values can be left there from the previous use.
However, to clear the top cell in the stack is not a single-step task because one has to clear the operands of the instruction itself and then initialise it with the value of sp pointer.
Thus the execution command sequence is the following:
allocate a new cell in the stack by increasing stack pointer (first line);
clear the first operand of the instruction;
initialise this operand with the new value of the stack pointer (second line);
do the same with the second operand of the instruction ? clear and initialise (third line); and then execute the instruction, which will clear the allocated cell in the stack (forth line).

The next two instructions clear and initialise cell C similarly.
The instruction D C:0 _f copies the address of the instruction inc sp to the stack and jumps to _f.
This works, because D holds the value of the next memory cell (remember ?) and C points to the now cleared top cell on the stack.

~ 9 ~

Page 10

A negative value written to the stack then forces a jump to the label _f.

Once inside the function f, the stack pointer can be modified, but we assume that functions restore it before they exit.
So the return code has to jump to the address extracted from the stack:

546:YAMAGUTIseisei
18/09/09 08:13:40.63 vMpjqLBja BE:48954454-2BP(3)
　　A; sp A
　　B; A:0 B
　　Z Z B:0

Here the value of stack pointer sp is written to A, and the instruction A:0 B copies the stored address to B.
The address stored negatively so the positive value is being restored.

The stack does a little bit more than just storing return addresses.
That will be explored later in subsections 4.3 and 4.4.

4.2 Expressions

C language operations consist of statements, which in turn consist of keyword statements and expressions.
The syntax of keyword statements and expressions are best represented by Backus-Naur Forms (BNF) ? the standard way of representing context-free grammars.
A classic example is the grammar of arithmetic expressions:

　　expression:=
　　　　term
　　　　　　expression + term
　　　　　　expression - term
　　　　term:=
　　　　　　primary
　　　　　　term * primary
　　　　　　term / primary
　　　　primary:=
　　　　　　identifier
　　　　　　constant
　　　　　　( expression )

These mutually recursive definitions can be used by a program called parser to build a tree representation for any grammatically valid expression.
Once such a tree is built, the compiler’s job is to organise the sequence of instructions so that the result of any sub-tree is passed up the tree.
For example, a tree of the expression:

547:YAMAGUTIseisei
18/09/09 08:14:35.49 vMpjqLBja BE:68535874-2BP(3)
　　a + ( b ? c )

Figure

consists of a node ‘+’, variable a and a sub-tree, which consists of a node ‘-’, and variables b and c.
To make a calculation, the compiler must use a temporary variable to store the result of the sub-tree, which has to be used later in addition; and potentially to be used further up if this expression is a part of a larger expression.
In this particular example we need only one temporary, but generally many temporaries are required.
The expression is compiled into the following code:

　　t; b Z; Z t; Z
　　c t
　　a Z; Z t; Z

~ 10 ~

Page 11

The first line copies value b into temporary t.
The second line subtracts value c from the temporary.
At this point the compiler is finished with the sub-tree.
Its result is the generated code and the temporary variable t holding the value of the calculated sub-tree.
Now the compiler generates code for the addition.
Its arguments now are variable a and temporary t.
The third line adds a to t.
Now t holds the result of the whole expression.
If this expression is a part of a larger expression, t is passed up the tree as an argument to the upper level node of the tree.
If not, then t value is discarded because the evaluation is finished.

548:YAMAGUTIseisei
18/09/09 08:15:29.08 vMpjqLBja BE:44059436-2BP(3)
More advanced grammar may involve assignment, dereferencing, unary operations and many others.
But each grammar construction can be represented by a corresponding sub-tree, and processed later by the compiler to produce code.
For example, a subtraction from a dereferenced value represented in C as:

　　*k -= a

has to be translated into

　　t; k Z; Z t; Z
　　a t:0

Here a temporary variable has to be used inside the code for dereferencing.
The sequence of instructions is: clear t; copy value k into t; subtract a from the memory k points to.

Here a few elements of the grammar processing have been touched.
C grammar takes several pages just to list the BNF.
However larger and more complex grammars are resolved by the compiler in similar manner.

4.3 Function calls

In the subsection 4.1 above it was shown how to push into and pop from the stack.
When a function takes arguments, they have to be pushed into the stack together with the return address.
The stack must be restored upon function return.
Consider a function taking two arguments:

　　int f(int a, int b);
　　　　...
　　　　f(a,b);

The call to a function f must be translated into something like this

549:YAMAGUTIseisei
18/09/09 08:17:21.13 vMpjqLBja BE:154205497-2BP(3)
　　# 1 push b
　　# 2 push a
　　# 3 push return_address
　　# 4 goto f # return_address:
　　# 5 sp -= 3

In C arguments can be expressions and the call to a function can be a part of another expression - sub-expression, i.e.
the compiler must properly handle more complicated cases like the following

~ 11 ~

Page 12

　　int f(int a, int b)
　　{
　　　　...
　　　　return f;
　　}
　　　　...
　　　　int k;
　　　　k=f;
　　　　k(f(1,2),3); // call via variable ? 　　indirect call
　　　　k = f(1,2)(3,4); // call by return value

Here for simplicity the C function type int(*)(int,int) is represented as int.
Subleq supports only one variable type.
Therefore, more elaborate typing system does not introduce extra functionality in the language.

550:YAMAGUTIseisei
18/09/09 08:18:23.81 vMpjqLBja BE:132176096-2BP(3)
Arguments pushed into the stack can properly be calculated as sub-expressions (sub-tree).
In this sense for the actual function call it is irrelevant either program variables or temporary are pushed into the stack.

　　# 1 push B
　　　　# clearing the next cell in the stack [remember that sp is negative]
　　　　# the line below is same as in C syntax: *(++sp)=0;
　　　　dec sp; t1; sp t1; t2; sp t2; t1:0 t2:0
　　　　# same as in C syntax: *sp+=B;
　　　　t3; sp t3; b Z; Z t3:0; Z

　　# 2 push A
　　　　# the same with A
　　　　dec sp; t4; sp t4; t5; sp t5; t4:0 t5:0
　　　　t6; sp t6; a Z; Z t6:0; Z

　　# 3 push return_address
　　　　dec sp; t7; sp t7; t8; sp t8; t7:0 t8:0
　　　　t9; sp t9; t10 t9:0 goto_address
　　　　. t10: return_address

　　# 4 goto f goto_address: Z Z f

　　# 5 sp -= 3
　　　　return_address: const(-3) sp

Notation const(-3) sp is a short for

　　unique_name sp
　　...
　　unique_name:-3

次ページ