(強いAI)技術的特異点/(世界加速) 23

(強いAI)技術的特異点/(世界加速) 23at FUTURE

(強いAI)技術的特異点/(世界加速) 23 - 暇つぶし2ch500:YAMAGUTIseisei
18/08/26 16:41:37.25 CL1hr8qnX BE:97908858-2BP(3)
2.3
` Death is not the end ' : discounted curiosity with infinite horizon
One important point is that the use of an end of episode signal, sometimes called a ` done ' , can often leak information about the true reward function.
If we don't remove the ` done ' signal, many of the Atari games become too simple.
For example, a simple strategy of giving +1 artificial reward at every time-step when the agent is alive and 0 on death is sufficient to obtain a high score in some games,
for instance, the Atari game ` Breakout ' where it will seek to maximize the episode length and hence its score.
In the case of negative rewards, the agent will try to end the episode as quickly as possible.

4

501:YAMAGUTIseisei
18/08/26 16:42:21.10 CL1hr8qnX BE:29373034-2BP(3)
Page 5

0 100 200 300 400
0 500 1000 1500 2000 2500 3000 3500 4000

BeamRider
BreakOut
MontezumaRevenge
Pong
Mario
Qbert
Reverraid
Seaquest
SpaceInvaders

Frames (millions)
Extrinsic Reward per Episode

Pixels
VAE features
Inverse Dynamics features
Random CNN features

Figure 2:
A comparison of feature learning methods on 8 selected Atari games and the Super Mario Bros.

502:YAMAGUTIseisei
18/08/26 16:43:13.06 CL1hr8qnX BE:48954454-2BP(3)
These evaluation curves show the mean reward (with standard error) of agents trained purely by curiosity, without reward or an end-of-episode signal.
We see that our purely curiosity-driven agent is able to gather rewards in these environments without using any extrinsic reward at training.
Results on all of the Atari games are in the appendix in Figure 8.
We find curiosity model trained on pixels does not work well across any environment and VAE features perform either same or worse than random and inverse dynamics features.
Further, inverse dynamics-trained features perform better than random features in 55% of the Atari games.
An interesting outcome of this analysis is that random features for modeling curiosity are a simple, yet surprisingly strong baseline and likely to work well in half of the Atari games.

In light of this, if we want to study the behavior of pure exploration agents, we should not bias the agent.
In the infinite horizon setting (i.e., the discounted returns are not truncated at the end of the episode and always bootstrapped using the value function), death is just another transition to the agent, to be avoided only if it is boring.
Therefore, we removed ` done ' to separate the gains of an agent 's exploration from merely that of the death signal.
In practice, we do find that the agent avoids dying in the games since that brings it back to the beginning of the game, an area it has already seen many times and where it can predict the dynamics well.
This subtlety has been neglected by previous works showing experiments without extrinsic rewards.

503:YAMAGUTIseisei
18/08/26 16:47:17.14 CL1hr8qnX BE:44059436-2BP(3)
3
Experiments

In all of our experiments, both the policy and the embedding network work directly from pixels.
For our implementation details including hyper-parameters and architectures, please refer to the Appendix A.
Unless stated otherwise, all curves are the average of three runs with different seeds, and the shaded areas are standard errors of the mean.
We have released the code and videos of a purely curious agent playing across all environments on the website 2.

3.1
Curiosity-driven learning without extrinsic rewards We begin by scaling up a pure curiosity-driven learning to a large number of environments without using any extrinsic rewards.
We pick a total of 54 diverse simulated environments, as shown in Figure 1,
including 48 Atari games, Super Mario Bros., 2 Roboschool scenarios (learning Ant controller and Juggling), Two-player Pong, 2 Unity mazes (with and without a TV controlled by the agent).
The goal of this large-scale analysis is to investigate the following questions:
(a) What actually happens when you run a pure curiosity-driven agent on a variety of games without any extrinsic rewards?
(b) What kinds of behaviors can you expect from these agents? (c) What is the effect of the different feature learning variants in dynamics-based curiosity on these behaviors?

2
URLﾘﾝｸ(pathak22.github.io)

5

504:YAMAGUTIseisei
18/08/26 16:48:28.26 CL1hr8qnX BE:9791322-2BP(3)
Page 6

A) Atari Games
To answer these questions, we began with a collection of well-known Atari games and ran a suite of experiments with different feature learning methods.
One way to measure how well a purely curious agent performs is to measure the extrinsic reward it is able to achieve, i.e. how good is the agent at playing the game.
We show the evaluation curves of mean extrinsic reward in on 8 common Atari games in Figure 2 and all 48 Atari suite in Figure 8 in the appendix.
It is important to note that the extrinsic reward is only used for evaluation, not for training.
However, this is just a proxy for pure exploration because the game rewards could be arbitrary and might not align at all with how the agent explores out of curiosity.

The first thing to notice from the curves is: most of them are going up.
This shows that a pure curiosity-driven agent can learn to obtain external rewards even without using any extrinsic rewards during training.
It is remarkable that agents with no extrinsic reward and no end of episode signal can learn to get scores comparable in some cases to learning with the extrinsic reward.
For instance, in Breakout, the game score increases on hitting the ball with the paddle into bricks which disappear and give points when struck.
The more times the bricks are struck in a row by the ball, the more complicated the pattern of bricks remaining becomes, making the agent more curious to explore further, hence, collecting points as a bi-product.
Further, when the agent runs out of lives, the bricks are reset to a uniform structure again that has been seen by the agent many times before and is hence very predictable, so the agent tries to stay alive to be curious by avoiding reset by death.

505:YAMAGUTIseisei
18/08/26 16:54:32.44 CL1hr8qnX BE:132176669-2BP(3)
This is an unexpected result and might suggest that many popular RL test-beds do not need an external reward.
This may be because game designers (similar to architects, urban planners, gardeners, etc.) are
very good at setting up curriculums to guide agents through the task explaining the reason Curiosity-like objective decently aligns with the extrinsic reward in many human-designed environments [6, 12, 16, 48].
However, this is not always the case, and sometimes a curious agent can even do worse than random agent!
This happens when the extrinsic reward has little correlation with the agent 's exploration, or when the agent fails to explore efficiently (e.g. see games ` Atlantis ' , ` IceHockey ' in Figure 8).
We further encourage the reader to refer to the game-play videos of the agent available on the website for a better understanding of the learned skills.

Comparison of feature learning methods:
We compare four feature learning methods in Figure 2: raw pixels, random features, inverse dynamics features and VAE features.
Training dynamics on raw-pixels performs bad across all the environments, while encoding pixels into features does better.
This is likely because it is hard to learn a good dynamics model in pixel space, and prediction errors may be dominated by small irrelevant details.

Surprisingly, random features (RF) perform quite well across tasks and sometimes better than using learned features.
One reason for good performance is that the random features are kept frozen (stable), the dynamics model learned on top of them has an easier time because of the stationarity of the target.
In general, random features should work well in the domains where visual observations are simple enough, and random features can preserve enough information about the raw signal, for instance, Atari games.
Interestingly, we find that while random features work well at training, IDF learned features appear to generalize better in Mario Bros. (see Section 3.2 for details).

506:YAMAGUTIseisei
18/08/26 16:55:34.83 CL1hr8qnX BE:110146695-2BP(3)
The VAE method also performed well but was somewhat unstable, so we decided to use RF and IDF for further experiments.
The detailed result in appendix Figure 8 compares IDF vs.
RF across the full Atari suite.
To quantify the learned behaviors, we compared our curious agents to a randomly acting agent.
We found that an IDF-curious agent collects more game reward than a random agent in 75% of the Atari games, an RF-curious agent does better in 70%.
Further, IDF does better than RF in 55% of the games.
Overall, random features and inverse dynamics features worked well in general.
Further details in the appendix.

B) Super Mario Bros.
We compare different feature learning methods in Mario Bros. in Figure 2.
Super Mario Bros has already been studied in the context of extrinsic reward free learning [27] in small-scale experiments, and so we were keen to see how far curiosity alone can push the agent.
We use an efficient version of Mario simulator faster to scale up for longer training keeping observation space, actions, dynamics of the game intact.
Due to 100x longer training and using PPO for optimization, our agent is able to pass several levels of the game, significantly improving over prior exploration results on Mario Bros.
Could we further push the performance of a purely curious agent by making the underlying optimization more stable? One way is to scale up the batch-size.
We do so by increasing the number of parallel threads for running environments from 128 to 2048.

6

507:YAMAGUTIseisei
18/08/26 16:58:13.95 CL1hr8qnX BE:34268827-2BP(3)
Page 7

0 10 20 30
0 250 500 750 1000 1250 1500 1750 2000

Extrinsic Reward per Episode

Number of gradient updates
(a) Mario w/ large batch
Batch of 128 environments
Batch of 1024 environments

Frames (millions)
(b) Juggling (Roboschool)
Pure curiosity (no-reward, infenite-horizon) exploration
juggling (Roboschool)

Frames (millions)
(c) Two-player Pong
Pure curiosity (no-reward, infenite-horizon) exploration
Two player Pong

Figure 3:
(a) Left: A comparison of the RF method on Mario with different batch sizes.
Results are without using extrinsic reward.
(b) Center: Number of ball bounces in the Juggling (Roboschool) environment.
(c) Right: Mean episode length in the multiplayer Pong environment.
The discontinuous jump on the graph corresponds to the agent reaching a limit of the environment -
after a certain number of steps in the environment the Atari Pong emulator starts randomly cycling through background colors and becomes unresponsive to agent 's actions

508:YAMAGUTIseisei
18/08/26 16:59:09.88 CL1hr8qnX BE:19582324-2BP(3)
We show the comparison between training using 128 and 2048 parallel environment threads in Figure 3(a).
As apparent from the graph, training with large batch-size using 2048 parallel environment threads performs much better.
In fact, the agent is able to explore much more of the game: discovering 11 different levels of the game, finding secret rooms and defeating bosses.
Note that the x-axis in the figure is the number of gradient steps, not the number of frames, since the point of this large-scale experiment is not a claim about sample-efficiency, but performance with respect to training the agent.
This result suggests that the performance of a purely curiosity-driven agent would improve as the training of base RL algorithm (which is PPO in our case) gets better.
The video is on the website.

C) Roboschool Juggling
We modified the Pong environment from the Roboschool framework to only have one paddle and to have two balls.
The action space is continuous with two-dimensions, and we discretized the action space into 5 bins per dimension giving a total of 25 actions.
Both the policy and embedding network are trained on pixel observation space (note: not state space).
This environment is more difficult to control than the toy physics used in games, but the agent learns to intercept and strike the balls when it comes into its area.
We monitored the number of bounces of the balls as a proxy for interaction with the environment, as shown in Figure 3(b).
See the video on the project website.

509:YAMAGUTIseisei
18/08/26 16:59:48.41 CL1hr8qnX BE:119937877-2BP(3)
D) Roboschool Ant Robot
We also explored using the Ant environment which consists of an Ant with 8 controllable joints on a track.
We again discretized the action space and trained policy and embedding network on raw pixels (not state space).
However, in this case, it was less easy to measure exploration because the extrinsic distance reward measures progress along the racetrack, but a purely curious agent is free to move in any direction.
We find that a walking like behavior emerges purely out of a curiosity-driven training.
We refer the reader to the result video showing that the agent is meaningfully interacting with the environment.

E) Multi-agent curiosity in Two-player Pong
We have already seen that a purely curiosity-driven agent learns to play several Atari games without reward, but we wonder how much of that behavior is caused by the fact that the opposing player is a computer-agent with hardcoded strategy.
What would happen if we were to make both the teams playing against each other to be curious? To find out, we take Two-player Pong game where both the sides (paddles of pong) of the game are controlled by curiosity-driven agents.
We share the initial layers of both the agent and have different action heads, i.e., total action space is now the cross product of the actions of player 1 by the actions of player 2.

Note that the extrinsic reward is meaningless in this context since the agent is playing both sides, so instead, we show the length of the episode.
The results are shown in Figure 3(c).
We see from the episode length that the agent learns to have more and longer rallies over time, learning to play pong without any teacher ? purely by curiosity on both sides.
In fact, the game rallies eventually get so long that they break our Atari emulator causing the colors to change radically, which crashes the policy as shown in the plot.

7

510:YAMAGUTIseisei
18/08/26 17:01:15.34 CL1hr8qnX BE:19581942-2BP(3)
Page 8

3.2
Generalization across novel levels in Super Mario Bros.
In the previous section, we showed that our purely curious agent can learn to explore efficiently and learn useful skills, e.g., game playing behaviour in games, walking behaviour in Ant etc.
So far, these skills were shown in the environment where the agent was trained on.
However, one advantage of developing reward-free learning is that one should then be able to utilize abundant `` unlabeled '' environments without reward functions by showing generalization to novel environments.

To test this, we first pre-train our agent using curiosity only in the Level 1-1 of Mario Bros.
We investigate how well RF and IDF-based curiosity agents generalize to novel levels of Mario.
In Figure 4, we show two examples of training on one level of Mario and finetuning on another testing level, and compare to learning on the testing level from scratch.
The training signal in all the cases is only curiosity reward.
In the first case, from Level 1-1 to Level 1-2, the global statistics of the environments match (both are ` day ' environment in games, i.e., blue background) but levels have different enemies, geometry and difficulty level.
We see that there is strong transfer from for both methods in this scenario.
However, the transfer performance is weaker in the second scenario from Level 1-1 to Level 1-3.
This is so because the problem is considerably harder for the latter level pairing as there is a color scheme shift from day to night, as shown in Figure 4.

511:YAMAGUTIseisei
18/08/26 17:02:53.25 CL1hr8qnX BE:29373034-2BP(3)
We further note that IDF-learned features transfer in both the cases and random features transfer in the first case, but do not transfer in the second scenario from day to night.
These results might suggest that while random features perform well on training environments, learned features appear to generalize better to novel levels.
However, this needs more analysis in the future across a large variety of environments.
Overall, we find some promising evidence showing that skills learned by curiosity help our agent explore efficiently in novel environments.

IDF scratch
IDF transfer
RF scratch
RF transfer

0 10 20 30
0 250 500 750 1000 1250 1500 1750 2000

World 1 level 1 to world 2 level 1
0 10 20 30
0 250 500 750 1000 1250 1500 1750 2000
World 1 level 1 to world 3 level 1
Frames (millions)
Extrinsic Reward per Episode

Figure 4:
Mario generalization experiments.
On the left we show transfer results from Level 1-1 to Level 1-2, and on the right we show transfer results from Level 1-1 to Level 1-3.
Underneath each plot is a map of the source and target environments.
All agents are trained without extrinsic reward.

512:YAMAGUTIseisei
18/08/26 17:04:02.97 CL1hr8qnX BE:36715853-2BP(3)
Frames (millions)
Extrinsic Reward per Episode

Unity maze

Random CNN features
Extrinsic only
Inverse dynamics features

Figure 5: Mean extrinsic reward in the Unity environment while training with terminal extrinsic + curiosity reward.
Note that the curve for extrinsic reward only training is constantly zero.

3.3 Curiosity with Sparse External Reward
In all our experiments so far, we have shown that our agents can learn useful skills without any extrinsic rewards driven purely by curiosity.
However, in many scenarios, we might want the agent to perform some particular task of interest.
This is usually conveyed to the agent by defining extrinsic rewards.
When rewards are dense (e.g. game score at every frame), classic RL works well and intrinsic rewards generally should not help performance.
However, designing dense rewards is a challenging engineering problem (see introduction for details).
In this section, we evaluate how well curiosity can help an agent perform a task in presence of sparse, or just terminal, rewards.

Terminal reward setting:
For many real problems, e.g. navigation, the only terminal reward is available, a setting where classic RL typically performs poorly.
Hence, we consider the 3D navigation in a maze designed in the Unity ML-agent framework with 9 rooms and a sparse terminal reward.

8

513:YAMAGUTIseisei
18/08/26 17:04:57.59 CL1hr8qnX BE:78326584-2BP(3)
Page 9

There is a discrete action space consisting of: move forwards, look left 15 degrees, look right 15 degrees and no-op.
The agent starts in the room-1, which is furthest away from room-9 which contains the goal of the agent.
We compare an agent trained with extrinsic reward (+1 when the goal is reached, 0 otherwise) to an agent trained with extrinsic + intrinsic reward.
Extrinsic only (classic RL) never finds the goal in all our trials which means it is impossible to get any meaningful gradients.
Whereas extrinsic+intrinsic typically converges to getting the reward every time.
Results in Figure 5 show results for vanilla PPO, PPO + IDF-curiosity and PPO + RF-curiosity.

Sparse reward setting: In preliminary experiments, we picked 5 Atari games which have sparse rewards (as categorized by [3]), and compared extrinsic (classic RL) vs.
extrinsic+intrinsic (ours) reward performance.
In 4 games out of 5, curiosity bonus improves performance (see Table 2 in the appendix, the higher score is better).
We would like to emphasize that this is not the focus of the paper, and these experiments are provided just for completeness.
We just combined extrinsic (coefficient 1.0) and intrinsic reward (coefficient 0.01) directly without any tuning.
We leave the question on how to optimally combine extrinsic and intrinsic rewards as a future direction.

514:YAMAGUTIseisei
18/08/26 17:11:36.23 CL1hr8qnX BE:29373326-2BP(3)
4
Related Work

Intrinsic Motivation:
A family of approaches to intrinsic motivation reward
an agent based on prediction error [2, 27, 36, 42], prediction uncertainty [11, 44], or improvement [19, 34] of a forward dynamics model of the environment that gets trained along with the agent 's policy.
As a result the agent is driven to reach regions of the environment that are difficult to predict for the forward dynamics model, while the model improves its predictions in these regions.
This adversarial and non-stationary dynamics can give rise to complex behaviors.
Relatively little work has been done in this area on the pure exploration setting where there is no external reward.
Of these mostly closely related are those that use a forward dynamics model of a feature space such as Stadie et al. [42] where they use autoencoder features, and Pathak et al. [27] where they use features trained
with an inverse dynamics task.
These correspond roughly to the VAE and IDF methods detailed in Section 2.1.

Smoothed versions of state visitation counts can be used for intrinsic rewards [3, 9, 24, 47].
Count-based methods have already shown very strong results when combining with extrinsic rewards such as setting the state of the art in the Atari game Montezuma 's Revenge [3],
and also showing significant exploration of the game without using the extrinsic reward.
It is not yet clear in which situations count-based approaches should be preferred over dynamics-based approaches; we chose to focus on dynamics-based bonuses in this paper since we found them straightforward to scale and parallelize.
In our preliminary experiments, we did not have sufficient success with already existing count-based implementations in scaling up for a large-scale study.

515:YAMAGUTIseisei
18/08/26 17:16:26.46 CL1hr8qnX BE:68536447-2BP(3)
Learning without extrinsic rewards or fitness functions has also been studied extensively in the evolutionary computing where it is referred to as ` novelty search ' [17, 18, 43].
There the novelty of an event is often defined as the distance of the event to the nearest neighbor amongst previous events, using some statistics of the event to compute distances.
One interesting finding from this literature is that often much more interesting solutions can be found by not solely optimizing for fitness.

Other methods of exploration are designed to work in combination with maximizing a reward function, such as those utilizing uncertainty about value function estimates [5, 23], or those using perturbations of the policy for exploration [8, 29].
Schmidhuber [37] and Oudeyer [25], Oudeyer and Kaplan [26] provide a great review of some of the earlier work on approaches to intrinsic motivation.
Alternative methods of exploration include Sukhbaatar et al. [45] where they utilize an adversarial game between two agents for exploration.
In Gregor et al. [10], they optimize a quantity called empowerment which is a measurement of the control an agent has over the state.
In a concurrent work, diversity is used as a measure to learn skills without reward functions Eysenbach et al. [7].

Random Features:
One of the findings in this paper is the surprising effectiveness of random features, and there is a substantial literature on random projections and more generally randomly initialized neural networks.
Much of the literature has focused on using random features for classification [14, 33, 49] where the typical finding is that whilst random features can work well for simpler problems,
feature learning performs much better once the problem becomes sufficiently complex.
Whilst we expect this pattern to also hold true for dynamics-based exploration, we have some preliminary evidence showing that learned features appear to generalize better to novel levels in Mario Bros.

9

516:YAMAGUTIseisei
18/08/26 17:18:41.35 CL1hr8qnX BE:110146695-2BP(3)
Page 10

5
Discussion

We have shown that our agents trained purely with a curiosity reward are able to learn useful behaviours: (a) Agent being able to play many atari games without using any rewards.
(b) Mario being able to cross over over 11 levels without reward.
(c) Walking like behavior emerged in the Ant environment.
(d) Juggling like behavior in Robo-school environment (e) Rally-making behavior in Two-player Pong with curiosity-driven agent on both sides.
But this is not always true as there are some Atari games where exploring the environment does not correspond to extrinsic reward.

More generally, these results suggest that, in environments designed by humans, the extrinsic reward is perhaps often aligned with the objective of seeking novelty.
The game designers set up curriculums to guide users while playing the game explaining the reason Curiosity-like objective decently aligns with the extrinsic reward in many human-designed games [6, 12, 16, 48].

0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4 5 6 7 8

Frames (millions)
Extrinsic Reward per Episode

RF with TV off
RF with TV on
IDF with TV off
IDF with TV on

Figure 6:
We add a noisy TV to the unity environment in Section 3.3.
We compare IDF and RF with and without the TV.

517:YAMAGUTIseisei
18/08/26 17:19:53.47 CL1hr8qnX BE:78326584-2BP(3)
Limitation of prediction error based curiosity:
A more serious potential limitation is the handling of stochastic dynamics.
If the transitions in the environment are random, then even with a perfect dynamics model, the expected reward will be the entropy of the transition, and the agent will seek out transitions with the highest entropy.
Even if the environment is not truly random, unpredictability caused by a poor learning algorithm, an impoverished model class or partial observability can lead to exactly the same problem.
We did not observe this effect in our experiments on games so we designed an environment to illustrate the point.

We return to the maze of Section 3.3 to empirically validate a common thought experiment called the noisy-TV problem.
The idea is that local sources of entropy in an environment like a TV that randomly changes channels when an action is taken should prove to be an irresistible attraction to our agent.
We take this thought experiment literally and add a TV to the maze along with an action to change the channel.
In Figure 6 we show how adding the noisy-TV affects the performance of IDF and RF.
As expected the presence of the TV drastically slows down learning, but we note that if you run the experiment for long enough the agents do sometimes converge to getting the extrinsic reward consistently.
We have shown empirically that stochasticity can be a problem, and so it is important for future work to address this issue in an efficient manner.

518:YAMAGUTIseisei
18/08/26 17:22:09.20 CL1hr8qnX BE:44059829-2BP(3)
Future Work:
We have presented a simple and scalable approach that can learn nontrivial behaviors across a diverse range of environments without any reward function or end-of-episode signal.
One surprising finding of this paper is that random features perform quite, but learned features appear to generalize better.
Whilst we believe that learning features will become important once the environment is complex enough, we leave that up to future work to explore.
Our wider goal, however, is to show that we can take advantage of many unlabeled (i.e., not having an engineered reward function) environments to improve performance on a task of interest.
Given this goal, showing performance in environments with a generic reward function is just the first step, and future work could investigate transfer from unlabeled to labeled environments.

Acknowledgments

We would like to thank Chris Lu for helping with the Unity environment, Phillip Isola and Alex Nichols for feedback on the paper.
We are grateful to the members of BAIR and OpenAI for fruitful discussions.
DP is supported by the Facebook graduate fellowship.

References

[1] Unity ML-agents.
URLﾘﾝｸ(github.com)
.
2

10

Page 11

[2] J. Achiam and S. Sastry. Surprise-based intrinsic motivation for deep reinforcement learning.
arXiv:1703.01732, 2017. 3, 9
[3] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos.
Unifying count-based exploration and intrinsic motivation. In NIPS, 2016. 1, 9
[4] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253279, jun 2013. 2

519:YAMAGUTIseisei
18/08/26 17:23:49.66 CL1hr8qnX BE:51402637-2BP(3)
[5] R. Y. Chen, J. Schulman, P. Abbeel, and S. Sidor.
UCB and infogain exploration via q-ensembles.arXiv:1706.01502, 2017. 9
[6] G. Costikyan. Uncertainty in games. Mit Press, 2013. 6, 10
[7] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine.
Diversity is all you need: Learning skills without a reward function. arXiv preprint, 2018. 9
[8] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg.
Noisy networks for exploration. arXiv:1706.10295, 2017. 9
[9] J. Fu, J. D. Co-Reyes, and S. Levine.
EX2: Exploration with exemplar models for deep reinforcement learning. NIPS, 2017. 9
[10] K. Gregor, D. J. Rezende, and D. Wierstra.
Variational intrinsic control. ICLR Workshop, 2017. 9
[11] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel.
Vime: Variational information maximizing exploration. In NIPS, 2016. 1, 9
[12] R. Hunicke, M. LeBlanc, and R. Zubek.
Mda: A formal approach to game design and game research. In AAAI Workshop on Challenges in Game AI, 2004. 6, 10
[13] S. Ioffe and C. Szegedy.
Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 4
[14] K. Jarrett, K. Kavukcuoglu, Y. LeCun, et al.
What is the best multi-stage architecture for object recognition?
In Computer Vision, 2009 IEEE 12th International Conference on, pages 21462153. IEEE, 2009. 9
[15] D. P. Kingma and M. Welling.
Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013. 2, 3
[16] N. Lazzaro. Why we play games:
Four keys to more emotion in player experiences. In Proceedings of GDC, 2004. 6, 10
[17] J. Lehman and K. O. Stanley.
Exploiting open-endedness to solve problems through the search for novelty. In ALIFE, 2008. 9
[18] J. Lehman and K. O. Stanley.
Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 2011. 9

520:YAMAGUTIseisei
18/08/26 17:25:07.68 CL1hr8qnX BE:51402637-2BP(3)
[19] M. Lopes, T. Lang, M. Toussaint, and P.-Y. Oudeyer.
Exploration in model-based reinforcement learning by empirically estimating learning progress. In NIPS, 2012. 9
[20] M. Lopes, T. Lang, M. Toussaint, and P.-Y. Oudeyer.
Exploration in model-based reinforcement learning by empirically estimating learning progress. In NIPS, 2012. 1
[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al.
Human-level control through deep reinforcement learning. Nature, 2015. 1
[22] S. Mohamed and D. J. Rezende.
Variational information maximisation for intrinsically motivated reinforcement learning. In NIPS, 2015. 1
[23] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy.
Deep exploration via bootstrapped dqn. In NIPS, 2016. 9
[24] G. Ostrovski, M. G. Bellemare, A. v. d. Oord, and R. Munos.
Count-based exploration with neural density models. arXiv:1703.01310, 2017. 1, 9
[25] P.-Y. Oudeyer.
Computational theories of curiosity-driven learning. arXiv preprint arXiv:1802.10546, 2018. 9

11

Page 12

[26] P.-Y. Oudeyer and F. Kaplan.
What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 2009. 1, 9
[27] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell.
Curiosity-driven exploration by self-supervised prediction. In ICML, 2017. 1, 2, 3, 4, 6, 9
[28] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell.
Zero-shot visual imitation. In ICLR, 2018. 1
[29] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz.
Parameter space noise for exploration. arXiv:1706.01905, 2017. 9
[30] P. Poupart, N. Vlassis, J. Hoey, and K. Regan.
An analytic solution to discrete bayesian reinforcement learning. In ICML, 2006. 1

521:YAMAGUTIseisei
18/08/26 17:27:09.49 CL1hr8qnX BE:29373034-2BP(3)
[31] D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014. 3
[32] E. L. Ryan, Richard; Deci.
Intrinsic and extrinsic motivations: Classic definitions and new directions. Contemporary Educational Psychology, 2000. 1
[33] A. M. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng.
On random weights and unsupervised feature learning. In ICML, pages 10891096, 2011. 9
[34] J. Schmidhuber.
Curious model-building control systems. In Neural Networks, 1991. 1991 IEEE International Joint Conference on, pages 14581463. IEEE, 1991. 9
[35] J. Schmidhuber.
A possibility for implementing curiosity and boredom in model-building neural controllers.
In From animals to animats: Proceedings of the first international conference on simulation of adaptive behavior, 1991. 1
[36] J. Schmidhuber.
A possibility for implementing curiosity and boredom in model-building neural controllers, 1991. 9
[37] J. Schmidhuber.
Formal theory of creativity, fun, and intrinsic motivation (19902010). IEEE Transactions on Autonomous Mental Development, 2010. 9
[38] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.
Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 4
[39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.
Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 2
[40] S. P. Singh, A. G. Barto, and N. Chentanez.
Intrinsically motivated reinforcement learning. In NIPS, 2005. 1
[41] L. Smith and M. Gasser.
The development of embodied cognition: Six lessons from babies. Artificial life, 2005. 1
[42] B. C. Stadie, S. Levine, and P. Abbeel.
Incentivizing exploration in reinforcement learning with deep predictive models. NIPS Workshop, 2015. 2, 9
[43] K. O. Stanley and J. Lehman.
Why greatness cannot be planned: The myth of the objective. Springer, 2015. 9

522:YAMAGUTIseisei
18/08/26 17:28:31.63 CL1hr8qnX BE:66087893-2BP(3)
[44] S. Still and D. Precup.
An information-theoretic approach to curiosity-driven reinforcement learning. Theory in Biosciences, 2012. 9
[45] S. Sukhbaatar, I. Kostrikov, A. Szlam, and R. Fergus.
Intrinsic motivation and automatic curricula via asymmetric self-play. In ICLR, 2018. 9
[46] R. S. Sutton and A. G. Barto.
Reinforcement learning: An introduction. MIT press Cambridge, 1998. 4
[47] H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel.
#Exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems, 2017. 9
[48] P. Wouters, H. Van Oostendorp, R. Boonekamp, and E. Van der Spek.
The role of game discourse analysis and curiosity in creating engaging and effective serious games by implementing a back story and foreshadowing. Interacting with Computers, 2011. 6, 10
[49] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang.
Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pages 14761483, 2015. 9

12

Page 13

A
Implementation Details

We have released the training code and environments on our website 3.
For full details, we refer the reader to our code and video results in the website.

Pre-processing:
All experiments were done with pixels.
We converted all images to grayscale and resized to size 84x84.
We learn the agent 's policy and forward dynamics function both on a stack of historical observations [xt?3, xt?2, xt?1, xt] instead of only using the current observation.
This is to capture partial observability in these games.
In the case of Super Mario Bros and Atari experiments, we also used a standard frameskip wrapper that repeats each action 4 times.

523:YAMAGUTIseisei
18/08/26 17:29:10.00 CL1hr8qnX BE:117489986-2BP(3)
Architectures:
Our embedding network and policy networks had identical architectures and were based on the standard convolutional networks used in Atari experiments.
The layer we take as features in the embedding network had dimension 512 in all experiments and no nonlinearity.
To keep the scale of the prediction error consistent relative to extrinsic reward, in the Unity experiments we applied batchnorm to the embedding network.
We also did this for the Mario generalization experiments to reduce covariate shift from level to level.
For the VAE auxiliary task and pixel method, we used a similar deconvolutional architecture the exact details of which can be found in our code submission.
The IDF and forward dynamics networks were heads on top of the embedding network with several extra fully-connected layers of dimensionality 512.

Hyper-parameters:
We used a learning rate of 0.0001 for all networks.
In most experiments, we used 128 parallel environments with the exceptions of the Unity and Roboschool experiments where we could only run 32 parallel environments, and the large scale Mario experiment where we used 2048.
We used rollouts of length 128 in all experiments except for the Unity experiments where we used 512 length rollouts so that the network could quickly latch onto the sparse reward.
In the initial 9 experiments on Mario and Atari, we used 3 optimization epochs per rollout in the interest of speed.
In the Mario scaling, generalization experiments, as well as the Roboschool experiments, we used 6 epochs.
In the Unity experiments, we used 8 epochs, again to more quickly take advantage of sparse rewards.

B
Additional Results

524:YAMAGUTIseisei
18/08/26 17:29:43.70 CL1hr8qnX BE:66087893-2BP(3)
0 250 500 750 1000 1250 1500 1750
0 1000 2000 3000 4000 5000 6000 7000

0 100 200 300 400

BeamRider
BreakOut
MontezumaRevenge
Pong
Mario
Qbert
Reverraid
Seaquest
SpaceInvaders

Frames (millions)
Extrinsic Reward per Episode

Pixels
VAE features
Inverse Dynamics features
Random CNN features

(a) Best returns
(b) Episode length

Figure 7:
(a) Left: Best extrinsic returns on eight Atari games and Mario.
(c) Right: Mean episode lengths on eight Atari games and Mario.

525:YAMAGUTIseisei
18/08/26 17:32:07.16 CL1hr8qnX BE:176234898-2BP(3)
Frames (millions)
Extrinsic Reward per Episode

Inverse Dynamics features
Random agent
Random CNN features

Figure 8:
Pure curiosity-driven exploration (no extrinsic reward, or end-of-episode signal) on 48 Atari games.
We observe that the extrinsic returns of curiosity-driven agents often increases despite the agents having no access to the extrinsic return or end of episode signal.
In multiple environments,
the performance of the curiosity-driven agents is significantly better than that of a random agent, although there are environments where the behavior of the agent is close to random, or in fact seems to minimize the return, rather than maximize it.
For the majority of the training process RF perform better than a random agent in about 67% of the environments, while IDF perform better than a random agent in about 71% of the environments.

Reward Gravitar Freeway Venture PrivateEye MontezumaRevenge
Ext Only 999.3 ± 220.7 33.3 ± 0.6 0 ± 0 5020.3 ± 395 1783 ± 691.7
Ext + Int 1165.1 ± 53.6 32.8 ± 0.3 416 ± 416 3036.5 ± 952.1 2504.6 ± 4.6

Table 2:
These results compare the mean reward (± std-error) after 100 million frames across 3 seeds for an agent trained with intrinsic plus extrinsic reward versus extrinsic reward only.
The extrinsic (coefficient 1.0) and intrinsic reward (coefficient 0.01) were directly combined without any hyper-parameter tuning.
We leave the question on how to optimally combine extrinsic and intrinsic rewards up to future work.
This is to emphasize that combining extrinsic with intrinsic rewards is not the focus of the paper, and these experiments are provided just for completeness.

526:YAMAGUTIseisei
18/08/26 17:32:51.06 CL1hr8qnX BE:58745546-2BP(3)
B.1
Atari
To better measure the amount of exploration, we provide the best return of curiosity-driven agents in figure 7(a) and the episode lengths in figure 7(b).
Notably on Pong the increasing episode length combined with a plateau in returns shows that the agent maximizes the number of ball bounces, rather than the reward.

Figure 8 shows the performance of curiosity-driven agents based on Inverse Dynamics and Random features on 48 Atari games.

Although not the focus of this paper, for completeness we include some results on combining intrinsic and extrinsic reward on several sparse reward Atari games.
When combining with extrinsic rewards, we use the end of the episode signal.
The reward used is the extrinsic reward plus 0.01 times the intrinsic reward.
The results are shown in Table 2.
We don't observe a large difference between the settings, likely because the combination of intrinsic and extrinsic reward needs to be tuned.
We did observe that one of the intrinsic+extrinsic runs on Montezuma 's Revenge explored 10 rooms.

3Website at URLﾘﾝｸ(pathak22.github.io)

13

527:YAMAGUTIseisei
18/08/26 17:33:32.21 CL1hr8qnX BE:58745546-2BP(3)
Page 14

0 2500 5000 7500 10000 12500 15000 17500
0 25000 50000 75000 100000 125000 150000 175000 200000

Extrinsic Reward per Episode
Number of gradient updates

Scale in Mario

Batch of 128 environments
Batch of 1024 environments

Figure 9:
Best extrinsic returns on the Mario scaling experiments.
We observe that larger batches allow the agent to explore more effectively, reaching the same performance in less parameter updates, and also achieving better ultimate scores.

B.2 Mario We show the analogue of the plot shown in Figure 3(a) showing max extrinsic returns.
See Figure 9
.
14

Page 15

15

528:YAMAGUTIseisei
18/08/26 17:36:22.93 CL1hr8qnX BE:29372562-2BP(3)
>>488-

>>514
A family of approaches to intrinsic motivation reward an agent based on prediction error , prediction uncertainty , or improvement of a forward dynamics model of the environment that gets trained along with the agent 's policy.

>>515
literature has focused on using random features for classification where the typical finding is that whilst random features can work well for simpler problems, feature learning performs much better once the problem becomes sufficiently complex.

529:YAMAGUTIseisei
18/09/09 07:47:01.01 vMpjqLBja BE:119937877-2BP(3)
This is the html version of the file URLﾘﾝｸ(mazonka.com)
. Google automatically generates html versions of documents as we crawl the web.

Page 1

A Simple Multi-Processor Computer Based on Subleq

Oleg Mazonka and Alex Kolodin
mazonka@gmail.com alex.kolodin@gmail.com

May 2011 (Revision 3, draft 8)

Abstract

Subleq (Subtract and Branch on result Less than or Equal to zero) is both an instruction set and a programming language for a One Instruction Set Computer (OISC).
We describe a hardware implementation of an array of 28 one-instruction Subleq processors on a low-cost FPGA board.
Our test results demonstrate that the computational power of our Subleq OISC multi-processor is comparable to that of a CPU of a modern personal computer.
Additionally, we provide implementation details of our complier from a C-style language to Subleq.

530:YAMAGUTIseisei
18/09/09 07:48:08.65 vMpjqLBja BE:48954454-2BP(3)
Contents

1. Introduction.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　2
2. Subleq Assembly Language.　　.　　.　　.　　.　　.　　.　　.　　.　　.......3
3. Hardware design .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　6
3.1 Overview.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　..6
3.2 Interface description.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.7
3.3 Subleq processor .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　......7
4. C Compiler for Subleq.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.......8
4.1 Stack.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.8
4.2 Expressions .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　....10
4.3 Function calls .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.11
4.4 Stack variables .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.......14
4.5 Multiplication.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.15
4.6 Conditional jump .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　...16
5. Results.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　......17
5.1 Test #1.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　....17
5.2 Test #2.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　....18
6. Conclusion .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　....... 19
7. Appendix.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　..21
7.1 C with multiplication .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.....21
7.2 C without multiplication .　　.　　.　　.　　.　　.　　.　　.　　.　　.　　21
7.3 Subleq code.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　....22
References.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　.　　23

~ 1 ~

Page 2

531:YAMAGUTIseisei
18/09/09 07:51:37.91 vMpjqLBja BE:34268827-2BP(3)
1. Introduction

OISC (One Instruction Set Computer) is the ultimate RISC (Reduced Instruction Set Computer) with conventional CPU, where the number of instructions is reduced to one.
Having only one available processor instruction eliminates the necessity of op-code and permits simpler computational elements, thus allowing more of them implemented in hardware with the same number of logical gates.
Since our goal was to build a functional multi-processor system with a maximum possible number of processors on a single low-cost programmable chip,
OISC was a natural choice, with the remaining step being selection of a suitable single-processor instruction set.

Currently known OISC can be roughly separated into three broad categories:

1. Transport Triggered Architecture Machines;
2. Bit Manipulating Machines;
3. Arithmetic Based Turing-Complete Machines.

Transport Triggered Architecture (TTA) is a design in which computation is a side effect of data transport.
Usually some memory registers (triggering ports) within common address space, perform an assigned operation when the instruction references them.
For example, in an OISC utilizing a single memory-to-memory copy instruction [1], this is done by triggering ports performing arithmetic and instruction pointer jumps when writing into them.
Despite appealing simplicity, there are two major drawbacks associated with such OISC.
Firstly, the CPU has to have separate functional units controlling triggering ports.
Secondly, there are difficulties with generalization of the design, since any two different hardware designs are likely to use two different assembly languages.
Because of these disadvantages we ruled out this class of OISCs for our implementation.

532:YAMAGUTIseisei
18/09/09 07:52:35.82 vMpjqLBja BE:58745164-2BP(3)
Bit Manipulating Machines is the simplest class.
A bit copying machine, called BitBitJump, copies one bit in memory and passes the execution unconditionally to the address specified by one of the operands of the instruction [2].
This process turns out to be capable of universal computation (i.e.
being able to execute any algorithm and to interpret any other universal machine) because copying bits can conditionally modify the code ahead to be executed.
Another machine, called the Toga computer, inverts a bit and passes the execution conditionally depending on the result of inversion [3].
Yet another bit operating machine, similar to BitBitJump, copies several bits at the same time.
The problem of computational universality is solved in this case by keeping predefined jump tables in the memory [4].
Despite simplicity of the bit operating machines, we ruled them out too, because they require more memory than is normally available on an inexpensive FPGA.
To make a functional multiprocessor machine with bit manipulation operation, at least 1Mb of memory per processor is required.
Therefore we decided that a more complex processor with less memory is a better choice for our purposes.

~ 2 ~

Page 3

Arithmetic based Turing-complete Machines use an arithmetic operation and a conditional jump.
Unlike the two previous classes which are universal computers, this class is universal and Turing-complete in its abstract representation.
The instruction operates on integers which may also be addresses in memory.
Currently there are several known OISCs of this class, based on different arithmetic operations [5]:
addition ? Addleq, decrement ? DJN, increment ? P1eq, and subtraction ? Subleq (Subtract and Branch on result Less than or Equal to zero).
The latter is the oldest, the most popular and, arguably, the most efficient [6][7].
A Subleq instruction consists of three operands: two for subtraction and one for the conditional jump.

533:YAMAGUTIseisei
18/09/09 07:53:50.33 vMpjqLBja BE:117489986-2BP(3)
Attempts to build hardware around Subleq have been undertaken previously.
For example, David A Roberts designed a Subleq CPU and wrote a software Subleq library [8].
His implementation is a single CPU with keyboard input, terminal, control ROM, and 16Mb RAM, and is much more complex than ours.
There were a few other similar designs described on various Internet sites, e.g. [9].
However all of them were just proof-of-concept simulations without a practical implementation.

In the following sections we describe components of the system we built.
Section 2 outlines a Subleq abstract machine and its assembly notation.
Section 3 describes our hardware implementation of the multiprocessor core.
Section 4 briefly describes techniques used to convert a high level programming language into Subleq instruction code.
In Sections 5 and 6 we present comparative speed-test results for our device, followed by a discussion and summary.
In the Appendix the code calculating factorials is presented in C and Subleq notations.

Subleq software <- USB -> Driver, Programming environment

Figure 1 FPGA board is connected to PC by USB cable

Figure 1 represents the connection of our device to a computer with USB cable.

534:YAMAGUTIseisei
18/09/09 07:55:07.61 vMpjqLBja BE:73431465-2BP(3)
2. Subleq Assembly Language

A Subleq abstract machine operates on an infinite array of memory, where each cell holds an integer number.
This number can be an address of another memory cell.
The numbering starts from zero.
The program is defined as a sequence of instructions read from the memory with the first instruction at the address zero.
A Subleq instruction has 3 operands:

A B C

Execution of one instruction A B C subtracts the value in the memory cell at the address stored in A from the content of a memory cell at the address stored in B and then writes the result back into the cell with the address in B.
If the value after subtraction in B is less or equal to zero, the execution jumps to the address specified in C; otherwise execution continues to the next instruction, i.e. the address of the memory cell following C.

~ 3 ~

Page 4

Assembly notation helps to read and write code in Subleq.
The following is the list of syntax conventions:

? label;
? question mark;
? reduced instruction;
? multi-instruction;
? literal and expression;
? data section;
? comment.

535:YAMAGUTIseisei
18/09/09 07:56:10.49 vMpjqLBja BE:117490368-2BP(3)
Label is a symbolic name of a particular address followed by a colon.
In the following example

　A B C
A:2 B:1 0
C:B B 0

each line represents one instruction with three operands.
Here A, B, and C are not abstract names, but labels (addresses) of specific cells in the memory.
For example, label A refers to the 4th cell, which is initialised with value 2.
The first instruction subtracts the value of cell A from the value of cell B, which value is 1, and stores the result in the cell B, which becomes -1.
Since the result is less than zero, the next instruction to be executed is the third line, because value C is the address of the first operand of the instruction on the third line.
That subtracts B from B making it zero, so the execution is passed to the address 0.
If these three lines is the whole program, then the first operand of the first instruction has the address 0.
In this case the execution is passed back to the first instruction which would make B -2.
That process continues forever.
The instructions being executed are only the first and the third lines, and the value of the cell B changes as 1, -1, 0, -2, 0, -2, 0, and so on.

A question mark is defined as the address of the next cell in memory:

A B ?
B B 0

is the same as

A B C
C:B B 0

Reduced instruction format is a convenient shortcut: two operands instead of three assume the third to be the address of next instruction, i.e. ?; and only one operand assumes the second to be the same as the first, so

536:YAMAGUTIseisei
18/09/09 07:57:06.05 vMpjqLBja BE:44058863-2BP(3)
A

is the same as

A A

~ 4 ~

Page 5

and is the same as

A A ?

If more than one instruction is placed on the same line, each instruction except the last must be followed by semicolon.
The following code copies the value from

A to B: Z; B; A Z; Z B

Integer numbers like A:72 are used as constants in the code.
Literals can be used instead of integers assuming their ASCII values.
For example, A:'H' is the same as A:72; and A:"Hi" is the same as A:'H' 'i'.
Addition, subtraction, parenthesis, and unary minus can be used in expression.
The code Z Z ?+3 A B C D E F sets Z to zero and jumps to the third instruction D E F.
Since instructions can be reduced, the assembler must know when to generate a full instruction with three operands.
To avoid such generation, period at the beginning of the line is used.
Thus program data can be placed on such a line.
The code

537:YAMAGUTIseisei
18/09/09 07:58:40.41 vMpjqLBja BE:198264299-2BP(3)
A A ?+1
. U:-1
U A

sets A to 1.
Without the period on the second line the code would be interpreted as

A A ?+1
U:-1 (-1) ?
U A

Comments are delimited by hash symbol #: everything from # till the end of the line is a comment.
Jump to a negative address halts the program.
Usually (-1) as the third operand is used to stop the program, for example:

# halt
Z Z (-1)

The parentheses around (-1) are necessary to indicate that it is the third operand, so the instruction would not be interpreted as

Z Z-1 ?

To make a Subleq program interactive (requesting data and responding to user while working), input and output operations can be defined as operations on a non-existing memory cell.
The same (-1) address can be used for this.
If the second operand is (-1) the value of the first operand is the output.
If the first operand is (-1), the second operand gets the value from the input stream.
Input and output operations are defined on byte basis in ASCII code.
If the program tries to output a value greater than 255, the behaviour is undefined.

Below is a "Hello world" program adapted from Lawrence Woodman helloworld.sq [10].

538:YAMAGUTIseisei
18/09/09 07:59:47.32 vMpjqLBja BE:34267872-2BP(3)
~ 5 ~

Page 6

It is exceptionally terse, but is a good example of Subleq efficiency.

L:H (-1); U L; U ?+2; Z H (-1); Z Z L
. U:-1 H:"hello, world\n" Z:0

A special variable called Z is often used in Subleq as an intermediate temporary variable within a highly small scope.
It is commonly assumed that this variable is initialised at zero and left at zero after every usage.

The program above consists of five instructions.
The first instruction prints the character pointed by its first operand (the first pointer) which is initialised to the beginning of the data string ? the letter ’h’.
The second instruction increments that pointer ? the first operand of the first instruction.
The third instruction increments the second pointer, which is the second operand of the forth instruction.
The forth instruction tests the value pointed by the second pointer and halts the program when the value is zero.
It becomes zero when the pointer reaches the cell one after the end of the data string, which is Z:0.
The fifth instruction loops back to the beginning of the program, so the process continues until the halt condition is not satisfied.

539:YAMAGUTIseisei
18/09/09 08:00:42.28 vMpjqLBja BE:102803876-2BP(3)
3. Hardware design

3.1 Overview

We have used Altera Cyclone III EP3C16 FPGA as the basis for a hardware implementation.
The choice was based on the relatively low price (~ US$30) of this FPGA IC chip and availability of the test hardware for it.

The test board we used has a DDR2 RAM IC fitted, but access to the RAM is limited to one process at a time.
True parallel implementation requires separate internal memory blocks allocated for each processor hence the amount of available memory in the FPGA limits the number of processors.
The EP3C16 FPGA has 56 of 16 bit memory blocks of 8 Kbits each.
Our implementation of one 32 bit Subleq processor requires a minimum of 2 memory blocks, so only 28 processors can be fitted into the FPGA.
We could choose 16 bit implementation and have more processors (up to 56), but with only 1 Kbyte of memory allocated to each.

The FPGA is connected to the USB bus with the help of an external Cypress FX2 CPU that is configured as a bridge between USB and SPI (Serial Peripheral Interface) utilised to load FPGA with code and data.
The interface bridge is transparent for the PC software.

　　　　FPGA
MEMORY　　<-> PROCESSOR1
MEMORY　　<-> PROCESSOR2
MEMORY　　<-> PROCESSOR3
　　:　　　　　　 <-> SPI <-> CONTROL_CPU <-> USB
MEMORY　　<-> PROCESSOR7
MEMORY　　<-> PROCESSOR8

Figure 2 Block-diagram of the board

~ 6 ~

Page 7

540:YAMAGUTIseisei
18/09/09 08:01:42.82 vMpjqLBja BE:88117294-2BP(3)
Figure 2 is a communication block diagram of the board.
The solution was coded in VHDL and compiled with Quartus II Web Edition software freely available from the Altera website.
Our code is scalable for up to 63 processors for possible use with larger FPGAs.
The limit of 63 processors is due to our SPI bus addressing implementation and can be increased, if necessary.

All 28 processors run independently and synchronized by a single 150 MHz clock generated by one of the FPGA PLLs from a reference oscillator fitted on the PCB.

To increase the number of processors, additional boards with FPGAs could be easily connected via USB bus.

3.2 Interface description

Each processor has a serial interface to the allocated memory and the Status byte, accessible from a single address serial loading.
The serial interface takes over memory's data and address buses when processing is stopped.

Address space inside the FPGA is organised as a table addressed by two numbers: processor index and memory address.
Reading one byte from index 0 returns the number of processors inside the FPGA.
For this design the returned value is 28.
Indices from 1 to 28 are assigned to the processors, which have 2048 bytes (512 of 32 bit words) of memory available to each.

Writing to a processor memory is an operation which sequentially loads a buffer of 2048 bytes.
Reading from the processor’s memory is different: the first word (4 bytes) returned is the status of the processor and the rest is the memory content.

541:YAMAGUTIseisei
18/09/09 08:02:33.56 vMpjqLBja BE:58745838-2BP(3)
The Status byte ? the first byte of the first word ? can be in one of three states: 0xA1 ? running, 0xA2 ? stopped, or 0xA0 ? stopped and not run since power on.
Writing into the processor’s memory automatically starts execution, thus eliminating the need in a separate command.
Reading from a processor’s memory stops that processor.
An exception is reading the first byte of status, which does not stop the processor.
Additionally, a processor can be stopped by Subleq halt operand (-1) as mentioned in Section 2.
Other negative references, such as input or output as described above in the Subleq assembly language section, also stop the processor, because no IO operations are defined in this architecture.

3.3 Subleq processor

The state machine algorithm can be presented in pseudocode as:

　　IP = 0
　　while (IP >= 0)
　　{
　　　　A = memory[IP]
　　　　B = memory[IP+1]
　　　　C = memory[IP+2]
　　　　if (A < 0 or B < 0):
　　　　{
　　　　　　IP = -1
　　　　}
　　　　else:
　　　　{
　　　　　　memory[B] = memory[B] - memory[A]
　　　　　　if (memory[B] > 0)
　　　　　　　　IP = IP + 3
　　　　　　else:
　　　　　　　　IP = C
　　　　　　}
　　　　}

542:YAMAGUTIseisei
18/09/09 08:03:12.40 vMpjqLBja BE:110146695-2BP(3)
~ 7 ~

Page 8

where IP is an instruction pointer, memory[] is a value of a memory cell, and A, B, and C are integers.

The Subleq processor core is written with the help of RAM 2 Port Megafunction of Quartus II software we used to build dual port memory access.
The implemented solution allows access to the content by two distinct addresses (memory[A] and memory[B]) simultaneously, which results in saving on processing clock ticks.
The disadvantage of this implementation is an additional latency of one clock tick to access the data and address buses compared to a single port memory implementation.
However the total of processing clock ticks per memory access for the dual port is still less than that required for a single port.

The core is based on a state machine, which starts automatically when the memory of a processor is loaded.
On any read or write operation, or encountering a negative operand the processing stops, but the first status byte can be read at any time without affecting the computation.

4. C Compiler for Subleq

In this section we briefly describe some elements of the compiler we wrote, which compiles simplified C code into Subleq [11].
The compiler is used in one of our tests, so that direct comparison is possible between execution of a compiled native C code and a Subleq code, compiled from the same C source.
The compiler is a high-level language interface to OISC ? the only known to us such compiler at the time of writing.

4.1 Stack

The primary C programming language concepts are functions and the stack.
Implementation of the stack in Subleq is achievable by using memory below the code.
Using code self-modification one can place into and retrieve from the stack values.
Function calls require return address to be placed into the stack.
Consider the following C code:

543:YAMAGUTIseisei
18/09/09 08:06:52.07 vMpjqLBja BE:110147459-2BP(3)
　　void f()
　　{
　　　　...
　　}

　　void main()
　　{
　　　　...
　　}

　　f();
　　　　 ...
　　}

~ 8 ~

Page 9

After the above is compiled to machine code, it must perform the following operations
1) the address of the instruction immediately after calling f has to be put on the stack;
2) a jump to the code of the function f must be made
3) at the end of the function f, the address from the stack needs to be extracted and
4) the execution should be transferred to the extracted address.
According to C standard, the function main is a proper C function, i.e. it can be called from other functions including itself.
Hence the program must have a separate entry point, which in the following code is called sqmain.
The C code above compiles into:

544:YAMAGUTIseisei
18/09/09 08:08:07.99 vMpjqLBja BE:110146695-2BP(3)
　　　　0 0 sqmain
　　_f:
　　　　...
　　　　#return ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0
　　_main:
　　　　...
　　　　#call f
　　　　dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
　　　　?+6; sp ?+2; ?+2 0 _f; . ?; inc sp
　　　　...
　　　　#return
　　　　?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0
　　sqmain:
　　　　#call main
　　　　dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
　　　　?+6; sp ?+2; ?+2 0 _main; . ?; inc sp
　　　　0 0 (-1)
　　. inc:-1 Z:0 dec:1 sp:-sp

The cell stack pointer sp is the last memory cell in the program.
It is initialised with the negative value of its own address.
A negative value is used here to speed up the code execution ? working with subtraction operation may sometimes save a few steps if the data is recorded as negative of its actual value.
The instruction dec sp subtracts 1 from sp, hence increasing its real value by 1.
Below is an excerpt calling the function f in more readable form ? relative references ? are replaced with labels.

545:YAMAGUTIseisei
18/09/09 08:13:11.96 vMpjqLBja BE:44059829-2BP(3)
　　dec sp
　　A; sp A
　　B; sp B
　　A:0 B:0
　　C; sp C
　　D C:0 _f
　　. D:?
　　inc sp

The instruction on the fourth line is to clear the cell in the stack, since some values can be left there from the previous use.
However, to clear the top cell in the stack is not a single-step task because one has to clear the operands of the instruction itself and then initialise it with the value of sp pointer.
Thus the execution command sequence is the following:
allocate a new cell in the stack by increasing stack pointer (first line);
clear the first operand of the instruction;
initialise this operand with the new value of the stack pointer (second line);
do the same with the second operand of the instruction ? clear and initialise (third line); and then execute the instruction, which will clear the allocated cell in the stack (forth line).

The next two instructions clear and initialise cell C similarly.
The instruction D C:0 _f copies the address of the instruction inc sp to the stack and jumps to _f.
This works, because D holds the value of the next memory cell (remember ?) and C points to the now cleared top cell on the stack.

~ 9 ~

Page 10

A negative value written to the stack then forces a jump to the label _f.

Once inside the function f, the stack pointer can be modified, but we assume that functions restore it before they exit.
So the return code has to jump to the address extracted from the stack:

546:YAMAGUTIseisei
18/09/09 08:13:40.63 vMpjqLBja BE:48954454-2BP(3)
　　A; sp A
　　B; A:0 B
　　Z Z B:0

Here the value of stack pointer sp is written to A, and the instruction A:0 B copies the stored address to B.
The address stored negatively so the positive value is being restored.

The stack does a little bit more than just storing return addresses.
That will be explored later in subsections 4.3 and 4.4.

4.2 Expressions

C language operations consist of statements, which in turn consist of keyword statements and expressions.
The syntax of keyword statements and expressions are best represented by Backus-Naur Forms (BNF) ? the standard way of representing context-free grammars.
A classic example is the grammar of arithmetic expressions:

　　expression:=
　　　　term
　　　　　　expression + term
　　　　　　expression - term
　　　　term:=
　　　　　　primary
　　　　　　term * primary
　　　　　　term / primary
　　　　primary:=
　　　　　　identifier
　　　　　　constant
　　　　　　( expression )

These mutually recursive definitions can be used by a program called parser to build a tree representation for any grammatically valid expression.
Once such a tree is built, the compiler’s job is to organise the sequence of instructions so that the result of any sub-tree is passed up the tree.
For example, a tree of the expression:

547:YAMAGUTIseisei
18/09/09 08:14:35.49 vMpjqLBja BE:68535874-2BP(3)
　　a + ( b ? c )

Figure

consists of a node ‘+’, variable a and a sub-tree, which consists of a node ‘-’, and variables b and c.
To make a calculation, the compiler must use a temporary variable to store the result of the sub-tree, which has to be used later in addition; and potentially to be used further up if this expression is a part of a larger expression.
In this particular example we need only one temporary, but generally many temporaries are required.
The expression is compiled into the following code:

　　t; b Z; Z t; Z
　　c t
　　a Z; Z t; Z

~ 10 ~

Page 11

The first line copies value b into temporary t.
The second line subtracts value c from the temporary.
At this point the compiler is finished with the sub-tree.
Its result is the generated code and the temporary variable t holding the value of the calculated sub-tree.
Now the compiler generates code for the addition.
Its arguments now are variable a and temporary t.
The third line adds a to t.
Now t holds the result of the whole expression.
If this expression is a part of a larger expression, t is passed up the tree as an argument to the upper level node of the tree.
If not, then t value is discarded because the evaluation is finished.

548:YAMAGUTIseisei
18/09/09 08:15:29.08 vMpjqLBja BE:44059436-2BP(3)
More advanced grammar may involve assignment, dereferencing, unary operations and many others.
But each grammar construction can be represented by a corresponding sub-tree, and processed later by the compiler to produce code.
For example, a subtraction from a dereferenced value represented in C as:

　　*k -= a

has to be translated into

　　t; k Z; Z t; Z
　　a t:0

Here a temporary variable has to be used inside the code for dereferencing.
The sequence of instructions is: clear t; copy value k into t; subtract a from the memory k points to.

Here a few elements of the grammar processing have been touched.
C grammar takes several pages just to list the BNF.
However larger and more complex grammars are resolved by the compiler in similar manner.

4.3 Function calls

In the subsection 4.1 above it was shown how to push into and pop from the stack.
When a function takes arguments, they have to be pushed into the stack together with the return address.
The stack must be restored upon function return.
Consider a function taking two arguments:

　　int f(int a, int b);
　　　　...
　　　　f(a,b);

The call to a function f must be translated into something like this

549:YAMAGUTIseisei
18/09/09 08:17:21.13 vMpjqLBja BE:154205497-2BP(3)
　　# 1 push b
　　# 2 push a
　　# 3 push return_address
　　# 4 goto f # return_address:
　　# 5 sp -= 3

In C arguments can be expressions and the call to a function can be a part of another expression - sub-expression, i.e.
the compiler must properly handle more complicated cases like the following

~ 11 ~

Page 12

　　int f(int a, int b)
　　{
　　　　...
　　　　return f;
　　}
　　　　...
　　　　int k;
　　　　k=f;
　　　　k(f(1,2),3); // call via variable ? 　　indirect call
　　　　k = f(1,2)(3,4); // call by return value

Here for simplicity the C function type int(*)(int,int) is represented as int.
Subleq supports only one variable type.
Therefore, more elaborate typing system does not introduce extra functionality in the language.

550:YAMAGUTIseisei
18/09/09 08:18:23.81 vMpjqLBja BE:132176096-2BP(3)
Arguments pushed into the stack can properly be calculated as sub-expressions (sub-tree).
In this sense for the actual function call it is irrelevant either program variables or temporary are pushed into the stack.

　　# 1 push B
　　　　# clearing the next cell in the stack [remember that sp is negative]
　　　　# the line below is same as in C syntax: *(++sp)=0;
　　　　dec sp; t1; sp t1; t2; sp t2; t1:0 t2:0
　　　　# same as in C syntax: *sp+=B;
　　　　t3; sp t3; b Z; Z t3:0; Z

　　# 2 push A
　　　　# the same with A
　　　　dec sp; t4; sp t4; t5; sp t5; t4:0 t5:0
　　　　t6; sp t6; a Z; Z t6:0; Z

　　# 3 push return_address
　　　　dec sp; t7; sp t7; t8; sp t8; t7:0 t8:0
　　　　t9; sp t9; t10 t9:0 goto_address
　　　　. t10: return_address

　　# 4 goto f goto_address: Z Z f

　　# 5 sp -= 3
　　　　return_address: const(-3) sp

Notation const(-3) sp is a short for

　　unique_name sp
　　...
　　unique_name:-3

551:YAMAGUTIseisei
18/09/09 08:21:26.97 vMpjqLBja BE:44058863-2BP(3)
The code above handles neither return value nor indirect calls yet.
The return value can be stored in a special variable (register).
If the program uses the return value in a sub-expression, then it must copy the value into a temporary immediately upon return.
Indirect calls can be achieved by dereferencing a temporary holding the address of the function.
It is straightforward, but more complex code.

Stack pointer can be modified inside a function when the function requests stack (local) variables.
For accessing local variables usually base pointer bp is used.
It is initialised on function entrance; is used as a base reference for local variables ? each local variable has an associated offset from base pointer; and is used to restore stack pointer at the end of the function.
Functions can call other functions, which means that each function must save upon entry and restore upon exit base pointer.
So the function body has to be wrapped with the following commands:

~ 12 ~

Page 13

　　1. # push bp
　　2. # sp -> bp
　　3. # sp -= stack_size
　　# ... function body
　　5. # bp -> sp
　　6. # pop bp
　　7. # return

Or in Subleq code.

552:YAMAGUTIseisei
18/09/09 08:23:41.07 vMpjqLBja BE:44058492-2BP(3)
　　dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
　　?+6; sp ?+2; bp 0
　　bp; sp bp
　　stack_size sp

　　# ... function body

　　sp; bp sp
　　?+8; sp ?+4; bp; 0 bp; inc sp
　　?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0

stack_size is a constant, which is calculated for every function during parsing.
It turns out that it is not enough to save bp.
A function call can happen inside an expression.
In such case all temporaries of the expression have to be saved.
A new function will be using the same temporary memory cells for its own needs.
For the expression f()+g() the results of the calls may be stored in variables t1 and t2.
If function g changes t1 where the result of function f is stored, a problem would appear.

A solution is to make every function push all temporaries it is using onto the stack and to restore them upon exit.
Consider the following function:

　　int g()
　　{
　　　　return k+1;
　　}

It translates into:

553:YAMAGUTIseisei
18/09/09 08:24:55.53 vMpjqLBja BE:78326584-2BP(3)
　　_g:
　　　　# save bp
　　　　dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
　　　　?+6; sp ?+2; bp 0
　　　　bp; sp bp

　　　　# push t1
　　　　dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
　　　　?+6; sp ?+2; t1 0
　　　　# push t2
　　　　dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
　　　　?+6; sp ?+2; t2 0

　　　　# calculate addition
　　　　t1; t2
　　　　_k t1
　　　　dec t1
　　　　t1 t2
　　　　# set the return value [negative]
　　　　ax; t2 ax

　　　　# pop t2
　　　　?+8; sp ?+4; t2; 0 t2; inc sp
　　　　# pop t1
　　　　?+8; sp ?+4; t1; 0 t1; inc sp

　　　　# restore bp
　　　　sp; bp sp
　　　　?+8; sp ?+4; bp; 0 bp; inc sp
　　　　# exit
　　　　?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0

554:YAMAGUTIseisei
18/09/09 08:25:45.91 vMpjqLBja BE:39163182-2BP(3)
~ 13 ~

Page 14

If somewhere inside the code there are calls to other functions, the temporaries t1 and t2 hold their calculated values because other functions save and restore them when executed.
Since all used temporaries in the function are pushed into the stack, it pays off to reduce the number of used temporaries.
It is possible to do this just by releasing any used temporary into a pool of used temporaries.
Then later when a new temporary is requested, the pool is first checked and a new temporary is allocated only when the pool is empty.
The expression

　　1+k[1]

compiles into

　　t1; t2; _k t1; dec t1; t1 t2
　　t3; t4; ?+11; t2 Z; Z ?+4; Z; 0 t3; t3 t4;
　　t5; t6; dec t5; t4 t5; t5 t6
　　# result in t6

When pool of temporaries is introduced the number of temporaries is halved:

　　t1; t2; _k t1; dec t1; t1 t2
　　t1; t3; ?+11; t2 Z; Z ?+4; Z; 0 t1; t1 t3
　　t1; t2; dec t1; t3 t1; t1 t2
　　# result in t2

which dramatically reduces the code removing corresponding push and pop operations.

555:YAMAGUTIseisei
18/09/09 08:26:33.41 vMpjqLBja BE:119937877-2BP(3)
4.4 Stack variables

Once bp is placed on the stack and sp is decremented to allocate memory, all local variables become available.
They can be accessed only indirectly because the compiler does not know their addresses.
For example, the function f in

　　int f(int x, int y)
　　{
　　　　int a, b=3, c[3], d=5;
　　　　...
　　}
　　f(7,9);

has 4 local variables with the stack size equal to 6.
When this function is entered the stack has the following values:

... y[9] x[7] [return_address] [saved_bp] a[?] b[3] c0[?] c1[?] c2[?] d[5] ...
　　　　　　　　　　　　　　^　　　　　　　　　　　　　　 ^
　　　　　　　　　　　　　　(bp) 　　　　　　　　　　　　　　(sp)

The compiler knows about the offset of each variable from bp.

~ 14 ~

Page 15

556:YAMAGUTIseisei
18/09/09 08:28:24.39 vMpjqLBja BE:48954645-2BP(3)
　　Variable Offset
　　y 　　-3
　　x 　　-2
　　a 　　1
　　b 　　2
　　c 　　3
　　d 　　6

Hence, in the code any reference to a local variable, not pointing to an array, can be replaced with *(bp+offset).
The array c has to be replaced with (bp+offset) because the name of array is the address of its first element.
The name does not refer to a variable, but the referencing with [] does.
In C

　　c[i]

is the same as

　　*(c+i)

which can be interpreted in our example as

　　*((bp+3)+i)

4.5 Multiplication

The only trivial multiplication in Subleq is multiplication by 2,

　　t=a+a: t; a Z; a Z; Z t; Z

To multiply 2 numbers one can use the formula

557:YAMAGUTIseisei
18/09/09 08:29:27.49 vMpjqLBja BE:29373326-2BP(3)
　　A*B = (2A)*(B/2) + A*(B%2)

This is a simple recursive formula, but it requires integer and modular division.
Division can be implemented as the following algorithm.
Given two numbers A and B, B is increased by 2 until the next increase gives B greater then A.
At the same time as increasing B, we increase another variable I by 2, which has been initialized to 1.
When B becomes greater then A, I holds the part of the result of division ? the rest is to be calculated further using A-B and original B.
This can be done recursively accumulating all I's.
At the last step when A<B, A is the modulus.
This algorithm can be implemented as a short recursive function in C.
Upon the exit this function returns the integer division as the result and division modulus in the argument j.

　　int a, int b, int * j)
　　{

　　　　if( a < b ) { *j=a; return 0; }

　　　　int b1=b, i=1, bp, ip;

　　next:
　　　　bp = b1; ip = i;
　　　　b1 *= 2; i *= 2;
　　　　if( b1 > a )
　　　　　　return ip+divMod(a-bp,b,j);
　　　　goto next;
　　}

~ 15 ~

Page 16

558:YAMAGUTIseisei
18/09/09 08:30:57.51 vMpjqLBja BE:66089039-2BP(3)
This function is not optimal.
A more efficient function can be achieved by replacing recursion with another external loop.
Multiplication, integer and modular division operations requiring quite elaborate calculations can be implemented as library functions.
That is, each multiplication a*b can be replaced with a call _mul(a,b), and later the compiler may add (if necessary) the implementation of the function.

4.6 Conditional jump

In C, Boolean expressions which evaluate to zero are false and non-zero are true.
In Subleq this leads to longer code when handling Boolean expressions because every Boolean expression evaluates on the basis of equality or non-equality to zero.

A better approach is to treat less or equal to zero as false and a positive value as true.
Then if-expression if(expr){<body>} will be just one instruction

　　Z t next
　　<body>
　　next: ...

where t is the result of the expression expr.
However to remain fully compatible with C (for example, if(x+1){...} ? an implicit conversion to Boolean) all cases where integer expression is used as Boolean have to be detected.
Fortunately there are only a few such cases:

　　if(expr)
　　while(expr)
　　for(...,expr,...)
　　! expr
　　expr1 && expr2
　　expr1 || expr2

The job can be done inside the parser, so the compiler would not have to care about Boolean or integer expression, and it can produce much simpler code.

In cases when a Boolean variable is used in expressions as integer, like in:

559:YAMAGUTIseisei
18/09/09 08:31:39.95 vMpjqLBja BE:117490368-2BP(3)
　　passing an argument f(a>0)
　　returning from a function return(a>0);
　　assignment x=(a>0);
　　other arithmetic expression x=1+5*(a>0);

the variable must be converted to C-style, i.e. negative result zeroed.
This can be done as simple as

　　x Z ?+3; x; Z

Z x ? + 3
Z Z
x Z
Z

x > 0
x == 0
x < 0

Figure 3 Diagram representing conditional jumps.

~ 16 ~

Page 17

A terse check for a value being less than, equal to, or greater than zero is:

　　Z x ?+3; Z Z G; x Z E; Z; L:

560:YAMAGUTIseisei
18/09/09 08:32:36.33 vMpjqLBja BE:119937877-2BP(3)
where L, E, and G are addresses to pass the execution in cases when x is less than, equal to, or greater than zero respectively.
Figure 3 shows the schema of the execution.
Note, that x does not change and Z is zero on any exit!

5. Results

Figure 4 FPGA board, 28 Subleq processors with allocated 2 Kb per processor

Figure 4 shows our FPGA board powered via USB cable.
Sized about 5 x 7 centimetres the board implements 28 Subleq processors with allocated memory of 2 Kb per processor and running at clock frequency 150 MHz.

To test the efficiency of the board we chose two mathematical problems.
The first calculates the size of a function residue of an arithmetic group.
The second calculates modular double factorials.

5.1 Test #1

In the first test we selected a problem of finding the order of the function residue of the following process:

　　xi +1 = 2 xi mod M
　　yi +1 = 2( xi + yi ) mod M

where x and y are integers initialised to 1, mod is a modulo operation, M is some value.
Starting from the point (x0=1,y0=1) the equations generate a sequence of pairs.
We chose this problem because its solution is difficult, with answers often much greater than M (but less than M 2 ).
Number M was selected such, that the calculations could be completed in a few minutes.
When this sequence is sufficiently long, a new pair of generated numbers will eventually be the same as the pair previously generated in the sequence.
The task is to find how many steps have to be completed before the first occurrence of the result with the same value.
In our test the selected value of M was M=5039 and the number of iterations was calculated as 12693241.

561:YAMAGUTIseisei
18/09/09 08:33:43.69 vMpjqLBja BE:176235089-2BP(3)
Page 17

~ 18 ~

A C program to solve this problem can be written without the use of multiplication or division:

　　int x=1, y=1, m=5039;
　　int c=0, ctr=1, t;
　　int x0=0, y0=0;

　　int printf();
　　int main()
　　{

　　　　while(1)
　　　　{
　　　　　　y += x; y += y; x += x;
　　　　　　while( x>=m ) x-=m;
　　　　　　while( y>=m ) y-=m;

　　　　　　if( x==x0 && y==y0 ) break;

　　　　　　if( ++c==ctr )
　　　　　　{
　　　　　　　　x0=x; y0=y;
　　　　　　　　c=0; ctr+=ctr;
　　　　　　}
　　　　}
　　　　printf("point: %d %d loop: %d of %d\n",x0,y0,c+1,ctr);
　　}

562:YAMAGUTIseisei
18/09/09 08:35:14.91 vMpjqLBja BE:58745838-2BP(3)
This program has been tested in the following cases:

　　1. Compiled with our Subleq compiler and run on one of the processors on FPGA board;
　　2. Compiled with our Subleq compiler and emulated on PC#1 (Intel Q9650 at 3GHz)
　　3. Compiled with Microsoft C/C++ compiler (v16) with full optimisation and run on PC#1.
　　4. Same as 2, but run on PC#2 (Pentium 4 at 1.7GHz)
　　5. Same as 3 run on PC#2

The table below shows execution time in seconds for each test.

　　1 Subleq on 1 processor FPGA 94.0
　　2 Subleq on PC#1 46.0
　　3 C on PC#1 0.37
　　4 Subleq on PC#2 216
　　5 C on PC#2 0.54

From these results we conclude that the speed of a single processor on FPGA is of the same order of magnitude as the speed of CPU of ordinary PC when emulating Subleq instructions.
Native code on a PC runs about hundred times faster.

5.2 Test #2

The second test was the calculation of modular double factorials, namely

　　 N n
　　(N !)! mod M = ?? i mod M
　　 n =1 i =1

Page 18

~ 19 ~

563:YAMAGUTIseisei
18/09/09 08:36:23.18 vMpjqLBja BE:58745164-2BP(3)
In this test case we were able to use the full power of our multi-processor Subleq system because multiplication in the above equation could be calculated in parallel across all 28 processors.
For N=5029 and M=5039 the result is 95 and these numbers were used in the test.
Number M was same as in the Test#1 and number N was selected to give the result (95) in ASCII printable range.
The calculations were run in the following configurations:

　　1. Hand written Subleq code run of FPGA board [Appendix 7.3]
　　2. Subleq code emulated on PC (same as PC#1 in the first test)
　　3. Equivalent C code compiled with the same C compiler and run on PC [Appendix 7.1]
　　4. Same C code compiled with Subleq compiler and emulated on PC
　　5. Equivalent C code without multiplication operation compiled with C compiler and run on PC [Appendix 7.2]
　　6. Same C code as in 5 compiled with Subeq compiler and emulated on PC

The code we used was not 100% efficient, since the solution to the problem needs ~O(NlogN) operations if utilising modular exponentiation, rather than ~O(N 2 ) as presented in the Appendix.
However this is not important when evaluating relative performance.

The results are presented in the table below.
The values are execution time in seconds.

　　1 Subleq on FPGA, parallel on 28 processors 62.0
　　2 Subleq on PC (emulation) 865
　　3 C with multiplication, executable run on PC 0.15
　　4 C with multiplication, Subleq emulated on PC 12060
　　5 C without multiplication, executable run on PC 7.8
　　6 C without multiplication, Subleq emulated on PC 9795

The 28 FPGA processors easily outperform the emulation of the same Subleq code on a PC.
C code without multiplication compiled into Subleq and emulated runs faster than C code with multiplication, because the compiler’s library multiplication function is not as efficient as the multiplication function written in this example.

564:YAMAGUTIseisei
18/09/09 08:40:24.80 vMpjqLBja BE:39163182-2BP(3)
6. Conclusion

Using an inexpensive Cyclone III FPGA we have successfully built an OISC multi-processor device with processors running in parallel.
Each processor has its own memory limited to 2 Kb.
Due to this limitation we were unable to build a multi-processor board with even simpler individual processor instruction set, such as e.g.
bit copying [2], because in that case, to run practically useful computational tasks the minimum required memory is ~ 1 Mb of memory per processor.
The limited memory available in our device also did not permit us to run more advanced programs, such as emulators of another processor or use more complex computational algorithms,
because all the computational code has to fit inside the memory allocated for each processor.

Page 19

~ 20 ~

The size of memory available to each processor can be increased by choosing larger and faster albeit more expensive FPGA such as Stratix V.
Then a faster processing clock and larger number of CPUs could be implemented as well.
The VHDL code of the CPU state machine could also be optimised improving computational speed.
Given sufficient memory, it would be possible to emulate any other processor architecture, and use algorithms written for other CPU’s or run an operating system.
Apart from the memory constrain, another downside of this minimalist approach was reduced speed.
Our board uses rather slow CPU clock speed of 150 MHz.
As mentioned above, more expensive FPGA can run at much faster clock speeds.

565:YAMAGUTIseisei
18/09/09 08:41:29.57 vMpjqLBja BE:36716235-2BP(3)
On the other hand, the simplicity of our design allows for it to be implemented as a standalone miniature-scale multi-processor computer, thus reducing both physical size and energy consumption.
With proper hardware, it might also be possible to power such devices with low power solar batteries similar to those used in cheap calculators.
Our implementation is scalable ? it is easy to increase the number of processors by connecting additional boards without significant load on host’s power supply.
A host PC does not have to be fast to load the code and read back the results.
Since our implementation is FPGA based, it is possible to create other types of runtime re-loadable CPUs, customised for specific tasks by reprogramming FPGA.

In conclusion, we have demonstrated the feasibility of the OISC concept and applied it to building a functional prototype of OISC multi-processor system.
Our results demonstrate that with proper hardware and software implementation, a substantial computational power can be achieved already in a very simple OISC multi-processor design.

Page 21

~ 21 ~

7. Appendix

This section presents pieces of code calculating modular double factorial.

7.1 C with multiplication

The following C program calculates modular double factorial using built-in multiplication and division operations.

566:YAMAGUTIseisei
18/09/09 08:42:50.59 vMpjqLBja BE:102803876-2BP(3)
1 　　int printf();
2 　　int main()
3 　　{
4 　　　　int a=5029;
5 　　　　int b=1;
6 　　　　int m=5039;
7 　　　　int x=1;
8 　　　　int i,j;
9
10 　　　　for( i=a; i>b; i-- )
11 　　　　for( j=1; j<=i; j++ )
12 　　　　x = (j*x)%m;
13
14 　　　　printf("%d",x);
15 　　}

Lines 10-12 is a double loop multiplying numbers from b to a modulo m.

7.2 C without multiplication

This C program does the same calculation as the program above in 7.1, but without built-in multiplication and division operations.
Multiplication and division functions are written explicitly.
en explicitly.

1 　　int DivMod(int a, int b, int *m)
2 　　{
3 　　　　int b1, i1, bp, ip;
4 　　　　int z = 0;
5
6 　　start:
7 　　　　if( a<b ){ *m=a; return z; }
8

567:YAMAGUTIseisei
18/09/09 08:43:44.94 vMpjqLBja BE:73431465-2BP(3)
9 　　　　b1=b; i1=1;
10
11 　　next:
12 　　　　bp = b1; ip = i1;
13 　　　　b1 += b1; i1 += i1;
14
15 　　　　if( b1 > a )
16 　　　　{
17 　　　　　　a = a-bp;
18 　　　　　　z += ip;
19 　　　　　　goto start;
20 　　　　}
21
22 　　　　if( b1 < 0 ) return z;
23
24 　　　　goto next;
25 　　}
26
27 　　int Mult(int a, int b)
28 　　{
29 　　　　int dmm, r=0;
30
31 　　　　while(1)
32 　　　　{
33 　　　　　　if( !a ) return r;
34 　　　　　　a=DivMod(a,2,&dmm);
35 　　　　　　if( dmm ) r += b;
36 　　　　　　b += b;
37 　　　　}
38 　　}
39
40 　　int printf();

568:YAMAGUTIseisei
18/09/09 08:45:16.48 vMpjqLBja BE:66089039-2BP(3)
41
42　　int a=5029, b=1, m=5039;
43 　　int k=0, x=1, t;
44
45 　　int main()
46 　　{
47 　　start: k=a;
48 　　loop: t=Mult(k,x);
49 　　　　DivMod(t,m,&x);
50
51 　　　　if( --k ) goto loop;
52 　　　　if( --a > b ) goto start;
53
54 　　　　printf("%d",x);
55 　　}

Lines 1-25 implement the division algorithm described in 4.5, but optimised by removing recursive call.
The multiplication (lines 27-38) is a straightforward implementation of the formula shown in 4.5.

Page 21

~ 22 ~

C loops are replaced with goto statements to make process flow similar to Subleq implementation in the next subsection 7.3.

7.3 Subleq code

Subleq code calculating modular double factorials has been written manually, because the compiled Subleq from C did not fit into the memory.
The code below has 83 instructions, which can fit even into 1 Kb with 32-bit word.

569:YAMAGUTIseisei
18/09/09 08:46:06.76 vMpjqLBja BE:61193055-2BP(3)
1 　　0 0 Start
2
3 　　. A:5029 B:1 MOD:5039
4 　　. Z:0 K:0 X:1
5
6 　　Start:
7 　　A Z; Z K; Z
8
9 　　Loop:
10 　　mu_a; K mu_a
11 　　mu_b; X mu_b
12
13
14 　　Mult:
15 　　mu_r
16
17 　　mu_begin:
18 　　t2; mu_a t2 mu_return:N2
19
20 　　dm_a; mu_a dm_a
21 　　dm_b; C2 dm_b
22 　　dm_return; N3L dm_return
23 　　t2 t2 DivMod
24
25 　　N3:
26 　　dm_m t2 ?+3
27
28 　　mu_b mu_r
29
30 　　mu_a; dm_z mu_a
31 　　mu_b Z; Z mu_b; Z Z mu_begin
32

570:YAMAGUTIseisei
18/09/09 08:48:15.18 vMpjqLBja BE:9791322-2BP(3)
33 　　. mu_a:0 mu_b:0 mu_r:0
34
35 　　#Mult
36
37
38 　　N2:
39 　　dm_a; mu_r Z; Z dm_a; Z
40 　　dm_b; MOD dm_b
41
42 　　dm_return; N1L dm_return
43 　　Z Z DivMod
44
45 　　N1:
46 　　X; dm_m X
47
48 　　C1 K ?+3
49 　　Z Z Loop
50
51 　　C1 A
52 　　B A END
53 　　B Z; Z A; Z
54 　　K K Start
55
56 　　END:
57 　　X (-1)
58 　　Z Z (-1)
59
60 　　DivMod:
61
62 　　dm_z
63 　　dm_m
64

次ページ