(強いAI)技術的特異点/(世界加速) 23

(強いAI)技術的特異点/(世界加速) 23at FUTURE

(強いAI)技術的特異点/(世界加速) 23 - 暇つぶし2ch447:YAMAGUTIseisei
18/08/06 00:54:39.80 FnAR0u04o BE:137071687-2BP(3)
INTRODUCTION

Design productivity is still a challenge for reconfigurable computing.
It is expensive to port workloads into gates and to endure 102 to 104 second bitstream rebuild design iterations.
Soft processor array overlays can help mitigate these costs.
The costly initial port becomes a simple cross-compile targeting the soft processors, and most design turns are quick recompiles.
Application bottlenecks can then be offloaded to custom hardware exposed as new instructions, function units, autonomous accelerators, memories, or interconnects.
The advent of heterogeneous FPGAs with hard ARM cores does not diminish the complementary utility of soft cores.
As FPGAs double in capacity, potential soft processors per FPGA also doubles.
A mid-range FPGA can now host many hundreds of soft processors and their memory interconnection network, and such massively parallel processor and accelerator arrays (MPPAAs) can sustain hundreds of memory accesses and branches per cycle --
throughput that a few hard processors cannot match.
The microarchitectures of general purpose soft processors have changed little in two decades.
Philip Freidin's 16-bit RISC4005 (1991) was an in-order pipelined scalar RISC, as were j32, xr16, NIOS, and MicroBlaze [1]–[4], and as are their latest versions.
Over the years soft processors have gained caches, branch predictors, and other structures for boosting instruction level parallelism, but the basic scalar RISC microarchitecture still dominates.
This reflects a good fit between this simple microarchitecture and the FPGA primitive elements required to implement it -- particularly LUTs and one write/cycle LUT RAMs.
Unfortunately when such architectures take a cache miss, execution stops dead.

448:YAMAGUTIseisei
18/08/06 00:56:08.62 FnAR0u04o BE:68535874-2BP(3)
Design studies targeting higher instruction level parallelism (ILP) microarchitectures typically implement VLIW [5], [6] or vector [7], [8] architectures instead of out-of-order (OoO) [9]– [11] soft processor cores.
The problem with superscalar OoO microarchitectures is the complexity of the machinery needed to rename registers, schedule instructions in dataflow order, clean up after mispeculation, and retire results in-order for precise exceptions.
This in turn requires expensive circuits such as deep many-ported register files, many-ported CAMs for dataflow instruction scheduling wakeup, and many wide bus multiplexers and bypass networks, all of which are area intensive in FPGAs.
For example, multi-read, multi-write RAMs require a mix of replication, multi-cycle operation, clock doubling, bank interleaving, live-value-tables, and other expensive techniques.
The present work is a new approach to build high ILP OoO superscalar soft processors without most of the complexity and overhead.
Our insight is to implement an Explicit Data Graph Execution (EDGE) [12], [13] instruction set architecture designed for area and energy efficient high ILP execution.

1

Together the EDGE architecture and its compiler finesse away much of the register renaming, CAMs, and complexity, enabling an out-of-order processor for only a few hundred LUTs more than an in-order scalar RISC.
This paper explores how an EDGE ISA and FPGA optimized EDGE microarchitecture compare to in-order RISCs common on FPGAs today.
The key challenge, and the main contribution of the paper, is how to build a small, fast dataflow instruction scheduler in an FPGA.
We develop and contrast two alternative FPGA implementations on our way to developing a minimal-area EDGE soft processor.

449:YAMAGUTIseisei
18/08/06 00:57:07.53 FnAR0u04o BE:137071687-2BP(3)
z = x + y;
if (z <= 5) {

x=R0, y=R7
HEADER
I[0] READ R0 T[2R]
I[1] READ R7 T[2L]
I[2] ADD T[3L]
I[3] TLEI #5 B[1P]
I[4] BRO.T B1
I[5] BRO.F B1

↓

x += 1;
y -= 1;
x /= y;

HEADER
I[0] READ R0 T[2L]
I[1] READ R7 T[3L]
I[2] ADD #1 T[4L]
I[3] SUB #1 T[4R]
I[4] DIV W[R0]
I[5] BRO

}

450:YAMAGUTIseisei
18/08/06 00:58:15.18 FnAR0u04o BE:117490368-2BP(3)
INSTRUCTION WINDOW

OPERAND BUFFERS BP 0 1
READ R0
2R
READ R7
2L
ADD
3L
TLEI #5

BRO.T

BRO.F

Fig. 1:
Psuedo code and corresponding instruction block.

451:YAMAGUTIseisei
18/08/06 01:00:51.33 FnAR0u04o BE:137071687-2BP(3)
7b 2b 2b 3b 9b 9b
OPCODE PR BID XOP TARGET1 TARGET0

PR=
PREDICATE
BID=
BROADCAST ID
XOP=
EXTENDED OPCODE

Fig. 2:
General instruction format

452:YAMAGUTIseisei
18/08/06 01:06:59.49 FnAR0u04o BE:73431465-2BP(3)
II.
EDGE OVERVIEW

EDGE architectures [12], [14]–[16] execute instructions organized within instruction blocks that are fetched, executed, and committed atomically.
Instructions inside blocks execute in dataflow order, which removes the need for expensive register renaming and provides power efficient out-of-order execution.
The compiler explicitly encodes the data dependencies through the instruction set architecture, freeing the microarchitecture from rediscovering these dependencies at runtime.
Using predication, all intra-block branches are converted to dataflow instructions, and all dependencies other than memory are direct data dependencies.
This target form encoding allows instructions within a block to communicate their operands directly via operand buffers, reducing the number of accesses to a power hungry multi-ported physical register file.
Between blocks, instructions communicate using memory and registers.
By utilizing a hybrid dataflow execution model, EDGE architectures still support imperative programming languages and sequential memory semantics, but reap the benefits of outof-order execution with near in-order power efficiency and complexity.
Figure 1 shows an example of two EDGE instruction blocks and how instructions explicitly encode their targets.
In this example each block corresponds to a basic block.
The first two READ instructions target the left (T[2L]) and right (T[2R]) operands of the ADD instruction.
A READ is the only instruction that reads from the global register file (however any instruction may target, i.e.
write to, the global register file).
When the ADD receives the result of both register reads it will become ready and execute.

453:YAMAGUTIseisei
18/08/06 01:07:39.23 FnAR0u04o BE:51401873-2BP(3)
Figure 2 shows the general instruction format.
Each EDGE instruction is 32 bits and supports encoding up to two target instructions.
For instructions with more consumers than target fields, the compiler can build a fanout tree using move instructions or it can can assign high fanout instructions to broadcasts [15].
Broadcasts support sending an operand over a lightweight network to any number of consumer instructions in a block.
In Figure 1, when the TLEI (test-less-than-equalimmediate) instruction receives its single input operand from the ADD it will become ready and execute.
The test then produces a predicate operand that is broadcast on channel one (B[1P]) to all instructions listening on the broadcast channel, which in this example are the two predicated branch instructions (BRO.T and BRO.F).
The branch that receives a matching predicate will fire.

A spectrum of EDGE implementations are possible with various area and performance tradeoffs.
Prior EDGE research studied very wide issue implementations [12], [13], as well as fusion of multiple cores [14]–[16] to boost performance on scalar workloads.
In this work we focus on MPPAA scenarios utilizing compact EDGE soft processors with competitive performance/area.
Therefore data and pointers are 32 bits; blocks can be up to 32 instructions long; and the microarchitecture decodes 1-2 instructions per clock and issues one.
We further restrict the load-store queue (LSQ) in this study to a simple, non-speculative design and omit branch or memory dependence prediction.

454:YAMAGUTIseisei
18/08/06 01:09:57.45 FnAR0u04o BE:44059436-2BP(3)
III.
EDGE IN AN FPGA

IF
INSN CACHE DATA
nK x 32 x 2 ports
BLOCK RAM

DC
DECODER(S)

IS
INSTRUCTION WINDOW
INSN SCHEDULER
32 ENTRIES

T1 T0 IID

DECODED INSNS
32 x n LUT-RAM(S)

OPERAND BUFFERS
32 x 32 LUT-RAMS

455:YAMAGUTIseisei
18/08/06 01:11:14.79 FnAR0u04o BE:29373326-2BP(3)
EX
EX PIPELINE REGS

EX
TS

OPS0
32x32

×

LS
LOAD/STORE
QUEUE

DATA CACHE DATA
nK x 32
BLOCK RAM

LS PIPELINE REGS

×2

REGISTER FILE
32 x 32 LUT-RAM

Fig. 3:
Two decode, single issue EDGE microarchitecture.

456:YAMAGUTIseisei
18/08/06 01:14:03.00 FnAR0u04o BE:198264299-2BP(3)
A.
Microarchitecture
Figure 3 is an example microarchitecure for a compact EDGE processor.
It has much in common with a conventional in-order scalar RISC: instruction and data caches and a five stage pipeline including instruction fetch (IF), decode (DC), operand fetch, execute (EX), and memory/data cache access (LS).
Unlike an in-order processor, instruction operands are read from operand buffers, not the register file; and the instruction to execute next, in dataflow order, is determined by the IS (issue) pipeline stage.
This employs an instruction window comprising a dataflow instruction scheduler, a decoded instructions buffer, and operand buffers.
It uses a simple loadstore queue to issue memory instructions in program order.
The front end (IF, DC) runs decoupled from the back end (IS, EX, LS).
It fetches and decodes two instructions per clock into the instruction window.
The instruction window's dataflow scheduler keeps the ready state of each decoded instruction's inputs i.e.
its predication and operands.
When all of its inputs (if any) are ready, the instruction wakes up and is ready to issue.
The lowest numbered ready instruction IID is selected each cycle and its decoded instruction and input operands are read.
Besides the data mux and function unit control signals, this instruction encodes up to two ready events.
The scheduler accepts these and/or events from other sources (muxed into T0 and T1) and updates the ready state of other instructions in the window.
Thus dataflow execution proceeds, starting with the block's ready 0-input instructions, then instructions that these target, and so forth.

457:YAMAGUTIseisei
18/08/06 01:16:14.13 FnAR0u04o BE:154205879-2BP(3)
B.
EDGE dataflow instruction scheduling requirements
The instruction window and scheduler are the linchpin of the core.
Their area, clock period, capabilities, and limitations largely determine the realized performance of an EDGE core and the throughput of an EDGE multiprocessor.

The instruction scheduler has diverse functionality and requirements.
It is highly concurrent.
Each cycle the decoder(s) write instructions' decoded ready state and decoded instructions into the window.
Each cycle the scheduler selects the next instruction to issue, and in response the back end sends ready events --
either target ready events targeting a specific instruction's input slot (predicate, operand #0, operand #1) or broadcast ready events targeting all instructions waiting on a broadcast ID.
These set per-instruction active ready state bits which together with the decoded ready state may signal that the instruction is ready to issue.
Note the scheduler sometimes accepts events for target instructions which have not yet been decoded and must also inhibit reissue of issued ready instructions.
EDGE instructions may be non-predicated, or predicated true or false.
A predicated instruction does not become ready until it is targeted by another instruction's predicate result, and that result matches the predicate condition.
If the predicate doesn't match, the instruction never issues.

2

458:YAMAGUTIseisei
18/08/06 01:19:27.57 FnAR0u04o BE:24477252-2BP(3)
On branch to a new block all instruction window ready state is flash cleared (block reset).
However when a block branches back to itself (block refresh) only active ready state is cleared; the decoded ready state is preserved so that it is not necessary to re-fetch and decode the block's instructions.
Refresh is key to saving time and energy in loops.
Since some software critical paths consist of a single chain of dependent instructions, i.e.
A targets B targets C, it is important that the dataflow scheduler add no pipeline bubbles for successive back-to-back instruction wakeup.
Therefore the IS-stage ready-issue-target-ready pipeline recurrence should complete in one cycle – assuming this does not severely impact clock frequency.
Instructions such as ADD have a latency of one cycle.
With EX-stage result forwarding the scheduler can wake their targets' instructions in the IS-stage, even before the instruction completes.
Other instruction results may await ALU comparisons, take multiple cycles, or have unknown latency.
These must wait until later to wake their targets.
Finally, the scheduler design should be scalable across a spectrum of anticipated EDGE implementations – each cycle accepting at least 1-4 decoded instructions and 2-4 target ready events, and issuing 1-2 instructions per cycle.
We consider two alternative dataflow instruction scheduler designs:
a brute-force parallel scheduler, where instructions' ready state is explicily represented in FPGA D-flip-flops (FFs), in which the ready status of every instruction is reevaluated each cycle;
and a more compact incremental scheduler which keeps ready state in LUT RAM and which updates ready status of only 2-4 target instructions per cycle.

459:YAMAGUTIseisei
18/08/06 01:20:35.82 FnAR0u04o BE:73431465-2BP(3)
C.
A parallel instruction scheduler

BID
T1
T0
ENs

31
...
3
DBID DRT DRF DR0 DR1

NEXT RDYS
RDY RT RF R0 R1 INH

2
1
0

DEC.RDYS
RESET
RESET REFRESH

32→(5,1)
PRIORITY ENCODER

IID,V

Fig. 4:
Block diagram of a parallel dataflow scheduler, with entry #2 shown in more detail.

460:YAMAGUTIseisei
18/08/06 01:22:51.58 FnAR0u04o BE:48954454-2BP(3)
Figure 4 depicts a parallel instruction scheduler for the instruction window of Figure 3.
Note the active ready state is set by target ready events T0, T1 and broadcast ID BID (if any), qualified by various input type enables ENs.
For a 32-entry window there are 32 instances of a one-instruction ready circuit.
In any cycle one or more of the 32 RDY signals may be asserted.
A 32-bit priority encoder reduces this to the 5-bit IID of the next instruction to issue.
For each entry there are six bits of decoded ready state,
i.e. they are initialized by the instruction decoder:

• DBID: 2-bit binary broadcast ID, or 00 if none
• DRT, DRF: decoder: predicate true (false) is ready
• DR0, DR1: decoder: operand #0 (operand #1) is ready

Together these bits encode whether the instruction has been decoded, awaits a predicate and/or some operand(s), perhaps via a broadcast channel, or is immediately ready to issue.
These bits are cleared on block reset only.
There are also six bits of active ready state: :

• RT, RF: predicate true (false) is ready
• R0, R1: operand #0 (operand #1) is ready
• INH: inhibit instruction – it has already issued
• RDY: instruction is ready to issue

3

461:YAMAGUTIseisei
18/08/06 01:25:56.54 FnAR0u04o BE:22029833-2BP(3)
An instruction is ready iff (RT & RF & R0 & R1 & ~INH).
Any of RT, RF, R0, R1 may be set when:
its corresponding DRX is set by the decoder, or
an executing instruction targets that input, explicitly, or via a broadcast event (broadcast ID, input).
Active ready state bits are cleared on block reset or refresh.

Decoded ready state Active ready state
Instruction DBID DRT DRF DR0 DR1 RT RF R0 R1 INH RDY
READ 00 1 1 1 1 1 1 1 1 1 0
READ 00 1 1 1 1 1 1 1 1 0 1
ADD 00 1 1 0 0 1 1 1 0 0 0
TLEI 00 1 1 0 1 1 1 0 1 0 0
BRO.T B1 01 0 1 1 1 0 1 1 1 0 0
BRO.F B1 01 1 0 1 1 1 0 1 1 0 0
undecoded 00 0 0 x x 0 0 x x x 0

TABLE I: Example Instruction Scheduler Ready State

462:YAMAGUTIseisei
18/08/06 01:28:42.21 FnAR0u04o BE:19582324-2BP(3)
Table I depicts a block's instruction scheduler state after decoding six instructions and issuing the first.
The first four non-predicated instructions have DRT and DRF set reflecting that they do not await any particular predicate results.
The two READ instructions, unpredicated and with zero input operands, are immediately ready to issue.
The first has issued – and so is now inhibited from reissue – targeting operand 0 of the ADD, whose R0 is now set.
The second READ will issue in the next IS pipeline cycle.
The TLEI (test-lessthan-or-equal-immediate) instruction broadcasts its predicate outcome on channel 1; the two branch instructions, predicated true (resp.
false), await this predicate result.
The seventh entry has not been decoded: (DRT|DRF)=0.
To reduce the critical path of dataflow scheduling, the front end writes predecoded EDGE instructions into the decoded instructions buffer.
As instruction IID issues, its decoded instruction is read by the back end.
Amongst other things it contains two target operand ready event fields, _T0 and _T1, which designate the 0-2 (IID, input) explicit targets of the instruction, as well as a 4-bit vector of input enables: ENs={RT EN, RF EN, R0 EN, R1 EN}.
Referring back to Figure 3, these signals are muxed with ready events from other pipeline stages into T0 and T1 input by the scheduler.

463:YAMAGUTIseisei
18/08/06 01:36:06.91 FnAR0u04o BE:132176669-2BP(3)
>>459
Therefore the IS-stage ready-issue-target-ready pipeline recurrence should complete in one cycle -- assuming this does not severely impact clock frequency.
Finally, the scheduler design should be scalable across a spectrum of anticipated EDGE implementations -- each cycle accepting at least 1-4 decoded instructions and 2-4 target ready events, and issuing 1-2 instructions per cycle.

>>490
inhibit instruction -- it has already issued

464:YAMAGUTIseisei
18/08/06 01:38:44.60 FnAR0u04o BE:44058492-2BP(3)
>>462
The first has issued -- and so is now inhibited from reissue -- targeting operand 0 of the ADD, whose R0 is now set.

465:YAMAGUTIseisei
18/08/06 01:41:18.68 FnAR0u04o BE:36716235-2BP(3)
D.
FPGA implementation of the parallel scheduler
Careful attention to FPGA circuit design is required to minimize the area and clock period of the scheduler.
The 32instruction window requires 32*(6+6)=384 FFs for the ready state, and 32*many LUTs to decode ready events and update each entry's ready state.
A modern FPGA packs a set of LUTs (lookup tables) and D-flip-flops (FFs) together into a logic cluster.
For example, Xilinx 7 series devices group four 6-LUTs and eight FFs into each ``slice'' cluster.
Each LUT has two outputs and may be used as one 6-LUT, or two 5-LUTs with five common inputs.
Each output may be registered in a FF.
The flip-flops have optional CE (clock enable) and SR (set/reset) inputs but these signals are common to all eight FFs in the cluster.
This basic cluster architecture is similar in Altera FPGAs.
From this follows two design considerations.
Fracturable 6-LUT decoders: For target instruction index decoding, so long as indices are ≤5 bits, two decoders may fit into a single 6-LUT.
Slice FF packing and cluster control set restrictions: To minimize area and wire delays, the design packs the ready state FFs densely, 4-8 FFs per cluster.
Every 6-bit decoded ready state entry is written together (common RST and CE) and can pack into one or two slices.
More care is required for the active ready state FFs.
Each of these 32*6=192 FFs may be individually set, but by packing four FFs per slice, when one FF is clock enabled, all are clock enabled.
Whenever a FF is set by a ready event the other FFs in its slice should not change.
This requires implementing CE functionality in each FF's input LUT, feeding back its output into its input: FF NXT = FF | (EN & input).

466:YAMAGUTIseisei
18/08/06 01:42:14.67 FnAR0u04o BE:132176669-2BP(3)
generate for (i = 0; i < N; i = i + 1) begin: R
always @* begin
// target decoders
T00[i] = T0 == i;
T01[i] = T0 == (i|N);
T10[i] = T1 == i;
T11[i] = T1 == (i|N);
B[i] = BID == DBID[i];

// next active ready state logic
RT_NXT[i] = RT[i] | DRT[i]
| (RT_EN & (T01[i]|T11[i]|B[i]));
RF_NXT[i] = RF[i] | DRF[i]
| (RF_EN & (T00[i]|T10[i]|B[i]));
R0_NXT[i] = R0[i] | DR0[i]
| (R0_EN & (T00[i]|T10[i]|B[i]));
R1_NXT[i] = R1[i] | DR1[i]
| (R1_EN & (T01[i]|T11[i]|B[i]));
INH_NXT[i] = INH[i] | (INH_EN & (IID == i));
RDY_NXT[i] = RT_NXT[i] & RF_NXT[i] & R0_NXT[i]
& R1_NXT[i] & ~INH_NXT[i];
end
end endgenerate

Listing 1: Parallel scheduler ``next readys'' logic

467:YAMAGUTIseisei
18/08/06 01:46:49.27 FnAR0u04o BE:73431465-2BP(3)
Listing 1 is Verilog that generates the ``next readys'' for an N-entry parallel scheduler.
Although there are four ready event input types (predicate true, false, operand #0, operand #1),
by ensuring that predicate target events never occur in the same cycle as operand target events, a single target index bit suffices to distinguish false/operand #0 targets from true/operand #1 targets.
(Further decoding is provided by specific {RT/RF/R0/R1} ENs enables.) Therefore for an instruction window with N=32 entries, T0 and T1 are six bits {input#:1; IID:5}.
The target decoders T00, T01, T10, T11 (target-0input-0, etc.) are each one 6-LUT, as is the broadcast select decoder B.
The next active ready state logic folds together the target decoder outputs with current active and decoded ready state.
This requires another seven LUTs (two for INH NXT), for a total of 32*12 = 384 LUTs.
This may be improved by splitting the 32-entry scheduler into two 16-entry banks of even and odd instructions.
Within a bank a 4-bit bank-IID suffices.
Then T0, T1 narrow to five bits so T00, T01, T10, T11 fit in two 5,5-LUTs, and INH NXT in one 6-LUT, or 2*16*(3+6)=288 LUTs in all.

468:YAMAGUTIseisei
18/08/06 01:51:35.42 FnAR0u04o BE:176234898-2BP(3)
>>467

4

469:YAMAGUTIseisei
18/08/06 01:56:04.52 FnAR0u04o BE:110146695-2BP(3)
Besides shaving LUTs, a two bank scheduler provides two sets of T0, T1 ports and can sink two sets of two events each cycle.
This is essential to sustain wider issue rates of two instructions per cycle (which may target four operands per cycle).
Yet-wider issue and yet-larger instruction windows may even merit a four bank design.
The ready-issue-target-ready scheduling recurrence is the critical path of the IS-stage.
A 32→5 priority encoder reduces the RDY vector to an IID which selects the decoded instruction.
The decoded instruction's fields { T0, T1, BID, ENs} are muxed into {T0,T1,BID,ENs} which update target instructions' ready state, including RDY.
Priority encoder: Many 32-bit encoder designs were evaluated, including onehot conversion with LUT or carry-logic or-trees, carry-logic zero-scan, and F7MAP/F8MAP muxes.
The present design uses two 16→4 encoders, one per bank, which complete in two LUT delays.
In a one-issue processor, a subsequent 2:1 mux selects one of these encoder outputs.
In particular each 16-bit encoder input I[15:0] is chunked into I[15], I[14:10], I[9:5], I[4:0].
Each 5-bit group indexes a 32x4 LUT ROM with the precomputed encoder output for that group.
Together with three 5-bit zero comparator outputs, these feed a custom 4-bit 3:1 selector which outputs 'b1111 when all three groups are zero.
Technology mapping and floorplanning:
The design uses an RPM (relationally placed macro) methodology to improve area and interconnect delays and achieve a repeatable layout for easy routing and timing closure under module composition and massive replication.
Structural RTL instantiates modules and tiles them into a scheduler.
The XST annotation (*LUT MAP="yes"*) on a ≤6-input module locks its logic to one LUT; (*RLOC="XxYy"*) packs FPGA primitives into clusters and places clusters relative to each other.

470:YAMAGUTIseisei
18/08/06 01:57:08.33 FnAR0u04o BE:132176096-2BP(3)
Fig. 5:
FPGA implementation of the parallel scheduler

Figure 5 is a Xilinx 7-series implementation of Figure 4, including the scheduler, priority encoder, and decoded instruction buffer, with the critical path highlighted in white.
Each two horizontal rows of FPGA slices correspond to four entries in the instruction window.
Left to right are:

• pale yellow: four 6-bit decoded ready state flip-flops;
• yellow/green: B, T00, T01, T10, T11 target decoders;
• orange: active ready state LUTs/FFs RT NXT/RT, etc.;
• purple: INH NXT and INH; • red: RDY NXT and RDY.

To the right are the synthesized priority encoders and muxes (blue) and the decoded instructions buffer (white) implemented in several 32x6-bit true dual port LUT RAMs.
Performance: In a Kintex-7 -1-speed grade, the critical path takes 5.0 ns, including RDY clock-to-out, priority encoder, mux, decoded instructions LUT RAM, next readys logic and RDY setup.
Interconnect delay is 85% of the critical path -- unfortunately all paths from any RDY to any RDY must traverse a relatively large diameter netlist.
Cycle time may be reduced to 2.9 ns by adding a pipeline register halfway through the scheduler critical path (the output port of the instruction buffer LUT RAM)
however this will not achieve back-to-back issue (in successive cycles) of a single dependent chain of instructions.

471:YAMAGUTIseisei
18/08/06 02:01:58.00 FnAR0u04o BE:132176096-2BP(3)
E.
Incremental dataflow scheduler ready state
The parallel scheduler is straightforward but it consumes hundreds of LUTs and FFs just to maintain 32x12b of ready state -- a few LUTs worth of LUT RAM -- and this area doubles as the instruction window size doubles.
Also, each cycle its next readys LUTs recompute the readiness of every instruction, even though (broadcast notwithstanding) each issued instruction affects at most two others' ready state.
In contrast, the incremental scheduler keeps decoded and active ready state in LUT RAM, maintains the frontier of ready instructions in queues, and evaluates the ready status of just 2-4 target instructions per cycle.

5

Compared to an array of FFs, LUT RAM is fast and dense but has some shortcomings: there is no way to flash clear it, and it only supports one write per cycle.

472:YAMAGUTIseisei
18/08/06 02:04:55.29 FnAR0u04o BE:29372843-2BP(3)
DRDYSS
WA ← DC_IID
RA ← EVT_IID
I ← DC_DRDYS
O → READY LOGIC DRDYS

ARDYSS
WA ← EVT_IID
RA ← EVT_IID
I ← READYLOGIC ARDYS_NXT
O → READYLOGIC DRDYS

DVS ← RESET
O → READYLOGIC DV
WA ← DRDYSS WA
RA ← DRDYSS RA

AVS ← RESETvREFRESH
WA ← ARDYSS WA
RA ← ARDYSS RA
O → READYLOGIC AV

READY LOGIC
READY →
DV ← DVS O
DRDYS ← DRDYSS O
AV ← AVS O
ARDYS → ARDYSS O
ARDYS_NXT → ARDYSS I
EVT_RDYS ← EVT_RDYS

(a) Design: ready state, validation, and ready logic.

473:YAMAGUTIseisei
18/08/06 02:07:43.81 FnAR0u04o BE:110147459-2BP(3)
(b) FPGA implementation.

Fig. 6:
A 16-entry scheduler bank.

474:YAMAGUTIseisei
18/08/06 02:09:43.22 FnAR0u04o BE:44058863-2BP(3)
// ready logic
always @* begin
ARDYS_NXT = (DV ? DRDYS : 4'b0000)
| (AV ? ARDYS : 4'b0000)
| EVT_RDYS;
READY = &ADRYS_NXT;
end

Listing 2: Ready logic

475:YAMAGUTIseisei
18/08/06 02:12:33.24 FnAR0u04o BE:48954454-2BP(3)
Instead, the scheduler uses a hybrid of LUT RAM and FF ``RAM''.
Decoded (DRT, DRF, DR0, DR1) and active (RT, RF, R0, R1) ready state are stored in several banks of 16x4 true dual port LUT RAM, which is validated by a 16x1 flashclearable-set-only-RAM ``FC-SO-RAM''.
This comprises 16 FFs (with common reset), 16 write port address decoders (eight 5,5-LUTs), and a 16:1 read port mux (four 6-LUTs, two MUXF7s, one MUXF8) -- just three slices in all.
Each read from this hybrid reads the 4b LUT RAM entry and its valid bit.
Each write updates the LUT RAM and sets its valid bit.
Multiple LUT RAM write ports.
To sustain a fetch/decode rate of d instructions/cycle, and an issue rate of i instructions/cycle, it is necessary to update d+2i ready state entries each cycle.
This is a challenge for one write/cycle LUT RAM.
Rather than use clock doubling or replicated RAM banks with live value tables, the incremental scheduler divides the ready state up into four (or more) interleaved, disjoint banks:

(decoded, active) ready state for (even, odd) instructions.
Then the front end can write even and odd decoded ready state while the back end updates even and/or odd target instructions' active ready state.
Figure 6 shows the resulting a 16-entry scheduler bank design and implementation.
The blue decoded and active ready state LUT RAMs DRDYSS and ARDYSS are validated by orange/red FC-SO-RAMs DVS and AVS.
Each cycle the decoder writes instruction DC IID's decoded ready state DC DRDYS and its valid bit.
Also each cycle the bank's target ready event EVT ::={EVT IID; EVT RDYS} is processed via a read-modify-write of EVT IID's ARDYS with its DRDYS and EVT RDYS.
See Listing 2.
The instruction is ready when all four ARDYS bits are set.
All of this logic (cyan) requires just one slice; as an optimization READY's and-reduction is in carry-logic.

476:YAMAGUTIseisei
18/08/06 02:13:32.42 FnAR0u04o BE:78326584-2BP(3)
Note that the EDGE compiler does not guarantee that both targets of an instruction are in disjoint banks so there may be scheduler bank conflicts.
An ADD instruction might target an operand of instruction 10 and an operand of instruction 12.
Since it is not possible to update the active ready state of the two even bank targets in the same cycle, one event is processed and the other is queued for a later cycle.

477:YAMAGUTIseisei
18/08/06 02:17:11.29 FnAR0u04o BE:73431465-2BP(3)
F.
Incremental dataflow scheduler design, operation, and implementation
The core of the scheduler (Figure 7) consists of:
• INSN: decoded instruction with two target event fields
• EVT0, EVT1: even/odd pending event registers
• even/odd event muxes, controlled by predecoded selects
• SCH0, SCH1: even/odd 16-entry scheduler banks
• three ready instruction IID queues:
-- DCRDYQ: decoder ready queue;
-- ISRDYQ: issue (scheduler) ready queue;
-- LSRDYQ: load/store ready queue
• two 3:1 selectors to select next IID
• INSNS: decoded instructions RAM (read port)
Note, in this design, the scheduler recurrence cycle begins and ends with the decoded instruction register.
Now consider execution of the first EDGE code block in Figure 1.
The scheduler is reset, clearing DVS, AVS in SCH0, SCH1.
The front end fetches the block's header and fetches and decodes its instructions into INSNS.
The two READs are ready to issue so their IIDs are enqueued on DCRDYQ.
This ``primes the pump'' for the back end.
The other instructions await operands or predicates, are not ready, so are not enqueued.

6

478:YAMAGUTIseisei
18/08/06 02:20:36.99 FnAR0u04o BE:14686632-2BP(3)
0
INSN
T1
T0

1
EVT1
EVT0

2 3 4
LSRDYQ
DCRDYQ
ISRDYQ
SCH1
READY →
EVT ←
EVT_IID →
SCH0
READY →
EVT ←
EVT_IID →

5
IID

6
INSNS:
DECODED INSTRUCTIONS
32xn LUT RAM

(a) Design.

479:YAMAGUTIseisei
18/08/06 02:21:24.24 FnAR0u04o BE:132176669-2BP(3)
(b) FPGA implementation.

Fig. 7:
32-entry scheduler, decoded instructions buffer, and ready queues.

480:YAMAGUTIseisei
18/08/06 02:24:31.08 FnAR0u04o BE:66087893-2BP(3)
Back end dataflow execution proceeds as follows.
Initially INSN is invalid and both READYs are negated.
The IID selector tree selects/dequeues the first READ instruction (IID=0) from DCRDYQ.
The decoded READ instruction word is read from INSNS into INSN.

The READ targets ADD operand #1.
Its INSN.T0 (even bank target ready event) field is valid and its mux selects EVT=(2,'b0001) for SCH0.
That updates ADD's active ready state: 'b1100|'b0000|'b0001='b1101, now awaiting only the left operand (operand #0).
Since neither scheduler bank found a READY instruction, the IID selector tree selects/dequeues the second READ from DCRDYQ.

481:YAMAGUTIseisei
18/08/06 02:26:44.11 FnAR0u04o BE:24477252-2BP(3)
This READ targets ADD operand #0;
its INSN.T0 is EVT=(2,'b0010).
SCH0 updates ADD's ready state to 'b1111 and asserts READY so the ADD (IID=2) issues.
ADD's T1 targets the TLEI ready state in SCH1.
TLEI becomes ready and issues.
As for TLEI, neither T0/T1 fields designate IS-stage ready events.
Why? Unlike simple one cycle latency instructions like ADD, test instructions' targets cannot receive ready events until the test executes in the EX-stage.
Once a test completes, its true/false predicate event(s) are signaled.
These proceed through queues and/or muxes (not shown) to the EVT0, EVT1 pending event registers, awaiting idle scheduler event slots.

482:YAMAGUTIseisei
18/08/06 02:27:11.41 FnAR0u04o BE:137071878-2BP(3)
Queues:
The design employs many elastic FIFO ready queues and event queues.
They are small and fast, built with up-down counters and Xilinx SRL32CE 32-bit variable length shift register LUTs.
Besides DCRDYQ the present design has two other ready queues.
ISRDYQ:
In a ``one issue'' design when an instruction issues and it targets and wakes two others, the even instruction issues next, and the odd one is queued on ISRDYQ.
LSRDYQ:
EDGE processors use a load-store queue to provide sequential memory semantics.
One simple area-optimized LSQ defers and reorders certain accesses; then when a (ready) load/store becomes issuable to memory the LSQ enqueues it on LSRDYQ.
Broadcast wakeup:
Each EDGE result broadcast may target and wake an arbitrary number of instructions in the window.
This is easy for a parallel scheduler but costly for an incremental one.
When a result is broadcast the scheduler must sequentially update the ready state of every decoded instruction with that broadcast input.
Therefore the decoder maintains queues (BR1Q, BR2Q, BR3Q) of IIDs of instructions with a given broadcast input.
Once a broadcast result is known, the scheduler begins to dequeue the BRnQ IIDs into EVTs presented to SCH0, SCH1.
Performance:
The labels 0-6 in Figure 7a depict the number of ``LUT delays'' to each point in the scheduler critical path, the white path in Figure 7b.
In a Kintex-7 -1-speed grade, this takes 4.3 ns including INSN clock-to-out, EVT mux, SCH1's AVS read port mux, ARDYS NXT and READY logic, IID selector, INSNS read, and INSN setup.
Here interconnect delay is just 70% of the critical path reflecting relatively shorter nets and some use of LUT-local MUXF7/MUXF8/CARRY4 nets.
The scheduler clock period may be reduced to 2.5 ns by adding pipeline registers after the LUT RAM and FC-SO-RAM reads, but as with the parallel scheduler, pipelining precludes backto-back issue of dependent instructions.

483:YAMAGUTIseisei
18/08/06 02:28:25.23 FnAR0u04o BE:48954645-2BP(3)
G.
Comparing parallel and incremental schedulers
Table II summarizes the differences between the two dataflow scheduler designs.
The core of the incremental scheduler is less than a third the size of the parallel scheduler although the size advantage is smaller when the additional overhead of queues and muxes is added.
The incremental scheduler is also faster and the area*period metric is 2.6x better.

7

484:YAMAGUTIseisei
18/08/06 02:29:34.41 FnAR0u04o BE:132176669-2BP(3)
Metric Parallel Incremental Units

Area, 32 entries 288 78 LUTs
Area, total, 32 entries 340 150 LUTs
Period 5.0 4.3 ns
Period, pipelined 2.9 2.5 ns
Area, total * period 1700 645 LUT*ns

Broadcast flash iterative
Event bank conflicts? never sometimes

Area, 4 events/cycle 288 156 LUTs
Area, 64 entries 576 130 LUTs

TABLE II:
Comparing parallel and incremental schedulers.

However the parallel scheduler retains some brute force advantages.
It can process a broadcast event in a single cycle, whereas the incremental scheduler must iteratively drain a broadcast queue at a rate of 1-2 instructions per cycle.
This may cause issue stalls in some workloads.
The incremental scheduler is also subject to even/odd target bank conflicts which may delay an instruction wake up.
Real workload studies are needed to measure whether these effects overshadow its substantial area*period advantage.
Finally consider future scale up to wider issue and larger instruction windows.
The parallel scheduler does not grow when subdivided into more banks to process twice as many events per cycle, whereas the incremental scheduler core area doubles.
To grow the instruction window to 64 entries, the parallel scheduler require twice as much area, whereas the incremental scheduler area grows more modestly.

485:YAMAGUTIseisei
18/08/06 02:30:24.21 FnAR0u04o BE:9791322-2BP(3)
IV.
CONCLUSION
This paper presents our work towards a practical out-oforder dataflow soft processor architecture for FPGAs.
We set out to discover whether a novel EDGE instruction set architecture, optimized for simpler high ILP microarchitectures in ASICs, is also a good fit for FPGAs, or whether general purpose soft processors will remain stuck in the scalar RISC slow lane.
We considered two different dataflow instruction scheduler designs and their optimized FPGA implementations.
In the context of commercial 200 MHz, 1,000-2,000 LUT soft processors, the limited FPGA resource cost and clock period impact of either design seems acceptable and practical.
Both design alternatives will scale well to future four-decode/twoissue implementations.

REFERENCES
[1] J. Gray,
``Homebrewing RISCs in FPGAs,'' URLﾘﾝｸ(fpgacpu.org) August 1996.
[2] ---- ,
``Building a RISC System in an FPGA,''
Circuit Cellar Ink, no. 116 -- 118, March, April, May 2000.
[Online]. Available:
URLﾘﾝｸ(fpgacpu.org)
[3] Altera Corporation,
``Nios Embedded Processor Software Development Reference Manual,'' March 2001.
[4] Xilinx Inc., ``MicroBlaze Processor Reference Guide,''
2002.
[5] A. K. Jones, R. Hoare, D. Kusic, J. Fazekas, and J. Foster,
``An FPGAbased VLIW Processor with Custom Hardware Execution,''
in Proceedings of the 13th International Symposium on Field-programmable Gate Arrays, 2005, pp. 107 -- 117.
[6] K. O. I. Tili and J. G. Steffan,
``TILT: A Multithreaded VLIW Soft Processor Family,''
in Proceedings of the International Conference on Field Programmable Logic and Applications, August 2013.

486:YAMAGUTIseisei
18/08/06 02:35:01.27 FnAR0u04o BE:61193055-2BP(3)
[7] P. Yiannacouras, J. G. Steffan, and J. Rose,
``VESPA: Portable, Scalable, and Flexible FPGA-based Vector Processors,''
in Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 2008, pp. 61 -- 70.
[8] J. Yu, G. Lemieux, and C. Eagleston,
``Vector Processing As a Soft-core CPU Accelerator,'' in Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, 2008, pp. 222 -- 232.
[9] R. Carli,
``Flexible MIPS Soft Processor Architecture,''
Master's thesis, Massachusetts Institute of Technology, May 2008.
[10] K. Aasaraai and A. Moshovos,
``Towards a viable out-of-order soft core: Copy-free, checkpointed register renaming,''
in Proceedings of the 19th International Conference on Field Programmable Logic and Applications, August 2009.
[11] B. H. Dwiel, N. K. Choudhary, and E. Rotenberg,
``FPGA Modeling of Diverse Superscalar Processors,''
in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2012, pp. 188 -- 199.
[12] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John, C. Lin, C. R. Moore,
J. Burrill, R. G. McDonald, W. Yoder, X. Chen, R. Desikan, S. Drolia, J. Gibson, M. S. S. Govindan,
P. Gratz, H. Hanson, C. Kim, S. K. Kushwaha, H. Liu, R. Nagarajan, N. Ranganathan,
E. Reeber, K. Sankaralingam, S. Sethumadhavan, P. Sivakumar, and A. Smith,
``Scaling to the End of Silicon with EDGE architectures,'' IEEE Computer, vol. 37, no. 7, pp. 44 -- 55, July 2004.

487:YAMAGUTIseisei
18/08/06 02:35:26.75 FnAR0u04o BE:24477252-2BP(3)
[13] M. Gebhart, B. A. Maher, K. E. Coons, J. Diamond, P. Gratz, M. Marino, N. Ranganathan, B. Robatmili, A. Smith, J. Burrill, S. W. Keckler, D. Burger, and K. S. McKinley,
``An Evaluation of the TRIPS Computer System,''
in Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009, pp. 1 -- 12.
[14] C. Kim, S. Sethumadhavan, M. S. Govindan, N. Ranganathan, D. Gulati, D. Burger, and S. W. Keckler,
``Composable lightweight processors,''
in Proceedings of the 40th International Symposium on Microarchitecture, 2007, pp. 381 -- 394.
[15] B. Robatmili, D. Li, H. Esmaeilzadeh, S. Govindan, A. Smith, A. Putnam, D. Burger, and S. W. Keckler,
``How to Implement Effective Prediction and Forwarding for Fusable Dynamic Multicore Architectures,''
in Proceedings of the 19th International Symposium on High Performance Computer Architecture, 2013, pp. 460 -- 471.
[16] M. S. S. Govindan, B. Robatmili, D. Li, B. Maher, A. Smith, S. W. Keckler, and D. Burger,
``Scaling power and performance via processor composability,''
IEEE Transactions on Computers, March 2013.

8

488:YAMAGUTIseisei
18/08/26 16:15:24.18 CL1hr8qnX BE:24477252-2BP(3)
>156 ｰ 180824 1738 3eCkMSqb
:
>URLﾘﾝｸ(thenextweb.com)
:
> AIに好奇心を実装、
>
>Open AI―イーロンマスクが共同出資したシンギュラリティ専門のシンクタンク―は好奇心による学習という広範な研究 ry （ URLﾘﾝｸ(pathak22.github.io)
:

489:YAMAGUTIseisei
18/08/26 16:18:00.66 CL1hr8qnX BE:73431465-2BP(3)
Page 1

Large-Scale Study of Curiosity-Driven Learning

Yuri Burda∗
OpenAI
Harri Edwards∗
OpenAI
Deepak Pathak∗
UC Berkeley
Amos Storkey
Univ. of Edinburgh
Trevor Darrell
UC Berkeley
Alexei A. Efros
UC Berkeley

Abstract

Reinforcement learning algorithms rely on carefully engineering environment rewards that are extrinsic to the agent.
However, annotating each environment with hand-designed, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent.
Curiosity is a type of intrinsic reward function which uses prediction error as reward signal.
In this paper: (a) We perform the first large-scale study of purely curiosity-driven learning, i.e. without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite.
Our results show surprisingly good performance, and a high degree of alignment between the intrinsic curiosity objective and the hand- designed extrinsic rewards of many game environments.
(b) We investigate the effect of using different feature spaces for computing prediction error and show that random features are sufficient for many popular RL game benchmarks,
but learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.).
(c) We demonstrate limitations of the prediction-based rewards in stochastic setups.
Game-play videos and code are at
URLﾘﾝｸ(pathak22.github.io)
.

490:YAMAGUTIseisei
18/08/26 16:20:05.55 CL1hr8qnX BE:58745838-2BP(3)
1
Introduction

Reinforcement learning (RL) has emerged as a popular method for training agents to perform complex tasks.
In RL, the agent policy is trained by maximizing a reward function that is designed to align with the task.
The rewards are extrinsic to the agent and specific to the environment they are defined for.
Most of the success in RL has been achieved when this reward function is dense and well-shaped, e.g., a running `` score '' in a video game [21].
However, designing a well-shaped reward function is a notoriously challenging engineering problem.
An alternative to `` shaping '' an extrinsic reward is to supplement it with dense intrinsic rewards [26], that is, rewards that are generated by the agent itself.
Examples of intrinsic reward include `` curiosity '' [11, 22, 27, 35, 40] which uses prediction error as reward signal, and `` visitation counts '' [3, 20, 24, 30] which discourages the agent from revisiting the same states.
The idea is that these intrinsic rewards will bridge the gaps between sparse extrinsic rewards by guiding the agent to efficiently explore the environment to find the next extrinsic reward.

491:YAMAGUTIseisei
18/08/26 16:20:46.49 CL1hr8qnX BE:66087893-2BP(3)
But what about scenarios with no extrinsic reward at all? This is not as strange as it sounds.
Developmental psychologists talk about intrinsic motivation (i.e., curiosity) as the primary driver in the early stages of development [32, 41]: babies appear to employ goal-less exploration to learn skills that will be useful later on in life.
There are plenty of other examples, from playing Minecraft to visiting your local zoo, where no extrinsic rewards are required.
Indeed, there is evidence that pre-training an agent on a given environment using only intrinsic rewards allows it to learn much faster when fine-tuned to a novel task in a novel environment [27, 28].
Yet, so far, there has been no systematic study of learning with only intrinsic rewards.

*Alphabetical ordering; the first three authors contributed equally.

Preprint.
Work in progress.

492:YAMAGUTIseisei
18/08/26 16:22:09.28 CL1hr8qnX BE:110146695-2BP(3)
Page 2

Figure 1:
A snapshot of the 54 environments investigated in the paper.
We show that agents are able to make progress using no extrinsic reward, or end-of-episode signal, and only using curiosity.
Video results, code and models at
URLﾘﾝｸ(pathak22.github.io)
.

493:YAMAGUTIseisei
18/08/26 16:23:45.97 CL1hr8qnX BE:73431656-2BP(3)
In this paper, we perform a large-scale empirical study of agents driven purely by intrinsic rewards across a range of diverse simulated environments.
In particular, we choose the dynamics-based curiosity model of intrinsic reward presented in Pathak et al. [27] because it is scalable and trivially parallelizable, making it ideal for large-scale experimentation.
The central idea is to represent intrinsic reward as the error in predicting the consequence of the agent 's action given its current state, i.e., the prediction error of learned forward-dynamics of the agent.
We thoroughly investigate the dynamics-based curiosity across 54 environments: video games, physics engine simulations, and virtual 3D navigation tasks, shown in Figure 1.

To develop a better understanding of curiosity-driven learning, we further study the crucial factors that determine its performance.
In particular, predicting future state in high dimensional raw observation space (e.g., images) is a challenging problem and, as shown by recent works [27, 42], learning dynamics in an auxiliary feature space leads to improved results.
However, how one should choose such an embedding space is a critical, yet open research problem.
Through a systematic ablation, we examine the role of different ways to encode agent 's observation such that an agent can perform well driven purely by its own curiosity.

494:YAMAGUTIseisei
18/08/26 16:27:38.51 CL1hr8qnX BE:36715853-2BP(3)
To ensure stable online training of dynamics, we argue that the desired embedding space should: (a) be compact in terms of dimensionality,
(b) preserve sufficient information about the observation, and (c) be a stationary function of the observations.
We show that encoding observations via a random network turn out to be a simple, yet effective technique for modeling curiosity across many popular RL benchmarks.
This might suggest that many popular RL video game test-beds are not as visually sophisticated as commonly thought.
Interestingly, we discover that although random features are sufficient for good performance at training, the learned features appear to generalize better (e.g., to novel game levels in Super Mario Bros.).

In summary:
(a) We perform a large-scale study of curiosity-driven exploration across a variety of environments including:
the set of Atari games [4], Super Mario Bros., virtual 3D navigation in Unity [1], multi-player Pong, and Roboschool [39] environments.
(b) We extensively investigate different feature spaces for learning the dynamics-based curiosity: random features, pixels, inverse- dynamics [27] and variational auto-encoders [15] and evaluate generalization to unseen environments.
(c) We conclude by discussing some limitations of a direct prediction-error based curiosity formulation.
We observe that if the agent itself is the source of stochasticity in the environment, it can reward itself without making any actual progress.
We empirically demonstrate this limitation in a 3D navigation task where the agent controls different parts of the environment.

2

495:YAMAGUTIseisei
18/08/26 16:30:44.18 CL1hr8qnX BE:14686632-2BP(3)
Page 3

2
Dynamics-based Curiosity-driven Learning

Consider an agent that sees an observation xt, takes an action at and transitions to the next state with observation xt+1.
We want to incentivize this agent with a reward rt relating to how informative the transition was.
To provide this reward, we use an exploration bonus involving the following elements:
(a) a network to embed observations into representations φ(x),
(b) a forward dynamics network to predict the representation of the next state conditioned on the previous observation and action p(φ(xt+1)|xt,at).
Given a transition tuple {xt,xt+1,at}, the exploration reward is then defined as rt = ? log p(φ(xt+1)|xt,at), also called the surprisal [2].
An agent trained to maximize this reward will favor transitions with high prediction error, which will be higher in areas where the agent has spent less time, or in areas with complex dynamics.
Such a dynamics-based curiosity has been shown to perform quite well across scenarios [27] especially when the dynamics are learned in an embedding space rather than raw observations.
In this paper, we explore dynamics-based curiosity and use mean-squared error corresponding to a fixed-variance Gaussian density as surprisal, i.e., f(xt,at) ? φ(xt+1)2 2 where f is the learned dynamics model.
However, any other density model could be used.

2.1
Feature spaces for forward dynamics
Consider the representation φ in the curiosity formulation above.
If φ(x) = x, the forward dynamics model makes predictions in the observation space.
A good choice of feature space can make the prediction task more tractable and filter out irrelevant aspects of the observation space.
But, what makes a good feature space for dynamics driven curiosity? We narrow down a few qualities that a good feature space should have:

496:YAMAGUTIseisei
18/08/26 16:33:46.18 CL1hr8qnX BE:48954645-2BP(3)
• Compact: The features should be easy to model by being low(er)-dimensional and filtering out irrelevant parts of the observation space.
• Sufficient: The features should contain all the important information.
Otherwise, the agent may fail to be rewarded for exploring some relevant aspect of the environment.
• Stable: Non-stationary rewards make it difficult for reinforcement agents to learn.
Exploration bonuses by necessity introduce non-stationarity since what is new and novel becomes old and boring with time.
In a dynamics-based curiosity formulation, there are two sources of non-stationarity: the forward dynamics model is evolving over time as it is trained and the features are changing as they learn.
The former is intrinsic to the method, and the latter should be minimized where possible

In this work, we systematically investigate the efficacy of a number of feature learning methods, summarized briefly as follows:

Pixels
The simplest case is where φ(x) = x and we fit our forward dynamics model in the observation space.
Pixels are sufficient, since no information has been thrown away, and stable since there is no feature learning component.
However, learning from pixels is tricky because the observation space may be high-dimensional and complex.

Random Features (RF)
The next simplest case is where we take our embedding network, a convolutional network, and fix it after random initialization.
Because the network is fixed, the features are stable.
The features can be made compact in dimensionality, but they are not constrained to be.
However, random features may fail to be sufficient.

497:YAMAGUTIseisei
18/08/26 16:36:04.77 CL1hr8qnX BE:9791322-2BP(3)
Variational Autoencoders (VAE) VAEs were introduced in [15, 31] to fit latent variable generative models p(x, z) for observed data x and latent variable z with prior p(z) using variational inference.
The method calls for an inference network q(z|x) that approximates the posterior p(z|x).
This is a feedforward network that takes an observation as input and outputs a mean and variance vector describing a Gaussian distribution with diagonal covariance.

VAE IDF RF Pixels
Stable No No Yes Yes
Compact Yes Yes Maybe No
Sufficient Yes Maybe Maybe Yes

Table 1:
Table summarizing the categorization of different kinds of feature spaces considered.

3

Page 4

We can then use the mapping to the mean as our embedding network φ.
These features will be a low-dimensional approximately sufficient summary of the observation,
but they may still contain some irrelevant details such as noise, and the features will change over time as the VAE trains.

498:YAMAGUTIseisei
18/08/26 16:37:31.56 CL1hr8qnX BE:156653388-2BP(3)
Inverse Dynamics Features (IDF) Given a transition (st,st+1,at) the inverse dynamics task is to predict the action at given the previous and next states st and st+1.
Features are learned using a common neural network φ to first embed st and st+1.
The intuition is that the features learned should correspond to aspects of the environment that are under the agent 's immediate control.
This feature learning method is easy to implement and in principle should be invariant to certain kinds of noise (see [27] for a discussion).
A potential downside could be that the features learned may not be sufficient, that is they do not represent important aspects of the environment that the agent cannot immediately affect.
A summary of these characteristics is provided in Table 1.
Note that the learned features are not stable because their distribution changes as learning progresses.
One way to achieve stability could be to pre-train VAE or IDF networks.
However, unless one has access to the internal state of the game, it is not possible to get a representative data of the game scenes to train the features.
One way is to act randomly to collect data, but then it will be biased to where the agent started, and won't generalize further.
Since all the features involve some trade-off of desirable properties, it becomes an empirical question as to how effective each of them is across environments.

2.2
Practical considerations in training an agent driven purely by curiosity
Deciding upon a feature space is only first part of the puzzle in implementing a practical system.
Here, we detail the critical choices we made in the learning algorithm.
Our goal was to reduce non-stationarity in order to make learning more stable and consistent across environments.
Through the following considerations outlined below, we are able to get exploration to work reliably for different feature learning methods and environments with minimal changes to the hyper-parameters.

499:YAMAGUTIseisei
18/08/26 16:38:36.96 CL1hr8qnX BE:176234898-2BP(3)
• PPO.
In general, we have found PPO algorithm [38] to be a robust learning algorithm that requires little hyper-parameter tuning, and hence, we stick to it for our experiments.
• Reward normalization.
Since the reward function is non-stationary, it is useful to normalize the scale of the rewards so that the value function can learn quickly.
We did this by dividing the rewards by a running estimate of the standard deviation of the sum of discounted rewards.
• Advantage normalization.
While training with PPO, we normalize the advantages [46] in a batch to have a mean of 0 and a standard deviation of 1.
• Observation normalization.
We run a random agent on our target environment for 10000 steps, then calculate the mean and standard deviation of the observation and use these to normalize the observations when training.
This is useful to ensure that the features do not have very small variance at initialization and to have less variation across different environments.
• More actors.
The stability of the method is greatly increased by increasing the number of parallel actors (which affects the batch-size) used.
We typically use 128 parallel runs of the same environment for data collection while training an agent.
• Normalizing the features.
In combining intrinsic and extrinsic rewards, we found it useful to ensure that the scale of the intrinsic reward was consistent across state space.
We achieved this by using batch-normalization [13] in the feature embedding network.

500:YAMAGUTIseisei
18/08/26 16:41:37.25 CL1hr8qnX BE:97908858-2BP(3)
2.3
` Death is not the end ' : discounted curiosity with infinite horizon
One important point is that the use of an end of episode signal, sometimes called a ` done ' , can often leak information about the true reward function.
If we don't remove the ` done ' signal, many of the Atari games become too simple.
For example, a simple strategy of giving +1 artificial reward at every time-step when the agent is alive and 0 on death is sufficient to obtain a high score in some games,
for instance, the Atari game ` Breakout ' where it will seek to maximize the episode length and hence its score.
In the case of negative rewards, the agent will try to end the episode as quickly as possible.

4

501:YAMAGUTIseisei
18/08/26 16:42:21.10 CL1hr8qnX BE:29373034-2BP(3)
Page 5

0 100 200 300 400
0 500 1000 1500 2000 2500 3000 3500 4000

BeamRider
BreakOut
MontezumaRevenge
Pong
Mario
Qbert
Reverraid
Seaquest
SpaceInvaders

Frames (millions)
Extrinsic Reward per Episode

Pixels
VAE features
Inverse Dynamics features
Random CNN features

Figure 2:
A comparison of feature learning methods on 8 selected Atari games and the Super Mario Bros.

502:YAMAGUTIseisei
18/08/26 16:43:13.06 CL1hr8qnX BE:48954454-2BP(3)
These evaluation curves show the mean reward (with standard error) of agents trained purely by curiosity, without reward or an end-of-episode signal.
We see that our purely curiosity-driven agent is able to gather rewards in these environments without using any extrinsic reward at training.
Results on all of the Atari games are in the appendix in Figure 8.
We find curiosity model trained on pixels does not work well across any environment and VAE features perform either same or worse than random and inverse dynamics features.
Further, inverse dynamics-trained features perform better than random features in 55% of the Atari games.
An interesting outcome of this analysis is that random features for modeling curiosity are a simple, yet surprisingly strong baseline and likely to work well in half of the Atari games.

In light of this, if we want to study the behavior of pure exploration agents, we should not bias the agent.
In the infinite horizon setting (i.e., the discounted returns are not truncated at the end of the episode and always bootstrapped using the value function), death is just another transition to the agent, to be avoided only if it is boring.
Therefore, we removed ` done ' to separate the gains of an agent 's exploration from merely that of the death signal.
In practice, we do find that the agent avoids dying in the games since that brings it back to the beginning of the game, an area it has already seen many times and where it can predict the dynamics well.
This subtlety has been neglected by previous works showing experiments without extrinsic rewards.

503:YAMAGUTIseisei
18/08/26 16:47:17.14 CL1hr8qnX BE:44059436-2BP(3)
3
Experiments

In all of our experiments, both the policy and the embedding network work directly from pixels.
For our implementation details including hyper-parameters and architectures, please refer to the Appendix A.
Unless stated otherwise, all curves are the average of three runs with different seeds, and the shaded areas are standard errors of the mean.
We have released the code and videos of a purely curious agent playing across all environments on the website 2.

3.1
Curiosity-driven learning without extrinsic rewards We begin by scaling up a pure curiosity-driven learning to a large number of environments without using any extrinsic rewards.
We pick a total of 54 diverse simulated environments, as shown in Figure 1,
including 48 Atari games, Super Mario Bros., 2 Roboschool scenarios (learning Ant controller and Juggling), Two-player Pong, 2 Unity mazes (with and without a TV controlled by the agent).
The goal of this large-scale analysis is to investigate the following questions:
(a) What actually happens when you run a pure curiosity-driven agent on a variety of games without any extrinsic rewards?
(b) What kinds of behaviors can you expect from these agents? (c) What is the effect of the different feature learning variants in dynamics-based curiosity on these behaviors?

2
URLﾘﾝｸ(pathak22.github.io)

5

504:YAMAGUTIseisei
18/08/26 16:48:28.26 CL1hr8qnX BE:9791322-2BP(3)
Page 6

A) Atari Games
To answer these questions, we began with a collection of well-known Atari games and ran a suite of experiments with different feature learning methods.
One way to measure how well a purely curious agent performs is to measure the extrinsic reward it is able to achieve, i.e. how good is the agent at playing the game.
We show the evaluation curves of mean extrinsic reward in on 8 common Atari games in Figure 2 and all 48 Atari suite in Figure 8 in the appendix.
It is important to note that the extrinsic reward is only used for evaluation, not for training.
However, this is just a proxy for pure exploration because the game rewards could be arbitrary and might not align at all with how the agent explores out of curiosity.

The first thing to notice from the curves is: most of them are going up.
This shows that a pure curiosity-driven agent can learn to obtain external rewards even without using any extrinsic rewards during training.
It is remarkable that agents with no extrinsic reward and no end of episode signal can learn to get scores comparable in some cases to learning with the extrinsic reward.
For instance, in Breakout, the game score increases on hitting the ball with the paddle into bricks which disappear and give points when struck.
The more times the bricks are struck in a row by the ball, the more complicated the pattern of bricks remaining becomes, making the agent more curious to explore further, hence, collecting points as a bi-product.
Further, when the agent runs out of lives, the bricks are reset to a uniform structure again that has been seen by the agent many times before and is hence very predictable, so the agent tries to stay alive to be curious by avoiding reset by death.

505:YAMAGUTIseisei
18/08/26 16:54:32.44 CL1hr8qnX BE:132176669-2BP(3)
This is an unexpected result and might suggest that many popular RL test-beds do not need an external reward.
This may be because game designers (similar to architects, urban planners, gardeners, etc.) are
very good at setting up curriculums to guide agents through the task explaining the reason Curiosity-like objective decently aligns with the extrinsic reward in many human-designed environments [6, 12, 16, 48].
However, this is not always the case, and sometimes a curious agent can even do worse than random agent!
This happens when the extrinsic reward has little correlation with the agent 's exploration, or when the agent fails to explore efficiently (e.g. see games ` Atlantis ' , ` IceHockey ' in Figure 8).
We further encourage the reader to refer to the game-play videos of the agent available on the website for a better understanding of the learned skills.

Comparison of feature learning methods:
We compare four feature learning methods in Figure 2: raw pixels, random features, inverse dynamics features and VAE features.
Training dynamics on raw-pixels performs bad across all the environments, while encoding pixels into features does better.
This is likely because it is hard to learn a good dynamics model in pixel space, and prediction errors may be dominated by small irrelevant details.

Surprisingly, random features (RF) perform quite well across tasks and sometimes better than using learned features.
One reason for good performance is that the random features are kept frozen (stable), the dynamics model learned on top of them has an easier time because of the stationarity of the target.
In general, random features should work well in the domains where visual observations are simple enough, and random features can preserve enough information about the raw signal, for instance, Atari games.
Interestingly, we find that while random features work well at training, IDF learned features appear to generalize better in Mario Bros. (see Section 3.2 for details).

506:YAMAGUTIseisei
18/08/26 16:55:34.83 CL1hr8qnX BE:110146695-2BP(3)
The VAE method also performed well but was somewhat unstable, so we decided to use RF and IDF for further experiments.
The detailed result in appendix Figure 8 compares IDF vs.
RF across the full Atari suite.
To quantify the learned behaviors, we compared our curious agents to a randomly acting agent.
We found that an IDF-curious agent collects more game reward than a random agent in 75% of the Atari games, an RF-curious agent does better in 70%.
Further, IDF does better than RF in 55% of the games.
Overall, random features and inverse dynamics features worked well in general.
Further details in the appendix.

B) Super Mario Bros.
We compare different feature learning methods in Mario Bros. in Figure 2.
Super Mario Bros has already been studied in the context of extrinsic reward free learning [27] in small-scale experiments, and so we were keen to see how far curiosity alone can push the agent.
We use an efficient version of Mario simulator faster to scale up for longer training keeping observation space, actions, dynamics of the game intact.
Due to 100x longer training and using PPO for optimization, our agent is able to pass several levels of the game, significantly improving over prior exploration results on Mario Bros.
Could we further push the performance of a purely curious agent by making the underlying optimization more stable? One way is to scale up the batch-size.
We do so by increasing the number of parallel threads for running environments from 128 to 2048.

6

507:YAMAGUTIseisei
18/08/26 16:58:13.95 CL1hr8qnX BE:34268827-2BP(3)
Page 7

0 10 20 30
0 250 500 750 1000 1250 1500 1750 2000

Extrinsic Reward per Episode

Number of gradient updates
(a) Mario w/ large batch
Batch of 128 environments
Batch of 1024 environments

Frames (millions)
(b) Juggling (Roboschool)
Pure curiosity (no-reward, infenite-horizon) exploration
juggling (Roboschool)

Frames (millions)
(c) Two-player Pong
Pure curiosity (no-reward, infenite-horizon) exploration
Two player Pong

Figure 3:
(a) Left: A comparison of the RF method on Mario with different batch sizes.
Results are without using extrinsic reward.
(b) Center: Number of ball bounces in the Juggling (Roboschool) environment.
(c) Right: Mean episode length in the multiplayer Pong environment.
The discontinuous jump on the graph corresponds to the agent reaching a limit of the environment -
after a certain number of steps in the environment the Atari Pong emulator starts randomly cycling through background colors and becomes unresponsive to agent 's actions

508:YAMAGUTIseisei
18/08/26 16:59:09.88 CL1hr8qnX BE:19582324-2BP(3)
We show the comparison between training using 128 and 2048 parallel environment threads in Figure 3(a).
As apparent from the graph, training with large batch-size using 2048 parallel environment threads performs much better.
In fact, the agent is able to explore much more of the game: discovering 11 different levels of the game, finding secret rooms and defeating bosses.
Note that the x-axis in the figure is the number of gradient steps, not the number of frames, since the point of this large-scale experiment is not a claim about sample-efficiency, but performance with respect to training the agent.
This result suggests that the performance of a purely curiosity-driven agent would improve as the training of base RL algorithm (which is PPO in our case) gets better.
The video is on the website.

C) Roboschool Juggling
We modified the Pong environment from the Roboschool framework to only have one paddle and to have two balls.
The action space is continuous with two-dimensions, and we discretized the action space into 5 bins per dimension giving a total of 25 actions.
Both the policy and embedding network are trained on pixel observation space (note: not state space).
This environment is more difficult to control than the toy physics used in games, but the agent learns to intercept and strike the balls when it comes into its area.
We monitored the number of bounces of the balls as a proxy for interaction with the environment, as shown in Figure 3(b).
See the video on the project website.

509:YAMAGUTIseisei
18/08/26 16:59:48.41 CL1hr8qnX BE:119937877-2BP(3)
D) Roboschool Ant Robot
We also explored using the Ant environment which consists of an Ant with 8 controllable joints on a track.
We again discretized the action space and trained policy and embedding network on raw pixels (not state space).
However, in this case, it was less easy to measure exploration because the extrinsic distance reward measures progress along the racetrack, but a purely curious agent is free to move in any direction.
We find that a walking like behavior emerges purely out of a curiosity-driven training.
We refer the reader to the result video showing that the agent is meaningfully interacting with the environment.

E) Multi-agent curiosity in Two-player Pong
We have already seen that a purely curiosity-driven agent learns to play several Atari games without reward, but we wonder how much of that behavior is caused by the fact that the opposing player is a computer-agent with hardcoded strategy.
What would happen if we were to make both the teams playing against each other to be curious? To find out, we take Two-player Pong game where both the sides (paddles of pong) of the game are controlled by curiosity-driven agents.
We share the initial layers of both the agent and have different action heads, i.e., total action space is now the cross product of the actions of player 1 by the actions of player 2.

Note that the extrinsic reward is meaningless in this context since the agent is playing both sides, so instead, we show the length of the episode.
The results are shown in Figure 3(c).
We see from the episode length that the agent learns to have more and longer rallies over time, learning to play pong without any teacher ? purely by curiosity on both sides.
In fact, the game rallies eventually get so long that they break our Atari emulator causing the colors to change radically, which crashes the policy as shown in the plot.

7

510:YAMAGUTIseisei
18/08/26 17:01:15.34 CL1hr8qnX BE:19581942-2BP(3)
Page 8

3.2
Generalization across novel levels in Super Mario Bros.
In the previous section, we showed that our purely curious agent can learn to explore efficiently and learn useful skills, e.g., game playing behaviour in games, walking behaviour in Ant etc.
So far, these skills were shown in the environment where the agent was trained on.
However, one advantage of developing reward-free learning is that one should then be able to utilize abundant `` unlabeled '' environments without reward functions by showing generalization to novel environments.

To test this, we first pre-train our agent using curiosity only in the Level 1-1 of Mario Bros.
We investigate how well RF and IDF-based curiosity agents generalize to novel levels of Mario.
In Figure 4, we show two examples of training on one level of Mario and finetuning on another testing level, and compare to learning on the testing level from scratch.
The training signal in all the cases is only curiosity reward.
In the first case, from Level 1-1 to Level 1-2, the global statistics of the environments match (both are ` day ' environment in games, i.e., blue background) but levels have different enemies, geometry and difficulty level.
We see that there is strong transfer from for both methods in this scenario.
However, the transfer performance is weaker in the second scenario from Level 1-1 to Level 1-3.
This is so because the problem is considerably harder for the latter level pairing as there is a color scheme shift from day to night, as shown in Figure 4.

511:YAMAGUTIseisei
18/08/26 17:02:53.25 CL1hr8qnX BE:29373034-2BP(3)
We further note that IDF-learned features transfer in both the cases and random features transfer in the first case, but do not transfer in the second scenario from day to night.
These results might suggest that while random features perform well on training environments, learned features appear to generalize better to novel levels.
However, this needs more analysis in the future across a large variety of environments.
Overall, we find some promising evidence showing that skills learned by curiosity help our agent explore efficiently in novel environments.

IDF scratch
IDF transfer
RF scratch
RF transfer

0 10 20 30
0 250 500 750 1000 1250 1500 1750 2000

World 1 level 1 to world 2 level 1
0 10 20 30
0 250 500 750 1000 1250 1500 1750 2000
World 1 level 1 to world 3 level 1
Frames (millions)
Extrinsic Reward per Episode

Figure 4:
Mario generalization experiments.
On the left we show transfer results from Level 1-1 to Level 1-2, and on the right we show transfer results from Level 1-1 to Level 1-3.
Underneath each plot is a map of the source and target environments.
All agents are trained without extrinsic reward.

512:YAMAGUTIseisei
18/08/26 17:04:02.97 CL1hr8qnX BE:36715853-2BP(3)
Frames (millions)
Extrinsic Reward per Episode

Unity maze

Random CNN features
Extrinsic only
Inverse dynamics features

Figure 5: Mean extrinsic reward in the Unity environment while training with terminal extrinsic + curiosity reward.
Note that the curve for extrinsic reward only training is constantly zero.

3.3 Curiosity with Sparse External Reward
In all our experiments so far, we have shown that our agents can learn useful skills without any extrinsic rewards driven purely by curiosity.
However, in many scenarios, we might want the agent to perform some particular task of interest.
This is usually conveyed to the agent by defining extrinsic rewards.
When rewards are dense (e.g. game score at every frame), classic RL works well and intrinsic rewards generally should not help performance.
However, designing dense rewards is a challenging engineering problem (see introduction for details).
In this section, we evaluate how well curiosity can help an agent perform a task in presence of sparse, or just terminal, rewards.

Terminal reward setting:
For many real problems, e.g. navigation, the only terminal reward is available, a setting where classic RL typically performs poorly.
Hence, we consider the 3D navigation in a maze designed in the Unity ML-agent framework with 9 rooms and a sparse terminal reward.

8

513:YAMAGUTIseisei
18/08/26 17:04:57.59 CL1hr8qnX BE:78326584-2BP(3)
Page 9

There is a discrete action space consisting of: move forwards, look left 15 degrees, look right 15 degrees and no-op.
The agent starts in the room-1, which is furthest away from room-9 which contains the goal of the agent.
We compare an agent trained with extrinsic reward (+1 when the goal is reached, 0 otherwise) to an agent trained with extrinsic + intrinsic reward.
Extrinsic only (classic RL) never finds the goal in all our trials which means it is impossible to get any meaningful gradients.
Whereas extrinsic+intrinsic typically converges to getting the reward every time.
Results in Figure 5 show results for vanilla PPO, PPO + IDF-curiosity and PPO + RF-curiosity.

Sparse reward setting: In preliminary experiments, we picked 5 Atari games which have sparse rewards (as categorized by [3]), and compared extrinsic (classic RL) vs.
extrinsic+intrinsic (ours) reward performance.
In 4 games out of 5, curiosity bonus improves performance (see Table 2 in the appendix, the higher score is better).
We would like to emphasize that this is not the focus of the paper, and these experiments are provided just for completeness.
We just combined extrinsic (coefficient 1.0) and intrinsic reward (coefficient 0.01) directly without any tuning.
We leave the question on how to optimally combine extrinsic and intrinsic rewards as a future direction.

514:YAMAGUTIseisei
18/08/26 17:11:36.23 CL1hr8qnX BE:29373326-2BP(3)
4
Related Work

Intrinsic Motivation:
A family of approaches to intrinsic motivation reward
an agent based on prediction error [2, 27, 36, 42], prediction uncertainty [11, 44], or improvement [19, 34] of a forward dynamics model of the environment that gets trained along with the agent 's policy.
As a result the agent is driven to reach regions of the environment that are difficult to predict for the forward dynamics model, while the model improves its predictions in these regions.
This adversarial and non-stationary dynamics can give rise to complex behaviors.
Relatively little work has been done in this area on the pure exploration setting where there is no external reward.
Of these mostly closely related are those that use a forward dynamics model of a feature space such as Stadie et al. [42] where they use autoencoder features, and Pathak et al. [27] where they use features trained
with an inverse dynamics task.
These correspond roughly to the VAE and IDF methods detailed in Section 2.1.

Smoothed versions of state visitation counts can be used for intrinsic rewards [3, 9, 24, 47].
Count-based methods have already shown very strong results when combining with extrinsic rewards such as setting the state of the art in the Atari game Montezuma 's Revenge [3],
and also showing significant exploration of the game without using the extrinsic reward.
It is not yet clear in which situations count-based approaches should be preferred over dynamics-based approaches; we chose to focus on dynamics-based bonuses in this paper since we found them straightforward to scale and parallelize.
In our preliminary experiments, we did not have sufficient success with already existing count-based implementations in scaling up for a large-scale study.

515:YAMAGUTIseisei
18/08/26 17:16:26.46 CL1hr8qnX BE:68536447-2BP(3)
Learning without extrinsic rewards or fitness functions has also been studied extensively in the evolutionary computing where it is referred to as ` novelty search ' [17, 18, 43].
There the novelty of an event is often defined as the distance of the event to the nearest neighbor amongst previous events, using some statistics of the event to compute distances.
One interesting finding from this literature is that often much more interesting solutions can be found by not solely optimizing for fitness.

Other methods of exploration are designed to work in combination with maximizing a reward function, such as those utilizing uncertainty about value function estimates [5, 23], or those using perturbations of the policy for exploration [8, 29].
Schmidhuber [37] and Oudeyer [25], Oudeyer and Kaplan [26] provide a great review of some of the earlier work on approaches to intrinsic motivation.
Alternative methods of exploration include Sukhbaatar et al. [45] where they utilize an adversarial game between two agents for exploration.
In Gregor et al. [10], they optimize a quantity called empowerment which is a measurement of the control an agent has over the state.
In a concurrent work, diversity is used as a measure to learn skills without reward functions Eysenbach et al. [7].

Random Features:
One of the findings in this paper is the surprising effectiveness of random features, and there is a substantial literature on random projections and more generally randomly initialized neural networks.
Much of the literature has focused on using random features for classification [14, 33, 49] where the typical finding is that whilst random features can work well for simpler problems,
feature learning performs much better once the problem becomes sufficiently complex.
Whilst we expect this pattern to also hold true for dynamics-based exploration, we have some preliminary evidence showing that learned features appear to generalize better to novel levels in Mario Bros.

9

516:YAMAGUTIseisei
18/08/26 17:18:41.35 CL1hr8qnX BE:110146695-2BP(3)
Page 10

5
Discussion

We have shown that our agents trained purely with a curiosity reward are able to learn useful behaviours: (a) Agent being able to play many atari games without using any rewards.
(b) Mario being able to cross over over 11 levels without reward.
(c) Walking like behavior emerged in the Ant environment.
(d) Juggling like behavior in Robo-school environment (e) Rally-making behavior in Two-player Pong with curiosity-driven agent on both sides.
But this is not always true as there are some Atari games where exploring the environment does not correspond to extrinsic reward.

More generally, these results suggest that, in environments designed by humans, the extrinsic reward is perhaps often aligned with the objective of seeking novelty.
The game designers set up curriculums to guide users while playing the game explaining the reason Curiosity-like objective decently aligns with the extrinsic reward in many human-designed games [6, 12, 16, 48].

0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4 5 6 7 8

Frames (millions)
Extrinsic Reward per Episode

RF with TV off
RF with TV on
IDF with TV off
IDF with TV on

Figure 6:
We add a noisy TV to the unity environment in Section 3.3.
We compare IDF and RF with and without the TV.

517:YAMAGUTIseisei
18/08/26 17:19:53.47 CL1hr8qnX BE:78326584-2BP(3)
Limitation of prediction error based curiosity:
A more serious potential limitation is the handling of stochastic dynamics.
If the transitions in the environment are random, then even with a perfect dynamics model, the expected reward will be the entropy of the transition, and the agent will seek out transitions with the highest entropy.
Even if the environment is not truly random, unpredictability caused by a poor learning algorithm, an impoverished model class or partial observability can lead to exactly the same problem.
We did not observe this effect in our experiments on games so we designed an environment to illustrate the point.

We return to the maze of Section 3.3 to empirically validate a common thought experiment called the noisy-TV problem.
The idea is that local sources of entropy in an environment like a TV that randomly changes channels when an action is taken should prove to be an irresistible attraction to our agent.
We take this thought experiment literally and add a TV to the maze along with an action to change the channel.
In Figure 6 we show how adding the noisy-TV affects the performance of IDF and RF.
As expected the presence of the TV drastically slows down learning, but we note that if you run the experiment for long enough the agents do sometimes converge to getting the extrinsic reward consistently.
We have shown empirically that stochasticity can be a problem, and so it is important for future work to address this issue in an efficient manner.

518:YAMAGUTIseisei
18/08/26 17:22:09.20 CL1hr8qnX BE:44059829-2BP(3)
Future Work:
We have presented a simple and scalable approach that can learn nontrivial behaviors across a diverse range of environments without any reward function or end-of-episode signal.
One surprising finding of this paper is that random features perform quite, but learned features appear to generalize better.
Whilst we believe that learning features will become important once the environment is complex enough, we leave that up to future work to explore.
Our wider goal, however, is to show that we can take advantage of many unlabeled (i.e., not having an engineered reward function) environments to improve performance on a task of interest.
Given this goal, showing performance in environments with a generic reward function is just the first step, and future work could investigate transfer from unlabeled to labeled environments.

Acknowledgments

We would like to thank Chris Lu for helping with the Unity environment, Phillip Isola and Alex Nichols for feedback on the paper.
We are grateful to the members of BAIR and OpenAI for fruitful discussions.
DP is supported by the Facebook graduate fellowship.

References

[1] Unity ML-agents.
URLﾘﾝｸ(github.com)
.
2

10

Page 11

[2] J. Achiam and S. Sastry. Surprise-based intrinsic motivation for deep reinforcement learning.
arXiv:1703.01732, 2017. 3, 9
[3] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos.
Unifying count-based exploration and intrinsic motivation. In NIPS, 2016. 1, 9
[4] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253279, jun 2013. 2

519:YAMAGUTIseisei
18/08/26 17:23:49.66 CL1hr8qnX BE:51402637-2BP(3)
[5] R. Y. Chen, J. Schulman, P. Abbeel, and S. Sidor.
UCB and infogain exploration via q-ensembles.arXiv:1706.01502, 2017. 9
[6] G. Costikyan. Uncertainty in games. Mit Press, 2013. 6, 10
[7] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine.
Diversity is all you need: Learning skills without a reward function. arXiv preprint, 2018. 9
[8] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg.
Noisy networks for exploration. arXiv:1706.10295, 2017. 9
[9] J. Fu, J. D. Co-Reyes, and S. Levine.
EX2: Exploration with exemplar models for deep reinforcement learning. NIPS, 2017. 9
[10] K. Gregor, D. J. Rezende, and D. Wierstra.
Variational intrinsic control. ICLR Workshop, 2017. 9
[11] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel.
Vime: Variational information maximizing exploration. In NIPS, 2016. 1, 9
[12] R. Hunicke, M. LeBlanc, and R. Zubek.
Mda: A formal approach to game design and game research. In AAAI Workshop on Challenges in Game AI, 2004. 6, 10
[13] S. Ioffe and C. Szegedy.
Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 4
[14] K. Jarrett, K. Kavukcuoglu, Y. LeCun, et al.
What is the best multi-stage architecture for object recognition?
In Computer Vision, 2009 IEEE 12th International Conference on, pages 21462153. IEEE, 2009. 9
[15] D. P. Kingma and M. Welling.
Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013. 2, 3
[16] N. Lazzaro. Why we play games:
Four keys to more emotion in player experiences. In Proceedings of GDC, 2004. 6, 10
[17] J. Lehman and K. O. Stanley.
Exploiting open-endedness to solve problems through the search for novelty. In ALIFE, 2008. 9
[18] J. Lehman and K. O. Stanley.
Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 2011. 9

520:YAMAGUTIseisei
18/08/26 17:25:07.68 CL1hr8qnX BE:51402637-2BP(3)
[19] M. Lopes, T. Lang, M. Toussaint, and P.-Y. Oudeyer.
Exploration in model-based reinforcement learning by empirically estimating learning progress. In NIPS, 2012. 9
[20] M. Lopes, T. Lang, M. Toussaint, and P.-Y. Oudeyer.
Exploration in model-based reinforcement learning by empirically estimating learning progress. In NIPS, 2012. 1
[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al.
Human-level control through deep reinforcement learning. Nature, 2015. 1
[22] S. Mohamed and D. J. Rezende.
Variational information maximisation for intrinsically motivated reinforcement learning. In NIPS, 2015. 1
[23] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy.
Deep exploration via bootstrapped dqn. In NIPS, 2016. 9
[24] G. Ostrovski, M. G. Bellemare, A. v. d. Oord, and R. Munos.
Count-based exploration with neural density models. arXiv:1703.01310, 2017. 1, 9
[25] P.-Y. Oudeyer.
Computational theories of curiosity-driven learning. arXiv preprint arXiv:1802.10546, 2018. 9

11

Page 12

[26] P.-Y. Oudeyer and F. Kaplan.
What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 2009. 1, 9
[27] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell.
Curiosity-driven exploration by self-supervised prediction. In ICML, 2017. 1, 2, 3, 4, 6, 9
[28] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell.
Zero-shot visual imitation. In ICLR, 2018. 1
[29] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz.
Parameter space noise for exploration. arXiv:1706.01905, 2017. 9
[30] P. Poupart, N. Vlassis, J. Hoey, and K. Regan.
An analytic solution to discrete bayesian reinforcement learning. In ICML, 2006. 1

521:YAMAGUTIseisei
18/08/26 17:27:09.49 CL1hr8qnX BE:29373034-2BP(3)
[31] D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014. 3
[32] E. L. Ryan, Richard; Deci.
Intrinsic and extrinsic motivations: Classic definitions and new directions. Contemporary Educational Psychology, 2000. 1
[33] A. M. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng.
On random weights and unsupervised feature learning. In ICML, pages 10891096, 2011. 9
[34] J. Schmidhuber.
Curious model-building control systems. In Neural Networks, 1991. 1991 IEEE International Joint Conference on, pages 14581463. IEEE, 1991. 9
[35] J. Schmidhuber.
A possibility for implementing curiosity and boredom in model-building neural controllers.
In From animals to animats: Proceedings of the first international conference on simulation of adaptive behavior, 1991. 1
[36] J. Schmidhuber.
A possibility for implementing curiosity and boredom in model-building neural controllers, 1991. 9
[37] J. Schmidhuber.
Formal theory of creativity, fun, and intrinsic motivation (19902010). IEEE Transactions on Autonomous Mental Development, 2010. 9
[38] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.
Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 4
[39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.
Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 2
[40] S. P. Singh, A. G. Barto, and N. Chentanez.
Intrinsically motivated reinforcement learning. In NIPS, 2005. 1
[41] L. Smith and M. Gasser.
The development of embodied cognition: Six lessons from babies. Artificial life, 2005. 1
[42] B. C. Stadie, S. Levine, and P. Abbeel.
Incentivizing exploration in reinforcement learning with deep predictive models. NIPS Workshop, 2015. 2, 9
[43] K. O. Stanley and J. Lehman.
Why greatness cannot be planned: The myth of the objective. Springer, 2015. 9

522:YAMAGUTIseisei
18/08/26 17:28:31.63 CL1hr8qnX BE:66087893-2BP(3)
[44] S. Still and D. Precup.
An information-theoretic approach to curiosity-driven reinforcement learning. Theory in Biosciences, 2012. 9
[45] S. Sukhbaatar, I. Kostrikov, A. Szlam, and R. Fergus.
Intrinsic motivation and automatic curricula via asymmetric self-play. In ICLR, 2018. 9
[46] R. S. Sutton and A. G. Barto.
Reinforcement learning: An introduction. MIT press Cambridge, 1998. 4
[47] H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel.
#Exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems, 2017. 9
[48] P. Wouters, H. Van Oostendorp, R. Boonekamp, and E. Van der Spek.
The role of game discourse analysis and curiosity in creating engaging and effective serious games by implementing a back story and foreshadowing. Interacting with Computers, 2011. 6, 10
[49] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang.
Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pages 14761483, 2015. 9

12

Page 13

A
Implementation Details

We have released the training code and environments on our website 3.
For full details, we refer the reader to our code and video results in the website.

Pre-processing:
All experiments were done with pixels.
We converted all images to grayscale and resized to size 84x84.
We learn the agent 's policy and forward dynamics function both on a stack of historical observations [xt?3, xt?2, xt?1, xt] instead of only using the current observation.
This is to capture partial observability in these games.
In the case of Super Mario Bros and Atari experiments, we also used a standard frameskip wrapper that repeats each action 4 times.

次ページ