10/07/19 11:55:43
よろしくお願いしますorz
The aggregation method presented in this paper is similar in concept to existing approaches to state
aggregation for MDPs in that it chooses states to aggregate according to a value based on reward.
Stochastic dynamic programming dynamically aggregates states according to an estimate of the
value function (Boutilier, Dearden, & Goldszmidt 2000).
Dietterich and Flann aggregate rectangular regions in a spatial state space based on values propagated
back from a goal-based reward function (Dietterich & Flann 1995).
The main differences between our work and these methods are that ours is designed specifically for partially
observable problems and acts exclusively on the time dimension of finite horizon problems.
Li et. al. provide an overview and analysis of various approaches to state abstraction in fully-observable
Markov decision processes (Li, Walsh, & Littman 2006).
Systems that obey the Markov property are memoryless, meaning that the environment’s response
at time t + 1 depends only on the state and actions at time t, and the rest of the execution history can be forgotten.