Главная
A I I E Transactions Use of Isolated Forecasts in Markov Decision Processes with Imperfect Information
Use of Isolated Forecasts in Markov Decision Processes with Imperfect Information
Brooks, D. M., Leondes, C. T.Насколько Вам понравилась эта книга?
Какого качества скаченный файл?
Скачайте книгу, чтобы оценить ее качество
Какого качества скаченные файлы?
Том:
6
Язык:
english
Журнал:
A I I E Transactions
DOI:
10.1080/05695557408974959
Date:
September, 1974
Файл:
PDF, 528 KB
Ваши теги:
 Пожалуйста, сначала войдите в свой аккаунт

Нужна помощь? Пожалуйста, ознакомьтесь с инструкцией как отправить книгу на Kindle
В течение 15 минут файл будет доставлен на Ваш email.
В течение 15 минут файл будет доставлен на Ваш kindle.
Примечание: Вам необходимо верифицировать каждую книгу, которую Вы отправляете на Kindle. Проверьте свой почтовый ящик на наличие письма с подтверждением от Amazon Kindle Support.
Примечание: Вам необходимо верифицировать каждую книгу, которую Вы отправляете на Kindle. Проверьте свой почтовый ящик на наличие письма с подтверждением от Amazon Kindle Support.
Связанные Подборки
0 comments
Вы можете оставить отзыв о книге и поделиться своим опытом. Другим читателям будет интересно узнать Ваше мнение о прочитанных книгах. Независимо от того, пришлась ли Вам книга по душе или нет, если Вы честно и подробно расскажете об этом, люди смогут найти для себя новые книги, которые их заинтересуют.
1

2

This article was downloaded by: [Virginia Tech Libraries] On: 14 March 2015, At: 05:22 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 3741 Mortimer Street, London W1T 3JH, UK A I I E Transactions Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/uiie19 Use of Isolated Forecasts in Markov Decision Processes with Imperfect Information a D. M. Brooks & C. T. Leondes a b Sperry Support Services Bay , St. Louis, Mississippi b University of California , Los Angles, California Published online: 09 Jul 2007. To cite this article: D. M. Brooks & C. T. Leondes (1974) Use of Isolated Forecasts in Markov Decision Processes with Imperfect Information, A I I E Transactions, 6:3, 244251, DOI: 10.1080/05695557408974959 To link to this article: http://dx.doi.org/10.1080/05695557408974959 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, resellin; g, loan, sublicensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/termsandconditions Use of Isolated Forecasts in Markov Decision Processes with Imperfect Information D. M. BROOKS Sperry Support F e c e s Bay St. Louis, Wsassippi C. T. LEONDES Downloaded by [Virginia Tech Libraries] at 05:22 14 March 2015 University of California Los Angles, California Abstract: The Markov Decision Process formulation and its application to processes in which there is uncertainty as to process state, or imperfect state information, are reviewed. The question of how to determine an optimal action if some form of intelligence or "forecast" of the process state for a single process stage is posed. An expression for the expected return for each available action is developed, as a perturbation to the basic process. The optimal action and value of the forecast are obtained by combining a Policy Iteration solution of the imperfect information process and evaluation of the policies for the perturbed process based on these imperfect information process parameters. The Markov Decision Process formulation has been studied extensively as a means for determining optimal decision policies for sequential processes. Solution algorithms based upon both linear and dynamic programming are discussed by Derman (1). For processes of long duration, Howard (2) developed the Policy Iteration Procedure, an algorithm for establishing the optimal decision for each process state. Further discussion of policy iteration is included in (3). In this algorithm, the set of extremal equations which represent the process is solved alternatively for the relative values of each state, with the policy fixed, and for the policy which optimizes return, with the relative values fixed. The procedure converges on the optimal policy and the corresponding relative values. Processes in which the process state cannot be observed in real time, or with complete accuracy, may in some instances still be treated by Markov Decision process techniques. Eckles (4) developed a dynamic programming method for optimizing a discrete parameter, nonstationary, finitestate Markov process. Smallwood and Sondik (5) studied a partially observable, finite horizon, Markov process and showed that the dynamic behavior of the information vector is itself a discretetime continuousstate Markov Received February 1974; revised July 1974. process. Brooks and Leondes (6) discussed a Markov process with state information lag. In some cases, a revised state space may be used, with states defined in terms of whatever process characteristics may actually be observed. If it is possible to determine a probability distribution relationship between the two state spaces, it may be possible to transform the known transition probability matrix for the original process to one which describes the process in terms of the revised state space. If so, the optimal policy for use with this imperfect state information may be determined using the Policy Iteration algorithm. The expected average return, or gain, of the process when only imperfect state information is available and the corresponding policy is used, will be lower than the expected average return with perfect information, of course. This raises the question of how to compare alternative state information sources and how to determine the optimal action when an improved observation is available for only a single stage of the process. In this paper, the latter question will be considered. Markov Decision Processes The Markov Decision Process model is applicable to processes which may be described in terms of a definitive set of process states, alternative actions available in each state, AIIE TRANSACTIONS, Volume 6, No. 3 transition probabilities among states for each action, and rewards received for each action and transition. The following recursive relationship describes the process dynamics: v(i,n)= max k where [ q(i, k) + p(ili k)v(i,n1) i 4 (i,k ) = ] ['I C p(j/i, k)rG/i,k ) i t21 process, k If a relationship can be established between states of the observed process and those of the underlying process, a relationship between transition probabilities can be deduced. Thus, Downloaded by [Virginia Tech Libraries] at 05:22 14 March 2015 and: I v(i,n) represents the expected return, if an optimal policy is followed, starting from state i, with n steps remaining in the process. v(i,o) is a known boundary condition. p(j1i.k) is the probability of transition to state j, if the present state is i and action k is implemented. r(ili,k) is the reward received for a transition from i to j under action k. q(i,k) is the expected immediate reward for implementing action k i n state i. If the process is of long duration, with transition probabilities, actions, and rewards that are independent of time, Eq. [I] may be reduced to: g + v(i) = mtx [ p Cili,k )v(i) 4 (i,k) + i I . [31 In this equation v(i) is the relative value of starting the process in state i, the average return per state of the process, or gain, is represented by g, and the probability of the process being in state i by n(i), thus C n(i) q(i, k) L41 v(i,n> = n g + v ( i ) . [51 g = The Policy Iteration Procedure converges on an optimal stationary policy for this process. Imperfect State Information If the decisionmaker cannot observe the process state directly or in real time, he may have to select an action with imperfect state information, e.g., a subset of the parameters which define the state or delayed identification of the state at an earlier stage. Thus, he is observing a process which appears to move among a set of "observable states" defined in terms of observable conditions while the underlying process is actually undergoing transitions among the actual process states. If the observed process can be considered Markovian, with stationary transition probabilities and rewards, it can be described by Eq. [3] ,with modified parameters. Using primes to differentiate the revised state space and process parameters from those of the underlying     September 1947, AIIE TRANSACTIONS     where i f ) is the probability that the state of the underlying process is i, if the observed state is i t . ~ ( J ' / J ] is the probability that the observed state is jr, if the state of the underlying process is j. If one of these probability distributions is known, the other may be obtained using Bayes rule. Thus, if pCi/ir)is known n(ir)is the limiting state probability for the optimal policy. Similarly, q(ir,k) = 2i p (i/ir)q(i,k) . PI These equations, together with p(i/ir),which is assumed to be known, and the underlying process parameters, provide all the parameters necessary to determine an optimal policy for the Imperfect Information process and the corresponding gain. This gain cannot be greater than the optimal gain of the Perfect Information process. Similarly, it cannot be less than the gain resulting from use of the Perfect Information optimal policy in the Imperfect Information process (otherwise that would be the optimal policy). Forecasts One might consider the solution to the imperfect information process in which actions are selected based on certain observable parameters as a lower bound, and the corresponding perfect information process in which actions are selected based on perfect knowledge of the process state as an upper bound. In some instances, it may be possible through some additional effort to obtain supplementary data or intelligence which makes possible an improved estimate or forecast as to the actual state. Presumably, a solution algorithm based on the use of such forecasts would produce higher returns than the basic imperfect information algorithm. However, this will depend on the accuracy or validity of the forecasts and the structure of the return and transition matrices. If forecasts will be available and will be used in each stage of the process, we are simply dealing with still another imperfect information process using another state space. However, if a forecast is available for only a Downloaded by [Virginia Tech Libraries] at 05:22 14 March 2015 single stage of the process, the value may be determined for that single stage of the process, assuming that the remaining stages are affected only to the extent that the immediate decision alters the immediate stage transition probabilities. It is usually assumed that there is a positive correlation between information and decision makingthat better information always supports improved decisions. However, it is also possible that the cost of additional information may not be consistent with the increase in return realized from the improved decisions. Nelson (7) provided an expression for the maximum expected return based on forecast f, eCf): p('j/xe(f.k)is the probability of outcome j, with forecast f and action k, and r(j/k) is the return from outcome j, with action k. He defined the value of the forecasts as the difference between the expected returns, weighted over all forecasts by the probability of the forecast, p m , and the expected return not using forecasts,e. Nelson also showed that the value of forecasts as defined above is nonnegative, i.e., use of the forecasts is better, averaged over all possible conditions. For some isolated values off, e m may be less than e . Moreover, if the cost of obtaining the forecasts is considered, it is possible for this cost to exceed the value of the forecasts. Equation 1101 is not always appropriate for comparing individual actions, however. Once improved information becomes available, returns based on long term statistics and random states are of questionable validity. A more valid comparison of the returns from the two approaches, with and without forecasts, for a single isolated decision, would use the same event probabilities based on the best available data (usually the data on which the forecast is based): e(f*) = C po'/f, k*) 41,k*) i k* represents k selected without benefit of forecasting. In the Markov decision process situation, with imperfect information, the use of forecast information for a single step of the process may be considered as a onestep perturbation of the long term process. The value of a forecast, by this interpretation, would be the increase in expected return resulting from use of the forecast as a basis for the single stage decision, followed by subsequent use of the optimal imperfect information process decisions. These returns must include the effect of altering the transition probabilities to the next stage of the process, as well as the onestep returns. If one looks first at the situation where the forecast is perfect, he has a one step perfect information process followed by a continuing process with imperfect information. In this situation, the long term expected reward is: 246 Substituting v(j: n1) = (nI) g' + vu') 1131 z q (i, k) + ( n  ~ ) ~+ ' p f/i, k) v(/) i' I An iterative procedure is not required to determine the expected rewards or value of the forecast, since all of the required terms are available already. g, v(ir) q(i, k), p('jr/i,k) were determined in the imperfect information analysis. To determine the optimal k, the term involving g'may be dropped since it is fixed by the imperfect information decision and is common to all forecasts. Indeed, the term "gain" has no significance for the transient or perturbed portion of the process. To establish the value of the forecast, it is necessary to determine the nstage expected return for the case in which the action is chosen by the imperfect information algorithm but the return is computed using the knowledge inherent in the perfect forecast:  v(ir9nlfl =[s(i,@ + j' p(il/i,@ VV: n1) ], .for I i =f vu') in both equations would be determined from the basic imperfect information algorithm. k_ is, of course, a function of it. The value of the perfect forecast, then, is Recalling that p(jl/i, k) = Zjp Q1/j)p(i/i, k), and that k* is therefore chosen using a weighted maximum over a group of states, it is clear that V(pfl is nonnegative. V(pfl differs from the value of perfect information in the continuous Markov decision process, since v(il), the imperfect information followon value, is involved rather than vQ). The logic for analysis of single step perfect forecasts is shown in Fig. 1. If the forecast is not perfect, two alternative approaches are available. One may base his decision procedure on the assumption that the forecast is perfect, as above, or incorporate the forecast accuracy expectation into the decision procedure. In the latter case, one must introduce the potential effects of incorrect state prediction information. It seems appropriate to assume an adequate base of statistical data on forecasting precision to allow construction of a probability distribution of actual state, conditioned on a given forecast state. If such statistical data is not available, AIIE TRANSACTIONS, Volume 6, No. 3 Downloaded by [Virginia Tech Libraries] at 05:22 14 March 2015 In this case, the value of the forecast is with & as defined before, and assuming that s(i/f) represents the best available estimate for probable actual state. This expression is dependent on the state of the process, which determines &, as well as on the forecast f. The logic for analyzing this process is shown in Fig. 2. In this expression, also, the value of the forecast must be nonnegative, since the second term uses a decision chosen for maximization under broader and less restrictive conditions. However, limits on the applicability of the expression may be required with respect to accuracy of the forecast. As a minimum, the accuracy of the forecast should be higher than the base, which would be the imperfect information process state probability distribution, without the forecast. There may be some question as to how these accuracies are to be compared but one basis is the probability of the process actually being in the forecast state, as given by the conditional probability distribution for the forecast, s(i/f). This probability should be greater than the probability of being in that state given by the imperfect information distribution, p(i/il). s(i/' > p (i/if), for i = f . from I m p e r f e c t I n f o r m a t i o n process Find: Fig. 1. + Single step perfect forecast solution logic. p(jl/f,k)= &s(i/f)p(jl/i,k) 1 it will be necessary to adopt a subjectively assumed conditional probability distribution. When this is done, of course, the results are contingent on the validity of the assumption. Proceeding on the above assumption that the conditional state probability distribution is known, and designating the distribution as each f , k , I Find: I I s(i/' = probability of state i when the forecast i s 5 Find: k ~171 I If Find: = vCf,n) ng', and noting t h a t z s ( i / ' (n 1)g1=(nl)gf, Fig. 2. Single step imperfect information solution logic. September 1974, AIIE TRANSACTIONS In actual practice, since the optimal action k may vary with actual state i, and the accuracy distribution s(i/'may vary among the possible state forecasts, the value of the forecasts may also vary. It may well be that some forecasts have much higher values than others. The cost of the forecasts may also be either fixed or variable. Therefore, the net value of some forecasts may be marginal, or even negative. If possible, the administration of the forecasting service should be set up in such manner that only the forecasts with values above some suitable threshold are provided. Those forecasts would be acted upon and would be perturbations to the basic process; the other forecasts would be ignored and would therefore not affect the basic process. most of the period, but the return reflects a cost of repairs. The expected immediate returns for various states and actions as well as the expected state transition probability matrices are presented in Table 1. The transition probabilities are based on an assumed deterioration rate of 0.2 from the current state to the next lower state, if no maintenance is performed, and upon 95% and 99% probabilities of restoration to good operating condition for major overhaul and replacement, respectively. The transition probability structure is such that there are policies which represent drastic actions and which result in high probabilities of step changes between extreme states, rather than being limited to transitions primarily between adjacent states. Downloaded by [Virginia Tech Libraries] at 05:22 14 March 2015 Example These methods are applicable to any process which may properly be represented as a Markov process, but primarily they would apply to continuing processes with indefinite horizons in which the state of the process may be classified into one of a finite number of discrete states. As an example of the application of these techniques, this section will be devoted to their use in analysis of a maintenancelreplacement policy situation. The process considered is one in which a complex system performs assigned missions, e.g., a space shuttle vehicle, a patrol submarine, or simply a factory machine. At periodic intervals, its condition may be assessed and maintenance1 replacementloperation type decisions made. At the time of decision, however, because of a requirement for advance planning and a time lag in effecting the desired decision, complete state information may not be available. In other words, maintenance or replacement actions taken on an emergency basis would be more costly than those accomplished on an orderly and planned basis. The parameters of the process and their significance are as follows. The state at any time will be classified into one of four operating conditions: 1 2 3 4 Good Fair Marginal Inoperative The return, which may be interpreted as either profit from use of the unit or as a unit of utility realized from the degree of success in mission accomplishment, is a function of the state. In each state, there are four decision or action alternatives: 1 2 3 4 Use with no maintenance Use with minor repairs Perform major overhaul Replace With either action 3 or 4, the unit is assumed to be out of operation for the next stage of the process. With minor repairs, the unit is assumed to be available for operation for 248 The procedure for analysis of decision policies for this maintenance process is identical to that discussed above. As a basis for comparison and to establish optimal policies for application in the analysis of forecasts, the perfect information process was first studied. The results are shown in Table 2. Only three iterations were required to establish the optimal policy, which is to continue operation as long as the system remains in good operating condition, to replace it if it is inoperative, or to perform a major overhaul if it is in either of the intermediate states. The expected gain or average return per stage of the process is 2.242. As indicated earlier, in the process being considered, there is a one period lag in the implementation of a decison. Thus the type of imperfect information might be described as a oneperiod state information lag. In this situation, the revised state space is defined in terms of the state existing and action taken at the preceding stage of the process. The probability distribution for the actual present state is AIIE TRANSACTIONS, Volume 6, No. 3 Table 2: Perfect information solution results. \~r= Iteration i k Relative value 1 2 3 4 1 1 1 1 46.4286 26.6667 9.5238 0 1 2 + p(i i, k)v(i) new I 4 k 18.91 18.76' 5 .40 6.02 14.56 14.56 14.56' 14.56' 1 3 4 4 12.65 6.00 0.32 1.78 7.04 6.94* 2.27" 1.42 2.24 2.24 2.24 2.24' 1 3 3 4 14.56, 6.76 1.03 12.65 6.00 0.35 7.04 6.94' 2.28" 2.24 2.24 2.24 1 3 3 1 .O 1.78 2.24" 4 *I *z 23.75' 4.00 1 5 1.20 22.25 5.72 13.01  3.05 14.56' 6.76 1.O 1 .O  1 2 3 4 1 3 4 4 12.3171 4.6951 0 0 1 1 12.3183 *3 2 3 4 3 3 4 4.6951 0.0364 0 2.24206 identical to the transition probability distribution for the preceding stage. p (i/ir, n) = p (i/i, k, n +1 ) . In this process, since the next state j' is determined by the present i and k, ~(ili',n), if j' corresponds to selected k pOr/ir,k, n) = q(i,k) 2.24085 3 Downloaded by [Virginia Tech Libraries] at 05:22 14 March 2015 g 0.9048 0 ,otherwise. From these relationships, and Eqs. [7] and [8], state transition probabilities and rewards for the revised state space may be computed and then used in The Policy Iteration algorithm. September 1974, AIIE TRANSACTIONS 1.41  The results of analysis of a one period information lag (a one period lag in effectivity of the action decision, in this case) by policy iteration are indicated in Table 3. To perform the analysis, expected returns for each state (prior state and action) were determined and new transition probability matrices were prepared. For the initial iteration, policy 1 was used for each state; this iteration reflects the same gain as determined previously in iteration 1 of the perfect information analysis. (Constant use of the same policy produces the same average gain, irrespective of the analytical approach.) After five iterations, the optimal policies have been established and the gain determined to be 2.136. This represents approximately 5% reduction in expected return attributable to this type of imperfect infor mation. The optimal policy in most states is to continue operation. In certain states, those which represent having been in poor operating condition but takingno maintenance action at the prior stage, overhaul or replacement is the policy selected. 2.2145' 0.3242 5.2118 5.633 11.1188 6.601 12.7598 5.3139' 10.1786 12.2145 4.6861 1.1188 14.3803 4.3803 1.6922" 0.1590 5.3562 5.6608 10.9161 6.6610 12.5305 5.4998*10.1157 12.8469 13.8747 14.0719 10.9161 14.0719 1 1 6922 4.5002 1 .I901* 0.6283 5.4586 5.6886 10.7134 6.7209 12.3041 5.6857* 10.0527 12.6364 13.6826 13.7645 10.7134 13.7645 1 2 3 Downloaded by [Virginia Tech Libraries] at 05:22 14 March 2015 1 1.1 188 1 2 3 V ,, 1 2 3  considered, there is an additional variation, which is not surprising. In general, the value of the forecasts, as determined by the difference in return as contrasted with the policy which would be followed, decreases as the forecast accuracy decreases. It might also be noted that the spread in returns between decisions diminishes, for each state, for forecast accuracy decreases, reflecting an increase in "entropy," and a smearing of previously precise boundaries. 13.136 13.136 14.3803 Perfect Forecasts 95% Forecast Accuracy .9161 90% Forecast Accuracy 1 1 .I901 4.3143 0.7134 3.7645 0.1658' 1.5809 5.7054 5.7442' 10.3080 6.8408 1 1 8514 6.0576 9.9269' 1 3.2292 10.3080 13.1487 10.1658 4.2558 1 2 3 12.1367 13.1487 0.3811 Single Stage Information Lag Forecast Solution Data s s s s (111) (2/1) (311) (411) s S s s (414) (314) (214) (114) 0.95 0.03 0.01 0.01 0.90 0.06 0.03 0.01 0.80 0.12 0.06 0.02 Table 4 contains results of using forecasts in the informationlag process. The numbers tabulated are based on relative values of the base process to which the case reverts at the next stage and on the state which was arbitrarily chosen to be zero. It is apparent that for some process states, the forecast makes possible a significant increase in return. For most forecasts, the optimal action is the same as for the corresponding state in the continuing perfect information process. For forecast 3, the indicated action differs, but the difference in value is small. At the lowest forecast accuracy Table 5 contains the forecast values for various combinations of forecast, forecast accuracy and informationlag state. Each column represents a specific decision, and therefore the states in which that decision would be made using the! informationlag algorithm. This table illustrates that the value depends on the observed process state as well as the forecast. It also shows that the values tend to diminish as forecast accuracy falls off. In this example, the same policy is optimal for all forecast accuracies from 90% up and only one decision is altered for 80% forecasts. This indicates that use of the alternative decison rule determination procedure, wherein the policy for perfect forecasts is used, would have been appropriate. Conclusion This paper has discussed a procedure for determining the optimal decision when supplementary observations are possible, in a Markov Decision Process with Imperfect State Information. The procedure involves enumeration of the returns for all decisions, as perturbations of the continuing AIIE TRANSACTIONS, Volume 6, No. 3 process, using data for the optimal policy as determined by the Policy Iteration Procedure applied to the basic process. It was shown that significant increases in return result in some process states, but that the increase may be zero in many states. This approach has potential value as a means of identifying those process states in which supplementary observations should be made. (6) (7) Brooks, D. M.and Leondex, C. T., "Markov Decision Processes with StateInformation Lag," Operations Research, 21, 904907 (1972). Nelson, R. R., and S. G. Winter, Jr., Weather Information and Economic Decision. A Preliminary Report, RM2620NASA, Rand Corporation ( August 1,1970). Acknowledgment The research in this paper was supported under AFOSR Grant 722166. References Downloaded by [Virginia Tech Libraries] at 05:22 14 March 2015 (1) (2) (3) (4) (5) Derman, C., Finite State Markovian Decision Processes, Academic Press, New York, N.Y. (1970). Howard, R. A., Dynamic Programming and Markov Processes, M.I.T. Press, Cambridge, Mass. (1960). Howard, R. A., Dynamic Probabilistic Systems, Vol. 11, John Wiley & Sons, New York, N.Y. (1971). Eckles, J. E., "Optimum Maintenance with Incomplete Information," Operations Research, 16,10581067 (1970). Smallwood, R. D. and Sondik, E. J., "The Optimum Control of Partially Observable Markov Processes over a Finite Horizon," Operations Research, 21,10711088 (1972). Dr. David M. Brooks is an Engineering Consultant, Sperry Support Services, Division of Sperry Rand, NASA Space Technology Laboratories, Bay St. Louis, Miss. His current activities involve research on data buoys systems for weather service. Dr. Brooks is a retired Navy Officer and has been with North American Rockwell. He holds a BS and MS from MIT and PhD from the University of California, Los Angeles. He is a member of the American Society of Naval Engineers and Sigma Xi. Dr. C. T. Leondes is Professor of Engineering Systems, University of California, Los Angeles. His current research interests are in dynamic systems control and operations research. Dr. Leondes received his BS, MS and PhD all from the University of Pennsylvania. He is a member of AIAA and AAS and a Fellow of IEEE.