Masaryk University Faculty of Informatics
Modern Probabilistic Verification
Habilitation Thesis Jan Křetínský
Brno, 2018
To Zuzka
ii
Abstract
Hardware and software verification is a mature technology, which has been adopted in many areas where correctness of behaviour is critical. In contrast, its younger probabilistic sibling still struggles at many basic points: the properties of interest are often hard to specify, verification engines do not scale well with the size of the system, and the products of the process are difficult to use. We show how machine learning, automata theory, and revisiting established concepts may help to address these issues, based on recent results.
Shrnutí
Hardwarová a softwarová verifikace se již etablovala v řade oblastí, kde je správné fungování systémů zcela klíčové. Naproti tomu její mladší, pravděpodobnostní odrůda stále zápolí se základními požadavky: není vždy jednoduché vyjádřit požadovanou vlastnost systému, použitelnost trpí přílišnou závislostí na velikosti systému a forma výsledků neodpovídá praktické potřebě. Tato habilitační práce ukazuje některé nedávno otevřené možnosti, jak strojové učení, teorie automatů a přehodnocení některých zaběhnutých pojmů a postupů múze v těchto ohledech pomoci.
iii
Contents
I COMMENTARY 1
1 Introduction 3
1.1 Probabilistic verification..................... 3
1.2 Focus and structure of the thesis ................ 6
2 Learning to Control 7
2.1 State of the art........................... 7
2.1.1 Numerical verification.................. 7
2.1.2 Statistical model checking................ 9
2.1.3 Strategy Representation................. 12
2.2 Contributions........................... 13
2.2.1 Numerical verification.................. 13
2.2.2 Statistical model checking................ 16
2.2.3 Strategy representation.................. 18
2.3 Contributed papers and activities................ 20
3 From LTL to Automata 23
3.1 State of the art........................... 23
3.2 Contributions........................... 25
3.3 Contributed papers and activities................ 29
4 From Reachability and Expectation to Complex and Group-by Objectives 31
4.1 State of the art........................... 31
4.1.1 Logical and behavioural specifications......... 31
4.1.2 Group-by aggregate operators ............. 35
4.2 Contributions........................... 38
4.2.1 Logical and behavioural specifications......... 38
4.2.2 Group-by aggregate operators ............. 41
4.3 Contributed papers and activities................ 45
iv
5 Conclusion
47
Bibliography
49
II SELECTED PAPERS
71
A Note on Copyright: Acknowledgment of the Publishers
72
A Note on Contribution: Explanation of Assumptions
74
Probabilistic bisimulation: Naturally on distributions (CONCUR 2014)
Verification of Markov decision processes using learning algorithms (ATVA 2014)
Counterexample explanation by learning small strategies in Markov decision processes (CAV 2015)
Unifying two views on multiple mean-payoff objectives in Markov decision processes (Log. Meth. in Comp. Sci. 2017, based on paper in LICS 2015)
Faster statistical model checking for unbounded temporal properties (ACM Trans, on Comp. Log. 2017, based on paper in TACAS 2016)
From LTL and limit-deterministic Buchi automata to deterministic parity automata (TACAS 2017)
One theorem to rule them all: A unified translation of LTL into lo-automata (LICS 2018)
Conditional value-at-risk for reachability and mean payoff in Markov decision processes (LICS 2018)
Value iteration for simple stochastic games: Stopping criterion and learning algorithm (CAV 2018)
Rabinizer 4: From LTL to your favourite deterministic automaton (CAV 2018)
v
vi
PARTI COMMENTARY
Chapter 1
Introduction
In this chapter, we introduce the general topic of the thesis and outline its structure. While the subsequent chapters can be read independently, they all refer to the notions introduced in this chapter.
1.1 Probabilistic verification
Probabilistic systems are abundant in many areas ranging from telecommunication (randomized protocols), transportation (automotive, aerospace), operations research (queuing networks), biology (signalling pathways), to daily-life appliances (embedded software controllers), to name just a few.
Since many of the systems are safety-critical, we need to ensure their proper behaviour. To this end, we construct models of these systems and then analyze them with respect to, e.g., low consumption for resource-limited systems, high mean time to failure for dependable systems, or to gain understanding of complex behaviour of natural processes.
The process of model checking follows the general pattern depicted in Figure 1.1. The are two inputs to the procedure: a system and a property.
Firstly, the systems where probabilistic features are essential can be formalized in various ways, depending on further present features. In this thesis, we deal with the most fundamental models, namely
Markov chains (MC) [Mar06, Nor98] for fully stochastic systems,
Markov decision processes (MDP) [Bel57, Put94] capturing both stochas-ticity and non-determinism, either controllable or uncontrollable,
stochastic games (SG) [Sha53, Con90] with all the three present.
3
1. Introduction
System with probabilistic behaviour
expressed as a
Probabilistic model M
Yes /No /How much Witness/ Counterexample
Property \
expressed as a
Formula ip
Chp.3
Chp.4
J
Chp. 2
Figure 1.1: The general scheme of probabilistic model checking and the focus of the subsequent chapters
Besides, many richer formalism are defined in terms of MC and MDP, such as e.g. stochastic timed automata (STA) [BBB+14] or probabilistic timed automata (PTA) [NPS13], designed to cope with additional timing issues. We will mostly focus on the basic models, such as MDP.
Secondly, the properties of interest range from the simplest properties of the system, e.g. reachability of a given state, to more complex properties defined by a quantitative structure over the system, e.g. long-run average reward [How60, Gil57], by a temporal formula, e.g. of linear temporal logic (LTL) [Pnu77], or by comparison to another system, e.g. bisimulation [Par81, LS89] or distance [CHR10]. Consequently, the property is usually formalized declaratively by a formula in a suitable logic or operationally by a behaviour of another system. We will mostly focus on the former.
As indicated in Figure 1.1, the model checking procedure combines and analyzes the input pair, yielding an answer to the question whether the
4
1. Introduction
Table 1.1: Examples of properties of interest. The classic notions (line 1) have been extended to the probabilistic setting (line 2) and further generalized to more robust notions of quantitative probabilistic properties (line 3). Here P denotes probability, E expectation, Ri the reward obtained in the ith step, and L(S) the language recognized by S.
LTL average reward (mean payoff) trace equivalence
classic probabilistic quantitative pr. formula if MP := liminf^ ± £™=1 Rt L(5i) = L(52) F[
£ VL : Pi[L] = P2[L] F[tp]>p P[MP>£]>p VL : |Pi[L]-P2[L]| < d
model satisfies the formula. However, for probabilistic systems, the Boolean notions of satisfaction and equivalence are not always useful. Rather we need to refine the notions into truly quantitative ones, such as the extent of satisfaction, e.g. "with 95% chance the long-run average reward is between 0 and 42". For instance, even highly safety-critical systems such as nuclear plants, with each hardware component failing with certain probability, do not satisfy the safety properties, but with some (preferably high) probability. Computing this probability is the task of quantitative probabilistic model checking. Similarly, the probabilities of failures of the components are only empirically estimated and the slightest imprecision in the estimate may result in systems being or not being equivalent. Instead, we prefer to measure how much they differ. This can be captured by the quantitative notion of distance. Several examples of popular properties are depicted in Table 1.1.
Further, the answer may be documented by a witness, e.g. a motion-planning strategy to be implemented, or a counterexample, e.g. scheduling leading to violation of mutual exclusion. This is crucial for the practical applicability of model checking and the core point of controller synthesis.
The approach to the analysis depends not only on the already discussed types of the two inputs and of the output, but also on
the granted knowledge of the model ranging from
• full quantitative information, white-box models (assumed in most of the literature) to
• qualitative one (without the exact quantities) to
• black-box models, which can only be simulated and their behaviour observed in runtime, and
the required guarantee on the result ranging from
5
1. Introduction
• precise/optimal to
• e-precise/e-optimal (for a known or given error bound e)
• to confidence intervals typical in statistical model checking [YS02, ISOLA'16] and probably approximately correct results (PAC) typical in learning [Val84, SLW+06] to,
• best effort, which is mostly out of scope of the thesis. 1.2 Focus and structure of the thesis
The thesis reports on recent^*) progress in making probabilistic verification more efficient, in particular more scalable despite the state-space-explosion problem, and more usable in that (i) the input formula reflects the desired property more accurately, and (ii) the output artefact (counterexample or the produced controller) are easier to understand, debug or implement.
The content can be classified into three streams and we outline it in the order we explained Figure 1.1:
• Chapter 2 discusses how inexact approaches based on simulations and machine learning may improve both exact analysis and the quality of its output.
• Chapter 3 focuses on LTL and how to transform its formulae into different types of automata, allowing for their efficient analysis for each particular setting.
• Chapter 4 handles more complex properties and their extensions, combinations, and alternative interpretations.
The main novel aspects are thus
• bridging some of the gaps between formal and practically used methods,
• innovations in automata theory, leading to practical improvements, and
• identification of new specification concepts and their verification procedures, respectively.
The second part then lists some of the papers on which the discussed results are based.
(*). I.e., since obtaining the second Ph.d. in mid-2014.
6
Chapter 2
Learning to Control
In this chapter, we explain how imprecise techniques based on simulations and machine learning can enhance traditional verification and controller-synthesis techniques in several ways:
• improving scalability without compromising the result,
• improving scalability for relaxed statistical guarantees, particularly for black-box systems where stronger guarantees are not possible anyway,
• improving structure and size of the produced results.
2.1 State of the art
Verification offers a range of methods for solving various classes of probabilistic verification problems.
2.1.1 Numerical verification
For finite systems, graph algorithms are quite efficient, e.g. [CY95] for MC and MDP, but only work for purely qualitative questions. In the generally quantitative case (quantitative interpretation in line 2 of Table 1.1, or quantitative objectives such as mean payoff in the middle column), numerical methods are usually used, most notably
(i) dynamic programming [Put94], such as value iteration (VI), used in the most used probabilistic model checker PRISM [KNP02], or policy/strategy iteration (SI), or
7
2. Learning to Control
(ii) linear programming (LP) as used in e.g. DiVinE [BBC+08].
On the one hand, LP provides precise results. On the other hand, it is slow for MDP and not applicable to SG. Since the repetitive evaluation of strategies in SI is often slow in practice, VI is usually preferred. For instance, the most used probabilistic model checker PRISM [KNP02] and its branch PRISM-Games [CFK+13] use VI for MDP and SG as the default option, respectively. However, while SI is in principle a precise method, VI is an approximative method, which converges only in the limit. Unfortunately, there was no known stopping criterion for VI. Consequently, there were no guarantees on the results returned in finite time. Therefore, current tools stop when the difference between the two most recent approximations is low, and thus may return arbitrarily imprecise results [HM14]. A stopping criterion has been developed in [HM14] and published in parallel with our [atva'14a], for details see Section 2.2.1.
Continuous systems are often transformed into finite systems. The methods typically include either some form of discretization with derived error bounds, as in e.g. MRMC [KZH+11] for continuous-time MDPs, or use more structured finite abstractions, such as regions and zones for (probabilistic extensions of) timed automata, as in e.g. Uppaal [BDL+06, DLL+15] or PRISM [KNP02].
All of these methods provide guarantees on the results. However, the price to pay for that is non-trivial complexity with respect to the size of the system and the property. Furthermore, due to the curse of dimensionality the systems grow exponentially with respect to the number of variables and components involved. Consequently, the real systems mostly cannot be treated directly. This issue is under a heavy attack of many techniques:
• Firstly, there are symbolic techniques avoiding explicit treatment of the whole state space, e.g. [CHJS11, BKH99, ZSF12], and combinations of explicit and symbolic techniques [WBB+10, BBR14].
• Secondly, there are model-transformation techniques:
- Compositional techniques aim to analyse parts of the systems separately and combine the results to infer properties of the whole, e.g. [CDL+10, DH11, BKW14, CCD15] or our [HKK13].
- Abstraction techniques merge states, factoring out information irrelevant for the satisfaction of the particular property [KKNP10, HWZ08, HHWZ10] or our [SKC+15].
8
2. Learning to Control
- Dual to abstractions, there are reduction techniques, delimiting and analysing only subsystems, thus considering only some behaviours. An example of a safe technique is the partial order (or symmetry) reduction [BGC04, GDF09]. Unfortunately, there are not too many safe reduction techniques. In contrast, non-safe reduction techniques are extensively used in practice. For instance, stochastic simulation is a very useful debugging technique. It can disprove properties and find deficiencies of the system. However, it cannot be simply used to prove properties and assure the system correctness.
2.1.2 Statistical model checking
Verification of MC, MDP, and related systems traditionally relies on numerical approaches. However, numerical analysis of the whole system is often inapplicable in practice:
(i) when the system is too large due to state space explosion, or
(ii) when the exact transitions are unknown (black-box systems).
In such cases, statistical approaches and simulation form a powerful alternative. The statistical approach typically consists in
1. observing (finitely many finitely long) simulation runs,
2. analysis of each run,
3. inferring properties of the system from statistics on the results of the analysis.
Since simulations can often be done very fast, this approach often scales very well. The increased performance, however, comes at a cost of providing only probabilistic guarantees on the result, i.e., the result of the analysis is correct only with probability 1 — e for a user-given e. In many contexts this is no limitation since e can be very small as the algorithms scale well, so that the uncertainty is negligible. This may be particularly the case when the model itself is validated with respect to the real system only with some certainty.
Statistical model checking (SMC) [YS02] has been successfully applied to various biological [JCL+09, PGL+13], hybrid [ZPC10, DDL+12, EGF12, Larl2] or cyber-physical [BBB+10, CZ11, DDL+13] systems and there is a substantial tool support available [JLS12, BDL+12b, BCLS13, BHH12].
9
2. Learning to Control
2.1.2.1 Statistical model checking for MCs
Statistical model checking (SMC) [YS02] of Markov chains refers to algorithms with the following specification:
Specification of Markov chains statistical model checking Input:
• a finite black-box MC M. (i.e., access to any desired finite number of sampled simulation paths of any desired finite lengths)
• a linear property tp
• a threshold probability p
• an indifference region e > 0
• two error bounds a, (3 > 0
• possibly some characteristics of M. from Table 2.1
Output: if P[jM |= if] > p + e, return YES with probability at least 1 — a;
if F[A4 |= ip] < p — e, return NO with probability at least 1 — /3.
Bounded and Unbounded Properties Most of the previous efforts in SMC has focused on the analysis of properties with bounded horizon [YS02, SVA04, YKNP06, JCL+09, JLS12, BDL+12b]. For such bounded properties (e.g. state r is reached with probability at most 0.5 in the first 1000 steps), statistical guarantees can be obtained in a completely black-box setting, where execution runs of the system can be observed, but no other information is available. Unbounded properties (e.g. state r is reached with probability at most 0.5 in any number of steps) are significantly more difficult, as a stopping criterion is needed when generating a potentially infinite execution run, and some information about the system is necessary for providing statistical guarantees. Table 2.1 presents an overview of the assumptions for the statistical analysis of unbounded properties, detailed below.
SMC of unbounded properties, usually "unbounded until" properties, was first considered in [HLMP04] and the first approach was proposed in [SVA05], but observed incorrect in [HJB+10]. Notably, in [YCZ10] two approaches are described. The first approach proposes to terminate sampled paths at every step with some probability pterm arid re-weight the result accordingly. In order to guarantee the asymptotic convergence of this method,
10
2. Learning to Control
Table 2.1: Statistical approaches organised by (i) the class of verified linear properties, and (ii) by the required information about the Markov chain, where pm\n is the minimal transition probability, |5| is the number of states, and A is the second largest eigenvalue of the chain. Further, 0 denotes the (unbounded) reachability and U the (unbounded) until of LTL.
property no info Pmin |'S'|)Pmin A topology
bounded e.g. [YS02, SVA04]
0,U x [tacas'16] [atva'14a] [YCZ10] [YCZ10,HJB+10]
ltl, mp x [tacas'16] [atva'14a]
the second eigenvalue A of the chain must be computed, which is as hard as the verification problem itself. It should be noted that the method provides only asymptotic guarantees as the width of the confidence interval converges to zero. The correctness of [LP08] relies on the knowledge of the second eigenvalue A, too. The second approach of [YCZ10] requires the qualitative knowledge, i.e. the chain's topology, which is used to transform the chain so that all potentially infinite paths are eliminated. In [HJB+10], a similar transformation is performed, again requiring knowledge of the topology. The (pre)processing of the state space required by the topology-aware methods, as well as by traditional numerical methods for Markov chain analysis, is a major practical hurdle for large (or unknown) state spaces. Another approach, limited to ergodic Markov chains, is taken in [RP09], based on coupling methods. There are also extensions of SMC to timed systems [DLL+15].
2.1.2.2 Statistical model checking for MDPs
The development of statistical model checking techniques for probabilistic models with nondeterminism, such as MDPs, has only been treated quite recently. In [BFHH11], properties are analysed for MDPs with spurious non-determinism, where the way it is resolved does not affect the desired property. In the case with general non-determinism, one approach is to give the non-determinism a probabilistic semantics, e.g., using a uniform distribution instead, as for timed automata in [DLL+lla, DLL+llb, Larl3]. Others [LP12, HMZ+12, atva'14a] aim to quantify over all strategies and produce an e-optimal strategy. The works of [HMZ+12] and [LP12] deal with the problem in the setting of bounded and discounted (and for the purposes of approximation thus bounded) properties, respectively. In the former work, candidates for optimal strategies are generated and gradu-
11
2. Learning to Control
ally improved, but "at any given point we cannot quantify how close to optimal the candidate scheduler is" (cited from [HMZ+12]) and the algorithm "does not in general converge to the true optimum" (cited from [LST14]). Further, [LST14] randomly samples (compact representation of) strategies, but again focuses only on (time-)bounded properties.
There are also various practically efficient heuristics that, however, provide none or very weak guarantees, often based on some form of learning [BTOO, LL08, WT16, TT16, AY17, BBS08]. Even for MDP, the first PAC algorithm (limited to discounted reward) has been given in [SLW+06].
2.1.3 Strategy Representation
For systems with non-determinism, both counterexamples as well as witnesses are given as strategies, resolving the non-determinism. Representing the resulting strategy compactly is important in both cases since either (i) it needs to be implemented as a controller and must be simple enough, or (ii) it is a counterexample when trying to prove a property for all strategies and then the corresponding bug needs to be understood and fixed. There are several different classes of data structures and algorithms to represent strategies.
Firstly, in artificial intelligence, compact (factored) representations of MDP structure have been developed using dynamic Bayesian networks [BDG95, KK99], probabilistic STRIPS [KHW94], algebraic decision diagrams [HSAHB99], and also decision trees [BDG95]. Formalisms used to represent MDPs can, in principle, be used to represent values and strategies as well. In particular, variants of decision trees are probably the most used [BDG95, CK91, KP99]. For a detailed survey of compact representations see [BDH99].
Secondly, in the context of verification, MDPs are often represented using variants of (MT)BDDs [dAKN+00, HKN+03, MP04], and strategies by BDDs [WBB+10].
Thirdly, [AL09] uses a directed on-the-fly search to compute sets of most probable diagnostic paths. The notion of paths encoded as AND/OR trees has also been studied in [LL13] to represent probabilistic counter-examples visually as fault trees, and then derive causal relationship between events. Further, [WJV+13, DJW+14] compute a smallest set of guarded commands (of a PRISM-like language) inducing a violating subsystem, but, unlike other methods, does not provide a compact representation of actual decisions needed to reach an erroneous state; moreover, there is not always a command-based counterexample.
Finally, decision trees have been used in connection with real-time dy-
12
2. Learning to Control
namic programming and reinforcement learning to represent the learned approximation of the value function [BD96, Pye03]. Learning a compact decision tree representation of a strategy has been investigated in [SLT10] for the case of body sensor networks with discounted objectives.
2.2 Contributions
2.2.1 Numerical verification
In this section, we employ simulations and reinforcement learning to speed up dynamic-programming techniques, such as VI. The structure of the approach is as follows. Firstly, while VI traditionally approximates the value from below, we show how to modify it so that it approximates the value from above, too. Consequently, we obtain a stopping criterion for VI: when the difference between the under- and over-approximation is smaller than the desired precision, we can stop. Secondly, we show how the lower and upper bounds can be utilized to derive guarantees on generally unreliable methods based on simulation and learning, resulting in an efficient reduction technique.
Stopping criterion for VI Deriving over-approximating VI sequence converging to the value is non-trivial already for MDP with the reachability objective. The reason is that the greatest fixpoint of the VI operator (the greatest solution to Bellman equations) can be strictly greater than the actual value. This is illustrated in Figure 2.1.
In [atva'14a] we show that when so-called end components (EC) [DA97] are abstracted into a single state then there is only a single fixpoint of the VI operator. Hence both sequences converge to the same value. Consequently, the over-approximating sequence on such modified system induces an over-approximating sequence on the original system that converges to the actual original value. This has been independently discovered also in [HM14] a few months after the first submission of [atva'14a].
In [CAV'18b] we extend this approach to SG. Here end components cannot be abstracted into single states since the values of the individual states are different. Instead, we consider varying temporary abstractions depending on the current approximations. This is illustrated in Figure 2.2.
In [Cav'17] we extend the stopping criterion for MDP with reachability of [atva'14a] to MDP with long-run average reward.
Asynchronous value iteration There are variants of VI, where we apply the VI operator "asynchronously", i.e. with varying frequencies for different states. For such asynchronous methods, nothing is known about the
13
2. Learning to Control
i Li(t) = U({s,t}) C/i(t)=C/i(s) ^({s,t})
0 0 1 1
1 l 3 1 2 3
2 4 9 1 5 9
3 13 27 1 14 27
Figure 2.1: Example illustrating the iterations of lower (L) and upper (U) bounds for maximum reachability of the double circled 1. Left: An MDP (as special case of SG) where the greatest fixpoint lim^oo Ui(t) = 1 is different from the value (|) due to the grayed EC {s, t}. This is due to the fact that Ui(t) depends on Ui(s) and vice versa and, intuitively, s and t mutually suggest the possibility of the value being up to 1 although this illusion is not based on real paths to the target of this measure. Right: The same MDP where the EC is "collapsed" into a single state, ensuring the convergence Hindoo Ui({s, t}) = \. Bottom: The approximations illustrating the non/convergence in the first few steps.
speed of convergence. Yet, since we can guarantee rigorous lower and upper bounds, we know the current precision of the approximations. Therefore, we can apply techniques from reinforcement learning (Q-learning) [Wat89] or probabilistic planning (bounded real-time dynamic programming, BRTDP) [MLG05] and still obtain results with guaranteed precision.
Moreover, since the frequencies can be arbitrary, the result is often output without ever examining some states. This is illustrated in Figure 2.3. Altogether, this approach allows us to focus on the most important part of the system and ignore the rest. Since the former can be orders of magnitude smaller, this reduction technique is a useful tool to fight the state space explosion problem, for illustration see Figure 2.3. For examples of reductions in the size of the considered part of the state space, see Table 2.2.
14
2. Learning to Control
Figure 2.2: Left: Example of SG where states of an EC have different values. The Greek letters on the leaving arrows denote the values of the actions. Round and square states belong to the maximizer and minimizer, respectively. Right three figures: Correct collapsing in cases where a < (3, a > (3, and a = (3, respectively. In contrast to MDP, some actions of the EC leaving the collapsed part have to be removed.
Table 2.2: Reductions in the considered part of the state space on examples from the PRISM Benchmark Suite [KNP12]. The two columns display the size of the whole state space and the size of the part explored by our approach in order to approximate the value with precision 10~6.
Example Number of states visited by
PRISM [KNP02] our [atva'14a]
zeroconf 4,427,159 977
wlan 5,007,548 1,995
firewire 19,213,802 32,214
mer 26,583,064 1,950
We provide and experimentally test this approach for MDP with reachability and LTL in [atva'14a], for MDP with long-run average reward in [Cav'17] and an SI variant in [atva'17], and for SG with reachability in [cav'18b]. We extend the approach to continuous time in [atva'18b]. We also consider a hybrid approach combining BRTDP and Monte Carlo tree search in [isola'18].
The basic algorithm of [HM14] is implemented in PRISM [BKL+17] and the learning approach of [atva'14a] in storm [DJKV17] and together with the extensions in our distribution of PRISM^. The extension for SG where the interleaving of players is severely limited (every EC belongs to one player only) is discussed in [Ujml5].
(*). Accessible at https : //gitlab . lrz . de/i7/prism
15
2. Learning to Control
Figure 2.3: Example of an MDP where the parts of the system in the clouds can be ignored if the required precision is 0.005. Choosing the thick actions is guaranteed to result in an 0.005-optimal strategy, no matter how other non-determinism is resolved.
2.2.2 Statistical model checking
In this section, we employ simulations and reinforcement learning to analyze (partially) unknown systems.
2.2.2.1 Statistical model checking for MC
We present SMC algorithms for unbounded properties that, however, do not require much knowledge of the system. They are based on detecting that a simulation run reached a so-called bottom strongly connected component (BSCC), i.e. an EC of the MC regarded as an MDP. Then and only then we can deduce what the rest of the infinite run will be like and can thus terminate the run. Moreover, this also implies that such algorithms can be applied not only to reachability and unbounded-until properties, but can also be easily extended to LTL or mean payoff.
In [atva'14a] a priori bounds for the length of execution runs are calculated from the minimum transition probability pm\n and the number of states jS*) only. After a long enough trace, we can deduce we are in a BSCC with very high probability. The length of the trace can be bounded (for
a given confidence) by |5| and pm;n. Indeed, if there is a way out of the BSCC
\s\
it is sufficient to take a path of length S, which has probability at least pm\„-
16
2. Learning to Control
Ui
tj o.oi U 0.01
Figure 2.4: Example of a Markov chain with non-trivial BSCC detection
However, without taking execution information into account, these bounds are exponential in the number of states and highly impractical, as illustrated in the example below.
In [TACAS'16] we further improve on this idea and declare that we have reached a BSCC if the same states are repeated for long enough time. Since we observe the visited states we do not need the size of the state space |5| as a bound, but only pm|n. This is the first SMC algorithm that uses information obtained from execution prefixes. We illustrate and compare the two approaches on the following example.
Example. Consider the property of reaching state r in the Markov chain depicted in Figure 2.4. While the execution runs reaching r satisfy the property and can be stopped without ever entering any Vi, the finite execution paths without r, such as stuttutuut, are inconclusive. In other words, observing this path does not rule out the existence of a transition from, e.g., u to r, which, if existing, would eventually be taken with probability 1. This transition could have arbitrarily low probability, rendering its detection arbitrarily unlikely, yet its presence would change the probability of satisfying the property from 0.5 to 1. However, knowing that if there exists such a transition leaving the set, its transition probability is at least pm-in = 0.01, we can estimate the probability that the system is stuck in the set {£, u} of states. Indeed, if existing, the exit transition was missed at least four times during the execution above, no matter whether it exits t or u. Consequently, the probability that there is no such transition and {£, u} is a BSCC is at least
This means that in the approach of [TACAS'16], in order to get 99% confidence that {£, u} is a BSCC, we only need to see both t and u around 500 times on a run, since 1 - (1 - pmin)500 = 1 - 0.99500 « 0.993. This is in stark contrast to a priori bounds that provide the same level of confidence, such as the (l/pmin)'5' = lOO^™) runs required by [ATVA'14a], which is infeasi-ble for large m of our example. In contrast, the performance of the method [TACAS'16] is independent of m. A
Consequently, many execution runs can be stopped quickly. Moreover,
l-(l-Pmin)4.
17
2. Learning to Control
since the number of execution runs necessary for a required confidence level is independent of the size of the state space, it is not very large even for highly confident results (think of opinion polls). Altogether, it efficiently fights the state space explosion for systems where strongly connected components are not too large.
2.2.2.2 Statistical model checking for MDP
Further, [atva'14a] is also the first to consider SMC for MDP with unbounded properties. It explores (similarly to [HMZ+12]) the opportunities offered by learning-based methods, as used in fields such as planning or reinforcement learning. The algorithm assumes information limited to |5| and pmin and is based on delayed Q-learning (DQL) [SLW+06]. Like in the numeric algorithm above, it maintains both lower and upper bounds on the result and gradually improves them. Contrary to the numeric algorithm, these bounds are not guaranteed to be correct, but only probably approximately correct (PAC) since there is a non-zero probability that the empirical estimates of the behaviour are significantly incorrect. However, this probability can be set arbitrarily close to 0.
The crucial steps of [atva'14a] are (1) modifying the DQL algorithm with PAC guarantees of [SLW+06] from the discounted setting to the undis-counted setting, but where terminating states are reached almost surely, and (2) lifting this to general MDPs with ECs, where terminating states may not be reached. This relies on simulations and the detection of end components on the fly. This technique extends also to LTL objectives and thus also to both maximum and minimum probabilities.
Further, we consider MDP with qualitative knowledge only and treat a combination of w-regular objectives and long-run average reward in [concur'18a].
2.2.3 Strategy representation
In this section, we employ simulations and decision-tree learning to post-process strategies so that they are (i) more understandable and (ii) more resource-efficient with respect to memory and time, i.e. smaller and faster to run. We stipulate that the size is the basic measure and can serve as a proxy for the others: simplicity and execution speed.
In [CAV'15a], we propose three steps to obtain smaller strategy representation. Each of them has a positive effect on the resulting size.
1. Obtaining a (possibly partially defined and non-deterministic) e-optimal strategy. The e-optimal strategies produced by standard meth-
18
2. Learning to Control
ods, such as VI of PRISM [KP13], may be too large to compute and overly specific. Firstly, as argued in [atva'14a], typically only a small fraction of the system needs to be explored in order to find an e-optimal strategy, whereas most states are reached with only a very small probability. Without much loss, the strategy may not be defined there. For example, in the MDP depicted in Figure 2.3, the decision in s (and the clouds) are almost irrelevant for the overall probability of reaching 1 from init. Such a partially defined strategy can be obtained using [atva'14a].
2. Identifying important parts of the strategy. Given a strategy, we define a concept of importance of a state s for reaching goal is defined by P[0s | Qgoal]. Let us shed some light on this definition. Observe that only a fraction of states can be reached while following the strategy, and thus have positive importance. On the unreachable states, with zero importance, the definition of the strategy is useless. For instance, in the previous example, also the upper cloud was partially explored in order to find out whether it is better to take action up or down. However, if the resulting strategy is to use down and b, the information what to do in the upper cloud is useless. In addition, we consider the lower cloud to be of zero importance, too, since its states are never reached on the way to target and thus cannot be utilized. Furthermore, apart from ignoring states with zero importance, it is desirable to partially ignore decisions that are unlikely to be made (in less important states such as s), and in contrast, stress more the decisions in important states likely to be visited (such as init). The crucial notion of importance is obviously not computed, but only estimated statistically by simulating the system under the given strategy.
3. Data structures for compact representation of strategies. The explicit representation of a strategy by a table of pairs (state, action to play) results in a huge amount of data since the systems often have millions of states. Therefore, a symbolic representation by binary decision diagrams (BDD) looks as a reasonable option. However, there are several drawbacks of using BDDs. Firstly, due to the bit-level representation of the state-action pairs, the resulting BDD is not very readable. Secondly, it is often still too large to be understood by human, for instance due to a bad ordering of the variables. Thirdly, it cannot quantitatively reflect the differences in the importance of states. Of course, we can store decisions in states with importance
19
2. Learning to Control
Table 2.3: Representation of strategies for several examples of PRISM Benchmark Suite [KNP12]. On the left, we display the size of the state space and the maximum reachability probability. On the right, we display the sizes of the representations using explicit listings of sets, BDD, decision trees, and the relative precision of the strategy induced by the generated tree. The trees are pruned as much as possible to still keep this error below 1%. On the mer benchmark, PRISM mems out while outputting the strategy. The line below shows the case with a partial strategy computed by our method [BCC+14] discussed above, for precision 10~6.
Example States Value Explicit BDD DT Rel.err(DT) %
fire wire investor mer zeroconf 481,136 1.0 35,893 0.958 1,773,664 0.200016 89,586 0.00863 479,834 4233 1 0.0 28,151 783 27 0.886 MEM-OUT 1887 619 13 0.00014 60,463 409 7 0.106
above a certain threshold. However, we obtain much smaller representations and solve all the three issues if we allow more variability and reflect the whole quantitative information by decision-tree learning, using entropy.
We demonstrate the efficiency of the approach for MDP in [CAV'15a], see Table 2.3. We also give examples how reasons for present bugs can be read off from the decision trees.
Further, we modify the approach for non-stochastic games in [TACAS'18]. Interestingly, when applied to parametrized examples, the strategies for different values of the parameters are sometimes so similar that a generic solution can be read off. This suggests a potential of this technique for parametrized synthesis.
2.3 Contributed papers and activities
[atva'14a] Tomáš Brázdil, Krishnendu Chatterjee, Martin Chmelík, Vojtěch Forejt, Jan Křetínský, Marta Z. Kwiatkowska, David Parker, and Mateusz Ujma. Verification of Markov decision processes using learning algorithms. In ATVA, volume 8837 of LNCS, pages 98-114. Springer, 2014. Attached in Part 2 of the thesis.
20
2. Learning to Control
[cav'15a] Tomáš Brázdil, Krishnendu Chatterjee, Martin Chmelik, Andreas Fellner, and Jan Křetínský. Counterexample explanation by learning small strategies in Markov decision processes. In CAV (1), volume 9206 of LNCS, pages 158-177. Springer, 2015. Attached in Part 2 of the thesis.
[Tacas'16] Przemyslaw Daca, Thomas A. Henzinger, Jan Křetínský, and Tatjana Petrov. Faster statistical model checking for unbounded temporal properties. In TACAS, volume 9636 of LNCS, pages 112-129. Springer, 2016.
Journal version published in ACM Transaction on Computational Logic, 18(2):12:1-12:25, 2017. Attached in Part 2 of the thesis.
[isola'16] Jan Křetínský. Survey of statistical verification of linear unbounded properties: Model checking and distances. In ISoLA (1), volume 9952 of LNCS, pages 27-45,2016.
[Cav'17] Pranav Ashok, Krishnendu Chatterjee, Przemyslaw Daca, Jan Křetínský, and Tobias Meggendorfer. Value iteration for long-run average reward in Markov decision processes. In CAV (1), volume 10426 of LNCS, pages 201-221. Springer, 2017.
[Atva'17] Jan Křetínský and Tobias Meggendorfer. Efficient strategy iteration for mean payoff in Markov decision processes. In ATVA, volume 10482 of LNCS, pages 380-399. Springer, 2017.
[tacas'18] Tomáš Brázdil, Krishnendu Chatterjee, Jan Křetínský, and Viktor Toman. Strategy representation by decision trees in reactive synthesis. In TACAS (1), volume 10805 of LNCS, pages 385-407. Springer, 2018.
[cav'18b] Edon Kelmendi, Julia Krämer, Jan Křetínský, and Maximilian Weininger. Value iteration for simple stochastic games: Stopping criterion and learning algorithm. In CAV (1), volume 10981 of LNCS, pages 623-642. Springer, 2018. Attached in Part 2 of the thesis.
[Dagstuhl'18] Nils Jansen, Joost-Pieter Katoen, Pushmeet Kohli, and Jan Křetinsky (eds.). Machine learning and model checking join forces (Dagstuhl seminar 18121). Dagstuhl Reports, 8(3):74-93, 2018.
21
2. Learning to Control
[concur'18a] Jan Křetínský, Guillermo A. Perez, and Jean-Francois Raskin. Learning-based mean-payoff optimization in an unknown MDP under w-regular constraints. In CONCUR, volume 118 of LIPIcs, pages 32:1-32:16. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018.
[atva'18b] Pranav Ashok, Yuliya Butkova, Holger Hermanns, and Jan Křetínský. Continuous-time Markov decisions based on partial exploration. In ATVA, volume 11138 of LNCS, pages 317-334. Springer, 2018.
[isola'18] Pranav Ashok, Tomáš Brázdil, Jan Křetínský, and Ondřej Slámečka. Monte Carlo tree search for verifying reachability in Markov decision processes. In ISoLA. To appear, 2018.
Activities Combining learning and formal methods is currently a hot topic, discussed at venues for establishing new directions, such as Dagstuhl seminars. The author has co-organized one on this topic [Dagstuhl'18] and presented it in five other Dagstuhl seminars.
In 2018, the author has given invited talks on the topic at Logic and Learning workshop at the Alan Turing Institute, FNRS seminar at Brussels Free University, University of Twente, or FOPSS summer school Logic and Learning at FLoC in Oxford.
On this topic, the author has obtained a German Research Foundation grant Statistical Unbounded Verification in 2017 and twice passed to the second round of ERC Starting Grant application (scoring A and B).
22
Chapter 3
From LTL to Automata
In this chapter, we explain how automata theory can improve scalability of model checking for complex properties.
3.1 State of the art
Automata-theoretic approach [VW86] is a key technique for verification and synthesis of systems with linear-time specifications, such as formulae of linear temporal logic (LTL) [Pnu77]. It proceeds in two steps: first, the formula is translated into a corresponding automaton; second, the product of the system and the automaton is further analyzed. For an instantiation of the framework, see Figure 3.2. The size of the automaton is important as it directly affects the size of the product and thus largely also the analysis time, particularly for deterministic automata and probabilistic model checking in a very direct proportion. For verification of non-deterministic systems, mostly non-deterministic Buchi automata (NBA) are used [Cou99, DGV99, EHOO, SBOO, GO01, GL02, Fri03, BKRS12, DLLF+16] since they are typically very small and easy to produce.
In contrast to verification of non-deterministic systems, verification of probabilistic systems, such as Markov decision processes (MDP), or synthesis require either more involved techniques, e.g. [KPV06], restrictions to logical fragments, e.g. [AT04, BJP+12], or other types of automata than NBA as detailed below.
Probabilistic LTL model checking cannot profit directly from NBA. Even the qualitative question, whether a formula holds with probability 0 or 1, requires automata with at least a restricted form of determinism. The prime example are the limit-deterministic (also called semi-deterministic) Buchi
23
3. From LTL to Automata
e.g. PEPA models or text -> PRISM lang. (-> MTBDD)
e.g. specification patterns or text —> formula
MDP M
LTL formula p
exponential blow up
Non-deterministic Biichi automaton
exponential blow up
Deterministic Rabin automaton 7?.«:
Product M x TZy: to be analysed
MEC decomposition & evaluation
MEC collapse
reachability - by LP, VI, SI etc., enhanced by topological -v order, parallel computation, BRTDP etc.
Prmax[M |= p]
Figure 3.1: Traditional probabilistic LTL model checking for MDP, as implemented in e.g. PRISM
automata (LDBA) [CY88]. However, for the general quantitative questions, where the probability of satisfaction is computed, general limit-determinism is not sufficient. Instead, deterministic Rabin automata (DRA) have been mostly used [KNP02], see Figure 3.1. In principle, all standard types of deterministic automata are applicable here except for deterministic Biichi automata (DBA), which are not as expressive as LTL. However, other types of automata, such as deterministic Muller and deterministic parity automata (DPA) are typically larger than DGRA in terms of acceptance condition or the state space, respectively. Indeed, note that every DGRA can be written as a Muller automaton on the same state space with an exponentially-sized acceptance condition, and DPA are a special case of DRA and thus DGRA.
24
3. From LTL to Automata
Recently, specific LDBA were also proved applicable to the quantitative setting [HLS+15].
LTL synthesis can also be solved using the automata-theoretic approach. Although DRA and DGRA transformed into games can be used here, the algorithms for the resulting Rabin games [PP06] are not very efficient in practice. In contrast, DPA may be larger, but in this setting they are the automata of choice due to the good practical performance of parity-game solvers [FL09, ML16, JBB+17, MSL18].
Types of translations. The translations of LTL to NBA, e.g., [VW86], are typically "semantic" in the sense that each state is given by a set of logical formulae and the language of the state can be captured in terms of semantics of these formulae. In contrast, the determinization of Safra [Saf88] or its improvements [Pit06, Sch09, TD14, FL15] are not "semantic" in the sense that they ignore the structure and produce trees as the new states that, however, lack the logical interpretation. As a result, if we apply Safra's determinization on semantically created NBA, we obtain DRA that lack the structure and, moreover, are unnecessarily large since the construction cannot utilize the original structure. In contrast, our previous work [KE12, KLG13, EK14] as well as some follow-ups [KV15, KV17] provide "semantic" constructions, often producing smaller automata.
3.2 Contributions
We provide direct semantic translations of LTL into deterministic automata several semanticity-preservning transformation. The zoo of the main translations is depicted in Figure 3.2.
In [ATVA'14b] we implement our previous construction of D(G)RA [KE12] and in [fmsd'16] we provide its logic-based presentation, a mechanized Isabelle proof and fix a previous bug. In [Cav'16], we simplify the construction and obtain LDBA of a particular kind that we prove applicable to quantitative LTL model checking (in contrast to the general LDBA). In [ATVA'16] we implement both the construction as well as the model checking procedure. In [TACAS'17a, TACAS'17b] we provide two very different approaches to obtain DPA, applicable to LTL synthesis. The former transforms our LDBA, whereas the latter is a more efficient variation on the classical index appearance record. Both preserve the semantic description, allowing for further optimizations of the resulting automata. In [LlCS'18a], we finally provide an asymptotically optimal and unified translation of LTL into D(G)RA, LDBA and NBA, which is additionally simpler than the pre-
25
3. From LTL to Automata
[KE12]
Figure 3.2: Translations of LTL into different types of automata. Automata names are strings of the form (NIDI LD)G?(B I R I P)A, where the meaning of the symbols is non-deterministic, deterministic, limit-deterministic, generalized, Buchi, Rabin, parity, and automaton, respectively. Translations implemented in Rabinizer 4 [cav'18b] are indicated with a solid line. The traditional approaches are depicted as dotted arrows. The determinization of NBA to DRA is implemented in ltl2dstar [Kle], to LDBA in Seminator [BDK+17] and to (mostly) DPA in spot [DLLF+16]. The light gray area denotes the types of automata applicable to probabilistic LTL model checking, while the dark gray denotes applicability to LTL synthesis.
vious translations and more systematic.
Moreover, we provide mature tool support for all these operations and more. While Rabinizer 3 [atva'14b] implements only the translations to DGRA and DRA, Rabinizer 4 [cav'18a] implements all the translations depicted in Figure 3.2 with solid arrows. It improves all these translations, both algorithmically and implementation-wise, and moreover, features the first implementation of the translation of a frequency extension of LTL, for further details see Chapter 4. The tool outputs the automata in the Hanoi omega-automata (HOA) format, which we established in [cav'15b].
Further, in order to utilize the resulting automata for verification, Rabinizer 4 comes with our own distribution^ of the PRISM model checker [KNP02], which allows for model checking MDP against LTL using not only DRA and
(*). Our distribution additionally features optimized data structures and algorithms such as the brtdp family, discussed in Chapter 2.
26
3. From LTL to Automata
DGRA, but also using LDBA and against frequency LTL using so-called DGRMA [lpar'15], see Chapter 4. Finally, the tool can turn the produced DPA into parity games between the players with input and output variables. Therefore, when linked to parity-game solvers, it can be also used for LTL synthesis. Rabinizer 4 is freely available at http : //rabinizer . model. in . turn. de together with an on-line demo, visualization, usage instructions and examples.
Finally, the infrastructure ofRabinizer has been modularized and made easily re-usable as a library Owl [Atva'18a]. It has already demonstrated its re-usability in several projects, also without the presence of the library authors. For instance, our experience with Master students has demonstrated that a tool for a complex translation, such as [BDK+17], can be easily implemented using roughly 400 lines of code, achieving performance comparable to the original dedicated tool. We have also implemented Safra's determinization procedure from NBA to DPA. Although this procedure is often described as tedious to implement, it required only 60 lines of code in Owl for the algorithms and 60 lines of code for simple data structures and integration into the pipeline of the tool.
Example. Consider the formula ip = FaVFG(ftVFc). The classical approach to construct a non-deterministic automaton would yield the NBA below, which non-deterministically decides whether to wait for an a or not:
Safra's determinization transforms it into a deterministic automaton with several dozens of states. The state-of-the-art tools ltl2dstar [Kle] with ltl3ba [BKRS12] or spot [DLLF+16], employing sophisticated simplifications, yield an automaton of around five states. In contrast, Rabinizer yields an automaton with two states (the next one below) with a (transition-based) Rabin acceptance condition, not using any simplifications.
The approaches of [KE12, LlCS'18a] and [EK14, fmsd'16] perform this construction as the product of the following automata. First, the master automaton monitors the satisfaction of Fa only:
27
3. From LTL to Automata
U
tt
U
'tt
Secondly, the second disjunct FG(ftVFc) is a prefix-independent property and thus is monitored by a set of slave automata. In the former case of [KE12, LlCS'18a], this is an automaton for G(b V Fc) (on the left), which relays monitoring F c to another automaton (on the right):
r,- tt -*
^- -')
r,- tt -*
^- -9
These slave automata have two special features. Firstly, both automata have a "universal branching" in the initial state, meaning a "copy" of each automaton is started in each step and we monitor whether we accept with all but finitely many copies (the left automaton for a G-formula) and with infinitely many copies (the right automaton for an F-formula), respectively Secondly, whether the state Fc (of the left slave) is accepting or not depends on the runtime information provided by the slave on the right.
In the case of [EK14, fmsd'16], there is only one slave automaton for the whole G{b V Fc) in which Fc is treated directly:
ff— tt -%
t- -V
A
28
3. From LTL to Automata
3.3 Contributed papers and activities
[atva'14b] Zuzana Komárkova and Jan Křetínský. Rabinizer 3: Safraless translation of LTL to small deterministic automata. In ATVA, volume 8837 of LNCS, pages 235-241. Springer, 2014.
[cav'15b] Tomáš Babiak, František Blahoudek, Alexandre Duret-Lutz, Joachim Klein, Jan Křetínský, David Müller, David Parker, and Jan Strejček. The Hanoi omega-automata format. In CAV (1), volume 9206 of LNCS, pages 479-486. Springer, 2015.
[C AV'16] Salomon Sickert, Javier Esparza, Stefan Jaax, and Jan Křetínský.
Limit-deterministic Büchi automata for linear temporal logic. In CAV (2), volume 9780 of LNCS, pages 312-332. Springer, 2016.
[Atva'16] Salomon Sickert and Jan Křetínský. Mochiba: Probabilistic LTL model checking using limit-deterministic Büchi automata. In ATVA, volume 9938 of LNCS, pages 130-137, 2016.
[Fmsd'16] Javier Esparza, Jan Křetínský, and Salomon Sickert. From LTL to deterministic automata - A safraless compositional approach. Formal Methods in System Design, 49(3):219-271,2016. Based on a conference paper, which is a part of one of the author's PhD theses
[TACAS'17a] Javier Esparza, Jan Křetínský, Jean-Francois Raskin, and Salomon Sickert. From LTL and limit-deterministic Büchi automata to deterministic parity automata. In TACAS (1), volume 10205 of LNCS, pages 426-442, 2017. Attached in Part 2 of the thesis.
[TACAS'17b] Jan Křetínský, Tobias Meggendorfer, Clara Waldmann, and Maximilian Weininger. Index appearance record for transforming rabin automata into parity automata. In TACAS (1), volume 10205 of LNCS, pages 443-460,2017.
[LlCS'18a] Javier Esparza, Jan Křetínský, and Salomon Sickert. One theorem to rule them all: A unified translation of LTL into lo-automata. In LICS, pages 384-393. ACM, 2018. Attached in Part 2 of the thesis.
29
3. From LTL to Automata
[CAV'18a] Jan Kfetinsky, Tobias Meggendorfer, Salomon Sickert, and Christopher Ziegler. Rabinizer 4: From LTL to your favourite deterministic automaton. In CAV (1), volume 10981 of LNCS, pages 567-577. Springer, 2018. Attached in Part 2 of the thesis.
[atva'18a] Jan Kfetinsky, Tobias Meggendorfer, and Salomon Sickert.
Owl: A library for w-words, automata, and LTL. In ATVA, volume 11138 of LNCS, pages 543-550. Springer, 2018.
Activities The paper [fmsd'16] was originally invited to the Journal of ACM and accepted for publication there. After acceptance, the authors discovered a bug, withdrew the paper and fixed the bug later.
Merging the features of our PRISM distribution into the public release of PRISM as well as linking the new version of Rabinizer is subject to current collaboration with the authors of PRISM. So far, PRISM newly supports only model checking using DGRA and linking to Rabinizer 3 [atva'14b] or any tool producing D(G)RA in the HOA format [cav'15b]. Strix [MSL18] is a tool for LTL synthesis that uses SI to solve games produced byRabinizer 4. The tool won all categories of the LTL/TLSF track at SyntComp 2018^. This shows that traditional explicit synthesis can be competitive once the sizes of automata drop by orders of magnitude compared to those produced by Safra-like constructions. This is possible due to the fundamentally different approach of Rabinizer 4.
The author has given an invited talk at Highlights'18 on the topic of [LlCS'18a]. In 2016, the author has obtained a German Research Foundation grant Verified Model Checkers on this topic.
(f). For the results of the competition see http://www.syntcomp.org/ syntcomp-2018-results/
30
Chapter 4
From Reachability and Expectation to Complex and Group-by Objectives
In Chapter 2 we have dealt with basic objectives such as reachability probability or expected long-run average reward. In Chapter 3 we have shown how the automata-theoretic methods can be used to extend the results to linear temporal logic by reduction to reachability. Each of the objectives is defined by a payoff function on runs of the system, e.g. long-run average reward, and then the results for each of the runs are combined into a single number using an "aggregate" operator, in the previous cases by the expectation operator. In this chapter, we discuss (1) more complex payoff functions and (2) more complex aggregate operators.
4.1 State of the art
4.1.1 Logical and behavioural specifications
There are two fundamentally different approaches to specifying and verifying properties of systems. Firstly, the logical approach makes use of specifications given as formulae of temporal or modal logics. Secondly, the behavioural approach exploits various equivalence or refinement checking methods, provided the specifications are given in the same formalism as implementations.
Probabilistic CTL Temporal logics are a convenient and useful formalism to describe behaviour of dynamical systems. In Chapter 3, we considered
31
4. From Reachability and Expectation to Complex and Group-by Objectives
a probabilistic interpretation of LTL, which allows us to quantify the probability of runs satisfying a given LTL formula. Similarly, probabilistic CTL (PCTL) [HS86, HJ94] is the probabilistic extension of the branching-time logic CTL [EH82], obtained by replacing the existential and universal path quantifiers with the probabilistic operators, which allow us to quantify the probability of runs satisfying a given path formula. At first, the probabilities used were only 0 and 1 [HS86], giving rise to the qualitative PCTL (qPCTL). This has been extended to any values from [0,1] in [HJ94], yielding the (quantitative) PCTL (PCTL). A simple example of a PCTL formula is ok\J=1 (X-0-9'finish), which says that on almost all runs we reach a state where there is 90% chance to finish in the next step and up to this state ok holds true. Like with probabilistic LTL, PCTL formulae are interpreted over Markov chains where each state is assigned a subset of atomic propositions that are valid in a given state.
The PCTL model checking problem problem has been studied both for finite and infinite Markov chains and decision processes, see e.g. [CY95, HK97, EY09, EKM06, BKS05]. Beside the model checking problem, it is interesting to study the satisfiability problem, asking whether a given formula has a model, i.e. whether there is a Markov chain satisfying it. If a model does exist, we also want to construct it. Apart from being a fundamental problem, it is a possible tool for checking consistency of specifications or for reactive synthesis. Indeed, the underspecified system together with the specification can be encoded in a formula; the model of such a formula yields a controller for the original system that satisfies the specification. The problem has been shown EXPTIME-complete for qualitative PCTL in the setting where we quantify over finite models (finite satisfiability) [HS86, BFKK08] as well as over generally countable models (infinite satisfiability) in our [BFKK08]. The problem for (the general quantitative) PCTL remains open for decades.
The satisfiability problem for qPCTL and qPCTL* was investigated already in the early 80's [LS83, KL83, HS86], together with the existence of sound and complete axiomatic systems. The decidability for qPCTL over countable models also follows from these general results for qPCTL*, but the complexity was not examined until [BFKK08], showing it is also EXPTIME-complete, both for finite and infinite satisfiability.
As for the non-probabilistic predecessors of PCTL, the satisfiability problem is known to be EXPTIME-complete for CTL [EH82], the same holds for the more general modal /x-calculus [BB87, FL79]. The complexity of the satisfiability problems has been investigated also for fragments of CTL [KV00] and the modal /x-calculus [HKM06].
32
4. From Reachability and Expectation to Complex and Group-by Objectives
The PCTL strategy synthesis problem asks whether the non-determinism in a given Markov decision process can be resolved so that the resulting Markov chain satisfies the formula [BGL+04, KS08, BBFK06].
Frequency LTL Many properties specifying the desired behaviour, such as "the system is always responsive" can be easily captured by LTL. This logic is in its nature qualitative and cannot express quantitative linear-time properties such as "a given failure happens only rarely". To overcome this limitation, especially apparent for stochastic systems, extensions of LTL with frequency operators have been recently studied [BDL12a, BMM14]. Such extensions come at a cost, and for example the "frequency until" operator can make the controller-synthesis problem undecidable already for non-stochastic systems [BDL12a, BMM14]. It turns out [FK15, THY11, THHY12] that a way of providing significant added expressive power while preserving tractability is to extend LTL only by the "frequency globally" formulae G-f(p. Such a formula is satisfied if the long-run frequency of satisfying ip on an infinite path is at least /. The respective logic is called frequency LTL (fLTL). MDP controller synthesis for fLTL has been shown decidable for the fragment containing only the operator G-1 [FK15].
Frequency LTL was studied in another variant in [BDL12a, BMM14] where a frequency until operator is introduced in two different LTL-like logics, and undecidability is proved for related problems. The work [BDL12a] also yields decidability with restricted nesting of the frequency until operator; as the decidable fragment in [BDL12a] does not contain frequency-globally operator, it is not possible to express many useful properties expressible in our logic. A logic that speaks about frequencies on a finite interval was introduced in [THY11], but the paper provides algorithms only for Markov chains and a bounded fragment of the logic.
As we see in Section 4.2.1, this is related to combining LTL and the mean-payoff objective. There are several works that combine mean-payoff objectives with e.g. logics or parity objectives, but in most cases only simple atomic propositions can be used to define the payoff [BCHJ09, BCHK11, CD11]. The work [BKKW14] extends LTL with another form of quantitative operators, allowing accumulated weight constraint expressed using automata, again not allowing quantification over complex formulas. Further, [ABK14] introduces a variant of LTL with a discounted-future operator.
Linear Distances The distance between processes s and t is typically formalized as suppgC \p(s) — p(t) | where C is a class of properties of interest and p(s) is a quantitative value of the property p in process s [DGJP99].
33
4. From Reachability and Expectation to Complex and Group-by Objectives
This notion has been introduced in [DGJP99] for Markov chains and further developed in various settings, such as Markov decision processes [FPP04], quantitative transition systems [dAMRS07], or concurrent games [dAFS04].
Several kinds of distances have been investigated for Markov chains. On the one hand, [Abal3, DGJP99, vBW06, vBSW07, BBLM13c, BBLM13b, BBLM13a, GP11], lift the equivalence given by the probabilistic bisimula-tion of Larsen and Skou [LS89] into branching distances. On the other hand, there are linear distances. They are particularly appropriate when (i) we are interested in linear-time properties, and/or (ii) we want to estimate the distance based only on simulation runs of the system, i.e. in a black-box setting. (Recall that for branching distances, the underlying probabilistic bisimula-tion corresponds to testing equivalence where not only runs from the initial state can be observed, but also the current state of the system can be dumped at any moment and system copies restarted from that state [LS89].
There are two main linear distances traditionally considered for Markov chains: total variation distance and trace distance. Algorithms have been proposed for both of them in the case when the Markov chains are known (white-box setting).
Firstly, for the total variation distance in the white-box setting, [CK14] shows that deciding whether it is 1 can be done in polynomial time, but computing it is NP-hard and not known to be decidable, however, it can be approximated; [BBLM15b] considers this distance more generally for semi-Markov processes, provides a different approximation algorithm, and shows it coincides with distances based on (i) metric temporal logic, and (ii) timed automata languages.
Secondly, trace distance is based on the notion of trace equivalence, which can be decided in polynomial time [DHR08] (however, trace refinement on Markov decision processes is already undecidable [FKS16]). Variants of trace distance are considered in [JMLM14] where it is taken as a limit of finite-trace distances, possibly using discounting or averaging. In [BBLM15a] the finite-trace distance is shown to coincide with distances based on (i) LTL and (ii) LTL without U-operator, i.e., only using X-operator and Boolean connectives; it is also shown NP-hard and not known to be decidable, similarly to the total variation distance; finally, an approximation algorithm is shown (again in the white-box setting), where the over-approximants are branching-time distances, showing a nice connection between the branching and linear distances.
34
4. From Reachability and Expectation to Complex and Group-by Objectives
4.1.2 Group-by aggregate operators
The fundamental problem for MDP is to design a strategy resolving the non-deterministic choices so that the systems' behaviour is optimized with respect to a given objective function, or, in the case of multi-objective optimization, to obtain the desired trade-off. The classic objective function (in the optimization phrasing) or the query (in the decision-problem phrasing) consists of two parts. First, a payoff is a measurable function assigning an outcome to each run of the system. It can be real-valued, such as the long-run average reward (also called mean payoff), or a two-valued predicate, such as reachability. Second, the payoffs for single runs are combined into an overall outcome of the strategy, typically in terms of expectation. The resulting objective function is then for instance the expected long-run average reward, or the probability to reach a given target state.
In this chapter, firstly, we discuss different ways how to aggregate the results on single runs than a single expectation. The motivation to do so is mainly to have a more fine-grained control over the resulting performance, in particular with respect to infrequent ("tail") behaviour and/or more aspects (multiple objectives) at once. Secondly, we discuss how to connect the two parts yet tighter in order to control expectation and other aggregates at each time point. The motivation to do so is most apparent in population systems such as many biological systems. We try to unify the philosophy of the two extensions as a "group-by" objective in the contributions section.
Risk-averse control aims to overcome one of the main disadvantages of the expectation operator, namely its ignorance towards the incurred risks, intuitively phrased as a question "How bad are the bad cases?" While the standard deviation (or variance) quantifies the spread of the distribution, it does not focus on the bad cases and thus fails to capture the risk. There are a number of quantities used to deal with this issue:
• The worst-case analysis (in the financial context known as discounted maximum loss) looks at the payoff of the worst possible run. While this makes sense in a fully non-deterministic environment and lies at the heart of verification, in the probabilistic setting it is typically unreasonably pessimistic, taking into account events happening with probability 0, e.g., never tossing head on a fair coin.
Risk-averse approaches optimizing the worst case together with expectation have been considered in beyond-worst-case and beyond-almost-sure analysis investigated in both the single-dimensional [BFRR17]
35
4. From Reachability and Expectation to Complex and Group-by Objectives
and in the multi-dimensional [CR15] setup.
• The value-at-risk (VaR) describes the value in the worst p-quantile for some p £ [0,1]. For instance, the value at the 0.5-quantile is the median, for the 0.05-quantile (the ventile) it is the value of the best run among the 5% worst ones. See Figure 4.1 for an example of VaR for two given probability density functions. As such it captures the "reasonably possible" worst-case. There has been an extensive effort spent recently on the analysis of MDP with respect to VaR and the re-formulated notions of quantiles, percentiles, thresholds, satisfaction view etc., see below. Although VaR is more realistic, it tends to ignore the outliers too much, as seen in Figure 4.1 on the right. VaR has been characterized as "seductive, but dangerous" and "not sufficient to control risk" [Bed95].
The decision problem related to VaR has been phrased in probabilistic verification mostly in the form "Is the probability that the payoff is higher than a given value threshold more than a given probability threshold?", the so-called satisfaction semantics [BBC+14] as opposed to the expectation semantics. The total reward gained attention both in the verification community [UB13, HK15, BKKW17] and recently in the AI community [GWX17, LZB17]. Multi-dimensional percentile queries are considered for various objectives, such as mean-payoff, limsup, liminf, shortest path in [RRS17]; for the specifics of two-dimensional case and their interplay, see [BDD+14]; for reachability and LTL, see [EKVY08]. Percentile queries for more complex constraints have also been considered, namely their conjunctions [FKR95, BBC+14] or generally Boolean expressions [HKL17]. Some of these approaches have already been practically applied and found useful by domain experts [BDK+14b, BDK14a].
• The conditional value-at-risk (CVaR a.k.a. average value-at-risk, expected shortfall, expected tail loss) answers the question "What can I expect if things go wrong?" It is defined as the expectation over the whole worst p-quantile, see Figure 4.1. As such it describes the lossy tail, taking outliers into account, but with the respective weight. In the degenerate cases, CVaR for p = 1 is the expectation and for p = 0 the worst case. It is an established risk metric in finance, optimization and operations research, e.g. [ADEH99, RU00], and "is considered to be a more consistent measure of risk than VaR" [RU00]. Recently, it started permeating to areas closer to verification, such as
36
4. From Reachability and Expectation to Complex and Group-by Objectives
robotics [CCP16].
There is a body of work that optimizes CVaR in MDP. However, to the best of our knowledge, all the approaches (1) focus on the single-dimensional case, (2) disregard the expectation, and (3) treat neither reachability nor long-run average reward. They focus on the discounted [BOH], total [CCP16], or immediate [KFKT11] reward, as well as extend the results to continuous-time models [HG16, MY17]. This work comes from the area of optimization and operations research, with the notable exception of [CCP16], which focuses on the total reward, generalizing weighted reachability [CCP16]. However, it provides only an approximation solution for the one-dimensional case, neglecting expectation and the respective trade-offs. Further, CVaR is a topic of high interest in finance, e.g., [RUOO, Bed95]. The central difference is that they consider variations of portfolios (i.e., the objective functions) while leaving the underlying random process (the market) unchanged. This is dual to our problem, since we fix the objective function and now search for an optimal random process (or the respective strategy).
Distribution transformers Recently, there has been interest in regarding probabilistic systems as deterministic transformers of probability distributions rather than individual stochastic processes. In the standard semantics of probabilistic systems, when a probabilistic step from a state to a distribution is taken, the random choice is resolved and we continue from one of the successor states. In contrast, under the distribution semantics the choice is not resolved and we continue from the distribution over the successors. Thus, instead of the current state the transition changes the current distribution over the states. This semantics is adequate in many applications, such as systems biology, sensor networks, robot planning, etc. [Henl2, BBMR08, HMW09].
In [KVAK10] model checking MDP under this semantics is shown un-decidable and a decidable subclass identified. Approximative approaches to model checking MC are considered in [AAGT15]. A simpler problem of synchronization in MDP under the distribution semantics is solved in [DMS14a, DMS14b].
This semantics has been reflected in the study of bisimulations. The theory of bisimulations is a well-established and elegant framework to describe equivalence between processes based on their behaviour. The original definition was given for non-deterministic processes [Par81] and was
37
4. From Reachability and Expectation to Complex and Group-by Objectives
further extended to finite probabilistic systems in [LS89]. Since then many variants of bisimulations have been proposed and investigated, in particular an extension to non-deterministic probabilistic systems [SL94]. The distribution semantics has been considered for finite MC [DHR08], finite MDP [FZ14], and finite Markov automata (continuous-time systems) [SZG12]. Further, [CR11, Henl2, Cat05] consider standard bisimulations lifted to distributions, coinciding with the standard bisimulation when projected to Dirac distributions. Similarly, [EHZ10, EHK+13, DH13] consider lifting that, however, differs from the state-based bisimulation in the weak case.
4.2 Contributions
4.2.1 Logical and behavioural specifications
Probabilistic CTL In [concur'18b], we address the satisfiability problem on fragments of PCTL. In order to get a better understanding of this ultimate problem, we answer the problem for several fragments of PCTL that are
• quantitative, i.e. involving also probabilistic quantification over arbitrary rational numbers (not just 0 and 1),
• step unbounded, i.e. not imposing any horizon for the temporal operators.
Besides, we consider models with unbounded size, i.e. countable models or finite models, but with no a priori restriction on the size of the state space. These are the three distinguishing features, compared to other works. Firstly, solutions for the qualitative PCTL have been given in [HS86, BFKK08] and for a more general qualitative logic PCTL* in [LS83, KL83]. Secondly, [CK16] shows decidability for bounded PCTL where the scope of the operators is restricted by a step bound to a given time horizon. Thirdly, the bounded satisfiability problem is to determine, whether there exists a model of a given size for a given formula. This problem has been solved by encoding it into an SMT problem [BFS12].
In particular, we show decidability of the satisfiability problem for several quantitative unbounded fragments of PCTL, focusing on future- and globally-operators (F,G), and discuss both finite and infinite satisfiability. Further, we identify a "smallest elegant" fragment where the problem remains open and the solution requires additional techniques.
38
4. From Reachability and Expectation to Complex and Group-by Objectives
Frequency LTL In [LPAR'15] we make a step towards the ultimate goal of a model checking procedure for the whole fLTL. We address the general quantitative setting with arbitrary frequency bounds p and consider the fragment LTL\GU, which is obtained from frequency LTL by preventing the U operator from occurring inside G or G-^ formulas (but still allowing the F operator to occur anywhere in the formula). The approach we take is completely different from [FK15] where ad hoc product MDP construction is used, heavily relying on existence of certain types of strategies in the / = 1 case. In this paper we provide, to the best of our knowledge, the first translation of a quantitative logic to equivalent deterministic automata. This allows us to take the standard automata-theoretic approach to verification [VW86]: after obtaining the finite automaton, we do not deal with the structure of the formula originally given, and we solve a synthesis problem on a product of the single automaton with the MDP.
To our best knowledge, this paper gives the first decidability result for probabilistic verification against linear-time temporal logics extended by quantitative frequency operators with complex nested subformulas of the logic.
It works in two steps, keeping the same time complexity as for ordinary LTL. In the first step, a LTL\GU formula gets translated to an equivalent deterministic generalized Rabin automaton extended with mean-payoff objectives. This step is inspired by our previous work [KLG13], but the extension with auxiliary automata for G-^ requires a different construction. The second step is the analysis of MDPs against conjunction of limit inferior mean-payoff, limit superior mean-payoff, and generalized Rabin objectives. This result is obtained by adapting and combining several existing involved proof techniques of [BCFK13] and our [LlCS'15]..
Although our algorithm does not allow us to handle the extension of the whole LTL, the considered fragment LTL\GU contains a large class of formulas and offers significant expressive power. It subsumes the GR(1) fragment of LTL [BJP+12], which has found use in synthesis for hardware designs. The U operator, although not allowed within a scope of a G operator, can still be used for example to distinguish paths based on their prefixes.
Example. As an example synthesis problem expressible in this fragment, consider a cluster of servers where each server plays either a role of a load-balancer or a worker. On startup, each server listens for a message specifying its role. A load-balancer forwards each request and only waits for a confirmation whereas a worker processes the requests itself. A specification for a single server in the cluster can require, for example, that the following
39
4. From Reachability and Expectation to Complex and Group-by Objectives
formula (with propositions explained above) holds with probability at least 0.95:
((/U6) G^°-"(r X(/AFc)))a((/U«/) G^°'85(r (XpVXXp)))
A
Previous work [THHY12] offered only translation of a similar logic to non-deterministic "mean-payoff Buchi automata" noting that it is difficult to give an analogous reduction to deterministic "mean-payoff Rabin automata". The reason is that the non-determinism is inherently present in the form of guessing whether the subformulas of G-^ are satisfied on a suffix. Our construction overcomes this difficulty and offers equivalent deterministic automata, utilizing our [KE12, KLG13, LlCS'15].
Linear distances In [concur'16], we introduce a simple framework for linear distances between Markov chains by the formula suppgC \p(s) — p(t)\ above. Here p(s) is the probability of satisfying p when starting a simulation run in state s. In other words, when p is seen as a language it is the probability to generate a trace belonging to p.
We consider estimating distances only from simulating the systems, i.e. the black-box setting. One of the main difficulties is that the class C typically includes properties with arbitrarily long horizon or even infinite-horizon properties, whereas every simulation run is necessarily finite. Note that we do not want employ here any simplifications such as imposed fixed horizon or discounting, typically used for obtaining efficient algorithms, e.g., [DGJP99, vBW06, BBLM13b], and the undiscounted setting is fundamentally more complex [vBSW07]. Since even simpler tasks are impossible for unbounded horizon in the black-box setting without any further knowledge, it is assumed we know an upper bound on the size of the state space jS*) and a lower bound on the minimum transition probability pm-m, similarly to Section 2.2.2.
Depending on the class C, we obtain various interesting instantiations of our framework:
Example. One extreme choice is to consider all measurable languages, resulting in the total variation distance. The other extreme choices are to consider (1) only the generators of the cr-algebra, i.e. the cones, resulting in the finite-trace distance; or (2) only the elementary events, resulting in the infinite-trace distance. There are many possible choices for £ between the two extremes above, such as classes of the Borel hierarchy, long-run average
40
4. From Reachability and Expectation to Complex and Group-by Objectives
reward criteria, w-regular languages, classes of automata or sets of temporal formulae of certain form (the class of w-regular languages can also be given by the monadic second-order logic or w-automata) etc. A
We can now use statistics on finite simulation runs to (i) deduce information on the whole infinite runs and (ii) estimate the distance of the systems. For a particular distance function Dc, the goal is to construct an algorithm with the following specification:
Specification of ^-distance estimation Input:
• two finite black-box MCs M.\,M.2 (i-e., access to any desired finite number of sampled simulation paths of any desired finite lengths)
• confidence a e (0,1)
• interval width 5 £ (0,1)
Output: interval / such that |/| < 5 and F[DC(M1,M2) e I] > 1 - a
In short, we show that the total variation distance cannot be estimated by simulating the systems, and that the finite-trace distance can be estimated. The former result is further exploited to show that the inestimability result holds also already for clopen sets, Rabin automata, and LTL (even without the Until-operator). However, it is also shown that infinite-trace distance and distances for some fragments of LTL are estimable. Moreover, restricting the size of automata also yields estimability. Furthermore, assuming finite precision of transition probabilities, e.g. they are given by at most two decimal digits, even the total variation distance can be estimated, exploiting the white-box algorithms. Under this assumption, trace equivalence can also be decided correctly with arbitrarily high probability.
4.2.2 Group-by aggregate operators
Risk-averse control A more focused view on differences among possible behaviours of a system can be taken by aggregating similar behaviours, tailoring an abstract view. We draw an analogy with database queries. Instead of looking at all data points, we can aggregate the data by functions such as AVG, corresponding to the expectation semantics, but also MIN when the
41
4. From Reachability and Expectation to Complex and Group-by Objectives
worst-case/sure-winning view is taken, or MAX in the analogous situation. Beyond that, we can aggregate in a more sophisticated manner. A query
select count(*) from runs
where value = true / / alternatively, value > threshold
corresponds to the satisfaction semantics. A slight variation
select value, count(*) from runs group by value
corresponds more closely to the quantile analysis and CVaR analysis. One can even think of the distribution semantics as grouping according to the state at a particular moment:
select state[t], count(*) from runs group by state [t]
In [LlCS'15], we unified the expectation semantics and the satisfaction semantics for long-run average reward as follows. Intuitively, the problem we consider asks to optimize the expectation while ensuring the satisfaction. Formally, consider an MDP with n reward functions, a probability threshold vector p (or threshold p for the so-called joint interpretation [BBC+14]), and a reward threshold vector r. We consider the set of satisfaction strategies that ensure the satisfaction semantics, i.e., they ensure with probability at least p (or respective probabilitites of p) that runs have long-run average reward value vector at least r. Then the optimization of the expectation is considered with respect to the satisfaction strategies. Note that if p is 0 (assuming non-negative rewards here for simplicity), then the satisfaction strategies is the set of all strategies and we obtain the traditional expectation semantics as a special case. We investigate the questions of algorithmic complexity, strategy complexity (in terms of memory and randomization), and trade-offs in the sense of a Pareto curve.
Example. We list a few examples, illustrating the use of such combinations:
• For simple risk aversion, consider a single reward function modelling investment. Positive reward stands for profit, negative for loss. We aim at maximizing the expected long-run average while guaranteeing that it is non-negative with at least 95%. This is an instance with n = l,p = 0.95, r = 0.
• For more dimensions, consider the example [Put94, Problems 6.1, 8.17]. A vendor assigns to each customer either a low or a high rank.
42
4. From Reachability and Expectation to Complex and Group-by Objectives
Further, there is a decision the vendor makes each year either to invest money into sending a catalogue to the customer or not. Depending on the rank and on receiving a catalogue, the customer spends different amounts for vendor's products and the rank can change. The aim is to maximize the expected profit provided the catalogue is almost surely sent with frequency at most /. Further, one can extend this example to only require that the catalogue frequency does not exceeded / with 95% probability, but 5% best customers may still receive catalogues very often.
• A gratis service for downloading is offered as well as a premium one. For each we model the throughput as rewards r\,r{S) denote the set of probability measures (or probability distributions) over S. The following definition is similar to the treatment of [52].
Definition 1. A non-deterministic labelled Markov process (NLMP) is a tuple P = (S, L, {ra | a G L}) where S is a measurable space of states, L is a measur-
3
able space of labels, and ta : S —> S(T>(S)) assigns to each state s a measurable set of probability measures ta(s) available in s under aS1^
When in a state s G S, NLMP reads a label a G L and non-deter ministically chooses a successor distribution fi G T>(S) that is in the set of convex combina-tions^2) over Ta(s), denoted by s —fi. If there is no such distribution, the process halts. Otherwise, it moves into a successor state according to fi. Considering convex combinations is necessary as it gives more power than pure resolution of non-determinism [43].
Example 1. If all sets are finite, we obtain probabilistic automata (PA) defined [43] as a triple (S, L, —>) where —> C S x Lx T>(S) is a probabilistic transition relation with (s,a,fi) G —> if fi G ra(s).
Example 2. In the continuous setting, consider a random number generator that also remembers the previous number. We set L = [0,1], S = [0,1] x [0,1] and tx((new, last}) = {fix} f°r x = new and 0 otherwise, where fix is the uniform distribution on [0, 1] x {x}. If we start with a uniform distribution over S, the measure of successors under any x G L is 0. Thus in order to get any information of the system we have to consider successors under sets of labels, e.g. intervals.
For a measurable set A C L of labels, we write s —^> fi if s —fi for some a G A, and denote by Sa '■= {s \ 3fi : s fi} the set of states having some outgoing label from A. Further, we can lift this to probability distributions by setting fi —^> v if \i(Sa) > 0 and v = jjjj^j JsesA Vs tJ'(^s) ^ov sonie measurable function assigning to each state s G Sa a measure vs such that s —^> vs. Intuitively, in fi we restrict to states that do not halt under A and consider all possible combinations of their transitions; we scale up by ^pjjy to obtain a distribution again.
Example 3. In the previous example, let v be the uniform distribution. Due to the independence of the random generator on previous values, we get v^H-v.
Similarly, v —— ^> up.1,0.2] where f[o.i,o.2] is uniform on [0,1] in the first component and uniform on [0.1, 0.2] in the second component, with no correlation.
Using this notation, a non-deterministic and probabilistic system such as NLMP can be regarded as a non-probabilistic, thus solely non-deterministic, labelled transition system over the uncountable space of probability distributions. The natural bisimulation from this distribution perspective is as follows.
Definition 2. Let (S, L, {ra \ a G L}) be a NLMP and R C V(S) x V(S) be a symmetric relation. We say that R is a (strong) probabilistic bisimulation if for each fiRv and measurable A C L
1-) We further require that for each s G S we have {(a,fi)\fi G Ta(s)} G S(L)® S(T>(S)) and for each A G E(L) and Y G S(V(S)) we have {s G S \ 3a G A.ra(s) n Y / 0} G S(S). Here S(V(S)) is the Giry cr-algebra [27] over V(X).
■2-) A distribution fi G T>(S) is a convex combination of a set M G S(T>(S)) of distributions if there is a measure v on T>(S) such that u(M) = 1 and fi = f ,eT)/s\ fi'v(dfi).
4
1. h(Sa) = v(Sa), and
2. for each \i—>\j! there is a v—>v' such that \j! Rv1.
We set \i ~ v if there is a probabilistic bisimulation R such that [i,Rv.
Example 4- Considering Example 2, states {x} x [0,1] form a class of ~ for each x G [0,1] as the old value does not affect the behaviour. More precisely, fi ~ v iff marginals of their first component are the same.
Naturalness. Our definition of bisimulation is not created ad-hoc as it often appears for relational definitions, but is actually an instantiation of the standard bisimulation for a particular coalgebra. Although this aspect is not necessary for understanding the paper, it is another argument for naturalness of our definition. For reader's convenience, we present a short introduction to coalgebras and the formal definitions in [31]. Here we only provide an intuitive explanation by example.
Non-deterministic labelled transition systems are essentially given by the transition function S —> ^(S)1"; given a state s G S and a label a G L, we can obtain the set of the successors {s' G S | sAs'}. The transition function corresponds to a coalgebra, which induces a bisimulation coinciding with the classical one of Park and Milner [40]. Similarly, PA are given by the transition function S —>• V('D(S))L; instead of successors there are distributions over successors. Again, the corresponding coalgebraic bisimulation coincides with the classical ones of Larsen and Skou [38] and Segala and Lynch [44].
In contrast, our definition can be obtained by considering states 5" to be distributions in T>{S) over the original state space and defining the transition function to be S' ->• ([0,1] x V(S'))sl~L\ The difference to the standard non-probabilistic case is twofold: firstly, we consider all measurable sets of labels, i.e. all elements of E(L); secondly, for each label set we consider the mass, i.e. element of [0,1], of the current state distribution that does not deadlock, i.e. can perform some of the labels. These two aspects form the crux of our approach and distinguish it from other approaches.
3 Applications
We now argue by some concrete application domains that the distribution view on bisimulation yields a fruitful notion.
Memoryless vs. memoryfull continuous time. First, we reconsider the motivating discussion from Section 1 revolving around the difference between continuous time represented by real-valued clocks, respectively memoryless stochastic time. For this we introduce a simple model of stochastic automata [10].
Definition 3. A stochastic automaton (SA) is a tuple S = (Q,C, A, —>•, k, F) where Q is a set of locations, C is a set of clocks, A is a set of actions, —> C Q x A x 2C x Q is a set of edges, n : Q —> 2C is a clock setting function, and F assigns to each clock its distribution overR>Q.
5
Avoiding technical details, S has the following NLMP semantics P$ with state space S = QxRc, assuming it is initialized in some location q^: When a location q is entered, for each clock c G n(q) a positive value is chosen randomly according to the distribution F(c) and stored in the state space. Intuitively, the automaton idles in location q with all clock values decreasing at the same speed until some edge (q, a, X, q') becomes enabled, i.e. all clocks from X have value < 0. After this idling time t, the action a is taken and the automaton enters the next location q'. If an edge is enabled on entering a location, it is taken immediately, i.e. t = 0. If more than one edge become enabled simultaneously, one of them is chosen non-deterministically. Its formal definition is given in [31]. We now are in the position to harvest Definition 2, to arrive at the novel bisimulation for stochastic automata.
Definition 4. We say that locations qi, qi of an SA S are probabilistic bisim-ilar, denoted q\ ~ q2, if /igi ~ nq2 in P$ where nq denotes a distribution over the state space of P$ given by the location being q, every c ^ n{q) being 0, and every c G n(q) being independently set to a random value according to F(c).
This bisimulation identifies q and q' from Section 1 unlike any previous bisimulation on SA [10]. In Section 4 we discuss how to compute this bisimulation, despite being continuous-space. Recall that the model initialized by q is obtained by hrst translating two simple CTMC, and then applying the natural interleaving semantics, while the model, of q' is obtained by hrst applying the equally natural CTMC interleaving semantics prior to translation. The bisimilarity of these two models generalizes to the whole universe of CTMC and SA:
Theorem 1. Let SA(C) denote the stochastic automaton corresponding to a CTMCC. For any CTMCd,C2, we have
SA(d) \\SA SA(d) ~ SA(d\\cTd).
Here, \\ct and \\sa denotes the interleaving parallel composition of SA [11] (echoing TA parallel composition) and CTMC [33,30] (Kronecker sum of their matrix representations), respectively.
Bisimulation for partial-observation MDP (POMDP). A POMDP is a quadruple M. = (S, A, S, O) where (as in an MDP) S is a set of states, A is a set of actions, and S : S x A —>• T>(S) is a transition function. Furthermore, O C 2s partitions the state space. The choice of actions is resolved by a policy yielding a Markov chain. Unlike in an MDP, such choice is not based on the knowledge of the current state, only on knowing that the current state belongs into an observation o G O. POMDPs have a wide range of applications in robotic control, automated planning, dialogue systems, medical diagnosis, and many other areas [46].
In the analysis of POMDP, the distributions over states, called beliefs, arise naturally and bisimulations over beliefs have already been considered [7,34].
6
However, to the best of our knowledge, no algorithms for computing belief bisim-ilation for POMDP exist. We fill this gap by our algorithm for computing distribution bisimulation for PA in Section 4. Indeed, two beliefs p, v in POMDP A4 are belief bisimilar in the spirit of [7] iff p and v are distribution bisimilar in the induced PA Dm = (S, O x A, —>) where (s, (o, a), p) G—> if s G o and 6(s,a) =/i-(3)
Further applications. Probabilistic automata are especially apt for compositional modelling of distributed systems. The only information a component in a distributed system has about the current state of another component stems from their mutual communication. Therefore, each component can be also viewed from the outside as a partial-observation system. Thus, also in this context, distribution bisimulation is a natural concept. While ~ is not a congruence w.r.t. standard parallel composition, it is apt for compositional modelling of distributed systems where only distributed schedulers are considered. For details, see [31,49].
Furthermore we can understand a PA as a description, in the sense of [25,39], of a representative agent in a large homogeneous population. The distribution view then naturally represents the ratios of agents being currently in the individual states and labels given to this large population of PAs correspond to global control actions [25]. For more details on applications, see [31].
4 Algorithms
In this section, we discuss computational aspects of deciding our bisimulation. Since ~ is a relation over distributions over the system's state space, it is un-countably infinite even for simple finite systems, which makes it in principle intricate to decide. Fortunately, the bisimulation relation has a linear structure, and this allows us to employ methods of linear algebra to work with it effectively. Moreover, important classes of continuous-space systems can be dealt with, since their structure can be exploited. We exemplify this on a subset of deterministic stochastic automata, for which we are able to provide an algorithm to decide bisimilarity.
Finite systems — greatest fixpoints. Let us fix a PA (S, L, —>). We apply the standard approach by starting with T>{S) x T>{S) and pruning the relation until we reach the fixpoint ~. In order to represent ~ using linear algebra, we identify a distribution p with a vector (p(si),. .. ,p(s\g\)) G
Although the space of distributions is uncountable, we construct an implicit representation of ~ by a system of equations written as columns in a matrix E.
Definition 5. A matrix E with \S\ rows is a bisimulation matrix if for some bisimulation R, for any distributions p, v
pRv iff {p-v)E = 0.
For a bisimulation matrix E, an equivalence class of ii is then the set (/z + {p \ pE = 0}) n T>(S), the set of distributions that are equal modulo E.
■3-) Note that [7] also considers rewards that can be easily added to ~ and our algorithm.
7
Example 5. The bisimulation matrix E below encodes that several conditions must hold for two distributions p, v to be bisimilar. Among others, if we multiply /j, — is with e.g. the second column, we must get 0. This translates to (p(v) — v{v)) -1 = 0, i.e. p(v) = v{v). Hence for bisimilar distributions, the measure of v has to be the same. This proves that u ^ v (here we identify states and their Dirac distributions). Similarly, we can prove that i ~ \t' + \t". Indeed, if we multiply the corresponding difference vector (0,0,1,-^,-^,0,0) with any column of the matrix, we obtain 0.
/l o o o o\
1 0 0 0 0
i o o i i
1 0 0 0 1 10 0 10 10 10 0 \1 1 0 0 0/
Note that the unit matrix is always a bisimulation matrix, not relating anything with anything but itself. For which bisimulations do there exist bisimulation matrices? We say a relation R over distributions is convex if pRv and p!' Rv' imply (pp + (1 — p)p') R (pis + (1 — p)y') for any p G [0,1].
Lemma 1. Every convex bisimulation has a corresponding bisimulation matrix.
Since ~ is convex (see [31]), there is a bisimulation matrix corresponding to ~. It is a least restrictive bisimulation matrix E (note that all bisimulation matrices with the least possible dimension have identical solution space), we call it minimal bisimulation matrix. We show that the necessary and sufficient condition for E to be a bisimulation matrix is stability with respect to transitions.
Definition 6. For a \S\ x |5| matrix P, we say that a matrix E with \S\ rows is P-stable if for every p G
pE = Q => pPE = 0 (1) We first briefly explain the stability in a simpler setting.
Action-deterministic systems. Let us consider PA where in each state, there is at most one transition. For each a G L, we let Pa = (pij) denote the transition matrix such that for all i,j, if there is (unique) transition Sj —p we set p^ to p(sj), otherwise to 0. Then p evolves under a into p,Pa. Denote 1 = (1,..., 1)T.
Proposition 1. In an action-deterministic PA, E containing 1 is a bisimulation matrix iff it is Pa-stable for all a G L.
To get a minimal bisimulation matrix E, we start with a single vector 1 which stands for an equation saying that the overall probability mass in bisimilar distributions is the same. Then we repetitively multiply all vectors we have by all the matrices Pa and add each resulting vector to the collection if it is linearly independent of the current collection, until there are no changes. In Example 5, the second column of E is obtained as Pcl, the fourth one as Pa(Pcl) and so on.
8
The set of all columns of E is thus given by the described iteration
{Pa \aeL}*l
modulo linear dependency. Since Pa have |5| rows, the fixpoint is reached within |5| iterations yielding 1 < d < \S\ equations. Each class then forms an (\S\ — d)-dimensional affine subspace intersected with the set of probability distributions T>(S). This is also the principle idea behind the algorithm of [51] and [19].
Non-deterministic systems. In general, for transitions under A, we have to consider cf non-deterministic choices in each Sj among all the outgoing transitions under some a G A. We use variables wf denoting the probability that j-th transition, say (sj, a\, fij), is taken by the scheduler/player^4) in Sj. We sum up the choices into a "non-deterministic" transition matrix Pj47 with parameters W
cA
whose ith row equals 5^/=i wl lA ■ It describes where the probability mass moves from Si under A depending on the collection W of the probabilities the player gives each choice. By Wa we denote the set of all such W.
A simple generalization of the approach above would be to consider {Pj47 | A c L,W G Wa}*1- However, firstly, the set of these matrices is uncountable whenever there are at least two transitions to choose from. Secondly, not all Pj47 may be used as the following example shows.
Example 6. In each bisimulation class in the following example, the probabilities of s\ + S2, S3, and S4 are constant, as can also be seen from the bisimulation matrix E, similarly to Example 5. Further, E can be obtained as (1 Pcl P&1). Observe that E is -stable for W that maximizes the probability of going
into the "class" s3 (both si and s2 g° to s3, i.e. w\ "class" S4.
1); similarly for the
X
S2)
pW r{a]
/0 0 w\ w22\
0 0 0 0 Voo 0 0 J
/100\ 1 0 0 1 0 1
V110/
However, for W with w\ 7^ w\, e.g. si goes to s3 and s2 goes with equal probability to s3 and S4 (wl = l,^ = w\ = \), we obtain from Pj^jE a new independent vector (0, 0.5, 0, 0)T enforcing a partition finer than ~. This does not mean that Spoiler wins the game when choosing such mixed W in some /i, it only means that Duplicator needs to choose a different W in a bisimilar v in order to have \xP™ ~ vP^a f°r the successors.
4' We use the standard notion of Spoiler-Duplicator bisimulation game (see e.g. [42]) where in {/io,/ii} Spoiler chooses i G {0,1}, A C L, and fii—Duplicator has to reply with fii-i—tfi'i-i such that fii(SA) = (J-i-i(SA), and the game continues in {fi'o,fi'i}. Spoiler wins iff at some point Duplicator cannot reply.
9
A fundamental observation is that we get the correct bisimulation when Spoiler is restricted to finitely many "extremal" choices and Duplicator is restricted for such extremal W to respond only with the very same W. (*)
To this end, consider = P^ E where E is the current matrix with each of e columns representing an equation. Intuitively, the iih row of Mjf describes how much of Si is moved to various classes when a step is taken. Denote the linear forms in over W by rriij. Since the players can randomize and mix choices which transition to take, the set of vectors {(mn(wj,..., w^),..., 171^(11}},..., w1% wj,..., wl* > 0, SjLi wi = 1} forms a convex polytope denoted by Cj. Each vector in d is thus the ith row of the matrix where some concrete weights wj are "plugged in". This way Cj describes all the possible choices in Sj and their effect on where the probability mass is moved.
Denote vertices (extremal points) of a convex polytope P by £(P). Then £{Ci) correspond to pure (non-randomizing) choices that are "extremal" w.r.t. E. Note that now if Sj ~ then Cj = Ck, or equivalently £{Cj) = £(Ck)- Indeed, for every choice in Sj there needs to be a matching choice in Sk and vice versa. However, since we consider bisimulation between generally non-Dirac distributions, we need to combine these extremal choices. For an arbitrary distribution H G T>{S), we say that a tuple c G Ili=i £{Ci) is extremal in \i if /i-cT is a vertex of the polytope {/i • c'T | d G II'='1 Cj}- Note that each extremal c corresponds to particular pure choices, denoted by W(c). Unfortunately, for choices W(d) of Spoiler extremal in some distribution, Duplicator may in another distribution need to make different choices. Indeed, in Example ?? the tuple corresponding to W is extremal in the Dirac distribution of state si. Therefore, we dehne £(C) to be the set of tuples c extremal in the uniform distribution. Interestingly, tuples extremal in the uniform distribution are (1) extremal in all distributions and (2) reflect all extremal choices, i.e. for every c extremal in some /i, there is a d extremal in the uniform distribution such that d is also extremal in \i and H ■ c = n ■ d. As a result, the fundamental property (*) is guaranteed.
Proposition 2. Let E be a matrix containing 1. It is a bisimulation matrix iff it is P^^ -stable for all A C L and c G £(C).
Theorem 2. Algorithm 1 computes a minimal bisimulation matrix.
The running time is exponential. We leave the question whether linear programming or other methods [32] can yield E in polynomial time open. The algorithm can easily be turned into one computing other bisimulation notions from the literature, for which there were no algorithms so far, see Section 5.
Continuous-time systems - least fixpoints. Turning our attention to continuous systems, we finally sketch an algorithm for deciding bisimulation ~ over a subclass of stochastic automata, this constitutes the first algorithm to compute a bisimulation on the uncountably large semantical object.
We need to adopt two restrictions. First, we consider only deterministic SA, where the probability that two edges become enabled at the same time is zero (when initiated in any location). Second, to simplify the exposition, we restrict all
10
Input : Probabilistic automaton (S, L, —>) Output : A minimal bisimulation matrix E
foreach ACL do
I compute PY II non-deterministic transition matrix
E<-(1) repeat
foreach ACL do
MY , k, F) be a deterministic SA over exponential distributions. There is an algorithm to decide in time polynomial in \S\ and exponential in \C\ whether q\ ~ qi for any locations q\,q2-
The rest of the section deals with the proof. We fix S = (Q, C, A, —>•, k, F) and 91,92 c Q. First, we straightforwardly abstract the NLMP semantics P$ over state space S=Qxlcbya NLMP P over state space S = Qx (R>0 U {-})c where all negative values of clocks are expressed by —. Let £ denote the obvious mapping of distributions T>(S) onto T>(S). Then £ preserves bisimulation since two states s\, S2 that differ only in negative values satisfy £(ra(si)) = £(ra(s2)) for all a € L.
Lemma 2. For any distributions n,v on S we have \i ~ v iff~ £(^)-
Second, similarly to an embedded Markov chain of a CTMC, we further abstract the NLMP P by a finite deterministic PA D = (S, A, —>) such that each state of D is a distribution over the uncountable state space S.
— The set S is the set of states reachable via the transitions relation defined below from the distributions /igi, nq2 corresponding to qi, q2 (see Definition 4).
— Let us fix a state \i G S (note that \i G D(S)) and an action a G A such that in the NLMP P an a-transition occurs with positive probability, i.e. fi—%v for some v and for Aa = {a} x R>o. Thanks to restricting to deterministic SA, P is also deterministic and such a distribution v
11
is uniquely denned. We set (/z, a, M) G —> where M is the discrete distribution that assigns probability pqj to state vqj for each q G Q and / : C —> {—,+} where pgj = v{Sqj), vqj is the conditional distribution vq(X) := v(X n Sqj)/v{Sqj) for any measurable X C. S, and S^j = {(', f) G 5 | q' = q, v(c) > 0 iff /(c) = + for each c G C} the set of states with location q and where the sign of clock values matches /.
For exponential distributions all the reachable states v G S correspond to some location q where the subset X C C is newly sampled, hence we obtain:
Lemma 3. For a deterministic SA over exponential distributions, \S\ < \Q\ ■ 2ici.
Instead of a greatest fixpoint computation as employed for the discrete algorithm, we take a complementary approach and prove or disprove bisimilarity by a least fixpoint procedure. We start with the initial pair of distributions (states in D) which generates further requirements that we impose on the relation and try to satisfy them. We work with a tableau, a rooted tree where each node is either an inner node with a pair of discrete probability distributions over states of D as a label, a repeated node with a label that already appears somewhere between the node and the root, or a failure node denoted by □, and the children of each inner node are obtained by one rule from {Step, Lin}. A tableau not containing □ is successful.
Step For a node \i ~ v where \i and v have compatible timing, we add for each label a G L one child node /ia ~ va where \ia and va are the unique distributions such that \i —\xa and v —va. Otherwise, we add one failure node. We say that \i and v have compatible timing if for all actions s£i we have that Ta[p] is equivalent to Ta[i/]. Here Ta[p] is a measure over M>o such that Ta[p](I) := p(S^ajxj), i.e. the measure of states moving after time in I with action a.
Lin For a node \i ~ v linearly dependent on the set of remaining nodes in the tableau, we add one child (repeat) node \i ~ v. Here, we understand each node n ~ v as a vector \i-vva. the |5s|-dimensional vector space.
Note that compatibility of timing is easy to check. Furthermore, the set of rules is correct and complete w.r.t. bisimulation in P.
Lemma 4. There is a successful tableau from \i ~ v iff \i ~ v in P. Moreover, the set of nodes of a successful tableau is a subset of a bisimulation.
We get Theorem 3 since qi ~ q2 iff £(nqi) ~ C(Mg2) m P an Exp(\/2) Exp(\/2) is the product of two exponential distributions with rate 1/2, fiu = u Exp(l), and fiv = v Exp(l). Note that for both clocks x and y, the probability of getting to zero first is 0.5.
Step
1 fJ~q +0 fJ~u r- - 1 /it.
1 2 Hu r - 1
1 4 Hu r - 1
■ Step _f-i-Step
■ Step
The finite tableau on the left is successful since it ends in a repeated node, thus it proves u ~ v. The infinite tableau on the right is also successful and proves q ~ v. When using only the rule Step, it is necessarily infinite as no node ever repeats. The rule Lin provides the means to truncate such infinite sequences. Observe that the third node in the tableau on the right above is linearly dependent on its ancestors.
Remark 1. Our approach can be turned into a complete proof system for bisim-ulation on models with expolynomial distributions (5). For them, the states of the discrete transition system D can be expressed symbolically. In fact, we conjecture that the resulting semi-algorithm can be twisted to a decision algorithm for this expressive class of models. This is however out of the scope of this paper.
5 Related work and discussion
For an overview of coalgebraic work on probabilistic bisimulations we refer to a survey [47]. A considerable effort has been spent to extend this work to continuous-space systems: the solution of [15] (unfortunately not applicable to R), the construction of [21] (described by [42] as "ingenious and intricate"), sophisticated measurable selection techniques in [18], and further approaches of [17] or [52]. In contrast to this standard setting where relations between states and their successor distributions must be handled, our work uses directly relations on distributions which simplifies the setting. The coalgebraic approach has also been applied to trace semantics of uncountable systems [35]. The topic is still very lively, e.g. in the recent [41] a different coalgebraic description of the classical probabilistic bisimulation is given.
Recently, distribution-based bisimulations have been studied. In [19], a bisimulation is defined in the context of language equivalence of Rabin's deterministic
■5-) With density that is positive on an interval [£,u) for £ G No, u G N U {oo} given piecewise by expressions of the form 5Zi=o 5Zj=o fltjI'e_A':'1 f°r aij? ^ij G K U {oo}. This class contains many important distributions such as exponential, or uniform, and enables efficient approximation of others.
13
probabilistic automata and also an algorithm to compute the bisimulation on them. However, only finite systems with no non-determinism are considered. The most related to our notion are the very recent independently developed [24] and [49]. However, none of them is applicable in the continuous setting and for neither of the two any algorithm has previously been given. Nevertheless, since they are close to our definition, our algorithm with only small changes can actually compute them. Although the bisimulation of [24] in a rather complex way extends [19] to the non-deterministic case reusing their notions, it can be equivalently rephrased as our Definition 2 only considering singleton sets A v: instead of restricting to the states of the support that can perform some action of A, it considers those states that can perform exactly actions of A. Here each ith. row of each transition matrix P^ needs to be set to zero if the set of labels from Sj is different from A.
There are also bisimulation relations over distributions defined over finite [9,29] or uncountable [8] state spaces. They, however, coincide with the classical [38] on Dirac distributions and are only directly lifted to non-Dirac distributions. Thus they fail to address the motivating correspondence problem from Section 1. Moreover, no algorithms were given. Further, weak bisimula-tions [23,22,16] (coarser than usual state based analogues) applied to models without internal transitions also coincide with lifting [29] of the classical bisimulation [38] while our bisimulation is coarser.
There are other bisimulations that identify more states than the classical [38] such as [48] and [4] designed to match a specific logic. Another approach to obtain coarser equivalences on probabilistic automata is via testing scenarios [50].
6 Conclusion
We have introduced a general and natural notion of a distribution-based probabilistic bisimulation, have shown its applications in different settings and have provide algorithms to compute it for finite and some classes of infinite systems. As to future work, the precise complexity of the finite case is certainly of interest. Further, the tableaux decision method opens the arena for investigating wider classes of continuous-time systems where the new bisimulation is decidable. Acknowledgement We would like to thank Pedro D'Argenio, Filippo Bonchi, Daniel Gebler, and Matteo Mio for valuable feedback and discussions.
References
1. M. Agrawal, S. Akshay, B. Genest, and P. Thiagarajan. Approximate verification of the symbolic dynamics of Markov chains. In LICS, 2012.
2. R. Alur and D. Dill. A theory of timed automata. Theor. Comput. Sci., 126(2):183-235, 1994.
3. G. Behrmann, A. David, K. G. Larsen, P. Pettersson, and W. Yi. Developing UPPAAL over 15 years. Softw., Pract. Exper., 41(2):133-142, 2011.
14
4. M. Bernardo, R. D. Nicola, and M. Loreti. Revisiting bisimilarity and its modal logic for nondeterministic and probabilistic processes. Technical Report 06, IMT Lucca, 2013.
5. M. Bravetti and P. D'Argenio. Tutte le algebre insieme: Concepts, discussions and relations of stochastic process algebras with general distributions. In Validation of Stochastic Systems, 2004.
6. M. Bravetti, H. Hermanns, and J.-P. Katoen. YMCA: Why Markov Chain Algebra? Electr. Notes Theor. Comput. Sei., 162:107-112, 2006.
7. P. Castro, P. Panangaden, and D. Precup. Equivalence relations in fully and partially observable Markov decision processes. In IJCAI, 2009.
8. S. Cattani. Trace-based Process Algebras for Real-Time Probabilistic Systems. PhD thesis, University of Birmingham, 2005.
9. S. Crafa and F. Ranzato. A spectrum of behavioral relations over ltss on probability distributions. In CONCUR, 2011.
10. P. D'Argenio and J.-P. Katoen. A theory of stochastic systems, part I: Stochastic automata. Inf. Comput, 203(l):l-38, 2005.
11. P. D'Argenio and J.-P. Katoen. A theory of stochastic systems, part II: Process algebra. Inf. Comput, 203(l):39-74, 2005.
12. P. R. D'Argenio and C. Baier. What is the relation between CTMC and TA?, 1999. Personal communication.
13. A. David, K. Larsen, A. Legay, M. Mikucionis, D. Poulsen, J. van Vliet, and Z. Wang. Statistical model checking for networks of priced timed automata. In FORMATS, 2011.
14. A. David, K. Larsen, A. Legay, M. Mikucionis, and Z. Wang. Time for statistical model checking of real-time systems. In CAV, 2011.
15. E. de Vink and J. Rutten. Bisimulation for probabilistic transition systems: A coalgebraic approach. In ICALP, 1997.
16. Y. Deng and M. Hennessy. On the semantics of Markov automata. Inf. Comput., 222:139-168, 2013.
17. J. Desharnais, V. Gupta, R. Jagadeesan, and P. Panangaden. Approximating labeled Markov processes. In LICS, 2000.
18. E.-E. Doberkat. Semi-pullbacks and bisimulations in categories of stochastic relations. In ICALP, 2003.
19. L. Doyen, T. Henzinger, and J.-F. Raskin. Equivalence of labeled Markov chains. Int J. Found. Comput Sei., 19(3):549-563, 2008.
20. L. Doyen, T. Massart, and M. Shirmohammadi. Limit synchronization in Markov decision processes. CoRR, abs/1310.2935, 2013.
21. A. Edalat. Semi-pullbacks and bisimulation in categories of Markov processes. Mathematical Structures in Computer Science, 9(5):523-543, 1999.
22. C. Eisentraut, H. Hermanns, J. Krämer, A. Turrini, and L. Zhang. Deciding bisim-ilarities on distributions. In QEST, 2013.
23. C. Eisentraut, H. Hermanns, and L. Zhang. On probabilistic automata in continuous time. In LICS, 2010.
24. Y. Feng and L. Zhang. When equivalence and bisimulation join forces in probabilistic automata. In FM, 2014.
25. N. Gast and B. Gaujal. A mean field approach for optimization in discrete time. Discrete Event Dynamic Systems, 21(1):63-101, 2011.
26. S. Georgievska and S. Andova. Probabilistic may/must testing: retaining probabilities by restricted schedulers. Formal Asp. Comput., 24(4-6):727-748, 2012.
27. M. Giry. A categorical approach to probability theory. In Categorical aspects of topology and analysis. Springer, 1982.
15
28. P. G. Harrison and B. Strulo. Spades - a process algebra for discrete event simulation. J. Log. Comput, 10(l):3-42, 2000.
29. M. Hennessy. Exploring probabilistic bisimulations, part I. Formal Asp. Comput., 2012.
30. H. Hermanns, U. Herzog, and V. Mertsiotakis. Stochastic process algebras - between LOTOS and Markov chains. Computer Networks, 30(9-10):901-924, 1998.
31. H. Hermanns, J. Krcäl, and J. Kfetfnsky. Probabilistic bisimulation: Naturally on distributions. CoRR, abs/1404.5084, 2014.
32. H. Hermanns and A. Turrini. Deciding probabilistic automata weak bisimulation in polynomial time. In FSTTCS, 2012.
33. J. Hillston. A Compositional Approach to Performance Modelling. Cambridge University Press, New York, NY, USA, 1996.
34. D. Jansen, F. Nielson, and L. Zhang. Belief bisimulation for hidden Markov models - logical characterisation and decision algorithm. In NASA Formal Methods, 2012.
35. H. Kerstan and B. König. Coalgebraic trace semantics for probabilistic transition systems based on measure theory. In CONCUR, 2012.
36. V. Korthikanti, M. Viswanathan, G. Agha, and Y. Kwon. Reasoning about MDPs as transformers of probability distributions. In QEST, 2010.
37. M. Z. Kwiatkowska, G. Norman, and D. Parker. Prism 4.0: Verification of probabilistic real-time systems. In CAV, vol. 6806 of Lecture Notes in Computer Science. Springer, 2011.
38. K. Larsen and A. Skou. Bisimulation through probabilistic testing. In POPL, 1989.
39. R. May et al. Biological populations with nonoverlapping generations: stable points, stable cycles, and chaos. Science, 186(4164):645-647, 1974.
40. R. Milner. Communication and concurrency. PHI Series in computer science. Prentice Hall, 1989.
41. M. Mio. Upper-expectation bisimilarity and lukasiewicz /i-calculus. In FoSSaCS, 2014.
42. D. Sangiorgi and J. Rutten. Advanced Topics in Bisimulation and Coinduction. Cambridge University Press, New York, NY, USA, 1st edition, 2011.
43. R. Segala. Modeling and Verification of Randomized Distributed Real-time Systems. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1995.
44. R. Segala and N. Lynch. Probabilistic simulations for probabilistic processes. In CONCUR, 1994.
45. R. Segala and N. A. Lynch. Probabilistic simulations for probabilistic processes. In CONCUR, vol. 836 of Lecture Notes in Computer Science. Springer, 1994.
46. G. Shani, J. Pineau, and R. Kaplow. A survey of point-based pomdp solvers. AAMAS, 27(1):1-51, 2013.
47. A. Sokolova. Probabilistic systems coalgebraically: A survey. Theor. Comput. Sei., 412(38):5095-5110, 2011.
48. L. Song, L. Zhang, and J. Godskesen. Bisimulations meet PCTL equivalences for probabilistic automata. In CONCUR, 2011.
49. L. Song, L. Zhang, and J. C. Godskesen. Late weak bisimulation for Markov automata. CoRR, abs/1202.4116, 2012.
50. M. Stoelinga and F. Vaandrager. A testing scenario for probabilistic automata. In ICALP, 2003.
51. W. Tzeng. A polynomial-time algorithm for the equivalence of probabilistic automata. SIAM J. Comput, 21(2):216-227, 1992.
52. N. Wolovick. Continuous probability and nondeterminism in labeled transaction systems. PhD thesis, Universidad Nacional de Cördoba, 2012.
16
Verification of Markov Decision Processes using Learning Algorithms*
Tomáš Brázdil1, Krishnendu Chatterjee2, Martin Chmelík2, Vojtěch Forejt3, Jan Křetínský2, Marta Kwiatkowska3, David Parker4, and Mateusz Ujma3
1 Masaryk University, Brno, Czech Republic 2 1ST Austria 3 University of Oxford, UK 4 University of Birmingham, UK
Abstract. We present a general framework for applying machine-learning algorithms to the verification of Markov decision processes (MDPs). The primary goal of these techniques is to improve performance by avoiding an exhaustive exploration of the state space. Our framework focuses on probabilistic reachability, which is a core property for verification, and is illustrated through two distinct instantiations. The first assumes that full knowledge of the MDP is available, and performs a heuristic-driven partial exploration of the model, yielding precise lower and upper bounds on the required probability. The second tackles the case where we may only sample the MDP, and yields probabilistic guarantees, again in terms of both the lower and upper bounds, which provides efficient stopping criteria for the approximation. The latter is the first extension of statistical model checking for unbounded properties in MDPs. In contrast with other related techniques, our approach is not restricted to time-bounded (finite-horizon) or discounted properties, nor does it assume any particular properties of the MDP. We also show how our methods extend to LTL objectives. We present experimental results showing the performance of our framework on several examples.
1 Introduction
Markov decision processes (MDPs) are a widely used model for the formal verification of systems that exhibit stochastic behaviour. This may arise due to the possibility of failures (e.g. of physical system components), unpredictable events (e.g. messages sent across a lossy medium), or uncertainty about the environment (e.g. unreliable sensors in a robot). It may also stem from the explicit use of randomisation, such as probabilistic routing in gossip protocols or random back-off in wireless communication protocols.
Verification of MDPs against temporal logics such as PCTL and LTL typically reduces to the computation of optimal (minimum or maximum) reachability probabilities, either on the MDP itself or its product with some deterministic w-automaton. Optimal reachability probabilities (and a corresponding optimal strategy for the MDP) can be computed in polynomial time through a reduction to linear programming, although in
* This research was funded in part by the European Research Council (ERC) under grant agreement 267989 (QUAREM), 246967 (VERIWARE) and 279307 (Graph Games), by the EU FP7 project HIERATIC, by the Austrian Science Fund (FWF) projects S11402-N23 (RiSE), S11407-N23 (RiSE) and P23499-N23, by the Czech Science Foundation grant No P202/12/P612, by EPSRC project EP/K038575/1 and by the Microsoft faculty fellows award.
practice verification tools often use dynamic programming techniques, such as value iteration which approximates the values up to some pre-specified convergence criterion.
The efficiency or feasibility of verification is often limited by excessive time or space requirements, caused by the need to store a full model in memory. Common approaches to tackling this include: symbolic model checking, which uses efficient data structures to construct and manipulate a compact representation of the model; abstraction refinement, which constructs a sequence of increasingly precise approximations, bypassing construction of the full model using decision procedures such as SAT or SMT; and statistical model checking [37,19], which uses Monte Carlo simulation to generate approximate results of verification that hold with high probability.
In this paper, we explore the opportunities offered by learning-based methods, as used in fields such as planning or reinforcement learning [36]. In particular, we focus on algorithms that explore an MDP by generating trajectories through it and, whilst doing so, produce increasingly precise approximations for some property of interest (in this case, reachability probabilities). The approximate values, along with other information, are used as heuristics to guide the model exploration so as to minimise the solution time and the portion of the model that needs to be considered.
We present a general framework for applying such algorithms to the verification of MDPs. Then, we consider two distinct instantiations that operate under different assumptions concerning the availability of knowledge about the MDP, and produce different classes of results. We distinguish between complete information, where full knowledge of the MDP is available (but not necessarily generated and stored), and limited information, where (in simple terms) we can only sample trajectories of the MDP.
The first algorithm assumes complete information and is based on real-time dynamic programming (RTDP) [3]. In its basic form, this only generates approximations in the form of lower bounds (on maximum reachability probabilities). While this may suffice in some scenarios (e.g. planning), in the context of verification we typically require more precise guarantees. So we consider bounded RTDP (BRTDP) [30], which supplements this with an additional upper bound. The second algorithm assumes limited information and is based on delayed Q-learning (DQL) [35]. Again, we produce both lower and upper bounds but, in contrast to BRTDP, where these are guaranteed to be correct, DQL offers probably approximately correct (PAC) results, i.e., there is a non-zero probability that the bounds are incorrect.
Typically, MDP solution methods based on learning or heuristics make assumptions about the structure of the model. For example, the presence of end components [15] (subsets of states where it is possible to remain indefinitely with probability 1) can result in convergence to incorrect values. Our techniques are applicable to arbitrary MDPs. We first handle the case of MDPs that contain no end components (except for trivial designated goal or sink states). Then, we adapt this to the general case by means of on-the-fly detection of end components, which is one of the main technical contributions of the paper. We also show how our techniques extend to LTL objectives and thus also to minimum reachability probabilities.
Our DQL-based method, which yields PAC results, can be seen as an instance of statistical model checking [37,19], a technique that has received considerable attention. Until recently, most work in this area focused on purely probabilistic models, without
2
nondeterminism, but several approaches have now been presented for statistical model checking of nondeterministic models [13,14,27,4,28,18,29]. However, these methods all consider either time-bounded properties or use discounting to ensure convergence (see below for a summary). The techniques in this paper are the first for statistical model checking of unbounded properties on MDPs.
We have implemented our framework within the PRISM tool [25]. This paper concludes with experimental results for an implementation of our BRTDP-based approach that demonstrate considerable speed-ups over the fastest methods in PRISM.
Detailed proofs omitted due to lack of space are available in [7].
1.1 Related Work
In fields such as planning and artificial intelligence, many learning-based and heuristic-driven solution methods for MDPs have been developed. In the complete information setting, examples include RTDP [3] and BRTDP [30], as discussed above, which generate lower and lower/upper bounds on values, respectively. Most algorithms make certain assumptions in order to ensure convergence, for example through the use of a discount factor or by restricting to so-called Stochastic Shortest Path (SSP) problems, whereas we target arbitrary MDPs without discounting. More recently, an approach called FRET [24] was proposed for a generalisation of SSP, but this gives only a onesided (lower) bound. We are not aware of any attempts to apply or adapt such methods in the context of probabilistic verification. A related paper is [1], which applies heuristic search methods to MDPs, but for generating probabilistic counterexamples.
As mentioned above, in the limited information setting, our algorithm based on delayed Q-learning (DQL) yields PAC results, similar to those obtained from statistical model checking [37,19,34]. This is an active area of research with a variety of tools [21,8,6,5]. In contrast with our work, most techniques focus on time-bounded properties, e.g., using bounded LTL, rather than unbounded properties. Several approaches have been proposed to transform checking of unbounded properties into testing of bounded properties, for example, [38,17,33,32]. However, these focus on purely probabilistic models, without nondeterminism, and do not apply to MDPs. In [4], unbounded properties are analysed for MDPs with spurious nondeterminism, where the way it is resolved does not affect the desired property.
More generally, the development of statistical model checking techniques for probabilistic models with nondeterminism, such as MDPs, is an important topic, treated in several recent papers. One approach is to give the nondeterminism a probabilistic semantics, e.g., using a uniform distribution instead, as for timed automata in [13,14,27]. Others [28,18], like this paper, aim to quantify over all strategies and produce an e-optimal strategy. The work in [28] and [18] deals with the problem in the setting of discounted (and for the purposes of approximation thus bounded) or bounded properties, respectively. In the latter work, candidates for optimal schedulers are generated and gradually improved, but "at any given point we cannot quantify how close to optimal the candidate scheduler is" and "the algorithm does not estimate the maximum probability of the property" (cited from [29]). Further, [29] considers compact representation of schedulers, but again focuses only on (time) bounded properties.
Since statistical model checking is simulation-based, one of the most important difficulties is the analysis of rare events. This issue is, of course, also relevant for our
3
approach; see the section on experimental results. Rare events have been addressed using methods such as importance sampling [17,20] and importance splitting [22].
End components in MDPs can be collapsed either for algorithmic correctness [15] or efficiency [11] (where only lower bounds on maximum reachability probabilities are considered). Asymptotically efficient ways to detect them are given in [10,9].
2 Basics about MDPs and Learning Algorithms
We begin with basic background material on MDPs and some fundamental definitions for our learning framework. We use N, Q, and R to denote the sets of all non-negative integers, rational numbers and real numbers respectively. Dist(X) is the set of all rational probability distributions over a finite or countable set X, i.e., the functions / : X —> [0,1] n Q such that ^2X 2A assigns non-empty sets of enabled actions to all states, and A : SxA —> Dist(S) is a (partial) probabilistic transition function defined for all s and a where a € E(s).
Remark 1. For simplicity of presentation we assume w.l.o.g. that, for every action a € A, there is at most one state s such that a € E(s), i.e., E(s) n E(s') = 0 for s ^ s'. If there are states s, s' such that a € E(s) n E(s'), we can always rename the actions as (s, a) € E(s), and (s', a) € E(s'), so that the MDP satisfies our assumption.
An infinite path of an MDP M is an infinite sequence ui = soaosiai • • • such that a.i € E(.Si) and A(.Si, ai)(si+i) > 0 for every i g N. Afinitepath is a finite prefix of an infinite path ending in a state. We use last{uS) to denote the last state of a finite path ui. We denote by IPath (resp. FPath) the set of all infinite (resp. finite) paths, and by IPaths (resp. FPaths) the set of infinite (resp. finite) paths starting in a state s.
A state s is terminal if all actions a € E(s) satisfy A(s, a)(s) = 1. An end component (EC) of M is a pair (5', A1) where S' C S and A' C \Js 0 for some s € 5" and a e A', then s' e 5"; and (2) for all s, s' e 5" there is a path ui = sqclq ... sn such that sq = s, sn = s' and for all 0 < i < n we have a,i Gi'.a maximal end component (MEC) is an EC that is maximal with respect to the point-wise subset ordering.
Strategies. A strategy of MDP M is a function a : FPath —> Dist(A) satisfying supp(a(u))) C E(last(u))) for every ui € FPath. Intuitively, the strategy resolves the choices of actions in each finite path by choosing (possibly at random) an action enabled in the last state of the path. We write Um for the set of all strategies in M. In standard fashion [23], a strategy a induces, for any initial state s, a probability measure Pr^ s over IPaths. A strategy a is memoryless if cr(w) depends only on last(uS).
4
Objectives and values. Given a set F C S of target states, bounded reachability for step k, denoted by ()-kF, refers to the set of all infinite paths that reach a state in F within k steps, and unbounded reachability, denoted by ()F, refers to the set of all infinite paths that reach a state in F. Note that ()F = \Jk>0 ()-hF. We consider the reachability probability Pr^ S(()F), and strategies that maximise this probability. We denote by V(s) the value in s, defined by supCTeZ. Pr^ S(()F). Given e > 0, we say that a strategy a is e-optimal in s if Pr^ S(()F) + e > V(s), and we call a 0-optimal strategy optimal. It is known [31] that, for every MDP and set F, there is a memoryless optimal strategy for ()F. We are interested in strategies that approximate the value function, i.e., e-optimal strategies for some e > 0.
2.2 Learning Algorithms for MDPs
In this paper, we study a class of learning-based algorithms that stochastically approximate the value function of an MDP. Let us fix, for this section, an MDP M = (S, s, A, E, A) and target states F C S. We denote by V : S x A [0,1] the value function for state-action pairs of M, defined for all (s, a) where s£S and a € E(s):
Intuitively, V(s, a) is the value in s assuming that the first action performed is a. A learning algorithm A simulates executions of M, and iteratively updates upper and lower approximations U : S x A ^ [0,1] and L:Sxi-> [0,1], respectively, of the value function V : S x A —> [0,1].
The functions U and L are initialised to appropriate values so that L(s, a) < V(s, a) < U(s, a) for all s € S and a € A. During the computation of A, simulated executions start in the initial state s and move from state to state according to choices made by the algorithm. The values of U(s, a) and L(s, a) are updated for the states s visited by the simulated execution. Since maxa£E(s) U(s, a) and maxa££(s) L(s, a) represent upper and lower bound on V(s), a learning algorithm A terminates when maxagEp) U(s, a) — maxa£Ep) L(s, a) < e where the precision e > 0 is given to the algorithm as an argument. Note that, because U and L are possibly updated based on the simulations, the computation of the learning algorithm may be randomised and even give incorrect results with some probability.
Definition 2. Denote by A(e) the instance of learning algorithm A with precision e. We say that A converges surely (resp. almost surely) if, for every e > 0, the computation of A(e) surely (resp. almost surely) terminates, and L(s, a) < V(s, a) < U(s, a) holds upon termination.
In some cases, almost-sure convergence cannot be guaranteed, so we demand that the computation terminates correctly with sufficiently high probability. In such cases, we assume the algorithm is also given a confidence 5 > 0 as an argument.
Definition 3. Denote by A(e, 5) the instance of learning algorithm A with precision e and confidence 5. We say that A is probably approximately correct (PAC) if, for every e > 0 and every 5 > 0, with probability at least 1 — 5, the computation of A(e, 5) terminates with L(s, a) < V(s, a) < U(s, a).
5
The function U defines a memoryless strategy ajj which in every state s chooses all actions a maximising the value U(s, a) over E(s) uniformly at random. The strategy cry is used in some of the algorithms and also contributes to the output.
Remark 2. If the value function is defined as the infimum over strategies (as in [30]), then the strategy chooses actions to minimise the lower value. Since we consider the dual case of supremum over strategies, the choice of ajj is to maximise the upper value.
We also need to specify what knowledge about the MDP M is available to the learning algorithm. We distinguish the following two distinct cases.
Definition 4. A learning algorithm has limited information about M if it knows only the initial state s, a number K > \S\, a number Em > maxses \E(s)\, a number 0 < 1 < Pmin. where pmin = mm{A(s,a)(s') s € S, a € E(s),s' € supp(A(s, a))}, and the function E (more precisely, given a state s, the learning procedure can ask an oracle for E(s)). We assume that the algorithm may simulate an execution ofM starting with s and choosing enabled actions in individual steps.
Definition 5. A learning algorithm has complete information about M if it knows the complete MDP M.
Note that the MDPs we consider are "fully observable", so even in the limited information case strategies can make decisions based on the precise state of the system.
3 MDPs without End Components
We first present algorithms for MDPs without ECs, which considerably simplifies the adaptation of BRTDP and DQL to unbounded reachability objectives. Later, in Section 4, we extend our methods to deal with arbitrary MDPs (with ECs). Let us fix an MDP M = (S, s, A, E, A) and a target set F. Formally, we assume the following.
Assumption-EC. MDP M has no ECs, except two trivial ones containing distinguished terminal states 1 and 0, respectively, with F = {1}, V(l) = 1 and V(0) = 0.
3.1 Our framework
We start by formalising a general framework for learning algorithms, as outlined in the previous section. We then instantiate this and obtain two learning algorithms: BRTDP and DQL. Our framework is presented as Algorithm 1, and works as follows. Recall that functions U and L store the current upper and lower bounds on the value function V. Each iteration of the outer loop is divided into two phases: EXPLORE and UPDATE. In the EXPLORE phase (lines 5-10), the algorithm samples a finite path ui in M from s to a state in {1, 0} by always randomly choosing one of the enabled actions that maximises the U value, and sampling the successor state using the probabilistic transition function. In the UPDATE phase (lines 11 - 16), the algorithm updates U and L on the state-action pairs along the path in a backward manner. Here, the function pop pops and returns the last letter of the given sequence.
6
lgorithm 1 Learning algorithm (for MDPs with no ECs)
Inputs: An EC-free MDP M
U(;-)<-l,L(;-)<-0
7(1, •) <- 1,17(0, •) <- 0 > Initialise
repeat
u) <— s I* Explore phase */ repeat
a sampled uniformly from argmaxo6BjIoji(IJjj U(last(u),a)
s sampled according to A(last(u), a) > GetSucc(ci;, a)
ui <— ui as
until s e {1,0} > TerminatePath^)
repeat /* Update phase */
s' <— pop(ui)
a <— pop(ui)
s <— last(ui)
update((s, a),s') until u) = s
until max„eg(j) U(s, a) — max„eEp) L(s, a) < e > terminate
3.2 Instantiations: BRTDP and DQL
Our two algorithm instantiations, BRTDP and DQL, differ in the definition of UPDATE.
Unbounded reachability with BRTDP. We obtain BRTDP by instantiating UPDATE with Algorithm 2, which requires complete information about the MDP. Intuitively, UPDATE computes new values of U(s, a) and L(s, a) by taking the weighted average of the corresponding U and L values, respectively, over all successors of s via action a. Formally, denote U(s) = maxa£E(s) U(s, a) and L(s) = maxa££(s) L(s, a).
Algorithm 2 BRTDP instantiation of Algorithm 1 1: procedure Update((s, a), •) 2: U(s,a):=j:a,€SA(s,a)(s')U(s') 3: L(s,a):=j:a,€SA(s,a)(s')L(s')
The following theorem says that BRTDP satisfies the conditions of Definition 2 and never returns incorrect results.
Theorem 1. The algorithm BRTDP converges almost surely under Assumption-EC.
Remark 3. Note that, in the EXPLORE phase, an action maximising the value of U is chosen and the successor is sampled according to the probabilistic transition function of M. However, we can consider various modifications. Actions and successors may be chosen in different ways (e.g., for GETSUCC), for instance, uniformly at random, in a round-robin fashion, or assigning various probabilities (bounded from below by some fixed p > 0) to all possibilities in any biased way. In order to guarantee almost-sure convergence, some conditions have to be satisfied. Intuitively we require, that the state-action pairs used by e-optimal strategies have to be chosen enough times. If this condition is satisfied then the almost-sure convergence is preserved and the practical running times may significantly improve. For details, see Section 5.
7
Remark 4. The previous BRTDP algorithm is only applicable if the transition probabilities are known. However, if complete information is not known, but A(s,a) can be repeatedly sampled for any s and a, then a variant of BRTDP can be shown to be probably approximately correct.
Unbounded reachability with DQL. Often, complete information about the MDP is unavailable, repeated sampling is not possible, and we have to deal with only limited information about M (see Definition 4). For this scenario, we use DQL, which can be obtained by instantiating UPDATE with Algorithm 3.
Algorithm 3 DQL (delay m, estimator precision e) instantiation of Algorithm 1 1: procedure Update((s, a), s')
2: if c(s, a) = m and LEARN(s, a) then
3: if accumvm(s, a)/m < U(s, a) — 2e then
4: U(s,a) <— accum,m{s,a)/rn + e
5: accumvm(s, a) = 0
6: if accum^is, a)/m > L(s, a) + 2e then
7: L(s,a) <—accumm{s,a)/m —e
8: accumm(s, a) = 0
9: c(s,a)=0
10: else
11: accumvm{s, a) <— accumvm(s,a) + U(s')
12: accumf^(s,a) accumf^(s,a) + L(s')
13: c(s, a) <— c(s, a) + 1
Macro LEARN(s, a) is true in the fcth call of Update((s, a), ■) if, since the (k — 2m)th call of Update((s, a), •), line 4 was not executed in any call of Update(-, •).
The main idea behind DQL is as follows. As the probabilistic transition function is not known, we cannot update U(s,a) and L(s,a) with the actual values ^2g,€S A(s,a)(s')U(s') and Xls'eS A(s, a)(s')L(s'), respectively. However, we can instead use simulations executed in the EXPLORE phase of Algorithm 1 to estimate these values. Namely, we use accurals, a)/m to estimate ~^2s,eS A(s, a)(s')U(s') where accurals, a) is the sum of the U values of the last m immediate successors of (s, a) seen during the EXPLORE phase. Note that the delay m must be chosen large enough for the estimates to be sufficiently close, i.e., e-close, to the real values.
So, in addition to U(s,a) and L(s,a), the algorithm uses new variables accum^(s, a) and accum1^i(s,a) to accumulate U(s,a) and L(s,a) values, respectively, and a counter c(s, a) recording the number of invocations of a in s since the last update (all these variables are initialised to 0 at the beginning of computation). Assume that a has been invoked in s during the EXPLORE phase of Algorithm 1, which means that UPDATE((s, a), s') is eventually called in the UPDATE phase of Algorithm 1 with the corresponding successor s' of (s, a). If c(s, a) = m at that time, a has been invoked in s precisely m times since the last update concerning (s, a) and the procedure UPDATE((s, a), s') updates U(s, a) with accurals, a)/m plus an appropriate constant e (unless LEARN is false). Here, the purpose of adding e is to make U(s, a) stay above the real value V(s, a) with high probability. If c(s, a) < m, then
8
Fig.l. MDP M with an EC (left), MDP M{mi'ma} constructed from M in on-the-fly BRTDP (centre), and MDP M' obtained from M by collapsing C = ({mi, m.2}, {a, 6}) (right).
UPDATE((s, a), s') simply accumulates U(s') into accurnt^l(s,a) and increases the counter c(s, a). The L(s, a) values are estimated by accurals, a)/m in a similar way, just subtracting e from accurals, a). The procedure requires m and e as inputs, and
they are chosen depending on e and 6; more precisely, we choose e = e'^Pm^g™^—
ln(6|5| Lllfl I I5!]"4! )/5)
and m = —•—L-n—L^5—'—and establish that DQL is probably approximately correct. The parameters m and e can be conservatively approximated using only the limited information about the MDP (i.e. using K, Em and q). Even though the algorithm has limited information about M, we still establish the following theorem.
Theorem 2. DQL is probably approximately correct under Assumption-EC.
Bounded reachability. Algorithm 1 can be trivially adapted to handle bounded reachability properties by preprocessing the input MDP in standard fashion. Namely, every state is equipped with a bounded counter with values ranging from 0 to k where k is the step bound, the current value denoting the number of steps taken so far. All target states remain targets for all counter values, and every non-target state with counter value k becomes rejecting. Then, to determine the fc-step reachability in the original MDP, we compute the (unbounded) reachability in the new MDP. Although this means that the number of states is multiplied by k + 1, in practice the size of the explored part of the model can be small.
4 Unrestricted MDPs
We first illustrate with an example that the algorithms BRTDP and DQL as presented in Section 3 may not converge when there are ECs in the MDP.
Example 1. Consider the MDP M in Fig. 1 (left) with EC ({m1,m2},{a,b}). The values in states mi, m-2 areK(mi) = V{rn,2) = 0.5 but the upper bounds are U(mi) = U{rn,2) = 1 for every iteration. This is because [/(mi, a) = U(rri2, b) = 1 and both algorithms greedily choose the action with the highest upper bound. Thus, in every iteration t of the algorithm, the error for the initial state mi is [/(mi) — V(mi) = ^ and the algorithm does not converge. In general, any state in an EC has upper bound 1
9
since, by definition, there are actions that guarantee the next state is in the EC, i.e., is a state with upper bound 1. This argument holds even for standard value iteration with values initialised to 1.
One way of dealing with general MDPs is to preprocess them to identify all MECs [10,9] and "collapse" them into single states (see e.g. [15,11]). These algorithms require that the graph model is known and explore the whole state space, but this may not be possible either due to limited information (see Definition 4) or because the model is too large. Hence, we propose a modification to the algorithms from the previous sections that allows us to deal with ECs "on-the-fly". We first describe the collapsing of a set of states and then present a crucial lemma that allows us to identify ECs to collapse.
Collapsing states. In the following, we say that an MDP M' = (5", s', A', E', A') is obtained from M = (S, s, A, E, A) by collapsing a tuple (r, B), where r C S and B C A with B C UseiJ E(s) if:
- s' = (s\r)u{s{rlb)},
- s' is either S) or s, depending on whether s € r or not,
- A' = A\B,
- E'(s) = E(s), for s e S \ r; E'{s^B)) = \JseR E(s) \ B,
- A' is defined for all s e S' and a e E'(s) by:
• A'(s, a)(s') = A(s, a)(s') for s, s' ± S(jj,s),
• A'(s, a)(s(RtB)) = J2s>eR A(s> a)(s') for s s(R,b),
• A'(s(RB^, a)(s') = A(s,a)(s') for s' ^ s(r,b) and s the unique state with a € E(s) (see Remark 1),
• A'(s(RB^, a)(s(j{is)) = ms'eij A(s, a)(s') where s is the unique state with aeE(s).
We denote the above transformation, which creates M' from M, as the COLLAPSE function, i.e., COLLAPSE(i?, B). As a special case, given a state s and a terminal state s' € {0,1}, we use MAKETERMINAL(s, s') as shorthand for COLLAPSE({s, s'}, E(s)), where the new state is renamed to s'. Intuitively, after MAKETERMINAL(s, s'), every transition previously leading to state s will now lead to the terminal state s'.
For practical purposes, it is important to note that the collapsing does not need to be implemented explicitly, but can be done by keeping a separate data structure which stores information about the collapsed states.
Identifying ECs from simulations. Our modifications will identify ECs "on-the-fly" through simulations that get stuck in them. The next lemma establishes the identification principle. To this end, for a path ui, let us denote by Appear{uj, i) the tuple (Si, At) of M such that s € Si and a € Ai(s) if and only if (s, a) occurs in ui more than i times.
Lemma 1. Let c = exp(— (pm;n/Em)K / k), where k = KEm + 1, and let i > k. Assume that the EXPLORE phase in Algorithm 1 terminates with probability less than 1. Then, provided the EXPLORE phase does not terminate within 3ia iterations, the conditional probability that Appear (lj, i) is an EC is at least 1 — 2clia ■ (pm-ln/' Em) K.
The above lemma allows us to modify the EXPLORE phase of Algorithm 1 in such a way that simulations will be used to identify ECs. The ECs discovered will subsequently be collapsed. We first present the overall skeleton (Algorithm 4) for treating
10
ECs "on-the-fly", which consists of two parts: (i) identification of ECs; and (ii) processing them. The instantiations for BRTDP and DQL will differ in the identification phase. Hence, before proceeding to the individual identification algorithms, we first establish the correctness of the processing phase.
Algorithm 4 Extension for general MDPs
1: function On-the-fly-EC
2: M <- IdentifyECs
3: for all (R, B) e M do
4: Collapse (R,B)
5: for all s e Rand a e E(s) \B do
6: U(s(RtB),a) <- U(s, a)
7: i(s(fl,b), a) «- L(s, a)
8: ifi?nF^0then
9: maketerminal(s(fliB), 1)
10: else if no actions enabled in s^n,B) then
11: maketerminal(s(fljS),0)
Lemma 2. Assume (R,B) is an EC in MDP M, Vm the value before the PROCESS ECs procedure in Algorithm 4, and Vy\' the value after the procedure, then:
- fori e {0,1} (/MAKETERMINAL(s(jjiS),i) is called, thenVs e R : VM(s) = i, -VseS\R: VM(s) = Vw(s),
- Vs € R : VM(s) = VM>(s{R,B)).
Interpretation of collapsing. Intuitively, once an EC (R, B) is collapsed, the algorithm in the EXPLORE phase can choose a state s € R and action a € E(s) \ B to leave the EC. This is simulated in the EXPLORE phase by considering all actions of the EC uniformly at random until s is reached, and then action a is chosen. Since (R, B) is an EC, playing all actions of B uniformly at random ensures s is almost surely reached. Note that the steps made inside a collapsed EC do not count towards the length of the explored path.
Now, we present the on-the-fly versions of BRTDP and DQL. For each case, we describe: (i) modification of Algorithm 1; (ii) identification of ECs; and (iii) correctness.
4.1 Complete information (BRTDP)
Modification of Algorithm 1. To obtain BRTDP working with unrestricted MDPs, we modify Algorithm 1 as follows: for iteration i of the EXPLORE phase, we insert a check after line 9 such that, if the length of the path ui explored (i.e., the number of states) is fcj (see below), then we invoke the ON-THE-FLY-EC function for BRTDP. The ON-THE-FLY-EC function possibly modifies the MDP by processing (collapsing) some ECs as described in Algorithm 4. After the ON-THE-FLY-EC function terminates, we interrupt the current EXPLORE phase, and start the EXPLORE phase for the z+l-th iteration (i.e., generating a new path again, starting from s in the modified MDP). To complete the description we describe the choice of fcj and identification of ECs. Choice of fcj. Because computing ECs can be expensive, we do not call ON-THE-FLY-EC every time a new state is explored, but only after every fcj steps of the repeat-until
> Identification of ECs > Process ECs
11
loop at lines 6-10 in iteration i. The specific value of fcj can be decided experimentally and change as the computation progresses. A theoretical bound for fcj to ensure that there is an EC with high probability can be obtained from Lemma 1.
Identification of ECs. Given the current explored path uj, let (T, G) be Appear (ui, 0), that is, the set of states and actions explored in ui. To obtain the ECs from the set T of explored states, we use Algorithm 5. This computes an auxiliary MDP MT = (T', s, A',E',A') defined as follows:
_ T' = T U {t | 3s e T, a € E(s) such that A(s, a)(t) > 0},
- A' = \JSSTE(8)U{±},
- E'(s) = E(s) if s e T and E'(s) = {±} otherwise,
- A'(s, a) = A(s, a) Use T, and A'(s, _L)(s) = 1 otherwise.
It then computes all MECs of MT that are contained in T and identifies them as ECs. The following lemma states that each of these is indeed an EC in the original MDP.
Algorithm 5 Identification of ECs for BRTDP 1: function IdentifyECs(M, T) 2: compute MT 3: M' 6 for all i. The loop at lines 6 to 10 of Algorithm 1 generates a path ui that contains some (possibly zero) number of loops m\ arri'ib followed by mi arri'i cms dt where t € {0,1}. In the subsequent UPDATE phase, we set U(rri3,d) = L(rri3,d) = 0.5 and then U(m2,c) = L(m,i,c) = 0.5; none of the other values change. In the second iteration of the loop at lines 6 to 10, the path ui' = mi ani2 bnii ani'i b... is being generated, and the newly inserted check for ON-THE-FLY-EC will be triggered once ui achieves the length fcj.
The algorithm now aims to identify ECs in the MDP based on the part of the MDP explored so far. To do so, the MDP MT for the set T = {mi, 7712} is constructed and we depict it in Fig. 1 (centre). We then run MEC detection on MT, finding that ({mi, mg}, {a, b}) is an EC, and so it gets collapsed according to the COLLAPSE procedure. This gives the MDP M' from Fig. 1 (right).
The execution then continues with M'. A new path is generated at lines 6 to 10 of Algorithm 1; suppose it is ui" = sccm^dO. In the UPDATE phase we then update the value U(sc,d) = L(sc,d) = 0.5, which makes the condition at the last line of Algorithm 1 satisfied, and the algorithm finishes, having computed the correct value.
12
4.2 Limited information (DQL)
Modification of Algorithm 1 and identification of ECs. The modification of Algorithm 1 is done exactly as for the modification of BRDTP (i.e., we insert a check after line 9 of EXPLORE, which invokes the ON-THE-FLY-EC function if the length of path uj exceeds fcj). In iteration i, we set k{ as 3if, for some li (to be described later). The identification of the EC is as follows: we consider Appear (uj, li), the set of states and actions that have appeared more than li times in the explored path ui, which is of length 'il'l, and identify the set as an EC; i.e., Ad in line 2 of Algorithm 4 is defined as the set containing the single tuple Appear{ui, li). We refer to the algorithm as on-the-fly DQL.
Choice of li and correctness. The choice of li is as follows. Note that, in iteration i, the error probability, obtained from Lemma 1, is at most clcfL%l\ ■ (pmin/Em) K and we choose li such that 2c^li?f • (pmin/1 Em)~K < 4^, where S is the confidence. Note that, since c < 1, we have that cf% decreases exponentially, and hence for every i such li exists. It follows that the total error of the algorithm due to the on-the-fly EC collapsing is at most 5/2. It follows from the proof of Theorem 2 that for on-the-fly DQL the error is at most 5 if we use the same e as for DQL, but now with DQL confidence <5/4,
ln(24|5|I A\(1 I I5!]"4!)/S)
i.e., with m = —————E—J—L. As before, these numbers can be conservatively approximated using the limited information.
Theorem 4. On-the-fly DQL is probably approximately correct for all MDPs.
Example 3. Let us now briefly explain the execution of on-the-fly DQL on the MDP M from Fig. 1 (left). At first, paths of the same form as ui in Example 2 will be generated and there will be no change to U and L, because in any call to UPDATE (see Algorithm 3) for states s € {mi, rri'2} with c(s, a) = m the values accumulated in accum^(s, a)j'm and accurn^l(s,a)/m are the same as the values already held, namely 1 and 0, respectively.
At some point, we call UPDATE for the tuple (m3, d) with c[m^, d) = m, which will result in the change of U(rn^, d) and L{m^, d). Note, that at this point, the numbers accum^(s, d)/m and accurals, d)/m are both equal to the proportion of generated paths that visited the state 1. This number will, with high probability, be very close to 0.5, say 0.499. We thus set U(m3, d) = 0.499 + e and L(m3, d) = 0.499 - e.
We then keep generating paths of the same form and at some point also update U{rn,2, c) and L(m,i, c) to precisely 0.499 + e and 0.499 — e, respectively. The subsequently generated path will be looping onmi and m2, and once it is of length tt, we identify ({mi, mg}, {a, b}) as an EC due to the definition of Appear (uj, li). We then get the MDP from Fig. 1 (right), which we use to generate new paths, until the upper and lower bounds on value in the new initial state are within the required bound.
4.3 Extension to LTL
So far we have focused on reachability, but our techniques also extend to linear temporal logic (LTL) objectives. By translating an LTL formula to an equivalent deterministic ui-automaton, verifying MDPs with LTL objectives reduces to analysis of MDPs with irregular conditions such as Rabin acceptance conditions. A Rabin acceptance condition consists of a set {(Mi,N{)... (Md, Nd)} of d pairs (Mu N{), where each M{ C S and
13
Ni C S. The acceptance condition requires that, for some 1 < i < d, states in Mi are visited infinitely often and states in N are visited finitely often.
Value computation for MDPs with Rabin objectives reduces to optimal reachability of winning ECs, where an EC (R, B) is winning if R n Mi ^ 0 and R n Ni = 0 for some l R, we write argmaxl£j f(x) = {x € X f(x) = maxj/gj f(x')}.
Markov chains. A Markov chain is a tuple M = (L, P, /j) where L is a finite set of locations, P : L —> Dist(L) is a probabilistic transition function, and jj, € Dist(L) is the initial probability distribution. We denote the respective unique probability measure for M by p.
Markov decision processes. A Markov decision process (MDP) is a tuple G = (S, A, Act, 5, s) where S is a finite set of states, A is a finite set of actions, Act : S —> 2A \ {0} assigns to each state s the set Act(s) of actions enabled in s, 5 : S x A —> Dist(S) is a probabilistic transition function that, given a state and an action, gives a probability distribution over the successor states, and s is the initial state. A run in G is an infinite alternating sequence of states and actions ui = siaiS2«2 ■ ■ ■ such that for all i > 1, we have at € Act(si) and 5(.Si, ai)(si+i) > 0. A path of length k in G is a finite prefix w = s\ai ■ ■ ■ cik-i-Sk of a run in G.
Strategies and plays. Intuitively, a strategy (or a policy) in an MDP G is a "recipe" to choose actions. Formally, a strategy is a function a : S —> Dist(A) that given the current state of a play gives a probability distribution over the enabled actions.1 In general, a strategy may randomize, i.e. return non-Dirac distributions. A strategy is deterministic if it gives a Dirac distribution for every argument.
1 In general, a strategy may be history dependent. However, for objectives considered in this paper, memoryless strategies (depending on the last state visited) are sufficient. Therefore, we only consider memoryless strategies in this paper.
5
A play of G determined by a strategy a and a state s € 5* is a Markov chain Gf where the set of locations is S, the initial distribution jj, is Dirac with jj,(s) = 1 and
P(s)(s') = ^f7(S)(a).(5(S;a)(S').
The induced probability measure is denoted by P? and "almost surely" or "almost all runs" refers to happening with probability 1 according to this measure. We usually write PCT instead of PJ (here s is the initial state of G).
Liberal strategies. A liberal strategy is a function q : S —> 2A such that for every s£S we have that 0 ^ 0, we say that a strategy a is e-optimal if P VaZ(s) — e, and we call a 0-optimal strategy optimal.2 To avoid overly technical notation, we assume that states of F, subject to the reachability objective, are absorbing, i.e. for all s € F, a € Act(s) we have <5(s, a)(s) = 1.
End components. A non-empty set 5" C S is an ew 2A \ {0} such that (1) for all s € 5' we have Act'(s) C ^ci(s), (2) for all s € S', we have a e Act'(s) iff <5(s, a) € Dist(S'), and (3) for all s,t e S' there is a path uj = s\a\ ■ ■ ■ a^-i-Sk such that s\ = s, Sk = t, and Sj € 5", € j4ci'(sj) for every i. An end component is a maximal end component (MEC) if it is maximal with respect to the subset ordering. Given an MDP, the set of MECs is denoted by MEC. Given a MEC, actions of Act' (s) and Act(s) \ Act' (s) are called internal and external (in state s), respectively.
3 Computing £-optimal Strategies
There are many algorithms for solving quantitative reachability in MDPs, such as the value iteration, the strategy improvement, linear programming based methods etc., see [Put94]. The main method implemented in PRISM is the value iteration, which successively (under)approximates the value Val(s, a) = Xls'eA H5' a)(s') ' Val(s') of every state-action pair (s, a) by a value V(s, a), and stops when the approximation is good enough. Denoting by V(s) := rnaxaej4ct(-s) V(s, a), every step of the value iteration improves the approximation V(s, a) by assigning V(s, a) := Xls'es ^(s> a)(s') ' (we start with V such that V(s) = 1 if s € F, and V(s) = 0 otherwise).
The disadvantage of the standard value iteration (and also most of the above mentioned traditional methods) is that it works with the whole state space of the MDP (or at least with its reachable part). For instance, consider states tij, vt of Fig. 1. The paper [BCC+14] adapts methods of bounded real-time dynamic programming (BRTDP, see e.g. [MLG05]) to speed up the computation of the value iteration by improving V(s, a) 3 only on "important" state-action pairs identified by simulations.
2 For every MDP, there is a memory less deterministic optimal strategy, see e.g. [Put94].
3 Here we use V for the lower approximation denoted by Vl in [BCC+14].
6
Even though RTDP methods may substantially reduce the size of an e-optimal strategy, its explicit representation is usually large and difficult to understand. Thus we develop succinct representations of strategies, based on decision trees, that will reduce the size even further and also provide a human readable representation. Even though the above methods are capable of yielding deterministic e-optimal strategies, that can be immediately fed into machine learning algorithms, we found it advantageous to give the learning algorithm more freedom in the sense that if there are more e-optimal strategies, we let the algorithm choose (uniformly). This is especially useful within MECs where many actions have the same value. Therefore, we extract liberal e-optimal strategies from the value approximation V, output either by the value iteration or BRTDP.
Computing liberal e-optimal strategies. Let us show how to obtain a liberal strategy <; from the value iteration, or BRTDP. For simplicity, we start with MDP without MECs.
MDP without end components. We say that V : S x A —> [0,1] is a valid e-underapproximation if the following conditions hold:
1. V(s, a) < Val(s, a) for all s e S and a e A
2. Val(s) - V(s) < e
3. V(s, a) < J2s>es 0 and a valid e-underapproximation V, c;v is e-optimal. 5
General MDP. For MDPs with end components we have to extend the definition of the valid e-underapproximation. Given a MEC S' C S, we say that (s, a) £ S x A is maximal-external in S' if s € 5", a € Act(s) is external and V(s, a) > V(s', a') for all s' € 5" and a' € Act(s'). A state s' € 5" is an exit (of 5") if (s, a) is maximal-external in 5" for some a € Act(s). We add the following condition to the valid e -underapproximation:
4. Each MEC 5" C S has at least one exit. Now the definition of c;v is also more complicated:
- For every s € S which is not in any MEC, we put = argmaxoej4ct(s) V(s,a).
- For every s£S which is in a MEC 5',
• if s is an exit, then = {a€ Act(s) \ (s, a) is maximal-external in S'}
• otherwise, = {n£ Act(s) | a is internal}
Using these extended definitions, Lemma 1 remains valid. Further, note that is defined even for states with trivial underapproximation V(s) = 0, for instance a state s that was never subject to any value iteration improvement. Then the values 2A, assigning to each state its good actions, it can be explicitly represented as a list of state-action pairs, i.e., as a subset of „
S x A = \\_Dom{vx) x A x {0, l,...,m} (1)
i=l
In addition, standard optimization algorithms implemented in PRISM use an explicit "don't-care" value —2 for action in each unreachable state, meaning the strategy is not defined. However, one could simply not list these pairs at all. Thus a smaller list is obtained, with only the states where <; is defined. Recall that one may also omit states s satisfying Imp'' (s) = 0, thus ignoring reachable states with zero probability to reach the target. Further optimization may be achieved by omitting states s satisfying Imp'' (s) < A for a suitable A > 0.
5.1 BDD Representation
The explicit set representation can be encoded as a binary decision diagram (BDD). This has been used in e.g. [WBB+ 10,EJPV12]. The principle of the BDD representation of a set is that (1) each element is encoded as a string of bits and (2) an automaton, in the form of a binary directed acyclic graph, is created so that (3) the accepted language is exactly the set of the given bit strings. Although BDDs are quite efficient, see Section 6, each of these three steps can be significantly improved:
1. Instead of a string of bits describing all variables, a string of integers (one per variable) can be used. Branching is then done not on the value of each bit, but according to an inequality comparing the variable to a constant. This significantly improves the readability.
2. Instead of building the automaton according to a chosen order of bits, we let a heuristic choose the order of the inequalities and the actual constants in the inequalities.
3. Instead of representing the language precisely, we allow the heuristic to choose which data to represent and which not. The likelihood that each datum is represented corresponds to its importance, which we provide as another input.
The latter two steps lead to significantly smaller graphs than BDDs. All this can be done in an efficient way using decision trees learning.
5.2 Representation using Decision Trees
Decision trees. A decision tree for a domain Yii=i ^ ^ Zd is a tuple T = (T, p, 6) where T is a finite rooted binary (ordered) tree with a set of inner nodes N and a set of leaves L, p assigns to every inner node a predicate of the form [xt ~ const] where i € {1, • • •, d}, Xi € Xi, const € Z, ~ € {<, <, >, >, =}, and 6 assigns to every leaf a value good, or bad. 9
8 On the one hand, PRISM does not allow different modules to have local variables with the same name, hence we do not distinguish which module does a variable belong to. On the other hand, while PRISM declares no names for non-synchronizing actions, we want to exploit the connection between the corresponding actions of different copies of the same module.
9 There exist many variants of decision trees in the literature allowing arbitrary branching, arbitrary values in the leaves, etc., e.g. [Mit97]. For simplicity, we define only a special suitable subclass.
9
Similarly to BDDs, the language C(T) C N™ of the tree is denned as follows. For a vector x = (x\,..., xn) € N™, we find a path p from the root to a leaf such that for each inner node n on the path, the predicate p(n) is satisfied by substitution xt = Xi iff the first child of n is on p. Denote the leaf on this particular path by £. Then x is in the language C(T) of T iff 0(£) = good.
Example 2. Consider dimension d = 1, domain X\ = {1,..., 7}. A tree representing a set {1, 2, 3, 7} is depicted in Fig. 2. To depict the ordered tree clearly, we use unbroken lines for the first child, corresponding to the satisfied predicate, and dashed line for the second one, corresponding to the unsatisfied predicate.
In our setting, we use the domain S x A defined by Equation (1) which is of the form n™=i2 ^ where for each x\ < 3
1 < i < n we have Xt = Dom(vi), Xn+\ = A and /
good x\ < 7
bad
good
Xn+2 = {0,1, • • •, m}. Here the coordinates Dom(vi) are considered "unbounded" and, consequently, the respective predicates use inequalities. In contrast, we know the possible values of A x {0,1,..., m} in advance and they are not too many. Therefore, these coordinates are considered "dis- pjg 2- A decision tree for crete" and the respective predicates use equality. Examples {123 7}C{1 7} of such trees are given in Section 6 in Fig. 4 and 5. Now a
decision tree T for this domain determines a liberal strategy <; : S —> 2A by a € iff (s,a) e jC(T).
Learning. We describe the process of learning a training set, which can also be understood as storing the input data. Given a training sequence (repetitions allowed!) x1,..., xk, with each xx = (x\,..., xln) € Nd, partitioned into the positive and negative subsequence, the process of learning according to the algorithm ID3 [Qui86,Mit97] proceeds as follows:
1. Start with a single node (root), and assign to it the whole training sequence.
2. Given a node n with a sequence r,
- if all training examples in r are positive, set 9(n) = good and stop;
- if all training examples in r are negative, set 9(n) = bad and stop;
- otherwise,
• choose a predicate with the "highest gain" (with lowest entropy, see e.g. [Mit97, Sections 3.4.1, 3.7.2]),
• split r into sequences satisfying and not satisfying the predicate, assign them to the first and the second child, respectively,
• go to step 2 for each child.
Intuitively, the predicate with the highest gain splits the sequence so that it maximizes the portion of positive data in the satisfying subsequence and the portion of negative data in the non-satisfying subsequence.
In addition, the final tree can be pruned. This means that some leaves are merged, resulting in a smaller tree at the cost of some imprecision of storing (the language of the tree changes). The pruning phase is quite sophisticated, hence for the sake of simplicity and brevity, we omit the details here. We use the standard C4.5 algorithm and refer to [Qui93,Mit97]. In Section 6, we comment on effects of parameters used in pruning.
Learning a strategy. Assume that we already have a liberal strategy <; : S —> 2A. We show how we learn good and bad state-action pairs so that the language of the
10
resulting tree is close to the set of good pairs. The training sequence will be composed of state-action pairs where good pairs are positive examples, and bad pairs are negative ones. Since our aim is to ensure that important states are learnt and not pruned away, we repeat pairs with more important states in the training sequence more frequently.
Formally, for every s£ S and a € Act(s), we put the pair (s, a) to the training sequence repea£(s)-times, where
repeat(s) = c ■ Imp'' (s)
for some constant c € N (note that Imp''(s) < 1). Since we want to avoid exact computation of Imp'' (s), we estimate it using simulations. In practice, we thus run c simulation runs that reach the target and set repeat (s) to be the number of runs where s was also reached.
6 Experiments
In this section, we present the experimental evaluation of the presented methods, which we have implemented within the probabilistic model checker PRISM [KNP11]. All the results presented in this section were obtained on a single Intel(R) Xeon(R) CPU (3.50GHz) with memory limited to 10GB.
First, we discuss several alternative options to construct the training data and to learn them in a decision tree. Further, we compare decision trees to other data structures, namely sets and BDDs, with respect to the sizes necessary for storing a strategy. Finally, we illustrate how the decision trees can be used to gain insight into our benchmarks.
6.1 Decision Tree Learning
Generating Training Data. The strategies we work with come from two different sources. Firstly, we consider strategies constructed by PRISM, which can be generated using the explicit or sparse model checking engine. Secondly, we consider strategies constructed by the BRTDP algorithm [BCC+14], which are defined on a part of the state space only.
Recall that given a strategy, the training data for the decision trees is constructed from c simulation runs according to the strategy. In our experiments, we found that c = 10000 produces good results in all the examples we consider. Note that we stop each simulation as soon as the target or a state with no path to the target state is reached. Decision Tree Learning in Weka. The decision trees are constructed using the Weka machine learning package [HFH+09]. The Weka suite offers various decision tree classifiers. We use the J48 classifier, which is an implementation of the C4.5 algorithm [Qui93]. The J48 classifier offers two parameters to control the pruning that affect the size of the decision tree:
- The leaf size parameter MgN determines that each leaf node with less than M instances in the training data is merged with its siblings. Therefore, only values smaller than the number of instances per classification class are reasonable, since higher numbers always result in the trivial tree of size 1.
- The confidence factor C € (0, 0.5] is used internally for determining the amount of pruning during decision tree construction. Smaller values incur more pruning and therefore smaller trees.
Detailed information and an empirical study of the parameters for J48 is available in [DM12].
11
Effects of the parameters. We illustrate the effects of the parameters C and M on the resulting size of the decision tree on the mer benchmark. However, similar behaviour appears in all the examples. Figures 3a and 3b show the resulting size of the decision tree for several (random) executions. Each line in the plots corresponds to one decision tree, learned with 15 different values of the parameter. The C parameter scales linearly between 0.0001 and 0.5. The M parameter scales logarithmically between 1 and the minimum number of instances per class in the respective training set. The plots in Figure 3 show that M is an effective parameter in calibrating the resulting tree size, whereas C plays less of a role. Hence, we use C = 10-4. Furthermore, since the tree size is monotone in M, the parameter M can be used to retrieve a desired level of detail.
0 2 0.3 0.4 0.5
20 30 40 50 Tree Size
(a) fixed M -
(b) fixed C = 10"
(c) Tree Size vs Error
Fig. 3: Decision tree Parameters
Figure 3c depicts the relation of the tree size to the relative error of the induced strategy. It shows that there is a threshold size under which the tree is not able to capture the strategy correctly anymore and the error rises quickly. Above the threshold size, the error is around 1%, considered reasonable in order to extract reliable information. This threshold behaviour is observed in all our examples. Therefore, it is sensible to perform a binary search for the highest M ensuring the error at most 1%.
6.2 Results
First, we briefly introduce the four examples from the PRISM benchmark suite [KNP12], which we tested our method on. Note that the majority of the states in the used bencham-rks are non-deterministic, so the strategies are non-trivial in most states, firewire models the Tree Identify Proto-
Table 2: Effects of various learning variants on the tree size. Smallest trees computed from PRISM or BRTDP and the average time to compute one number are presented.
col of the IEEE 1394 High Performance Serial Bus, which is used to transport video and audio signals within a network of multimedia devices. The reachability objective is that one node gets the root and the other one the child role, investor models a market investor and shares, which change their value probabilistically over time. The reachability objective is that the investor finishes a trade at a time, when his shares are more valuable than some threshold value.
ExamplellOPIVPIOEIVEOO OV |Avg Time
firewire investor
mer_17M 17
mer_big zeroconf
1
27
1
25 33 23 7
1
31 17 23 7
1
35 29 37 7
1 1 37 37 19 none 17 none
7 17
45s 135s 314s 129s 141s
12
Table 1: Comparison of representation sizes for strategies obtained from PRISM and from BRTDP computation. Sizes are presented as the number of states for explicit lists of values, the number of nodes for BDDs, and the number of nodes for decision trees (DT). For DT, we display the tree size obtained from the binary search described above. DT Error reports the relative error of the strategy determined by the decision tree (on the induced Markov chain) compared to the optimal value, obtained by model checking with PRISM.
PRISM BRTDP
Example \S\ Value Explicit BDD DT DT Error Explicit BDD DT DT Error
firewire 481,136 1.000 479,834 4,233 1 0.000% 766 4,763 1 1 0.000%
investor 35,893 0.958 28,151 783 27 0.886% 21,931 2,780 35 0.836%
mer_17M 1,773,664 0.200 Memory Out 1,887 619 17 0.000%
mer_big 2 Approx. 1013 0.200 Memory Out 1,894 646 19 0.692%
zeroconf 89,586 0.009 60,463 409 7 0.106% 1,630 905 7 0.235%
1 Note that BDDs represent states in binary form. Therefore, one entry in the explicit state list corresponds to several nodes in the BDD.
2 We did not measure the state size as the MDP does not fit in memory, but extrapolated it from the linear dependence of model size and one of its parameters, which we could increase to 231 — 1. The value is obtained from the BRTDP computation.
mer is a mutual exclusion protocol, that regulates the access of two users to two different resources. The protocol should prohibit that both resources are accessed simultaneously, zeroconf is a network protocol which allows users to choose their IP addresses autonomously. The protocol should detected and prohibit IP address conflict.
For every example, Table 1 shows the size of the state space, the value of the optimal strategy, and the sizes of strategies obtained from explicit model checking by PRISM and by BRTDP, for each discussed data structure.
Learning variants. In order to justify our choice of the importance function Imp, we compare it to several alternatives.
1. When constructing the training data, we can use the importance measure Imp, and add states as often as is indicated by its importance (I), or neglect it and simply add every visited state exactly once (O).
2. Further, states on the simulation are learned conditioned on the fact that the target state is reached «»• Another option is to consider all simulations (V).
3. Finally, instead of the probability to visit the state (p), one can consider the expected number of visits (e).
In Table 2, we report the sizes of the decision trees obtained for the all learning variants. We conclude that our choice (I0P) is the most useful one.
6.3 Understanding Decision Trees
We show how the constructed decision trees can help us to gain insight into the essential features of the systems.
zeroconf example. In Figure 4 we present a decision tree that is a strategy for zeroconf and shows how an unresolved IP address conflict can occur in the protocol. First we present how to read the strategy represented in Figure 4. Next we show how the strategy can explain the conflict in the protocol. Assume that we are classifying a state-action pair (s, a), where action a is enabled in state s.
13
1. No matter what the current state s is, the action rec is always classified as bad according to the root of the tree. Therefore, the action rec should be played with positive probability only if all other available actions in the current state are also classified as bad.
2. If action a is different from rec, the right son of the root node is reached. If action a is different from action l>0&b=l&ip_mess=l -> b' = 0&z' = 0&nl' =min (nl + 1, 8) &ip_mess' =0 (the whole PRISM command is a single action), then a is classified as good in state s. Otherwise, the left son is reached.
3. In node z < 0 the classification of action a (that is the action that labels the parent node) depends on the variable valuation of the current state. If the value of var. z is greater than 0, then a is classified as good in state s, otherwise it is classified as bad.
Action rec stands for a network host receiving a reply to a broadcast message, resulting in resolution of an IP address conflict if one is present, which clearly does not help in constructing an unresolved conflict. The action labelling the right son of the root represents the detection of an IP address conflict by an arbitrary network host. This action is only good, if variable z, which is a clock variable, in the current state is greater than 0. The combined meaning of the two nodes is that an unresolved IP address conflict can occur if the conflict is detected too late.
f irewire example. For f irewire, we obtain a trivial tree with a single node, labelled good. Therefore, playing all available actions in each state guarantees reaching the target almost surely. In contrast to other representations, we have automatically obtained the information that the network always reaches the target configuration, regardless of the individual behaviour of its components and their interleaving.
mer example. In the case of mer, there exists a strategy that violates the required property that the two resources are not accessed simultaneously. The decision tree for the mer strategy is depicted in Figure 5. In order to understand how a state is reached, where both resources are accessed at the same time, it is necessary to determine which user accesses which resource in that state.
1. The two tree nodes labelled by 1 explain what resource user 1 should access. The root node labelled by action sl = 0& rl = 0 -> r 1' =2 specifies that the request to access resource 2 (variable rl is set to 2) is classified as bad. The only remaining action for user 1 is to request access to resource 1. This action is classified as good by the right son of the root node.
2. Analogously, the tree nodes labelled by 2 specify that user 2 actions should request access to resource 2 (follows from s2 = 0&r2 = 0 -> r2' =2). Once resource 2 is requested it should change its internal state s2 to 1 (follows from s2 = 0&r2=2
action = [rec]
Fig. 4: A decision tree for zeroconf
14
-> s2' =1). It follows, that in the state violating the property, user 1 has access to resource 1 and user 2 to resource 2.
The model is supposed to correctly handle such overlapping requests, but fails to do so in a specific case. In order to further debug the model, one has to find the action of the scheduler that causes this undesired behaviour. The lower part of the tree specifies that ul.request.comm is a candidate for such an action. Inspecting a snippet of the code of ul.request.comm from the PRISM source code (shown below) reveals that in the given situation, the scheduler reacts inappropriately with some probability p. [ul_request_comm] s=0 & commUser=0 & driveUser!=0 & k (l-p):(s'=l) & (r'=driveUser) & (k'=k+l) + p:(s'=-l) & (gc'=true) & (k'=k+l)
The remaining nodes of the tree that were not discussed are necessary to reset the situation if the non-faulty part (with probability 1 — p) of the ul.request.comm command was executed. It should be noted that executing the faulty ul.request.comm action does not lead to the undesired state right away. The action only grants user 1 access rights in a situation, where he should not get these rights. Only a successive action leads to user 1 accessing the resource and the undesired state being reached. This is a common type of bug, where the command that triggered an error is not the cause of it.
faction = [sl = 0srl = 0->rl' =2]"] (t)
[ bad ^ faction = [any action of userl] | (^)
_^Jfc < Or;"J | good ]
action = [r ! =0&driveUser=0->r' =0&gc' =true]
action = s2=0&r2=0->r2' =2
0[
0
action = s2=0&r2=2->s2' =1
action = [ul.request.comm]
action = [any synchronized action] j " ~| bad |
| good I ~~[bad ]
Fig. 5: A decision tree for mer
7 Conclusion
In this work we presented a new approach to represent strategies in MDPs in a succinct and comprehensible way. We exploited machine learning methods to achieve our goals. Interesting directions of future works are to investigate whether other machine learning methods can be integrated with our approach, and to extend our approach from reachability objectives to other objectives (such as long-run average and discounted-sum). Acknowledgements. This research was funded in part by Austrian Science Fund (FWF) Grant No P 23499-N23, FWF NFN Grant No S11407-N23 (RiSE) and Z211-N23 (Wittgenstein Award), European Research Council (ERC) Grant No 279307 (Graph Games), ERC Grant No 267989 (QUAREM), the Czech Science Foundation Grant No P202/12/G061, and People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007-2013) REA Grant No 291734.
15
References
ÁBD 14. E. Abrahám, B. Becker, C. Dehnert, N. Jansen, J.-P. Katoen, and R. Wimmer. Counterexample generation for discrete-time markov models: An introductory survey. In Formal Methods for Executable Software Models - 14th International School on Formal Methods for the Design of Computer, Communication, and Software Systems, SFM2014, Bertinoro, Italy, June 16-20, 2014, pages 65-121, 2014.
ADvR08. M. E. Andres, P. R. D'Argenio, and P. van Rossum. Significant diagnostic counterexamples in probabilistic model checking. In Hardware and Software: Verification and Testing, 4th International Haifa Verification Conference, HVC 2008, Haifa, Israel, October 27-30, 2008. Proceedings, pages 129-148, 2008.
AL09. Husain Aljazzar and Stefan Leue. Generation of counterexamples for model checking of markov decision processes. In QEST, pages 197-206. IEEE Computer Society, 2009.
ALIO. Husain Aljazzar and Stefan Leue. Directed explicit state-space search in the generation of counterexamples for stochastic model checking. IEEE Trans. Software Eng., 36(l):37-60, 2010.
ALFLS11. H. Aljazzar, F. Leitner-Fischer, S. Leue, and D. Simeonov. Dipro - a tool for probabilistic counterexample generation. In SPIN, pages 183-187, 2011.
BCC+14. Tomáš Brázdil, Krishnendu Chatterjee, Martin Chmelík, Vojtěch Forejt, Jan Křetřnský, Marta Z. Kwiatkowska, David Parker, and Mateusz Ujma. Verification of markov decision processes using learning algorithms. In ATVA 2014, Sydney, NSW, Australia, November 3-7, 2014, Proceedings, pages 98-114, 2014.
BD96. C. Boutilier and R. Dearden. Approximating value trees in structured dynamic programming. In In Proceedings of the Thirteenth International Conference on Machine Learning, pages 54-62, 1996.
BDG95. C. Boutilier, R. Dearden, and M. Goldszmidt. Exploiting structure in policy construction. In IJCAI-95, pp.11041111, 1995.
BDH99. C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. JAIR, 11:1-94, 1999.
BH11. Tevfik Bultan and Pao-Ann Hsiung, editors. Automated Technology for Verification and Analysis, 9th International Symposium, ATVA, Taipei, Taiwan, October 11-14, 2011. Proceedings, volume 6996 of LNCS. Springer, 2011.
BJW02. Julien Bernet, David Janin, and Igor Walukiewicz. Permissive strategies: from parity games to safety games. ITA, 36(3):261-275, 2002.
BK08. C. Baier and J.-P. Katoen. Principles of Model Checking (Representation and Mind Series). The MIT Press, 2008.
BKK14. T. Brázdil, S. Kiefer, and A. Kučera. Efficient analysis of probabilistic programs with an unbounded counter. Journal of the ACM, 61(6):41:1^1:35, 2014.
BMOU11. Patricia Bouyer, Nicolas Markey, Jorg Olschewski, and Michael Ummels. Measuring permissiveness in parity games: Mean-payoff parity games revisited. In Bultan and Hsiung [BH11], pages 135-149.
CCD 14. Krishnendu Chatterjee, Martin Chmelik, and Przemyslaw Daca. CEGAR for qualitative analysis of probabilistic systems. In CAV, pages 473^-90, 2014.
CK91. D. Chapman and L. P. Kaelbling. Input generalization in delayed reinforcement learning: An algorithm and performance comparisons, pages 726-731. Morgan Kaufmann, 1991.
CV10. Rohit Chadha and Mahesh Viswanathan. A counterexample-guided abstraction-refinement framework for Markov decision processes. ACM Trans. Comput. Log. 12, page 1, 2010.
CY95. C. Courcoubetis and M. Yannakakis. The complexity of probabilistic verification. Journal of the ACM, 42(4):857-907, 1995.
16
dA97. dAKN+OO.
DFK+14.
DHK08. DJJL01.
DJJL02.
DJW+14.
DM12. EJPV12.
FHPWIO.
FV97. HFH+09.
HKD09. HKN+03.
How60. HSaHB99.
HWZ08. JÁK+11.
JÁV+12.
L. de Alfaro. Formal Verification of Probabilistic Systems. PhD thesis, Stanford University, 1997.
L. de Alfaro, M. Kwiatkowska, G. Norman, D. Parker, and R. Segala. Symbolic model checking of probabilistic processes using MTBDDs and the Kronecker representation. In TACAS, 2000.
Klaus Dráger, Vojtech Forejt, Marta Z. Kwiatkowska, David Parker, and Mateusz Ujma. Permissive controller synthesis for probabilistic systems. In TACAS, pages 531-546, 2014.
B. Damman, T. Han, and J.-P. Katoen. Regular expressions for PCTL counterexamples. In QEST, pages 179-188. IEEE Computer Society, 2008.
Pedro R. D'Argenio, Bertrand Jeannet, Henrik Ejersbo Jensen, and Kim Guldstrand Larsen. Reachability analysis of probabilistic systems by successive refinements. In PAPM-PROBMIV, LNCS 2165, pages 39-56. Springer, 2001. Pedro R. D'Argenio, Bertrand Jeannet, Henrik Ejersbo Jensen, and Kim Guldstrand Larsen. Reduction and refinement strategies for probabilistic analysis. In PAPM-PROBMIV, LNCS 2399, pages 57-76. Springer, 2002.
C. Dehnert, N. Jansen, R. Wimmer, E. Abrahám, and J.-P. Katoen. Fast debugging of PRISM models. In Franck Cassez and Jean-Francois Raskin, editors, Automated Technology for Verification and Analysis - 12th International Symposium, ATVA 2014, Sydney, NSW, Australia, November 3-7, 2014, Proceedings, volume 8837 of Lecture Notes in Computer Science, pages 146-162. Springer, 2014.
Sam Drazin and Matt Montag. Decision tree analysis using weka. Machine Learning-Project II, University of Miami, pages 1-3, 2012.
C. Von Essen, B. Jobstmann, D. Parker, and R. Varshneya. Semi-symbolic computation of efficient controllers in probabilistic environments. Technical report, Verimag, 2012.
H. Fecher, M. Huth, N. Piterman, and D. Wagner. PCTL model checking of markov chains: Truth and falsity as winning strategies in games. Perform. Eval, 67(9):858-872, 2010.
J. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer-Verlag, 1997.
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10-18, 2009.
T. Han, J.-P. Katoen, and B. Damman. Counterexample generation in probabilistic model checking. IEEE Trans. Software Eng., 35(2):241-257, 2009. H. Hermanns, M. Kwiatkowska, G. Norman, D. Parker, and M. Siegle. On the use of MTBDDs for performability analysis and verification of stochastic systems. Journal of Logic and Algebraic Programming: Special Issue on Probabilistic Techniques for the Design and Analysis of Systems, 56(l-2):23-67, 2003.
Ronald A. Howard. Dynamic programming and Markov processes. The MIT press, New York London, Cambridge, MA, 1960.
J. Hoey, R. St-aubin, A. Hu, and C. Boutilier. Spudd: Stochastic planning using decision diagrams. In In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 279-288. Morgan Kaufmann, 1999. H. Hermanns, B. Wachter, and L. Zhang. Probabilistic CEGAR. In CAV, pages 162-175, 2008.
N. Jansen, E. Abrahám, J. Katelaan, R. Wimmer, J.-P. Katoen, and B. Becker. Hierarchical counterexamples for discrete-time markov chains. In Bultan and Hsiung [BH11], pages 443-452.
N. Jansen, E. Abrahám, M. Volk, R. Wimmer, J.-P. Katoen, and B. Becker. The COMICS tool - computing minimal counterexamples for dtmcs. In ATVA, pages 349-353, 2012.
17
KH09. M. Kattenbelt and M. Huth. Verification and refutation of probabilistic specifications via games. In IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2009, December 15-17, 2009, IIT Kanpur, India, pages 251-262, 2009.
KHW94. N. Kushmerick, S. Hanks, and D. Weld. An algorithm for probabilistic least-commitment planning. In Proceedings of AAAI-94, page 10731078, 1994.
KK99. M. Kearns and D. Koller. Efficient reinforcement learning in factored MDPs. IJCAI, pages 740-747, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
KNP06. Marta Z. Kwiatkowska, Gethin Norman, and David Parker. Game-based abstraction for Markov decision processes. In QEST, pages 157-166, 2006.
KNP11. M. Kwiatkowska, G. Norman, and D. Parker. PRISM 4.0: Verification of probabilistic real-time systems. In CAW, pages 585-591, 2011.
KNP12. M. Kwiatkowska, G. Norman, and D. Parker. The PRISM benchmark suite. In QEST, pages 203-204, 2012.
KP99. D. Koller and R. Parr. Computing factored value functions for policies in structured
MDPs. In In Proceedings of the Sixteenth International Joint Conference on Artificial
Intelligence, pages 1332-1339. Morgan Kaufmann, 1999. KP13. Marta Kwiatkowska and David Parker. Automated verification and strategy synthesis
for probabilistic systems. In Automated Technology for Verification and Analysis,
pages 5-22. Springer, 2013. KPC12. A. Komuravelli, C. S. Pasareanu, and E. M. Clarke. Assume-guarantee abstraction
refinement for probabilistic systems. In CAV, pages 310-326, 2012. LL13. F. Leitner-Fischer and S. Leue. Probabilistic fault tree synthesis using causality
computation. IJCCBS, 4(2): 119-143, 2013. Mit97. Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1
edition, 1997.
MLG05. H Brendan McMahan, Maxim Likhachev, and Geoffrey J Gordon. Bounded realtime dynamic programming: RTDP with monotone upper bounds and performance guarantees. In ICML, 2005.
MP04. A. Miner and D. Parker. Validation of Stochastic Systems: A Guide to Current Research, volume 2925 of Lecture Notes in Computer Science (Tutorial Volume), chapter Symbolic Representations and Analysis of Large Probabilistic Systems, pages 296-338. Springer, 2004.
Put94. M.L. Puterman. Markov Decision Processes. Wiley, 1994.
Pye03. Larry D. Pyeatt. Reinforcement learning with decision trees. In The 21st 1ASTED International Multi-Conference on Applied Informatics (AI2003), February 10-13, 2003, Innsbruck, Austria, pages 26-31, 2003.
Qui86. J. Ross Quinlan. Induction of decision trees. Mach. Learn., 1(1):81-106, 1986.
Qui93. J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
Seg95. R. Segala. Modeling and Verification of Randomized Distributed Real-Time Systems. PhD thesis, MIT Press, 1995. Technical Report MIT/LCS/TR-676.
SLT10. C. S. Raghavendra S. Liu, A. Panangadan and A. Talukder. Compact representation of coordinated sampling policies for body sensor networks. In In Proceedings of Workshop on Advances in Communication and Networks (Smart Homes for Tele-Health), pages 6-10. IEEE, 2010.
Var85. M. Vardi. Automatic verification of probabilistic concurrent finite state programs. In FOCS, pages 327-338, 1985.
WBB+10. Ralf Wimmer, Bettina Braitling, Bernd Becker, Ernst Moritz Hahn, Pepijn Crouzen, Holger Hermanns, Abhishek Dhama, and Oliver Theel. Symblicit calculation of long-run averages for concurrent probabilistic systems. In QEST, pages 27-36, Washington, DC, USA, 2010. IEEE Computer Society.
18
WJÁ+14. R. Wimmer, N. Jansen, E. Abrahám, J.-P. Katoen, and B. Becker. Minimal counterexamples for linear-time probabilistic verification. TCS, 549:61-100, 2014.
WJV+13. Ralf Wimmer, Nils Jansen, Andreas Vorpahl, Erika Abrahám, Joost-Pieter Katoen, and Bernd Becker. High-level counterexamples for probabilistic automata. In QEST, pages 39-54, 2013.
19
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MARKOV DECISION PROCESSES
KRISHNENDU CHATTERJEE, ZUZANA KRETLNSKÄ, AND JAN KRETINSKY 1ST Austria
Institut für Informatik, Technische Universität Muiichen, Germany Institut für Informatik, Technische Universität München, Germany
ABSTRACT. We consider Markov decision processes (MDPs) with multiple limit-average (or mean-payoff) objectives. There exist two different views: (i) the expectation semantics, where the goal is to optimize the expected mean-payoff objective, and (ii) the satisfaction semantics, where the goal is to maximize the probability of runs such that the mean-payoff value stays above a given vector. We consider optimization with respect to both objectives at once, thus unifying the existing semantics. Precisely, the goal is to optimize the expectation while ensuring the satisfaction constraint. Our problem captures the notion of optimization with respect to strategies that are risk-averse (i.e., ensure certain probabilistic guarantee). Our main results are as follows: First, we present algorithms for the decision problems, which are always polynomial in the size of the MDP. We also show that an approximation of the Pareto curve can be computed in time polynomial in the size of the MDP, and the approximation factor, but exponential in the number of dimensions. Second, we present a complete characterization of the strategy complexity (in terms of memory bounds and randomization) required to solve our problem.
1. Introduction
MDPs and mean-payoff objectives. The standard models for dynamic stochastic systems with both nondeterministic and probabilistic behaviours are Markov decision processes (MDPs) [How60, Put94, FV97]. An MDP consists of a finite state space, and in every state a controller can choose among several actions (the nondeterministic choices), and given the current state and the chosen action the system evolves stochastically according to a probabilistic transition function. Every action in an MDP is associated with a reward (or cost), and the basic problem is to obtain a strategy (or policy) that resolves the choice of actions in order to optimize the rewards obtained over the run of the system. An objective is a
This is an extended version of the LICS'15 paper with full proofs and additional complexity results. This research was funded in part by Austrian Science Fund Grant No P 23499-N23, European Research Council Grant No 279307 (Graph Games), the DFG Research Training Group PUMA: Programm- und Modell-Analyse (GRK 1480), the Czech Science Foundation grant No. 15-17564S, and People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007-2013) REA Grant No 291734.
LOGICAL METHODS © Krishnendu Chatterjee, Zuzana Kretinska, and Jan Kretinsky
IN COMPUTER SCIENCE DOI:10.2168/LMCS-??? Creative Commons
1
2
krishnendu chatterjee, zuzana kretinska, and jan kretinsky
function that given a sequence of rewards over the run of the system combines them to a single value. A classical and one of the most well-studied objectives in context of MDPs is the limit-average (or long-run average or mean-payoff) objective that assigns to every run the average of the rewards over the run.
Single vs. multiple objectives. MDPs with single mean-payoff objectives have been widely studied (see, e.g., [Put94, FV97]), with many applications ranging from computational biology to analysis of security protocols, randomized algorithms, or robot planning, to name a few [BK08, KNP02, DEKM98, KGFP09]. In verification of probabilistic systems, MDPs are widely used, for concurrent probabilistic systems [CY95, Var85], probabilistic systems operating in open environments [Seg95, dA97], and applied in diverse domains [BK08, KNP02]. However, in several application domains, there is not a single optimization goal, but multiple, potentially dependent and conflicting goals. For example, in designing a computer system, the goal is to maximize average performance while minimizing average power consumption, or in an inventory management system, the goal is to optimize several potentially dependent costs for maintaining each kind of product. These motivate the study of MDPs with multiple mean-payoff objectives, which has also been applied in several problems such as dynamic power management [FKP12]. Two views. There exist two views in the study of MDPs with mean-payoff objectives [BBC+14]. The traditional and classical view is the expectation semantics, where the goal is to maximize (or minimize) the expectation of the mean-payoff objective. There are numerous applications of MDPs with the expectation semantics, such as in inventory control, planning, and performance evaluation [Put94, FV97]. The alternative semantics is called the satisfaction semantics, which, given a mean-payoff value threshold sat and a probability threshold pr, asks for a strategy to ensure that the mean-payoff value be at least sat with probability at least pr. In the case with n reward functions, there are two possible interpretations. Let sat and pr be two vectors of thresholds of dimension k, and 0 < pr < 1 be a single threshold. The first interpretation (namely, the conjunctive interpretation) requires the satisfaction semantics in each dimension 1 < i < n with thresholds sati and pri, respectively (where Vi is the i-ih component of vector v). The sets of satisfying runs for each reward may even be disjoint here. The second interpretation (namely, the joint interpretation) requires the satisfaction semantics for all rewards at once. Precisely, it requires that, with probability at least pr, the mean-payoff value vector be at least sat. The distinction of the two views (expectation vs. satisfaction) and their applicability in analysis of problems related to stochastic reactive systems has been discussed in details in [BBC+14]. While the joint interpretation of satisfaction has already been introduced and studied in [BBC+14], here we consider also the conjunctive interpretation, which was not considered in [BBC+14]. The conjunctive interpretation was considered in [FKR95], however, only a partial solution was provided, and it was mentioned that a complete solution would be very useful. Our problem. In this work we consider a new problem that unifies the two different semantics. Intuitively, the problem we consider asks to optimize the expectation while ensuring the satisfaction. Formally, consider an MDP with n reward functions, a probability threshold vector pr (or threshold pr for joint interpretation), and a mean-payoff value threshold vector sat. We consider the set of satisfaction strategies that ensure the satisfaction semantics. Then the optimization of the expectation is considered with respect to the satisfaction strategies. Note that if pr is 0, then the satisfaction strategies is the set of all strategies and we obtain the traditional expectation semantics as a special case.
unifying two views on multiple mean-payoff objectives in mdps
3
We also consider important special cases of our problem, depending on whether there is a single reward (mono-reward) or multiple rewards (multi-reward), and whether the probability threshold is pr = 1 (qualitative criteria) or the general case (quantitative criteria). Specifically, we consider four cases:
(1) Mono-qual: a single reward function and qualitative satisfaction semantics;
(2) Mono-quant: a single reward function and quantitative satisfaction semantics;
(3) Multi-qual: multiple reward functions and qualitative satisfaction semantics;
(4) Multi-quant: multiple reward functions and quantitative satisfaction semantics. Note that for multi-qual and mono cases, the two interpretations (conjunctive and joint) of the satisfaction semantics coincide, whereas in the multi-quant problem (which is the most general problem) we consider both the conjunctive and the joint interpretations, separately (multi-quant-conjunctive, multi-quant-joint) as well as at once (multi-quant-conjunctive-joint).
Motivation. The motivation to study the problem we consider is twofold. Firstly, it presents a unifying approach that combines the two existing semantics for MDPs. Secondly and more importantly, it allows us to consider the problem of optimization along with risk aversion. A risk-averse strategy must ensure certain probabilistic guarantee on the payoff function. The notion of risk aversion is captured by the satisfaction semantics, and thus the problem we consider captures the notion of optimization under risk-averse strategies that provide probabilistic guarantee. The notion of strong risk-aversion where the probability is treated as an adversary is considered in [BFRR14], whereas we consider probabilistic (both qualitative and quantitative) guarantee for risk aversion. We now illustrate our problem with several examples. Illustrative examples:
• For simple risk aversion, consider a single reward function modelling investment. Positive reward stands for profit, negative for loss. We aim at maximizing the expected long-run average while guaranteeing that it is non-negative with at least 95%. This is an instance of mono-quant with pr = 0.95, sat = 0.
• For more dimensions, consider the example [Put94, Problems 6.1, 8.17]. A vendor assigns to each customer either a low or a high rank. Further, there is a decision the vendor makes each year either to invest money into sending a catalogue to the customer or not. Depending on the rank and on receiving a catalogue, the customer spends different amounts for vendor's products and the rank can change. The aim is to maximize the expected profit provided the catalogue is almost surely sent with frequency at most /. This is an instance of multi-qual. Further, one can extend this example to only require that the catalogue frequency does not exceed / with 95% probability, but 5% best customers may still receive catalogues very often (instance of multi-quant).
• The following is again an instance of multi-quant. A gratis service for downloading is offered as well as a premium one. For each we model the throughput as rewards r\,T2- For the gratis service, expected throughput 1Mbps is guaranteed as well as 60% connections running on at least 0.8Mbps. For the premium service, not only have we a higher expectation of 10Mbps, but also 95% of the connections are guaranteed to run on at least 5Mbps and 80% on even 8Mbps (satisfaction constraints). In order to keep this guarantee, we may need to temporarily hire resources from a cloud, whose cost is modelled as a reward r$. While satisfying
4
krishnendu chatterjee, zuzana kretinska, and jan kretinsky
the guarantee, we want to maximize the expectation of p2 ■ r2 — P3 ■ T3 where p2 is the price per Mb at which the premium service is sold and p% is the price at which additional servers can be hired. Note that since the percentages above are different, the constraints cannot be encoded using the joint interpretation, and conjunctive interpretation is necessary.
The basic computational questions. In MDPs with multiple mean-payoff objectives, different strategies may produce incomparable solutions. Thus, there is no "best" solution in general. Informally, the set of achievable solutions is the set of all vectors v such that there is a strategy that ensures the satisfaction semantics and that the expected mean-payoff value vector under the strategy is at least v. The "trade-offs" among the goals represented by the individual mean-payoff objectives are formally captured by the Pareto curve, which consists of all maximal tuples (with respect to component-wise ordering) that are not strictly dominated by any achievable solution. Pareto optimality has been studied in cooperative game theory [Owe95] and in multi-criterion optimization and decision making in both economics and engineering [Kos88, YC03, SCK04].
We study the following fundamental questions related to the properties of strategies and algorithmic aspects in MDPs:
• Algorithmic complexity: What is the complexity of deciding whether a given vector represents an achievable solution, and if the answer is yes, then compute a witness strategy?
• Strategy complexity: What type of strategies is sufficient (and necessary) for achievable solutions?
• Pareto-curve computation: Is it possible to compute an approximation of the Pareto curve?
Our contributions. We provide comprehensive answers to the above questions. The main highlights of our contributions are:
• Algorithmic complexity. We present algorithms for deciding whether a given vector is an achievable solution and constructing a witness strategy. All our algorithms are polynomial in the size of the MDP. Moreover, they are polynomial even in the number of dimensions, except for multi-quant with conjunctive interpretation where it is exponential.
• Strategy complexity. It is known that for both expectation and satisfaction semantics with single reward, deterministic memorylessM strategies are sufficient [FV97, BBE10, BBC+14]. We show this carries over in the mono-qual case only. In contrast, we show that for mono-quant both randomization and memory is necessary. For randomized strategies, they can be stochastic-update, where the memory is updated probabilistically, or deterministic-update, where the memory update is deterministic. We provide precise bounds on the memory size of stochastic-update strategies. Further, we show that for both mono-quant and multi-qual, deterministic-update strategies require memory size that is dependent on the MDP. Finally, we also show that deterministic-update strategies are sufficient even for multi-quant, thus extending the results of [BBC+14].
Ma strategy is memoryless if it is independent of the history, but depends only on the current state. a strategy that is not deterministic is called randomized.
unifying two views on multiple mean-payoff objectives in mdps
5
• Pareto-curve computation. We show that in all cases with multiple rewards an s-approximation of the Pareto curve can be achieved in time polynomial in the size of the MDP, exponential in the number of dimensions, and polynomial in ^, for s > 0. In summary, we unify the two existing semantics, present comprehensive results related to algorithmic and strategy complexities for the unifying semantics, and improve results for the existing semantics.
Technical contributions. In the study of MDPs (with single or multiple rewards), the solution approach is often by characterizing the solution as a set of linear constraints. Similar to the previous works [CMH06, EKVY08, FKN+11, BBC+14] we also obtain our results by showing that the set of achievable solutions can be represented by a set of linear constraints, and from the linear constraints witness strategies for achievable solutions can be constructed. However, previous work on the satisfaction semantics [BBC+14, RRS15] reduces the problem to invoking linear-programming solution for each maximal end-component and a separate linear program to combine the partial results together. In contrast, we unify the solution approaches for expectation and satisfaction and provide one complete linear program for the whole problem. This in turn allows us to optimize the expectation while guaranteeing satisfaction. Further, this approach immediately yields a linear program where both conjunctive and joint interpretations are combined, and we can optimize any linear combination of expectations. Finally, we can also optimize the probabilistic guarantees while ensuring the required expectation. The technical device to obtain one linear program is to split the standard variables into several, depending on which subsets of constraints they help to achieve. This causes technical complications that have to be dealt with making use of conditional probability methods.
Related work. The study of Markov decision processes with multiple expectation objectives has been initiated in the area of applied probability theory, where it is known as constrained MDPs [Put94, Alt99]. The attention in the study of constrained MDPs has been mainly focused on restricted classes of MDPs, such as unichain MDPs, where all states are visited infinitely often under any strategy. Such a restriction guarantees the existence of memoryless optimal strategies. The more general problem of MDPs with multiple mean-payoff objectives was first considered in [Cha07] and a complete picture was presented in [BBC+14]. The expectation and satisfaction semantics was considered in [BBC+14], and our work unifies the two different semantics for MDPs. For general MDPs, [CMH06, CFW13] studied multiple discounted reward functions. MDPs with multiple uj-regular specifications were studied in [EKVY08]. It was shown that the Pareto curve can be approximated in polynomial time in the size of MDP and exponential in the number of specifications; the algorithm reduces the problem to MDPs with multiple reachability specifications, which can be solved by multi-objective linear programming [PY00]. In [FKN+11], the results of [EKVY08] were extended to combine w-regular and expected total reward objectives. The problem of conjunctive satisfaction was introduced in [FKR95]. They present solution for only stationary (memoryless) strategies, and explicitly mention that such strategies are not sufficient and a solution to the general problem would be very useful. They also mention that it is unlikely to be a simple extension of the single dimensional case. Our results not only present the general solution, but we also present results that combine both the conjunctive and joint satisfaction semantics along with the expectation semantics. The multiple percentile are currently considered for various objectives, such as mean-payoff, limsup, liminf, shortest path in [RRS15]. However, [RRS15]
6
krishnendu chatterjee, zuzana kretinska, and jan kretinsky
does not consider optimizing the expectation, whereas we consider maximizing expectation along with satisfaction semantics. The notion of risks has been considered in MDPs with discounted objectives [WL99], where the goal is to maximize (resp., minimize) the probability (risk) that the expected total discounted reward (resp., cost) is above (resp., below) a threshold. The notion of strong risk aversion, where for risk the probabilistic choices are treated instead as an adversary was considered in [BFRR14]. In [BFRR14] the problem was considered for single reward for mean-payoff and shortest path. In contrast, though inspired by [BFRR14], we consider risk aversion for multiple reward functions with probabilistic guarantee (instead of adversarial guarantee), which is natural for MDPs. Moreover, [BFRR14] generalizes mean-payoff games, for which no polynomial-time solution is known, whereas in our case, we present polynomial-time algorithms for the single reward case and in several cases of multiple rewards (see the first item of our contributions). Further, an independent work [CR15] extends [BFRR14] to multiple dimensions, and they also consider "beyond almost-sure threshold problem", which corresponds to the multi-qual problem, which is a special case of our solution. Finally, a very different notion of risk has been considered in [BCFK13], where the goal is to optimize the expectation while ensuring low variance. The problem has been considered only for single dimension, and no polynomial-time algorithm is known.
2. Preliminaries
2.1. Basic definitions. We mostly follow the basic definitions of [BBC+14] with only minor deviations. We use N, Q, R to denote the sets of positive integers, rational and real numbers, respectively. For n G N, we denote [n] = {1,. . ., n). For a sequence lj = l\^2 '' ' and n G N, we denote the n-th element by Lj[n].
Given two vectors v, w G Rfe, where k G N, we write v > w iff Vi > Wi for all 1 < i < k, where denotes the i-th component of vector v. Further, 1 denotes (1,...,1), and 1 denotes Kronecker's delta, i.e., lx(x) = 1 and lx(y) = 0 for y ^ x.
Finally, the set of all distributions over a countable set X is denoted by Dist(X), and d G Dist(X) is Dirac if d(x) = 1 for some x G X, i.e., d = tx.
Markov chains. A Markov chain is a tuple M = (L, P, p) where L is a countable set of locations, P : L —> Dist(L) is a probabilistic transition function, and p G Dist(L) is the initial probability distribution.
A run in M is an infinite sequence lj = l\^2 ' '' of locations, a "path in M is a finite prefix of a run. Each path w in M determines the set Cone(io) consisting of all runs that start with w. To M we associate the probability space (Runs, J7, P), where Runs is the set of all runs in M, J- is the u-field generated by all Cone(io), and P is the unique probability measure such that P(Cone(^i ■ ■ ■ £k)) = p^) ■ n*=i P(4)(^+i).
Markov decision processes. A Markov decision process (MDP) is a tuple G = (S, A, Act, 5, so) where S is a finite set of states, A is a finite set of actions, Act : S —> 2A \ {0} assigns to each state s the set Act(s) of actions enabled in s so that {Act(s) | s G S} is a partitioning of A, 5 : A —> Dist(S) is a probabilistic transition function that given an action a gives a probability distribution over the successor states, and so is the initial state. Note that we consider that every action is enabled in exactly one state.
unifying two views on multiple mean-payoff objectives in mdps
7
A run in G is an infinite alternating sequence of states and actions lj = sia\S2a2 ■ ■ ■ such that for all i > 1, we have aj G Act(si) and <5(aj)(sj+i) > 0. A path of length k in G is a finite prefix w = s±ai ■ ■ ■ a^-iSk of a run in G.
Strategies and plays. The semantics of MDPs is defined using the notion of strategies. Intuitively, a strategy in an MDP G is a "recipe" to choose actions. Usually, a strategy is formally defined as a function a : (SA)*S —> Dist(A) that given a finite path w, representing the history of a play, gives a probability distribution over the actions enabled in the last state. In this paper, we adopt a slightly different (though equivalent—see [BBC+14, Section 6]) definition, which is more convenient for our setting. Let M be a countable set of memory elements. A strategy is a triple a = (au,an,a), where au : A X S X M —> Dist(M) and <7n : S X M —> Dist(A) are memory update and next move functions, respectively, and a is the initial distribution on memory elements. We require that, for all (s,m) G S X M, the distribution an(s,m) assigns a positive value only to actions enabled at s, i.e. an(s,m) G Dist(Act(s)).
A play of G determined by a strategy a is a Markov chain Ga = {S x M x A, P, fi), where
H(s,m,a) = tSo(s) ■ a(m) ■ an(s,m)(a)
P(s, m, a)(s', m', a) = S(a)(s') ■ cru(a, s', m)(m') ■ crn(s', m')(a).
Hence, Ga starts in a location chosen randomly according to a and an. In a current location (s, m, a), the next action to be performed is a, hence the probability of entering s' is 6(a)(s'). The probability of updating the memory to ml is cru(a, s', m)(m/), and the probability of selecting a' as the next action is crn(s', m'){a'). Note that these choices are independent, and thus we obtain the product above. The induced probability measure is denoted by PCT and when the initial state s is not clear from the context, we use to denote PCT corresponding to the MDP where the initial state is set to s. "Almost surely" or "almost all runs" refers to happening with probability 1 according to this measure. The respective expected value of a random variable / : Runs ->■ R is Ws[f] = jRunsf dF° or Ea[f] = jRunsf d¥a for short. For t G N, random variables St, At return s,a, respectively, where (s,m,a) is the i-th location on the run.
Strategy types. In general, a strategy may use infinite memory M, and both cru and crn may randomize. The strategy is
• deterministic-update, if a is Dirac and the memory update function gives a Dirac distribution for every argument;
• stochastic-update, if it is not necessarily deterministic-update;
• deterministic, if it is deterministic-update and the next move function gives a Dirac distribution for every argument;
• randomized, if it is not necessarily deterministic.
We also classify the strategies according to the size of memory they use. The important subclasses of strategies are
• memoryless (or 1-memory) strategies, in which M is a singleton,
• n-memory strategies, in which M has exactly n elements,
• finite-memory strategies, in which M is finite, and
• Markov strategies, in which M = N and cru(-, ■, n){n + 1) = 1.
Markov strategies have a nice structure: they only need a counter and to know the current state [FV97].
8
krishnendu chatterjee, zuzana kretinska, and jan kretinsky
End components. A set TUB with O^TCS and B C (JtGT Act(t) is an end component of G if (1) for all a£B, whenever 5(a)(s') > 0 then s' G T; and (2) for all s,t E.T there is a path w = s±ai ■ ■ ■ afc_iSfc such that s± = s, sj, = i, and all states and actions that appear in lj belong to T and B, respectively. An end component T U B is a maximal end component (MEC) if it is maximal with respect to the subset ordering. Given an MDP, the set of MECs is denoted by MEC. Finally, if (S, A) is a MEC, we call the MDP strongly connected.
Remark 2.1. The maximal end component (MEC) decomposition of an MDP, i.e., the computation of MEC, can be achieved in polynomial time [CY95]. For improved algorithms for general MDPs and various special cases see [CH11, CH12, CH14, CL13].
Analogously, for a finite-memory strategy a, a bottom strongly connected component (BSCC) of Ga is a subset of locations W C S x M x A such that (i) for all £\ G W and £2 G S x M x A, if there is a path from £\ to £2 then £2 G W, and (ii) for all £\,£2 G W we have a path from £\ to £2. Every BSCC W determines a unique end component {s,a \ (s, m, a) G W} of G, and we sometimes do not strictly distinguish between W and its associated end component.
For C G MEC, let
0.c = G Runs I 3no : Vn > no : w[n] G C}
denote the set of runs with a suffix in C. Similarly, we define 0,d for a BSCC D. Since almost every run eventually remains in a MEC, e.g. [CY98, Proposition 3.1], \ C G MEC}
"partitions" almost all runs. More precisely, for every strategy, each run belongs to exactly one Qc almost surely; i.e. a run never belongs to two f^c's and for every a, we have [UcgMEC ^c] = 1- Therefore, actions that are not in any MEC are almost surely taken only finitely many times.
2.2. Problem statement. In order to define our problem, we first briefly recall how long-run average can be defined. Let G = (S, A, Act, 5, so) be an MDP, n G N and r : A —> Q™ an n-dimensional reward function. Since the random variable given by the limit-average function lr(r) = limx-j.oo ^ 5Zt=i r(^t) maY be undefined for some runs, we consider maximizing the respective point-wise limit inferior:
1 T
lrinf(r) = liminf - V r(At) 1 —¥00 1 ^—'
t=i
i.e. for each i G [n] and lj G Runs, we have lrinf(r)(o;)j = liminfr-^oo ^ JZ^=1 r(At(o;))j. Similarly, we could define lrsup(r) = limsup^^^ ^ 5Zt=i r(^t)- However, maximizing limit superior is less interesting, see [BBC+14]. Further, the respective minimizing problems can be solved by maximization with opposite rewards. This paper is concerned with the following tasks:
unifying two views on multiple mean-payoff objectives in mdps
9
Realizability (multi-quant-conjunctive): Given an MDP, n G N, r : A —> Qn,
exp G Qn, sat G Q™,pr G ([0,1] n Q)ra, decide whether there is a strategy a such that Vi G [n]
Eff [lrinf(r)i] > exPi, (EXP)
• PCT[lrinf(r)j > sat,} > pr,. (conjunctive-SAT)
Witness strategy synthesis: If realizable, construct a strategy satisfying the requirements.
e-witness strategy synthesis: If realizable, construct a strategy satisfying the requirements with exp — el and sat — e ■ 1.
(mono-quant) (mono-qual)
We are mostly interested in (multi-quant-conjunctive) as it is the core of all other discussed problems. However, we also consider the following important special cases: (multi-qual) : pr = 1,
n = 1,
n = l,pr = 1 .
Additionaly, we are also interested in variants of (multi-quant-conjunctive). Firstly, in (multi-quant-joint), the constraint (conjunctive-SAT) is replaced by
PCT[lrinf(r) > sat] > pr (joint-SAT)
for pr G [0,1]. Secondly, (multi-quant-conjunctive-joint) arises by adding (joint-SAT)
constraint
lrjnf(r) > sat > pr for pr G [0,1] flQ and sat G
The relationship
between the problems is depicted in Fig. 1.
(multi-quant-conjunctive-joint)
(mult i- quant- conj unct ive) (mult i- quant-j oint) (multi-qual) (mono-quant) (mono-qual)
Figure 1. Relationship of the defined problems with lower problems being specializations of the higher ones
Furthermore, each of the three constraints (EXP), (conjunctive-SAT), and (joint-SAT) defines the respective decision problem given solely by that constraint. Each of these three problems is a special case of (multi-quant-conjunctive-joint) where the other constraints are trivial (e.g. requiring the average reward be greater or equal to the minimum reward of the MDP). Finally, apart from decision problems, one often considers optimization problems, where the task is to maximize the parameters so that the answer to the decision problem is still positive. Observe that since optimization in multi-dimensional setting cannot in general produce a single "best" solution, one can consider Pareto curves, which are sets of all component-wise optimal and mutually incomparable solutions to the optimization problem.
10
krishnendu chatterjee, zuzana kretinska, and jan kretinsky
Example 2.2 (Running example). We illustrate (multi-quant-conjunctive) with an MDP of Fig. 2 with n = 2, rewards as depicted, and exp = (1.1, 0.5), sat = (0.5,0.5),pr = (0.8, 0.8). Observe that rewards of actions £ and r are irrelevant as these actions can almost surely be taken only finitely many times.
a,r(a) = (4,0) 6,r(6) = (l,0) d,r(d) = (0,l)
Figure 2. An MDP with two-dimensional rewards
This instance is realizable and the witness strategy has the following properties. The strategy plays three "kinds" of runs. Firstly, due to pr = (0.8, 0.8), with probability at least 0.8 + 0.8 — 1 = 0.6 runs have to jointly surpass both satisfaction thresholds (at the same time), i.e. exceed the vector (0.5,0.5). This is only possible in the right MEC by playing each b and d half of the time and switching between them with a decreasing frequency, so that the frequency of c, e is in the limit 0. Secondly, in order to ensure the expectation of the first reward, we reach the left MEC with probability 0.2 and play a. Thirdly, with probability 0.2 we reach again the right MEC but only play d with frequency 1, ensuring the expectation of the second reward.
In order to play these three kinds of runs, in the first step in s we take £ with probability 0.4 (arriving to u with probability 0.2) and r with probability 0.6, and if we return back to s we play r with probability 1. If we reach the MEC on the right, we toss a biased coin and with probability 0.25 we go to w and play the third kind of runs, and with probability 0.75 play the first kind of runs.
Observe that although both the expectation and satisfaction value thresholds for the second reward are 0.5, the only solution is not to play all runs with this reward, but some with a lower one and some with a higher one. Also note that each of the three types of runs must be present in any witness strategy. Most importantly, in the MEC at state w we have to play in two different ways, depending on which subset of value thresholds we intend to satisfy on each run. Also note that in order to do that, we use memory with stochastic update. A
3. Solution
In this section, we briefly recall a solution to a previously considered problem and show our solution to the more general (multi-quant-conjunctive) realizability problem, along with an overview of the correctness proof. The solution to the other variants is derived and a detailed analysis of the special cases and the respective complexities is given in Section 6.
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
11
3.1. Previous results.
3.1.1. Linear programming for expectation semantics. In [BBC+14], a solution to the (EXP) constraint has been given. The existence of a witness strategy was shown equivalent to the existence of a solution to the linear program in Fig. 3.
Requiring all variables ya,ySixa f°r a 6 A,s £ S be non-negative, the program is the
following:
(1) transient flow: for s £ S
1s0(s) + ^2ya- S(a)(s) = ^ Va+Vs aGA a^Act(s)
(2) almost-sure switching to recurrent behaviour:
E v> = 1
sGCgMEC
(3) probability of switching in a MEC is the frequency of using its actions: for (7 S MEC
sGC aGC
(4) recurrent flow: for s£S
^xa ■ 5(a)(s) = ^2 xa
aGA a^Act(s)
(5) expected rewards:
~^2 Xa ' r — exP
aGA
FIGURE 3. Linear program of [BBC+14] for (EXP)
Intuitively, xa is the expected frequency of using a on the long run; Equation 4 thus expresses the recurrent flow in MECs and Equation 5 the expected long-run average reward. However, before we can play according to ^-variables, we have to reach MECs and switch from the transient behaviour to this recurrent behaviour. Equation 1 expresses the transient flow before switching. Variables ya are the expected number of using a until we switch to the recurrent behaviour in MECs and ys is the probability of this switch upon reaching s. To relate y- and ^-variables, Equation 3 states that the probability to switch within a given MEC is the same whether viewed from the transient or recurrent flow perspective. Actually, one could eliminate variables ys and use directly xa in Equation 1 and leave out Equation 3 completely, in the spirit of [Put94]. However, the form with explicit ys is more convenient for correctness proofs. Finally, Equation 2 states that switching happens almost surely. Note that summing Equation 1 over all s £ S yields 5Zsgs2/s = ^ Since ys can be shown to equal 0 for state s not in MEC, Equation 2 is redundant, but again more convenient.
The solution above builds on the work [EKVY08], which studied MDPs with multiple reachability and ^-regular specifications. It has inspired Equation 1 as well as computation of the Pareto curve. It was shown that the Pareto curve can be approximated in polynomial time in the size of MDP and exponential in the number of specifications; the algorithm
12
KRISHNENDU CHATTERJEE, ZUZANA KRETINSKA, AND JAN KRETINSKY
reduces the problem to MDPs with multiple reachability specifications, which can be solved by multi-objective linear programming [PYOO].
3.1.2. Linear programming for satisfaction semantics. Apart from considering (EXP) separately, [BBC+14] also considers the constraint (joint-SAT) separately. While the former was solved using the linear program above, the latter required a reduction to one linear program per each MEC and another one to combine the results. More precisely, for each MEC we first decide whether there is a strategy exceeding the threshold. Second, we maximize the probability to reach these MECs. Similarly, in [RRS15], for each MEC we decide for every subset of thresholds whether there is a strategy exceeding them. The results are again combined in a linear program for reachability.
In contrast, we shall provide a single linear program for the (multi-quant-conjunctive) problem, unifying the solution approaches for expectation and satisfaction problem. This in turn allows us to optimize the expectation while guaranteeing satisfaction. Further, this approach immediately yields a linear program where both conjunctive and joint interpretations are combined, and we can optimize any linear combination of expectations. Finally, we can also optimize the probabilistic guarantees while ensuring the required expectation. For greater detail, see Section 3.4.
3.2. Our unifying solution. There are two main tricks to incorporate the satisfaction semantics. The first one is to ensure that a flow exceeds the value threshold. We first explain it on the qualitative case.
3.2.1. Solution to (multi-qual). When the additional constraint (SAT) is added so that almost all runs satisfy h;nf (r) > sat, then the linear program of Fig. 3 shall be extended with the following additional equation:
6. almost-sure satisfaction: for C G MEC
Note that xa represents the absolute frequency of playing a (not relative within the MEC). Intuitively, Equation 6 thus requires in each MEC the average reward be at least sat. Here we rely on the non-trivial fact, that in a MEC, actions can be played on almost all runs with the given frequencies for any flow, see Corollary 5.5.
The second trick ensures that each conjunct in the satisfaction constraint can be handled separately and, consequently, that the probability threshold can be checked.
3.2.2. Solution to (multi-quant-conjunctive). When each value threshold sati comes with a non-trivial probability threshold pri, some runs may and some may not have the long-run average reward exceeding sati. In order to speak about each group, we split the set of runs, for each reward, into parts which do and which do not exceed the threshold.
Technically, we keep Equations 1-5 as well as 6, but split xa into xa^ for N C [n], where N describes the subset of exceeded thresholds; similarly for ys. The linear program L then takes the form displayed in Fig. 4.
Intuitively, only the runs in the appropriate "iV-classes" are required in Equation 6 to have long-run average rewards exceeding the satisfaction value threshold. However, only
unifying two views on multiple mean-payoff objectives in mdps
13
Requiring all variables ya,ys,N,Xa,N f°r a £ A, s £ S, N C [n] be non-negative, the program is the following:
(1) transient flow: for s £ S
1s0(s) + X ya ' S{a)(s) = X ya + X VsjN
a£A a^Act(s) NC[n]
(2) almost-sure switching to recurrent behaviour:
X] ^ = 1
sGCgMEC
NC[n]
(3) probability of switching in a MEC is the frequency of using its actions: for C £ MEC, N C [n]
X y^N = X Xa'N
sGC aGC
(4) recurrent flow: for s £ S, N C [n]
X^a,^ ■ exp
a€A, NC[n]
(6) commitment to satisfaction: for C £ MEC, iV C [n], i E N
X xa,N ■ r(a)i > X Xa'N ' satl
aGC aGC
(7) satisfaction: for i £ [n]
X - Pr*
a€A, NC[n]:i€N
Figure 4. Linear program L for (multi-quant-conjunctive)
the appropriate "iV-classes" are considered for surpassing the probabilistic threshold in Equation 7.
Theorem 3.1. Given a (multi-quant-conjunctive) realizability problem, the respective system L (in Fig. 4) satisfies the following:
(1) The system L is constructible and solvable in time polynomial in the size of G and exponential in n.
(2) Every witness strategy induces a solution to L.
(3) Every solution to L effectively induces a witness strategy.
Example 3.2 (Running example). The linear program L for Example 2.2 is shown in Appendix A. Here we spell out some useful points we need later: Equation 1 for state s
1 + 0.5ye = ye + yr+ ys,0 + ys,{iy + ys,{2} + ys,{i,2}
14
KRISHNENDU CHATTERJEE, ZUZANA KRETINSKA, AND JAN KRETINSKY
expresses the Kirchhoff's law for the flow through the initial state. Equation 6 for the MEC C = {v, w, b, c, d, e}, TV = {1,2}, i = 1
expresses that runs ending up in C and satisfying both satisfaction value thresholds have to use action b at least half of the time. The same holds for d and thus actions c, e must be played with zero frequency on these runs. Equation 7 for i = 1 sums up the gain of all actions on runs that have committed to exceed the satisfaction value threshold either for the first reward, or for the first and the second reward.
Moreover, we show later in Lemma 5.1, that variables xi>N,xrN for any TV C [n] can be omitted from the system as they are zero for any solution. Intuitively, transient actions cannot be used in the recurrent flows. A
3.3. Proof overview. Here, we briefly describe the main ideas of the proof of Theorem 3.1.
The first point. The complexity follows immediately from the syntax of L and the existence of a polynomial-time algorithm for linear programming [Sch86].
The second point. Given a witness strategy a, we construct values for variables so that a valid solution is obtained. The technical details can be found in Section 4.
The proof of [BBC+14, Proposition 4.5], which inspires our proof, sets the values of xa to be the expected frequency of using a by a, i.e.
Since this Cesaro limit (expected frequency) may not be defined, a suitable value f(a) between the limit inferior and superior has to be taken. In contrast to the approach of [BBC+14], we need to distinguish among runs exceeding various subsets of the value thresholds sati,i G [n]. For TV C [n], we call a run N-good if lr;nf(r)j > sati for exactly all i £ TV. TV-good runs thus jointly satisfy the TV-subset of the constraints. Now instead of using frequencies f(a) of each action a, we use frequencies /at(a) of the action a on TV-good runs separately, for each TV. This requires some careful conditional probability considerations, in particular for Equations 1, 4, 6 and 7.
Example 3.3 (Running example). The strategy of Example 2.2 induces the following revalues. For instance, action a is played with a frequency 1 on runs of measure 0.2, hence xa,{l} = 0-2 and xa$ = ^a,{2} = ^,{1,2} = 0- Action d is played with frequency 0.5 on runs of measure 0.6 exceeding both value thresholds, and with frequency 1 on runs of measure 0.2 exceeding only the second value threshold. Consequently, £<2,{i,2} = 0-5 ■ 0.6 = 0.3 and xd,{2} = O-2 whereas xd$ = x^j =0. A
Values for y-variables are derived from the expected number of taking actions during the "transient" behaviour of the strategy. Since the expectation may be infinite in general, an equivalent strategy is constructed, which is memory less in the transient part, but switches to the recurrent behaviour in the same way. Then the expectations are finite and the result
^6,(1,2} ■ 1 > (^6,(1,2} + xc,{l,2} + ^,{1,2} + xe,{l,2}) ' 0-5
t=l
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
15
of [EKVY08] yields values satisfying the transient flow equation. Further, similarly as for revalues, instead of simply switching to recurrent behaviour in a particular MEC, we consider switching in a MEC and the set iV~ for which the following recurrent behaviour is iV-good.
Example 3.4 (Running example). The strategy of Example 2.2 plays in s for the first time £ with probability 0.4 and r with 0.6, and next time r with probability 1. This is equivalent to a memoryless strategy playing £ with 1/3 and r with 2/3. Indeed, both ensure reaching the left MEC with 0.2 and the right one with 0.8. Consequently, for instance for r, the expected number of taking this action is
The third point. Given a solution to L, we construct a witness strategy a, which has a particular structure. The technical details can be found in Section 5. The general pattern follows the proof method of [BBC+14, Proposition 4.5], but there are several important differences.
First, a strategy is designed to behave in a MEC so that the frequencies of actions match the rc-values. The structure of the proof differs here and we focus on underpinning the following key principle. Note that the flow described by ^-variables has in general several disconnected components within the MEC, and thus actions connecting them must not be played with positive frequency. Yet there are strategies that on almost all runs play actions of all components with exactly the given frequencies. The trick is to play the "connecting" actions with an increasingly negligible frequency. As a result, the strategy visits all the states of the MEC infinitely often, as opposed to strategies generated from the linear program in Fig. 3 in [BBC+14], which is convenient for the analysis.
Second, the construction of the recurrent part of the strategy as well as switching to it has to reflect again the different parts of L for different N, resulting in iV-good behaviours.
Example 3.5 (Running example). A solution with x^^^y = 0.3,2^,(1,2} = 0.3 induces two disconnected flows. Each is an isolated loop, yet we can play a strategy that plays both actions exactly half of the time. We achieve this by playing actions c, e with probability l/2fe in the k—th step. In Section 5 we discuss the construction of the strategy from the solution in greater detail, necessary for later complexity discussion. A
3.4. Important aspects of our approach and its consequences. We now explain some important conceptual aspects of our result. The previous proof idea from [BBC+14] is as follows: (1) The problem for expectation semantics is solved by a linear program. (2) The problem for satisfaction semantics is solved as follows: each MEC is considered, solved separately using a linear program, and then a reachability problem is solved using a different linear program. In comparison, our proof has two conceptual steps. Since our goal is to optimize the expectation (which intuitively requires a linear program), the first step is to come up with a single linear program for satisfaction semantics. The second step is to come up with a linear program that unifies the linear program for expectation semantics and the linear program for satisfaction semantics, allowing us to maximize expectation while ensuring satisfaction.
The values yu,{i} = 0-2, yv,{i,2} = 0-6, Uv,{2} = 0-2 are given by the probability measures of each "kind" of runs (see Example 2.2). A
16
krishnendu chatterjee, zuzana kretinska, and jan kretinsky
Since our solution captures all the frequencies separately within one linear program, we can work with all the flows at once. This has several consequences:
• While all the hard constraints are given as a part of the problem, we can easily find maximal solution with respect to a weighted reward expectation, i.e. w ■ lr;nf(r), where w is the vector of weights for each reward dimension. Indeed, it can be expressed as the objective function w ■ ~^2a N xa^ ■ r(a) of the linear program. Further, it is also relevant for the construction of the Pareto curve.
• We can also optimize satisfaction guarantees for given expectation thresholds. For more detail, see Section 8.
• We can easily add more satisfaction constraints (with different thresholds) on the same resource as well as add joint constraints of the form PCT[/\fc. lr;nf(rj,.) > pr]. Both can be solved by adding a copy of Equation 7 for each subset N of all the constraints.
• The number of variables used in the linear program immediately yields an upper bound on the computational complexity of various subclasses of the general problem. Several polynomial bounds are proven in Section 6. A
4. Proof of Theorem 3.1: Witness strategy induces solution to L
Now we present the technical proof of Theorem 3.1. We start with the second point and show how to construct a solution to L from a witness strategy. Let d be a strategy such that Vi G [n] . PCT[lrinf(r)l > sat,} > pr,
• ECT[lrinf(r)l > exPl
We construct a solution to the system L. The proof method roughly follows that of [BBC+14, Proposition 4.5]. However, separate flows for "iV-good" runs require some careful conditional probability considerations, in particular for Equations 4, 6 and 7.
4.1. Recurrent behaviour and Equations 4—7. We start with constructing values for variables xan,a £ 4,iV C [n].
In general, the frequencies of the actions may not be well defined, because the defining limits may not exist. Further, it may be unavoidable to have different frequencies for several sets of runs of positive measure. There are two tricks to overcome this difficulty. Firstly, we partition the runs into several classes depending on which parts of the objective they achieve. Secondly, within each class we pick suitable values lying between lr;nf(r) and fi"sup(r) °f these runs. In order to achieve the first point, we define for N sat, AVi ^ N : lrinf(r)(w)j < sat,}
Then 0,^, N C [n] form a partitioning of Runs. Further, observe that runs of 0,^ are the runs where joint satisfaction holds, for all rewards i G iV. This is important for the algorithm for (multi-quant-joint) from Section 6.
In order to achieve the second point, we define /7v(a), for every a, to be lying between values liminfy-j.oo ^ Yld=i ^[At = a fl Qn] and limsupy^^ ^ Yld=i ^[At = a fl fijv], which can be safely substituted for xa^ in L. Let A be written as {a±,a2, ■ ■ ■ ,cl\a\} and let us first consider the case when Pct[J7tv] > 0. Since every bounded infinite sequence contains an infinite convergent subsequence, there is an increasing sequence of indices, Tq .. .,
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
17
such that lim^oo ^ 5Zt=i IPCT[^t = ai \ &n] is wen defined. Then we can choose a subse-
i
quence T02, if, T22 ... of the sequence T0\ T/, T2X ... so that lim^oo ^ YILi r 0, the transition probabilities on 0, cannot differ from the actual transition probabilities too much all the time. A
0.5 0.5
b,r(b) = 0
FIGURE 5. An MDP illustrating A
We first consider a simpler problem:
Lemma 4.2. Let (At)tepj be i.i.d. Bernoulli variables with expectation 5 = E[At]. Then for any event Q, with P[fi] > 0, we have lim Eq[A(] = 5.
t—>oo
18
KRISHNENDU CHATTERJEE, ZUZANA KRETINSKA, AND JAN KRETINSKY
Proof. For a contradiction, let w.l.o.g. limsup^^ En [At] = 5 + 3e. (If limsupt^^ En [At] < (5, we can consider the variables 1 — At with this property). Moreover, we may safely assume that En[Af] > 5 + 2e for all t £ N, otherwise we consider the respective subsequence. Let Highi C Q be the set of runs of Q such that i Ylt=i At > <5 + £ and similarly Normak C Q be the set of runs of Q such that 4 5Zt=i At < <5 + e. Clearly, = Highi ttJ Normali for every i. Then
1 1 1 5 + 2e<-VEn[At] = -En
<
t=i
l^mghl[Zl=i At] ■ P[^J + ^NormalAZUi &t] ■ P^ma^]
P[ffiff/iJ + F[Normali] 1 ■ Pfffiff/^] + (S + e) ■ F[Normali]
FlHigh,] + F[Normak] Altogether, by comparing the first and the last expression, we get
F[Normak] < 1 ~ 6 ~ 2e . F[High%} (4.2) s
where the fraction is constant for all i. Since by the law of large numbers lim^oo F[Highj] = 0, we obtain lim^oo F[Normali] = 0 and thus F[fl] = 0, a contradiction. □
Now we apply the preceding lemma to MDPs:
Lemma 4.3. Let A?~ C [n] be such that PCT[fijv] > 0. Then for every a £ A, s £ S, we have lim Fa[At = a\QN]- \Atv(a)(s) - 5(a)(s)\ = 0.
t—>oo
Proof plan. Note that if PCT[Af = a \ fijv] = 1 for all t then the result follows directly from the previous lemma where we set At(cj) to 1 if St+i = s and 0 otherwise. Indeed, then E[Af] = 5(a)(s) and En^[At] = Af (a)(s). Consequently, lim^oo PCT [At = a \ ttN] ■ \A?(a)(s)-5(a)(s)\ = l-0.
In the general case, the probability of taking a on the runs can vary over time. In order to cope with that, we consider sets I C N of positions where a is taken with high enough probability (i.e., in "many" runs). The first step of the proof is thus to derive (4.3), an analogue of (4.2), but now relativized to positions in I. In the previous lemma, the second step consisted in applying the law of large numbers to conclude that probability of overly high preference of some outcome has zero probability, causing a contradiction with (4.2). In this proof, the second step will require more math to conclude that, due to the relativization.
Proof. Suppose for a contradiction, that for some a £ A, s £ S there are infinitely many t for which Fa[At = a | ttN] ■ \A^(a)(s) - 6(a)(s)\ > £ for some £ > 0. Denote the set of these i's by T. Since both factors are bounded by 0 and 1, there are £ > 0 and e > 0 such that for all t £ T we have Fa[At = a \ ttN] > C and w.l.o.g. Af (a)(s) > S(a)(s) + 2e (if A^(a)(s) < <5(a)(s) then there is another successor s' of a with this property). Consequently, for every t £ T, we have
F°[nNnAt = anSt+1 = s]
F^NHAt = a]- > 5(a)(s) + 2£
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
19
First step. Now we derive (4.3), a version of (4.2) relativized to finite sets I C. T. The positive probability of taking a in these positions guarantees that overly high preference of the outcome s is well defined.
Formally, similarly to the previous inequality for each t G T, the same holds for the average over any finite set of indices I C.T:
6{a){s) + 2e < —----= (*)
Z^taF "1n nAt = a\
Denoting
i-Tries-In-7 = {to G QN \ \{t G / | At = a}\ = i} i-Successes-In-J = {uj G 0-n | \{t £ I \ At = a D St+i = s} \ = i} we can rewrite the term (*) by grouping runs with same "frequencies" as
Elii i- Fa [i-Successes-In-i]
(*) = —m-= (**)
El=i« ■ PCT[i-Tries-In-J]
Similarly to the previous lemma, we introduce runs with "success rate" higher and lower than 5(a)(s) + e, now relative to the indices of I. Formally,
High! = i-Tries-In-J n [J /c-Successes-In-J
k>i-(s(a)(s)+e)
Normal J = i-Tries-In-J n [J /c-Successes-In-J
k (*) > 5(a)(s) + 2e, we get by the same computation as for obtaining (4.2) 1/1 1 - 5 - 2e 1/1
Vi-Pff [Normalj] <--S^i ■ Fa [Highj] (4.3)
—^ s —^
i=i i=i
for every finite I C. T.
Second step. Now we consider particular Fs leading to a contradiction. Let T be written as (ii,i2) • • •} so that t\ < t2 < ■ ■ ■ ■ For m < n, we consider finite subsets R\ = {tmi tm+ii • • • > tn} of T and will prove that
lim lim V i ■ PCT = 0 (4.4)
20
KRISHNENDU CHATTERJEE, ZUZANA KRETINSKÄ, AND JAN KRETINSKY
As a consequence of (4.3) we obtain also lim.
, lim„
Normal
0 and
thus lim^oo lim^^oo Yl\=i1 i-PCT[i-Tries-In-/™] = 0, i.e. with growing m the average number of tries after m approaches 0, a contradiction with PCT[Aj = a | f^yv] > C f°r infinitely many t and Fa[ttN] > 0.
It remains to prove (4.4). Intuitively, we consider index sets that start later (at position m —> oo) to avoid initial potentially large elements. Summands with high i's, i.e. runs with many tries, below denoted by C, will be shown negligible by the central limit theorem (in the previous lemma the law of large numbers was sufficient). Further, we will have to argue that even summands with low i's are small for high enough m. This is due to the fact that either a is taken frequently enough on some runs (A) or for high enough indices not any more on the other runs (B).
Formally, let Inf = Qn D {At = a for infinitely many t} and Fin>k = H {At =
a for only finitely many t}n{At into
middle {rn)
E «■
i=i
1^1,
a for some t > k]. We split the sum YJi=i
High
High™ n Inf
middle (m)
E ■
i=i
High™ n Fin
\I'n\\
■ E ■
i=middle(m)-\-l
High,
A B c
by defining an appropriate middle : N —> N. We show that each term approaches zero.
A: Observe that for every i and m, we have limj^oo ¥a[i-Tries-In-1^ n Inf] = 0. Hence also limjj-j.oo A = 0 for every m and irrespective of the choice of middleim), and thus lim^oo lim^oo A=0. B: We define middle(m) to be the largest number such that ^™^e(m) i. IpCT[_F%>m] <
1/m. This trivially ensures limm_j>00 B < linVj^oo 1/m = 0. C: Since limm_j.oo ¥° [Fin>m] = 0, we obtain by the definition of middle that for m —> oo also middle(m) —> oo. Consequently, it is sufficient to prove that
1^1 lim y i
i=k
High
—> 0 for k —> oo uniformly for all m .
(4.5)
Fix an arbitrary m. Let Xj denote the indicator random variable of the event that jth use of action a, when looking only at time points im,im+i,im+2 • • •, resulted in the successor s. Precisely, let Tj be an auxiliary random variable with value ti> such that \{q | m < q < i,Atq = a}\ = j and Atq = a; then Xj is 1 if Stj+i = s and 0 otherwise. Due to the Markov property, Xj are Bernoulli i.i.d. with mean 6(a)(s). Further,
Hight
C
> S(a)(s)
Therefore, by central limit theorem
Fa[mghi] <$(-
where e = e/^J5{a)(s) ■ (1 — 5{a)(s)) and is the cumulative distribution function of the standard normal distribution and ;$ denotes that the inequality < holds "only
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
21
for large i", i.e. in the limit. Consequently, for large k, we have
lim ^ i ■ pCT Hi9hi\ ~ X * ' $(~^ ■ ^)
where the right-hand side does not depend on m and is thus a uniform bound for all m. Further, since (—\fi ■ e) decreases exponentially in the right-hand side approaches 0 as k —> 0 (independently of m) and (4.5) follows. □
Now we show, that Equation 4 is satisfied. For all s £ S and N C [n] such that Pct[J7tv] = 0, we have trivially
^xaiN ■ 5(a)(s) = X xaiN
aGA a^Act(s)
and whenever Pct[S7tv] > 0 we have F[k]^/Ar(aH(a)(s)
L 1 aGA
= E ,lim ^Er^ = a I ■ JH^v] ■ 5(a)(a) (definition of fN)
1 Ti
= y lim — y PCTL4t = a I £V] ■ 5(a)(s) (linearity of the limit)
aGA t=l Ti
V lim ^ VPCTUt = a I nN] ■ Af (a)(s) (Lemma 4.3)
aGA t=l 1 ^
lim — V V Fa[At = a \nN}- Af (a)(a) (definition of 2»
t=l aGA 1 Tl
lim — V P^St+i = s | Oat] (definition of Af)
^-J-oo T)
* t=l
1 T'
lim — / Fa[St = s | fijv] (reindexing and Cesaro limit)
1 t=l
1 T'
lim — y y PCT[At = a | S7at] (s must be followed by a £ ylcf(s))
t=l aGAet(s)
1 1
^icj—] X /™rEr[Al = a | ^w]-Pct[^at] (linearity of the limit)
^ W^ aGAct(s) ^ t=l
mL i E ^(a) • (definition of /at)
^ Wj aGAct(s)
22
KRISHNENDU CHATTERJEE, ZUZANA KRETTNSKA, AND JAN KRETINSKY
Equation 5. For all i £ [n], we have
^2 ^2xa,N-rl(a) > ECT[lrinf(ri)] > expl
NC[n] aGA
where the second inequality is due to a being a witness strategy and the first inequality follows from the following:
^ *^2xa,N ■ rl(a)
NC[n] aGA
= ^ ^/jv(a) ■ r-j(a) (definition of xGiN)
NC[n] aGA P"[flN}>0
1 Tl
= V V rAa) ■ lim — V Fa\At = a I VtN] ■ Fa[nN] (definition of fN)
JVCIn] aGA t=l
F°[nN]>o
1
= V Fa[ttN] ■ lim — V V T-j(a) ■ Fa[At = a \ ttN] (linearity of the limit)
e^oo Tp
NC[n] t=l aGA
Pa[QN]>0
1 T
> Fa [nN] ■ lhri inf - Yl r* (a) ' ^ [A* = a I (definition of lim inf)
NC[n] °° t=l aGA
Pa[QN]>0
1 T
= ^ Fa[ttN] ■ lmiinf — ^ ECT[rj(At) | ft at] (definition of the expectation)
NC[n] °° t=l
Pa[QN]>0
> Y ^I^n] ■ ECT[lrinf(n) | ttN] (Fatou's lemma)
NC[n] P"[flN}>0
= ECT[lrinf(rj)] (f^'s partition Runs)
Although Fatou's lemma (see, e.g. [Roy88, Chapter 4, Section 3]) requires the function ri(At) be non-negative, we can replace it with the non-negative function ri(At)—mina£j4 r^(a) and add the subtracted constant afterwards.
In order to show that Equations 6 and 7 hold, we prove the following lemma. This lemma is further necessary when relating the re-variables to the transient flow in Equation 3 later.
Lemma 4.4. For A?~ C [n] and C G MEC, we have
aGC
Proof. The proof is trivial for the case with Pct[S7at] =0. Let us now assume PCT[ftAr] > 0: y xa,N
aGC
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
23
1 1
lim — V V Fa[At = a \ ÜN] ■ Fa[üN] (definition of xa N and Tp]
i=l aGC Ti
lim ^ÉEh^^0!"^
i=l aGC
^ = a|njy\nc]-F7£rf)?d)-n^]
(partitioning of Runs)
t=l aGC
1 ť
lim — V V PCT[At = a | Í2jv n Í2C] ■ Fa[nN n Í2C]
P—±r>r-: l/l < ^ < ^
1 T
(lim - V PCT[At = a | Í2jv \ Í2C] = 0 for a G C)
t=i
lCT[^Ar n ííc] ■ lim ^ V V Pct[Aí = a I Í^at n ííc] (linearity of the limit)
1 t=l aGC
1 Te
'"[fijvnfic] ■ i™ — y Fa[At G C | ttNr)ttc}
Tt
í=l
(taking two different actions at time t are disjoint events) = Pct[S7at H Q.c] (since At G C for all but finitely many t on S7c, see below)
It remains to prove that the last limit is equal to 1. We have
i > lim -y pCT[At ec\nNnnc] = 1™ - y ect
y ia(At) I ííjv n ížc
í=l í=l LaGC
which is by dominated convergence theorem equal to
ect
lim i- V y ia(At) I nN n nc
í=l aGC
eCT[1] = 1
by definition of Qc-
□
Equation 6. For all C G MEC, A/ C [n],i £ N
y ^a,7V ■ n(a) > y a;a,w ■ sat,
follows trivially for Pct[S7tv] = 0, and whenever Pct[S7tv] > 0 we have y xa,N ■ n(«)
1 T
> liminf - y y n(a) ■ PCT[At = a | Qjv] ■ F^Jv]
i=l aGC
(as above for Eq. 5, by def. of xa^, fx, linearity of lim, def. of liminf)
24
KRISHNENDU CHATTERJEE, ZUZANA KŘETÍNSKÁ, AND JAN KŘETÍNSKÝ
= liminf -VV rt(a) ■ Fa[At = a \ ÜN n Í2C] ■ ^ [Í2jv n Í2C]
í=l aGC
(as above in Lemma 4.4, by partitioning Runs, now with additional factor Vi(a))
> ra[nN n nc] ■ ECT[irinf(n) | nN n nc]
(as above for Eq. 5, by def. of expectation and Fatou's lemma)
> F^IQn n Qc] ■ sat, (by definition of QN and i £ AT)
= y^ ^a,AT ■ satj (by Lemma 4.4)
Equation 7. For every iE [n], by assumption on the strategy cr
J] Pct[^at] = PCT[w £ Runs I lrnrfMHi > sat,] > pr,
NC[n]:i€N
and the first term actually equals
E E^a,7V= E E E^a,7V (by (4.1))
JVC[n]:ieATaeA NC[n]:i€N CgMEC aGC
= E E Pl^nfic] (by Lemma 4.4)
JVC[n]:ieJVCeMEC
= Pct[J7tv] (^c's partition almost all Runs)
NC[n]:i€N
4.2. Transient behaviour and Equations 1—3. Now we set the values for yx, \ £ AU(S x 2N), and prove that they satisfy Equations 1-3 of L when the values /jv(g) are assigned to a;a,7V- One could obtain the values yx using the methods of [Put94, Theorem 9.3.8], which requires the machinery of deviation matrices. Instead, we can first simplify the behaviour of a in the transient part to memoryless using [BBC+14] and then obtain yx directly, like in [EKVY08], as expected numbers of taking actions. To this end, for a state s we define ()s to be the set of runs that contain s.
Similarly to [BBC+14, Proposition 4.2 and 4.5], we modify the MDP G into another MDP G as follows: For each s £ S, N C [n], we add a new absorbing state fsin- The only available action for fs^ leads back to fs^ with probability 1. We also add a new action as^ to every s £ S for each N 7\r assigns probability 1 to fsin- Finally, we remove all unreachable states. The construction of [BBC+14] is the same but with only a single value used for N. We denote the copy of each state s of G in Gby s.
Lemma 4.5. There is a strategy a in G such that for every C £ MEC and iV C [n],
EF^[OAjv]=no[ficnnJV] .
sec
Proof. First, we consider an MDP G' created from G in the same way as G, but instead of /s)jv f°r each s £ S, N C [n], we only have a single fs; similarly for actions as. As in [BBC+14, Lemma 4.6], we obtain a strategy a' in G' such that J2s€cPs''i^M = rs0[^c}-
unifying two views on multiple mean-payoff objectives in mdps
25
We modify cr' into cr as follows. It behaves as cr', but instead of taking action as with probability p, we take each action aS:N with probability p ■ \\nc] ■ (For Pso^d = 0, we define a arbitrarily.) Then
E ^[0/^1 = E 'ho?1 ■ K [0/.] = pc n
□
By [EKVY08, Theorem 3.2], there is a memoryless strategy cr satisfying the lemma above such that
CO
ya := ^^Pf [At = a] (for actions a preserved in G)
t=l
vs,n ■■= n^iofsM
are finite values satisfying Equations 1 and 2, and, moreover,
ys,Ar>Ep"[0^]-
sGC
By Lemma 4.5 for each Cg MEC we thus have
sGC
and summing up over all C and N we have
E E*m^ E pCT[^]
NC[n] sG5 ATC[n]
where the first term is 1 by Equation 2, the second term is 1 by partitioning of Runs, hence they are actually equal and thus
E ys,N = iH^c n nN] = E x°,n
sGC aGC
where the last equality follows by Lemma 4.4, yielding Equation 3.
5. Proof of Theorem 3.1: Solution to L induces witness strategy
Now we proceed to the proof of the third point of Theorem 3.1. Let xa^, ya, ys>Ar, seS,a£ A, iV C [n] be a solution to the system L. We show how it effectively induces a witness strategy cr.
We start with the recurrent part. We prove that even if the flow of Equation 4 is "disconnected" we may still play the actions with the exact frequencies xa^ on almost all runs. To formalize the frequency of an action a on a run, recall la is the indicator function of a, i.e. la(a) = 1 and la(6) = 0 for a ^ b £ A. Then Freqa = lr;nf(la) defines a vector random variable, indexed by a £ A. For the moment, we focus on strongly connected MDPs, i.e. the whole MDP is a MEC, and with N C [n] fixed.
Firstly, we construct a strategy for each "strongly connected" part of the solution xa^ and connect the parts, thus averaging the frequencies. This happens at a cost of a small error used for transiting between the strongly connected parts. Secondly, we eliminate this error as we let the transiting happen with measure vanishing over time.
26
KRISHNENDU CHATTERJEE, ZUZANA KRETINSKA, AND JAN KRETINSKY
5.1. ^-values and recurrent behaviour. To begin with, we show that ^-values describe the recurrent behaviour only:
Lemma 5.1. Let xa^, a £ A, N C [n] be a non-negative solution to Equation 4 of system L. Then for any fixed N, Xn '■= {s, a \ xa^ > 0, a £ Act(s)} is a union of end components. In particular, C (JMEC, and for every a £ A \ (J MEC and N C [n], we have
xa,N = 0.
Proof. Denoting xs^ := ^2a€Act(s) xa,N = 12a€Axa,N ' Ar > 0} U {s \ xs^N > 0} . Firstly, we need to show that for all a £ Xn, whenever <5(a)(s') > 0 then s' £ Xn- Since xs',N > xa,N ■ <^(a)(s') > 0) we have s' £ X^r-
Secondly, let there be a path from s to t in Xn- We need to show that there is a path from t to s in X^. Assume the contrary and denote T C X^r the set of states with no path to s in Xn; we assume t £ T. We write the path from s to i as s ■ ■ ■ s'bt' ■ ■ ■ t where s' £ Xn \ T and t' £ T. Then b £ Aci(s') and <5(6)(i') > 0. Consequently,
^""^ a;a ■ (5(a) (s) = a;a (by summming Equation 4 over s £ X^r \ T)
s£XN\Ta£A S£XN\T a^Act(s)
= E E E E E E^-^)(*)
s€XN\T aGAet(s) s€XN\T s£XN\T a€Act(s) sGT
(case split over target states) > E E E (by 5(6)(t') > 0)
s£XN\T a£Act(s) s£XN\T
= ^2 E Xa ' 5(a)(^) (rearranging)
sGXiv\TaG,4et(s): s€XN\T
= xa ■ <5(a) (s) (see below)
s£XN\T a£A
which is a contradiction. The last equality follows by definition of T: actions enabled in T cannot lead to Xn \ T since from Xn \ T there is always a path to s and from T there is no path to s. □
We thus start with the construction of the recurrent behaviour from a;-values. For the moment, we restrict to strongly connected MDP and focus on Equation 4 for a particular fixed N C [n]. Note that for a fixed N C [n] we have a system of equations equivalent to the form
xa ■ 5(a)(s) = xa for each s £ S. (5-1)
aGA a^Act(s)
We set out to prove Corollary 5.5. This crucial observation states that even if the flow of Equation 4 is "disconnected", we may still play the actions with the exact frequencies xa^ on almost all runs.
Firstly, we construct a strategy for each "strongly connected" part of the solution xa (each end-component of Xn of Lemma 5.1).
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
27
Lemma 5.2. In a strongly connected MDP G, let xa^,a £ A be a non-negative solution to Equation 4 of system L for a fixed iV C [n] and Yla€Axa,N > 0- It induces a memoryless strategy £ such that for every BSCCs D of G1', every BGDfli, and almost all runs in D holds
„ xaN Freqa
J2a€DnA xa,N
i.e.
Freqa = a 0- F°r every e > 0, there is a memoryless strategy £e such that for all a £ A almost surely
Freqa > Xa'N--s
Z^a€AXa,N
Proof. We obtain £e by a suitable perturbation of the strategy £ from previous lemma in such a way that all actions get positive probabilities and the frequencies of actions change only slightly, similarly as in [BBC+14, Proposition 5.1, Part 2].
There exists an arbitrarily small (strictly) positive solution x'a of Equation (5.1). Indeed, it suffices to consider a strategy r which always takes the uniform distribution over the actions in every state and then assign KT[Freqa] /M to x'a for sufficiently large M. As the system of Equations (5.1) is linear and homogeneous, assigning xa^ + x'a to xa^ also solves this system (and thus Equation 4 as well) and all values are positive. Consequently, Lemma 5.2 gives us a memoryless strategy C,£ satisfying almost surely (with F^'-probability 1)
Fregfl= (Y + ^M.
Ea'GA {xa',N + xa,)
We may safely assume that ^ZaGA-^a — rzi ' ^2a€Axa,N- Then almost surely
Freqa = Xa,N X<1—— (by Lemma 5.2)
SaGA^a.
N
>^--r (by<>0)
J2a€AXa,N + SaGA
X'
^ -Z-V--Z- ^ X'a < T=F ■ EaGA xa,N)
J2a€AXa,N + i_£ ■ SaGA xa,N Xg,N
(rearranging)
28
KRISHNENDU CHATTERJEE, ZUZANA KRETINSKA, AND JAN KRETINSKY
xa,N xa,N , . \
= —--e ■ —- (rearranging)
Z^a€AXa,N l^a£AXa,N
> --e (by v X*-N < 1)
□
Thirdly, we eliminate this error as we let the transiting (by x'a) happen with probability vanishing over time.
Lemma 5.4. In a strongly connected MDP, let £j be a sequence of strategies, each with Freq = fl almost surely, and such that lim^oo fl is well defined. Then there is Markov strategy £ such that almost surely
Freq = lim fl.
i—>oo
Proof. This proof very closely follows the computation in [BBC+14, Proposition 5.1, Part "Moreover"], but for general £j.
Given a £ A, let If a := lim^oo By definition of limit and the assumption that Freqa = lr;nf(la) is almost surely equal to /* for each £j, there is a subsequence ^ of the sequence £j such that p£> [lrjnf(la) > If a — 2--7-1] = 1. Note that for every j £ N there is Kj £ N such that for all a £ A and s £ S we get
1 T
-Kj t=0
> 1 - T3.
Let us consider a sequence tiq, ni,. .. of numbers where rij > Kj and
EfeZ/a.
1 —¥co 1 ^—'
t=0
Denote by Ek the set of all runs to = soaosiai ■ ■ ■ of such that for some < d < we have
Nj+d-l
-d £ la{ak) < lfa-2~k.
We have F^[Ej] < 2~i and thus J2JLi ^[Ej] = \ < 00 holds- BY Borel-Cantelli lemma [Roy88], almost surely only finitely many of Ej take place. Thus, almost every run lj = soaOslal '' ' of G^ satisfies the following: there is £ such that for all j > £ and all kj < d < rij we have that
Nj+d-l
\ ^ Va-2~j. (5.4)
k=Nj
Consider T 6 N such that Nj < T < Nj+i where j > £. Below, we prove the following inequality
1 T
yE^W ^ (Z/a-21^)(l-21-^). (5.5)
t=o
Taking the limit of (5.5) where T (and thus also j) goes to oo, we obtain
1 T
Freqa{u) = liminf- £ la(at) > liminf(Z/a - 21^)(1 - 21"') = lfa = lim /a
t=0
yielding the lemma. It remains to prove (5.5). First, note that
T Nj—1 T
t=0 t=N.
and that by (5.4)
which gives
t=7Vj_i t=Nj-!
T T
f^Uat) > (lfa - 21~3)^ + I £ Mat). (5-6)
t=0 t=wj Now, we distinguish two cases. First, if T — Nj < Kj+±, then
> -°--=-°--= 1--3 3+— > (1 - 21~3)
T Nj + kJ+i Nj_i + rij + kJ+i Nj-i + rij + kj+i
by (5.2) and (5.3). Therefore, by (5.6),
T
^ E^) ^ (lfa-21-3)(l-21-3).
T
t=o
30
KRISHNENDU CHATTERJEE, ZUZANA KRETINSKA, AND JAN KRETINSKY
Second, if T — Nj > kj+i, then
- y la(at) = r_jy y la (at)--f-
t=Nj 3 t=Nj
> (lfa - 2-1) (1 - Ni-)+ni) (by (5.4))
>(Z/a-2^)(l-2-^-^) (by (5.2))
and thus, by (5.6),
1 y Mat) > [lfa - 2^ + (lfa ~ 2-^) (l - 2-i - |) t=o
>(Va-21-i)(^+(l-2-i-^
>(lfa-21-t)(l-21-*) which finishes the proof of (5.5). □
Now we know that strategies within an end component can be merged into a strategy with frequencies corresponding to the solution of Equation 4 for each fixed N.
Corollary 5.5. For a strongly connected MDP, let xa^, a £ A be a non-negative solution to Equation 4 of system L for a fixed N C [n] and Yla€Axa,N > 0- Then there is Markov strategy £jv such that for each a £ A almost surely
Freqa = —--.
Proof. The strategy ^ is constructed by Lemma 5.4 taking £j to be from Lemma 5.3. □
Remark 5.6. Note that using such strategy, all actions and states in the single MEC are visited infinitely often. (This will be later useful for the strategy complexity analysis.)
Since the fraction is independent of the initial state of the MDP, the frequency is almost surely the same also for all initial states. The reward of £jv is almost surely
lrinf(r)(w) = —-.
lsaxa,N
When the MDP is not strongly connected, we obtain such £jv in each MEC C with SaeC xa,N > 0 and the respective reward of almost all runs in C is thus
E^ [lrinf (r) | nc] = E°fi'a'rr(''> . (5.7)
Z^aGCnA xa,N
Moreover, the long-run average reward is the same for almost all runs, which is a stronger property than in [BBC+14, Lemma 4.3], which does not hold for the induced strategy there. We need this property here in order to combine the satisfaction requirements.
Z^aCCnA Xa,N
1. (5.8)
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
31
5.2. y-values and transient behaviour. We now consider the transient part of the solution that plays £n's with various probabilities. Let "switch to £n m C" denote the event that a strategy updates its memory, while in C, into such an element that it starts playing exactly as We can stitch all £at's together as follows:
Lemma 5.7. Let £n,N C [n] be strategies. Then every non-negative solution ya,ys,n, a G A, s G S, N C [n] to Equation 1 effectively induces a strategy cr such that
PCT [switch to £tv in s] = Us,N and a is memoryless before the switch.
Proof. The idea is similar to [BBC+14, Proposition 4.2, Step 1]. However, instead of switching in s to £ with some probability p, here we have to branch this decision and switch to £m with probability p ■ ^ Vs'N-.
Formally, for every MEC C of G, we denote the number 5ZsgC Stvc[™] Vs,n by yc-According to the Lemma 4.4 of [BBC+14] we have a stochastic-update strategy f) which stays eventually in each MEC C with probability yc-
Then the strategy 7r works as follows. It plays according to i? until a BSCC of G^ is reached. This means that every possible continuation of the path stays in the current MEC C of G. Assume that C has states s±,..., s^. At this point, the strategy 7f changes its behaviour as follows: First, it strives to reach s± with probability one. Upon reaching s±, it chooses randomly with probability VSy^ to behave as £n forever, or otherwise to follow on to s2- If the strategy Ft chooses to go on to s2, it strives to reach s2 with probability one. Upon reaching s2, it chooses with probability -^/s2'JV- to behave as £n forever, or
to follow on to S3, and so on, till Sk- That is, the probability of switching to £n in Si is
_ysi,n_
Since i? stays in a MEC C with probability yc, the probability that the strategy 7r switches to £n in Si is equal to ysi,n- Further, as in [BBC+14] we can transform the part of 7t before switching to £jv to a memoryless strategy and thus get strategy cr. □
Corollary 5.8. Let £n,N C [n] be strategies. Then every non-negative solution yaiy.s,Nixa,Nia £ A, s G S, N C [n] to Equations 1 and 3 effectively induces a strategy a such that for every MEC C
ICT [switch to £tv in C] = xa,N
E
a^CnA
and cr is memoryless before the switch.
Proof. By Lemma 5.7 and Equation 3. □
5.3. Proof of witnessing. We now prove that the strategy cr of Corollary 5.8 with £n, N C [n] of Corollary 5.5 is indeed a witness strategy. Note that existence of £n's depends on the sums of a;-values being positive. This follows by Equation 2 and 3. We evaluate the strategy cr as follows:
ECT[lrinf(r)]
KRISHNENDU CHATTERJEE, ZUZANA KRETINSKA, AND JAN KRETINSKY
Y Y pCT[switch to £w in C\ ■ E^N Prinf (r) | nC] CGMEC7VC[n]
(by Equation 2, Y PCT [switch to £N] = 1
NC[n]
E E ( E ^)-E^[lrinf(r)|ftc]
(by Corollary 5.8) ^ a;a>Ar ■ r(a)/ ^ a;a>Arj (by (5.7))
CGMEC7VC[ra] aGCnA
= E E ( E x°,"
CgMEC NC[n]: aGCnA aGCnA aGCnA
= E E E ^,JVr(a)
NC[n] CgMEC aGCnA
= E E M,r(Q)
NC[n] aGAnlJMEC
= E ^2xa,N-r(a)
NC[n] aGA
> exp
and for each i 6 [n] we have iHlimf (r)i > sat,] = Y Y pCT[switch to 6v in C] ■ P5iV[lrinf(r)l > sat, \ Qc]
CGMEC7VC[n]
(by Equation 2, Y ^ [switch to £N] = 1)
NC[n]
= E E ( E ^a,^) ■ P^piirff'"); > sati I ftc] (by Corollary 5.8)
(by Lemma 5.1) (by Equation 5)
>
CeMEC JVCfnl aGCnA
E E ( E x«,»
CgMEC NC[n]: aGCnA
E E (E^-^
CgMEC ieATC[n]: aGCnA
E E E x°>*
i6JVC[n] CgMEC aGCnA
E E x°>*
!6JVC[n] aGAnlJ MEC
E E^
ieATCfnl aGA
E ■ r(a)*/ E Xa'N -sati
.aGCnA aGCnA
(by (5.8))
Y Xa,N ■ SaU/ Y Xa,N > Sat, .aGCnA aGCnA
(by Equation 6)
(by Lemma 5.1)
unifying two views on multiple mean-payoff objectives in mdps
33
(by Equation 7)
Remark 5.9. The proof of the corresponding claim for e-witness strategies proceeds as above. We get that the strategy cr of Corollary 5.8 with Q£N,N C [n] of Lemma 5.3 is an
6. Algorithmic complexity In this section, we discuss the solutions to and complexity of all the introduced problems.
6.1. Solution to (multi-quant-conjunctive). As we have seen, there are ■ n) ■ 2n variables in the linear program L. By Theorem 3.1, the upper bound on the algorithmic time complexity is polynomial in the number of variables in system L. Hence, the realizability problem for (multi-quant-conjunctive) can be decided in time polynomial in |G| and exponential in n.
6.2. Solution to (multi-quant-joint) and the special cases. In order to decide (multi-quant-joint), the only subset of runs to exceed the probability threshold is the set of runs with all long-run rewards exceeding their thresholds, i.e. Q\n\ (introduced in Section 4.1). The remaining runs need not be partitioned and can be all considered to belong to £1$ without violating any constraint. Intuitively, each xa 0 now stands for the original sum zC/vc[n]-jvy:[n] xa,N'i similarly for y-variables. Consequently, the only non-zero variables of L indexed by A?~ satisfy N = [n] or = 0. The remaining variables can be left out of the system.
Requiring all variables ya,Vs,N,xa,N for a £ A, s £ S, N £ {0, [n]} be non-negative, the program is the following:
(1) transient flow: for s £ S
(3) probability of switching in a MEC is the frequency of using its actions: for C £ MEC
e-witness strategy.
A
aeA a^Act(s)
(2) almost-sure switching to recurrent behaviour:
see
see
aeC
see
aeC
(4) recurrent flow: for s £ S
aeA
a
eAct(s)
aeA
a
eAct(s)
34
KRISHNENDU CHATTERJEE, ZUZANA KRETINSKA, AND JAN KRETINSKY
(5) expected rewards:
E (xa,% + xa,[n\) ' r(a) > eXP
aGA
(6) commitment to satisfaction: for (7 S MEC and i £ [n]
E xa,[n] ■ r(a)i > y xa^n] ■ sat.
aGC aGC
(7) satisfaction:
aGA
Since there are now C(|G| ■ n) variables, the problem as well as its special cases can be decided in polynomial time.
Similarly, for (mono-quant) it is sufficient to consider N = [n] = {1} and iV = 0 only. Consequently, for (multi-qual) N = [n], and for (mono-qual) N = [n] = {1} are sufficient, thus the index iV can be removed completely.
Theorem 6.1. The (multi-quant-joint) realizability problem (and thus also all its special cases) can be decided in time polynomial in |G| and n. □
6.3. Solution to (multi-quant-conjunctive-joint). The linear program for this "combined" problem can be easily derived from the program L in Fig. 4 as follows.
The first step consists in splitting the recurrent flow into two parts, yes and no Requiring all variables be non-negative, the program is the following:
(1) transient flow: for s £ S
(3) probability of switching in a MEC is the frequency of using its actions: for C £
aGA a^Act(s) NC[n]
(2) almost-sure switching to recurrent behaviour:
MEC, iV C [n]
sGC
aGC
sGC
(4) recurrent flow: for s £ S, N C [n]
aGC
aGA
a^Act(s)
aGA
a^Act(s)
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
35
(5) expected rewards:
X (xa,N,yes + xa,N,no) ' >*(a) > exp JVC[n]
(6) commitment to satisfaction: for C G MEC, iV C [n], i G N
Y,^yes-r{a)l>Y,xa^yes-satl
aGC aGC
X xa,N,no ■ r(a)i > X xa,N,no ' Sat, aGC aGC
(7) satisfaction: for i G [n]
^ ^ xa,N,yes ~\~ xa,N,no ^ J^z
aGA, NC[n]:i€N
Note that this program has the same set of solutions as the original program, considering
Substitution O^tv = af3,N,yes + af3,N,no-
The second step consists in using the "yes" part of the flow for ensuring satisfaction of the (joint-SAT) constraint. Formally, we add the following additional equations (of type 6 and 7, respectively):
X xa,N,yes ■ r(a)i > ^ xa,N,yes ■ sat, for i G [n] and N C [n]
aGC aGC
(7)
X xa,N,yes > Vr
aGA JVC[n]
Note that the number of variables is double that for (multi-quant-conjunctive).
Therefore, the complexity remains essentially the same:
Corollary 6.2. The algorithmic complexity for the (multi-quant-conjuctive-joint) is
polynomial in the size of the MDP and exponential in n. □
Remark 6.3. The strategies for the case of (multi-quant-conjunctive-joint) are very similar to that of (multi-quant-conjunctive). Indeed, the structure of the constructed (e-)witness strategies is the same: the memoryless strategy for reaching the desired MECs is followed by a stochastic-update switch to strategies for the recurrent behaviour. The only difference is the following, (e-)witness strategies for (multi-quant-conjunctive) switch to strategies £jv (or Cn)> eacn given by values of a;-variables indexed by a fixed A?~ C [n]. In contrast, strategies for (multi-quant-conjunctive-joint) switch to strategies ^,6 (or C/vb)? each given by values of ^-variables indexed by a fixed A?~ C [n] and b G {yes, no}. A
Furthermore, we can also allow multiple constraints, i.e. more (joint-SAT) constraints or more (conjunctive-SAT), thus specifying probability thresholds for more value thresholds for each reward. Then instead of subsets of [n] as so far, we consider subsets of the set of all constraints. The number of variables is then exponential in the number of constraints rather than just in the dimension of the rewards.
36
krishnendu chatterjee, zuzana kretinska, and jan kretinsky
6.4. Hardness. The (multi-quant-conjunctive-joint) problem is also of significant theoretical interest since we can also prove the following hardness result:
Theorem 6.4. The (multi-quant-conjunctive-joint) problem is NP-hard (even without the (EXP) constraint).
Proof. We proceed by reduction from SAT. Let ip be a formula with the set of clauses C = {ci,..., Cfc} over atomic propositions Ap = {a±,..., ap}. We denote Ap = {oi,..., a^} the literals that are negations of the atomic propositions. We define an MDP Gv = (S, A, Act, 5, sq) as follows:
• S = {st | ie_\p}},
• A = ApUAp,
• Act(si) = {a,i,al} for i G [p],
• S(a>i)(si+i) = 1 and 6(ai)(si+i) = 1 (actions are assigned Dirac distributions),
• s0 = Sl= Sp+l.
The constructed MDP is illustrated in Fig. 6. Intuitively, a run in Gv repetitively chooses a valuation.
We define the dimension of the reward function to be n = k + 2p. We index the components of vectors with this dimension by C U Ap U Ap. The reward function is defined for each £ G A as follows:
Intuitively, we get a positive reward for a clause when it is guaranteed to be satisfied by the choice of a literal. The latter two items simply count the number of uses of a literal; thus lrinf(r)a = Freqa.
The realizability problem instance Rv is then defined by a conjunction of the following (conjunctive-SAT) and (joint-SAT) constraints:
Figure 6. MDP G,
• r{£){al)
X
Fa lrinf (r)e > - > - for each £ G Ap U Ap
(conjunctive-S)
(joint-S)
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
37
Intuitively, (conjunctive-S) ensures that almost all runs choose, for each atomic proposition, either the positive literal with frequency 1, or the negative literal with frequency 1; in other words, it ensures that the choice of valuation is consistent within the run almost surely. Indeed, since the choice between aj and al happens every p steps, runs that mix both with positive frequency cannot exceed the value threshold 1/p. Therefore, half of the runs must use only aj, half must use only a7. Consequently, almost all runs choose one of them consistently.
Further, (joint-S) on the top ensures that there is a (consistent) valuation that satisfies all the clauses. Moreover, we require that this valuation is generated with probability at least 1/2. Actually, we only need probability strictly greater than 0.
We now prove that ip is satisfiable if and only if the problem instance defined above on MDP Gv is realizable.
"Only if part": Let v C Ap U Ap be a satisfying valuation for ip. We define a to have initial distribution on memory elements mi, 7712 with probability 1/2 each. With memory mi we always choose action from v and with memory 7712 from the "opposite valuation" v (where a is identified with a).
Therefore, each literal has frequency 1/p either in the first or the second kind of runs. Further, the runs of the first kind (with memory mi) satisfy all clauses.
"If part": Given a witness strategy a for R(p), we construct a satisfying valuation. First, we focus on the property induced by the (conjunctive-S) constraint. We show that almost all runs uniquely induce a valuation
Va ■= {£ £ Ap U ~Ap I Freqi > 0}
which follows from the following lemma:
Lemma 6.5. For every witness strategy a satisfying the (conjunctive-S) constraint, and for each a £ Ap, we have
Freqa = - and Freqa = 0
P
Freqa = 0 and Freqa ■■
Proof. Let a £ Ap be an arbitrary atomic proposition. To begin with, observe that due to the circular shape of MDP Gv, we have
Freqa + Freqa < 1/p (6.1) for every run. Indeed, Freqa + Freqa = liminfT^oo ^ Y%=i 1a + lim infr->oo t ELi 1a ^ liminfT^oo y Yjt=l{^-a + la) = 1/p-
Therefore, the two events Freqa > 1/p and Freq^ > 1/p are disjoint. Due to the (conjunctive-S) constraint, almost surely exactly one of the events occurs. Indeed,
1 >
Freqa > - U Freqa > -
p p
Freqa >
Freqa >
>
1 1
with the equality by disjointness of the events and the last inequality by (conjunctive-S).
Therefore, by (6.1), almost surely either Freqa = 1/p and Freqa = 0, or Freqa = 0 and Freqa = D
38
krishnendu chatterjee, zuzana kretinska, and jan kretinsky
By the (joint-S) constraint, we have a set QSat, with non-zero measure, of runs satisfying lrinf(r)c > 1 f°r each c G C. By the previous lemma, almost all runs of 0,Sat induce unique valuations. Since there are finitely many valuation, at least one of them is induced by a set of non-zero measure. Let lj be one of the runs and v the corresponding valuation. We claim that v is a satisfying valuation for ip.
Let c G C be any clause, we show v \= c. Since lrinf(r)(w)c > 1, there is an action £ such that
• Freqp(uj) > 0, and
• r(a)e > 1.
The former inequality implies that £ G v and the latter that £ \= c. Altogether, v |= c for every c G C, hence v witnesses satisfiability of ip. □
Theorem 6.4 contrasts Theorem 6.1: while extension of (joint-SAT) with (EXP) can be solved in polynomial time, extending (joint-SAT) with (conjunctive-SAT) makes the problem NP-hard. Intuitively, adding (conjunctive-SAT) enforces us to consider the subsets of dimensions, and explains the exponential dependency on the number of dimensions in Theorem 3.1 (though our lower bound does not work for (conjunctive-SAT) with (EXP)).
The results are summarized in Table 2 and contrasted to the previously known polynomial bounds in Table 1.
7. Strategy complexity
First, we recall the structure of witness strategies generated from L in Section 5. In the first phase, a memoryless strategy is applied to reach MECs and switch to the recurrent strategies This switch is performed as a stochastic update, remembering the following two pieces of information: (1) the binary decision to stay in the current MEC C forever, and (2) the set A?~ C [n], such that almost all the produced runs belong to Q,^. Each recurrent strategy £at is then an infinite-memory strategy, where the memory is simply a counter. The counter determines which memoryless strategy ("^ is played.
7.1. Randomization and memory. Similarly to the traditional setting with the expectation or the satisfaction semantics considered separately, the case with a single objective is simpler.
Lemma 7.1. Deterministic memoryless strategies are sufficient for witness strategies for (mono-qual).
Proof. For each MEC, there is a value, which is the maximal long-run average reward. This is achievable for all runs in the MEC and using a memoryless strategy £. We prune the MDP to remove MECs with values below the threshold sat. A witness strategy can be chosen to maximize the single long-run expected average objective, and thus also to be deterministic and memoryless [Put94]. Intuitively, in this case each MEC is either stayed at almost surely, or left almost surely if the value of the outgoing action is higher. □
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
39
Further, both for the expectation and the satisfaction semantics, deterministic memo-ryless strategies are sufficient for quantitative queries [FV97, BBE10] with single objective. In contrast, we show that both randomization and memory is necessary in our combined setting even for e-witness strategies.
Example 7.2. Randomization and memory is necessary for (mono-quant) with sat = l,exp = 3,pr = 0.55 and the MDP and r depicted in Fig. 7. We have to remain in MEC {s,a} with probability p £ [0.1,2/3], hence we need a randomized decision. Further, memoryless strategies would either never leave {s, a} or would leave it eventually almost surely. Finally, the argument applies to s-witness strategies, since the interval for p contains neither 0 nor 1 for sufficiently small s.
a, r(a) = 2
c, r(c) = 0
d, r(d) = 10
FIGURE 7. An MDP with a single objective, where both randomization and memory is necessary
A
In the rest of the section, we discuss bounds on the size of the memory and the degree of randomization. Due to [BBC+14, Section 5], infinite memory is indeed necessary for witnessing (joint-SAT) with pr = 1, hence also for (multi-qual).
7.2. Memory bounds for deterministic update. We prove that finite memory is sufficient in several cases, namely for all e-witness strategies and for (mono-quant) witness strategies. Moreover, these results also hold for deterministic-update strategies. Indeed, as one of our technical contributions, we prove that stochastic update at the moment of switching is not necessary and deterministic update is sufficient, requiring only a finite blow up in the memory size.
Lemma 7.3. Deterministic update is sufficient for witness strategies for (multi-quant-conjuctive) and (multi-quant-joint). Moreover, finite memory is sufficient before switching to £at's.
Proof idea. The stochastic decision during the switching in MEC C can be done as a deterministic update after a "toss", a random choice between two actions in C in one of the states of C. Such a toss does not affect the long-run average reward as it is only performed finitely many times.
More interestingly, in MECs where no toss is possible, we can remember which states were visited how many times and choose the respective probability of leaving or staying in C. □
40
KRISHNENDU CHATTERJEE, ZUZANA KRETINSKA, AND JAN KRETINSKY
Proof. Let a be a strategy induced by L. We modify it into a strategy g with the same distribution of the long-run average rewards. The only stochastic update that a performs is in a MEC, switching to £jv with some probability. We modify a into g in each MEC C separately.
Tossing-MEC case First, we assume that there are toss, a,b G C with a, b G Act (toss). Whenever a should perform a step in s G C and possibly make a stochastic-update, say to mi with probability p\ and with probability p2, g performs a "toss" instead. A (pi,P2)-toss consists of reaching toss with probability 1 (using a memory less strategy), taking a, b with probabilities pi,p%, respectively, and making a deterministic update based on the result, in order to remember the result of the toss. After the toss, g returns back to s with probability 1 (again using a memory less strategy). Now as it already remembers the result of the (pi,p2)-toss, it changes the memory to mi or accordingly, by a deterministic update.
In general, since the stochastic-update probabilities depend on the action chosen and the state to be entered, we have to perform the toss for each combination before returning to s. Further, whenever there are more possible results for the memory update (e.g. various N), we can use binary encoding of the choices, say with k bits, and repeat the toss with the appropriate probabilities /c-times before returning to s.
This can be implemented using finite memory. Indeed, since there are finitely many states in a MEC and a is memoryless, there are only finitely many combinations of tosses to make and remember till the next simulated update of a.
Tossfree-MEC case It remains to handle the case where, for each state s G C, there is only one action a G Act(s) n C. Then all strategies staying in C behave the same here, call this memoryless deterministic strategy £. Therefore, the only stochastic update that matters is to stay in C or not. The MEC C is left via each action a with the probability
oo
leavea := ^ V[St G C and At = a and St+i £ C] t=i
and let {a \ leavea > 0} = {a±,.. . , af\ be the leaving actions. The strategy g upon entering C performs the following. First, it leaves C via a± with probability leaveai (see below how),
then via a>2 with probability ^-illve ' anc^ so on v^a ai with probability
leaven
1 - Y?j=l leaveaj
subsequently for each i 6 [f]. After the last attempt with ap, if we are still in C, we update memory to stay in C forever (playing £).
Leaving C via a with probability leave can be done as follows. Let rate = 5Zs^c G [0,1]. Therfore, in order to choose m G N, we
can simply set m := L\°^\~^°"e^J, which also ensures that p G [0,1] for the respective
P ■= T^Cl " JlE^), obtained from (7.1).
In order to implement the strategy in MECs of this second type, for each action it is sufficient to have a counter up to the respective m. □
Remark 7.4. Moreover, our proof also shows, that finite memory is sufficient before switching to £at's (as defined in Section 5) for deterministic-update witnessing (and s-witnessing) strategies. Therefore, finite memory deterministic update is sufficient for e—witness strategies, in particular also for (joint-SAT), which improves the strategy complexity known from [BBC+14]. Note that in general, conversion of a stochastic-update strategy to a deterministic-update strategy requires an infinite blow up in the memory [dAHK07]. A
As a consequence, we obtain several bounds on memory size valid even for deterministic-update strategies. Firstly, infinite memory is required only for witness strategies:
Lemma 7.5. Deterministic-update with finite memory is sufficient for e-witness strategies for (multi-quant-conjuctive) and (multi-quant-joint).
Proof. After switching, memoryless strategies Q£N can be played instead of the sequence of
cr □
Remark 7.6. The previous proof of sufficiency of deterministic-update finite memory for e-witness strategies applies also to (multi-quant-conjunctive-joint). Indeed, firstly, Lemma 7.3 applies verbatim to (multi-quant-conjunctive-joint). Secondly, we switch to only finitely many recurrent strategies due to Remark 6.3. A
Secondly, infinite memory is required only for multiple objectives:
Lemma 7.7. Deterministic-update strategies with finite memory are sufficient witness strategies for (mono-quant).
Proof. After switching in a MEC C, we can play the following memoryless strategy. In C, there can be several components of the flow. We pick any with the largest long-run average reward. Q
Further, the construction in the toss-free case gives us a hint for the respective lower bound on memory, even for the single-objective case.
Example 7.8. For deterministic-update e-witness strategies for (mono-quant) problem, memory with size dependent on the transition probabilities is necessary. Indeed, consider the same realizability problem as in Example 7.2, but with a slightly modified MDP parametrized by A, depicted in Fig. 8. Again, we have to remain in MEC {s, a} with probability p G [0.1,2/3]. For e-witness strategies the interval is slightly wider; let £ > 0 denote the minimal probability with which any (e-)witness strategy has to leave the MEC and all (e-)witness strategies have to stay in the MEC with positive probability. We show that at
42
krishnendu chatterjee, zuzana kretinska, and jan kretinsky
least \j]-memory is necessary. Observe that this setting also applies to the (EXP) setting of [BBC+14], e.g. exp = (0.5,0.5) and the MDP of Fig. 9. Therefore, we provide a lower bound also for this simpler case (no MDP-dependent lower bound is provided in [BBC+14]).
a, r(a) = 2
Figure 8. An MDP family with a single objective, where memory with size dependent on transition probabilities is necessary for deterministic-update strategies
a,r(a) = (l,0)
(c) = (0,l)
1 - A
Figure 9. An MDP family, where memory with size dependent on transition probabilities is necessary for deterministic-update strategies even for (EXP) studied in [BBC+14]
For a contradiction, assume there are less than \j] memory elements. Then, by the pigeonhole principle, in the first \j — 1] visits of s, some memory element m appears twice. Note that due to the deterministic updating, each run generates the same play, thus the same sequence of memory elements. Let p be the probability to eventually leave s provided we are in s with memory m.
If p = 0 then the probability to leave s at the start is less than \j — 2] - A < £, a contradiction. Indeed, we have at most \j — 2] tries to leave s before obtaining memory m and with every try we leave s with probability at most A; we conclude by the union bound.
Let p > 0. Due to the deterministic updates, all runs staying in s use memory m infinitely often. Since p > 0, there is a finite number of steps such that (1) during these steps the overall probability to leave s is at least p/2 and (2) we are using m again. Consequently, the probability of the runs staying in s is 0, a contradiction. A
7.3. Memory bounds for stochastic update. Although we have shown that stochastic update is not necessary, it may be helpful when memory is small.
Lemma 7.9. Stochastic-update 2-memory strategies are sufficient for witness strategies for (mono-quant).
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
43
Proof. The strategy a of Section 5, which reaches the MECs and stays in them with given probability, is memoryless up to the point of switch by Corollary 5.8. Further, we can achieve the optimal value in each MEC using a memoryless strategy as in Lemma 7.7. □
Theorem 7.10. Upper bounds on memory size for stochastic-update e-witness strategies circ cis follows:
• (multi-qual) 2 memory elements,
• (multi-quant-joint) 3 memory elements,
• (multi-quant-conjunctive) 2™ + 1 memory elements,
• (multi-quant-conjunctive-joint) 2n+1 + 1 memory elements.
Proof. The structure of e-witness strategies is described in Remark 5.9. Let us recall from Corollary 5.8 that strategy cr is memoryless before the switch. For (multi-qual), (multi-quant-joint) and (multi-quant-conjunctive), we perform the stochastic-update switch to different memory elements corresponding to the different strategies (,£N. From Lemma 5.3 we have that every such strategy is also memoryless. From Lemma 5.7 we have that we switch only to such for iV C [n], which correspond to possible nonzero variables ys,N- Therefore, the number of memory elements needed is the number of possible nonzero variables ys^ for iV C [n] and additionally one element for the strategy cr before the switch.
Altogether, we get the following upper bounds on memory size of e-witness strategies. For (multi-quant-conjunctive), 2™ + l memory elements are sufficient, since all of the ySiN for A?~ C [n] can be positive. For (multi-quant-joint), 3 memory elements are sufficient, because we use only ysjra] and ys$ as discussed in 6.2. Finally for (multi-qual), 2 memory elements are sufficient, because we use only ys as in 3.2.1.
Due to Remark 6.3, the bound on the number of recurrent strategies for (multi-quant-conjunctive-joint) is twice as large as for (multi-quant-conjunctive), i.e., 2n+1. The upper bound on the size of memory for e-witness strategies for (multi-quant-conjunctive-joint) is thus 1 + 2ra+1, compared to 1 + 2™ for (multi-quant-conjunctive). □
Example 7.11. For (multi-quant-joint),e-witness strategies may require memory with at least 3 elements. Consider an MDP with two states s and t with transitions and rewards as depicted in Fig. 10. Further, let sat = (1,0, 0), pr = \ and exp = (0,1,1).
FIGURE 10. An MDP where 3-memory is necessary for (multi-quant-joint)
Suppose 2 memory elements are sufficient. In state s for each memory element we can either stay in s or go with some positive probability to state t. Therefore we have three cases on the behaviour in s regarding the transition to t:
(1) for each memory element we have positive probability p\ and p2 respectively, to go to state t,
ai,r(ai) = (1,0,0)
a2,r(a2) = (0,4,0)
44
KRISHNENDU CHATTERJEE, ZUZANA KRETINSKA, AND JAN KRETINSKY
(2) for both memory elements we have zero probability to go to t and
(3) for one memory element, say memory element 1, we have zero probability and for the other one, say memory element 2, we have positive probability p to go to t.
In the first case, we go to t eventually almost surely. Indeed, in each step we enter t with probability at least min(pi,p2) and cannot return back. Therefore, we stay in t forever and thus we cannot satisfy the satisfaction constraint.
In the second case, we never enter state t. Hence, we cannot satisfy the expectation constraint, because r(ai)s = r(a2)3 = 0.
In the third case, we firstly assume that we switch from memory 1 to 2 with some positive probability p\. Then in each step we have at least probability p\ ■ p to enter t. Therefore, we end up in state t almost surely, not satisfying constraints, as shown above. Secondly, suppose we cannot switch from memory 1 to 2. Then we almost surely end up in state s with memory 1 or in state t. In state s with memory 1 we can either play action a± with probability 1 or with smaller potentially zero probability q. In the former case, lr(r2) = 0, thus violating the expectation constraint. In the latter case, for almost every run lr(ri) < 1 — q, contradicting the satisfaction constraint.
Note that a witnessing strategy exists, which uses only 3 memory elements. On half of the runs, we play only action a± to satisfy the satisfaction constraint. So we define crn(s,l)(ai) = 1. To satisfy the expectation constraint for r2 we define exp. A
However, even with stochastic update, the size of the finite memory cannot be bounded by a constant for (multi-quant-conjunctive).
Example 7.12. Even e-witness strategy for (multi-quant-conjunctive) may require memory with at least n memory elements. Consider an MDP with a single state s and self-loop a,i with reward ri(a,j) equal to 1 for i = j and 0 otherwise, for each is [n]. Fig. 11 illustrates the case with n = 3. Further, let sat = 1 and pr = 1/n ■ 1.
ai,r(ai) = (1,0,0)
a2,r(a2) = (0,1,0) a3,r(a3) = (0,0,1)
FIGURE 11. An MDP where n-memory is necessary, depicted for n = 3
The only way to e-satisfy the constraints is that for each i, 1/n runs take only aj, but for a negligible portion of time. Since these constraints are mutually incompatible for a single run, n different decisions have to be repetitively taken at s, showing the memory requirement. A
We summarize the upper and lower bounds for witness and e-witness strategies in Table 3 and Table 4, respectively.
UNIFYING TWO VIEWS ON MULTIPLE MEAN-PAYOFF OBJECTIVES IN MDPS
45
8. PARETO CURVE APPROXIMATION AND COMPLEXITY SUMMARY
For a single objective, no Pareto curve is required and we can compute the optimal value of expectation in polynomial time by the linear program L with the objective function max5ZaGA("ca,0 + xa,{i}) ' r(a)- For multiple objectives we obtain the following:
Theorem 8.1. For e > 0, an e-approximation of the Pareto curve for (multi-quant-conjunctive-joint) can be constructed in time polynomial in |G| and j and exponential in n.
Proof. We replace exp in Equation 5 of L by a vector v of variables. Maximizing with respect to v is a multi-objective linear program. By [PY00], we can e-approximate the Pareto curve in time polynomial in the size of the program and -, and exponential in the number of objectives (dimension of v). □
The proof of Theorem 8.1 shows that we can obtain a Pareto-curve approximation also for possible values of the sat or pr vectors for a given exp vector. We simply replace these vectors by vectors of variables, obtaining a multi-objective linear program. If we want the complete Pareto-curve approximation for all the parameters sat, pr, and exp, the number of objectives rises from n to 3 ■ n. The complexity is thus still polynomial in the size of the MDP and 1/e, and exponential in n.
In particular, for the single-objective case, we can compute also the optimal pr given exp and sat, or the optimal sat given pr and exp.
The complexity results are summarized in the following theorem:
Theorem 8.2. The algorithmic complexities are shown in Table 2. The bounds on the complexity of the witness and e-witness strategies are as shown in Table 3 and Table 4, respectively.
Comments on the tables. U: denotes upper bounds (which suffice for all MDPs) and L: lower bounds (which are required in general for some MDPs). Results without reference are induced by the specialization or generalization relation depicted in Fig. 1 and for Table 3 and 4 by e—witness strategies being a weaker notion than witness strategies. The abbreviations stoch.-up., det.-up., rand., det., inf., fin., and X-mem. stand for stochastic update, deterministic update, randomizing, deterministic, infinite-, finite- and X-memory strategies, respectively. Here n is the dimension of reward function and p = l/pmin where Pmin is the smallest positive probability in the MDP. Note that inf. actually means that the strategy is in form of a Markov strategy, see Section 5.
Remark 8.3. For a comparison, the results on previously studied subcases of our problems are depicted in Table 1. A
46
krishnendu chatterjee, zuzana kŕetínská, and jan kŕetínský
Table 1. Previous results on algorithmic and strategy complexities. The abbreviations alg., strat., and c. stand for algorithmic, strategy, and complexity, respectively. Cases multiple and single refer to the number of objectives. Results for single-objective MDPs are based on classical literature, e.g. [Put94, Thm.9.1.8]. Results for MDPs with multiple objectives are due to [BBC+14].
Case Alg. c. Witness strat. c. e-witness strat. c.
multiple (joint-SAT) poly(\G\,n) U: det.-up. inf. L: rand. inf. U: stoch.-up. 2-mem. L: rand. 2-mem.
multiple (EXP) poly(\G\,n) U: det.-up. inf. L: rand. inf. U: stoch.-up. 2-mem., det.-up. fin. L: rand. 2-mem.
single (joint-SAT) poly(\G\) U=L: det. 1-mem. U=L: det. 1-mem.
single (EXP) poly(\G\) U=L: det. 1-mem. U=L: det. 1-mem.
Table 2. Algorithmic complexity results for each of the discussed cases.
Case Algorithmic complexity
(mult i- quant-conj. - j oint) poly(\G\,2n) [Cor.6.2], NP-hard [Thm. 6.4]
(multi-quant-conj.) poly(\G\,2n) [Thm.3.1]
(multi-quant-joint) poly(\G\,n) [Thm.6.1]
(multi-qual) poly(\G\,n)
(mono-quant) poly(\G\)
(mono-qual) poly(\G\)
Table 3. Witness strategy complexity bounds for each of the discussed cases.
Case Witness strategy complexity
(multi-quant-conj.-joint) U: det.-up. [Rem.7.6] inf. L: rand. inf.
(multi-quant-conj.) U: det.-up. [Lem.7.3] inf. L: rand. inf.
(multi-quant-joint) U: det.-up. inf. L: rand. inf.
(multi-qual) U: det.-up. inf. L: rand. inf. [BBC+14, Sec.5]
(mono-quant) U: stoch.-up. 2-mem. [Lem.7.9], det.-up. fin. [Lem.7.7] L: rand. 2-mem., for det.-up. p-mem.
(mono-qual) U: (trivially also L: ) det. 1-mem. [Lem.7.1]
9. Conclusion
We have presented a unifying solution framework to the expectation and satisfaction optimization of Markov decision processes with multiple objectives. This allows us to synthesize
unifying two views on multiple mean-payoff objectives in mdps 47
Table 4. e-witness strategy complexity bounds for each of the discussed cases.
Case e-witness strategy complexity
(multi-quant-conj.-joint) U: stoch.-up. (2n+i + l)-mem. [Thm.7.10], det.-up. fin. [Rem.7.6] L: rand, n-mem. [Ex.7.12], for det.-up. p-mem.
(multi-quant-conj.) U: stoch.-up. (2™ + l)-mem. [Thm.7.10], det.-up. fin. [Lem.7.5] L: rand, n-mem. [Ex.7.12], for det.-up. p-mem.
(multi-quant-joint) U: stoch.-up. 3-mem. [Thm.7.10], det.-up. fin. L: rand. 3-mem. [Ex.7.11]
(multi-qual) U: stoch.-up. 2-mem. [Thm.7.10], det.-up. fin. L: rand. mem. [BBC+14, Sec.3]
(mono- quant) U: stoch.-up. 2-mem., det.-up. fin. L: rand. [Ex.7.2] 2-mem. [Ex.7.2], for det.-up. p-mem. [Ex.7.8]
(mono-qual) U: (trivially also L: ) det. 1-mem.
optimal and e-optimal risk-averse strategies. We have considered several possible combinations of the two semantics and provided algorithms for their solution as well as the complete picture of the complexities for all these cases.
Regarding the algorithmic complexity, we have shown that (multi-quant-joint) and all its special cases can be solved in polynomial time. For both (multi-quant-conjunctive) and (multi-quant-conjunctive-joint), we have presented an algorithm that works in time polynomial in the size of MDP, but exponential in the dimension of reward function. However, the exponential in the dimension of reward function is not a limitation for most of practical purposes since the dimension is typically low. For the latter case we have also proved that the problem is NP-hard. The complexity of (multi-quant-conjunctive) remains an interesting open question. Moreover, our algorithms for Pareto-curve approximation work in time polynomial in the size of MDPs and exponential in the dimension of reward function. However, note that even for the special case of expectation semantics the current best known algorithms depend exponentially on the dimension of reward function [BBC+14].
We have also provided comprehensive results on strategy complexities. It is known that for both expectation and satisfaction semantics with single objective, deterministic memoryless strategies are sufficient [FV97, BBE10, BBC+14]. We have shown this carries over in the (mono-qual) case only. In contrast, for (mono-quant) both randomization and memory is necessary. However, we have also shown that only a restricted form of randomization (deterministic update) is necessary even for (multi-quant), thus improving the upper bound for e—witness strategies for the satisfaction problem of [BBC+14] to finite-memory deterministic update. Furthemore, we have established that with deterministic update the memory size is dependent on the MDP; the result also applies to the expectation problem of [BBC+14], where no MDP-dependent lower bound was given. We have presented upper bounds on stochastic update e—witness strategies, which are constant for (multi-qual) and (multi-quant-joint), and exponentially dependent on the dimension of reward function for (multi-quant-conjunctive) and (multi-quant-conjunctive-joint). The question whether there are polynomially dependent upper bounds for the latter two cases stays open.
48
KRISHNENDU chatterjee, ZUZANA KŘETÍNSKÁ, AND jan KŘETÍNSKÝ
Acknowledgements We are very thankful to the anonymous reviewers for their helpful suggestions and pointing at gaps in the proofs of Lemma 4.3 and Lemma 5.1, and to Rasmus Ibsen-Jensen for discussing the proof of Lemma 4.2.
References
[Alt99] E. Altman. Constrained Markov Decision Processes (Stochastic Modeling). Chapman fe Hall/CRC, 1999.
[BBC+14] T. Brázdil, V. Brožek, K. Chatterjee, V. Forejt, and A. Kučera. Markov decision processes with
multiple long-run average objectives. LMCS, 10(1), 2014. [BBE10] T. Brázdil, V. Brožek, and K. Etessami. One-counter stochastic games. In FSTTCS, pages 108—
119, 2010.
[BCFK13] T. Brázdil, K. Chatterjee, V. Forejt, and A. Kučera. Trading performance for stability in Markov
decision processes. In LICS, pages 331-340, 2013. [BFRR14] V. Bruyěre, E. Filiot, M. Randour, and J.-F. Raskin. Meet your expectations with guarantees:
Beyond worst-case synthesis in quantitative games. In STACS'14, pages 199-213, 2014. [BK08] C. Baier and J.-P. Katoen. Principles of Model Checking. MIT Press, 2008.
[CFW13] K. Chatterjee, V. Forejt, and D. Wojtczak. Multi-objective discounted reward verification in
graphs and MDPs. In LPAR'13, pages 228-242, 2013. [CH11] K. Chatterjee and M. Henzinger. Faster and dynamic algorithms for maximal end-component
decomposition and related graph problems in probabilistic verification. In SODA, pages 1318—
1336, 2011.
[CH12] K. Chatterjee and M. Henzinger. An 0(n ) time algorithm for alternating Biichi games. In
SODA, pages 1386-1399, 2012. [CH14] K. Chatterjee and M. Henzinger. Efficient and dynamic algorithms for alternating Biichi games
and maximal end-component decomposition. J ACM, 2014. [Cha07] K. Chatterjee. Markov decision processes with multiple long-run average objectives. In FSTTCS,
pages 473-484, 2007.
[CL13] K. Chatterjee and J. Lacki. Faster algorithms for Markov decision processes with low treewidth. In CAV, pages 543-558, 2013.
[CMH06] K. Chatterjee, R. Majumdar, and T. A. Henzinger. Markov decision processes with multiple objectives. In STACS, pages 325-336, 2006.
[CR15] Lorenzo Clemente and Jean-Francois Raskin. Multidimensional beyond worst-case and almost-sure problems for mean-payoff objectives. In LICS, pages 257-268, 2015.
[CY95] C. Courcoubetis and M. Yannakakis. The complexity of probabilistic verification. Journal of the ACM, 42(4):857-907, 1995.
[CY98] C. Courcoubetis and M. Yannakakis. Markov decision processes and regular events. Automatic Control, IEEE Transactions on, 43(10):1399-1418, October 1998.
[dA97] L. de Alfaro. Formal Verification of Probabilistic Systems. PhD thesis, Stanford University, 1997.
[dAHK07] L. de Alfaro, T. A. Henzinger, and O. Kupferman. Concurrent reachability games. Theor. Corn-put. Sci, 386(3):188-217, 2007.
[DEKM98] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1998.
[EKVY08] K. Etessami, M. Kwiatkowska, M. Vardi, and M. Yannakakis. Multi-objective model checking of Markov decision processes. LMCS, 4(4): 1-21, 2008.
[FKN+11] V. Forejt, M. Z. Kwiatkowska, G. Norman, D. Parker, and H. Qu. Quantitative multi-objective verification for probabilistic systems. In TACAS, pages 112-127, 2011.
[FKP12] V. Forejt, M. Z. Kwiatkowska, and D. Parker. Pareto curves for probabilistic model checking. In ATVA'12, pages 317-332, 2012.
[FKR95] J. A. Filar, D. Krass, and K. W Ross. Percentile performance criteria for limiting average Markov decision processes. Automatic Control, IEEE Transactions on, 40(1):2—10, Jan 1995.
[FV97] J. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer-Verlag, 1997.
[How60] H. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960.
[KGFP09] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas. Temporal-logic-based reactive mission and motion planning. IEEE Transactions on Robotics, 25(6):1370—1381, 2009.
unifying two views on multiple mean-payoff objectives in mdps
49
[KNP02] M. Kwiatkowska, G. Norman, and D. Parker. PRISM: Probabilistic symbolic model checker. In
TOOLS' 02, pages 200-204, 2002. [Kos88] J. Koski. Multicriteria truss optimization. In Multicriteria Optimization in Engineering and in
the Sciences. 1988. [Owe95] G. Owen. Game Theory. Academic Press, 1995.
[Put94] M.L. Puterman. Markov Decision Processes. John Wiley and Sons, 1994.
[PY00] C. H. Papadimitriou and M. Yannakakis. On the approximability of trade-offs and optimal access
of web sources. In FOCS, pages 86-92, 2000. [Roy88] H. Royden. Real Analysis. Prentice Hall, 3rd edition, 12 February 1988.
[RRS15] Mickael Randour, Jean-Francois Raskin, and Ocan Sankur. Percentile queries in multidimensional markov decision processes. In CAV, Part I, pages 123-139, 2015.
[Sch86] A. Schrijver. Theory of Linear and Integer Programming. Wiley-Interscience, 1986.
[SCK04] R. Szymanek, F. Catthoor, and K. Kuchcinski. Time-energy design space exploration for multilayer memory architectures. In DATE, pages 318-323, 2004.
[Seg95] R. Segala. Modeling and Verification of Randomized Distributed Real-Time Systems. PhD thesis, MIT, 1995.
[Var85] M. Vardi. Automatic verification of probabilistic concurrent finite state programs. In FOCS, pages 327-338, 1985.
[WL99] C. Wu and Y. Lin. Minimizing risk models in Markov decision processes with policies depending on target values. Journal of Mathematical Analysis and Applications, 231(l):47-67, 1999.
[YC03] P. Yang and F. Catthoor. Pareto-optimization-based run-time task scheduling for embedded systems. In CODES+ISSS, pages 120-125, 2003.
Appendix A. Limear program for the running example
(1) 1 + 0.5% = ye + yr+ ys,0 + ys,{iy + ys,{2} + ys,{i,2} o.5ye + ya = ya + yu$ + yu,{i} + yu,{2} + yu,{i,2}
yr +yb + Ve = yb + yc + yv,iD + Vv,{i} + yv,{2} + Vv,{i,2} yc + yd = yd + ye + yw,% + yw,{i} + yw,{2} + yw,{i,2}
(2) yu,%+yu,{i}+yu,{2}+yu,{i,2}+yv,%+Vv,{iy+Vv,{2}+Vv,{i,2}+yw,%+yw,{i}+yw,{2} +
Vw,{l,2} = i
(3) yu,% = xa,%
yu,{l} = xa,{l} yu,{2} = xa,{2} Vu,{l,2} = xa,{l,2}
Vv,% + Vw,% = Xb,% + xc$ + xd,% + xe,%
Vv,{l} + Vw,{l} = xb,{l} + xc,{l} + xd,{l} + xe,{l}
yv,{2} + yw,{2} = xb,{2} + xc,{2} + xd,{2} + xe,{2}
Vv,{l,2} + Vw,{l,2} = xb,{l,2} + xc,{l,2} + xd,{l,2} + xe,{l,2}
(4) 0.5x^0 = xfß + xrß
0-5a^{2} = xe,{2} + xr,{2}
0-5^,{i,2} = xe,{i,2} + xr,{i,2} 0.5x^0 + xaß = xaß
0.5xe{1} + Xa^!} = Xa^!}
50
KRISHNENDU CHATTERJEE, ZUZANA KŘETÍNSKÁ, AND JAN KŘETÍNSKÝ
0.5a^{2} + xa,{2} = xa,{2}
0-5^,{l,2} + xa,{l,2} = ^,{1,2}
xr$ + xbfi + xefi = xbfi + xcfi xr,{l} + xb,{l} + xe,{l} = xb,{l} + xc,{l} xr,{2} + xb,{2} + xe,{2} = xb,{2} + xc,{2} xr,{l,2} + xb,{l,2} + xe,{l,2} = xb,{l,2} + xc,{l,2}
xc$ + xd,% = xd,% + xe$ xc,{l} + xd,{l} = xd,{l} + xe,{l} xc,{2} + xd,{2} = xd,{2} + xe,{2} ^,{1,2} + ^,{1,2} = ^,{1,2} + xe,{l,2}
(5) r(£)xefi + r(e)xit{1} + r(í)x^{2} + r(i)xt^2} + r(r)xrfi + r(r)x^{1} + r(r)x^{2} + r(r)xrÁm+(A, O):eq,0+(4, 0):eq,{1} + (4, 0K,{2} + (4, 0)a;aj{lj2} + (l, 0)xb#+(l, 0)a:6j{1} + (l,0)a;6i{2} + (l,0)a:6j{lj2} + (O,OK,0 + (0,0K,{1} + (0, 0K,{2} + (0, 0K,{1,2} + (0, l)a;di0+(O, l)^,{i}+(0, l)^,{2}+(0, l)^,{i,2}+(0, 0)xefi+(0,0K,{1}+(0,0)xeÁ2} + (O.OK,^} > (1.1,0.5)
(6) ^xa,{i} > 0.5a;a,{i} 0 > 0.5a;a,{2} ^xa,{i,2} > 0.5a;a,{i,2} 0 > 0.5a;a,{i,2}
xb,{i} > 0.5a;6,{i} + 0.5a;c,{i} + 0.5a;d,{i} + 0.5a;e,{i} xd,{2} > 0.5a;6,{2} + 0.5a;c,{2} + 0.5a;d,{2} + 0.5a;e,{2} ^6,(1,2} > 0.5a;6,{i,2} + 0.5xC){1)2} + 0.5xd^2} + 0.5xe){1)2} ^,{1,2} > 0.5a;6,{i,2} + 0.5xC){1)2} + 0.5xd:{h2} + 0.5xe){1)2}
(7) xe,{l}+xe,{l,2}+xr,{l}+xr,{l,2}+xa,{l}+xa,{l,2}+xb,{l}+xb,{l,2}+xc,{l}+xc,{l,2} + xd,{l} + xd,{l,2} + xe,{l} + xe,{l,2} > 0-8
xe,{2}+xe,{l,2}+xr,{2}+xr,{l,2}+xa,{2}+xa,{l,2}+xb,{2}+xb,{l,2} + xc,{2}+xc,{l,2} + xd,{2} + xd,{l,2} + xe,{2} + xe,{l,2} > 0-8
Faster Statistical Model Checking for Unbounded Temporal Properties
PRZEMYSLAW DACA and THOMAS A. HENZINGER, ist Austria, Austria
JAN KRETINSKY, Technische Universität München, Germany TATJANA PETROV, IST Austria, Austria
We present a new algorithm for the statistical model checking of Markov chains with respect to unbounded temporal properties, including full linear temporal logic. The main idea is that we monitor each simulation run on the fly, in order to detect quickly if a bottom strongly connected component is entered with high probability, in which case the simulation run can be terminated early. As a result, our simulation runs are often much shorter than required by termination bounds that are computed a priori for a desired level of confidence on a large state space. In comparison to previous algorithms for statistical model checking our method is not only faster in many cases but also requires less information about the system, namely, only the minimum transition probability that occurs in the Markov chain. In addition, our method can be generalised to unbounded quantitative properties such as mean-payoff bounds.
CCS Concepts: 'Theory of computation —> Logic and verification; Random walks and Markov chains; Modal and temporal logics;
Additional Key Words and Phrases: Markov chains, mean payoff, simulation, statistical model checking, temporal logic
ACM Reference Format:
Przemyslaw Daca, Thomas A. Henzinger, Jan Kfetinsky, and Tatjana Petrov, 2016. Faster Statistical Model Checking for Unbounded Temporal Properties. ACM Trans. Comput. Logic 9, 4, Article 39 (March 2010), 26 pages.
DOI: http://dx.doi.org/10.1145/0000000.0000000 1. INTRODUCTION
Traditional numerical algorithms for the verification of Markov chains may be computationally intense or inapplicable, when facing a large state space or limited knowledge of the chain. To this end, statistical algorithms are used as a powerful alternative. Statistical model checking (SMC) typically refers to approaches where (i) finite paths of the Markov chain are sampled a finite number of times, (ii) the property of interest is verified for each sampled path (e.g. state r is reached), and (iii) hypothesis testing or statistical estimation is used to infer conclusions (e.g. state r is reached with probability at most 0.5) and give statistical guarantees (e.g. the conclusion is valid with 99% confidence). SMC approaches differ in (a) the class of properties they can verify (e.g. bounded or unbounded properties), (b) the strength of statistical guarantees they provide (e.g. confidence bounds, only asymptotic convergence of the method towards
This is an extended version of [Daca et al. 2016a] with full proofs and an extended discussion of the experimental results. This research was funded in part by the European Research Council (ERC) under grant agreement 267989 (QUAREM), the Austrian Science Fund (FWF) under grants project S11402-N23 (RiSE) and Z211-N23 (Wittgenstein Award), the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007-2013) REA Grant No 291734, the SNSF Advanced Postdoc Mobility Fellowship — grant number P300P2_161067, and the Czech Science Foundation under grant agreement P202/12/G061.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2010 ACM. 1529-3785/2010/03-ART39 $15.00 DOI: http://dx.doi.org/10.1145/0000000.0000000
ACM Transactions on Computational Logic, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Table I. SMC approaches to Markov chain verification, organised by (i) the class of verifiable properties, and (ii) by the required information about the Markov chain, where pmin is the minimum transition probability, \S\ is the number of states, and A is the second largest eigenvalue of the chain.
LTL, mean payoff X here [Brazdil et al. 2014](LTL)
O, U X here —"— [Younes et a 1. 2010] [He et al. 2010] [Younes et al. 2010]
bounded [Younes and Simmons 2002] [Sen et al. 2004]
no info Pmin | 5*1, Pmir, A topology
the correct value, or none), and (c) the amount of information they require about the Markov chain (e.g. the topology of the graph). In this paper, we provide an algorithm for SMC of unbounded properties, with confidence bounds, in the setting where only the minimum transition probability of the chain is known. Such an algorithm is particularly desirable in scenarios when the system is not known ("black box"), but also when it is too large to construct or fit into memory.
Most of the previous efforts in SMC have focused on the analysis of properties with bounded horizon [Younes and Simmons 2002; Sen et al. 2004; Younes et al. 2006; Younes and Simmons 2006; Jha et al. 2009; Jegourel et al. 2012; Bulychev et al. 2012]. For bounded properties (e.g. state r is reached with probability at most 0.5 in the first 1000 steps) statistical guarantees can be obtained in a completely black-box setting, where execution runs of the Markov chain can be observed, but no other information about the chain is available. Unbounded properties (e.g. state r is reached with probability at most 0.5 in any number of steps) are significantly more difficult, as a stopping criterion is needed when generating a potentially infinite execution run, and some information about the Markov chain is necessary for providing statistical guarantees (for an overview, see Table I). On the one hand, some approaches require the knowledge of the full topology in order to preprocess the Markov chain. On the other hand, when the topology is not accessible, there are approaches where the correctness of the statistics relies on information ranging from the second eigenvalue A of the Markov chain, to knowledge of both the number \S\ of states and the minimum transition probability Pmin-
Our contribution is a new SMC algorithm for full linear temporal logic (LTL), as well as for unbounded quantitative properties (mean payoff), which provides strong guarantees in the form of confidence bounds. Our algorithm uses less information about the Markov chain than previous algorithms that provide confidence bounds for unbounded properties—we need to know only the minimum transition probability pmm of the chain, and not the number of states nor the topology. Yet, experimentally, our algorithm performs in many cases better than these previous approaches (see Section 5). Our main idea is to monitor each execution run on the fly in order to build statistical hypotheses about the structure of the Markov chain. In particular, if from observing the current prefix of an execution run we can stipulate that with high probability a bottom strongly connected component (BSCC) of the chain has been entered and explored, then we can terminate the current execution run. The information obtained from execution prefixes allows us to terminate executions as soon as the property is decided with the required confidence, which is usually much earlier than any bounds that can be computed a priori. As far as we know, this is the first SMC algorithm that uses information obtained from execution prefixes.
Finding pmm is a light assumption in many realistic scenarios and often does not depend on the size of the chain - e.g. bounds on the rates for reaction kinetics in chemical reaction systems are typically known; alternatively, from a PRISM language model they can be easily inferred without constructing the respective state space.
2
Fig. 1. A Markov chain.
Example 1.1. Consider the property of reaching state r in the Markov chain depicted in Figure 1. While the execution runs reaching r satisfy the property and can be stopped without ever entering any vi7 the finite execution paths without r, such as stuttutuut, are inconclusive. In other words, observing this path does not rule out the existence of a transition from, e.g., u to r, which, if existing, would eventually be taken with probability 1. This transition could have arbitrarily low probability, rendering its detection arbitrarily unlikely, yet its presence would change the probability of satisfying the property from 0.5 to 1. However, knowing that if there exists such a transition leaving the set, its transition probability is at least pmm = 0.01, we can estimate the probability that the system is stuck in the set {t, u} of states. Indeed, if existing, the exit transition was missed at least four times, no matter whether it exits t or u. Consequently, the probability that there is no such transition and {t, u} is a BSCC is at least
l-(l-Pmin)4-
This means that, in order to get 99% confidence that {t, u} is a BSCC, we only need to see both t and u around 500 times1 on a run. This is in stark contrast to a priori bounds that provide the same level of confidence, such as the (l/pmin)'s' = lOO^1™) runs required by [Brázdil et al. 2014], which is infeasible for large m. In contrast, our method's performance is independent of m. A
Monitoring execution prefixes allows us to design an SMC algorithm for complex unbounded properties such as full LTL. More precisely, we present a new SMC algorithm for LTL over Markov chains, specified as follows.
Input2:
— we can sample finite runs of arbitrary length from an unknown finite-state discrete-time Markov chain A4 according to the initial distribution3,
— we are given a lower bound pmm > 0 on the transition probabilities in M,
— an LTL formula 0,
— two error bounds a, f3 > 0.
Output:
— if F[(p] > p + e, return YES with probability at least 1 — a, and
— if p[
[0,1] is the transition probability matrix, such that for every s e S it
holds Es 0 | s, s' g S}) denote the smallest positive transition probability in M. A run of M is an infinite sequence p = s0si ■ ■ ■ of states, such that for alii > 0, P(sj, si+i) > 0; we let p[i] denote the state Sj. A path n in M is a finite prefix of a run of M. We denote the empty path by A and concatenation of paths tt\ and 7T2 by ni ■ n-2. Each path n in A4 determines the set of runs Cone(7r) consisting of all runs that start with n. To M we assign the probability space (Runs, T, P.m), where Runs is the set of all runs in A4, T is the cr-algebra generated by all Cone(7r), and P» is the unique probability measure such that
k
PA4[Cone(s0si • • ■ sk)] = p(s0) ■ P(si_i, s{),
i=l
where the empty product equals 1. We write P instead of P» if the Markov chain is clear from the context. The elements of T are called events. The respective expected value of a random variable / : Runs —> R is E[/] = JRuns / dF.
A non-empty set C C S of states is strongly connected (SC) if for every s, s' e C there is a path from s to s'. A set of states C C S is a bottom strongly connected component (BSCC) of M, if it is a maximal SC, and for each s e C and s' e S \ C we have P(s, s') = 0. The sets of all SCs and BSCCs in M are denoted by SC and BSCC, respectively. Note that with probability 1, the set of states that appear infinitely many
5
times on a run forms a BSCC. From now on, we use the standard notions of SC and BSCC for directed graphs as well.
2.2. Hypothesis Testing
Let X be a random variable, and suppose we are interested whether the expected value K[X] is larger or smaller than some threshold p. We formulate this question as a hypothesis testing problem, where we decide between the null hypothesis H0 and the alternative hypothesis Hi:
H0 : E[X] >p + e Hx : E[X] < p - e. (1)
The indifference region e > 0 describes the interval [p — e,p + e) were both hypothesis are acceptable.
Two types of errors are used to evaluate precision of a solution. A type I error is the probability of accepting Hi when H0 holds. Similarly, a type II error is the probability of choosing H0 when Hi holds. The test strength (a, f3) is a pair of values that bound the maximum probabilities of making type I and type II errors, respectively. In general, it is not possible to obtain low values of a and (3 at the same time when the indifference region e is zero, since the probability E[X] may be arbitrary close to the threshold from either side, making type I or II error very likely.
Sequential probability ratio test. The sequential probability ratio test (SPRT) is a popular statistical procedure for hypothesis testing [Wald 1945; Younes 2004]. In the SPRT the number of samples is not fixed, but sampling continues until the observations give strong evidence in favor of H0 or Hi. The SPRT gives no guarantee on the maximal number of samples; in practice, however, it often terminates quickly.
The SPRT works as follows. Suppose X is Bernoulli random variable, i.e. only values 0 and 1 are possible. After observing samples x = x\,..., xn from X the following ratio is computed:
p(x|pi) _ -A- F(X = xt | E[X] = pi) _ Pin{\ -pi)™"d"
n
p(x|p0) f=\ HX = xx | E[X] = p0) Pl"(l -po)"-d"
where dn = J2i=i x^ Po = P + £, and Pi = p — e. The decision rule for accepting a hypothesis is:
accept H0 if FjX|Plj < B accept Hi if FjX|Plj > A. (2)
p(x|p0) - p(x|p0) _
Finding the values of A, B such the test has the required strength is a difficult task. In practice, values A = (1 — ft)/a and B = f3/(l — a) are used, since they result in a test whose strength is close to (a, f3) [Younes 2004].
3. SOLUTION FOR REACHABILITY
A fundamental problem in Markov chain verification is computing the probability that a certain set of goal states is reached. For the rest of the paper, let M = (S, P, p) be a Markov chain and G C S be the set of the goal states in M. We let
OG := {p e Runs \3i>0: p[i] e G}
denote the event that "eventually a state in G is reached." The event OG is measurable and its probability P[OG] can be computed numerically or estimated using statistical algorithms. Since no bound on the number of steps for reaching G is given, the major difficulty for any statistical approach is to decide how long each sampled path should be. We can stop extending the path either when we reach G, or when no more new
6
states can be reached anyways. The latter happens if and only if we are in a BSCC and we have seen all of its states.
In this section, we first show how to monitor each simulation run on the fly, in order to detect quickly if a BSCC has been entered with high probability. Then, we show how to use hypothesis testing in order to estimate P[OG].
3.1. BSCC detection
We start with an example illustrating how to measure probability of reaching a BSCC from one path observation.
Example 3.1. Recall Example 1 and Figure 1. Now, consider an execution path stuttutu. Intuitively, does {t, u} look as a good "candidate" for being a BSCC of MP. We visited both t and u three times; we have taken a transition from each t and u at least twice without leaving {t,u}. By the same reasoning as in Example 1, we could have missed some outgoing transition with probability at most (1 — pmin)2- The structure of the system that can be deduced from this path is in Figure 2 and is correct with probability at least 1 — (1 — pmin)2- A
Now we formalise our intuition. Given a finite or infinite sequence p = s^si • • •, the support of p is the set p = {s0,si,...}. Further, the graph of p is given by vertices p and edges {(si: si+1) i = 0,1,...}.
Definition 3.2 (Candidate). If a path n has a suffix k such that k is a BSCC of the graph of 7r, we call k the candidate ofn. Moreover, for k e N, we call it a k-candidate (of -k) if each s e k has at least k occurrences in k and the last element of k has at least k + 1 occurrences. A k-candidate of a run p is a fc-candidate of some prefix of p.
Note that for each path there is at most one candidate. Therefore, we write K(tt) to denote the candidate of n if there is one, and K(tt) = _l, otherwise. Observe that each K(n) 7^ _l is strongly connected in M.
Example 3.3. Consider a path n = stuttutu, then K(n) = {t,u}. Observe that {t} is not a candidate as it is not maximal. Further, K(n) is a 2-candidate (and as such also a 1-candidate), but not a 3-candidate. Intuitively, the reason is that we only took a transition from u (to the candidate) twice, cf. Example 3.1. A
Intuitively, the higher the k the more it looks as if the fc-candidate is indeed a BSCC. Denoting by Candk(K) the random predicate of K being a fc-candidate on a run, the probability of "unluckily" detecting any specific non-BSCC set of states K as a fc-candidate, can be bounded as follows.
LEMMA 3.4. For every K C S such that K ^ bscq and every s e K, k e N,
F[Candk(K) \ Os] < (l-pmin)fc.
PROOF. Since K is not a BSCC, there is a state t € K with a transition to t! £ K. The set of states if is a fc-candidate of a run, only if t is visited at least k times by the path and was never followed by t' (indeed, even if t is the last state in the path,
Fig. 2. The graph of a path stuttutu.
7
by definition of a fc-candidate, there are also at least k previous occurrences of t in the path). Further, since the transition from t to t' has probability at least pmm, the probability of not taking the transition k times is at most (1 — pmin)fc-
Example 3.5. We illustrate how candidates "evolve over time" along a run. Consider a run p = sqsqsisq ■ ■ ■ of the Markov chain in Figure 3. The empty and one-letter prefix do not have the candidate defined, s0s0 has a candidate {so}, then again K(s0s0si) = _L, and K(s0s0sis0) = {s0,si}. One can observe that subsequent candidates are either disjoint or contain some of the previous candidates. Consequently, there are at most 2|5| — 1 candidates on every run, which is in our setting an unknown bound. A
While we have bounded the probability of detecting any specific non-BSCC set K as a fc-candidate, we need to bound the overall error for detecting a candidate that is not a BSCC. Since there can be many false candidates on a run before the real BSCC (e.g. Figure 3), we need to bound the error of reporting any of them.
In the following, we first formalise the process of discovering candidates along the run. Second, we bound the error that any of the non-BSCC candidates becomes a fc-candidate. Third, we bound the overall error of not detecting the real BSCC by increasing k every time a different candidate is found.
We start with discovering the sequence of candidates on a run. For a run p = s0si • • •, consider the sequence of random variables defined by K(s0 ... Sj) for j > 0, and let (Ki)i>i be the subsequence without undefined elements and with no repetition of consecutive elements. For example, for a run p = sosisisisosis2«2 • • •, we have K\ = {«i}, K-2 = {so,si}, Kz = {s2}, etc. Let Kj be the last element of this sequence, called the final candidate. Additionally, we define Kt := Kj for all I > j. We describe the lifetime of a candidate. Given a non-final Kt, we write p = aifJib^diSi so that a~ n Kt = 0, fSihji = K, dt £ K, and K(aif3i) ^ Ku K(aif3ibi) = Kt. Intuitively, we start exploring Ki in (3i, Ki becomes a candidate in &j, the birthday of the ith candidate; it remains to be a candidate until di, the death of the ith candidate. For example, for the run p = s0sisisis0sis2s2 ■ ■ ■ and i = 1, «i = s0, [31 = s1; bi = si, 71 = si, di = s0, Si = siS2«2p[8]p[9] • • •. Note that the final candidate is almost surely a BSCC of M and would thus have 7^ infinite.
Now, we proceed to bounding errors for each candidate. Since there is an unknown number of candidates on a run, we will need a slightly stronger definition. First, observe that Candk(Ki) iff Kt is a fc-candidate of fiibi^i. We say Ki is astrong k-candidate, written SCandk(Ki), if it is a fc-candidate oibi%. Intuitively, it becomes a fc-candidate even not counting the discovery phase. As a result, even if we already assume there exists an ith candidate, its strong fc-candidacy gives the guarantees on being a BSCC as above in Lemma 3.4.
□
'0
Fig. 3. A family (for n g N) of Markov chains with large eigenvalues.
8
1
£ BSCC]
1
$ BSCC]
1
$ BSCC]
LEMMA 3.6. For every i,k eN, we have
F[SCandk(Ki) \ K% £ BSCC] < (1 -pmin)fc .
Proof.
¥[SCandk{Kf) \ Kx BSCC]
y W[KX = C,bl = s}- F[SCandk(C) \Kx = C,bx = s]
CeSC\BSCC
sec
y p[K2 = C,h = s]- F[Candk(C) \ Os] (by Markov property)
CeSC\BSCC
sec
. /v ,'RC;rr E P[^ = C,6i = S]-(l-pmi„)fc (by Lemma 3.4)
CeSC\BSCC
sec
= (1 - Pmin)fc • (by V[Kt e SC, &2 e ^] = l)
□
Since the number of candidates can only be bounded with some knowledge of the state space, e.g. its size, we assume no bounds and provide a method to bound the error even for an unbounded number of candidates on a run.
LEMMA 3.7. For (h)^ e Nw, let Err be the set of runs such that for some i eN, we have SCandkt(Ki) despite Kt £ BSCC. Then
oo
F[£rr] < ^(1 - pmin)fc* .
i=l
oo
'\£rr] = p (J (sCandki(Ki) n Kt £ BSCc)
-i = l
< E P^CW^ nif,^ BSCC] (by the union bound)
i=l
oo
= ^2F[SCandki(Ki) \ A, BSCC] • p[AT2 BSCC]
i=l
oo
< *Yy\SCandki(Ki) \ Kt £ BSCC]
i=l
oo
= X)(1 - Pmin)fci • (by Lemma 3.6)
i=l
□
In Algorithm 1 we present a procedure for deciding whether a BSCC inferred from a path 7t is indeed a BSCC with confidence greater than 1 — 6. We use notation SCANDfei(_ftT, 7r) to denote the function deciding whether if is a strong fc^-candidate on 7T. The overall error bound is obtained by setting kt — i-los*
Proof.
log(l-pmi„) '
9
Algorithm 1 ReachedBSCC
Input: path n = s0si pm-m, 5 e (0,1]
Output: Yes iff K(ir) e BSCC
C <- ±, i <- 0
for j = 0 to n do
if K(s0 ■ ■ ■ Sj) 7^ j_ and K(s0 ■ ■ ■ Sj) ^ C then
- lUgVi-PminJ
if i > 1 and SCANDfci(K(7r),7r) then return Yes else return No
Theorem 3.8. For every 5 > 0, Algorithm 1 is correct with error probability at most 5.
proof. Since M is finite, the Algorithm 1 terminates almost surely. The probability to return an incorrect result can be bounded by returning incorrect result for one of the non-final candidates, which by Lemma 3.7 is as follows:
oo oo oo oo
i=l i=l i=l i=l
□
We have shown how to detect a BSCC of a single path with desired confidence. In Algorithm 2, we show how to use our BSCC detection method to decide whether a given path reaches the set G with confidence 1 — 5. The function NextState(vr) randomly picks a state according to the initial distribution \i if the path is empty (n = A); otherwise, if t is the last state of n, it randomly chooses its successor according to P(£, •). The algorithm returns Yes when n reaches a state in G, and No when for some i, the ith candidate is a strong fc^-candidate. In the latter case, with probability at least 1 — S, n has reached a BSCC not containing G. Hence, with probability at most S, the algorithm returns No for a path that could reach a goal.
Algorithm 2 SinglePathReach
Input: goal states G of M, pmin, 5 e (0,1 Output: Yes iff a run reaches G
7T <— A
repeat
s <- NextState(7r)
7T <— 7T • S
if s e G then return Yes
until REACHEDBSCC(7r,pmin, 5) return No
> We have provably reached G > By Theorem 3.8, F[K(n) e BSCC] > 1 - 5
3.2. Hypothesis testing with bounded error
In the following, we show how to estimate the probability of reaching a set of goal states, by combining the BSCC detection and hypothesis testing. More specifically, we sample many paths of a Markov chain, decide for each whether it reaches the goal states (Algorithm 2), and then use hypothesis testing to estimate the event probability.
10
The hypothesis testing is adapted to the fact that testing reachability on a single path may report false negatives.
Let be a Bernoulli random variable, such that X% = 1 if and only if SlN-GLEPATHREACH(G,pmin, 5) = Yes, describing the outcome of Algorithm 2. The following theorem establishes that X% estimates P[OG] with a bias bounded by S.
THEOREM 3.9. For every 5 > 0, we have P[OG] -5 < E[x£] < P[OG].
PROOF. Since the event OG is necessary for X% = 1, we have P[OG \X% = l] = l. It follows that E[X%] = P[X| = 1] = ¥[OG,X% = 1] < P[OG], hence the upper bound. As for the lower bound:
E[X%] = F[Xl = 1] = P[OG, X% = 1] OG is necessary for X% = 1
= P[OG] -F[OG,Xl = 0]
> P[OG] - S . by Theorem 3.8
□
In order to conclude on the value P[OG], the standard statistical model checking approach via hypothesis testing (cf. Section 2.2) decides between the hypothesis
H0 : P(OG) > p + e H^. P(OG) < p - e .
where e is a desired indifference region. As we do not have precise observations on each path, we reduce this problem to a hypothesis testing on the variable X% with a narrower indifference region:
H0 : E[Xl] >p+(e-S) H[ : E[X^] p + e. By Theorem 3.9 it follows that w\X% = 1] > P[OG] - 5 > p + (e - S), thus H'0 also holds. By assumption the test T' accepts H[ with probability at most a, thus, by the reduction, T also accepts Hi with probability < a. The proof for type II error is analogous.
□
Lemma 3.10 gives us the following algorithm to decide between H0 and Hi. We generate samples x0,xi,--- ,xn ~ X%, from SlNGLEPATHREACH(G,pm;n,5), and apply a statistical test to decide between H'0 and H[. Finally, we accept H0 if H'0 was accepted by the test, and Hi otherwise.
4. EXTENSIONS
In this section, we present how the on-the-fly BSCC detection can be used for verifying LTL and quantitative properties (mean payoff).
4.1. Linear temporal logic
We show how our method extends to properties expressible by linear temporal logic (LTL) [Pnueli 1977] and, in the same manner, to all w-regular properties. Given
11
a finite set Ap of atomic propositions, a labelled Markov chain (LMC) is a tuple M = (S,P,p,L), where (S,P,p) is a MC and L : S -> 2Ap is a labelling function. The formulae of LTL are given by the following syntax:
p ::= a \ ->p p A p X92 | p\Jp
for a £ Ap. The semantics is defined with respect to a word w € (2Ap)". The ith letter of w is denoted by w[i], i.e. w = w[0]w[l] ■ ■ ■ and we write Wi for the suffix w[i]w[i + 1] • • • . We define
a e w[0] not w \= p w \= p and w \= ip
wi \= p
3 k e N : |= V and V 0 < j < k : w} \=tp).
The set {w e (2Ap)" w |= p} is denoted by L (?). Given a labelled Markov chain A4 and an LTL formula p, we are interested in the measure
¥M[p] := p^[{p e Runs | L(p) |= p}],
where L is naturally extended to runs by L(p)[i] = L(p[i\) for all i.
For every LTL formula p, one can construct a deterministic Rabin automaton that accepts all runs that satisfy p.
Definition 4.1 (Deterministic Rabin automaton). A deterministic Rabin automaton (DRA) is a tuple A = (Q, 2Ap, 7,9o, Acc), where
— Q is a finite set of states,
— 7 : Q x 2Ap —> Q is the transition function,
— q0 € Q is the initial state, and
— Acc C 2*2 x 2*2 is the acceptance condition.
A word w e (2AvY induces an infinite sequence A(w) = s0si ■ ■ ■ e Q", such that s0 = qo and ~f(si,w[i\) = si+i for i > 0. We write Inf(w) for the set of states that occur infinitely often in A(w). Word w is accepted, if there exists a pair (E,F) € Acc, such that E n Inf (w) = 0 and F n Inf(w) 7^ 0. The language 1(A) of A is the set of all words accepted by A. The following is a well known result, see e.g. [Baier and Katoen 2008].
Lemma 4.2. For every LTL formula p, a DRA Av can be effectively constructed such that l(Av) = l(p).
The product of a MC and DRA is defined in the following way.
Definition 4.3 (Product of a MC and DRA). The product of a Markov chain A4 = (S,P,p) and deterministic Rabin automaton A = (Q,2Ap,~f,q0,Acc) is the Markov chain M®A=(Sx Q,P',p'), where
— P'((s, q), (V, q')) = P(s, s') if q' = j(q, L(s')) and P'((s, q), (V, q')) = 0 otherwise,
— p'(s, q) = p(s) if 7(gOJ L(s)) = q and p'(s, q) = 0 otherwise.
Note that M ® A has the same smallest transition probability pmm as Ad.
The crux of LTL probabilistic model checking relies on the fact that the probability of satisfying an LTL property p in a Markov chain A4 equals the probability of reaching an accepting BSCC in the Markov chain M ® Av. Formally, a BSCC C of M ® Av is accepting if for some (£, F) e Acc we have C n (S x £) = I and C n (5 x F) ^0. Let AccBSCC denote the union of all accepting BSCCs in A4 ® Then we obtain the following well-known fact [Baier and Katoen 2008]:
w \= a
w \= ->p
w \= p A ip
w \= X99
w \= pUip
12
Lemma 4.4. For every labelled Markov chain m and LTL formula ip, we have Wm[ V[K(tt) e BSCC] > 1 - 5
return 3(E, F) e Acc : K(tt) n (S x E) = 0 A K(vr) n (5 x F) ^ 0
4.2. Hypothesis testing with bounded error
Since the input used is a Rabin automaton, the method applies to all w-regular properties. Let be a Bernoulli random variable, such that X^ = 1 if and only if SlN-GLEPATHLTL(A¥,,pmir,, S) = Yes. Since the BSCC must be reached and fully explored to classify it correctly, the error of the algorithm can now be both-sided.
theorem 4.5. For every 5 > 0, P[>] - 5 < E[X^] < P[>] + 5. Further, like in Section 3.2, we can reduce the hypothesis testing problem for
4.3. Mean payoff
We show that our method extends also to quantitative properties, such as mean payoff (also called long-run average reward). Let A4 = (S,P,fi) be a Markov chain and r : S —> [0,1] be a reward function. Denoting by Si the random variable returning the i-th state on a run, the aim is to compute
This limit exists (see, e.g. [Norris 1998]), and equals EceBSCc F[^C] ■ MPC,where MPC is the mean payoff of runs ending in C. Note that MPc can be computed from r and transition probabilities in C [Norris 1998]. We have already shown how our method estimates P[OC]. Now we show how it extends to estimating transition probabilities in BSCCs and thus the mean payoff.
First, we focus on a single path n that has reached a BSCC C = K(tt) and show how to estimate the transition probabilities P(s, s') for each s, s' e C. Let Xss< be the random variable denoting the event that NextState(s) = s'. Xss< is a Bernoulli variable with parameter P(s, s'), so we use the obvious estimator P(s, s') = #ss,(7r)/#s(7r), where #„(vr) is the number of occurrences of a in n. If n is long enough so that #s(tt)
H0 : F[ip] > p + e and Hx : F[ip] < p - e for any S < e to the following hypothesis testing problem on the observable X^
H'0 : E[XSV] >p+(e-5) and H[ : E[X*] < p - (e - 5).
MP := lim E -\r(Sl)
13
is large enough, the estimation is guaranteed to have desired precision £ with desired confidence (1 — 5sy). Indeed, using Hoffding's inequality, we obtain
P[P(s, s') - P(s, s')\ • <5S,S, = 2e-2#s(7r)'«2 . (3)
Hence, we can extend the path n with candidate C until it is long enough so that we have a 1 — 5c confidence that all the transition probabilities in C are in the £-neighbourhood of our estimates, by ensuring that 2~2S s'£c $s,s' < These estimated
transition probabilities P induce an estimated mean payoff MPc- The following theorem relates the estimated and exact mean payoff.
Theorem 4.6. Let C be a BSCC in a Markov chain M with rewards in the range [0,1], MPc be the mean payoff of C, and MPc be the estimated mean payoff of C. Then
|M>c-MPc| <(:= + —) ' '-1. (4)
V Pmin /
proof. Consider a Markov chain C with a reward function r : S —> [0,1], such that C is a single BSCC. The discounted sum MDA for a state s of C is defined as:
MDA(s) := lim E
where A > 0 is a discount factor. We say that a Markov chain C is £-close to C if
(1) C is over the same states as C,
(2) Vs, s' e C : \Pc(s, s') - Pc(s, s')\ < £,
(3) Vs, s' e C : Pc(s, s')>0 ^ Pc(s, s') > 0.
We write MD for the discounted sum computed for C. By [Chatterjee 2012](Theorem 4) it holds that for every discount factor 0 < A < 1, every MC C that is £-close to C, and every state s:
MlDA(s) - MDA(s)| < (l + — \ ' '-1, (5)
V Pmin /
where pmm is the minimum transition probability in M. By [Solan 2003] we know that the discounted sum converges to mean payoff:
lim MDA(s) = MPC lim MlDA(s) = MPC,
A->1 A->1
where MPc and MPc are the mean payoff for C and C, respectively. We obtain the result by taking the limit A —> 1 in (5).
□
Note that by Taylor's expansion, for small £, we have ( ~ 2|C|£.
Algorithm 4 extends Algorithm 2 as follows. It divides the confidence parameters 5 into Sbscc (used as in Algorithm 2 to detect the BSCC) and 5c (the total confidence for the estimates on transition probabilities). For simplicity, we set 5BSCc = <^c = S/2. First, we compute the bound £ required for ^-precision (by Eq. 4). Subsequently, we compute the required strength k of the candidate guaranteeing 5c -confidence on P (from Eq. 3). The path is prolonged until the candidate is strong enough; in such a case MPc is C-approximated with 1 — 5c confidence. If the candidate of the path changes, all values are computed from scratch for the new candidate.
14
Algorithm 4 SinglePathMP
Input: reward function r, pmin, (, 5 e (0,1],
Output: MPC such that |MPC - MPC| < ( where C is the BSCC of the generated run
7T <- A
repeat
7T <— 7T . NextState(7r) if K(tt) ^ _l then
£ = Pmin((l + C)1/2|K(x)l - 1) > By Equation (4)
, ln(2|K(7r)|2)-ln(5/2) D n ,-
k <--^—-LJ-^L r> By Equation (3)
until REACHEDBSCC(7r,pmin,f5/2) and SCANDfc(AT(7r), vr) return MP^*-) computed from P and r
THEOREM 4.7. For every 5 > 0, the Algorithm 4 terminates correctly with probability at least 1 — S.
PROOF. From Eq. 3, by the union bound, we are guaranteed that the probability that none of the estimates PSiS' is outside of the C-neighbourhood doesn't exceed the sum of all respective estimation errors, that is, 5c = J2S s'ec ^s'- Next, from Eq. 4 and from the fact that C is subject to Theorem 3.8 with confidence SBscc,
P(|MPc(r) - MPc(r)| > C) =
=P(C e BSCC)P(|MP(r) - MP(r)| > ( \ C e BSCC)+
P(C i BSCC)P(|MP(r) - MP(r)| > ( \ C i BSCC) <1 • Sc + Sbscc ■1 = &c + Sbscc < S.
□
4.4. Hypothesis testing with bounded error
Let random variable X^p denote the value SinglePathMPÍt, pmm, (, S). The following
theorem establishes relation between the mean-payoff M P and the expected value of
xC,s
THEOREM 4.8. For every S,( > 0, MP - ( - S < E[X^sp] < MP + ( + 6. PROOF. Let us write xfyp as an expression of random variables Y, W, Z
X^sp = Y(1-W) + WZ,
where 1) W is a Bernoulli random variable, such that W = 0 iff the algorithm correctly detected the BSCC and estimated transition probabilities within bounds, 2) Y is the value computed by the algorithm if W = 0, and the real mean payoff MP when W = 1, and 3) Z is any random variable with the range [0,1]. The interpretation is as follows: when W = 0 we observe the result Y, which has bounded error (, and when W = 1 we observe arbitrary Z. We note that Y, W, Z are not necessarily independent. By Theorem 4.7 E[W] < 6 and by linearity of expectation: E[xfap] = E[Y] - E[YW] + E[WZ]. For the upper bound, observe that E[Y] < MP + (, E[YW] is non-negative and E[WZ] < 5. As for the lower bound, note that E[Y] > MP - (, E[YW] < 3 and E[WZ] is non-negative.
□
15
As a consequence of Theorem 4.8, if we establish that with (1 — a) confidence X^p belongs to the interval [a,b], then we can conclude with (1 — a) confidence that MP belongs to the interval [a — ( — S, b + ( + 5}. Standard statistical methods can be applied to find the confidence bound for xfap [Bickel and Doksum 2000].
5. EXPERIMENTAL EVALUATION
We implemented our algorithms in the probabilistic model checker PRISM [Kwiatkowska et al. 2011], and evaluated them on the DTMC examples from the PRISM benchmark suite [Kwiatkowska et al. 2012]. The benchmarks model communication and security protocols, distributed algorithms, and fault-tolerant systems. To demonstrate how our method performs depending on the topology of Markov chains, we also performed experiments on the generic DTMCs shown in Figure 3 and Figure 4, as well as on two CTMCs from the literature that have large BSCCs: "tandem" [Hermanns et al. 1999] and "gridworld" [Younes et al. 2006].
All benchmarks are parametrised by one or more values, which influence their size and complexity, e.g. the number of modelled components. We have made minor modifications to the benchmarks that could not be handled directly by the SMC component of PRISM, by adding self-loops to deadlock states and fixing one initial state instead of multiple.
Our tool can be downloaded at [Daca 2016]. Experiments were done on a Linux 64-bit machine running an AMD Opteron 6134 CPU with a time limit of 15 minutes and a memory limit of 5GB. To increase performance of our tool, we check whether a candidate has been found every 1000 steps; this optimization does not violate correctness of our analysis. See the appendix for a discussion on this bound.
Fig. 4. A Markov chain with two transient parts consisting of N strongly connected singletons, leading to BSCCs with the ring topology of M states.
Reachability. The experimental results for unbounded reachability are shown in Table II. The PRISM benchmarks were checked against their standard properties, when available. We directly compare our method to another topology-agnostic method of [Younes et al. 2010] (SimTermination), where at every step the sampled path is terminated with probability pterm- The approach of [Brázdil et al. 2014] with a priori bounds is not included, since it times out even on the smallest benchmarks. In addition, we performed experiments on two methods that are topology-aware: sampling with reachability analysis of [Younes et al. 2010] (SimAnalysis) and the numerical model-checking algorithm of PRISM (MC). The appendix contains detailed experimental evaluation of these methods.
The table shows the size of every example, its minimum probability, the number of BSCCs, and the size of the largest BSCC. Column "time" reports the total wall time for the respective algorithm, and "analysis" shows the time for symbolic reachability analysis in the SimAnalysis method. Highlights show the best result among the topology-agnostic methods. All statistical methods were used with the SPRT test for choosing between the hypothesis, and their results were averaged over five runs.
16
Table II. Experimental results for unbounded reachability.
Example BSCC SimAdaptive SimTermination[Younes et al. 2010] SimAnalysis [Younes et al. 2010] MC
name size Pm'\n no., max. size time time time analysis time
bluetooth(4) 149K 7.8 • 10"3 3K, 1 2.6s 16.4s 83.2s 80.4s 78.2s
bluetooth(7) 569K 7.8 • 10"3 5.8K, 1 3.8s 50.2s 284.4s 281.1s 261.2s
bluetooth(lO) >569K 7.8 • 10"3 >5.8K, 1 5.0s 109.2s TO - TO
brp(500,500) 4.5M 0.01 1.5K, 1 7.6s 13.8s 35.6s 30.7s 103.0s
brp(2K,2K) 40M 0.01 4.5K, 1 20.4s 17.2s 824.4s 789.9s TO
brp(10K,10K) >40M 0.01 >4.5K, 1 89.2s 15.8s TO - TO
crowds(6,15) 7.3M 0.066 >3K, 1 3.6s 253.2s 2.0s 0.7s 19.4s
crowds(7,20) 17M 0.05 >3K, 1 4.0s 283.8s 2.6s 1.1s 347.8s
crowds(8,20) 68M 0.05 >3K, 1 5.6s 340.0s 4.0s 1.9s TO
eql(15,10) 616G 0.5 1,1 16.2s TO 151.8s 145.1s 110.4s
eql(20,15) 1279T 0.5 1,1 28.8s TO 762.6s 745.4s 606.6s
eql(20,20) 1719T 0.5 1, 1 31.4s TO TO - TO
herman(17) 129M 7.6 • 10-H 1,34 23.0s 33.6s 21.6s 0.1s 1.2s
herman(19) 1162M 1.9 • 10"6 1,38 96.8s 134.0s 86.2s 0.1s 1.2s
herman(21) 10G 4.7- 10"7 1,42 570.0s TO 505.2s 0.1s 1.4s
leader(6,6) 280K 2.1 • 10"b 1, 1 5.0s 5.4s 536.6s 530.3s 491.4s
leader(6,8) >280K 3.8 • 10"6 1, 1 23.0s 26.0s MO - MO
leader(6,ll) >280K 5.6 • 10"7 1, 1 153.0s 174.8s MO - MO
nand(50,3) 11M 0.02 51, 1 7.0s 231.2s 36.2s 31.0s 272.0s
nand(60,4) 29M 0.02 61, 1 6.0s 275.2s 60.2s 56.3s TO
nand(70,5) 67M 0.02 71, 1 6.8s 370.2s 148.2s 144.2s TO
tandem(500) >1.7M 2.4 • 10"b 1, >501K 2.4s 6.4s 4.6s 3.0s 3.4s
tandem(lK) 1.7M 9.9 • 10"5 1, 501K 2.6s 19.2s 17.0s 12.7s 13.0s
tandem(2K) >1.7M 4.9 • 10"5 1, >501K 3.4s 72.4s 62.4s 59.8s 59.4s
gridworld(300) 162M 1 • 10"3 598, 89K 8.2s 81.6s MO - MO
gridworld(400) 384M 1 • 10"3 798, 160K 8.4s 100.6s MO - MO
gridworld(500) 750M 1 • 10"3 998, 250K 5.8s 109.4s MO - MO
Fig.3(16) 37 0.5 1, 1 58.6s TO 23.4s 0.4s 2.0s
Fig.3(18) 39 0.5 1, 1 TO TO 74.8.0s 1.8s 2.0s
Fig.3(20) 41 0.5 1, 1 TO TO 513.6s 11.3s 2.0s
Fig.4(lK,5) 4022 0.5 2, 5 7.8s 218.2s 3.2s 0.5s 1.2s
Fig.4(lK,50) 4202 0.5 2,50 12.4s 431.0s 211.8s 3.6s 0.7s 1.0s
Fig.4(lK,500) 6002 0.5 2, 500, 3.6s 1.0s 1.2s
Fig.4(10K,5) 40K 0.5 2, 5 52.2s TO 42.2s 25.4s 25.6s
Fig.4(100K,5) 400K 0.5 2, 5 604.2s 5.4s TO - TO
Note. Simulation parameters: a = /3 = e = 0.01, <5 = 0.001, pterm = 0.0001. TO means time-out, and MO means memory-out. Our approach is denoted by SimAdaptive here. Highlights show the best result the among topology-agnostic methods.
Finding the optimal termination probability pterm for the SimTermination method is a non-trivial task. If the probability is too high, the method might never reach the target states, thus give an incorrect result, and if the value is too low, then it might sample unnecessarily long traces that never reach the target. For instance, to ensure a correct answer on the Markov chain in Figure 3, pterm has to decrease exponentially with the number of states. By experimenting we found that the probability pterm = 0.0001 is low enough to ensure correct results. See the appendix for experiments with other values
Ofpterm-
On most examples our method scales better than the SimTermination method. Our method performs well even on examples with large BSCCs, such as "tandem" and "gridworld," due to early termination when a goal state is reached. For instance, on the "gridworld" example, most BSCCs do not contain a goal state, thus have to be fully explored, however the probability of reaching such BSCC is low, and as a consequence full BSCC exploration rarely occurs. The SimTermination method performs well when the target states are unreachable or can be reached by short paths. When long paths are necessary to reach the target, the probability that an individual path reaches the target is small, hence many samples are necessary to estimate the real probability with high confidence.
Moreover, it turns out that our method compares well even with methods that have access to the topology of the system. In many cases, the running time of the numerical algorithm MC increases dramatically with the size of the system, while remaining almost constant in our method. The bottleneck of the SimAnalysis algorithm is the reachability analysis of states that cannot reach the target, which in practice can be as difficult as numerical model checking.
LTL and mean payoff. In the second experiment, we compared our algorithm for checking LTL properties and estimating the mean payoff with the numerical methods of PRISM; the results are shown in Table III and IV. We compare against PRISM, since we are not aware of any SMC-based or topology-agnostic approach for mean payoff, or full LTL. For mean payoff, we computed 95%-confidence bound of size 0.22 with parameters 5 = 0.011, C = 0.08, and for LTL we used the same parameters as for reachability. We report results only on a single model of each type, where either method did not time out. In general our method scales better when BSCCs are fairly small and are discovered quickly.
Table III. Experimental results for LTL properties.
Example LTL
name property SimAdaptive time MC time
bluetooth(lO) □ O 8.0s TO
brp(10K,10K) On 90.0s TO
crowds(8,20) On 9.0s TO
eql(20,20) □O 7.0s MO
herman(21) □O TO 2.0s
leader(6,5) □O 277.0s 117.0s
nand(70,5) □O TO
tandem(2K) □O TO 221.0s
gridworld(lOO) □O -> On TO 110.4s
Fig.3(20) □O -> no TO
Fig.4(100K,5) □O 348.0s TO
Fig.4(lK,500) □O 827.0s 2.0s
Note. Simulation parameters for LTL: a = ß = e = 0.01, <5 = 0.001.
18
Table IV. Experimental results for mean-payoff properties.
Example Mean payoff
name SimAdaptive time MC time
bluetooth(lO) 3.0s TO
brp(10K,10K) 6.6s TO
crowds(8,20) 2.0s TO
eql(20,20) 2.6s TO
herman(21) MO
leader(6,6) 48.5 576.0
nand(70,5) 294.0s
tandem(500) TO 191.0s
gridworld(50) TO 58.1s
Fig.3(20) TO
Fig.4(100K,5) TO
Fig.4(lK,500) TO
Note. For mean-payoff we computed a 95%-confidence interval of size 0.22 with <5 = 0.011, C = 0.08.
6. DISCUSSION
As demonstrated by the experimental results, our method is fast on systems that are (1) shallow, and (2) with small BSCCs. In such systems, the BSCC is reached quickly and the candidate is built-up quickly. Further, recall that the BSCC is reported when a fc-candidate is found, and that fc is increased with each candidate along the path. Hence, when there are many strongly connected sets, and thus many candidates, the BSCC is detected by a fc-candidate for a large fc. However, since fc grows linearly in the number of candidates, the most important and limiting factor is the size of BSCCs.
We state the dependency on the depth of the system and BSCC sizes formally. We pick 5 := | and let
- log log1/ t-log5 sirn =--?---r and ki = —-——-r-
log log iEf-Sf -Iog(l-ftni„)
denote the a priori upper bound on the number of simulations necessary for the SPRT (cf. Section 2.2) and the strength of candidates as in Algorithm 2, respectively.
Theorem 6.1. Let R denote the expected number of steps before reaching a BSCC and B the maximum size of a BSCC. Further, let
T := max ..' time, to reach s' from s] .
CeBSCC;s,s'ec
In particular, T e 0(B/p^-m). Then the expected running time of Algorithms 2 and 3 is at most
0(sim-kR+B ■ B-T).
proof. We show that the expected running time of each simulation is at most kR+b ■ B ■ T. Since the expected number of states visited is bounded by R + B, the expected number of candidates on a run is less than 2(R + B) — 1. Since fcj grows linearly in i it is sufficient to prove that the expected time to visit each state of a BSCC once (when starting in BSCC) is at most B-T. We order the states of a BSCC as si,..., sb, then the time is at most ^i=i ^> where b < B. This yields the result since R e 0(kR+B ■ B ■ T).
It remains to prove that T < B/p^-m. Let s be a state of a BSCC of size at most B. Then, for any state s' from the same BSCC, the shortest path from s to s' has length at most B and probability at least p^m. Consequently, if starting at s, we haven't reached s' after B steps with probability at most 1 — p^in, and we are instead in some state
19
s" 7^ s', from which, again, the probability to reach s' within B steps at least p^in. Hence, the expected time to reach s' from s is at most
oo
^b.^i-pI)-1?,!,
i=l
where i indicates the number of times a sequence of B steps is observed. The series can be summed by differentiating a geometric series. As a result, we obtain a bound
B/pB.
□
Systems that have large deep BSCCs require longer time to reach for the required level of confidence. However, such systems are often difficult to handle also for other methods agnostic of the topology. For instance, correctness of [Younes et al. 2010] on the example in Figure 3 relies on the termination probability pterm being at most 1 — A, which is less than 2~™ here. Larger values lead to incorrect results and smaller values to paths of exponential length. Nevertheless, our procedure usually runs faster than the bound suggest; for detailed discussion see the appendix.
7. CONCLUSION
To the best of our knowledge, we propose the first statistical model-checking method that exploits the information provided by each simulation run on the fly, in order to detect quickly a potential BSCC, and verify LTL properties with the desired confidence. This is also the first application of SMC to quantitative properties such as mean payoff. We note that for our method to work correctly, the precise value of pmin is not necessary, but a lower bound is sufficient. This lower bound can come from domain knowledge, or can be inferred directly from description of white-box systems, such as the PRISM benchmark.
The approach we present is not meant to replace the other methods, but rather to be an addition to the repertoire of available approaches. Our method is particularly valuable for models that have small BSCCs and huge state space, such as many of the PRISM benchmarks.
In future work, we plan to investigate the applicability of our method to Markov decision processes, and to deciding language equivalence between two Markov chains. The idea of guessing BSCCs by simulation has already been re-used in order to estimate distances between Markov chains [Daca et al. 2016b]
APPENDIX
A. DETAILED EXPERIMENTS
Table V shows detailed experimental result for unbounded reachability. Compared to Table II we included: 1) experiments for the SimTermination method with two other values of pterm, 2) we report the number of sampled paths as "samples," and 3) we report the average length of sampled paths as "path length." Topology-agnostic methods, such as SimAdaptive and SimTermination, cannot be compared directly with topology-aware methods, such as SimAnalysis and MC, however for reader's curiosity we highlighted in the table the best results among all methods.
We observed that in the "herman" example the SMC algorithms work unusually slow. This problem seems to be caused by a bug in the original sampling engine of PRISM and it appears that all SMC algorithms suffer equally from this problem.
20
Table V. Detailed experimental results for unbounded reachability.
Example SimAdaptiv SimTerm nation, pterrr = 10~:3 SimTerir ination, pterm = io-4 SimTerrr ination, pterr n = 10_b Si nAnalysis MC
name time samples path length time samples path length time samples path length time samples path length time samples path length analysis time
bluetooth(4) 2.6s 243 499 185.0s 43K 387 16.4s 3389 484 6.4s 463 495 83.2s 219 502 80.4s 78.2s
bluetooth(7) 3.8s 243 946 697.4s 10 6K 604 50.2s 6480 897 10.2s 792 931 284.4s 219 937 281.1s 261.2s
bluetooth(lO) 5.0s 243 1391 TO 109.2s 9827 1292 15.0s 932 1380 TO TO
brp(500,500) 7.6s 230 3999 3.2s 258 963 13.8s 258 9758 107.2s 258 104K 35.6s 207 3045 30.7s 103.0s
brp(2K,2K) 20.4s 230 13K 3.4s 258 1029 17.2s 258 9127 115.0s 258 98K 824.4s 207 12K 789.9s TO
brp(10K,10K) 89.2s 230 62K 3.6s 258 960 15.8s 258 10K 109.4s 258 96K TO TO
crowds! 6,15) 3.6s 395 879 29.2s 7947 878 253.2s 7477 8735 TO 2.0s 400 85 0.7s 19.4s
crowds(7,20) 4.0s 485 859 32.6s 9378 850 283.8s 8993 8464 TO 2.6s 473 98 1.1s 347.8s
crowds(8,20) 5.6s 830 824 38.2s UK 821 340.0s 10K 8132 TO 4.0s 793 110 1.9s TO
eql(15,10) 16.2s 1149 652 303.2s 28K 628 TO TO 151.8s 1100 201 145.1s 110.4s
eql(20,15) 28.8s 1090 1299 612.8s 44K 723 TO TO 762.6s 999 347 745.4s 606.6s
eql(20,20) 31.4s 1071 1401 TO UK 156 TO TO TO TO
herman(17) 23.0s 243 30 257.6s 2101 30 33.6s 381 32 29.0s 279 31 21.6s 219 30 0.1s 1.2s
herman(19) 96.8s 243 40 TO 134.0s 355 38 254.4s 279 40 86.2s 219 38 0.1s 1.2s
herman(21) 570.0s 243 46 MO TO MO 505.2s 219 48 0.1s 1.4s
leader(6,6) 5.0s 243 7 7.6s 437 7 5.4s 258 7 258 7 536.6s 219 7 530.3s 491.4s
leader(6,8) 23.0s 243 7 62.4s 560 7 26.0s 279 7 26.2s 258 7 MO MO
leader(6,ll) 153.0s 243 7 TO 174.8s 279 7 776.8s 258 7 MO MO
nand(50,3) 7.0s 899 1627 570.6s 140K 846 231.2s 21K 4632 TO 36.2s 1002 1400 31.0s 272.0s
nand(60,4) 6.0s 522 2431 TO 275.2s 25K 4494 TO 60.2s 458 2160 56.3s TO
nand(70,5) 6.8s 391 3343 TO 370.2s 30K 4643 TO 148.2s 308 3080 144.2s TO
tandem(500) 2.4s 243 501 59.6s 43K 394 6.4s 3318 489 2.0s 412 500 4.6s 219 501 3.0s 3.4s
tandem(lK) 2.6s 243 1001 328.4s 114K 632 19.2s 6932 954 3.4s 858 995 17.0s 219 1001 12.7s 13.0s
tandem(2K) 3.4s 243 2001 TO 72.4s 14K 1811 6.6s 1093 1985 62.4s 219 2001 59.8s 59.4s
gridworld(300) 8.2s 1187 453 214.4s 46K 349 81.6s 18K 437 77.4s 16K 449 MO MO
gridworld(400) 8.4s 1047 543 274.8s 53K 399 100.6s 18K 531 93.0s 16K 548 MO MO
gridworld(500) 5.8s 480 637 277.4s 57K 431 109.4s 18K 605 104.4s 15K 627 MO MO
Fig.3(16) 58.6s 128 140K TO TO TO 23.4s 115 141K 0.4s 2.0s
Fig.3(18) TO 2.8s 258 1015 TO TO 74.8s 115 537K 1.8s 2.0s
Fig.3(20) TO WRONG TO TO 513.6s 119 2M 11.3s 2.0s
Fig. 4( IK, 5) 7.8s 1109 2489 TO 218.2s 23K 5916 TO 3.2s 896 1027 0.5s 1.2s
Fig.4(lK,50) 12.4s 1115 4306 TO 211.8s 23K 5880 TO 3.6s 881 1037 0.7s 1.0s
Fig.4(lK,500) 431.0s 1002 177K TO 218.6s 23K 5903 TO 3.6s 968 1042 1.0s 1.2s
Fig.4(10K,5) 52.2s 1161 20K 2.6s 258 1072 TO TO 42.2s 1057 10K 25.4s 25.6s
Fig.4(100K,5) 604.2s 1331 200K 2.6s 258 981 5.4s 258 9939 TO TO TO
Note. Simulation parameters: a = /3 = e = 0.01, <5 = 0.001. TO means a timeout or memory out, and WRONG means that the reported result was incorrect, due to pterm being too large. Our approach is denoted by SimAdaptive here. Highlights show the best result among all methods.
B. IMPLEMENTATION DETAILS
In our algorithms we frequently check whether the simulated path contains a candidate with the required strength. To reduce the time needed for this operation we use two optimisation: 1) we record SCs visited on the path, 2) we check if a candidate has been found every Cb > 1 steps. Our data structure records the sequence of SCs that have been encountered on the simulated path. The candidate of the path is then the last SC in the sequence. We also record the number of times each state in the candidate has been encountered. By using this data structure we avoid traversing the entire path every time we check if a strong fc-candidate has been reached.
To further reduce the overhead, we update our data structure every Cb steps (in our experiments Cb = 1000). Figures 5 and 6 show the impact of Cb on the running time for two Markov chains. The optimal value of Cb varies among examples, however experience shows that Cb ~ 1000 is a reasonable choice.
80
70 ll-60 -
50 -
W
*2 40 -
a
30 it 20 -10 -0 -
1 10 100 1000 10000 100000
check bound c\
Fig. 5. Total running time and time for processing candidates for a Markov chain in Figure 3 depending on the check bound CV
C. THEORETICAL VS. EMPIRICAL RUNNING TIME
In this section, we compare the theoretical upper bound on running time given in Theorem 6.1 to empirical data. We omit the number of simulation runs (term sirn in the theorem), and report only the logarithm of average simulation length. Figures 7, 8 and 9 present the comparison for different topologies of Markov chains. In Figure 7 we present the comparison for the worst-case Markov chain, which requires the longest paths to discover the BSCCs as a fc-candidate. This Markov chain is like the one in Figure 3, but where the last state has a single outgoing transition to the initial state. Figure 8 suggests that the theoretical bound can be a good predictor of running time with respect to the depth of the system, however, Figure 9 shows that it is very conservative with respect to the size of BSCCs.
22
90 80 70 60 50 40 30 20 10 0
10
total —e-candidate processing —
100
check bound c\
1000
10000
Fig. 6. Total running time and time for processing candidates for the "eql(20,20)" benchmark depending on the check bound CV
2 -1 -
0 -
2 4 6 8 10 12 14
number of states
Fig. 7. Average length of simulations for a Markov chain like in Figure 3, but where the last state has a single outgoing transition to the initial state.
23
theoretical —e-^ L experimental
6 5 4 3 2 1 0
1000 2000 3000 4000 5000 6000 7000 8000 9000 transient part length
Fig. 8. Average length of simulations for the MC in Figure 4, where M = 5 and N varies.
0 I-,-,-,-,-,-,-,-1
10 15 20 25 30 35 40 45 50
BSCC size
Fig. 9. Average length of simulations for the MC in Figure 4, where N = 1000 and M varies.
24
REFERENCES
Christel Baier and Joost-Pieter Katoen. 2008. Principles of model checking. MIT Press.
P.J. Bickel and K.A. Doksům. 2000. Mathematical statistics: basic ideas and selected topics. Number Bd. 1 in Mathematical Statistics: Basic Ideas and Selected Topics. Prentice Hall, http://books.google.at/books? id=8poZAQAAIAAJ
Tomáš Brázdil, Krishnendu Chatterjee, Martin Chmelík, Vojtěch Forejt, Jan Křetínský, Marta Z. Kwiatkowska, David Parker, and Mateusz Ujma. 2014. Verification of Markov Decision Processes Using Learning Algorithms. InATVA. 98-114.
Peter E. Bulychev, Alexandre David, Kim Guldstrand Larsen, Marius Mikucionis, Danny B0gsted Poulsen, Axel Legay, and Zheng Wang. 2012. UPPAAL-SMC: Statistical Model Checking for Priced Timed Automata. In QAPL. 1-16.
Krishnendu Chatterjee. 2012. Robustness of structurally equivalent concurrent parity games. In FoSSaCS. Springer, 270-285.
Przemyslaw Daca. 2016. Tool for the paper, http://pub.ist.ac.at/~przemek/pa_tool.html. (2016).
Przemyslaw Daca, Thomas A. Henzinger, Jan Křetínský, and Tatjana Petrov. 2016a. Faster Statistical Model Checking for Unbounded Temporal Properties. In TACAS. 112-129.
Przemyslaw Daca, Thomas A. Henzinger, Jan Křetínský, and Tatjana Petrov. 2016b. Linear Distances between Markov Chains. In 27th International Conference on Concurrency Theory, CONCUR 2016, August 23-26, 2016, Quebec City, Canada (LIPIcs), Josée Desharnais and Radha Jagadeesan (Eds.), Vol. 59. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 20:1-20:15. DO I: http://dx.doi.org/10.4230/LIPIcs.CONCUR.2016.20
Alexandre David, Kim G. Larsen, Axel Legay, Marius Mikucionis, and Danny B0gsted Poulsen. 2015. Uppaal SMC tutorial. STTT 17, 4 (2015), 397-415.
Radu Grosu and Scott A. Smolka. 2005. Monte Carlo Model Checking. In TACAS. 271-286.
Ru He, Paul Jennings, Samik Basu, Arka P. Ghosh, and Huaiqing Wu. 2010. A bounded statistical approach for model checking of unbounded until properties. In ASE. 225-234.
Thomas Hérault, Richard Lassaigne, Frederic Magniette, and Sylvain Peyronnet. 2004. Approximate Probabilistic Model Checking. In VMCAI. 73-84.
Holger Hermanns, Joachim Meyer-Kayser, and Markus Siegle. 1999. Multi terminal binary decision diagrams to represent and analyse continuous time Markov chains. In 3rd Int. Workshop on the Numerical Solution of Markov Chains. Citeseer, 188-207.
Cyrille Jégourel, Axel Legay, and Sean Sedwards. 2012. A Platform for High Performance Statistical Model Checking - PLASMA. In TACAS. 498-503.
Sumit Kumar Jha, Edmund M. Clarke, Christopher James Langmead, Axel Legay, André Platzer, and Paolo Zuliani. 2009. A Bayesian Approach to Model Checking Biological Systems. In CMSB. 218-234.
Marta Z. Kwiatkowska, Gethin Norman, and David Parker. 2011. PRISM 4.0: Verification of Probabilistic Real-Time Systems. In CAV. 585-591.
Marta Z. Kwiatkowska, Gethin Norman, and David Parker. 2012. The PRISM Benchmark Suite. In QEST. 203-204.
Richard Lassaigne and Sylvain Peyronnet. 2008. Probabilistic verification and approximation. Ann. Pure
Appl. Logic 152, 1-3 (2008), 122-131. James R Norris. 1998. Markov chains. Cambridge university press.
Johan Oudinet, Alain Denise, Marie-Claude Gaudel, Richard Lassaigne, and Sylvain Peyronnet. 2011. Uniform Monte-Carlo Model Checking. In FASE. 127-140. Amir Pnueli. 1977. The temporal logic of programs. In FOCS. 46-57.
Diana El Rabih and Nihal Pekergin. 2009. Statistical Model Checking Using Perfect Simulation. In ATVA. 120-134.
Koushik Sen, Mahesh Viswanathan, and Gul Agha. 2004. Statistical Model Checking of Black-Box Probabilistic Systems. In CAV. 202-215.
Koushik Sen, Mahesh Viswanathan, and Gul Agha. 2005. On Statistical Model Checking of Stochastic Systems. In CAV. 266-280.
Eilon Solan. 2003. Continuity of the value of competitive Markov decision processes. Journal of Theoretical
Probability 16, 4 (2003), 831-845. Abraham Wald. 1945. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16,
2 (1945), 117-186.
Hákan L. S. Younes. 2004. Planning and verification for stochastic processes with asynchronous events. In AAAI. 1001-1002.
25
Hakan L. S. Younes, Edmund M. Clarke, and Paolo Zuliani. 2010. Statistical Verification of Probabilistic Properties with Unbounded Until. In SBMF. 144-160.
Hakan L. S. Younes, Marta Z. Kwiatkowska, Gethin Norman, and David Parker. 2006. Numerical vs. statistical probabilistic model checking. STTT 8, 3 (2006), 216-228.
Hakan L. S. Younes and Reid G. Simmons. 2002. Probabilistic Verification of Discrete Event Systems using Acceptance Sampling. In CAV. Springer, 223-235.
Hakan L. S. Younes and Reid G. Simmons. 2006. Statistical probabilistic model checking with a focus on time-bounded properties. Inf. Comput. 204, 9 (2006), 1368-1409.
26
Prom LTL and Limit-Deterministic Biichi Automata to Deterministic Parity Automata *
Javier Esparza1, Jan Křetínský1, Jean-Francois Raskin2, and Salomon Sickert1
1 Technische Universität München {esparza, jan.kretinsky, sickert}@in.tum.de 2 Universitě libre de Bruxelles jraskin@ulb.ac.be
Abstract. Controller synthesis for general linear temporal logic (LTL) objectives is a challenging task. The standard approach involves translating the LTL objective into a deterministic parity automaton (DPA) by means of the Safra-Piterman construction. One of the challenges is the size of the DPA, which often grows very fast in practice, and can reach double exponential size in the length of the LTL formula. In this paper we describe a single exponential translation from limit-deterministic Biichi automata (LDBA) to DPA, and show that it can be concatenated with a recent efficient translation from LTL to LDBA to yield a double exponential, "Safraless" LTL-to-DPA construction. We also report on an implementation, a comparison with the SPOT library, and performance on several sets of formulas, including instances from the 2016 SyntComp competition.
1 Introduction
Limit-deterministic Biichi automata (LDBA, also known as semi-deterministic Biichi automata) were introduced by Courcoubetis and Yannakakis (based on previous work by Vardi) to solve the qualitative probabilistic model-checking problem: Decide if the executions of a Markov chain or Markov Decision Process satisfy a given LTL formula with probability 1 [Var85,VW86,CY95]. The problem faced by these authors was that fully nondeterministic Biichi automata (NBAs), which are as expressible as LTL, and more, cannot be used for probabilistic model checking, and deterministic Biichi automata (DBA) are less expressive than LTL. The solution was to introduce LDBAs as a model in-between: as expressive as NBAs, but deterministic enough.
After these papers, LDBAs received little attention. The alternative path of translating the LTL formula into an equivalent fully deterministic Rabin automaton using Safra's construction [Saf88] was considered a better option, mostly because it also solves the quantitative probabilistic model-checking problem (computing the probability of the executions that satisfy a formula). However, recent papers have shown that LDBAs were unjustly forgotten. Blahoudek et al. have
* This work is partially funded by the DFG Research Training Group "PUMA: Programm- und Modell-Analyse" (GRK 1480), DFG project "Verified Model Checkers", the ERC Starting Grant (279499: in VEST), and the Czech Science Foundation, grant No. P202/12/G061.
shown that LDBAs are easy to complement [BHS+16]. Kini and Viswanathan have given a single exponential translation of LTL\gtj to LDBA [KV15]. Finally, Sickert et al. describe in [SEJK16] a double exponential translation for full LTL that can also be applied to the quantitative case, and behaves better than Safra's construction in practice.
In this paper we add to this trend by showing that LDBAs are also attractive for synthesis. The standard solution to the synthesis problem with LTL objectives consists of translating the LTL formula into a deterministic parity automaton (DPA) with the help of the Safra-Piterman construction [Pit07]. While limit-determinism is not "deterministic enough" for the synthesis problem, we introduce a conceptually simple and worst-case optimal translation LDBA—>DPA. Our translation bears some similarities with that of [Fin 15] where, however, a Muller acceptance condition is used. This condition can also be phrased as a Rabin condition, but not as a parity condition. Moreover, the way of tracking all possible states and finite runs differs.
Together with the translation LTL—>LDBA of [SEJK16], our construction provides a "Safraless", procedure to obtain a DPA from an LTL formula. However, the direct concatenation of the two constructions does not yield an algorithm of optimal complexity: the LTL—>LDBA translation is double exponential (and there is a double-exponential lower bound), and so for the LTL—>DPA translation we only obtain a triple exponential bound. In the second part of the paper we solve this problem. We show that the LDBAs derived from LTL formulas satisfy a special property, and prove that for such automata the concatenation of the two constructions remains double exponential. To the best of our knowledge, this is the first double exponential "Safraless" LTL—>DPA procedure. (Another asymptotically optimal "Safraless" procedure for determinization of Biichi automata with Rabin automata as target has been presented in [FKVW15].)
In the third and final part, we report on the performance of an implementation of our LTL—>LDBA—>DPA construction, and compare it with algorithms implemented in the SPOT library [DLLF+16]. Note that it is not possible to force SPOT to always produce DPA, sometimes it produces a deterministic generalized Biichi automaton (DGBA). The reason is that DGBA are often smaller than DPA (if they exist) and game-solving algorithms for DGBA are not less efficient than for DPA. Therefore, also our implementation may produce DGBA in some cases. We show that our implementation outperforms SPOT for several sets of parametric formulas and formulas used in synthesis examples taken from the SyntComp 2016 competition, and remains competitive for randomly generated formulas.
Structure of the paper Section 2 introduces the necessary preliminaries about automata. Section 3 defines the translation LDBA—>DPA. Section 4 shows how to compose of LTL—>LDBA and LDBA—>DPA in such a way that the resulting DPA is at most doubly exponential in the size of the LTL formula. Section 5 reports on the experimental evaluation of this worst-case optimal translation, and Section 6 contains our conclusions. Several proofs and more details on the implementation can be found in [EKRS17].
2
Fig. 1: An LDBA for the LTL language FGa V FG&. The behavior of A is deterministic within the subset of states Qd = {2,3,4} which is a trap, the set of accepting transitions are depicted in bold face and they are defined only between states of Qd-
2 Preliminaries
Biichi automata A (nondeterministic) w-word automaton A with Biichi acceptance condition (NBA) is a tuple (Q, qo, E, 5, a) where Q is a finite set of states, qo € Q is the initial state, E is a finite alphabet, SCQxExQis the transition relation, and a C 3 is the set of accepting transitions3. W.l.o.g. we assume that 5 is total in the following sense: for all q € Q, for all a € E, there exists q' € Q such that (q, a, q') € 5. A is deterministic if for all q € Q, for all a € E, there exists a unique q' € Q such that (q, a, q') € 5. When 5 is deterministic and total, it can be equivalently seen as a function 5 : Q x E —> Q. Given S C Q and a e E, let post|(S) = {q1 \ 3q e S ■ (q, a, q1) e 5}.
A run of A on a w-word w : N —> E is a w-sequence of states p : N —> Q such that p(0) = qo and for all positions i 6 N, we have that (p(i), w(i), p(i + 1)) € 5. A run p is accepting if there are infinitely many positions i € N such that (p(i), w(i), p(i + 1)) € a. The language defined by A, denoted by \-(A), is the set of ui-words w for which A has an accepting run.
A limit-deterministic Biichi automaton (LDBA) is a Biichi automaton A = (Q, qo, E, 5, a) such that there exists a subset Qa C Q satisfying the three following properties:
1. a C Qd x E x Qd, i.e. all accepting transitions are transitions within Qa;
2. Vq e Qd - Va e E ■ Wqi, q<2 € Q ■ (q, a, q±) e 5 A (q, a, q2) € 5 gi = g2, i-e. the transition relation 5 is deterministic within Qd
3. Vg € Qd • Vcr g Z1 • Vg' E Q ■ (q, a, q') € <5 —> g' € Qd, i.e. Qd is a trap (when Qd is entered it is never left).
W.l.o.g. we assume that qo € Q\Qd, and we denote Q\Qd by Qd- Courcoubetis and Yannakakis show that for every w-regular language C, there exists an LDBA A such that L(A) = C [CY95]. That is, LDBAs are as expressive as NBAs. An example of LDBA is given in Fig. 1. Note that the language accepted by this LDBA cannot be recognized by a deterministic Biichi automaton.
3 Here, we consider automata on infinite words with acceptance conditions based on transitions. It is well known that there are linear translations from automata with acceptance conditions defined on transitions to automata with acceptance conditions defined on states, and vice-versa.
3
Parity automata A deterministic w-word automaton A with parity acceptance condition (DPA) is a tuple (Q, qo, E, 5,p), defined as for deterministic Biichi automata with the exception of the acceptance condition p, which is now a function assigning an integer in {1,2,..., a1}, called a color, to each transition in the automaton. Colors are naturally ordered by the order on integers.
Given a run p over a word w, the infinite sequence of colors traversed by the run p is noted p(p) and is equal to p(p(0), w(0), p(l)) p((p(l), w(l), p(2))... p(p(n), w(n), p(n+ 1)).... A run p is accepting if the minimal color that appears infinitely often along p(p) is even. The language defined by A, denoted by L(A) is the set of w-words w for which A has an accepting run.
While deterministic Biichi automata are not expressively complete for the class of w-regular languages, DPAs are complete for w-regular languages: for every w-regular language C there exists a DPA A such that L(A) = C, see e.g. [Pit07].
3 From LDBA to DPA
3.1 Run DAGs and their coloring
Run DAG A nondeterministic automaton A may have several (even an infinite number of) runs on a given w-word w. As in [KV01], we represent this set of runs by means of a directed acyclic graph structure called the run DAG of A on w. Given an LDBA A = (Q, Qa, qo, E, 5, a), this graph Gw = (V, E) has a set of vertices VCQxN and edges E C V x V defined as follows:
— V = UieM Vii where the sets Vi are defined inductively:
• V0 = {(q0, 0)}, and for all i > 1,
• Vi = {(q, i) | 3(q', i-l)e : (q', w(i), q) € 5};
— E = {((q, i), (q1, i+l))eVx Vl+1 | (q, w(i), q') € 5}.
We denote by V{d the set V{ Pi (Qa x {«}) that contains the subset of vertices of layer i that are associated with states in Qa-
Observe that all the paths of Gw that start from (qo ,0) are runs of A on w, and, conversely, each run p of A on w corresponds exactly to one path in Gw that starts from (qo, 0). So, we call runs the paths in the run DAG Gw. In particular, we say that an infinite path «o«i of Gw is an accepting run if there
are infinitely many positions i 6 N such that vt = (q,i), vi+\ = (q',i + 1), and (q, w(i), q') € a. Clearly, w is accepted by A if and only if there is an accepting run in Gw. We denote by p(0..n) = v0Vi ... vn the prefix of length n + 1 of the run p.
Ordering of runs A function Ord : Q —> {1, 2,..., \Qa\, +00} is called an ordering of the states of A w.r.t. Qa if Ord defines a strict total order on the state from Qd, and maps each state q € Qa to +00, i.e.:
— for all q € Qd, Ord(g) = +00,
— for all q € Qd, Ord(g) 7^ +00, and
4
— for all q, q' € Qa, Ord(g) = Ord(g') implies q = q'.
We extend Ord to vertices in Gw as follows: Ord((q,i)) = Ord(g).
Starting from Ord, we define the following pre-order on the set of run prefixes of the run DAG Gw. Let p(0..n) = vqVi and p'(0..n) = v'0v[ ... v'n ...
be two run prefixes of length n + 1, we write p(0..n) Z p'(0..n), if p(0..n) is smaller than p'(0..n), which is defined as:
— for alH, 0 < i < n, Ord(p(i)) = Ord(p'(i)), or
— there exists i, 0 < i < n, such that:
• Ord(p(i)) < Ord(p'(i)), and
• for all j, 0 0 • Ord(p(0..i)) Z Ord(p'(0..i)).
Remark 1. If A accepts a word w, then A has a Z-smallest accepting run for w.
We use the Z-relation on run prefixes to order the vertices of VI that belong to Qd'- for two different vertices v = £ V; and v' = (q',i) € V,, v is Zr
smaller than v', if there is a run prefix of Gw that ends up in v which is Z-smaller than all the run prefixes that ends up in v', which induces a total order among the vertices of V{d because the states in Qa are totally ordered by the function Ord.
Lemma 1. For all i > 0, for two different vertices v = (q,i),v' = (q',i) € Vf, then either v \Z{ v' or v' \Z{ v, i.e., Zi is a total order on V{d.
Indexing vertices The index of a vertex v = (q,i) € Vi such that q € Qa, denoted by lndi(«), is a value in {1, 2,..., \Qd\} that denotes its order in Vd according to Zi (the Zrsmallest element has index 1). For i > 0, we identify two important sets of vertices:
— Dec(V{d) is the set of vertices v E Vd such that there exists a vertex v' € Vd+1: (v,v') _ E and lndi+i(«') < lndi(«), i.e. the set of vertices in Vd whose (unique) successor in Vd+1 has a smaller index value.
— fkcc{Vd) is the set of vertices v = (q,i) € Vd such that there exists v' = (q',i + 1) € Vd+l: (v,v') _ E and (q,w(i),q') _ a, i.e. the set of vertices in Vd that are the source of an accepting transition on w(i).
Remark 2. Along a run, the index of vertices can only decrease. As the function Ind(-) has a finite range, the index along a run has to eventually stabilize.
Assigning colors The set of colors that are used for coloring the levels of the run DAG Gw is {1, 2,..., 2 • \Qa\ + 1}- We associate a color with each transition from level i to level i + 1 according to the following set of cases:
1. if Dec(l^d) = 0 and Acc(l^d) ^ 0, the color is 2 • min^^d) lnd2(«).
2. if Dec(l^d) ^ 0 and Acc(l^d) = 0, the color is 2 • min^D^yd) Ind^v) - 1.
5
3. if Dec(i^d) 7^ 0 and Acc(i^d) 7^ 0, the color is defined as the minimal color among
- codd = 2 • min^D^yd) lndi(«) - 1, and
- Ceven = 2 • mim.g^yd) Ind^v).
4. if DecfV/) = Acc(^d) = 0, the color is 2 • \Qq\ + 1.
The intuition behind this coloring is as follows: the coloring tracks runs in Qd (only those are potentially accepting as a C Qd x E x Qd) and tries to produce an even color that corresponds to the smallest index of an accepting run. If in level i the run DAG has an outgoing transition that is accepting, then this is a positive event, as a consequence the color emitted is even and it is a function of the smallest index of a vertex associated with an accepting transition from Vi to Vi+\. Runs in Qd are deterministic but they can merge with smaller runs. When this happens, this is considered as a negative event because the even colors that have been emitted by the run that merges with the smaller run should not be taken into account anymore. As a consequence an odd color is emitted in order to cancel all the (good) even colors that were generated by the run that merges with the smaller one. In that case the odd color is function of the smallest index of a run vertex in Vi whose run merges with a smaller vertex in Vi+\. Those two first cases are handled by cases 1 and 2 of the case study above. When both situations happen at the same time, then the color is determined by the minimum of the two colors assigned to the positive and the negative events. This is handled by case 3 above. And finally, when there is no accepting transition from Vi to Vi+\ and no merging, the largest odd color is emitted as indicated by case 4 above.
According to this intuition, we define the color summary of the run DAG Gw as the minimal color that appears infinitely often along the transitions between its levels. Because of the deterministic behavior of the automaton in Qd, each run can only merge at most \Qa\ — 1 times with a smaller one (the size of the range of the function Ind(•) minus one), and as a consequence of the definition of the above coloring, we know that, on word accepted by A, the smallest accepting run will eventually generate infinitely many (good) even colors that are never trumped by smaller odd colors.
Example L The left part of Fig. 2 depicts the run DAG of the limit-deterministic automaton of Fig. 1 on the word w = abb{ab)u;. Each path in this graph represents a run of the automaton on this word. The coloring of the run DAG follows the coloring rules defined above. Between level 0 and level 1, the color is equal to 7 = 2\Qd \ + 1, as no accepting edge is taken from level 0 to level 1 and no run merges (within Qd)- The color 7 is also emitted from level 1 to level 2 for the same reason. The color 4 is emitted from level 2 to level 3 because the accepting edge (3, b, 3) is taken and the index of state 3 in level 2 is equal to 2 (state 4 has index 1 as it is the end point of the smallest run prefix within Qd)- The color 3 is emitted from level 3 to level 4 because the run that goes from 3 to 4 merges with the smaller run that goes from 4 to 4. In order to cancel the even colors emitted by the run that goes from 3 to 4, color 3 is emitted. It cancels the even color 4 emitted before by this run. Afterwards, colors 3 is emitted forever. The color summary is 3 showing that there is no accepting run in the run DAG.
6
Gu
w = abb{ab'p 1
Colors
7
4
I *
4
3
4
3
4
3"
1.
a
Colors 7 2 7
'4
I ^
4
| 4"
Summary
3
Reject
Summary
4
Accept
Fig. 2: The run DAGs automaton of Fig. 1 on the word w = (ab)^ given on the left, and on the word w = aab^ given on the right, together with their colorings.
The right part of Fig. 2 depicts the run DAG of the limit deterministic automaton of Fig. 1 on the word w = aab^. The coloring of the run DAG follows the coloring rules defined above. Between levels 0 and 1, color 7 is emitted because no accepting edge is crossed. To the next level, we see the accepting edge (2, a, 2) and color 2 • 1 = 2 is emitted. Upon reading the first b, we see again 7 since there is neither any accepting edge seen nor any merging takes place. Afterwards, each b causes an accepting edge (3, b, 3) to be taken. While the smallest run, which visits 4 forever, is not accepting, the second smallest run that visits 3 forever is accepting. As 3 has index 2 in all the levels below level 3, the color is forever equal to 4. The color summary of the run is thus equal to 2 • 2 = 4 and this shows that word w = aab^ is accepted by our limit deterministic automaton of Fig. 1.
The following theorem tells us that the color summary (the minimal color that appears infinitely often) can be used to identify run DAGs that contain accepting runs. The proof can be found in [EKRS17, Appendix A].
Theorem 1. The color summary of the run DAG Gn
there is an accepting run in Gw.
is even if and only if
3.2 Construction of the DPA
From an LDBA A = (Q, Qa, qo, E, 5, a) and an ordering function Ord : Q —> {1, 2,..., \Qd\, +00} compatible with Qa, we construct a deterministic parity automaton B = (QB, q$ , E, 5B,p) that, on a word w, constructs the levels of the run DAG Gw and the coloring of previous section. Theorem 1 tells us that such an automaton accepts the same language as A.
First, we need some notations. Given a finite set S, we note V(S) the set of its subsets, and OV(S) the set of its totally ordered subsets. So if (s, <) € OV(S)
7
then s C S and < C s x s is a total strict order on s. For e € s, we denote by lnd(Sj<)(e) the position of e € s among the elements in s for the total strict order <, with the convention that the index of the <-minimum element is equal to 1. The deterministic parity automaton B = (QB, qB , S, SB,p) is defined as follows.
States and initial state The set of states is QB = V(Qd) x OV(Qd), i.e. a state of B is a pair (s, (t, <)) where s is a set of states outside Qd, and t is an ordered subset of Qd- The ordering reflects the relative index of each state within t. The initial state is qB = ({qo}, ({}, {}))•
Transition function Let (si,(ii,3q[ € t\ : q\ = 5(q[,a), and -<3q2 € t\ : q2 = 5(q2,a), and Ord(gi) < Ord(q2),
i.e. none has a predecessor in Qd, then they are ordered using Ord;
2. or, 3q[ € t\ : q\ = 5{q[,a), and -i3q'2 € t\ : q2 = 5(q2,a), i.e. qi has a cr-predecessor in Qd, and q2 not;
3. or 3q[ € t\ : q\ = 5(q[,a), and 3q2 € t\ : q2 = 5(q2,a), and mm<1{q'1 € ti | 9i = %i,cr)} < min<1{^ e h | rja = <5(^, cr)},
i.e. both have a predecessor in Qd, and they are ordered according to the order of their minimal parents.
Coloring To define the coloring of edges in the deterministic automaton, we need to identify the states q € t\ in a transition (si,(ii,CT (s2, (t2, <2)) as follows:
1. if Dec(ii) = 0 and Acc(ii) 7^ 0, the color is 2 • niin9eAcc(tl) lnd(tlj<1)(g).
2. if Dec(ii) 7^ 0 and Acc(ii) = 0, the color is 2 • niin9eDec(tl) lnd(tlj<1)(g) — 1.
3. if Dec(ii) 7^ 0 and Acc(ii) 7^ 0, the color is defined as the minimal color among
- codd = 2 • min9eDec(tl) lnd(tli<1)(g) - 1, and
- CeVen = 2 • min9eAcc(tl) lnd(tliDPA defined above to the LDBA of Fig. 1 that recognizes the LTL language FGa V FG&. The figure only shows the reachable states of this construction. As specified in the construction above, states of DPA are labelled with a subset of Qd and a ordered subset of Qd of the original NBA. As an illustration of the definitions above, let us explain the color of edges from state ({1}, [4, 3]) to itself on letter b. When the NBA is in state 1, 3 or 4 and letter b is read, then the next state of the automaton is again 1, 3 or 4. Note also that there are no runs that are merging in that case. As a consequence, the color that is emitted is even and equal to the index of the smallest state that is the target of an accepting transition. In this case, this is state 3 and its index is 2. This is the justification for the color 4 on the edge. On the other hand, if letter a is read from state ({1}, [4, 3]), then the automaton moves to states ({1}, [4, 2]). The state 3 is mapped to state 4 and there is a run merging which induces that the color emitted is odd and equal to 3. This 3 trumps all the 4's that were possibly emitted from state ({1}, [4, 3]) before.
Theorem 2. The language defined by the deterministic parity automaton B is equal to the language defined by the limit deterministic automaton A, i.e. 1(A) = 1(B).
Proof. Let w € and Gw be the run DAG of A on w. It is easy to show by induction that the sequence of colors that occur along Gw is equal to the sequence of colors defined by the run of the automaton B on w. By Theorem 1, the language of automaton B is thus equal to the language of automaton A. □
3.3 Complexity Analysis
Upper bound Let n = \Q\ be the size of the LDBA and let nd = \Qd\ be the size of the accepting component. We can bound the number of different orderings using the series of reciprocals of factorials (with e being Euler's number):
\OV(Qd)\ = V , Hd'.,, < nd ■ nd\ ■ V - = e • nd ■ nd\ e 0(2"'lo«") Thus the obtained DPA has 0(2n .2™'lo«™) = 2c,(™'lo«™) states and 0(n) colours.
9
Lower bound We obtain a matching lower bound by strengthening Theorem 8 from [L6d99]:
Lemma 2. There exists a family (L„)„>2 of languages (Ln over an alphabet of n letters) such that for every n the language Ln can be recognized by a limit-deterministic Biichi automaton with 'in + 2 states but can not be recognized by a deterministic Parity automaton with less than nl states.
Proof. The proof of Theorem 8 from [L6d99] constructs a non-deterministic Biichi automaton of exactly this size and which is in fact limit-deterministic.
Assume there exists a deterministic Parity automata for Ln with m < nl states. Since parity automata are closed under complementation, we can obtain a parity automaton and hence also a Rabin automaton of size m for Ln and thus a Streett automaton of size m for Ln, a contradiction to Theorem 8 of [L6d99]. □
Corollary 1. Every translation from limit-deterministic Biichi automata of size n to deterministic parity yields automata with 2fi(™lo8") states in the worst case.
4 From LTL to Parity in 22
In [SEJK16] we present a LTL—>LDBA translation. Given a formula ip of size n, the translation produces an asymptotically optimal LDBA with 22 states. The straightforward composition of this translation with the single exponential LDBA—>DPA translation of the previous section is only guaranteed to be triple exponential, while the Safra-Piterman construction produces a DPA of at most doubly exponential size. In this section we describe a modified composition that yields a double exponential DPA. To the best of our knowledge this is is the first translation of the whole LTL to deterministic parity automata that is asymptotically optimal and does not use Safra's construction.
The section is divided into two parts. In the first part, we explain and illustrate a redundancy occurring in our LDBA—>DPA translation, responsible for the undesired extra exponential. We also describe an optimization that removes this redundancy when the LDBA satisfies some conditions. In the second part, we show these conditions are satisfied on the products of the LTL—>LDBA translation, which in turn guarantees a doubly exponential LTL—>DPA procedure.
4.1 An improved construction
We can view the second component of a state of the DPA as a sequence of states of the LDBA, ordered by their indices. Since there are 22 ( ' states of the LDBA for an LTL formula of length n, the number of such sequences is
10
If only the length of the sequences (the maximum index) were bounded by 2™, the number of such sequences would be smaller than the number of functions 2« 22°<") which is
(220(n,)2„= 220(„,.2„= 220(„,
Fix an LDBA with set of states Q. Assume the existence of an oracle: a list of statements of the form L(q) C U9'eQ L( k).
— Redirect transitions leading from vertices in Vi-i to {v^,i) so that they lead to the smallest vertex (vi,i) of V{.
— Remove any vertices (if any) that are no longer reachable from vertices of Vi-
We define the color summary of G*w in exactly the same way as the color summary of Gw. The DAG GJ, satisfies the following crucial property, whose proof can be found in [EKRS17, Appendix B]:
Proposition 1. The color summary of the run DAG G^, is even if and only if there is an accepting run in Gw.
The mapping on DAGs induces a reduced DPA as follows. The states are the pairs (s, (t, <)) such that (t, <) does not contain redundant vertices. There is a transition (si, (t\, <)) A («2, (£2, <)) with color c iff there is a word w and an index i such that (si, (ti, <)) and («2, {t'i, <)) correspond to the i-th and (i + 1)-th levels of GJ,, and a and c are the letter and color of the step between these levels in G^,. Observe that the set of transitions is independent of the words chosen to define them.
The equivalence between the initial DPA A and the reduced DPA Ar follows immediately from Proposition 1: A accepts w iff Gw contains an accepting run iff the color summary of GJ, is even iff Ar accepts w.
Example 3. Consider the LDBA of Fig. 1 and an oracle given by L(4) = 0, ensuring L(4) C (JieJ L(i) for any ICQ. Then 4 is always redundant and merged, removing the two rightmost states of the DPA of Fig. 3 (left), resulting in the DPA of Fig. 3 (right). However, for the sake of technical convenience, we shall refrain from removing a redundant vertex when it is the smallest one (with index 1).
11
Since the construction of the reduced DPA is parametrized by an oracle, the obvious question is how to obtain an oracle that does not involve applying an expensive language inclusion test. Let us give a first example in which an oracle can be easily obtained:
Example 4- Consider an LDBA where each state v = {si,..., s^} arose from some powerset construction on an NBA in such a way that L({si,..., s^}) = L(si) U • • • L(sfc). An oracle can, for instance, allow us to merge whenever Vk C [_}jLDBA translation of [SEJK16]. Since partial evaluation of formulas plays a major role in the translation, we introduce the following definition. Given an LTL formula (ft=1 GFft)
d(n) = -((A?=i GFPl) -> G(q -> Fr)) F(n) = A?=i(GFPi -> GFq,)
for which the results are shown in (figure 4a). Additionally, we consider the "/" formulas from [SEJK16] (table 1). Observe that L2P' performs clearly better, and the gap between the tools grows when the parameter increases.
Randomly Generated Formulas from [BKS13] (figure 4b).
Real Data. Formulas taken from case studies and synthesis competitions — the intended domain of application of our approach. Figures 4c and 4d show results for the real-world formulas of [BKS13] and the TLSF specifications contained in the Acacia set of [JBB+16]. Table 1 shows results for LTL formulas expressing properties of Szymanski's protocol [DWDMR08], and for the generalised buffer benchmark of Acacia.
Average Compression Ratios. The geometric average compression ratio for
a benchmark suite B is defined as YlveB(n^/n^2P ) ^ ' where and n^2P denote the number of states of the automata produced by Spot and L2P', respectively. The ratios in our experiments (excluding benchmarks where Spot times out) are: 1.14 for random formulas, 1.12 for the real-world formulas of [BKS13], and 1.35 for the formulas of Acacia.
6 Conclusion
We have presented a simple, "Safraless", and asymptotically optimal translation from LTL and LDBA to deterministic parity automata. Furthermore, the
14
L2PJ (states) L2PJ (states)
(c) "Real-world" Formulas (d) TLSF/Acacia
Fig. 4: Comparison of Spot and our implementation using the best configurations. Timeouts are denoted by setting the size of the automaton to the maximum.
Table 1: Number of states and number of used colours in parenthesis for the constructed automata. Timeouts are marked with t.
/(1,0) /(I, 2) /(I, 4) /(2,0) /(2,2) zn zpl zp2 zp3 Buffer
s L2P L2P' 18(6) 12(8) 12(8) 141(8) 114(9) 78(7) 2062(8) 332(15) 271(11) 208(12) 144(14) 106(9) 883(12) 4732(19) 1904(15) t t 32(6) t t 42(6) t t 111(12) t t 97(12) t 1425(27) 435(4)
translation is suitable for an on-the-fly implementation. The resulting automata are substantially smaller than those produced by the SPOT library for formulas obtained from synthesis specifications, and have comparable or smaller size for other benchmarks. In future work we want to investigate the performance of the translation as part of a synthesis toolchain.
Acknowledgments. The authors want to thank Michael Luttenberger for helpful discussions and the anonymous reviewers for constructive feedback.
15
References
BHS 16. František Blahoudek, Matthias Heizmann, Sven Schewe, Jan Strejček, and Ming-Hsien Tsai. Complementing semi-deterministic Biichi automata. In TACAS, volume 9636 of LNCS, pages 770-787, 2016.
BKS13. František Blahoudek, Mojmír Křetínský, and Jan Strejček. Comparison
of LTL to deterministic Rabin automata translators. In LPAR, volume 8312 of LNCS, pages 164-172, 2013.
CY95. Costas Courcoubetis and Mihalis Yannakakis. The complexity of proba-
bilistic verification. J. ACM, 42(4):857-907, 1995.
DLLF+16. Alexandre Duret-Lutz, Alexandre Lewkowicz, Amaury Fauchille, Thibaud Michaud, Etienne Renault, and Laurent Xu. Spot 2.0 — a framework for LTL and w-automata manipulation. In Proceedings of the 14-th International Symposium on Automated Technology for Verification and Analysis (ATVA'16), volume 9938 of Lecture Notes in Computer Science, pages 122-129. Springer, October 2016. To appear.
DWDMR08. Martin De Wulf, Laurent Doyen, Nicolas Maquet, and Jean-Francois Raskin. Antichains - Alternative Algorithms for LTL Satisfiability and Model-Checking. TACAS, 2008.
EKRS17. Javier Esparza, Jan Křetínský, Jean-Francois Raskin, and Salomon Sick-ert. From LTL and limit-deterministic biichi automata to deterministic parity automata. Technical Report abs/1701.06103, arXiv.org, 2017.
Finl5. Bernd Finkbeiner. Automata, games, and verification, 2015.
Available at https://www.react.uni-saarland.de/teaching/ automata-games-verification-15/downloads/notes.pdf.
FKVW15. Seth Fogarty, Orna Kupferman, Moshe Y. Vardi, and Thomas Wilke.
Profile trees for biichi word automata, with application to determiniza-tion. Inf. Comput, 245:136-151, 2015.
JBB+16. Swen Jacobs, Roderick Bloem, Romain Brenguier, Ayrat Khalimov, Felix Klein, Robert Konighofer, Jens Kreber, Alexander Legg, Nina Narodyt-ska, Guillermo A. Perez, Jean-Francois Raskin, Leonid Ryzhyk, Ocan Sankur, Martina Seidl, Leander Tentrup, and Adam Walker. The 3rd reactive synthesis competition (SYNTCOMP 2016): Benchmarks, participants fe results. CoRR, abs/1609.00507, 2016.
KR10. Orna Kupferman and Adin Rosenberg. The blowup in translating LTL
to deterministic automata. In MoChArt, volume 6572 of LNCS, pages 85-94. Springer, 2010.
KV01. Orna Kupferman and Moshe Y. Vardi. Weak alternating automata are
not that weak. ACM Trans. Comput. Log., 2(3):408-429, 2001.
KV15. Dileep Kini and Mahesh Viswanathan. Limit deterministic and proba-
bilistic automata for LTL \ GU. In TACAS, volume 9035 of LNCS, pages 628-642, 2015.
Lod99. Christof Loding. Optimal bounds for transformations of omega-
automata. In C. Pandu Rangan, Venkatesh Raman, and Ramaswamy Ramanujam, editors, Foundations of Software Technology and Theoretical Computer Science, 19th Conference, Chennai, India, December 13-15, 1999, Proceedings, volume 1738 of Lecture Notes in Computer Science, pages 97-109. Springer, 1999.
Pit07. Nir Piterman. From nondeterministic Biichi and Streett automata to
deterministic parity automata. Logical Methods in Computer Science, 3(3), 2007.
16
Redl2. Roman R. Redziejowski. An improved construction of deterministic
omega-automaton using derivatives. Fundam. Inform., 119(3-4):393-406, 2012.
Saf88. Shmuel Safra. On the complexity of omega-automata. In FOCS, pages
319-327, 1988.
SEJK16. Salomon Sickert, Javier Esparza, Stefan Jaax, and Jan Kretmsky. Limit-deterministic buchi automata for linear temporal logic. In Computer Aided Verification - 28th International Conference, CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part II, pages 312-332, 2016.
ST03. Roberto Sebastiani and Stefano Tonetta. "more deterministic" vs.
"smaller" buchi automata for efficient LTL model checking. In Correct Hardware Design and Verification Methods, 12th IFIP WG 10.5 Advanced Research Working Conference, CHARME 2003, LAquila, Italy, October 21-24, 2003, Proceedings, pages 126-140, 2003.
Var85. Moshe Y. Vardi. Automatic verification of probabilistic concurrent finite-
state programs. In FOCS, pages 327-338, 1985.
VW86. Moshe Y. Vardi and Pierre Wolper. An automata-theoretic approach
to automatic program verification (preliminary report). In LICS, pages 332-344, 1986.
17
One Theorem to Rule Them All: A Unified Translation of LTL into ^-Automata*
Javier Esparza esparza@in.tum.de Technische Universität München Germany
Jan Kf etinsky jan.kretinsky@in.tum.de Technische Universität München Germany
Salomon Sickert sickert@in.tum.de Technische Universität München Germany
Abstract
We present a unified translation of LTL formulas into deterministic Rabin automata, limit-deterministic Buchi automata, and nonde-terministic Buchi automata. The translations yield automata of asymptotically optimal size (double or single exponential, respectively). All three translations are derived from one single Master Theorem of purely logical nature. The Master Theorem decomposes the language of a formula into a positive boolean combination of languages that can be translated into (^-automata by elementary means. In particular, Safra's, ranking, and breakpoint constructions used in other translations are not needed.
CCS Concepts • Theory of computation —> Automata over infinite objects; Modal and temporal logics;
Keywords Linear temporal logic, Automata over infinite words, Deterministic automata, Non-deterministic automata
1 Introduction
Linear temporal logic (LTL) [32] is a prominent specification language, used both for model checking and automatic synthesis of systems. In the standard automata-theoretic approach [38] the input formula is first translated into an (^-automaton, and then the product of this automaton with the input system is further analyzed. Since the size of the product is often the bottleneck of all the verification algorithms, it is crucial that the (^-automaton is as small as possible. Consequently, a lot of effort has been spent on translating LTL into small automata, e.g. [4, 10-12, 17, 18, 20, 21, 36].
While non-deterministic Buchi automata (NBA) can be used for model checking non-deterministic systems, other applications such as model checking probabilistic systems or synthesis usually require automata with a certain degree of determinism, such as deterministic parity automata (DPA) or deterministic Rabin automata (DRA) [5], deterministic generalized Rabin automata (DGRA) [8], limit-deterministic (or semi-deterministic) Buchi automata (LDBA) [9, 22, 35, 37], unambiguous Buchi automata [6] etc. The usual constructions that produce such automata are based on Safra's de-terminization and its variants [31, 33, 34]. However, they are known
*This work was partially funded and supported by the Czech Science Foundation, grant No. P202/12/G061, and the German Research Foundation (DFG) project "Verified Model Checkers" (317422601).
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. LICS '18, July 9-12, 2018, Oxford, United Kingdom © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5583-4/18/07... $15.00 https://doi.org/10.1145/3209108.3209161
to be difficult to implement efficiently, and to be practically inefficient in many cases due to their generality. Therefore, a recent line of work shows how DPA [14, 28], DRA and DGRA [13,15, 26, 27], or LDBA [23, 24, 35] can be produced directly from LTL, without the intermediate step through a non-deterministic automaton. All these works share the principle of describing each state by a collection of formulas, as happens in the classical tableaux construction for translation of LTL into NBA. This makes the approach particularly apt for semantic-based state reductions, e.g., for merging states corresponding to equivalent formulas. These reductions cannot be applied to Safra-based constructions, where this semantic structure gets lost.
In this paper, we provide a unified view of translations of LTL into NBA, LDBA, and DRA enjoying the following properties, absent in former translations:
Asymptotic Optimality. D(G)RA are the most compact among the deterministic automata used in practice, in particular compared to DPA. Previous translations to D(G)RA were either limited to fragments of LTL [3, 26, 27], or only shown to be triply exponential [13, 15]. Here we provide constructions for all mentioned types of automata matching the optimal double exponential bound for DRA and LDBA, and the optimal single exponential bound for NBA.
Symmetry. The first translations [26, 27] used auxiliary automata to monitor each Future- and G/ofcaWy-subformula. While this approach worked for fragments of LTL, subsequent constructions for full LTL [13, 15, 35] could not preserve the symmetric treatment. They only used auxiliary automata for G-subformulas, at the price of more complex constructions. Our translation re-establishes the symmetry of the first constructions. It treats F and G equally (actually, and more generally, it treats each operator and its dual equally), which results into simpler auxiliary automata.
Independence of Syntax. Previous translations were quite sensitive to the operators used in the syntax of LTL. In particular, the only greatest-fixed-point operator they allowed was Globally. Since formulas also had to be in negation normal form, pre-processing of the input often led to unnecessarily large formulas. While our translations still requires negation normal form, it allows for direct treatment of Release, Weak until, and other operators.
Unified View. Our translations rely on a novel Master Theorem, which decomposes the language of a formula into a positive boolean combination of "simple" languages, in the sense that they are easy to translate into automata. This approach is arguably simpler than previous ones (it is certainly simpler than our previous papers [15, 35]). Besides, it provides a unified treatment of DRA, NBA, and LDBA, differing only in the translations of the "simple" languages. The automaton for the formula is obtained from the automata for the "simple" languages by means of standard operations for closure under union and intersection.
l
LICS '18, July 9-12, 2018, Oxford, United Kingdom
Javier Esparza, Jan Kfetfnsky, and Salomon Sickert
On top of its theoretical advantages, our translation is comparable to previous DRA translations in practice, even without major optimizations. Summarizing, we think this paper finally achieves the goals formulated in [26], where the first translation of this kind—valid only for what we would now call a small fragment of LTL—was presented.
Structure of the Paper. Section 2 contains preliminaries about LTL and ^-automata. Section 3 introduces some definitions and results of [15, 35]. Section 4 shows how to use these notions to translate four simple fragments of LTL into deterministic Buchi and coBiichi automata; these translations are later used as building blocks. Section 5 presents our main result, the Master Theorem. Sections 6, 7, and 8 apply the Master Theorem to derive translations of LTL into DRA, NBA, and LDBA, respectively. Section 9 compares the paper to related work and puts the obtained results into context. The appendix of the accompanying technical report [16] contains the few omitted proofs and further related material.
2 Preliminaries
2.1 ^-Languages and ^-Automata
Let X be a finite alphabet. An 2 2 is a transition function, and Qo is a set of initial states. A transition is a triple (q, a, q') such that q' e A(q, a). A pre-automaton P is deterministic if Qo is a singleton and A(q, a) is a singleton for every q e Q and neX.
A run of P on an 1 and some sets l\, F\,. .., Ik, Fk of states.
An ^-automaton over E is a tuple 3{ = (Q, A, Qo, a) where (Q, A, Qo) is a pre-automaton over T. and a is an accepting condition. A run r of 3{ is accepting if r \= a. A word w is accepted by ^H. if some run of ^H. on w is accepting. An ^-automaton is a Buchi (coBiichi, Rabin) automaton if its accepting condition is a Buchi (coBiichi, Rabin) condition.
Limit-Deterministic Biichi Automata. Intuitively, a NBA is limit-deterministic if it can be split into a non-deterministic component without accepting states, and a deterministic component. The automaton can only accept by "jumping" from the non-deterministic to the deterministic component, but after the jump it must stay in the deterministic component forever. Formally, a NBA S = (Q, A, Qo, a) is limit-deterministic (LDBA) if Q can be partitioned into two disjoint sets Q = Qjy HI Qq, s.t.
1. a(q, v) c Q£> and |A(q, v)| = 1 for every q e Qd, v € t., and
2- S c QD for all Sea.
2.2 Linear Temporal Logic
We work with a syntax for LTL in which formulas are written in negation-normal form, i.e., negations only occur in front of atomic propositions. For every temporal operator we also include in the syntax its dual operator. On top of the next operator X, which is self-dual, we introduce temporal operators F (eventually), U (until), and W (weak until), and their duals G (always), R (release) and M (strong release). The syntax may look redundant but as we shall see it is essential to include W and M and very convenient to include F andG.
Syntax and semantics of LTL. A formula of LTL in negation normal form over a set of atomic propositions (Ap) is given by the syntax:
Vf iff w \=
, w).1
Definition 3.1. Let
iff =P ff
afijj/, v) otherwise.
Example 4.3. Let
. The reason is that ^ygv cnec^s that the greatest fixed point holds, and cannot enforce satisfaction of the least-fixed-point formula aVb.
If only we were given a promise that aVb holds infinitely often, then we could conclude that such a run is accepting. We can actually get such promises: for NBA and LDBA via the non-determinism of the automaton, and for DRA via the "non-determinism" of the acceptance condition. In the next section, we investigate how to utilize such promises (Section 5.3) and how to check whether the promises are fulfilled or not (Section 5.4). A
5 The Master Theorem
We present and prove the Master Theorem: A characterization of the words satisfying a given formula from which we can easily extract deterministic, limit-deterministic, and nondeterministic automata of asymptotically optimal size.
We first provide some intuition with the help of an example. Consider the formula
). Even more, we are promised that along the suffix ws the formula cUd never holds any more. How can we use this advice? First, w |='
0 such that for every k > 0 the suffix w,-+j. is ji-stable and the suffix Wj+k *s v-stable.
Proof. We only prove the //-stability part; the proof of the other part is similar. Since 0FWj £ fw. for every i > 0, it suffices to exhibit an index i such that @"F-Wt k □ 7wi+t f°r every k > 0. If 0FW □ Tw then we can choose i := 0. So assume Tw \ Sfw i1 0-By definition, every \j/ e Tv \ Q'F^, holds only finitely often along w. So for every \j/ e \ 6"F-w there exists an index i^ such that
4
A Unified Translation of LTL into (^-Automata
LICS '18, July 9-12, 2018, Oxford, United Kingdom
wi^+k V= 0. Let i := max{i^ | \j/ € Tw}, which exists because Tw is a finite set. It follows @"FVt k □ Twi+k for every k > 0, and so every w,-+j. is //-stable. □
Example 5.4. Letagain^) = GaV foUc. The word w' = {b}{c}{a}a is neither //-stable nor v-stable, but all suffixes w' .. of w' are both //-stable and v-stable. A
5.2 The formulas ?)[X]V and ^[y]^.
We first introduce ^)[X]V. Assume we have to determine if a word w satisfies [y]/j is inductively defined as follows:
• If
= &A then ^[ylu = [ .
ft otherwise.
r n Ut if ffl € y
• ]fm = ^Wife themp[y]„ = ^
^ \(^[y]^)U(^2[Y]p) otherwise.
f tt if,»ey
• Ifm = fiR\l/2 then® y„ = \
* \(MYh)M(UYh) otherwise.
Example 5.7. Let
0. We prove the following stronger statement via structural induction on ) -> (wt \= (p[X]v) )
We consider one representative of the "interesting" cases, and one of the "straightforward" cases.
Case 0 arbitrary and assume Wf |= i/'iU^. Then i/'lU^ € T-Wl and so cp e X. We prove w; |= (fiU^)[X]v:
W; |= ^iU^2 Wi |= ^Wj^2 =^ Vj. Wj+y |= t^l V 3fc < j. |= ^2
=^> Vj. Wi+; |= ^[X]v V 3k < j. wi+k |= MX]V (I-H.)
=^ wi |= (Mx]v)wOA2[x]v)
=^ Wj |= (^V^2)[X]V (
0 arbitrary and assume w,- |= ^ V^:
w, |= ^ V ^2 ^> (w, |= f{) V (w, |= \j/2) =^> (wt |= ^[X]v) V (wt |= ^2[X]V) (I.H.)
=^ wi 1=^1 V^2)[X]V (Def. 5.5) □
LICS '18, July 9-12, 2018, Oxford, United Kingdom
Javier Esparza, Jan Kfetfnsky, and Salomon Sickert
Lemma 5.8 suggests to decide w |=?
. Indeed, we could easily check correctness of this advice, because FGGc € FG(vLTL), and with its help checking GF(fo A Gc) reduces to checking GF(fo A tt) = GFfo, which is also easy.
One of the main ingredients of our approach is that in order to verify a promise X c ffFw we can rely on a promise Y Q "FQV about subformulas of X, and vice versa. There is no circularity in this rely/guarantee reasoning because the subformula order is well founded, and we eventually reach formulas fa such that ^[X]v = fa or i/^Y]^ = fa. This argument is formalized in the next lemma. The first part of the lemma states that mutually assuming correctness of the other promise is correct. The second part states that, loosely speaking, this rely/guarantee method is complete.
Lemma 5.10. Let
0 we consider two cases:
Case 1: fa e Y, i.e., X; = Xt-X and Yt \ Yt-X = {ft}.
By induction hypothesis and X, = X,_i we have X, c Q'F^, and Yt-i c TQV. We prove fa e "FQV, i.e., w |= FGfa, in three steps. Claim V.fa[X]v = fa[Xt]v.
By the definition of the •[•]„ mapping, ^[X],, is completely determined by the //-subformulas of fa that belong to X. By the definition of the sequence (Xo, Yq), ..., (Xn, Yn), a //-subformula of fa belongs to X iff it belongs to X;, and we are done. Claim 2: X, c QTVk for every k > 0. Follows immediately from X, c Q'F^,.
Proof of w |= FGfa. By the assumption of the lemma we have
w \= FG(^j[X]v), and so, by Claim l,w|= FG(^j[Xj]v). So there
exists an index j such that Wj+k |= iA;[X;]v for every k > 0. By
Claim 2 we further have X; c Q'F^. k for every j, k > 0. So we
can apply part (a2) of Lemma 5.8 to X;, Wj+k, and i/';, which yields
wj+k l= 0. So w |= FGi/'j.
Case 2: e X, i.e., X; \ Xt-X = {ft} and ^ = Yi-l
In this case X;_i c and Y, c TQV. We prove ^ e ^!FW, i.e.,
w |= GFi/'j in three steps.
Claim 1:M^=^[%
The claim is proved as in Case 1.
Claim 2: There is an j > 0 such that Y, c ^Wfc for every k > j. Follows immediately from Yf c TQV.
Proof of w |= GFi/'j. By the assumption of the lemma we have w \= GF(iJ/i[Y]ij). Let j be the index of Claim 2. By Claim 1 we have w \= GF(i/';[y,]/J), and so there exist infinitely many k > j such that Wfc |= ^/i\Yi\^. By Claim 2 we further have Y, c ^Wjt. So we can apply part (b2) of Lemma 5.8 to Yf, w^, and fa, which yields wk l= j. So w |= GF^.
(2.) Let i/' € Q'Fw We have w |= GFfa, and so w; |= \j/ for infinitely many i > 0. Since cFQ^l = "FQV for every i > 0, part (bl) of Lemma 5.8 can be applied to w,, JrQVi, and fa. This yields w, |= i^V&w]^ f°r infinitely many i > 0 and thus w |= GF(i/'[Jr^w]/J).
Let i/' € "FQw Since w,- |= FGi/', there is an index j such that wj+k l= 0. By Lemma 5.3 the index j can be chosen so that it also satisfies 0FW = Tw +k = Sfw -+k for every k > 0. So part (al) of Lemma 5.8 can be applied to 7wJ+t, wj+k> and i/'. This yields Wj+k |= (Hgf^J, for every k > 0 and thus w |= FG(^[^rw]v). □
Example 5.11. Let q> = F(a A G(fc V Fc)), X = {
X,Y' X,Y' X,Y states and one single Rabin pair. More precisely, for each of these
languages we construct either a DBA or a DCA. We then construct
a DRA for L(
0 there is a DCA with a state ff such that the automaton rejects iff it reaches this state. Intuitively, if the automaton rejects, then it rejects "after finite time". We prove the following lemma:
Lemma 6.1. Let 0.
aft. Wl |= 0MX]v a cp[X]v) v Xl[X]v (I.H.) => wi |= (OAl a WVX)) v Xl)[X]v (Def. 5.5)
=^wiN?lMv (Def. 3.1)
Loosely speaking, C^x starts by checking w |=' ^)[X]V. For this it maintains the formula (^[X]v); in its state. If the formula becomes ff after, say, j steps, then w |£ ^)[X]V, and C^x proceeds to check w |=? ^[X]v. In order to "switch" to this new problem, C^x needs to know j it in its state. In other words, after j steps C^x is in state (cpj, afi with a,fS,y as in Example 4.3.
{c}
For w = {c}{c}({a}{b})a we have X = Q'F^; the word is accepted. For w' = {c}a we have X + Q'Fmi', and the word is rejected. A
A DBA for y. We define a DBA recognizing L (A^eX GF(\j/[Y]^. Observe that GF(i/'[y]/J) e GF(uLTL) for every \jr e X, and that i/'fY"]^ has at most subformulas. By Proposition 4.2, L(GF(i/'[y]/J)
is recognized by a DBA with at most 22 ' ' states. Recall that the intersection of the languages of k DBAs with si,... ,sk states is recognized by a DBA with k ■ Ylj=i sj states. Since |X| < n, the intersection of the DBAs for the formulas GF(i/'[rr]/J) yields a DBA
/ 90(11)\™ „90(71) ,0(71) with at most n ■ 22 = 2"2 = 22 states.
ADCAforL?x y(eY FG(i>[X]vj is obtained dually to the previous case, applying FG(i/'[X]v) e FG(vLTL), and Proposition 4.2.
A DRAfor L(
), of", dnf
), af", dnf
0. The automaton consists of two components with sets of states Q\ and Q2 given by
Qx = {(f, l)\f& Reach" ((p)} Q2 = {(f[X]v, 2) | f e Reach" ((p)}
Transitions either stay in the same component, or "jump" from the first component to the second. Transitions that stay in the same component are of the form (f, i) —> (f', i) for f' e afv (f, v) and i = 1, 2. "Jumps" are transitions of the form (f, 1) —> (^[X]v, 2). Jumping amounts to nondeterministically guessing the suffix -unsatisfying afijp, wo;)[X]v.The accepting condition is inf(Q2). Notice that the state (ff, 2) does not have any successors. Since Reach"{q>) has at most 2" states, C^x has 2°(") states.
A NBA for L?x y. As in the case of DRAs, we define a NBA recog-
To obtain an NBA with 2
O(n)
states
nizingL^exGFOAm,,)). we use a well-known trick. Given a set .. ., i/'j.} of formulas, we have
/\ G?ft = GFOi A F(>2 A F(^3 A ... A F(fe_, A Ffc). ..)) i=l
The formula obtained after applying the trick belongs to GF(jiLTL) and has 0(n) //-subformulas. By Proposition 7.2.2 we can construct a NBA for it with 2°<") states.
A NBAforL?x y. In this case we apply
^FG^FG \/\^\
i=l \i=l I
and Proposition 7.2.4, yielding an automaton with 2°(") states.
ANBAforL{
0 satisfying
(1') w, |= af
0. For every \j/ e Reach(
^x, Y-Since Reach(
+ 22
22 ' ' states. Recall that the lower
bound for the blowup of a translation of LTL to LDBA is also doubly exponential (see e.g. [35]).
9 Discussion
This paper builds upon our own work [13, 15, 19, 26, 27, 35]. In particular, the notion of stabilization point of a word with respect to a formula, and the idea of using oracle information that is subsequently checked are already present there. The translations of LTL to LDBAs of [23, 24] are based on similar ideas, also with resemblance to obligation sets of [29, 30].
The essential novelty of this paper with respect to the previous work is the introduction of the symmetric mappings -[-]/j and '[']v-Applying them to an arbitrary formula
-Automata Manipulation. In ATVA. 122-129.
Javier Esparza and Jan Křetínský. 2014. From LTL to Deterministic Automata: A Safraless Compositional Approach. In CAV. 192-208.
Javier Esparza, Jan Křetínský, Jean-Francois Raskin, and Salomon Sickert. 2017. From LTL and Limit-Deterministic Biichi Automata to Deterministic Parity Automata. In TACAS. 426-442.
Javier Esparza, Jan Křetínský, and Salomon Sickert. 2016. From LTL to deterministic automata - A safraless compositional approach. Formal Methods in System Design 49, 3 (2016), 219-271.
Javier Esparza, Jan Křetínský, and Salomon Sickert. 2018. One Theorem to Rule Them All: A Unified Translation of LTL into -Automata. CoRR abs/1805.00748 (2018). arXiv:1805.00748 http://arxiv.org/abs/1805.00748
Kousha Etessami and Gerard J. Holzmann. 2000. Optimizing Biichi Automata. In CONCUR. 153-167.
Carsten Fritz. 2003. Constructing Biichi Automata from Linear Temporal Logic Using Simulation Relations for Alternating Biichi Automata. In CIAA. 35-48. Andreas Gaiser, Jan Křetínský, and Javier Esparza. 2012. Rabinizer: Small Deterministic Automata for LTL(F,G). In ATVA. 72-76.
Paul Gastin and Denis Oddoux. 2001. Fast LTL to Biichi Automata Translation. In CAV 53-65.
Dimitra Giannakopoulou and Flavio Lerda. 2002. From States to Transitions: Improving Translation of LTL Formulae to Biichi Automata. In FORTE. 308-326. Ernst Moritz Halin, Guangyuan Li, Sven Schewe, Andrea Turrini, and Lijun Zhang. 2015. Lazy Probabilistic Model Checking without Determinisation. In CONCUR. 354-367.
Dileep Kini and Mahesh Viswanathan. 2015. Limit Deterministic and Probabilistic Automata for LTL \ GU. In TACAS. 628-642.
Dileep Kini and Mahesh Viswanathan. 2017. Optimal Translation of LTL to Limit Deterministic Automata. In TACAS. 113-129.
Zuzana Komárkova and Jan Křetínský. 2014. Rabinizer 3: Safraless Translation of LTL to Small Deterministic Automata. In ATVA (LNCS), Vol. 8837. 235-241. Jan Křetínský and Javier Esparza. 2012. Deterministic Automata for the (F,G)-Fragment of LTL. In CAV 7-22.
Jan Křetínský and Ruslán Ledesma-Garza. 2013. Rabinizer 2: Small Deterministic Automata for LTL \ GU. In ATVA. 446-450.
Jan Křetínský, Tobias Meggendorfer, Clara Waldmann, and Maximilian Weininger. 2017. Index Appearance Record for Transforming Rabin Automata into Parity Automata. In TACAS. 443-460.
Jianwen Li, Geguang Pu, Lijun Zhang, Zheng Wang, Jifeng He, and Kim Guld-strand Larsen. 2013. On the Relationship between LTL Normal Forms and Biichi Automata. In Theories of Programming and Formal Methods - Essays Dedicated to Jifeng He on the Occasion of His 70th Birthday. 256-270.
Jianwen Li, Lijun Zhang, Shufang Zhu, Geguang Pu, Moshe Y. Vardi, and Jifeng He. 2018. An explicit transition system construction approach to LTL satisfiability checking. Formal Asp. Comput. 30, 2 (2018), 193-217.
Nir Piterman. 2006. From Nondeterministic Buchi and Streett Automata to Deterministic Parity Automata. In LICS. 255-264. Amir Pnueli. 1977. The Temporal Logic of Programs. In FOCS. 46-57. Shmuel Safra. 1988. On the Complexity of omega-Automata. In FOCS. 319-327. Sven Schewe. 2009. Tighter Bounds for the Determinisation of Biichi Automata. In FoSSaCS. 167-181.
Salomon Sickert, Javier Esparza, Stefan Jaax, and Jan Křetínský. 2016. Limit-Deterministic Biichi Automata for Linear Temporal Logic. In CAV. 312-332. Fabio Somenzi and Roderick Bloem. 2000. Efficient Biichi Automata from LTL Formulae. In CAV 248-263.
Moshe Y. Vardi. 1985. Automatic Verification of Probabilistic Concurrent Finite-State Programs. In FOCS. 327-338.
Moshe Y. Vardi and Pierre Wolper. 1986. An Automata-Theoretic Approach to Automatic Program Verification (Preliminary Report). In LICS. 332-344.
10
Conditional Value-at-Risk for Reachability and Mean Payoff in
Markov Decision Processes
Jan Kfetinsky Institut für Informatik (17) Technische Universität München Garching bei München, Bavaria, Germany jan.kretinsky@in.tum.de
Abstract
We present the conditional value-at-risk (CVaR) in the context of Markov chains and Markov decision processes with reachability and mean-payoff objectives. CVaR quantifies risk by means of the expectation of the worst p-quantile. As such it can be used to design risk-averse systems. We consider not only CVaR constraints, but also introduce their conjunction with expectation constraints and quantile constraints (value-at-risk, VaR). We derive lower and upper bounds on the computational complexity of the respective decision problems and characterize the structure of the strategies in terms of memory and randomization.
CCS Concepts • Theory of computation —> Verification by model checking;
ACM Reference Format:
Jan Kfetinsky and Tobias Meggendorfer. 2018. Conditional Value-at-Risk for Reachability and Mean Payoff in Markov Decision Processes. In LICS '18: 33rd Annual ACM/IEEE Symposium on Logic in Computer Science, July 9-12, 2018, Oxford, United Kingdom. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3209108.3209176
1 Introduction
Markov decision processes (MDP) are a standard formalism for modelling stochastic systems featuring non-determinism. The fundamental problem is to design a strategy resolving the non-deterministic choices so that the systems' behaviour is optimized with respect to a given objective function, or, in the case of multi-objective optimization, to obtain the desired trade-off. The objective function (in the optimization phrasing) or the query (in the decision-problem phrasing) consists of two parts. First, a payoff is a measurable function assigning an outcome to each run of the system. It can be real-valued, such as the long-run average reward (also called mean payoff), or a two-valued predicate, such as reachability. Second, the payoffs for single runs are combined into an overall outcome of the strategy, typically in terms of expectation. The resulting objective function is then for instance the expected long-run average reward, or the probability to reach a given target state.
Risk-averse control aims to overcome one of the main disadvantages of the expectation operator, namely its ignorance towards the incurred risks, intuitively phrased as a question "How bad are the
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. LICS '18, July 9-12, 2018, Oxford, United Kingdom © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5583-4/18/07... $15.00 https://doi.org/10.1145/3209108.3209176
Tobias Meggendorfer Institut für Informatik (17)
Technische Universität München Garching bei München, Bavaria, Germany
tobias.meggendorfer@in.tum.de
W A
1« i >
iU
value
value
Figure 1. Illustration of VaR and CVaR for some random variables.
bad cases?" While the standard deviation (or variance) quantifies the spread of the distribution, it does not focus on the bad cases and thus fails to capture the risk. There are a number of quantities used to deal with this issue:
• The worst-case analysis (in the financial context known as discounted maximum loss) looks at the payoff of the worst possible run. While this makes sense in a fully non-deterministic environment and lies at the heart of verification, in the probabilistic setting it is typically unreasonably pessimistic, taking into account events happening with probability 0, e.g., never tossing head on a fair coin.
• The value-at-risk (VaR) denotes the worst p-quantile for somep € [0,1]. For instance, the value at the 0.5-quantile is the median, the 0.05-quantile (the vigintile or ventile) is the value of the best run among the 5% worst ones. As such it captures the "reasonably possible" worst-case. See Fig. 1 for an example of VaR for two given probability density functions. There has been an extensive effort spent recently on the analysis of MDP with respect to VaR and the re-formulated notions of quantiles, percentiles, thresholds, satisfaction view etc., see below. Although VaR is more realistic, it tends to ignore outliers too much, as seen in Fig. 1 on the right. VaR has been characterized as "seductive, but dangerous" and "not sufficient to control risk" [8].
• The conditional value-at-risk (average value-at-risk, expected shortfall, expected tail loss) answers the question "What to expect in the bad cases?" It is defined as the expectation over all events worse than the value-at-risk, see Fig. 1. As such it describes the lossy tail, taking outliers into account, weighted respectively. In the degenerate cases, CVaR forp = 1 is the expectation and for p = 0 the (probabilistic) worst case. It is an established risk metric in finance, optimization and operations research, e.g. [1, 33], and "is considered to be a more consistent measure of risk" [33]. Recently, it started permeating to areas closer to verification, e.g. robotics [13].
Our contribution In this paper, we investigate optimization of MDP with respect to CVaR as well as the respective trade-offs with expectation and VaR. We study the VaR and CVaR operators for the first time with the payoff functions of weighted reachability and
LICS '18, July 9-12, 2018, Oxford, United Kingdom
Jan Kfetfnsky and Tobias Meggendorfer
mean payoff, which are fundamental in verification. Moreover, we cover both the single-dimensional and the multi-dimensional case.
Particularly, we define CVaR for MDP and show the peculiarities of the concept. Then we study the computational complexity and the strategy complexity for various settings, proving the following:
• The single dimensional case can be solved in polynomial time through linear programming, see Section 5.
• The multi-dimensional case is NP-hard, even for CVaR-only constraints. Weighted reachability is NP-complete and we give PSPACE and EXPSPACE upper bounds for mean payoff with CVaR and expectation constraints, and with additional VaR constraints, respectively, see Section 6. (Note that already for the sole VaR constraints only an exponential algorithm is known; the complexity is an open question and not even NP-hardness is known [15, 32].)
• We characterize the strategy requirements, both in terms of memory, ranging from memoryless, over constant-size to infinite memory, and the required degree of randomization, ranging from fully deterministic strategies to randomizing strategies with stochastic memory update.
While dealing with the CVaR operator, we encountered surprising behaviour, preventing us to trivially adapt the solutions to the expectation and VaR problems:
• Compared to, e.g., expectation and VaR, CVaR does not behave linearly w.r.t. stochastic combination of strategies.
• A conjunction of CVaR constraints already is NP-hard, since it can force a strategy to play deterministically.
1.1 Related work
Worst case Risk-averse approaches optimizing the worst case together with expectation have been considered in beyond-worst-case and beyond-almost-sure analysis investigated in both the single-dimensional [11] and in the multi-dimensional [17] setup.
Quantiles The decision problem related to VaR has been phrased in probabilistic verification mostly in the form "Is the probability that the payoff is higher than a given value threshold more than a given probability threshold?" The total reward gained attention both in the verification community [6, 24, 35] and recently in the AI community [23, 29]. Multi-dimensional percentile queries are considered for various objectives, such as mean-payoff, limsup, liminf, shortest path in [32]; for the specifics of two-dimensional case and their interplay, see [3]. Quantile queries for more complex constraints have also been considered, namely their conjunctions [9, 20], conjunctions with expectations [15] or generally Boolean expressions [25]. Some of these approaches have already been practically applied and found useful by domain experts [4, 5].
CVaR There is a body of work that optimizes CVaR in MDP. However, to the best of our knowledge, all the approaches (1) focus on the single-dimensional case, (2) disregard the expectation, and (3) treat neither reachability nor mean payoff. They focus on the discounted [7], total [13], or immediate [27] reward, as well as extend the results to continuous-time models [26, 30]. This work comes from the area of optimization and operations research, with the notable exception of [13], which focuses on the total reward. Since the total reward generalizes weighted reachability, [13] is related to our work the most. However, it provides only an approximation
solution for the one-dimensional case, neglecting expectation and the respective trade-offs.
Further, CVaR is a topic of high interest in finance, e.g., [8, 33]. The central difference is that there variations of portfolios (i.e. the objective functions) are considered while leaving the underlying random process (the market) unchanged. This is dual to our problem, since we fix the objective function and now search for an optimal random process (or the respective strategy).
Multi-objective expectation In the last decade, MDP have been extensively studied generally in the setting of multiple objectives, which provides some of the necessary tools for our trade-off analysis. Multiple objectives have been considered for both qualitative payoffs, such as reachability and LTL [19], as well as quantitative payoffs, such as mean payoff [9], discounted sum [14], or total reward [22]. Variance has been introduced to the landscape in [10].
2 Preliminaries
Due to space constraints, some proofs and explanations are shortened or omitted when clear and can be found in [28].
2.1 Basic definitions
We mostly follow the definitions of [9, 15]. N, Q, R are used to denote the sets of positive integers, rational and real numbers, respectively. For n e N, let [n] = {1,.. ., n). Further, kj refers to k ■ ej, where ej is the unit vector in dimension j.
We assume familiarity with basic notions of probability theory, e.g., probability space (Q, T, ji), random variable F, or expected value E. The set of all distributions over a countable set C is denoted by D(C). Further, d e -23(C) is Dirac if d(c) = 1 for some c e C. To ease notation, for functions yielding a distribution over some set C, we may write /(•, c) instead of f(-)(c) for c e C.
Markov chains A Markov chain (MC) is a tuple M = (S, 8, jiq), where S is a countable set of states1, 8 : S —> D(S) is a probabilistic transition function, and jiq e D(S) is the initial probability distribution. The SCCs and BSCCs of a MC are denoted by SCC and BSCC, respectively [31].
A run in M is an infinite sequence p = S1S2 ■ ■ ■ of states, we write Pi to refer to the i-th state s;. A path g in M is a finite prefix of a run p. Each path p in M determines the set Cone(p) consisting of all runs that start with p. To M, we associate the usual probability space (Q, T, P), where Q is the set of all runs in M, T is the a-field generated by all Cone(p), and P is the unique probability measure such that P(Cone(si ■ ■ ■ sk)) = //o(si) ■ YVjZi S(si>si+l)-Furthermore, <>B (<>□£) denotes the set of runs which eventually reach (eventually remain in) the set B c S, i.e. all runs where pi e B for some i (there exists an i'o such that p, e B for all i > i'o).
Markov decision processes A Markov decision process (MDP) is a tuple Ai = (S, A, Av, A, so) where S is a finite set of states, A is a finite set of actions, Av : S —> 2A \ {0} assigns to each state s the set Av(s) of actions enabled in s so that (Av(s) | s e S) is a partitioning ofA2,A : A —> £) (S) is a probabilistic transition function that given an action a yields a probability distribution over the successor states, and so is the initial state of the system.
1 We allow the state set to be countable for the formal definition of strategies on MDP. When dealing with Markov Chains in queries, we only consider finite state sets. 2In other words, each action is associated with exactly one state.
CVaR for Reachability and Mean Payoff in MDP
LICS '18, July 9-12, 2018, Oxford, United Kingdom
A run p of At is an infinite alternating sequence of states and actions p = s1ci1s2ci2 ■ ■ ■ such that for all i > 1, we have a, e Av(s;) and A(aj,S;+i) > 0. Again, p, refers to the i-th state visited by this particular run. A path of length fc in At is a finite prefix g = s\a\ ■ ■ ■ aii_1sii of a run in G.
Strategies and plays. Intuitively, a strategy in an MDP Al is a
"recipe" to choose actions based on the observed events. Usually, a strategy is denned as a function a : (SA)*S —> £>(A) that given a finite path p, representing the history of a play, gives a probability distribution over the actions enabled in the last state. We adopt the slightly different, though equivalent [9, Sec. 6] definition from [15], which is more convenient for our setting.
Let M be a countable set of memory elements. A strategy is a triple D(M) and £)(A) are memory update and next move functions, respectively, and a e £)(M) is the initial memory distribution. We require that, for all (s, m) e S X M, the distribution an(s,m) assigns positive values only to actions available at s, i.e. supp un(s, m) c Av(s).
A play of Al determined by a strategy it is a Markov chain Ala = (Sa, 8a ,11°), where the set of states is Sa = SxMxA, the initial distribution jiq is zero except for ji^(s(,,m, a) = a(m) ■ R is E'7 [X] = JQ X OS"7.
A convex combinations of two strategies 0 implies s' € T; and (ii) for all states s,t e T there is a path q = s\a\ - • • aj._jSj. e (TB)Ic~1T with si = s, sj. = t.
An end component (T, B) is a maximal end component (MEC) if T and B are maximal with respect to subset ordering. Given an MDP, the set of MECs is denoted by MEC. By abuse of notation, seM refers to all states of a MEC M, while a e M refers to the actions.
Remark 1. Computing the maximal end component (MEC) decomposition of an MDP, i.e. the computation of MEC, is in P [18].
Remark 2. For any MDP At and strategy a, a run almost surely eventually stays in one MEC, i.e. PTUmjeMEC ^nM,] = 1 [31].
2.2 Random variables on Runs
We introduce two standard random variables, assigning a value to each run of a Markov Chain or Markov Decision Process.
Weighted reachability. Let T c S be a set of target states and r : T i-> Q be a reward function. Define the random variable Rr as Rr(p) = r(min;{p; | pi e T}), if such an i exists, and 0 otherwise. Informally, Rr assigns to each run the value of the first visited target state, or 0 if none. Rr is measurable and discrete, as S is finite [31]. Whenever we are dealing with weighted reachability, we assume w.l.o.g. that all target states are absorbing, i.e. for any s e T we have S(s, s) = 1 for MC and A(a, s) = 1 for all a e Av(s) for MDP.
Mean payoff (also known as long-run average reward, and limit average reward). Again, let r : S i-> Q be a reward function. The mean payoff of a run p is the average reward obtained per step, i.e. Rm(p) = liminfn_>co 7j Z"=i r(Pi)- The liminf is necessary, since lim may not be defined in general. Further, Rm is measurable [31].
Remark 3. There are several distinct definitions of "weighted reachability". The one chosen here primarily serves as foundation for the more general mean payoff.
3 Introducing the Conditional Value-at-risk
In order to define our problem, we first introduce the general concept of conditional value-at-risk (CVaR), also known as average value-at-risk, expected shortfall, and expected tail loss. As already hinted, the CVaR of some real-valued random variable X and probability p € [0,1] intuitively is the expectation below the worst p-quantile of X.
Let X : Q —> R be a random variable over the probability space (Q, T^P). The associated cumulative density function (CDF) Fx ■ R —> [0,1] of X yields the probability of X being less than or equal to the given value r, i.e. Fx (r) = P({X( p). Unfortunately, this would lead to some complications later on. See [28, Sec. A.l] for details.
LICS '18, July 9-12, 2018, Oxford, United Kingdom
Jan Kfetfnsky and Tobias Meggendorfer
1
Ö 2
-a
1 --
value
value
Figure 2. Distribution showing peculiarities of CVaR
Example 3.1. Consider a random variable X with a distribution as outlined in Fig. 2. Forp < |, we certainly have VaRp = 2p. On the other hand, for anyp e (|, 1), we get VaRp = 2. Consequently, the integral remains constant and CVaRp would actually decrease for increasing p, not matching the intuition. A
General definition. As seen in Ex. 3.1, the previous definition breaks down when Fx is not continuous at the p-quantile and consequently Fx (VaRp (X)) > p- Thus, we handle the values at the threshold separately, similar to [34].
Definition 3.2. Let X be some random variable and p e [0,1]. With v = VaRp (X), the CVaR of X is defined as
CVaRp(X) ■■=-[( x dFx + (p - P[X < v]) ■ v |,
P \J(-oo,v) I
which can be rewritten as
CVaRp(X) = i (P[X < v] ■ E[X | X < v] + (p - P[X < v]) ■ vj.
The corner cases again are CVaRo := VaRo, and CVaRj = E.
Since the degenerate cases of p = 0 andp = 1 reduce to already known problems, we exclude them in the following.
We demonstrate this definition on the previous example.
Example 3.3. Again, consider the random variable X from Ex. 3.1. For \ < p < 1 we have that P[X < VaRp(X)] = P[X < 2] = \. The right hand side of the definition (p - P[X < VaRp(X)]) = p - \ captures the remaining discrete probability mass which we have to handle separately. Together with J|_M 2) x d^X = \ we get
jp. For example, withp = |
CVaRp(X) = I(i + (p-I)-2) = 2-
this yields the expected result CVaRp(X) = 1. A
Remark 4. Recall that¥[X < r] can be expressed as the left limit of Fx, namely¥[X < r] = linv_,-r Fx(r')- Hence, CVaRp(X) solely depends on the CDF ofX and thus random variables with the same CDF also have the same CVaR.
We say that F\ stochastically dominates F2 for two CDF F\ and F2, if Fl(r) < F2W f°r all r- Intuitively, this means that a sample drawn from F2 is likely to be larger or equal to a sample from F\. All three investigated operators (E, CVaR, and VaR) are monotone w.r.t. stochastic dominance [28, Sec. A.l].
4 CVaR in MC and MDP: Problem statement
Now, we are ready to define our problem framework. First, we explain the types of building blocks for our queries, namely lower bounds on expectation, CVaR, and VaR. Formally, we consider the following types of constraints.
e Qrf, and constraints from crit, given by vectors e, c, v e (Qu {±})d and p, q e (0, l)d. This implies that in each dimension there is at most one constraint per type. The presented methods can easily be extended to the more general setting of multiple constraints of a particular type in one dimension. The decision problem is to determine whether there exists a strategy a such that all constraints are met.
Technically, this is defined as follows. Let X be the d-dimensional
random variable induced by the objective obj and reward function
r, operating on the probability space of yVC7. The strategy it is a
witness to the query iff for each dimension j e [d] we have that
E[Xj] > ej, CVaRPj(Xj) > cj, and VaR^. (X,) > vj. Moreover, _L
constraints are trivially satisfied.
For completeness sake, we also consider MC0"' . queries, i.e. r obj, dim ^
the corresponding problem on (finite state) Markov chains.
Notation. We introduce the following abbreviations. When dealing with an MDP M, CVaRp7 denotes CVaRp relative to the probability space over runs induced by the strategy e, CVaRp = 2.5 > c, and VaRq = 5 > v. A
Hence strategies satisfying an expectation constraint together with either a CVaR or VaR constraint may necessarily involve randomization in general. We prove that (i) under mild assumptions randomization actually is sufficient, i.e. no memory is required, and (ii) fixed memory may additionally be required in general.
Definition 5.3. Let M be an MDP with target set T and reward function r. We say that M satisfies the attraction assumption if Al) the target set T is reached almost surely for any strategy, or A2) for all target state s e T we have r(s) > 0.
Essentially, this definition implies that an optimal strategy never remains in a non-target MEC. This allows us to design memoryless strategies for the weighted reachability problem.
Theorem 5.4. Memoryless randomizing strategies are sufficient for
MDP'E'.Vaf'CVaR' under the attraction assumption. r.single r
Proof. Fix an MDP M and reward function r. We prove that for any strategy a there exists a memoryless, randomizing strategy a' achieving at least the expectation, VaR, and CVaR of a.
All target states t, e T form single-state MECs, as we assumed that all target states are absorbing. Consequently, a naturally induces a distribution over these s,. Now, we apply [19, Theorem 3.2] to obtain a strategy a' with P°"' [0st] > P°" [0st] for all i.
With Al), we have Zpi = 1 and thus P*7' [<>t,] = P*7 [Oti]. Hence, a' obtains the same CDF for the weighted reachability objective. Under A2), the CDF F' of strategy a' stochastically dominates the CDF F of the original strategy p ■ c > xs ■ r(s) > e
Figure 4. LP used to decide weighted reachability queries given a guess t of VaRp. T~ := {s 6 T | r(s) ~ t], ~e {<, =, <).
Inspired by [15, Fig. 3], we use the optimality result from Thm. 5.4 to derive a decision procedure for weighted reachability queries under the attraction assumptions based on the LP in Fig. 4.
To simplify the LP, we make further assumptions - see [28, Sec. A.2] for details. First, all MECs, including non-target ones, consist of a single state. Second, all MECs from which T is not reachable are considered part of T and have r = 0 (similar to the "cleaned-up MDP" from [19]). Finally, we assume that the quantile-probabilities are equal, i.e. p = q. The LP can easily be extended to account for different values by duplicating the x_s variables and adding according constraints.
The central idea is to characterize randomizing strategies by the "flow" they achieve. To this end, Equality (2) essentially models Kirchhoff's law, i.e. inflow and outflow of a state have to be equal. In particular, ya expresses the transient flow of the strategy as the expected total number of uses of action a. Similarly, xs models the recurrent flow, which under our absorption assumption equals the probability of reaching s. Equality (3) ensures that all transient behaviour eventually changes into recurrent one.
In order to deal with our query constraints, Constraints (4) and (5) extract the worst p fraction of the recurrent flow, ensuring that the VaRp is at least t. Note that equality is not guaranteed by the LP; if x_s = xs for all s e T<, we have VaRp > t. Finally, Inequality (6) enforces satisfaction of the constraints.
Theorem 5.6. Let M be an MDP with target states T and reward function r, satisfying the attraction assumption. Fix the constraint probability p e (0,1) and thresholds e, c e Q. Then, we have that
1. for any strategy a satisfying the constraints, there is at e r(S) such that the LP in Fig. 4 is feasible, and
2. for any threshold t e r(S), a solution of the LP in Fig. 4 induces a memoryless, randomizing strategy a satisfying the constraints and VaRp > t.
Proof. First, we prove for a strategy a satisfying the constraints that there exists ate r(S) such that the LP is feasible. By Thm. 5.4, we may assume that it is a memoryless randomizing strategy. From [19, Theorem 3.2], we get an assignment to the ya's and xs's satisfying Equalities (1), (2), and (3) such that P°" [Os] = xs for all target states
LICS '18, July 9-12, 2018, Oxford, United Kingdom
Jan Kfetfnsky and Tobias Meggendorfer
seT. Further, let v = VaRp be the value-at-risk of the strategy. By definition of VaR, we have that P°" [X < v] < p.
Assume for now that T"7 [X < v] = p, i.e. the probability of obtaining a value strictly smaller than v is exactly p. In this case, choose t to be the next smaller reward, i.e. t = maxjr(s) < v). We set x_s = xs for all s e T<, satisfying Constraints (4) and (5).
Otherwise, we have T"7 [X < v] < p. Now, some non-zero fraction of the probability mass at v contributes to the CVaR. Again, we set the values for x_s according to Constraint (4). The only degree of freedom are the values of x_s where r(s) = t. There, we assign the values so that £seT= x_s = p - XseT< *s> satisfying Equality (5).
It remains to check Inequality (6). For expectation, we have ZseT*s ■ r(s) = ZseTr7^] ■ r(s) = E e. For CVaR, notice that, due to the already proven Constraints (4) and (5), the side of Inequality (6) is equal to C VaRp and thus at least c.
Second, we prove that a solution to the LP induces the desired strategy a. Again by [19, Theorem 3.2], we get a memoryless randomizing strategy a such that P17 [<>s] = xs for all states s e T. ThenE'J[i?r] = ZseT^O] ■ r(s) = ZseT*s ■ r(s) > e. Further,
CVaRp(i?r) = - (y , ^ xs ■ r(s) + (p - V , , xs) ■ v)
by definition. Now, we make a case distinction on x_s = xs for all s € T=. If this is true, we have v = VaRp = minjr e r(S) | r > t), but P"" [X < v] = p. Consequently, T< = {s e T : r(s) < v) and Y*s-.r(s) |, b i-> |) and play the corresponding action indefinitely, satisfying the constraints. A
We prove that this bound actually is tight, i.e. that, given stochastic memory update, two memory elements are sufficient.
Theorem 5.10. Two-memory stochastic strategies (i.e. with both randomization and stochastic update) are sufficient for MDP'E'VaR,CVaR'.
r / m,single
Proof. Let it be a strategy on an MDP M with reward function r. We construct a two-memory stochastic strategy a' achieving at least the expectation, VaR, and CVaR of a.
First, we obtain a memoryless deterministic strategy c/log(l —2~") € O(2") memory elements to count the number of steps. The same holds true for any deterministic-update strategy.
On the other hand, a strategy with stochastic memory update can encode this counting by switching its state with a small probability after each step. For example, a strategy switching with probability p = 3f from "play b" to "play a" satisfies the constraint. A
5.3 Single constraint queries
In this section, we discuss an important sub-case of the single-dimensional case, namely queries with only a single constraint, i.e. |crit| = 1. We show that deterministic memoryless strategies are sufficient in this case.
One might be tempted to use standard arguments and directly conclude this from the results of Thm. 5.4 as follows. Recall that this theorem shows that memoryless, randomizing strategies are sufficient; and that any such strategy can be written as finite convex combination of memoryless, deterministic strategies. Most constraints, for example expectation or reachability, behave linearly under convex combination of strategies, e.g., E°"A (X) = AE""1 [X] + (1 - A)E 5 forp = + ■ 0.5 ■ 0.2. Since all other gadgets yield 0 in dimension m and only half of the runs going through ?m end up in the shaded area, this corresponds to Ex. 5.14, where p = 0.2.
Once in either state xm or xm, a state cn corresponding to a clause Cn satisfied by this assignment is chosen uniformly. In the example gadget, we would have xm e Cni n Cn.L, and xm e C„3. We set the reward of cn to 1„. Then a clause cn is satisfied under the assignment if the state cn is visited with positive probability, e.g. if CVaRi > ■ 0.5 ■ jj. Clearly, a satisfying assignment exists iff a strategy satisfying these constraints exists. □
6.2 NP-completeness and strategies for reachability
For weighted reachability we prove that the previously presented bound is tight, i.e. that the weighted reachability problem with multiple dimensions and CVaR constraints is NP-complete when d is part of the input and P otherwise. First, we show that the strategy bounds of the single dimensional case directly transfer. Intuitively this is the case since only the steady state distribution over the target set T is relevant, independent of the dimensionality.
Theorem 6.3. Two-memory stochastic strategies (i.e. with both randomization and stochastic update) are sufficient forMDP'E'V;?'CVaR'.
r 7 r, multi
Moreover, ifrj(s) > 0 for all s e T and j e [d], then memoryless randomizing strategies are sufficient.
Proof. Follows directly from the reasoning used in the proofs of Thm. 5.10 and Thm. 5.4. □
(1) All variables ya, xs, are non-negative.
(4) VaR-consistent split for j e [d]:
xJs = xs for sel' xjs < xs for s£l'
(5) Probability-consistent split for j e [d]:
X, / ^ p.
(6) CVaR and expectation satisfaction for j e [d]:
X,, xs-'-i-1 p>-<-> X, r'(s) •<-'
Figure 8. LP used to decide multi-dimensional weighted reachability queries given a guess t of VaRPj.. Equalities (2) and (3) are as in Fig. 4, Ti := {s € T | r,-(s) ~ t,-}, ~6 {<, =, <}.
Theorem 6.4. MDP^^!'CVaR' is in NP if d is a part of the input; moreover, it is in P for any fixed d.
Proof sketch. To prove containment, we guess the VaR threshold vector t out of the set of potential ones, namely {r | Ei e [d], s e T.r, (s) = r}d and use an LP to verify the solution. We again assume that each MEC can reach the target set and is single-state, as we did for Fig. 4. The arguments used to resolve this assumption are still applicable in the multi-dimensional setting. The LP consists of the flow Equalities (2) and (3) from the LP in Fig. 4 together with the modified (In)Equalities (4)-(6) as shown in Fig. 8.
The difference is that we extract the worst fraction of the flow in each dimension. Consequently, we have d instances of each x_s variable, namely xj. The number of possible guesses t is bounded by \T\d and thus the guess is of polynomial length. For a fixed d the bound itself is polynomial and hence, as previously, we can try out all vectors. □
6.3 Upper bounds of mean payoff
In this section, we provide an upper bound on the complexity of mean-payoff queries. Strategies in this context are known to have higher complexity.
Proposition 6.5 ([9]). Infinite memory is necessary for MDPjj^ | ..
Note that this directly transfers to MDPfCVa^., as CVaR, = E. J m, multi
However, closing gaps between lower and upper bounds for the mean payoff objective is notoriously more difficult. For instance,
MDP
{VaR)
m, mult hard, neither is MDP
. is known to be in EXP, but not even known to be NP-
Va^!. Since we have proven that MDP 'CVa^' multi r m, multi
is NP-hard, we can expect that obtaining the matching NP upper bound will be yet more difficult. The fundamental difference of the multi-dimensional mean-payoff case is that the solutions within MECs cannot be pre-computed, rather non-trivial trade-offs must be considered. Moreover, the trade-offs are not "local" and must be synchronized over all the MECs, see [15] for details.
We now observe that, as opposed to quantile queries, i.e. VaR constraints, the behaviour inside each MEC can be assumed to be quite simple. Our results primarily rely on [16] and use a similar notation. In particular, given a run p, Freqfl(p) yields the average frequency of action a, i.e. Freqfl(p) := liminfjj^co ^ la(af)> where at refers to the action taken by p in step t.
CVaR for Reachability and Mean Payoff in MDP
LICS '18, July 9-12, 2018, Oxford, United Kingdom
Definition 6.6. A strategy a is MEC-constant if for all M; e MEC with P°"[<>nMj] > 0 and all j e [d] there is a v e R such that pff[i?m = v | onMt] = 1.
Lemma 6.7. MEC-constant strategies are sufficient for MDPj^'^f''.
Proo/ Fix an MDP Al with MECs MEC = {M1;.. ., M„), reward function r and a strategy a. Further, define pi = ¥a [<>DMi]. We construct a strategy a' so that (i) P17 [<>DM;] = pi for all Mi, and (ii) all behaviours of it on a MEC M, are "mixed" into each run on Mt, making it MEC-constant.
We first define the mixing strategies E°"[Freqfl | 0nMt] by construction. Consequently, E^'fR™) > E0" (R™)-
Since a' is MEC-constant, we have CVaR£'(#™ | ❖□M,) = E*7'^™ | ❖□Mj]. Further, by E^Freq^ | 0nMt] - pi < E°"' [Freqfl] for all a, we get E^fi?™ | ❖□M,-] < Effi [i?™]. So, CVaR^ (R™) = Effi [i?;m] > E^IR™ | ❖□M,-] > CVaR^fi?™ | 0nMt), as CVaR < E.
Finally, we apply this inequality together with property (i), obtaining CVaR£ (R™) < CVaR^' (R™) by [28, Thm. A.4] □
We utilize this structural property to design a linear program for these constraints. However, similarly to the previously considered LPs, it relies on knowing the VaR for each CVaRp constraint. Due to the non-linear behaviour of CVaR, the classical techniques do not allow us to conclude that VaR is polynomially sized and thus we do not present the "matching" NP upper bound, but a PSPACE upper bound, which we achieve as follows.
Theorem 6.8. MDP*13'^1 is in PSPACE.
Proof sketch. We use the existential theory of the reals, which is NP-hard and in PSPACE [12], to encode our problem. The VaR vector t is existentially quantified and the formula is a polynomially sized program with constraints linear in VaR's and linear in the remaining variables. This shows the complexity result.
The details of the procedure are as follows. For each j e [d], we use the existential theory of reals to guess the achieved VaR t = VaRpj.. Further, we non-deterministically obtain the following polynomially-sized information (or deterministically try out all options in PSPACE). For each j e [d] and for each MEC M,, we guess if the value achieved in M, is at most (denoted M, e MEC^,) or above (denoted M, e MEC^) the respective tj, and exactly one MEC MJ=, which achieves a value equal to it. Given these guesses, we check whether the LP in Fig. 9 has a solution.
(1) All variables ya, ys, xa, xs are non-negative.
(2) Transient flow for seS:
ls°(S) + lia^y* ■ A(a' S) = YiaeMď" + V*
(3) Probability of switching in a MEC is the frequency of using its actions for M; e MEC:
(4) Recurrent flow for s e S:
xs = /, xa ■ A(a,s) = V xa í—laeA Z-JaeAv(s)
(5) CVaR and expectation satisfaction for j e [d]:
X,,v< ■rřlNl + (>,ř - X,,v<) ■,ř • >" ■ťř
J]sesxs-rj(s)>ej
(6) Verify MEC classification guess for j e [d]:
TjseMiXs ^ TJ(S) " ^ f°r M< £ MEC< U {M='
TjseMiXs ■ (S) " *J fOT MÍ £ MEC> U (M='
(7) Verify VaR guess for j e [d]:
Xs vv< ' Xs , \r v*s '
Figure 9. LP used to decide multi-dimensional mean-payoff queries given a guess t of VaRPj. and MEC classification MEC^,, ML, and MEď>. SÍ := {s € S I s € MandM e MECÍ), ~e {<,>).
Equations (l)-(4) describe the transient flow like the previous LP's and, additionally, the recurrent flow like in [31, Sec. 9.3] or [9, 16, 19]. This addition is needed, since now our MECs are not trivial, i.e. single state. Again, Inequalities (5) verify that the CVaR and expectation constraints are satisfied. Finally, Inequalities (6) and (7) verify the previously guessed information, i.e. the VaR vector and the MEC classification.
Using the very same techniques, it is easy to prove that solutions to the LP correspond to satisfying strategies and vice versa. In particular, Inequalities (6) and (7) directly make use of the MEC-constant property of Lem. 6.7. □
While MEC-constant strategies are sufficient for E with CVaR,
in contrast, they are not even for just MDP'VaR' . [15, Ex.22L Con-J J m, multi L J
sequently, only an exponentially large LP is known for MDP^^j| ..
We can combine all the objective functions together as follows:
Theorem 6.9. MDP'E' Va^>CVaR 1 is in EXPSPACE. m, multi
Proof sketch. We proceed exactly as in the previous case, but now the flows in Equality (4) are split into exponentially many flows, depending on the set of dimensions where they achieve the given VaR threshold, see LP L in [15, Fig. 4]. The resulting size of the program is polynomial in the size of the system and exponential in d. Hence the call to the decision procedure of the existential theory of reals results in the EXPSPACE upper bound. □
LICS '18, July 9-12, 2018, Oxford, United Kingdom
Jan Křetínský and Tobias Meggendorfer
Table 1. Schematic summary of known and new results. Strategies are abbreviated by "C/n-M", where C is either Deterministic or Randomizing, n is the size of the memory, and M is either Detereministic or Stochastic MEMory.
dim single multi
obj any r m
crit 1 crit 1 = 1 |crit| > 2 CVaR e crit {E, VaRo 1 {VaR| {CVaR|, {CVaR, E| {E, CVaR, VaR 1
Complex. P NP-c., P for fixed d P EXP NP-h., PSPACE NP-h., EXPSPACE
Strat. D/l-MEM R/2-SMEM R/2-SMEM R/co-DMEM
7 Conclusion
We introduced the conditional value-at-risk for Markov decision processes in the setting of classical verification objectives of reachability and mean payoff. We observed that in the single dimensional case the additional CVaR constraints do not increase the computational complexity of the problems. As such they provide a useful means for designing risk-averse strategies, at no additional cost. In the multidimensional case, the problems become NP-hard. Nevertheless, this may not necessarily hinder the practical usability. Our results are summarized in Table 1.
We conjecture that the VaR's for given CVaR constraints are poly-nomially large numbers. In that case, the provided algorithms would
yieldNP-completeness for MDP'CVaR' and EXPTIME-containment J r m, multi
for '^'-'Pj|E'^R'.CVaR', where the exponential dependency is only on the dimension, not the size of the system.
Acknowledgments
This research has been partially supported by the Czech Science Foundation grant No. 18-11193S and the German Research Foundation (DFG) project KR 4890/2 "Statistical Unbounded Verification" (383882557). We thank Vojtech Forejt for bringing up the topic of CVaR and the initial discussions with Jan Krcal and wish them both happy life in industry. We also thank Michael Luttenberger and the anonymous reviewers for insightful comments and valuable suggestions.
References
Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. 1999. Coherent Measures of Risk. Mathematical Finance 9, 3 (1999), 203-228. Pranav Ashok, Krishnendu Chatterjee, Przemyslaw Daca, Jan Křetínský, and Tobias Meggendorfer. 2017. Value Iteration for Long-Run Average Reward in Markov Decision Processes. In CAV (LNCS), Vol. 10426. Springer, 201-221. Christel Baier, Marcus Daum, Clemens Dubslaff, Joachim Klein, and Sascha Kluppelholz. 2014. Energy-Utility Quantiles. In NFM (LNCS), Vol. 8430. Springer, 285-299.
Christel Baier, Clemens Dubslaff, and Sascha Kluppelholz. 2014. Trade-off analysis meets probabilistic model checking. In CSL-LICS. ACM, 1:1-1:10. Christel Baier, Clemens Dubslaff, Sascha Kluppelholz, Marcus Daum, Joachim Klein, Steffen Márcker, and Sascha Wunderlich. 2014. Probabilistic Model Checking and Non-standard Multi-objective Reasoning. In FASE (LNCS), Vol. 8411. Springer, 1-16.
Christel Baier, Joachim Klein, Sascha Kluppelholz, and Sascha Wunderlich. 2017. Maximizing the Conditional Expected Reward for Reaching the Goal. In TACAS (LNCS), Vol. 10206. 269-285.
Nicole Báuerle and Jonathan Ott. 2011. Markov Decision Processes with Average-Value-at-Risk criteria. Math. Meth. of OR 74, 3 (2011), 361-379. Tanya Stýblo Beder. 1995. VAR: Seductive but dangerous. Financial Analysts Journal 51, 5 (1995), 12-24.
Tomáš Brázdil, Václav Brožek, Krishnendu Chatterjee, Vojtech Forejt, and Antonín Kučera. 2014. Two Views on Multiple Mean-Payoff Objectives in Markov Decision Processes. LMCS 10, 1 (2014).
Tomáš Brázdil, Krishnendu Chatterjee, Vojtech Forejt, and Antonín Kučera. 2013. Trading Performance for Stability in Markov Decision Processes. In LICS. IEEE Computer Society, 331-340.
Véronique Bruyěre, Emmanuel Filiot, Mickael Randour, and Jean-Francois Raskin. 2017. Meet your expectations with guarantees: Beyond worst-case synthesis in
[is;
[19
[20;
[21
[22
[23 [24 [25
[26
[27
[28
[29 [30
[31 [32
[33
[34
[35
quantitative games. Inf. Comput. 254 (2017), 259-295.
John R Canny. 1988. Some Algebraic and Geometric Computations in PSPACE. In STOC. ACM, 460-467.
Stefano Carpin, Yinlam Chow, and Marco Pavone. 2016. Risk aversion in finite Markov Decision Processes using total cost criteria and average value at risk. In ICRA. IEEE, 335-342.
Krishnendu Chatterjee, Vojtech Forejt, and Dominik Wojtczak. 2013. Multi-objective Discounted Reward Verification in Graphs and MDPs. In LPAR (LNCS), Vol. 8312. Springer, 228-242.
Krishnendu Chatterjee, Zuzana Komárkova, and Jan Křetínský. 2015. Unifying Two Views on Multiple Mean-Payoff Objectives in Markov Decision Processes. In LICS. IEEE Computer Society, 244-256.
Krishnendu Chatterjee, Zuzana Křetínská, and Jan Křetínský. 2017. Unifying Two Views on Multiple Mean-Payoff Objectives in Markov Decision Processes. LMCS 13, 2 (2017).
Lorenzo Clemente and Jean-Francois Raskin. 2015. Multidimensional beyond Worst-Case and Almost-Sure Problems for Mean-Payoff Objectives. In LICS. IEEE Computer Society, 257-268.
Costas Courcoubetis and Mihalis Yannakakis. 1995. The Complexity of Probabilistic Verification. J. ACM 42, 4 (1995), 857-907.
Kousha Etessami, Marta Z. Kwiatkowska, Moshe Y. Vardi, and Mihalis Yannakakis. 2008. Multi-Objective Model Checking of Markov Decision Processes. LMCS 4, 4 (2008).
J.A. Filar, D. Krass, and K.W Ross. 1995. Percentile performance criteria for limiting average Markov decision processes. IEEE Trans. Automat. Control 40, 1 0an 1995), 2-10.
Jerzy A. Filar, Dmitry Krass, and Keith W. Ross. 1995. Percentile performance criteria for limiting average Markov decision processes. IEEE Trans. Automat. Control 40 (1995), 2-10.
Vojtech Forejt, Marta Z. Kwiatkowska, Gethin Norman, David Parker, and Hongyang Qu. 2011. Quantitative Multi-objective Verification for Probabilistic Systems. In TACAS (LNCS), Vol. 6605. Springer, 112-127.
Hugo Gilbert, Paul Weng, and Yan Xu. 2017. Optimizing Quantiles in Preference-Based Markov Decision Processes. In AAAI. AAAI Press, 3569-3575. Christoph Haase and Stefan Kiefer. 2015. The Odds of Staying on Budget. In ICALP (LNCS), Vol. 9135. Springer, 234-246.
Christoph Haase, Stefan Kiefer, and Markus Lohrey 2017. Computing quantiles in Markov chains with multi-dimensional costs. In LICS. IEEE Computer Society, 1-12.
Yonghui Huang and Xianping Guo. 2016. Minimum Average Value-at-Risk for Finite Horizon Semi-Markov Decision Processes in Continuous Time. SIAM Journal on Optimization 26, 1 (2016), 1-28.
Masayuki Kageyama, Takayuki Fujii, Kojí Kanefuji, and Hiroe Tsubaki. 2011. Conditional Value-at-Risk for Random Immediate Reward Variables in Markov Decision Processes. American J. Computational Mathematics 1, 3 (2011), 183-188. Jan Křetínský and Tobias Meggendorfer. 2018. Conditional Value-at-Risk for Reachability and Mean Payoff in Markov Decision Processes. Technical Report abs/1805.xxxxx. arXiv.org.
Xiaocheng Li, Huaiyang Zhong, and Margaret L. Brandeau. 2017. Quantile Markov Decision Process. CoRR abs/1711.05788 (2017).
Christopher W. Miller and Insoon Yang. 2017. Optimal Control of Conditional Value-at-Risk in Continuous Time. SIAMJ. Control and Optimization 55, 2 (2017), 856-884.
M. L. Puterman. 1994. Markov Decision Processes. J. Wiley and Sons. Mickael Randour, Jean-Francois Raskin, andOcan Sankur. 2017. Percentile queries in multi-dimensional Markov decision processes. FMSD 50, 2-3 (2017), 207-248. R. Tyrrell Rockafellar and Stanislav Uryasev. 2000. Optimization of Conditional Value-at-Risk. Journal of Risk 2 (2000), 21-41.
R Tyrrell Rockafellar and Stanislav Uryasev. 2002. Conditional value-at-risk for general loss distributions. Journal of banking & finance 26, 7 (2002), 1443-1471. Michael Ummels and Christel Baier. 2013. Computing Quantiles in Markov Reward Models. In FoSSaCS (LNCS), Vol. 7794. Springer, 353-368.
Value Iteration for Simple Stochastic Games: Stopping Criterion and Learning Algorithm*
Edon Kelmendi, Julia Kramer, Jan Kfetinsky, and Maximilian Weininger
Technical University of Munich
Abstract. Simple stochastic games can be solved by value iteration (VI), which yields a sequence of under-approximations of the value of the game. This sequence is guaranteed to converge to the value only in the limit. Since no stopping criterion is known, this technique does not provide any guarantees on its results. We provide the first stopping criterion for VI on simple stochastic games. It is achieved by additionally computing a convergent sequence of over-approximations of the value, relying on an analysis of the game graph. Consequently, VI becomes an anytime algorithm returning the approximation of the value and the current error bound. As another consequence, we can provide a simulation-based asynchronous VI algorithm, which yields the same guarantees, but without necessarily exploring the whole game graph.
1 Introduction
Simple stochastic game (SG) [Con92] is a zero-sum two-player game played on a graph by Maximizer and Minimizer, who choose actions in their respective vertices (also called states). Each action is associated with a probability distribution determining the next state to move to. The objective of Maximizer is to maximize the probability of reaching a given target state; the objective of Minimizer is the opposite.
Stochastic games constitute a fundamental problem for several reasons. From the theoretical point of view, the complexity of this problem1 is known to be in UP n coUP [HK66] , but no polynomial-time algorithm is known. Further, several other important problems can be reduced to SG, for instance parity games, mean-payoff games, discounted-payoff games and their stochastic extensions [CF11]. The task of solving SG is also polynomial-time equivalent to solving perfect information Shapley, Everett and Gillette games [AM09]. Besides, the problem is practically relevant in verification and synthesis. SG can model reactive systems, with players corresponding to the controller of the system and to its environment, where quantified uncertainty is explicitly modelled. This is useful in many application domains, ranging from smart energy management
* This research was funded in part by the German Excellence Initiative and the European Union Seventh Framework Programme under grant agreement No. 291763 for TUM - IAS, the Studienstiftung des deutschen Volkes project "Formal methods for analysis of attack-defence diagrams", the Czech Science Foundation grant No. 18-11193S, TUM IGSSE Grant 10.06 (PARSEC), and the German Research Foundation (DFG) project KR 4890/2-1 "Statistical Unbounded Verification".
1 Formally, the problem is to decide, for a given p 6 [0,1] whether Maximizer has a strategy ensuring probability at least p to reach the target.
[CFK+13b] to autonomous urban driving [CKSW13], robot motion planning [LaVOO] to self-adaptive systems [CMG14]; for various recent case studies, see e.g. [SK16]. Finally, since Markov decision processes (MDP) [Put 14] are a special case with only one player, SG can serve as abstractions of large MDP [KKNP10]. Solution techniques There are several classes of algorithms for solving SG, most importantly strategy iteration (SI) algorithms [HK66] and value iteration (VI) algorithms [Con92]. Since the repetitive evaluation of strategies in SI is often slow in practice, VI is usually preferred, similarly to the special case of MDPs [KM 17]. For instance, the most used probabilistic model checker PRISM [KNP11] and its branch PRISM-Games [CFK+13a] use VI for MDP and SG as the default option, respectively. However, while SI is in principle a precise method, VI is an approximative method, which converges only in the limit. Unfortunately, there is no known stopping criterion for VI applied to SG. Consequently, there are no guarantees on the results returned in finite time. Therefore, current tools stop when the difference between the two most recent approximations is low, and thus may return arbitrarily imprecise results [HM17]. Value iteration with guarantees In the special case of MDP, in order to obtain bounds on the imprecision of the result, one can employ a bounded variant of VI [MLG05,BCC+14] (also called interval iteration [HM17]). Here one computes not only an under-approximation, but also an over-approximation of the actual value as follows. On the one hand, iterative computation of the least fix-point of Bellman equations yields an under-approximating sequence converging to the value. On the other hand, iterative computation of the greatest fixpoint yields an over-approximation, which, however, does not converge to the value. Moreover, it often results in the trivial bound of 1. A solution suggested for MDPs [BCC+14,HM17] is to modify the underlying graph, namely to collapse end components. In the resulting MDP there is only one fixpoint, thus the least and greatest fixpoint coincide and both approximating sequences converge to the actual value. In contrast, for general SG no procedure where the greatest fixpoint converges to the value is known. In this paper we provide one, yielding a stopping criterion. We show that the pre-processing approach of collapsing is not applicable in general and provide a solution on the original graph. We also characterize SG where the fixpoints coincide and no processing is needed. The main technical challenge is that states in an end component in SG can have different values, in contrast to the case of MDP.
Practical efficiency using guarantees We further utilize the obtained guarantees to practically improve our algorithm. Similar to the MDP case [BCC+14], the quantification of the error allows for ignoring parts of the state space, and thus a speed up without jeopardizing the correctness of the result. Indeed, we provide a technique where some states are not explored and processed at all, but their potential effect is still taken into accountThe information is further used to decide the states to be explored next and to be analyzed in more detail. To this end, simulations and learning are used as tools. While for MDP this idea has already demonstrated speed ups in orders of magnitude [BCC+14,ACD+17], this paper provides the first technique of this kind for SG.
2
Our contribution is summarized as follows
— We introduce a VI algorithm yielding both under- and over-approximation sequences, both of which converge to the value of the game. Thus we present the first stopping criterion for VI on SG and the first anytime algorithm with guaranteed precision. We also characterize when a simpler solution is sufficient.
— We provide a learning-based algorithm, which preserves the guarantees, but is in some cases more efficient since it avoids exploring the whole state space.
— We evaluate the running times of the algorithms experimentally, concluding that obtaining guarantees requires an overhead that is either negligible or mitigated by the learning-based approach.
Related work The works closest to ours are the following. As mentioned above, [BCC+14,HM17] describe the solution to the special case of MDP. While [BCC+14] also provides a learning-based algorithm, [HM17] discusses the convergence rate and the exact solution. The basic algorithm of [HM17] is implemented in PRISM [BKL+17] and the learning approach of [BCC+14] in Storm [DJKV17a]. The extension for SG where the interleaving of players is severely limited (every end component belongs to one player only) is discussed in [Ujml5].
Further, in the area of probabilistic planning, bounded real-time dynamic programming [MLG05] is related to our learning-based approach. However, it is limited to the setting of stopping MDP where the target sink or the non-target sink is reached almost surely under any pair of strategies and thus the fixpoints coincide. Our algorithm works for general SG, not only for stopping ones, without any blowup.
For SG, the tools implementing the standard SI and/or VI algorithms are PRISM-games [CFK+13a], GAVS+ [CKLB11] and GIST [CHJR10]. The latter two are, however, neither maintained nor accessible via the links provided in their publications any more.
Apart from fundamental algorithms to solve SG, there are various practically efficient heuristics that, however, provide none or weak guarantees, often based on some form of learning [BT00,LL08,WT16,TT16,AY17,BBS08]. Finally, the only currently available way to obtain any guarantees through VI is to perform 72 iterations and then round to the nearest multiple of I/7, yielding the value of the game with precision I/7 [CH08]; here 7 cannot be freely chosen, but it is a fixed number, exponential in the number of states and the used probability denominators. However, since the precision cannot be chosen and the number of iterations is always exponential, this approach is infeasible even for small games.
Organization of the paper Section 2 introduces the basic notions and revises value iteration. Section 3 explains the idea of our approach on an example. Section 4 provides a full technical treatment of the method as well as the learning-based variation. Section 5 discusses experimental results and Section 6 concludes. The appendix (available in [KKKW18]) gives technical details on the pseudocode as well as the conducted experiments and provides more extensive proofs to the theorems and lemmata; in this paper, there are only proof sketches and ideas.
3
2 Preliminaries
2.1 Basic definitions
A probability distribution on a finite set X is a mapping 5 : X —> [0,1], such that ~^2x 2A assigns to every state a set of available actions, and 5 : S x A —> V(S) is a transition function that given a state s and an action a € Av(s) yields a probability distribution over successor states.
A Markov decision process (MDP) is a special case of SG where Sq = 0. We assume that SGs are non-blocking, so for all states s we have Av(s) 7^ 0. Further, 1 and 0 only have one action and it is a self-loop with probability 1. Additionally, we can assume that the SG is preprocessed so that all states with no path to 1 are merged with 0.
For a state s and an available action a € Av(s), we denote the set of successors by Post(s, a) := {s' | <5(s, a, s') > 0}. Finally, for any set of states T C S, we use T\j and Tq to denote the states in T that belong to Maximizer and Minimizer, whose states are drawn in the figures as □ and 0> respectively.
The semantics of SG is given in the usual way by means of strategies and the induced Markov chain and the respective probability space, as follows. An infinite path p is an infinite sequence p = s^gS-^ ... € (S x A)", such that for every i € M, a - € Av(si) and si+1 € Post(si, a{). Finite paths are defined analogously as elements of (S x A)* x S. Since this paper deals with the reachability objective, we can restrict our attention to memoryless strategies, which are optimal for this objective. We still allow randomizing strategies, because they are needed for the learning-based algorithm later on. A strategy of Maximizer or Minimizer is a function a : SD —> P(A) or Sq —> 25(A), respectively, such that [0,1] such that L0(i) = 1 and, for all other s £ S, L0(s) = 0. Then we repetitively apply Bellman updates (3) and (4)
L„(s,a):= Y, ^s^5') ' L™-i(s') (3)
L (s) •= /maXaeMs) L™(s-a) if s e Sa ^ [minaeAv(s) L„(s,a) if s € SQ
2 Throughout the paper, for any function / : S —y [0,1] we overload the notation and also write /(s,a) meaning ^2a,€S S(s, a, s') • f(s').
5
until convergence. Note that convergence may happen only in the limit even for such a simple game as in Figure 1 on the left. The sequence is monotonie, at all times a lower bound on V, i.e. Li(s) < V(s) for all s € S, and the least fixpoint satisfies L* := linin^oo Ln = V.
Unfortunately, there is no known stopping criterion, i.e. no guarantees how close the current under-approximation is to the value [HM17]. The current tools stop when the difference between two successive approximations is smaller than a certain threshold, which can lead to arbitrarily wrong results [HM17].
For the special case of MDP, it has been suggested to also compute the greatest fixpoint [MLG05] and thus an upper bound as follows. The function G : S —> [0,1] is initialized for all states s € S as G0(s) = 1 except for G0(o) = 0. Then we repetitively apply updates (3) and (4), where L is replaced by G. The resulting sequence Gn is monotonie, provides an upper bound on V and the greatest fixpoint G* := lim„ Gn is the greatest solution to the Bellman equations on [0,1}S.
This approach is called bounded value iteration (BVI) (or bounded realtime dynamic programming (BRTDP) [MLG05,BCC+14] or interval iteration [HM17]). If L* = G* then they are both equal to V and we say that BVI converges. BVI is guaranteed to converge in MDP if the only ECs are those of i and o [BCC+14]. Otherwise, if there are non-trivial ECs they have to be "collapsed"3. Computing the greatest fixpoint on the modified MDP results in another sequence If of upper bounds on V, converging to U* := lim„ U^. Then BVI converges even for general MDPs, U* = V [BCC+14], when transformed this way. The next section illustrates this difficulty and the solution through collapsing on an example.
In summary, all versions of BVI discussed so far and later on in the paper follow the pattern of Algorithm 1. In the naive version, UPDATE just performs the Bellman update on L and U according to Equations (3) and (4).4 For a general MDP, U does not converge to V, but to G*, and thus the termination criterion may never be met if G*(s0) — V(s0) > 0. If the ECs are collapsed in pre-processing then U converges to V.
For the general case of SG, the collapsing approach fails and this paper provides another version of BVI where U converges to V, based on a more detailed structural analysis of the game.
3 Example
In this section, we illustrate the issues preventing BVI convergence and our solution on a few examples. Recall that G is the sequence converging to the greatest solution of the Bellman equations, while U is in general any sequence over-approximating V that one or another BVI algorithm suggests.
Firstly, we illustrate the issue that arises already for the special case of MDP. Consider the MPD of Figure 1 on the left. Although V(s) = V(t) = 0.5, we have
3 All states of an EC are merged into one, all leaving actions are preserved and all other actions are discarded. For more detail see [KKKW18, Appendix A.I.]
4 For the straightforward pseudocode, see [KKKW18, Appendix A.2.].
6
Algorithm 1 Bounded value iteration algorithm
procedure BVI(precision e > 0)
for s £ S do \* Initialization * \ L(s) = 0 \* Lower bound * \ U (s) = 1 \* Upper bound * \ L(i) = 1 \* Value of sinks is determined a priori * \
U(o) =0
7: repeat
8: UPDATE(L, U) \* Bellman updates or their modification * \
9: until U(s0) — L(s0) < e \* Guaranteed error bound * \
Gi(s) = Gi(t) = 1 for all i. Indeed, the upper bound for t is always updated as the maximum of Gi(t,c) and Gi(t, b). Although Gi(t, c) decreases over time, Gi(t, b) remains the same, namely equal to Gi(s), which in turn remains equal to Gi(s, a) = Gi(t). This cyclic dependency lets both s and t remain in an "illusion" that the value of the other one is 1.
The solution for MDP is to remove this cyclic dependency by collapsing all MECs into singletons and removing the resulting purely self-looping actions. Figure 1 in the middle shows the MDP after collapsing the EC {s, t}. This turns the MDP into a stopping one, where l or o is under any strategy reached with probability 1. In such MDP, there is a unique solution to the Bellman equations. Therefore, the greatest fixpoint is equal to the least one and thus to V.
Secondly, we illustrate the issues that additionally arise for general SG. It turns out that the collapsing approach can be extended only to games where all states of each EC belong to one player only [Ujml5]. In this case, both Maximizer's and Minimizer's ECs are collapsed the same way as in MDP.
However, when both players are present in an EC, then collapsing may not solve the issue. Consider the SG of Figure 2. Here a and /3 represent the values of the respective actions.5 There are three cases:
First, let a < /3. If the bounds converge to these values we eventually observe Gi(g, e) < Li(r, /) and learn the induced inequality. Since p is a Minimizer's state it will never pick the action leading to the greater value of /3. Therefore, we can safely merge p and q, and remove the action leading to r, as shown in the second subfigure.
Second, if a > ft, p and r can be merged in an analogous way, as shown in the third subfigure.
Third, if a = ft, both previous solutions as well as collapsing all three states as in the fourth subfigure is possible. However, since the approximants may only converge to a and /3 in the limit, we may not know in finite time which of these cases applies and thus cannot decide for any of the collapses.
Consequently, the approach of collapsing is not applicable in general. In order to ensure BVI convergence, we suggest a different method, which we call
5 Precisely, we consider them to stand for a probabilistic branching with probability a (or fi) to l and with the remaining probability to o. To avoid clutter in the figure, we omit this branching and depict only the value.
7
deflating. It does not involve changing the state space, but rather decreasing the upper bound to the least value that is currently provable (and thus still correct). To this end, we analyze the exiting actions, i.e. with successors outside of the EC, for the following reason. If the play stays in the EC forever, the target is never reached and Minimizer wins. Therefore, Maximizer needs to pick some exiting action to avoid staying in the EC.
i L2({s,t}) G2({s,t})
0 0 1
1 l 2
3 3
2 4 5
9 9
Q 13 14
o 27 27
Fig. 1: Left: An MDP (as special case of SG) where BVI does not converge due to the grayed EC. Middle: The same MDP where the EC is collapsed, making BVI converge. Right: The approximations illustrating the convergence of the MDP in the middle.
©
f
e
pq
1
f
pr
f
pqr
Fig. 2: Left: Collapsing ECs in SG may lead to incorrect results. The Greek letters on the leaving arrows denote the values of the exiting actions. Right three figures: Correct collapsing in different cases, depending on the relationship of a and /3. In contrast to MDP, some actions of the EC exiting the collapsed part have to be removed.
For the EC with the states s and t in Figure 1, the only exiting action is c. In this example, since c is the only exiting action, (t, c) is the highest possible upper bound that the EC can achieve. Thus, by decreasing the upper bound of all states in the EC to that number6, we still have a safe upper bound. Moreover, with this modification BVI converges in this example, intuitively because now the upper bound of t depends on action c as it should.
For the example in Figure 2, it is correct to decrease the upper bound to the maximal exiting one, i.e. max{a,/3}, where a := Ui(a),/0 := Uj(b) are the
8 We choose the name "deflating" to evoke decreasing the overly high "pressure" in the EC until it equalizes with the actual "pressure" outside.
8
current approximations of a and of /3. However, this itself does not ensure BVI convergence. Indeed, if for instance a < (3 then deflating all states to /3 is not tight enough, as values of p and q can even be bounded by a. In fact, we have to find a certain sub-EC that corresponds to a, in this case {p, q} and set all its upper bounds to a. We define and compute these sub-ECs in the next section.
In summary, the general structure of our convergent BVI algorithm is to produce the sequence U by application of Bellman updates and occasionally find the relevant sub-ECs and deflate them. The main technical challenge is that states in an EC in SG can have different values, in contrast to the case of MDP.
4 Convergent Over-approximation
In Section 4.1, we characterize SGs where Bellman equations have more solutions. Based on the analysis, subsequent sections show how to alter the procedure computing the sequence over-approximating V so that the resulting tighter sequence IF still over-approximates V, but also converges to V. This ensures that thus modified BVI converges. Section 4.4 presents the learning-based variant of our BVI.
4.1 Bloated end components cause non-convergence
As we have seen in the example of Fig. 2, BVI generally does not converge due to ECs with a particular structure of the exiting actions. The analysis of ECs relies on the extremal values that can be achieved by exiting actions (in the example, a and /3). Given the value function V or just its current over-approximation LL, we define the most profitable exiting action for Maximizer (denoted by □) and Minimizer (denoted by O) as follows.
Definition 3 (bestExit). Given a set of states T C S and a function f : S —> [0,1] (see footnote 2), the f-value of the best T-exiting action of Maximizer and Minimizer, respectively, is defined as
bestExit°(T) = max /(s,a)
(s,a) exits T
bestExit^T) = mm /(s,a)
(s,a) exits T
with the convention that max0 = 0 and min0 = 1.
Example 1. In the example of Fig. 2 on the left with T = {p, q, r} and a. < ft, we have bestExitv(T) = /3, bestExit^(T) = 1. It is due to /3 < 1 that BVI does not converge here. We generalize this in the following lemma. A
Lemma 1. Let T be an EC. For every m satisfying bestExity (T) < m <
bestExit^(T), there is a solution /: S —> [0,1] to the Bellman equations, which on T is constant and equal to m.
9
Proof (Idea). Intuitively, such a constant m is a solution to the Bellman equations on T for the following reasons. As both players prefer getting m to exiting and getting "only" the values of their respective bestExit, they both choose to stay in the EC (and the extrema in the Bellman equations are realized on non-exiting actions). On the one hand, Maximizer (Bellman equations with max) is hoping for the promised m, which is however not backed up by any actions actually exiting towards the target. On the other hand, Minimizer (Bellman equations with min) does not realize that staying forever results in her optimal value 0 instead of m. □
Corollary 1. // bestExit^(T) > bestExity (T) for some ECT, then G* ^ V.
Proof. Since there are mi, m-2 such that bestExity (T) < mi < m2 < bestExit^p(T), by Lemma 1 there are two different solutions to the Bellman equations. In particular, G* > L* = V, and BVI does not converge. □
In accordance with our intuition that ECs satisfying the above inequality should be deflated, we call them bloated.
Definition 4 (BEC). An ECT is called a bloated end component (BEC), if tv°
bestExit^(T) > bestExitv(T).
Example 2. In the example of Fig. 2 on the left with a. < ft, the ECs {p, q} and {p,q,r} are BECs. A
Example 3. If an EC T has no exiting actions of Minimizer (or no Minimizer's states at all, as in an MDP), then bestExit^(T) = 1 (the case with ming). Hence
all numbers between bestExity (T) and 1 are a solution to the Bellman equations and G*(s) = 1 for all states s € T.
Analogously, if Maximizer does not have any exiting action in T , then it holds that bestExit^1 (T) = 0 (the case with max0), T is a BEC and all numbers
between 0 and bestExit^(T) are a solution to the Bellman equations.
Note that in MDP all ECs belong to one player, namely Maximizer. Consequently, all ECs are BECs except for ECs where Maximizer has an exiting action with value 1; all other ECs thus have to be collapsed (or deflated) to ensure BVI convergence in MDPs. Interestingly, all non-trivial ECs in MDPs are a problem, while in SGs through the presence of the other player some ECs can converge, namely if both players want to exit (See e.g. [KKKW18, Appendix A.3.]). A
We show that BECs are indeed the only obstacle for BVI convergence.
Theorem 1. If the SG contains no BECs except for {o} and {l}, then G* = V.
Proof (Sketch). Assume, towards a contradiction, that there is some state s with a positive difference G*(s) — V(s) > 0. Consider the set D of states with the maximal difference. D can be shown to be an EC. Since it is not a BEC there has to be an action exiting D and realizing the optimum in that state. Consequently, this action also has the maximal difference, and all its successors, too. Since some of the successors are outside of D, we get a contradiction with the maximality of D. □
10
In Section 4.2, we show how to eliminate BECs by collapsing their "core" parts, called below MSECs (maximal simple end components). Since MSECs can only be identified with enough information about V, Section 4.3 shows how to avoid direct a priori collapsing and instead dynamically deflate candidates for MSECs in a conservative way.
4.2 Static MSEC decomposition
Now we turn our attention to SG with BECs. Intuitively, since in a BEC all Minimizer's exiting actions have a higher value than what Maximizer can achieve, Minimizer does not want to use any of his own exiting actions and prefers staying in the EC (or steering Maximizer towards his worse exiting actions). Consequently, only Maximizer wants to take an exiting action. In the MDP case he can pick any desirable one. Indeed, he can wait until he reaches a state where it is available. As a result, in MDP all states of an EC have the same value and can all be collapsed into one state. In the SG case, he may be restricted by Minimizer's behaviour or even not given any chance to exit the EC at all. As a result, a BEC may contain several parts (below denoted MSECs), each with different value, intuitively corresponding to different exits. Thus instead of MECs, we have to decompose into finer MSECs and only collapse these.
Definition 5 (Simple EC). An EC T is called simple (SEC), if for all s € T we have V(s) = bestExitv (T).
A SEC C is maximal (MSEC) if there is no SEC C such that C C C.
Intuitively, an EC is simple, if Minimizer cannot keep Maximizer away from his bestExit. Independently of Minimizer's decisions, Maximizer can reach the best Exit almost surely, unless Minimizer decides to leave, in which case Maximizer could achieve an even higher value.
Example 4- Assume a < pi in the example of Figure 2. Then {p,q} is a SEC and an MSEC. Further observe that action c is sub-optimal for Minimizer and removing it does not affect the value of any state, but simplifies the graph structure. Namely, it destructs the whole EC into several (here only one) SECs and some non-EC states (here r). A
Algorithm 2, called FIND_MSEC, shows how to compute MSECs. It returns the set of all MSECs if called with parameter V. However, later we also call this function with other parameters / : S —> [0,1]. The idea of the algorithm is the following. The set X consists of Minimizer's sub-optimal actions, leading to a higher value. As such they cannot be a part of any SEC and thus should be ignored when identifying SECs. (The previous example illustrates that ignoring X is indeed safe as it does not change the value of the game.) We denote the game G where the available actions Av are changed to the new available actions Av (ignoring the Minimizer's sub-optimal ones) as Gjav^av,j. Once removed, Minimizer has no choices to affect the value and thus each EC is simple.
11
Algorithm 2 FINDJV1SEC
1: function FIND_MSEC(/ : S -►[0,1])
2: X^{(s,{aeAv(s)|/(s,a )>/(s)})|se50}
3: Av' [0,1] such that / > V and any EC T, DEFLATE(T, /) > V.
This allows us to define our BVI algorithm as the naive BVI with only the additional lines 3-4, see Algorithm 4.
Algorithm 4 UPDATE procedure for bounded value iteration on SG 1: procedure UPDATE(L : S [0,1], U : S [0,1])
2: L,U get updated according to Eq. (3) and (4) \* Bellman updates * \
3: for T £ FIND_MSEC(L) do \* Use lower bound to find ECs * \
4: U <- DEFLATE(T, U) \* and deflate the upper bound there * \
Theorem 2 (Soundness and completeness). Algorithm 1 (calling Algorithm 4) produces monotonic sequences L under- and U over-approximating V, and terminates.
Proof (Sketch). The crux is to show that U converges to V. We assume towards a contradiction, that there exists a state s with linin^oo U„(s) — V(s) > 0. Then there exists a nonempty set of states X where the difference between linin^oo l)n and V is maximal. If the upper bound of states in X depends on states outside of X, this yields a contradiction, because then the difference between upper bound and value would decrease in the next Bellman update. So X must be an EC where all states depend on each other. However, if that is the case, calling DEFLATE decreases the upper bound to something depending on the states outside of X, thus also yielding a contradiction. □
Summary of our approach:
1. We cannot collapse MECs, because we cannot collapse BECs with non-constant values.
2. If we remove X (the sub-optimal actions of Minimizer) we can collapse MECs (now actually MSECs with constant values).
3. Since we know neither X nor SECs we gradually deflate SEC approximations.
13
4.4 Learning-based algorithm
Asynchronous value iteration selects in each round a subset T C S of states and performs the Bellman update in that round only on T. Consequently, it may speed up computation if "important" states are selected. However, using the standard VI it is even more difficult to determine the current error bound. Moreover, if some states are not selected infinitely often the lower bound may not even converge.
In the setting of bounded value iteration, the current error bound is known for each state and thus convergence can easily be enforced. This gave rise to asynchronous VI, such as BRTDP (bounded real time dynamic programing) in the setting of stopping MDPs [MLG05], where the states are selected as those that appear on a simulation run. Very similar is the adaptation for general MDP [BCC+14]. In order to simulate a run, the transition probabilities determine how to resolve the probabilistic choice. In order to resolve the non-deterministic choice of Maximizer, the "most promising action" is taken, i.e., with the highest U. This choice is derived from a reinforcement algorithm called delayed Q-learning and ensures convergence while practically performing well [BCC+14].
In this section, we harvest our convergence results and BVI algorithm for SG, which allow us to trivially extend the asynchronous learning-based approach of BRTDP to SGs. On the one hand, the only difference to the MDP algorithm is how to resolve the choice for Minimizer. Since the situation is dual, we again pick the "most promising action", in this case with the lowest L. On the other hand, the only difference to Algorithm 1 calling Algorithm 4 is that the Bellman updates of U and L are performed on the states of the simulation run only, see lines 2-3 of Algorithm 5.
Algorithm 5 Update procedure for the learning/BRTDP version of BVI on SG 1: procedure UPDATE(L : S [0,1], U : S [0,1])
2: p <— path So,Si,...,S£ of length £ < k, obtained by simulation where the successor of s is s' with probability 5(s,a,s') and a is sampled randomly from argmaxa U(s,a) and argmina L(s,a) for s 6 Sn and s 6 Sq, respectively
3: L, U get updated by Eq. (3) and (4) on states S£,S£_1;.. . ,s0 \* all s 6 p * \
4: for T £ FIND.MSEC(L) do 5: DEFLATE(T,U)
If l or o is reached in a simulation, we can terminate it. It can happen that the simulation cycles in an EC. To that end, we have a bound k on the maximum number of steps. The choice of k is discussed in detail in [BCC+14] and we use 2 • | S to guarantee the possibility of reaching sinks as well as exploring new states. If the simulation cycles in an EC, the subsequent call of DEFLATE ensures that next time there is a positive probability to exit this EC. Further details can be found in [KKKW18, Appendix A.4.].
14
5 Experimental results
We implemented both our algorithms as an extension of PRISM-games [CFK+ 13a] a branch of PRISM [KNP11] that allows for modelling SGs, utilizing previous work of [BCC+14,Ujml5] for MDP and SG with single-player ECs. We tested the implementation on the SGs from the PRISM-games case studies [gam] that have reachability properties and one additional model from [CKJ12] that was also used in [Ujml5]. We compared the results with both the explicit and the hybrid engine of PRISM-games, but since the models are small both of them performed similar and we only display the results of the hybrid engine in Table 1.
Furthermore we ran experiments on MDPs from the PRISM benchmark suite [KNP12]. We compared our results there to the hybrid and explicit engine of PRISM, the interval iteration implemented in PRISM [HM17], the hybrid engine of Storm [DJKV17b] and the BRTDP implementation of [BCC+14].
Recall that the aim of the paper is not to provide a faster VI algorithm, but rather the first guaranteed one. Consequently, the aim of the experiments is not to show any speed ups, but to experimentally estimate the overhead needed for computing the guarantees.
For information on the technical details of the experiments, all the models and the tables for the experiments on MDPs we refer to [KKKW18, Appendix B]. Note that although some of the SG models are parametrized they could only be scaled by manually changing the model file, which complicates extensive benchmarking.
Although our approaches compute the additional upper bound to give the convergence guarantees, for each of the experiments one of our algorithms performed similar to PRISM-games. Table 1 shows this result for three of the four SG models in the benchmarking set. On the fourth model, PRISM's pre-computations already solve the problem and hence it cannot be used to compare the approaches. For completeness, the results are displayed in [KKKW18, Appendix B.5].
Table 1: Experimental results for the experiments on SGs. The left two columns denote the model and the given parameters, if present. Columns 3 to 5 display the verification time in seconds for each of the solvers, namely PRISM-games (referred as PRISM), our BVI algorithm (BVI) and our learning-based algorithm (BRTDP). The next two columns compare the number of states that BRTDP explored (#States_B) to the total number of states in the model. The rightmost column shows the number of MSECs in the model.
Model Parameters PRISM BVI BRTDP #States.B #States #MSECs
mdsm prop=l 8 8 17 767 62,245 1
prop=2 4 4 29 407 62,245 1
cdmsn 2 2 3 1,212 1,240 1
cloud N=5 3 7 15 1,302 8,842 4,421
N=6 6 59 4 570 34,954 17,477
15
Whenever there are few MSECs, as in mdsm and cdmsn, BVI performs like PRISM-games, because only little time is used for deflating. Apparently the additional upper bound computation takes very little time in comparison to the other tasks (e.g. parsing, generating the model, pre-computation) and does not slow down the verification significantly. For cloud, BVI is slower than PRISM-games, because there are thousands of MSECs and deflating them takes over 80% of the time. This comes from the fact that we need to compute the expensive end component decomposition for each deflating step. BRTDP performs well for cloud, because in this model, as well as generally often if there are many MECs [BCC+14], only a small part of the state space is relevant for convergence. For the other models, BRTDP is slower than the deterministic approaches, because the models are so small that it is faster to first construct them completely than to explore them by simulation.
Our more extensive experiments on MDPs compare the guaranteed approaches based on collapsing (i.e. learning-based from [BCC+14] and deterministic from [HM17]) to our guaranteed approaches based on deflating (so BRTDP and BVI). Since both learning-based approaches as well as both deterministic approaches perform similarly (see Table 2 in [KKKW18, Appendix B]), we conclude that collapsing and deflating are both useful for practical purposes, while the latter is also applicable to SGs. Furthermore we compared the usual unguaranteed value iteration of PRISM's explicit engine to BVI and saw that our guaranteed approach did not take significantly more time in most cases. This strengthens the point that the overhead for the computation of the guarantees is negligible
6 Conclusions
We have provided the first stopping criterion for value iteration on simple stochastic games and an anytime algorithm with bounds on the current error (guarantees on the precision of the result). The main technical challenge was that states in end components in SG can have different values, in contrast to the case of MDP. We have shown that collapsing is in general not possible, but we utilized the analysis to obtain the procedure of deflating, a solution on the original graph. Besides, whenever a SEC is identified for sure it can be collapsed and the two techniques of collapsing and deflating can thus be combined.
The experiments indicate that the price to pay for the overhead to compute the error bound is often negligible. For each of the available models, at least one of our two implementations has performed similar to or better than the standard approach that yields no guarantees. Further, the obtained guarantees open the door to (e.g. learning-based) heuristics which treat only a part of the state space and can thus potentially lead to huge improvements. Surprisingly, already our straightforward adaptation of such an algorithm for MDP to SG yields interesting results, palliating the overhead of our non-learning method, despite the most naive implementation of deflating. Future work could reveal whether other heuristics or more efficient implementation can lead to huge savings as in the case of MDP [BCC+14].
16
References
[ACD+17] Pranav Ashok, Krishnendu Chatterjee, Przemyslaw Daca, Jan Kretínský, and Tobias Meggendorfer. Value iteration for long-run average reward in markov decision processes. In CAV, pages 201-221, 2017.
[AM09] Daniel Andersson and Peter Bro Miltersen. The complexity of solving stochastic games on graphs. In ISAAC, pages 112-121, 2009.
[AY17] Giirdal Arslan and Serdar Yiiksel. Decentralized q-learning for stochastic teams and games. IEEE Trans. Automat. Contr., 62(4):1545-1558, 2017.
[BBS08] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Systems, Man, and Cybernetics, Part C, 38(2):156-172, 2008.
[BCC+14] Tomáš Brázdil, Krishnendu Chatterjee, Martin Chmelik, Vojtech Forejt, Jan Kretínský, Marta Z. Kwiatkowska, David Parker, and Mateusz Ujma. Verification of Markov decision processes using learning algorithms. In ATVA, pages 98-114. Springer, 2014.
[BK08] Christel Baier and Joost-Pieter Katoen. Principles of model checking, 2008.
[BKL+17] Christel Baier, Joachim Klein, Linda Leuschner, David Parker, and Sascha Wunderlich. Ensuring the reliability of your model checker: Interval iteration for markov decision processes. In CAV, pages 160-180, 2017.
[BT00] Ronen I. Brafman and Moshe Tennenholtz. A near-optimal polynomial time algorithm for learning in certain classes of stochastic games. Artif. Inteli, 121(l-2):31-47, 2000.
[CF11] Krishnendu Chatterjee and Nathanaěl Fijalkow. A reduction from parity games to simple stochastic games. In GandALF, pages 74-86, 2011.
[CFK+13a] T. Chen, V. Forejt, M. Kwiatkowska, D. Parker, and A. Simaitis. PRISM-games: A model checker for stochastic multi-player games. In N. Piterman and S. Smolka, editors, Proc. 19th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS'13), volume 7795 of LNCS, pages 185-191. Springer, 2013.
[CFK+13b] Taolue Chen, Vojtech Forejt, Marta Z. Kwiatkowska, David Parker, and Aistis Simaitis. Automatic verification of competitive stochastic systems. Formal Methods in System Design, 43(l):61-92, 2013.
[CH08] Krishnendu Chatterjee and Thomas A Henzinger. Value iteration. In 25 Years of Model Checking, pages 107-138. Springer, 2008.
[CHJR10] Krishnendu Chatterjee, Thomas A. Henzinger, Barbara Jobstmann, and Arjun Radhakrishna. Gist: A solver for probabilistic games. In CAV, pages 665-669, 2010.
[CKJ12] Radu Calinescu, Shinji Kikuchi, and Kenneth Johnson. Compositional Reverification of Probabilistic Safety Properties for Large-Scale Complex IT Systems, pages 303-329. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
[CKLB11] Chili-Hong Cheng, Alois Knoll, Michael Luttenberger, and Christian Buckl.
GAVS+: an open platform for the research of algorithmic game solving. In
ETAPS, pages 258-261, 2011. [CKSW13] Taolue Chen, Marta Z. Kwiatkowska, Aistis Simaitis, and Clemens
Wiltsche. Synthesis for multi-objective stochastic games: An application
to autonomous urban driving. In QEST, pages 322-337, 2013.
17
[CMG14] Javier Cämara, Gabriel A. Moreno, and David Garlan. Stochastic game analysis and latency awareness for proactive self-adaptation. In 9th International Symposium on Software Engineering for Adaptive and Self-Managing Systems, SEAMS 2014, Proceedings, Hyderabad, India, June 2-3, 2014, pages 155-164, 2014.
[Con92] Anne Condon. The complexity of stochastic games. Information and Computation, 96(2):203-224, 1992.
[CY95] Costas Courcoubetis and Mihalis Yannakakis. The complexity of probabilistic verification. Journal of the ACM, 42(4):857-907, July 1995.
[DJKV17a] Christian Dehnert, Sebastian Junges, Joost-Pieter Katoen, and Matthias Volk. A storm is coming: A modern probabilistic model checker. In CAV, pages 592-600, 2017.
[DJKV17b] Christian Dehnert, Sebastian Junges, Joost-Pieter Katoen, and Matthias
Volk. A storm is coming: A modern probabilistic model checker. CoRR,
abs/1702.04311, 2017. [gam] PRISM-games case studies. prismmod-
elcliecker.org/games/casestudies.php. Accessed: 2017-09-18. [HK66] A. J. Hoffman and R. M. Karp. On nonterminating stochastic games.
Management Science, 12(5):359-370, 1966. [HM17] Serge Haddad and Benjamin Monmege. Interval iteration algorithm for
mdps and imdps. Theoretical Computer Science, 2017. [KKKW18] Edon Kelmendi, Julia Krämer, Jan Kfetfnsky, and Maximilian Weininger.
Value iteration for simple stochastic games: Stopping criterion and learning
algorithm. Technical Report abs/1804.04901, arXiv.org, 2018. [KKNP10] Mark Kattenbelt, Marta Z. Kwiatkowska, Gethin Norman, and David
Parker. A game-based abstraction-refinement framework for markov decision processes. Formal Methods in System Design, 36(3):246-280, 2010. [KM17] Jan Kretfnsky and Tobias Meggendorfer. Efficient strategy iteration for
mean payoff in markov decision processes. In ATVA, pages 380-399, 2017. [KNP11] M. Kwiatkowska, G. Norman, and D. Parker. PRISM 4.0: Verification
of probabilistic real-time systems. In CAV, volume 6806 of LNCS, pages
585-591. Springer, 2011. [KNP12] M. Kwiatkowska, G. Norman, and D. Parker. The prism benchmark
suite. 9th International Conference on Quantitative Evaluation of Systems
(QEST12),pages 203204. IEEE, 2012. [LaVOO] Steven M. LaValle. Robot motion planning: A game-theoretic foundation.
Algorithmica, 26(3-4):430-465, 2000. [LL08] Jianwei Li and Weiyi Liu. A novel heuristic q-learning algorithm for solving
stochastic games. In IJCNN, pages 1135-1144, 2008. [Mar75] Donald A Martin. Borel determinacy. Annals of Mathematics, pages 363-
371, 1975.
[MLG05] H. Brendan Mcmahan, Maxim Likhachev, and Geoffrey J. Gordon.
Bounded real-time dynamic programming: Rtdp with monotone upper bounds and performance guarantees. In In ICML05, pages 569-576, 2005.
[Putl4] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley fe Sons, 2014.
[SK16] Maria Svorenovä and Marta Kwiatkowska. Quantitative verification and strategy synthesis for stochastic games. Eur. J. Control, 30:15-30, 2016.
[TT16] Alain Tcheukam and Hamidou Tembine. One swarm per queen: A particle swarm learning for stochastic games. In SASO, pages 144-145, 2016.
18
[Ujml5] Mateusz Ujma. On Verication and Controller Synthesis for Probabilistic Systems at Runtime. PhD thesis, Wolfson College, Oxford, 2015.
[WT16] Min Wen and Ufuk Topcu. Probably approximately correct learning in stochastic games with temporal logic specifications. In IJCAI, pages 3630-3636, 2016.
19
Rabinizer 4: Prom LTL to Your Favourite Deterministic Automaton*
Jan Kfetinsky, Tobias Meggendorfer, Salomon Sickert, and Christopher Ziegler Technical University of Munich
Abstract. We present Rabinizer 4, a tool set for translating formulae of linear temporal logic to different types of deterministic w-automata. The tool set implements and optimizes several recent constructions, including the first implementation translating the frequency extension of LTL. Further, we provide a distribution of PRISM that links Rabinizer and offers model checking procedures for probabilistic systems that are not in the official PRISM distribution. Finally, we evaluate the performance and in cases with any previous implementations we show enhancements both in terms of the size of the automata and the computational time, due to algorithmic as well as implementation improvements.
1 Introduction
Automata-theoretic approach [VW86] is a key technique for verification and synthesis of systems with linear-time specifications, such as formulae of linear temporal logic (LTL) [Pnu77]. It proceeds in two steps: first, the formula is translated into a corresponding automaton; second, the product of the system and the automaton is further analyzed. The size of the automaton is important as it directly affects the size of the product and thus largely also the analysis time, particularly for deterministic automata and probabilistic model checking in a very direct proportion. For verification of non-deterministic systems, mostly non-deterministic Biichi automata (NBA) are used [EH00,SB00,GO01,GL02,BKRS12,DLLF+16] since they are typically very small and easy to produce.
Probabilistic LTL model checking cannot profit directly from NBA. Even the qualitative question, whether a formula holds with probability 0 or 1, requires automata with at least a restricted form of determinism. The prime example are the limit-deterministic (also called semi-deterministic) Biichi automata (LDBA) [CY88] and the generalized LDBA (LDGBA). However, for the general quantitative questions, where the probability of satisfaction is computed, general limit-determinism is not sufficient. Instead, deterministic Rabin automata (DRA) have been mostly used [KNP11] and recently also deterministic generalized Rabin automata (DGRA) [CGK13]. In principle, all standard types of deterministic automata are applicable here except for deterministic Biichi automata (DBA), which are not as expressive as
* This work has been partially supported by the Czech Science Foundation grant No. P202/12/G061 and the German Research Foundation (DFG) project KR 4890/1-1 "Verified Model Checkers" (317422601). A part of the frequency extension has been implemented within Google Summer of Code 2016.
DGRA
[KE12]
■> DRA
[Sa£88]..--"
[Pit06,Sch09]
[KMWW17]
.............> DPA
[EKRS17]
LTL
■> NBA
Fig. 1. LTL translations to different types of automata. Translations implemented in Rabinizer 4 are indicated with a solid line. The traditional approaches are depicted as dotted arrows. The determinization of NBA to DRA is implemented in ltl2dstar [Kle], to LDBA in Seminator [BDK+17] and to (mostly) DPA in spot [DLLF+16].
LTL. However, other types of automata, such as deterministic Muller and deterministic parity automata (DPA) are typically larger than DGRA in terms of acceptance condition or the state space, respectively.1 Recently, several approaches with specific LDBA were proved applicable to the quantitative setting [HLS+15,SEJK16] and competitive with DGRA. Besides, model checking MDP against LTL properties involving frequency operators [BDL12] also allows for an automata-theoretic approach, via deterministic generalized Rabin mean-payoff automata (DGRMA) [FKK15].
LTL synthesis can also be solved using the automata-theoretic approach. Although DRA and DGRA transformed into games can be used here, the algorithms for the resulting Rabin games [PP06] are not very efficient in practice. In contrast, DPA may be larger, but in this setting they are the automata of choice due to the good practical performance of parity-game solvers [FL09,ML16,JBB+17].
Types of translations. The translations of LTL to NBA, e.g., [VW86], are typically "semantic " in the sense that each state is given by a set of logical formulae and the language of the state can be captured in terms of semantics of these formulae. In contrast, the determinization of Safra [Saf88] or its improvements [Pit06,Sch09,TD14,FL15] are not "semantic" in the sense that they ignore the structure and produce trees as the new states that, however, lack the logical interpretation. As a result, if we apply Safra's determinization on semantically created NBA, we obtain DRA that lack the structure and, moreover, are unnecessarily large since the construction cannot utilize the original structure. In contrast, the recent works [KE12,KLG13,EK14,KV15,SEJK16,EKRS17,MS17,KV17] provide "semantic" constructions, often producing smaller automata. Furthermore, various transformations such as degeneralization [KE12], index appearance record [KMWW17] or determinization of limit-deterministic automata [EKRS17] preserve the semantic description, allowing for further optimizations of the resulting automata.
Our contribution. While all previous versions of Rabinizer [GKE12,KLG13,KK14] featured only the translation LTL—»DGRA—»DRA, Rabinizer 4 now implements all the translations depicted by the solid arrows in Fig. 1. It improves all these
1 Note that every DGRA can be written as a Muller automaton on the same state space with an exponentially-sized acceptance condition, and DPA are a special case of DRA and thus DGRA.
2
translations, both algorithmically and implementation-wise, and moreover, features the first implementation of the translation of a frequency extension of LTL [FKK15].
Further, in order to utilize the resulting automata for verification, we provide our own distribution2 of the PRISM model checker [KNP11], which allows for model checking MDP against LTL using not only DRA and DGRA, but also using LDBA and against frequency LTL using DGRMA. Finally, the tool can turn the produced DPA into parity games between the players with input and output variables. Therefore, when linked to parity-game solvers, Rabinizer 4 can be also used for LTL synthesis.
Rabinizer 4 is freely available at http: //rabinizer. model. in. turn. de together with an on-line demo, visualization, usage instructions and examples.
2 Functionality
We recall that the previous version Rabinizer 3 has the following functionality:
— It translates LTL formulae into equivalent DGRA or DRA.
— It is linked to PRISM, allowing for probabilistic verification using DGRA (previously PRISM could only use DRA).
2.1 Translations
Rabinizer 4 inputs formulae of LTL and outputs automata in the standard HOA format [BBD+15], which is used, e.g., as the input format in PRISM. Automata in the HOA format can be directly visualized, displaying the "semantic" description of the states. Rabinizer 4 features the following command-line tools for the respective translations depicted as the solid arrows in Fig. 1:
ltl2dgra and ltl2dra correspond to the original functionality of Rabinizer 3, i.e., they translate LTL (now with the extended syntax, including all common temporal operators) to DGRA and DRA [EK14], respectively.
Itl21dgba and ltl21dba translate LTL to LDGBA using the construction of [SEJKM and to LDBA, respectively. The latter is our modification of the former, which produces smaller automata than chaining the former with the standard degen-eralization.
Itl2dpa translates LTL to DPA using two modes:
— The default mode uses the translation to LDBA, followed by a LDBA-to-DPA determinization [EKRS17] specially tailored to LDBA with the "semantic" labelling of states, avoiding additional exponential blow-up of the resulting automaton.
— The alternative mode uses the translation to DRA, followed by our improvement of the index appearance record of [KMWW17].
fltl2dgrma translates the frequency extension of LTL\gtj, i-e- LTL\gtj [KLG13] with G~p operator3, to DGRMA using the construction of [FKK15].
2 Merging these features into the public release of PRISM as well as linking the new version of Rabinizer is subject to current collaboration with the authors of PRISM.
3 The sequential globally construct [BDL12,BMM14] G~pip with ~ £ {>, >, <, <},p £ [0,1] intuitively means that the fraction of positions satisfying ip satisfies ~p. Formally, the fraction on an infinite run is defined using the long-run average [BMM14].
3
2.2 Verification and synthesis
The resulting automata can be used for model checking probabilistic systems and for LTL synthesis. To this end, we provide our own distribution of the probabilistic model checker PRISM as well as a procedure transforming automata into games to be solved.
Model checking: PRISM distribution For model checking Markov chains and Markov decision processes, PRISM [KNP11] uses DRA and recently also more efficient DGRA [CGK13,KK14]. Our distribution, which links Rabinizer, additionally features model checking using the LDBA [SEJK16,SK16] that are created by our ltl21dba.
Further, the distribution provides an implementation of frequency LTL\gtj model checking, using DGRMA. To the best of our knowledge, there are no other implemented procedures for logics with frequency. Here, techniques of linear programming for multi-dimensional mean-payoff satisfaction [CKK15] and the model-checking procedure of [FKK15] are implemented and applied.
Synthesis: Games The automata-theoretic approach to LTL synthesis requires to transform the LTL formula into a game of the input and output players. We provide this transformer and thus an end-to-end LTL synthesis solution, provided a respective game solver is linked. Since current solutions to Rabin games are not very efficient we implemented a transformation of DPA into parity games and a serialization to the format of PG Solver [FL09]. Due to the explicit serialization, we foresee the main use in quick prototyping.
3 Optimizations, Implementation, and Evaluation
Compared to the theoretical constructions and previous implementations, there are numerous improvements, heuristics, and engineering enhancements. We evaluate the improvements both in terms of the size of the resulting automaton as well as the running time. When comparing with respect to the original Rabinizer functionality, we compare our implementation ltl2dgra to the previous version Rabinizer 3.1, which is already a significantly faster [EKS16] re-implementation of the official release Rabinizer 3 [KK14]. All of the benchmarks have been executed on a host with i7-4700MQ CPU (4x2.4 GHz), running Linux 4.9.0-5-amd64 and the Oracle JRE 9.0.4+11 JVM. Due to the start-up time of JVM, all times below 2 seconds are denoted by <2 and not specified more precisely. All experiments were given a time-out of 900 seconds and mem-out of 4GB, denoted by —.
Algorithmic improvements and heuristics for each of the translations:
ltl2dgra and ltl2dra These translations create a master automaton monitoring the satisfaction of the given formula and a dedicated slave automaton for each subformula of the form Gip [EK14]. We (i) simplify several classes of slaves and (ii) "suspend" (in the spirit of [BBDL+13]) some so that they appear in the final product only in some states. The effect on the size of the state space is illustrated in Table 1 on a nested formula. Further, (iii) the acceptance condition is considered separately for each strongly connected component (SCC) and then
4
Table 1. Effect of simplifications and suspension for ltl2dgra on the formulae ipi = Gi_i)), and ip^ = G/ where >^ = a±, >^ = (i, ip2 — (xiA^i)V(^xiA 3), . . . , where 0, = XG((alU6l) V (c.Ud,)),
1p5 ... 1ps
Rabinizer 3.1 [EKS16 lt!2dgra <2 / 2 <2 / 7 <2 / 19 <2 / 1 <2 / 1 <2 / 1 <2 / 1 <2 / 1 <2 / 1
Table 3. Effect of break-point elimination fo r lt!21dba on safety formr lae s(n, m) = /\" G(aiVXmbi)
and for ltl21dgba on liveness formulae l(n conditions) = a:=1 GF(a, A Xr nbi), displaying #states (#Biichi
s(l,3) s(2,3) s(3,3) S(4,3) s(l,4) s(2,4) s(3,4) s(4,4)
[SEJK16] 20 (1) 400 (2) 8-103(3) lt!21dba 8 (1) 64 (1) 512 (1) 16 • 104(4) 4096 (1) 48 (1) 16 (1) 2304 (2) 110592 (3) 256 (1) 4096 (1) 65536 (1)
i(l,l) 1(2,1) 1(3, 1) '(4,1) i(l,4) i(2,4) i(3,4) «(4,4)
[SEJK16] 3 (1) lt!21dgba 3 (1) 9 (2) 27 (3) 5 (2) 9 (3) 81 (4) 17 (4) 10 (1) 3 (1) 100 (2) 103 (3) 104 (4) 5 (2) 9 (3) 17 (4)
Table 4. Effect of non-determinism of the initial component for ltl21dba on formulae f(i) = F(a A X1G6), displaying #states (#Biichi conditions)
/(I) /(2) /(3) /(4) /(5) /(6)
[SEJK16] 4 (1) lt!21dba 2 (1) 6 (1) 3 (1) 10 (1) 4 (1) 18 (1) 5 (1) 34 (1) 66 (1) 6(1) 7(1)
combined. On a concrete example of Table 2, the automaton for i = 8 has 31 atomic propositions, whereas the number of atomic propositions relevant in each component of the master automaton is constant, which we utilize and thus improve performance on this family both in terms of size and time. Itl21dba This translation is based on breakpoints for subformulae of the form Gip. We provide a heuristic that avoids breakpoints when ip is a safety or co-safety subformula, see Table 3.
Besides, we add an option to generate a non-deterministic initial component for the LDBA instead of a deterministic one. Although the LDBA is then no more suitable for quantitative probabilistic model checking, it still is for qualitative model checking. At the same time, it can be much smaller, see Table 4 which shows a significant improvement on the particular formula.
5
Table 5. Comparison of the average performance with the previous version of Rabinizer. The statistics are taken over a set of 200 standard formulae [KMS18] used, e.g., in [BKS13,EKS16], run in a batch mode for both tools to eliminate the effect of the JVM start-up overhead.
Tool Avg # states Avg # Avg runtime
acc. sets
Rabinizer 3.1 [EKS16] 6.3 6.7 0.23
ltl2dgra 6.2 4.4 0.12
ltl2dpa Both modes inherit the improvements of the respective ltl21dba and ltl2dgra translations. Further, since complementing DPA is trivial, we can run in parallel both the translation of the input formula and of its negation, returning the smaller of the two results. Finally, we introduce several heuristics to optimize the treatment of safety subformulae of the input formula. dra2dpa The index appearance record of [KMWW 17] keeps track of a permutation (ordering) of Rabin pairs. To do so, all ties between pairs have to be resolved. In our implementation, we keep a pre-order instead, where irrelevant ties are not resolved. Consequently, it cannot happen that an irrelevant tie is resolved in two different ways like in [KMWW17], thus effectively merging such states. Implementation The main performance bottleneck of the older implementations is that explicit data structures for the transition system are not efficient for larger alphabets. To this end, Rabinizer 3.1 provided symbolic (BDD) representation of states and edge labels. On the top, Rabinizer 4 represents the transition function symbolically, too.
Besides, there are further engineering improvements on issues such as storing the acceptance condition only as a local edge labelling, caching, data-structure overheads, SCC-based divide-and-conquer constructions, or the introduction of parallelization for batch inputs.
Average performance evaluation We have already illustrated the improvements on several hand-crafted families of formulae. In Tables 1 and 2 we have even seen the respective running-time speed-ups. As the basis for the overall evaluation of the improvements, we use some established datasets from literature, see [KMS18], altogether two hundred formulae. The results in Table 5 indicate that the performance improved also on average among the more realistic formulae.
4 Conclusion
We have presented Rabinizer 4, a tool set to translate LTL to various deterministic automata and to use them in probabilistic model checking and in synthesis. The tool set extends the previous functionality of Rabinizer, improves on previous translations, and also gives the very first implementations of frequency LTL translation as well as model checking. Finally, the tool set is also more user-friendly due to richer input syntax, its connection to PRISM and PG Solver, and the on-line version with direct visualization, which can be found at http://rabirLizer.irL.tuin.de.
References
[BBD+15] Tomáš Babiak, František Blahoudek, Alexandre Duret-Lutz, Joachim Klein, Jan Kŕetínský, David Müller, David Parker, and Jan Strejček. The hanoi omega-automata format. In CAV, Part I, pages 479-486, 2015.
6
[BBDL+13] Tomáš Babiak, Thomas Badie, Alexandre Duret-Lutz, Mojmír Křetínský, and
Jan Strejček. Compositional approach to suspension and other improvements
to LTL translation. In SPIN, pages 81-98, 2013. [BDK+17] František Blahoudek, Alexandre Duret-Lutz, Mikuláš Klokočka, Mojmír
Křetínský, and Jan Strejček. Seminator: A tool for semi-determinization of
omega-automata. In LPAR, pages 356-367, 2017. [BDL12] Benedikt Bollig, Normann Decker, and Martin Leucker. Frequency linear-time
temporal logic. In TASE, pages 85-92, 2012. [BKRS12] Tomáš Babiak, Mojmír Křetínský, Vojtěch Řehák, and Jan Strejček. LTL to
Biichi automata translation: Fast and more deterministic. In TACAS, pages
95-109, 2012.
[BKS13] František Blahoudek, Mojmír Křetínský, and Jan Strejček. Comparison of LTL to deterministic Rabin automata translators. In LPAR, volume 8312 of LNCS, pages 164-172, 2013.
[BMM14] Patricia Bouyer, Nicolas Markey, and Raj Mohan Matteplackel. Averaging in LTL. In CONCUR, pages 266-280, 2014.
[CGK13] Krishnendu Chatterjee, Andreas Gaiser, and Jan Křetínský. Automata with generalized Rabin pairs for probabilistic model checking and LTL synthesis. In CAV, pages 559-575, 2013.
[CKK15] Krishnendu Chatterjee, Zuzana Komárkova, and Jan Křetínský. Unifying two views on multiple mean-payoff objectives in markov decision processes. In LICS, pages 244-256, 2015.
[CY88] Costas Courcoubetis and Mihalis Yannakakis. Verifying temporal properties
of finite-state probabilistic programs. In FOCS, pages 338-345, 1988.
[DLLF+16] Alexandre Duret-Lutz, Alexandre Lewkowicz, Amaury Fauchille, Thibaud Michaud, Etienne Renault, and Laurent Xu. Spot 2.0 — a framework for LTL and w-automata manipulation. In ATVA, pages 122-129, October 2016.
[EH00] Kousha Etessami and Gerard J. Holzmann. Optimizing Biichi automata. In
CONCUR, pages 153-167, 2000.
[EK14] Javier Esparza and Jan Křetínský. From LTL to deterministic automata: A
Safraless compositional approach. In CAV, pages 192-208, 2014.
[EKRS17] Javier Esparza, Jan Křetínský, Jean-Francois Raskin, and Salomon Sickert.
From LTL and limit-deterministic Biichi automata to deterministic parity automata. In TACAS, pages 426-442, 2017.
[EKS16] Javier Esparza, Jan Křetínský, and Salomon Sickert. From LTL to deterministic automata - A safraless compositional approach. Formal Methods in System Design, 49(3):219-271, 2016.
[FKK15] Vojtěch Forejt, Jan Krčal, and Jan Křetínský. Controller synthesis for MDPs and frequency LTL\GU. In LPAR, pages 162-177, 2015.
[FL09] Oliver Friedmann and Martin Lange. Solving parity games in practice. In
ATVA, pages 182-196, 2009.
[FL15] Dana Fisman and Yoad Lustig. A modular approach for biichi determinization.
In CONCUR, pages 368-382, 2015.
[GKE12] Andreas Gaiser, Jan Křetínský, and Javier Esparza. Rabinizer: Small deterministic automata for LTL(F,G). In ATVA, pages 72-76, 2012.
[GL02] Dimitra Giannakopoulou and Flavio Lerda. From states to transitions: Im-
proving translation of LTL formulae to Biichi automata. In FORTE, pages 308-326, 2002.
[GO01] Paul Gastin and Denis Oddoux. Fast LTL to Biichi automata translation.
In CAV, pages 53-65, 2001. Tool accessible at http://www.lsv.ens-cachan. fr/~gastin/ltl2ba/.
7
[HLS+15] Ernst Moritz Hahn, Guangyuan Li, Sven Schewe, Andrea Turrini, and Li-jun Zhang. Lazy probabilistic model checking without determinisation. In CONCUR, volume 42 of LIPIcs, pages 354-367, 2015.
[JBB+17] Swen Jacobs, Nicolas Basset, Roderick Bloem, Romain Brenguier, Maximilien Colange, Peter Faymonville, Bernd Finkbeiner, Ayrat Khalimov, Felix Klein, Thibaud Michaud, Guillermo A. Perez, Jean-Frangois Raskin, Ocan Sankur, and Leander Tentrup. The 4th reactive synthesis competition (SYNTCOMP 2017): Benchmarks, participants fe results. CoRR, abs/1711.11439, 2017.
[KE12] Jan Křetínský and Javier Esparza. Deterministic automata for the (F,G)-
fragment of LTL. In CAV, volume 7358 of LNCS, pages 7-22, 2012.
[KK14] Zuzana Komárkova and Jan Křetínský. Rabinizer 3: Safraless translation of LTL to small deterministic automata. In ATVA, volume 8837 of LNCS, pages 235-24f, 2014.
[Kle] Joachim Klein. Itl2dstar - LTL to deterministic Streett and Rabin automata,
http://www.Itl2dstar.de/.
[KLG13] Jan Křetínský and Ruslán Ledesma-Garza. Rabinizer 2: Small deterministic automata for LTL\GU. In ATVA, pages 446-450, 2013.
[KMS18] Jan Křetínský, Tobias Meggendorfer, and Salomon Sickert. LTL Store: Repository of LTL formulae from literature and case studies. CoRR, abs/1804.xxxx, 2018.
[KMWW17] Jan Křetínský, Tobias Meggendorfer, Clara Waldmann, and Maximilian Weininger. Index appearance record for transforming rabin automata into parity automata. In TACAS, pages 443-460, 2017.
[KNP11] Marta Z. Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0: Verification of probabilistic real-time systems. In CAV, pages 585-59f, 20f f.
[KVf 5] Dileep Kini and Mahesh Viswanathan. Limit deterministic and probabilistic automata for LTL \ GU. In TACAS, pages 628-642, 2015.
[KV17] Dileep Kini and Mahesh Viswanathan. Optimal translation of LTL to limit deterministic automata. In TACAS 2017, 20Í7. To appear.
[ML16] Philipp J. Meyer and Michael Luttenberger. Solving mean-payoff games on the GPU. In ATVA, pages 262-267, 2016.
[MS17] David Miiller and Salomon Sickert. LTL to deterministic Emerson-Lei automata. In GandALF, pages 180-194, 2017.
[Pit06] Nir Piterman. From nondeterministic Biichi and Streett automata to deter-
ministic parity automata. In LICS, pages 255-264, 2006.
[Pnu77] Amir Pnueli. The temporal logic of programs. In FOCS, pages 46-57, f 977.
[PP06] Nir Piterman and Amir Pnueli. Faster solutions of Rabin and Streett games.
In LICS, pages 275-284, 2006.
[Saf88] Shmuel Safra. On the complexity of omega-automata. In FOCS, pages
3f9-327, 1988.
[SB00] Fabio Somenzi and Roderick Bloem. Efficient Biichi automata from LTL
formulae. In CAV, pages 248-263, 2000.
[Sch09] Sven Schewe. Tighter bounds for the determinisation of Biichi automata. In FoSSaCS, volume 5504 of LNCS, pages 167-181, 2009.
[SEJK16] Salomon Sickert, Javier Esparza, Stefan Jaax, and Jan Křetínský. Limit-deterministic biichi automata for linear temporal logic. In CAV, pages 312-332, 2016.
[SK16] Salomon Sickert and Jan Křetínský. Mochiba: Probabilistic LTL model
checking using limit-deterministic biichi automata. In ATVA, pages 130-137, 2016.
[TD14] Cong Tian and Zhenhua Duan. Buchi determinization made tighter. Technical
Report abs/1404.1436, arXiv.org, 2014.
8
[VW86] Moshe Y. Vardi and Pierre Wolper. An automata-theoretic approach to automatic program verification (preliminary report). In LICS, pages 332-344, 1986.
9