Bayesian Learning, Shutdown and Convergence ∗ Leopold S¨ogner † October 15, 2011 Abstract This article investigates a partial equilibrium production model with dynamic information aggregation. Firms apply Bayesian learning to estimate the unknown model parameter. In the baseline setting, where prices and quantities are supported by the real line and the noise term is Gaussian, convergence of the limited information to the full information setting is obtained. Imposing a non-negativity constraint on quantities destroys the convergence results obtained in the baseline model. With the constraints firms learning an unknown demand intercept parameter exit with strictly positive probability, even when the true value of this parameter would induce production in the full information setting. Parts of the model can be rescued by assuming bounded support for the stochastic noise term. Although, shut-down can be excluded with bounded support when the forecasts of the agents satisfy relatively mild regularity conditions, Bayesian learning and convergence need not take place in general. Keywords: Bayesian Learning, Consistency, Convergence. JEL: D82, G10. ∗ The author appreciates helpful comments from Larry Blume, Egbert Dierker, Hildegard Dierker, Klaus Ritzberger, Andreas Danis, David Florysiak, Michael Greinecker, Maarten Janssen, Julian Kolm, Martin Meier and all participants of the Micro Jour Fixe at IHS (summer & winter term 2010), the VGSE seminar (Vienna 2011) and the 11th SAET conference (Faro 2011). † Leopold S¨ogner, Department of Economics and Finance, Institute for Advanced Studies, Stumpergasse 56, 1060 Vienna, Austria, soegner@ihs.ac.at 1 1 Introduction One of the workhorse models when considering learning is a partial equilibrium setting with normally distributed noise and agents applying Bayesian learning. The mean parameter of the stochastic noise term is unknown while the variance term is kept fixed. By assuming a conjugate normal prior, the posterior of the unknown mean parameter is normal. The posterior mean becomes a mixture of the prior mean and the sample average. In addition to some other nice features of the normal distribution, this property also motivates why a normal setting is so frequently used: A dynamic setting with affine implied law of motion results in an affine perceived actual law of motion. In this article the baseline model is a partial equilibrium model with production as investigated in Jun and Vives (1996); JV in the following. JV have already shown that learning is not equivalent to convergence of the limited information to the full information equilibrium. However, for all non-unit root settings the unknown parameter is learned, while convergence of the limited information quantities to full information quantities is attained for all feasible parameter values. This paper shows that different deviations from the affine normal setting need neither result in learning nor in convergence. While quantities are real numbers in the baseline model, by requiring non-negative quantities – while sticking to a normally distributed demand intercept – we show that due to a shutdown option limited informed firms could leave the market while informed firms stay in there. Here the convergence result of the baseline setting breaks down. Let us relate this article to recent literature: As already stated above JV have already shown that learning and convergence are not equivalent. The JV setting assumes a continuum of price takers and public signals only. By the first assumption, solving for equilibrium does not require concepts from game theory, while the second assumption excludes any herding effects (for an overview see e.g. Vives (2008)[chapter 6]). Vives (2008) also presents a convergence result with herding behavior; also this model works with the normal distribution. Using the deviations discussed in this paper will also have impacts on the convergence results presented there. Alternatively, Smith and Sørensen (2000) considered a herding model with binomial outcomes. If there were no private signals in their setting, the Bayes estimator would be consistent since the sample space is finite. Smith and Sørensen (2000) show that due to learning 2 from public and private information confounded learning can appear, which is a situation where a Markov transition probability becomes independent of the last realization of the state variable. By the fact that only public information arises in our model, such a situation cannot appear in our model. However, our setting provides different examples where learning does not take place. We neither need heterogeneity of agents (informed vs. non-informed) nor herding behavior. Non-convergent behavior either arises from deviations from the normal distribution or from a shut-down condition (or both). For further stability results with respect to herding we refer the reader to Alfarano and Milakovic (2009). A model very similar to our baseline setting has also been investigated in Guesnerie (1992) to check for eductive stability or a strongly rational expectations equilibrium; see also Guesnerie (1993) and Guesnerie and Jara-Moroni (2009) for a more general treatment of this concept. This strand of literature raises the question whether rational expectations can be supported by Game theoretic reasoning. Here the author has shown that strong rationality depends on the eigenvalue of a cobweb function derived from a composition of the inverse aggregate demand with the supply function. In this article we abstract from the problem of strong rationalization and put the focus on the requirement of non-negative quantities. This paper is organized as follows: Section 2 describes the benchmark model. Section 3 introduces non-negative quantities and a shutdown option for firms. If the feasible set is restricted to the non-negative real axis, shut-down - that is to say a production of zero - can be an optimal strategy. Since no signals are received after shut-down, this implies that the agents also decide to stop learning. Then even in the limit the unknown parameter remains a random variable. With these modifications the convergence results obtained by JV may break down. Section 4 shows that a bounded support on the stochastic noise term and regularity conditions on the forecasts provide a more realistic model in which the results of the benchmark model can be resurrected. Although a convergence result will follow easily in a quite mechanical way, it is important to note that this simple setup already shows that even with bounded support of the noise term, convergence need not be attained in general. Regularity conditions on the learning scheme are necessary to guarantee convergence. Examples show that matching these conditions can but need not be trivial. Last but not least we investigate robustness by allowing for deviations from normal noise. We present examples - based 3 on Diaconis and Freedman (1986) and Christensen (2009) - where convergence need not occur even with bounded support. This last topic is related to the consistency of the Bayes estimate (see e.g. Diaconis and Freedman (1986) or Strasser (1985)). Applications and extension in economic theory have been presented e.g. in Blume and Easley (1993), Blume and Easley (1998), Feldman (1991), Kalai and Lehrer (1992), Kalai and Lehrer (1992) and Stinchcombe (2005). Appendix B will reconsider the results presented in literature and provides some further examples. 2 The Benchmark Model This section describes the model and the key results obtained in JV. Their results are based on the assumption that both prices, pt, and the quantities, xt, in particular may be negative. JV and also Vives (2008)[chapter 7.2] consider a discrete time model of dynamic information aggregation; time is indexed by t = 0, 1, 2, . . . , economic activity starts at t = 1. The aggregate demand function is described by the stochastic linear relationship: pt = zt − βxt for t ≥ 1 . (1) pt is the price established in period t, β > 0 is a constant and xt is the aggregate quantity consumed. zt is a random variable. The fractional differences ∆ζzt = zt − ζzt−1 are described by ∆ζ zt = (1 − ζ)θ + ηt , (2) where θ, ζ ∈ R. With ut = zt − θ we get an autoregressive and a moving average representation: zt = ζzt + (1 − ζ)θ + ηt , = θ + ut = θ + t−1 s=0 ζs ηt−s + u0 = θ + t−1 s=0 ζs ηt−s + (z0 − θ) . (3) In addition ut = ζut−1 + ηt. In the baseline setting ηt is iid normal with mean zero and variance σ2, ηt ∼ N(0, σ2 η). For ζ = 1, the process follows a first order autoregressive process with normal innovations, while for ζ = 1 we get a random walk. By this specification xt and pt ∈ R. zt is the stochastic demand 4 intercept.1 We assume that the process (zt) is started at z0 = θ, which has also be done implicitly in JV by assuming u0 = 0. This assumption will reduce the computational burden in Section 4. It is important to note that z0 cannot be observed by the firms. Consider a continuum of firms i ∈ [0, 1] endowed with Lebesque measure. Each firm i is a price taker and produces the homogeneous output xit at cost C(xit) = λ 2 x2 it; the parameter λ > 0. Aggregate output fulfills xt = 1 0 xitdi.2 Depending on what firms observe, distinguish between: 1. Full information (FI): At period t, the firms know θ and the past prices pt−1 = {p1, . . . , pt−1}. In addition firms know the structure of demand (1) and the model parameters λ, β, σ2 η, ζ, and u0 = 0. 2. Limited information (LI): Firms know λ, β, σ2 η, ζ, and u0 = 0 and observe past prices pt−1. They do not know θ, but the structure of demand (1) and (2). Firms use Bayes’ rule to update beliefs about θ. The prior is given by θ ∼ N(θ0, σ2 0). This is what Vives (2008)[chapter 7.1] calls learning within an equilibrium. For notational simplicity abbreviate the information sets by IFI t and ILI t , for the full information and the limited information case, respectively. Since firms are price takers, the profit function is πit = ptxit − λ 2 x2 it and the value function is Vit = (1 − δ)E ∞ k=0 δk pt+kxi,t+k − λ 2 x2 i,t+k |I (.) t with some discount factor δ ∈ (0, 1). By (1), (3) and ζ ∈ R the value function need not be finite in general. However, by the structure of the optimization problem maximizing Vit breaks up into the one period optimization problems max xit E ptxit − λ 2 x2 it|I (.) t . (4) Given the information set I (.) t the first order condition yields: xit = E pt|I (.) t λ . (5) 1 For some examples it is more convenient the work with ∆ζ zt = εt, where εt is centered around (1−ζ)θ (if the mean exists). Such a transformation makes sense if asymmetric noise (e.g. asymmetric (truncated) normal) or only noise on a subset of R will be considered (e.g. Gamma distribution, truncated distributions). This notation will be applied in Examples 1 to 4 and in some examples in Appendix B. 2 Since xit are equal for all i, except for a countably many i - the exact law of large numbers can be applied, such that xt = 1 0 xitdi still holds. For more details see Sun (2006). 5 Market Clearing: In period t firm i produces xit = E pt|I (.) t λ and aggregate (average) output is 1 0 xitdi = xt = E pt|I (.) t λ . Immediately after this output decision, the random variable ηt realizes, resulting in pt = zt − βxt = zt − β E pt|I (.) t λ . Remark 1. Since sgn(xit) = sgn(E pt|I (.) t ) by (5), expected revenues are positive no matter whether positive or negative prices are expected. Substitution of (5) into the expected profit function results in E πit|I (.) t = E pt|I (.) t 2 λ − λE pt|I (.) t 2 2λ2 = E pt|I (.) t 2 2λ > 0 (a.s.) . Agents commit to supply xit ∈ R even if sgn(xit) = sgn(E pt|I (.) t ) = sgn(pt). In this case πit < 0. Hence, realized profits πit ∈ R. By the law of iterated expectations E (πit) > 0. Convergence, Prediction and Learning: We restrict our analysis to learning within an equilibrium, as defined in Vives (2008)[p. 249]. Examples for different learning schemes in different fields of economics are provided e.g. in Timmermann (1996), Brock and Hommes (1997), Kelly and Kolstad (1999), Routledge (1999) or Evans and Honkapohja (2001). For an overview and a bulk of literature we refer the reader to Vives (2008)[chapters 7.1 and 10.2]. Due to (5), firms have to predict the price pt. By the inverse demand function (1) we get E pt|I (.) t = E zt|I (.) t − βxt . (6) xt is I (.) t measurable. For k ≥ 0, the k step ahead prediction is obtained by means of E zt+k|I (.) t = (1 − ζ)E θ|I (.) t k i=0 ζi + ζk+1 zt−1 . (7) For the full information case, where θ is known, (7) results in E zt|IFI t = (1 − ζ)θ + ζzt−1, such that xFI t = (1 − ζ)θ + ζzt−1 λ + β = θ + ζut−1 λ + β . (8) By the inverse demand function yt := pt + βxt = pt + β E pt|I (.) t λ = θ + ut = zt. With limited information firms know that yt = zt. From (1) and the specification of the noise term, ηt ∼ iid N(0, σ2 η), algebra 6 yields zt = θ + ζ(zt−1 − θ) + ηt and ∆ζ zt := zt − ζzt−1 = (1 − ζ)θ + ηt . (9) For ζ = 1, we directly observe from zt = θ(1 − ζ) + ζzt−1 + ηt that the parameter θ is not identified. That is to say, for any zt ∈ R the likelihood f(zt|θ, ζ = 1) = f(zt|θ , ζ = 1) for arbitrary pairs θ, θ . The last term in (9) corresponds to a linear regression setting with response variable ∆ζzt, the (constant) prediction variable (1 − ζ) and normal innovations ηt. With the conjugate normal prior3 θ ∼ N(θ0, σ2 0), under limited information firms derive the posterior distribution of the parameter θ by means of (see e.g. Vives (2008)[Appendix 10.2], Chib (1993), Robert (1994)[Chapter 4]): θ ∼ N(at, At) where at = ( 1 σ2 η (1 − ζ) t s=1 ∆ζzs + θ0 σ2 0 ( 1 σ2 η (1 − ζ)2t + 1 σ2 0 , At = 1 σ2 η (1 − ζ)2 t + 1 σ2 0 −1 . (10) By means of at firms immediately get the conditional expectation of θ.4 Since ILI t is generated by observations from 1, . . . , t − 1, the conditional expectation E θ|ILI t = at−1. Then, by (7) forecasts of zt+k are E zt+k|ILI t = (1 − ζ)E θ|ILI t k i=0 ζi + ζk+1 zt−1 = (1 − ζ)at−1 k i=0 ζi + ζk+1 zt−1 . (11) (5) and (11), with k = 0, yield xt = E zt|ILI t λ + β . (12) Before we restate the JV convergence result for the baseline model, let us briefly discuss some properties of (xt) and (xFI t ). (xFI t ) follows a first order autoregressive process. Generally, a first order autoregressive process ut = ζut−1 + ηt is ergodic (stationary in the limit) if |ζ| < 1 and E (|ηt|) < ∞ (see e.g. Meyn and 3 The following analysis can also be performed with an uninformative Jeffreys prior (see e.g. Robert (1994)), which can be motivated by the (frequentist) argument that information is only providing by the data. Minimizing the impact of the prior is done with this type of prior. For the current normal setting this implies that terms including θ0, σ2 0 or both vanish in (10). Proposition 1 and the results in Examples 5 and 6 still hold when replacing the conjugate normal prior by an uninformative Jeffreys prior. 4 Only zt for t ≥ 1 are available. At t = 1 firms use E z1|ILI 1 = θ0. At t = 2, z1 is already known by the firms, but they cannot calculate ∆ζ z1. Here we assume that firms apply ∆ζ z1 = z1 −θ0. Then (10) can be used. Alternatively, we could also start with a setup where no ∆ζ z1 is available but agents apply Bayesian simulations methods to sample the joint posterior of the parameter θ and the initial value as described in Appendix C. 7 Tweedie (2009)[Chapters 11 and 15]; in this case (ηt) is also geometrically ergodic). This implies that the distribution of ut, µt, converges to its stationary distribution µ. If u0 ∼ µ, then the process is stationary and ergodic. ut ∈ Lp if |ut|pdµt < ∞; where µt is the probability law of ut. For normal innovations, if |ζ| < 1 then (ut) is stationary and p-th movements exist such that ut ∈ Lp. Therefore, (xFI t ) is stationary in the limit and in Lp if |ζ| < 1. If |ζ| ≥ 1, neither (xt) nor (xFI t ) are ergodic. Discussing Lp convergence with quantities does not make sense in this case. If |ζ| ≥ 1, we can work with the fractional differences ∆ζx (.) t = x (.) t − ζx (.) t−1. For the full-information quantities this results in ∆ζxFI t = (1−ζ)[θ+ζηt−1] λ+β . (∆ζxFI t ) is stationary/∈ Lp if (ηt) is stationary/∈ Lp. Given our model assumptions, the fractional differences ∆ζxFI t are ergodic and ∈ Lp for all ζ. For the limited information case, some algebra yields ∆ζ xLI t = ∆ζ xFI t + (1 − ζ)[at−1 − ζat−2 − (1 − ζ)θ] λ + β . (13) The third term in (13) goes to zero if (at) converges to θ. (13) also implies that xt −xFI t = 1−ζ λ+β (at−1 − θ). If ζ = 1, then xLI t = xFI t at least for t ≥ 2, but θ is not learned. For all ζ = 1, (at) converges to θ almost surely and in Lp. ∆ζxLI t and xLI t converge to their full information counterparts, for ζ ∈ R and |ζ| < 1, respectively. Based on this discussion we reformulate the convergence result obtained by JV: Proposition 1. [Propositions 2.1, 2.2 in JV; Proposition 7.1 in Vives (2008)] (i) If |ζ| < 1 then E θ|ILI t converges to θ and xt → xFI t (a.s. and in mean square). (ii) If ζ = 1, except for the first period, no information about θ can be inferred from prices (with the precision of E θ|ILI t constant at 1/σ2 0 + 1/σ2 η) but xt = xFI t for t ≥ 2. (iii) If ζ ∈ R and ζ = 1, then E θ|ILI t converges to θ, ∆ζxt → ∆ζxFI t (a.s. and in mean square). Remark 2. JV also derived that √ t convergence is attained. This issue will not be investigated here. If ζ = 1 then E θ|ILI t → θ. This implies ∆ζxt → ∆ζxFI t . Due to the second part of Proposition 1, this is only an ”if” statement and not ”if and only if”. That is to say convergence of xt and learning θ are not equivalent. Subfigures (a) and (b) of Figure 1 will provide a graphical illustration of the convergence result obtained in this section. 8 Remark 3. When reconsidering the model and the results of this section, θ is estimated to perform predictions of prices and zt. One might therefore claim that - at least from the firms’ point of view - the ultimate goal is not learning but forecasting. The Bayesian tool to describe the distribution of the forecast is the predictive density (predictive distribution). It is derived by means of the Bayes theorem (see e.g. Bickel and Doksum (2001)[p. 254] and Poirier (1995)[Chapter 8.6]). Generally estimation and prediction are not equivalent. Blume and Easley (1998) present an example where learning occurs but the predictive density remains the same while learning occurs, and an example where the predictive density remains the same even if learning does not take place. In addition the authors show that ”the marginal distribution of the parameter absolutely continuous with respect to the likelihood” is sufficient for convergence of the predictive distribution (see Blume and Easley (1998)[Theorem 2.3]). For the current model: If ζ = 1 the predictive density of zt − ζzt−1 is a normal distribution with mean (1 − ζ)at−1 and variance (1 − ζ)2At−1 + σ2 η. If t → ∞ then at → θ and the forecast variance converges to σ2 η. This implies that the asymptotic Bayesian and frequentist prediction intervals are equal. (If the parameter σ2 η is unknown, the predictive density is a t-distribution. The convergence result with the forecast densities still holds.) For ζ = 1 the predictive distribution remains the same for all t ≥ 2 although θ is not learned. zt is normal with mean zt−1 and variance σ2 η. That is to say, although learning need not take place in this baseline model, the forecasting distribution converges to the correct limit distribution for all ζ. Robustness of the Linear Gaussian Model: A further question arises when we deviate from the Gaussian distribution. In general not every Bayes estimator has to be consistent, if we change the distribution of ηt. This implies that θt need not converge to θ in an arbitrary setting even if ζ = 0. Appendix B presents some statistical theory and the convergence results from literature. Sufficient conditions for convergence in regular and non-regular cases are provided (i.e. where the support of the distribution is part of the unknown parameter vector). Dirichlet mixtures provide examples to break down the convergence results obtained with the linear Gaussian setting presented in this section. In these examples the Bayes estimate is not consistent on sets of prior measure zero (small in a measure theoretic sense), while the same set is of Baire category one (not small in a topological sense); for more details see Examples 5 and 6 in 9 Appendix B. 3 Shutdown and Non-negative Quantities Section 2 assumed that the domains of xt and pt are the real line. What does this mean in economic terms? E.g. xt = xit < 0 implies that firms consume the product/ use xt as an input and consumers supply xt. If xt = xit = 0 a cost of λ 2 x2 it arises for all firms, no matter whether firms produce or consume. In addition, if prices and quantities are negative this implies that the firm receives a subsidy for producing a negative quantity. Therefore, the assumptions of Section 2 are not very convincing whenever quantities and prices are outside the non-negative orthant. One may argue that the model parameters can be chosen such that the probability of negative quantities and prices becomes small. Indeed this might be the case, but with normally distributed noise there is always a non-zero probability to violate non-negativity constraints. Now assume xit ≥ 0. Otherwise stay as close as possible to the model by JV. In particular, (1) is maintained.5 This scenario is motivated as follows: If firms produce xit = xt > 0, but ut will be realized such that there is no demand for this quantity (negative demand intercept), then firms cannot store this quantity and have negative revenues ptxit < 0, e.g. due to a scrapping cost for destroying xit. Formally, the following assumptions are adopted: Assumption 1. Quantities xit ≥ 0. Assumption 2. Firms are permitted to set xit = 0 (shutdown in period t). If xt = xit = 0, firms do not receive any signal from the market. No price is realized if there is no market. By Assumption 1, xt = 1 0 xitdi ≥ 0. The domain of xt is the non-negative part of the real axis. If ηt ∼ N(0, σ2 η) then pt ∈ R by (1). Assumption 2 could also be included in Section 2, since the expected profits are non-negative there (xt = 0 with a probability of zero), so that shutdown was not an issue. In this section the analysis is more convenient with Assumption 3. |ζ| < 1. 5 If (1) would be replaced by pt = max(0, θ +ut −βxt), then the conditional expectation of pt is no more linear in xt which makes the model much more complicated. 10 Assumption 3 excludes the unit root case and explosive processes. Now we stick to Assumptions 1 to 3. This yields:6 1. If xt > 0, then pt = zt − βxt and yt = zt (this last equality holds for all xt in Section 2). 2. If xt = 0 - by Assumption 2 - firms do not receive any information on the demand intercept zt. In terms of econometrics this implies that pt and zt are missing values. Now (pt) and (zt) are time series with some missing observations. Whenever xt > 0, pt − βxt = zt continuous to hold. Taking account to missing values, the information sets are FFI t and FLI t . The conditional expectations are E pt+k|F (.) t , xi,t+k > 0 = E zt+k|F (.) t − βxi,t+k 1xi,t+k>0, the forecasts of zt+k are derived by means of (11), where I (.) t has to be substituted by F (.) t . Regarding parameter estimation the difference to Section 2 is the presence of missing values. Exact Bayesian parameter estimation with missing values is briefly described in Appendix C. In a slightly simpler way we proceed as follows: If at time t, zt−j, . . . , zt−1 are missing, then firms adapt (7) to E zt+k|FLI t = (1 − ζ)E θ|FLI t−j (k+j) i=0 ζi + ζk+j+1 zt−1 and E zt+k|FFI t = θ + ζk+j+1 ut−j−1 .(14) Equipped with (14) firms can solve the profit maximization problem max xi,t+k E pt+k|F (.) t xi,t+k − λ 2 x2 i,t+k s.t. xi,t+k ≥ 0 . The first order conditions are E pt+k|F (.) t − λxi,t+k ≤ 0 and E pt+k|F (.) t − λxi,t+k xi,t+k = 0 . (15) Condition (15) implies that firm i currently produces xit = E pt|F (.) t λ units if E pt|F (.) t > 0 and 0 otherwise. By (15) firms also obtain the production plans for the periods t + k, k > 0, given current information. E pt+k|F (.) t > 0 requires E zt+k|F (.) t > 0 by the model structure (see equations (6) and (11)); if the latter term is positive there is an interior solution. This can be summarized as follows: 6 From Section 2 we already know that yt has been defined as yt = pt + βxt. Whenever production takes place yt = pt + βxt = θ + ut = zt has to hold. In Section 2, yt = zt for all t. 11 Proposition 2. Suppose that Assumptions 1-3 hold: C1 If E θ|F (.) t > 0 and E zt|F (.) t > 0. Then in period t, E zt+k|F (.) t > 0 for all k ≥ 0. In t firms produce xt > 0. For periods t + k firms currently plan to supply xt+k > 0. C2 If E θ|F (.) t > 0 and E zt|F (.) t < 0, then E zt+k|F (.) t > 0 for some k ≥ 1. Firms do not produce in period t but enter the market at period t + k. C3 If E θ|F (.) t < 0 and E zt|F (.) t > 0, then firms produce in this period xt > 0. At t firms plan to exit after k > 0 periods. C4 If E θ|F (.) t < 0 and E zt|F (.) t < 0, this also implies E zt+k|F (.) t < 0 for all k ≥ 0. Firms do not produce and exit for all t + k, k ≥ 0. Proposition 2 has important economic implications: If θ > 0, in the full-information case only production or entry after k periods is possible (cases C1 and C2). However, with limited information cases C3 and C4 are possible as well. In case C3, maybe the firms receive a positive signal to remain in the market. On the other hand, with C4 the following result obtains: Corollary 1. Suppose that θ > 0 and E θ|FLI t < 0 and E zt|FLI t < 0. Then firms with limited information exit and xt does not converge to xFI t . With t finite, there always exists a η+ t−1 such that E θ|FLI t < 0 and E zt|FLI t < 0 for any ηt−1 ≤ η+ t−1; the probability that ηt−1 ≤ η+ t−1 is strictly positive. As a short summary, Figure 1 provides a graphical illustration of the formal results obtained in Propositions 1 and 2. A shock occurs in period t = 4. In Subfigures (a) and (b) we observe the conditional expectation of the limited informed agent and the output of the informed and the limited informed agents given the assumptions of the base-line model presented in Section 2. We attain convergence to the true parameter θ = 1 and diminishing differences between the quantities xt and xFI t . In Subfigures (c) and (d) the Assumptions 1-3 hold. This results in negative E zt|FLI t and E θ|FLI t in period t = 5. While the fully informed agents start production after some periods (Case C2), the limited informed agents exit (Case C4) in Subfigure (d). By applying (14), E zt|FLI t converges to E θ|FLI t after exit as observed in Subfigure (c). 12 0 100 200 300 400 500 600 700 800 900 1000 −1.5 −1 −0.5 0 0.5 1 1.5 t E(θ|ILI t ) (a) 0 10 20 30 40 50 60 70 80 90 100 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 t xt (b) 0 2 4 6 8 10 12 14 16 18 20 −5 −4 −3 −2 −1 0 1 2 t E(θ|Ft LI ),E(zt |Ft LI ) (c) 0 2 4 6 8 10 12 14 16 18 20 0 0.2 0.4 0.6 0.8 1 t x t (d) Figure 1: Quantities, Conditional Expectations and Convergence. This figure plots time series of the quantities produced and conditional expectations. Parameters set to: ζ = 0.5, θ = θ0 = 1, σ2 η = σ2 0 = 1, λ = β = 1. A shock occurs in period t = 4. Subfigures (a) and (b) present representative output from the baseline model of Section 2: (a) plots E θ|ILI t for t = 1, . . . , 1000. (b) plots xt (solid line) and xF I t (dotted line) for the first 100 periods. Subfigures (c) and (d) present a non-convergent path for the setting presented in this section. (c) Conditional expectation E zt|FLI t (dashed-solid line) and E θ|FLI t (solid line) for the limited information case. (d) output xt (solid line) and xF I t (dashed-solid line). 13 Remark 4. Assumption 1, xit ≥ 0, could also be replaced by xit ≥ ¯x, where ¯x could be a minimum production level or a short selling constraint. For restrictions like this the results from this section can be adapted. 4 Bounded Support An assumption that avoids the above implications is as follows: Assumption 4. The support of ηt is a proper subset of [η, ¯η] and the parameter θ > 0. W.l.g. E(ηt) = 0. Suppη is the support of ηt. As in Section 3: (i) If xt > 0, then pt = zt − βxt and yt = zt. (ii) If xt = 0, pt and zt are missing values by Assumption 2. The first order condition (15) continues to hold. z0 = θ and (3) imply: zt = θ(1 − ζ) + ζzt−1 + ηt = θ + t−1 s=0 ζs ηt−s . (16) With |ζ| < 1 and bounded support, the absolute value of the last term in (16) remains smaller than 1 1−|ζ| max{¯η, |η|} =: ηmax. Therefore, with |ζ| < 1, z0 = θ and η ∈ [η, ¯η], we get the lower and upper bounds z = θ − ηmax and ¯z = θ + ηmax . (17) To derive a positive zt we have to restrict the support of ηt such that z > 0. This yields ηmax < θ(1 − |ζ|). The requirement < comes from zt > 0 with an arbitrary distribution of ηt (with bounded support), i.e. also atoms at the borders are allowed. As long as the distribution of ηt is absolutely continuous, < can be replaced by ≤. This yields: Corollary 2. Suppose that Assumptions 1-4 hold and the support of ηt is a proper subset of [−(1 − |ζ|)θ, (1−|ζ|)θ]. In addition firms are equipped with a Bayesian updating scheme such that E(θ|FLI t ) → θ as t → ∞ (a.s. and L2) and E(θ|FLI t ) ∈ (θ, ¯θ) for all t ≥ 0; θ = θ − 1 1−|ζ| max{¯η, |η|} and ¯θ = θ + 1 1−|ζ| max{¯η, |η|}. 14 Then E(θ|F (.) t ) > 0 and E(zt|F (.) t ) > 0; xt > 0 for all t by the first order condition (15) (i.e. only C1 of Proposition 2 is possible). The converge results of Proposition 1[part (i)] continue to hold. Bayesian learning can be replaced by the weaker assumption that agents use an estimator ˆθt, with ˆθt → θ, where ˆθt and the forecasts of zt remain in [z, ¯z]. Corollary 2 demands for positive zt, consistency with respect to the support and consistency of the Bayes estimate. That is to say, a bounded support of the random variable, such that zt > 0, is not sufficient for convergence. The following examples should shed some light on these issues: Example 1 (Truncated normal with miss-specified upper and lower bounds). Suppose that θ > 0, and εt follows a truncated normal distribution with lower bound ε and upper bound ¯ε, mean parameter ν and variance parameter σ2 TN resulting in zt > 0 (for the truncated normal we refer the reader to Appendix A). Truncated normal priors are assumed. In addition assume that ζ = 0, such that E(zt|FLI t ) < 0 is sufficient for exit. Take (18) of Appendix A and a parameter ν such that EfT N (ν,σ2 T N ,ε,¯ε)(zt) = θ > 0. These assumptions result in z = ε and ¯z = ¯ε; η and ¯η can be obtained by means of η = ε−θ and ¯η = ¯ε−θ. Suppose that firms know that the data are generated from a truncated normal but assume a lower bound ˜ε < ε and an upper bound ¯˜ε; these bounds are fixed and the distribution can be asymmetric. By (18) we find upper and lower bounds such that 0 < EfT N (ν,σ2 T N ,˜ε,¯˜ε)(εt) < θ. Now suppose that the firms apply Rodriguez-Yam et al. (2004) (as in Example 9) to estimate ν. Then a sufficiently small zt > 0 can result in an estimate ˆνt such that EfT N (ˆνt,σ2 T N ,˜ε,¯˜ε)(εt) = E(zt|ILI t ) < 0 where firms exit. Example 2 (Truncated normal, non-regular case). The miss-specification of the econometric model in Example 1 can be repaired by firms estimating the bounds of the truncated normal. Example 9 in Appendix B demonstrates that the estimates are still consistent if (zt) is observed and the upper and lower bounds are unknown parameters. The bounds converge to the true bounds as t → ∞. This does not imply E(zt|FLI t ) > 0 almost surely for all t. E.g. if the firms put relatively strong priors in the neighborhood of ˜ε and ˜ε assumed in Example 1 convergence need not take place. Example 3 (Truncated normal with correctly specified upper and lower bounds). Given the assumptions of Example 1, but the lower and upper bounds assumed correspond to the true parameters. Then E(zt|FLI t ) > 0 by applying Rodriguez-Yam et al. (2004) and (7). 15 The above examples demonstrate why bounded support of (zt) is not sufficient for convergence. In Example 1 the firms apply a prior and a likelihood with inconsistent bounds. That is to say, firms know or guess the correct class of distributions, but use a miss-specified econometric model with respect to the range of the random variable. Here also the estimates in the baseline case would be inconsistent. Example 2 repairs this drawback by estimating the bounds. Although the posterior will converge to the true parameter ψ if (zt)∞ t=0 is observed, this does not imply that the bounds for exit cannot be hit. Example 3 works with correctly specified bounds of the distributions. This is a strong assumption since firms exactly know the worst and the best shock that could happen. Remark 5. Corollary 2 does not guarantee non-negative prices. The inverse demand function (1) and the first order condition (15) result in pt = zt − β β+λ E(zt|FLI t ). Even with the assumptions of Corollary 2 where θ > 0 and η ∈ (η, ¯η), the price can become negative if zt is small but the expectation is large. With forecasts in [z, ¯z], by a simple plug in of the upper and lower bounds of zt we get pt ≥ 0 if z ≥ ¯z β β+λ. Remark 6. Non-negative prices and quantities can also be obtained in a different ways. E.g. Adam and Marcet (2010) work with log returns, while Vives (2008)[p. 271] propose a model with isoelastic cost and inverse demand structure. By these assumptions the authors restrict to Case 1 of Proposition 2. With isoelastic demand an xt → 0 forces prices to go to +∞ for any signal. This guarantees an interior solution with xt > 0. A further opportunity is to transform zt by a function g(.) such that g(zt) fulfills the requirements of Corollary 2. For example the logistic function, the arcus tangens function, etc. can be used in this case. In addition Bayesian learning results in non-negative forecasts. Robustness: Based on these last examples and Remark 6, we might expect that with bounded support consistency and convergence take place under fairly mild conditions. ”Fairly mild” is a matter of taste. Even with innovations on bounded support and θ ∈ η ∈ (θ, ¯θ) we can use a Dirichlet mixtures as already discussed in the last paragraph of Section 2 and in Appendix B. Similar to the asymmetric normal distribution constructed in Appendix A, a truncated asymmetric normal distribution can be constructed in the same way: Example 4 (Mixture of Truncated Normals). Take the upper and lower bounds ε and ¯ε from the Example 3 such that zt > 0 and assume that these bounds are known by all agents. Let εt follow a truncated 16 normal with parameter ν for irrational ν, while εt is asymmetric truncated normal of Fernandez and Steel (1998) type with some asymmetry parameter γ = 1 (see Appendix A). Once again the Bayes estimate of ν is given Rodriguez-Yam et al. (2004) (see Example 9 in the Appendix B). This estimate converges to the true parameter for all irrational ν while for rational ν we get inconsistent estimates.7 In Example 4 the requirements of Case C1 in Proposition 2 are satisfied all the time, i.e. we have excluded shut down. However, also under these circumstances the Bayes estimate need not be consistent and convergence of the limited to the full information case need not take place for all parameter values of the parameter space. This implies, that we can also construct examples with inconsistent estimates although shut down is not observed. Similar examples can be constructed with other distributions where the distribution of the noise is from a different class for rational and irrational elements. Therefore, even if we replace the linear demand structure by a log-linear one to force prices and quantities to be positive (as suggested by Remark 6), inconsistency of the Bayes estimate can still arise. Despite the fact that we can make the model more realistic by different assumptions to make prices and quantities non-negative, learning the true parameter still remains an issue in all these cases. 5 Conclusions By simple considerations we observed that by eliminating negative quantities but maintaining the linear normal demand structure, the converge results derived in Jun and Vives (1996) and Vives (2008) need not hold. By including a simple shutdown condition - even in the limit - the quantities provided in a full information economy need not agree with the the quantities produced in the limited information setting. To rescue the optimistic convergence results bounded support of the noise term and regularity conditions on learning are required. 7 Note that consistency of an estimate means that for all elements of the parameter space the posterior converges to a point mass at the true parameter (see Appendix B). 17 A The Truncated Normal and the Asymmetric Normal Distribution Consider a truncated normal random variable X with density fTN (x; ν, σ2 TN , ε, ¯ε) (see e.g. Paolella (2007)). The expected value of a truncated normal random variable X with location and scale parameters ν, σ2 TN and bounds ε < ¯ε is E(X) = ν + fSN ((ε − ν)/σTN ) − fSN ((¯ε − ν)/σTN ) FSN ((¯ε − ν)/σTN ) − FSN ((ε − ν)/σTN ) σTN . (18) fSN and FSN are a standard normal density and distribution function. Based on Fernandez and Steel (1998) an asymmetric distribution with density g(x) can be constructed from a symmetric distribution with density f(x) as follows (see e.g. also Paolella (2007)): gγ(x) = 2 ζ 1 + ζ2 (f(x/γ)1x≥0 + f(γx)1x<0) . (19) The parameter γ > 0 controls for the degree of asymmetry. For γ = 1, the distribution is symmetric such that g(x) = f(x). The moments of X are given by: E(Xr |γ) = ζr+1 + (−1)r/ζr+1 ζ + 1/ζ 2E (Xr |X > 0, γ = 0) . (20) By using the standard normal density fSN we get an asymmetric normal distribution with E (X) = 0 for γ = 1. For the normal distribution E (Xr|X > 0, γ = 0) can be derived as follows: Given the stardard normal density fSN (x) we observe that fSN (x) = (−x)fSN (x). By partial integration we observe that E (Xr|X > 0, γ = 0) = ∞ 0 xrfSN (x)dx = 1 r+1xr+1fSN (x)|∞ 0 − 1 r−1 ∞ 0 xr+1(−x)fSN (x)dx = 1 r+1 ∞ 0 xr+2fSN (x)dx such E (Xr|X > 0, γ = 0) = 1 r+1E Xr+2|X > 0, γ = 0 . Since fN (x) = (−x)fSN (x) we get ∞ 0 fSN (x)dx = ∞ 0 (−x)fSN (x)dx yielding E X1|X > 0, γ = 0 = 1√ 2π , while E X2|X > 0, γ = 0 = 1 2 for a standard normal random variable. By the above recursive relationship of the moments we get E (Xr|X > 0, γ = 0) for the standard normal. Some algebra yields 18 E (Xr|X > 0, γ = 0) = r! 2r/2(r/2)! for even r and E (Xr|X > 0, γ = 0) = 2(r−1)/2((r − 1)/2)! 1√ 2π with r odd. Given a standardized skewed normal random variable X, we get as usual a non-standardized normal random variable Y by means of Y = θ + σX, where θ and σ2 are the mean and the variance of Y with γ = 1. By means of (19) we can also construct an asymmetric truncated normal distribution by using fTN (.) instead of f(.). B Bayesian Learning and Consistency This section reviews Bayesian learning and presents examples with non-Gaussian innovations. Regarding convergence, we refer the reader to Ghosal et al. (2000) and the web-note of Shalizi (2010), describing the problems and providing literature on this topic. Let us start with a prior π on Ψ. Ψ is the parameter space, the elements are ψ. Given, the distribution of xt = (x1, . . . , xt), f(xt|ψ), the posterior can be derived by means of the Bayes theorem π(ψ|xt ) = f(xt|ψ)π(ψ) Ψ f(xt|ψ)π(dψ) . (21) Only for a small number of applications π(ψ|xt) can be derived analytically. Bayesian simulation methods to simulate from the posterior π(ψ|xt) is described in Appendix C. A more detailed picture is obtained by following Diaconis and Freedman (1986) and Blume and Easley (1993): The parameter space Ψ is a Borel subset of a complete separable metric space, the elements are ψ. Suppose that π is non-degenerated and π > 0 for any Borel subset containing the true parameter generating the data. Qψ is a probability measure on Borel subsets of the set of data X, x ∈ X. Q∞ ψ is the product measure on X∞. The map ψ → Qψ is one to one and Borel, the point mass at ψ is denoted by δψ. The joint distribution Pπ(A × B) = A Q∞ ψ (B)π(dψ), where A and B are Borel in Ψ and X∞, respectively. For a Borel subset H of Ψ × X∞ we get Pπ(H) = Ψ(δψ × Q∞ ψ )(H)π(dψ). δψ is the point mass at ψ. Then the posterior π(ψ|B) - also abbreviated by π(ψ|x∞) - is a version of the Pπ law of ψ (Consistency does not only depend on the prior but also on the version of the posterior we choose; for 19 more details see Diaconis and Freedman (1986)[Example A.1]). For a measurable set Bt, this results in the posterior π(ψ|Bt), or π(ψ|xt) in the other notation used in (21). For convergence issues in this context usually the weak-star topology is applied: Consider the probability measures µt and µ on Ψ: µt → µ weak star if and only if fdµt → fdµ for all bounded continuous functions f on Ψ. µt → δψ if and only if µt(U) → 1 for every neighborhood U of ψ. δψ is the Dirac point mass at ψ. By Doob’s theorem π(ψ|xt) → δψ in the weak star topology Pπ almost surely expect possibly on a set of π measure zero (see Diaconis and Freedman (1986)[Corollary A.2], Ghosal et al. (2000) and Blume and Easley (1993)[Theorem 1]). That is to say that the Bayesian is subjectively certain to converge to the truth, or in other terms, that for almost all H ∈ Ψ × X∞ the posterior is consistent. However, in frequentist terms: suppose that some ψ ∈ Ψ generates the data, we do not know whether the estimator is consistent. Lijoi et al. (2007) recently derived an extension of Doob’s convergence result for stationary data. Consistency at ψ means that π(U|xt) → 1 (in the probability measure Pψ) for any neighborhood U of ψ. Consistency implies that the posterior converges to the true parameter ψ; that is to say we have consistency at ψ for all elements of the parameter space (see e.g. Diaconis and Freedman (1986) or P¨otscher and Prucha (1997)). Doob’s theorem states that for any prior π(ψ) the posterior is consistent at any ψ ∈ Ψ except possibly on a set of π measure zero (see Ghosal et al. (2000)). This does not guarantee convergence to the true parameter for all ψ ∈ Ψ as required. Even worse, given a set of priors M, Diaconis and Freedman (1986) have shown that the set of all pairs (ψ, π) ∈ Ψ×M for which the posterior based on π is consistent is a meager set (of Baire category 1; here the topology of weak convergence has been used). The original result has been shown on a countable parameter space, while Feldman (1991) extended this result to a complete and separable, non-finite parameter space. Remark 7. By the latter result we observe that there are many ”bad priors”, the set of ”good priors” is topologically small. On the other hand if the statistician - in our case the firm - is ”certain about the prior π(ψ)”, Doob’s theorem tells us that the set where inconsistency might arise is small in a measure theoretic sense. The key aspect is that the firm has to choose a good prior as e.g. in the benchmark model. However, choosing a good prior is in general a strong assumption on the capabilities of the economic agents. 20 Statistical literature provides examples where Bayesian learning is inconsistent; here the reader is referred e.g. to Diaconis and Freedman (1986) and the literature cited there. Especially, when nonparametric Bayesian techniques are used, inconsistency becomes a serious problem. Consistency for nonparametric Bayesian models has been investigated more recently in Barron et al. (1999), Walker (2004), Ghosal and van der Vaart (2007) and Rousseau et al. (2010). For sufficient conditions for consistency and (counter-)examples in the non-parametric case we refer the reader to Jang et al. (2010). Based on Christensen (2009) we get the following example: Example 5 (Inconsistency, Application of Christensen (2009)). Given the assumptions of Section 2, but ηt ∼ N(θ, σ2 η) for irrational θ, while ηt is Cauchy with location parameter θ and variance parameter σ2 e for rational θ. By sticking to the normal prior θ ∼ N(θ0, σ2 0), the posterior π(θ|zt) = π(θ|ILI t+1) is still given by (10), where θ ∼ N(at, At). For irrational θ this estimate is consistent, while for rational θ we get inconsistent estimates. In this example we get inconsistency on a set of prior probability 0, i.e. this set is small in a measure theoretic sense. However, we get inconsistent estimates on a dense set. A set is small in a topological sense (meager set or set of the first Baire category), if it can be expressed as the countable union of nowhere dense sets. Using this topological concept, the set of consistent estimates is meager in this example. A more heuristic interpretation of this results is as follows: In any case an agent can only report rationals/ key in rational numbers into a computer. However, with any of these rational numbers the estimate is not consistent all the time. In addition if the sampling distribution of the unknown parameter θ puts positive probability mass on the rational numbers, the problem becomes even worse. Remark 8. From (10) we observe that with Gaussian noise the estimator of θ is a convex combination of the sample mean and a term arising from the prior which becomes neglectable if t becomes larger. For independent Gaussian noise the strong law of large number holds such that the sample mean converges to θ. For a Cauchy distribution this is not the case. In Example 5 this is the main driving force for the inconsistency of the estimator on rational θ. If distributions with non-finite expectation do not seem very realistic to the reader, we can think of an economy where estimates of θ and σ2 η enter into equilibrium quantities. Suppose that θ is fixed while 21 σ2 η is unknown. The expected value should be θ for all σ2 η. Assume a normal distribution for irrational σ2 η > 0 and a distribution where the second moment does not exist for rational σ2 η > 0 (e.g. a student t-distribution with 1 < ν ≤ 2 degrees of freedom, a distribution symmetric with respect to θ where E(X) = θ and E(X2) does not exist). In this case the posterior based on the Gaussian distribution is an inverse Gamma distribution (see Robert (1994)[Chapter 4] or Fr¨uhwirth-Schnatter (2006)), where the sample variance determines the scale parameter of this distribution. By the same argument as above we get inconsistent estimates if the true σ2 η is a rational number. Example 6 (Inconsistency, Variation of Example 5). Similar to Example 5 we can also construct Dirichlet mixtures with finite expectation. Let ηt ∼ N(θ, σ2 η) for irrational θ, while ηt is asymmetric normal of Fernandez and Steel (1998) type (see Appendix A). π(θ|zt) = π(θ|ILI t+1) is once again given by (10). For irrational θ this estimate is consistent such that at → θ, while for rational at → θ + E(ηt) by the law of large numbers. This implies lim at = θ for the asymmetric case. A similar example can be constructed with mixtures of asymmetric truncated normal distributions, etc. One of the usual procedures to check for convergence to the true parameter for a particular model is to show that the probability of non-convergence to the true parameter value ψ goes to zero (see also Blume and Easley (1993)[Corollary 2.1]) or to work with so called consistency hypothesis tests as discussed and presented in Ghosal et al. (1994), Ghosal et al. (1995a), Ghosal (2000) Ghosal et al. (2000), Ghosal and Tang (2006), Ghosal and van der Vaart (2007) and Walker (2004). Section 2 investigated a regular problem, where the support of the stochastic noise term does not change with the parameter ψ. For regular problems the result derived by Schwartz (1965) can often be applied to check for consistency. Further regular problems will be presented in Examples 8 and 9. Results for non-regular problems have also been investigated in Ghosal et al. (1994), Ghosal et al. (1995a), Chernozhukov and Hong (2004) and Hirano and Porter (2003). The following theorem provides sufficient conditions for almost sure convergence to the true parameter: Proposition 3. [Proposition 1 in Ghosal et al. (1995b); based on Ibragimov and Has’minskii (1981)] Consider a prior π, Ψ ∈ Rd and Ut = ϕ−1 t (Ψ − ψ) for some normalizing factor ϕt. The likelihood ratio 22 process is defined as Lt(u) = f(xt|ψ + ϕnu) f(xt|ψ) , u ∈ Ut. Assume that the conditions IH1 For some M > 0, m ≥ 0 and α > 0, Eψ Lt(u1) − Lt(u2) 2 ≤ M(1 + Rm ) u1 − u2 α , for all u1, u2 ∈ Ut satisfying u1 ≤ R and u2 ≤ R. IH2 For all u ∈ Ut, Eψ Lt(u) ≤ exp (−gt u ) , where {gt} is a sequence of real valued functions on [0, ∞) satisfying the following: (a) for a fixed t ≥ 1, gt(y) → ∞ as y → ∞ and (b) for any N > 0, limy,t→∞ yN exp (−gt(y)) = 0. GGS For some s > 0, ∞ t=1 ϕt s< ∞. are satisfied, then almost sure convergence of the posterior to the true parameter is attained. Usually ϕt = 1/ √ t is applied. When only conditions IH1 and IH2 are satisfied, we get convergence in probability. It is worth noting that Proposition 3 does neither require iid data nor a regular problem; only the existence of the densities (likelihood) f(xt|θ) is assumed. Proposition 3 also tells us that the prior has to be strictly positive around the true parameter value and that it is only allowed to grow at most like a polynomial. Lp convergence generally does not follow from almost sure convergence. If almost sure convergence is observed and the p-th moment of the sequence of random variables is uniformly integrable, then Lp convergence is attained e.g. by means of by Lebesgue’s dominated convergence theorem (see e.g. Billingsley (1986)[p. 213] or Klenke (2008)[p. 140]).8 8 A further issue is the limit distribution of the posterior around the true parameter value. For finite dimensional parameter spaces the Bernstein-von Mises Theorem (see e.g. LeCam and Yang (1990), Lehmann (1991), Freedman (1999), Bickel and Doksum (2001)[p. 339]) states that the Bayesian posterior around its mean is close to the asymptotic distribution of the maximum likelihood estimate around the true parameter value. In this case consistency of the maximum likelihood estimate implies consistent Bayes estimates. Also here Ghosal et al. (1994), Ghosal et al. (1995a) and Ghosal and van der Vaart (2007) provide useful tools to derive the limit distribution of the Bayes estimate. 23 In Section 2, the parameter ψ is θ, Ψ is the real line. (10) provides the posterior π(θ|ILI t ) in closed form. Consistency π(θ|ILI t ) → δθ can be checked directly by taking limits. Alternatively, consistency follows from Blume and Easley (1993)[Theorem 2.2]. Learning the mean of a normal random variable is also presented as an example there. Lp convergence follows from the properties of the normal distribution. The conditional expectation of zt was given by a convex combination of the parameter θ and the former realization zt−1. Convergence results on the parameter θ automatically carry over to convergence of the quantities. Generally, this need not be as simple, as demonstrated in the following example: Example 7. Assume that ηt is Cauchy. Suppose that the posterior of the location parameter θ converges to the true parameter θ. In this setting the parameter ψ = θ has been learned. In contrast to learning the expectation of ηt does not exist. As a consequence the conditional expectation of zt does not exist (neither in the full information nor in the limited information setting) and no convergence results on (xt) can be obtained. Example 7 demonstrates that consistent estimates are in general not sufficient for convergence (even with |ζ| < 1). Hence, we can additionally require uniform integrability of the random variables zt or ∆ζzt. Based on this discussion Proposition 1 can be extended as follows: Corollary 3. [Extension of Proposition 1] Suppose εt ∼ g(εt|ψ), E(|εt|p) < ∞ for all t and some p ≥ 1. In addition normalize the noise term such that E(∆ζzt) = E(ε) = θ(1 − ζ). The unknown parameters are ψ, Ψ ∈ Rd. If |ζ| = 1 and the consistency conditions of Proposition 3 are fulfilled, then the posterior converges to the true parameter ψ almost surely. If, in addition, |ζ| = 1 and |E ψ|ILI t |p is uniformly integrable for p ≥ 1 then E ψ|ILI t converges to ψ in Lp. For |ζ| < 1 and E ψ|ILI t → ψ almost surely and Lp then xt → xFI t almost surely and in Lp. For ζ ∈ R, ζ = 1 and E ψ|ILI t → ψ almost surely in in Lp, ∆ζxt converges to ∆ζxFI t a.s. and Lp. For ζ = 1, xt = xFI t for t ≥ 2. The following examples illustrate the implications of Corollary 3. In the first example we replace the normally distributed normal noise term with a student t-distributed noise term with ν degrees of freedom. The second example investigates the truncated normal case. 24 Example 8. Assume ζ = 1 and ∆ζzt = (1 − ζ)θ + ηt, where ηt is student t centered around zero with 1 < κ ≤ 2 degrees of freedom. The variance parameter σ2 η is fixed. For the parameter θ the agents apply a student t prior with ν degrees of freedom, mean parameter m0 and variance parameter σ2 η/M0. W.l.g. set σ2 η = 1. Then the posterior of the unknown parameter θ is student-t with variance parameter Mt = 1/(t(1 − ζ)2 + M0) and mean parameter mt = (1−ζ) t s=1 ∆ζzs+M0m0 t(1−ζ)2+M0 (see Harrison and West (1997)[p. 422]). In the same way as with the baseline model we get E θ|ILI t = mt−1. mt converges to θ (this could also be directly observed by applying Etemadi’s version of the strong law of large number Klenke (2008)[p. 112]). Here only Lκ converge, with κ < κ, follows from the dominated convergence theorem. Example 9. Given the assumptions of Section 2: ∆zt = εt. εt follows a truncated normal distribution with lower bound εt and upper bound ¯εt. The mean parameter is ν, the variance parameter σ2 TN is fixed, ν ∈ [εt, ¯εt]. ν is such that E(εt) = θ(1 − ζ). Suppose a normal or a truncated normal prior. With the variance parameter and the bounds fixed, we verified that the conditions of Proposition 3 hold for the unknown parameter ν. Therefore, we attain almost sure converge to the true parameter value ν. Although L2 convergence generally depends on the prior π used, if agents assume a (truncated) normal prior then L2 convergence is obtained. Suppose in addition that agents assume a truncated normal prior with parameters νTN0 and σ2 TN0 and the same lower and upper bounds. Rodriguez-Yam et al. (2004) have shown that the posterior of ν based on the data ∆ζzt is a truncated normal distribution with mean parameter aTNt = (1/σ2 ε ) t s=1 ∆ζzs+νT N0/σ2 T N0 (1/σ2 T N )t+1/σ2 T N0 and variance parameter ATNt = (1/σ2 TN )t + 1/σ2 TN0 −1 .9 In Examples 8 and 9 we investigated regular problems, i.e. problems where the support of the stochastic noise term does not change with the parameter Ψ. In Example 9 knowledge of the upper and the lower bounds is a strong assumption. By applying Proposition 3 we derive the following result: Example 10 (Application of Ghosal et al. (1995a)). Given the assumptions of Example 9, but the lower and the upper bounds of the truncated normal distribution are unknown, i.e. the true parameter 9 In the baseline model and in this example it is not difficult to verify that the estimates of ψ = {ν, σ2 T N } are also consistent, i.e. also the variance parameter can be estimated consistently. Simulation methods as described in Appendix C have to be applied to derive samples from the posterior. 25 generating the data is ψ = {ν, ε, ¯ε}. The parameter space Ψ ⊂ R3, with ˜ε < ¯˜ε ∈ Υ. Apply a normal prior to ν. Either (i) Υ is a compact subset of R2, Υ is assumed to be ”sufficiently large”, and a uniform prior is applied for any pairs ˜ε, ¯˜ε ∈ Υ; or (ii) normal priors are applied to ˜ε, ¯˜ε. Υ ⊂ Ψ With these priors we verified that the conditions of Proposition 3 hold. This implies that the posterior converges almost surely to the true parameter. C Bayesian Parameter Estimation In most cases π(ψ|xt) cannot be derived analytically. Fortunately, statistical literature has developed a bulk of numerical tools, Markov Chain Monte Carlo (MCMC) methods, which allow to simulate from the posterior distribution. We refer the reader to textbooks on Bayesian statistics, such as Robert (1994), Robert and Casella (1999), Casella and Berger (2001) or Fr¨uhwirth-Schnatter (2006). For the underlying paper it is sufficient to know that these econometric tools allow to sample from the posterior π(ψ|xt) whenever the likelihood π(xt|ψ) is available. The samples are denoted by ψ[m], m = B + 1, . . . , M. B is the number of burn in steps which are required to attain convergence of the corresponding Markov chain. Then the conditional expectation E (ψ|.) will be derived by taking the sample mean of ψ[m]. Since we have to deal with a missing data problem in Section 3, we briefly remark on data augmentation, proposed in Tanner and Wong (1987). The trick of this method is quite simple: Suppose that f(x|ψ) is not available in closed form but f(x|ψ, y), where y is a latent variable, a missing value or the realization of a latent process. By means of the Bayes theorem the posterior π(ψ, y|x) is proportional to f(x|ψ, y)π(ψ, y). By the above arguments, we are able to sample from the joint posterior π(ψ, y|x), such that posterior samples of ψ, y are available as well. This trick can be applied to a regression setting with missing values. Using the notation of Section 2, consider the model ∆ζzt := zt − ζzt−1= (1 − ζ)θ + ηt, ζ = 1 and ηt ∼ N(0, σ2 η), where the data ∆ζzt includes missing values. The regular observations are abbreviated by ∆ζz∗, the missing values are ∆ζzt \ ∆ζz∗. Now augment the parameter space by ∆ζz∗∗, where each elements of ∆ζz∗∗ replaces the a missing observation of ∆ζzt. Then the joint posterior of θ and ˜z∗∗ can be derived by running a Bayesian sampler consisting of the following steps: For each sweep m: Step 1: sample θ[m] from p(˜θ|∆ζz∗, ∆ζz∗∗,[m−1]). 26 Step 2: sample ∆ζz∗∗,[m] from p(∆ζz∗∗|θ[m], ∆ζz∗). Sampling Step 1 exactly corresponds to drawing from the normal distribution described by (10). For the augmented missing values, ∆ζz∗∗, one can work with the Metropolis Hastings algorithm. Given θ, ∆ζz∗∗ t is normally distributed with mean (1−ζ)θ and variance σ2 η. the innovations ηt are obtained recursively. By means of ˜θ[m] and ∆ζzt one directly obtains the residuals. The likelihood f((∆ζz∗, ∆ζz∗∗)|˜θ) is given by the product of normals with mean (1−ζ)θ and variance σ2 η. Thus, one proposes ∆ζz∗∗,p t from a normal density q(∆ζz∗∗,. t ) with mean (1 − ζ)θ[m] and variance σ2 η and accepts the proposed ∆ζz∗∗,p t with a probability of min{1, f((∆ζz∗,∆ζz∗∗,p)|˜θ[m]) f((∆ζz∗,∆ζz∗∗,[m−1])|˜θ[m]) q(∆ζz∗∗,[m−1]) q(∆ζz∗∗,p) }, otherwise ∆ζz∗∗,[m] = ∆ζz∗∗,[m−1]. Here ∆ζz∗∗,[m−1] are the current augmented missing values in the sampler. Alternative proposal densities can also be applied as well as forward-filtering backward sampling (see Fr¨uhwirth-Schnatter (1994)). If necessary, the missing zt can now be derived from z0 and ∆ζzt. Prediction and Missing Values: If at time t, zt−j, . . . , zt−1 are missing, then firms can take use of (14). By strictly sticking to Bayesian sampling ∆ζz∗∗ t−j−1, . . . , ∆ζz∗∗ t−1 will be simulated by means of Step 2; missing z∗∗ t follow from ∆ζzt. By the structure of the model zt+1|zt, θ[m] is normally distributed with mean (1 − ζ)θ[m] + ζzt and variance σ2 η. θ[m] is by (10) normally distributed with mean at−1 and variance At−1. 27 References Adam, K. and Marcet, A. (2010). Booms and busts in asset prices. IMES Discussion Paper Series 10-E-02, Institute for Monetary and Economic Studies, Bank of Japan. Alfarano, S. and Milakovic, M. (2009). Network structure and n-dependence in agent-based herding models. Journal of Economic Dynamics and Control, 33(1):78 – 92. Barron, A., Schervish, M. J., and Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. Annals of Statistics, 27(2):536–561. Bickel, P. and Doksum, K. (2001). Mathematical Statistics. Prentice Hall, New Jersey, 2 edition. Billingsley, P. (1986). Probability and Measure. Wiley, Wiley series in probability and mathematical statistics, New York, 2 edition. Blume, L. and Easley, D. (1993). Rational expectations and rational learning. Game theory and information, EconWPA. Blume, L. E. and Easley, D. (1998). Rational expectations and rational learning. In Mukul Majumdar, editor, Organizations with Incomplete Information, Cambridge University Press. Brock, W. A. and Hommes, C. H. (1997). A rational route to randomness. Econometrica, 65(5):1059–1096. Casella, G. and Berger, R. (2001). Statistical Inference. Duxbury Resource Center, Boston, 2 edition. Chernozhukov, V. and Hong, H. (2004). Likelihood estimation and inference in a class of nonregular econometric models. Econometrica, 72(5):1445–1480. Chib, S. (1993). Bayes regression with autoregressive errors. Journal of Econometrics, 58:77–99. Christensen, R. (2009). Inconsistent bayesian estimation. Bayesian Analysis, 4:413–416. Diaconis, P. and Freedman, D. (1986). On the consistency of Bayes estimates. Annals of Statistics, 14(1):1–26. 28 Evans, G. and Honkapohja, S. (2001). Learning and Expectations in Macroeconomics. Princeton University Press, Princeton. Feldman, M. (1991). On the generic nonconvergence of bayesian actions and beliefs. Economic Theory, 1:301–321. Fernandez, C. and Steel, M. F. (1998). On bayesian modeling of fat tails and skewness. Journal of the American Statistical Association, 93(441):359–371. Freedman, D. (1999). On the Bernstein-von Mises theorem with infinite dimensional parameters. Technical Report, number 492, Department of Statistics, University of California Berkeley. Fr¨uhwirth-Schnatter, S. (1994). Data augmentation and dynamic linear models. Journal of Time Series Analysis, 15:183–202. Fr¨uhwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer, New York. Ghosal, S., , and Samanta, T. (1995a). Stability and convergence of posterior in non-regular problems. Mathematical Methods of Statistics, 4(4):361–388. Ghosal, S. (2000). A review of consistency and convergence of posterior distribution. Technical report, In Proceedings of National Conference in Bayesian Analysis, Benaras Hindu University, Varanashi, India. Ghosal, S., Ghosh, J. K., and Samanta, T. (1994). Stability and Convergence of Posterior in Nonregular Problems. Statistical Decision Theory and Related Topics 5, (S.S. Gupta and J.O. Berger, eds.), Springer, New York. Ghosal, S., Ghosh, J. K., and Samanta, T. (1995b). On convergence of posterior distributions. Annals of Statistics, 23(6):2145–2152. Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics, 28:500–531. Ghosal, S. and Tang, Y. (2006). Bayesian consistency for markov processes. Sankhya, 68:227–239. 29 Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributions for non-iid observations. Annals of Statistics, 35:192–223. Guesnerie, R. (1992). An exploration of the eductive justifications of the rational-expectations hypothesis. American Economics Review, 82(5):1254–1278. Guesnerie, R. (1993). Theoretical tests of the rational expectations hypothesis in economic dynamical models. Journal of Economic Dynamics and Control, 17(5-6):847 – 864. Guesnerie, R. and Jara-Moroni, P. (2009). Expectational coordination in simple economic contexts: Concepts and analysis with emphasis on strategic substitutabilities. Working Paper 2009 - 27, Paris School of Economics. Harrison, J. and West, M. (1997). Bayesian Forecasting and Dynamic Models. Springer. Hirano, K. and Porter, J. R. (2003). Asymptotic efficiency in parametric structural models with parameterdependent support. Econometrica, 71(5):1307–1338. Ibragimov, I. A. and Has’minskii, R. Z. (1981). Statistical estimation, asymptotic theory. Applications of Mathematics, vol. 16, Springer, New York. Jang, G. H., Lee, J., and Lee, S. (2010). Posterior consistency of species sampling priors. Statistica Sinica, 20:581–593. Jun, B. and Vives, X. (1996). Learning and convergence to a full-information equilibrium are not equivalent. Review of Economic Studies, 63(4):653–674. Kalai, E. and Lehrer, E. (1992). Bayesian forecasting. Discussion paper 998, Department of Managerial Economics and Decision Sciences Kellog School of Management, Northwestern University. Kelly, D. L. and Kolstad, C. D. (1999). Bayesian learning, growth, and pollution. Journal of Economic Dynamics and Control, 23(4):491 – 518. Klenke, A. (2008). Probability Theory - A Comprehensive Course. Springer. 30 LeCam, L. M. and Yang, G. L. (1990). Asymptotics in Statistics: Some Basic Concepts. Springer, New York. Lehmann, E. (1991). Theory of Point Estimation. Wadsworth and Brooks/Cole. Lijoi, A., Pr¨unster, I., and Walker, S. G. (2007). Bayesian consistency for stationary models. Econometric Theory, 23(4):749–759. Meyn, S. and Tweedie, R. L. (2009). Markov Chains and Stochastic Stability. Cambridge University Press (Cambridge Mathematical Library), New York, 2 edition. Paolella, M. (2007). Intermediate Probability - A Computational Approach. Wiley. Poirier, D. J. (1995). Intermediate Statistics and Econometrics: A Comparative Approach. MIT Press. P¨otscher, B. M. and Prucha, I. R. (1997). Dynamic Nonlinear Econometric Models, Asymptotic Theory. Springer, New York. Robert, C. and Casella, G. (1999). Monte Carlo Statistical Methods. Springer, New York. Robert, C. P. (1994). The Bayesian Choice. Springer, New York. Rodriguez-Yam, G., Davis, R. A., and Scharf, L. (2004). Efficient gibbs sampling of truncated multivariate normal with application to constrained linear regression. Technical report, Department of Statistics, Columbia University. Rousseau, J., Chopin, N., and Liseo, B. (2010). Bayesian nonparametric estimation of the spectral density of a long or intermediate memory gaussian process. Working paper, Universit´e Paris Dauphine, Paris. Routledge, B. (1999). Adaptive learning in financial markets. The Review of Financial Studies, 12:1165– 1202. Schwartz, L. (1965). On bayes procedures. Probability Theory and Related Fields, 4:10–26. 31 Shalizi, C. R. (2010). Frequentist consistency of bayesian procedures. Technical report, Notebook, University of Michigan, Department of Statistics, http://cscs.umich.edu/ crshalizi/notebooks/bayesian- consistency.html. Smith, L. and Sørensen, P. (2000). Pathological outcomes of observational learning. Econometrica, 68(2):371–398. Stinchcombe, M. B. (2005). The unbearable flightiness of bayesians: Generaically erratic updating. Working paper, University of Texas at Austin. Strasser, H. (1985). Mathematical Theory of Statistics. Studies in Mathematics 7. de Gruyter, Berlin. Sun, Y. (2006). The exact law of large numbers via fubini extension and characterization of insurable risks. Journal of Economic Theory, 126(1):31 – 69. Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82:528–550. Timmermann, A. (1996). Excess volatility and predictability of stock prices in autoregressive dividend models with learning. Review of Economic Studies, 63(4):523575. Vives, X. (2008). Information and Learning in Markets. Princeton University Press. Walker, S. (2004). New approaches to bayesian consistency. Annals of Statistics, 32:2028–2043. 32