Bayesian Learning, Shutdown and Convergence ∗
Leopold S¨ogner †
October 15, 2011
Abstract
This article investigates a partial equilibrium production model with dynamic information aggregation.
Firms apply Bayesian learning to estimate the unknown model parameter. In the baseline setting,
where prices and quantities are supported by the real line and the noise term is Gaussian, convergence
of the limited information to the full information setting is obtained. Imposing a non-negativity
constraint on quantities destroys the convergence results obtained in the baseline model. With
the constraints ﬁrms learning an unknown demand intercept parameter exit with strictly positive
probability, even when the true value of this parameter would induce production in the full information
setting. Parts of the model can be rescued by assuming bounded support for the stochastic noise term.
Although, shut-down can be excluded with bounded support when the forecasts of the agents satisfy
relatively mild regularity conditions, Bayesian learning and convergence need not take place in general.
Keywords: Bayesian Learning, Consistency, Convergence.
JEL: D82, G10.
∗
The author appreciates helpful comments from Larry Blume, Egbert Dierker, Hildegard Dierker, Klaus Ritzberger,
Andreas Danis, David Florysiak, Michael Greinecker, Maarten Janssen, Julian Kolm, Martin Meier and all participants of
the Micro Jour Fixe at IHS (summer & winter term 2010), the VGSE seminar (Vienna 2011) and the 11th
SAET conference
(Faro 2011).
†
Leopold S¨ogner, Department of Economics and Finance, Institute for Advanced Studies, Stumpergasse 56, 1060 Vienna,
Austria, soegner@ihs.ac.at
1
1 Introduction
One of the workhorse models when considering learning is a partial equilibrium setting with normally
distributed noise and agents applying Bayesian learning. The mean parameter of the stochastic noise
term is unknown while the variance term is kept ﬁxed. By assuming a conjugate normal prior, the
posterior of the unknown mean parameter is normal. The posterior mean becomes a mixture of the prior
mean and the sample average. In addition to some other nice features of the normal distribution, this
property also motivates why a normal setting is so frequently used: A dynamic setting with aﬃne implied
law of motion results in an aﬃne perceived actual law of motion.
In this article the baseline model is a partial equilibrium model with production as investigated in
Jun and Vives (1996); JV in the following. JV have already shown that learning is not equivalent to
convergence of the limited information to the full information equilibrium. However, for all non-unit root
settings the unknown parameter is learned, while convergence of the limited information quantities to full
information quantities is attained for all feasible parameter values.
This paper shows that diﬀerent deviations from the aﬃne normal setting need neither result in learning
nor in convergence. While quantities are real numbers in the baseline model, by requiring non-negative
quantities – while sticking to a normally distributed demand intercept – we show that due to a shutdown
option limited informed ﬁrms could leave the market while informed ﬁrms stay in there. Here the
convergence result of the baseline setting breaks down.
Let us relate this article to recent literature: As already stated above JV have already shown that
learning and convergence are not equivalent. The JV setting assumes a continuum of price takers and
public signals only. By the ﬁrst assumption, solving for equilibrium does not require concepts from
game theory, while the second assumption excludes any herding eﬀects (for an overview see e.g. Vives
(2008)[chapter 6]). Vives (2008) also presents a convergence result with herding behavior; also this model
works with the normal distribution. Using the deviations discussed in this paper will also have impacts on
the convergence results presented there. Alternatively, Smith and Sørensen (2000) considered a herding
model with binomial outcomes. If there were no private signals in their setting, the Bayes estimator
would be consistent since the sample space is ﬁnite. Smith and Sørensen (2000) show that due to learning
2
from public and private information confounded learning can appear, which is a situation where a Markov
transition probability becomes independent of the last realization of the state variable. By the fact that
only public information arises in our model, such a situation cannot appear in our model. However, our
setting provides diﬀerent examples where learning does not take place. We neither need heterogeneity
of agents (informed vs. non-informed) nor herding behavior. Non-convergent behavior either arises from
deviations from the normal distribution or from a shut-down condition (or both). For further stability
results with respect to herding we refer the reader to Alfarano and Milakovic (2009).
A model very similar to our baseline setting has also been investigated in Guesnerie (1992) to check for
eductive stability or a strongly rational expectations equilibrium; see also Guesnerie (1993) and Guesnerie
and Jara-Moroni (2009) for a more general treatment of this concept. This strand of literature raises
the question whether rational expectations can be supported by Game theoretic reasoning. Here the
author has shown that strong rationality depends on the eigenvalue of a cobweb function derived from a
composition of the inverse aggregate demand with the supply function. In this article we abstract from
the problem of strong rationalization and put the focus on the requirement of non-negative quantities.
This paper is organized as follows: Section 2 describes the benchmark model. Section 3 introduces
non-negative quantities and a shutdown option for ﬁrms. If the feasible set is restricted to the non-negative
real axis, shut-down - that is to say a production of zero - can be an optimal strategy. Since no signals are
received after shut-down, this implies that the agents also decide to stop learning. Then even in the limit
the unknown parameter remains a random variable. With these modiﬁcations the convergence results
obtained by JV may break down.
Section 4 shows that a bounded support on the stochastic noise term and regularity conditions on the
forecasts provide a more realistic model in which the results of the benchmark model can be resurrected.
Although a convergence result will follow easily in a quite mechanical way, it is important to note that
this simple setup already shows that even with bounded support of the noise term, convergence need
not be attained in general. Regularity conditions on the learning scheme are necessary to guarantee
convergence. Examples show that matching these conditions can but need not be trivial. Last but not
least we investigate robustness by allowing for deviations from normal noise. We present examples - based
3
on Diaconis and Freedman (1986) and Christensen (2009) - where convergence need not occur even with
bounded support. This last topic is related to the consistency of the Bayes estimate (see e.g. Diaconis and
Freedman (1986) or Strasser (1985)). Applications and extension in economic theory have been presented
e.g. in Blume and Easley (1993), Blume and Easley (1998), Feldman (1991), Kalai and Lehrer (1992),
Kalai and Lehrer (1992) and Stinchcombe (2005). Appendix B will reconsider the results presented in
literature and provides some further examples.
2 The Benchmark Model
This section describes the model and the key results obtained in JV. Their results are based on the
assumption that both prices, pt, and the quantities, xt, in particular may be negative. JV and also Vives
(2008)[chapter 7.2] consider a discrete time model of dynamic information aggregation; time is indexed
by t = 0, 1, 2, . . . , economic activity starts at t = 1. The aggregate demand function is described by the
stochastic linear relationship:
pt = zt − βxt for t ≥ 1 . (1)
pt is the price established in period t, β > 0 is a constant and xt is the aggregate quantity consumed. zt
is a random variable. The fractional diﬀerences ∆ζzt = zt − ζzt−1 are described by
∆ζ
zt = (1 − ζ)θ + ηt , (2)
where θ, ζ ∈ R. With ut = zt − θ we get an autoregressive and a moving average representation:
zt = ζzt + (1 − ζ)θ + ηt ,
= θ + ut = θ +
t−1
s=0
ζs
ηt−s + u0 = θ +
t−1
s=0
ζs
ηt−s + (z0 − θ) . (3)
In addition ut = ζut−1 + ηt. In the baseline setting ηt is iid normal with mean zero and variance σ2,
ηt ∼ N(0, σ2
η). For ζ = 1, the process follows a ﬁrst order autoregressive process with normal innovations,
while for ζ = 1 we get a random walk. By this speciﬁcation xt and pt ∈ R. zt is the stochastic demand
4
intercept.1 We assume that the process (zt) is started at z0 = θ, which has also be done implicitly in JV
by assuming u0 = 0. This assumption will reduce the computational burden in Section 4. It is important
to note that z0 cannot be observed by the ﬁrms.
Consider a continuum of ﬁrms i ∈ [0, 1] endowed with Lebesque measure. Each ﬁrm i is a price taker
and produces the homogeneous output xit at cost C(xit) = λ
2 x2
it; the parameter λ > 0. Aggregate output
fulﬁlls xt =
1
0 xitdi.2 Depending on what ﬁrms observe, distinguish between:
1. Full information (FI): At period t, the ﬁrms know θ and the past prices pt−1 = {p1, . . . , pt−1}. In
addition ﬁrms know the structure of demand (1) and the model parameters λ, β, σ2
η, ζ, and u0 = 0.
2. Limited information (LI): Firms know λ, β, σ2
η, ζ, and u0 = 0 and observe past prices pt−1. They do
not know θ, but the structure of demand (1) and (2). Firms use Bayes’ rule to update beliefs about
θ. The prior is given by θ ∼ N(θ0, σ2
0). This is what Vives (2008)[chapter 7.1] calls learning within
an equilibrium.
For notational simplicity abbreviate the information sets by IFI
t and ILI
t , for the full information and the
limited information case, respectively. Since ﬁrms are price takers, the proﬁt function is πit = ptxit − λ
2 x2
it
and the value function is Vit = (1 − δ)E ∞
k=0 δk pt+kxi,t+k − λ
2 x2
i,t+k |I
(.)
t with some discount factor
δ ∈ (0, 1). By (1), (3) and ζ ∈ R the value function need not be ﬁnite in general. However, by the structure
of the optimization problem maximizing Vit breaks up into the one period optimization problems
max
xit
E ptxit −
λ
2
x2
it|I
(.)
t . (4)
Given the information set I
(.)
t the ﬁrst order condition yields:
xit =
E pt|I
(.)
t
λ
. (5)
1
For some examples it is more convenient the work with ∆ζ
zt = εt, where εt is centered around (1−ζ)θ (if the mean exists).
Such a transformation makes sense if asymmetric noise (e.g. asymmetric (truncated) normal) or only noise on a subset of R
will be considered (e.g. Gamma distribution, truncated distributions). This notation will be applied in Examples 1 to 4 and
in some examples in Appendix B.
2
Since xit are equal for all i, except for a countably many i - the exact law of large numbers can be applied, such that
xt =
1
0
xitdi still holds. For more details see Sun (2006).
5
Market Clearing: In period t ﬁrm i produces xit =
E pt|I
(.)
t
λ and aggregate (average) output is
1
0 xitdi = xt =
E pt|I
(.)
t
λ . Immediately after this output decision, the random variable ηt realizes, resulting
in pt = zt − βxt = zt − β
E pt|I
(.)
t
λ .
Remark 1. Since sgn(xit) = sgn(E pt|I
(.)
t ) by (5), expected revenues are positive no matter whether
positive or negative prices are expected. Substitution of (5) into the expected proﬁt function results in
E πit|I
(.)
t =
E pt|I
(.)
t
2
λ
−
λE pt|I
(.)
t
2
2λ2
=
E pt|I
(.)
t
2
2λ
> 0 (a.s.) .
Agents commit to supply xit ∈ R even if sgn(xit) = sgn(E pt|I
(.)
t ) = sgn(pt). In this case πit < 0.
Hence, realized proﬁts πit ∈ R. By the law of iterated expectations E (πit) > 0.
Convergence, Prediction and Learning: We restrict our analysis to learning within an equilibrium, as
deﬁned in Vives (2008)[p. 249]. Examples for diﬀerent learning schemes in diﬀerent ﬁelds of economics are
provided e.g. in Timmermann (1996), Brock and Hommes (1997), Kelly and Kolstad (1999), Routledge
(1999) or Evans and Honkapohja (2001). For an overview and a bulk of literature we refer the reader to
Vives (2008)[chapters 7.1 and 10.2].
Due to (5), ﬁrms have to predict the price pt. By the inverse demand function (1) we get
E pt|I
(.)
t = E zt|I
(.)
t − βxt . (6)
xt is I
(.)
t measurable. For k ≥ 0, the k step ahead prediction is obtained by means of
E zt+k|I
(.)
t = (1 − ζ)E θ|I
(.)
t
k
i=0
ζi
+ ζk+1
zt−1 . (7)
For the full information case, where θ is known, (7) results in E zt|IFI
t = (1 − ζ)θ + ζzt−1, such that
xFI
t =
(1 − ζ)θ + ζzt−1
λ + β
=
θ + ζut−1
λ + β
. (8)
By the inverse demand function yt := pt + βxt = pt + β
E pt|I
(.)
t
λ = θ + ut = zt. With limited information
ﬁrms know that yt = zt. From (1) and the speciﬁcation of the noise term, ηt ∼ iid N(0, σ2
η), algebra
6
yields
zt = θ + ζ(zt−1 − θ) + ηt and ∆ζ
zt := zt − ζzt−1 = (1 − ζ)θ + ηt . (9)
For ζ = 1, we directly observe from zt = θ(1 − ζ) + ζzt−1 + ηt that the parameter θ is not identiﬁed. That
is to say, for any zt ∈ R the likelihood f(zt|θ, ζ = 1) = f(zt|θ , ζ = 1) for arbitrary pairs θ, θ .
The last term in (9) corresponds to a linear regression setting with response variable ∆ζzt, the (constant)
prediction variable (1 − ζ) and normal innovations ηt. With the conjugate normal prior3 θ ∼ N(θ0, σ2
0),
under limited information ﬁrms derive the posterior distribution of the parameter θ by means of (see e.g.
Vives (2008)[Appendix 10.2], Chib (1993), Robert (1994)[Chapter 4]):
θ ∼ N(at, At) where at =
( 1
σ2
η
(1 − ζ) t
s=1 ∆ζzs + θ0
σ2
0
( 1
σ2
η
(1 − ζ)2t + 1
σ2
0
, At =
1
σ2
η
(1 − ζ)2
t +
1
σ2
0
−1
. (10)
By means of at ﬁrms immediately get the conditional expectation of θ.4 Since ILI
t is generated by
observations from 1, . . . , t − 1, the conditional expectation E θ|ILI
t = at−1. Then, by (7) forecasts of
zt+k are
E zt+k|ILI
t = (1 − ζ)E θ|ILI
t
k
i=0
ζi
+ ζk+1
zt−1 = (1 − ζ)at−1
k
i=0
ζi
+ ζk+1
zt−1 . (11)
(5) and (11), with k = 0, yield
xt =
E zt|ILI
t
λ + β
. (12)
Before we restate the JV convergence result for the baseline model, let us brieﬂy discuss some properties
of (xt) and (xFI
t ). (xFI
t ) follows a ﬁrst order autoregressive process. Generally, a ﬁrst order autoregressive
process ut = ζut−1 + ηt is ergodic (stationary in the limit) if |ζ| < 1 and E (|ηt|) < ∞ (see e.g. Meyn and
3
The following analysis can also be performed with an uninformative Jeﬀreys prior (see e.g. Robert (1994)), which can be
motivated by the (frequentist) argument that information is only providing by the data. Minimizing the impact of the prior
is done with this type of prior. For the current normal setting this implies that terms including θ0, σ2
0 or both vanish in (10).
Proposition 1 and the results in Examples 5 and 6 still hold when replacing the conjugate normal prior by an uninformative
Jeﬀreys prior.
4
Only zt for t ≥ 1 are available. At t = 1 ﬁrms use E z1|ILI
1 = θ0. At t = 2, z1 is already known by the ﬁrms, but they
cannot calculate ∆ζ
z1. Here we assume that ﬁrms apply ∆ζ
z1 = z1 −θ0. Then (10) can be used. Alternatively, we could also
start with a setup where no ∆ζ
z1 is available but agents apply Bayesian simulations methods to sample the joint posterior
of the parameter θ and the initial value as described in Appendix C.
7
Tweedie (2009)[Chapters 11 and 15]; in this case (ηt) is also geometrically ergodic). This implies that the
distribution of ut, µt, converges to its stationary distribution µ. If u0 ∼ µ, then the process is stationary
and ergodic. ut ∈ Lp if |ut|pdµt < ∞; where µt is the probability law of ut. For normal innovations, if
|ζ| < 1 then (ut) is stationary and p-th movements exist such that ut ∈ Lp. Therefore, (xFI
t ) is stationary
in the limit and in Lp if |ζ| < 1. If |ζ| ≥ 1, neither (xt) nor (xFI
t ) are ergodic. Discussing Lp convergence
with quantities does not make sense in this case. If |ζ| ≥ 1, we can work with the fractional diﬀerences
∆ζx
(.)
t = x
(.)
t − ζx
(.)
t−1. For the full-information quantities this results in ∆ζxFI
t = (1−ζ)[θ+ζηt−1]
λ+β . (∆ζxFI
t )
is stationary/∈ Lp if (ηt) is stationary/∈ Lp. Given our model assumptions, the fractional diﬀerences
∆ζxFI
t are ergodic and ∈ Lp for all ζ. For the limited information case, some algebra yields
∆ζ
xLI
t = ∆ζ
xFI
t +
(1 − ζ)[at−1 − ζat−2 − (1 − ζ)θ]
λ + β
. (13)
The third term in (13) goes to zero if (at) converges to θ. (13) also implies that xt −xFI
t = 1−ζ
λ+β (at−1 − θ).
If ζ = 1, then xLI
t = xFI
t at least for t ≥ 2, but θ is not learned. For all ζ = 1, (at) converges to θ almost
surely and in Lp. ∆ζxLI
t and xLI
t converge to their full information counterparts, for ζ ∈ R and |ζ| < 1,
respectively. Based on this discussion we reformulate the convergence result obtained by JV:
Proposition 1. [Propositions 2.1, 2.2 in JV; Proposition 7.1 in Vives (2008)]
(i) If |ζ| < 1 then E θ|ILI
t converges to θ and xt → xFI
t (a.s. and in mean square).
(ii) If ζ = 1, except for the ﬁrst period, no information about θ can be inferred from prices (with the
precision of E θ|ILI
t constant at 1/σ2
0 + 1/σ2
η) but xt = xFI
t for t ≥ 2.
(iii) If ζ ∈ R and ζ = 1, then E θ|ILI
t converges to θ, ∆ζxt → ∆ζxFI
t (a.s. and in mean square).
Remark 2. JV also derived that
√
t convergence is attained. This issue will not be investigated here. If
ζ = 1 then E θ|ILI
t → θ. This implies ∆ζxt → ∆ζxFI
t . Due to the second part of Proposition 1, this
is only an ”if” statement and not ”if and only if”. That is to say convergence of xt and learning θ are
not equivalent. Subﬁgures (a) and (b) of Figure 1 will provide a graphical illustration of the convergence
result obtained in this section.
8
Remark 3. When reconsidering the model and the results of this section, θ is estimated to perform
predictions of prices and zt. One might therefore claim that - at least from the ﬁrms’ point of view - the
ultimate goal is not learning but forecasting. The Bayesian tool to describe the distribution of the forecast
is the predictive density (predictive distribution). It is derived by means of the Bayes theorem (see e.g.
Bickel and Doksum (2001)[p. 254] and Poirier (1995)[Chapter 8.6]).
Generally estimation and prediction are not equivalent. Blume and Easley (1998) present an example
where learning occurs but the predictive density remains the same while learning occurs, and an example
where the predictive density remains the same even if learning does not take place. In addition the authors
show that ”the marginal distribution of the parameter absolutely continuous with respect to the likelihood”
is suﬃcient for convergence of the predictive distribution (see Blume and Easley (1998)[Theorem 2.3]).
For the current model: If ζ = 1 the predictive density of zt − ζzt−1 is a normal distribution with mean
(1 − ζ)at−1 and variance (1 − ζ)2At−1 + σ2
η. If t → ∞ then at → θ and the forecast variance converges
to σ2
η. This implies that the asymptotic Bayesian and frequentist prediction intervals are equal. (If the
parameter σ2
η is unknown, the predictive density is a t-distribution. The convergence result with the
forecast densities still holds.) For ζ = 1 the predictive distribution remains the same for all t ≥ 2 although
θ is not learned. zt is normal with mean zt−1 and variance σ2
η. That is to say, although learning need not
take place in this baseline model, the forecasting distribution converges to the correct limit distribution
for all ζ.
Robustness of the Linear Gaussian Model: A further question arises when we deviate from the Gaussian
distribution. In general not every Bayes estimator has to be consistent, if we change the distribution of
ηt. This implies that θt need not converge to θ in an arbitrary setting even if ζ = 0. Appendix B presents
some statistical theory and the convergence results from literature. Suﬃcient conditions for convergence
in regular and non-regular cases are provided (i.e. where the support of the distribution is part of the
unknown parameter vector). Dirichlet mixtures provide examples to break down the convergence results
obtained with the linear Gaussian setting presented in this section. In these examples the Bayes estimate
is not consistent on sets of prior measure zero (small in a measure theoretic sense), while the same set
is of Baire category one (not small in a topological sense); for more details see Examples 5 and 6 in
9
Appendix B.
3 Shutdown and Non-negative Quantities
Section 2 assumed that the domains of xt and pt are the real line. What does this mean in economic terms?
E.g. xt = xit < 0 implies that ﬁrms consume the product/ use xt as an input and consumers supply xt. If
xt = xit = 0 a cost of λ
2 x2
it arises for all ﬁrms, no matter whether ﬁrms produce or consume. In addition,
if prices and quantities are negative this implies that the ﬁrm receives a subsidy for producing a negative
quantity. Therefore, the assumptions of Section 2 are not very convincing whenever quantities and prices
are outside the non-negative orthant. One may argue that the model parameters can be chosen such that
the probability of negative quantities and prices becomes small. Indeed this might be the case, but with
normally distributed noise there is always a non-zero probability to violate non-negativity constraints.
Now assume xit ≥ 0. Otherwise stay as close as possible to the model by JV. In particular, (1) is
maintained.5 This scenario is motivated as follows: If ﬁrms produce xit = xt > 0, but ut will be realized
such that there is no demand for this quantity (negative demand intercept), then ﬁrms cannot store this
quantity and have negative revenues ptxit < 0, e.g. due to a scrapping cost for destroying xit. Formally,
the following assumptions are adopted:
Assumption 1. Quantities xit ≥ 0.
Assumption 2. Firms are permitted to set xit = 0 (shutdown in period t). If xt = xit = 0, ﬁrms do not
receive any signal from the market. No price is realized if there is no market.
By Assumption 1, xt =
1
0 xitdi ≥ 0. The domain of xt is the non-negative part of the real axis. If
ηt ∼ N(0, σ2
η) then pt ∈ R by (1). Assumption 2 could also be included in Section 2, since the expected
proﬁts are non-negative there (xt = 0 with a probability of zero), so that shutdown was not an issue. In
this section the analysis is more convenient with
Assumption 3. |ζ| < 1.
5
If (1) would be replaced by pt = max(0, θ +ut −βxt), then the conditional expectation of pt is no more linear in xt which
makes the model much more complicated.
10
Assumption 3 excludes the unit root case and explosive processes. Now we stick to Assumptions 1 to 3.
This yields:6
1. If xt > 0, then pt = zt − βxt and yt = zt (this last equality holds for all xt in Section 2).
2. If xt = 0 - by Assumption 2 - ﬁrms do not receive any information on the demand intercept zt. In
terms of econometrics this implies that pt and zt are missing values.
Now (pt) and (zt) are time series with some missing observations. Whenever xt > 0, pt − βxt = zt
continuous to hold. Taking account to missing values, the information sets are FFI
t and FLI
t .
The conditional expectations are E pt+k|F
(.)
t , xi,t+k > 0 = E zt+k|F
(.)
t − βxi,t+k 1xi,t+k>0, the
forecasts of zt+k are derived by means of (11), where I
(.)
t has to be substituted by F
(.)
t . Regarding parameter
estimation the diﬀerence to Section 2 is the presence of missing values. Exact Bayesian parameter
estimation with missing values is brieﬂy described in Appendix C. In a slightly simpler way we proceed
as follows: If at time t, zt−j, . . . , zt−1 are missing, then ﬁrms adapt (7) to
E zt+k|FLI
t = (1 − ζ)E θ|FLI
t−j
(k+j)
i=0
ζi
+ ζk+j+1
zt−1 and E zt+k|FFI
t = θ + ζk+j+1
ut−j−1 .(14)
Equipped with (14) ﬁrms can solve the proﬁt maximization problem
max
xi,t+k
E pt+k|F
(.)
t xi,t+k −
λ
2
x2
i,t+k s.t. xi,t+k ≥ 0 .
The ﬁrst order conditions are
E pt+k|F
(.)
t − λxi,t+k ≤ 0 and E pt+k|F
(.)
t − λxi,t+k xi,t+k = 0 . (15)
Condition (15) implies that ﬁrm i currently produces xit =
E pt|F
(.)
t
λ units if E pt|F
(.)
t > 0 and 0
otherwise. By (15) ﬁrms also obtain the production plans for the periods t + k, k > 0, given current
information. E pt+k|F
(.)
t > 0 requires E zt+k|F
(.)
t > 0 by the model structure (see equations (6) and
(11)); if the latter term is positive there is an interior solution. This can be summarized as follows:
6
From Section 2 we already know that yt has been deﬁned as yt = pt + βxt. Whenever production takes place yt =
pt + βxt = θ + ut = zt has to hold. In Section 2, yt = zt for all t.
11
Proposition 2. Suppose that Assumptions 1-3 hold:
C1 If E θ|F
(.)
t > 0 and E zt|F
(.)
t > 0. Then in period t, E zt+k|F
(.)
t > 0 for all k ≥ 0. In t ﬁrms
produce xt > 0. For periods t + k ﬁrms currently plan to supply xt+k > 0.
C2 If E θ|F
(.)
t > 0 and E zt|F
(.)
t < 0, then E zt+k|F
(.)
t > 0 for some k ≥ 1. Firms do not produce
in period t but enter the market at period t + k.
C3 If E θ|F
(.)
t < 0 and E zt|F
(.)
t > 0, then ﬁrms produce in this period xt > 0. At t ﬁrms plan to
exit after k > 0 periods.
C4 If E θ|F
(.)
t < 0 and E zt|F
(.)
t < 0, this also implies E zt+k|F
(.)
t < 0 for all k ≥ 0. Firms do
not produce and exit for all t + k, k ≥ 0.
Proposition 2 has important economic implications: If θ > 0, in the full-information case only production
or entry after k periods is possible (cases C1 and C2). However, with limited information cases C3 and
C4 are possible as well. In case C3, maybe the ﬁrms receive a positive signal to remain in the market. On
the other hand, with C4 the following result obtains:
Corollary 1. Suppose that θ > 0 and E θ|FLI
t < 0 and E zt|FLI
t < 0. Then ﬁrms with limited
information exit and xt does not converge to xFI
t . With t ﬁnite, there always exists a η+
t−1 such that
E θ|FLI
t < 0 and E zt|FLI
t < 0 for any ηt−1 ≤ η+
t−1; the probability that ηt−1 ≤ η+
t−1 is strictly
positive.
As a short summary, Figure 1 provides a graphical illustration of the formal results obtained in
Propositions 1 and 2. A shock occurs in period t = 4. In Subﬁgures (a) and (b) we observe the conditional
expectation of the limited informed agent and the output of the informed and the limited informed agents
given the assumptions of the base-line model presented in Section 2. We attain convergence to the true
parameter θ = 1 and diminishing diﬀerences between the quantities xt and xFI
t . In Subﬁgures (c) and
(d) the Assumptions 1-3 hold. This results in negative E zt|FLI
t and E θ|FLI
t in period t = 5. While
the fully informed agents start production after some periods (Case C2), the limited informed agents exit
(Case C4) in Subﬁgure (d). By applying (14), E zt|FLI
t converges to E θ|FLI
t after exit as observed in
Subﬁgure (c).
12
0 100 200 300 400 500 600 700 800 900 1000
−1.5
−1
−0.5
0
0.5
1
1.5
t
E(θ|ILI
t
)
(a)
0 10 20 30 40 50 60 70 80 90 100
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
t
xt
(b)
0 2 4 6 8 10 12 14 16 18 20
−5
−4
−3
−2
−1
0
1
2
t
E(θ|Ft
LI
),E(zt
|Ft
LI
)
(c)
0 2 4 6 8 10 12 14 16 18 20
0
0.2
0.4
0.6
0.8
1
t
x
t
(d)
Figure 1: Quantities, Conditional Expectations and Convergence. This ﬁgure plots time series of the quantities
produced and conditional expectations. Parameters set to: ζ = 0.5, θ = θ0 = 1, σ2
η = σ2
0 = 1, λ = β = 1. A shock occurs in
period t = 4. Subﬁgures (a) and (b) present representative output from the baseline model of Section 2: (a) plots E θ|ILI
t
for t = 1, . . . , 1000. (b) plots xt (solid line) and xF I
t (dotted line) for the ﬁrst 100 periods. Subﬁgures (c) and (d) present a
non-convergent path for the setting presented in this section. (c) Conditional expectation E zt|FLI
t (dashed-solid line) and
E θ|FLI
t (solid line) for the limited information case. (d) output xt (solid line) and xF I
t (dashed-solid line).
13
Remark 4. Assumption 1, xit ≥ 0, could also be replaced by xit ≥ ¯x, where ¯x could be a minimum
production level or a short selling constraint. For restrictions like this the results from this section can be
adapted.
4 Bounded Support
An assumption that avoids the above implications is as follows:
Assumption 4. The support of ηt is a proper subset of [η, ¯η] and the parameter θ > 0. W.l.g. E(ηt) = 0.
Suppη is the support of ηt.
As in Section 3: (i) If xt > 0, then pt = zt − βxt and yt = zt. (ii) If xt = 0, pt and zt are missing values
by Assumption 2. The ﬁrst order condition (15) continues to hold. z0 = θ and (3) imply:
zt = θ(1 − ζ) + ζzt−1 + ηt = θ +
t−1
s=0
ζs
ηt−s . (16)
With |ζ| < 1 and bounded support, the absolute value of the last term in (16) remains smaller than
1
1−|ζ| max{¯η, |η|} =: ηmax. Therefore, with |ζ| < 1, z0 = θ and η ∈ [η, ¯η], we get the lower and upper
bounds
z = θ − ηmax and ¯z = θ + ηmax . (17)
To derive a positive zt we have to restrict the support of ηt such that z > 0. This yields ηmax < θ(1 − |ζ|).
The requirement < comes from zt > 0 with an arbitrary distribution of ηt (with bounded support), i.e.
also atoms at the borders are allowed. As long as the distribution of ηt is absolutely continuous, < can
be replaced by ≤. This yields:
Corollary 2. Suppose that Assumptions 1-4 hold and the support of ηt is a proper subset of [−(1 −
|ζ|)θ, (1−|ζ|)θ]. In addition ﬁrms are equipped with a Bayesian updating scheme such that E(θ|FLI
t ) → θ
as t → ∞ (a.s. and L2) and E(θ|FLI
t ) ∈ (θ, ¯θ) for all t ≥ 0; θ = θ − 1
1−|ζ| max{¯η, |η|} and ¯θ = θ +
1
1−|ζ| max{¯η, |η|}.
14
Then E(θ|F
(.)
t ) > 0 and E(zt|F
(.)
t ) > 0; xt > 0 for all t by the ﬁrst order condition (15) (i.e. only C1
of Proposition 2 is possible). The converge results of Proposition 1[part (i)] continue to hold.
Bayesian learning can be replaced by the weaker assumption that agents use an estimator ˆθt, with
ˆθt → θ, where ˆθt and the forecasts of zt remain in [z, ¯z].
Corollary 2 demands for positive zt, consistency with respect to the support and consistency of the
Bayes estimate. That is to say, a bounded support of the random variable, such that zt > 0, is not
suﬃcient for convergence. The following examples should shed some light on these issues:
Example 1 (Truncated normal with miss-speciﬁed upper and lower bounds). Suppose that θ > 0, and
εt follows a truncated normal distribution with lower bound ε and upper bound ¯ε, mean parameter ν and
variance parameter σ2
TN resulting in zt > 0 (for the truncated normal we refer the reader to Appendix A).
Truncated normal priors are assumed. In addition assume that ζ = 0, such that E(zt|FLI
t ) < 0 is
suﬃcient for exit. Take (18) of Appendix A and a parameter ν such that EfT N (ν,σ2
T N ,ε,¯ε)(zt) = θ > 0.
These assumptions result in z = ε and ¯z = ¯ε; η and ¯η can be obtained by means of η = ε−θ and ¯η = ¯ε−θ.
Suppose that ﬁrms know that the data are generated from a truncated normal but assume a lower
bound ˜ε < ε and an upper bound ¯˜ε; these bounds are ﬁxed and the distribution can be asymmetric. By
(18) we ﬁnd upper and lower bounds such that 0 < EfT N (ν,σ2
T N ,˜ε,¯˜ε)(εt) < θ. Now suppose that the ﬁrms
apply Rodriguez-Yam et al. (2004) (as in Example 9) to estimate ν. Then a suﬃciently small zt > 0 can
result in an estimate ˆνt such that EfT N (ˆνt,σ2
T N ,˜ε,¯˜ε)(εt) = E(zt|ILI
t ) < 0 where ﬁrms exit.
Example 2 (Truncated normal, non-regular case). The miss-speciﬁcation of the econometric model in
Example 1 can be repaired by ﬁrms estimating the bounds of the truncated normal. Example 9 in
Appendix B demonstrates that the estimates are still consistent if (zt) is observed and the upper and
lower bounds are unknown parameters. The bounds converge to the true bounds as t → ∞. This does
not imply E(zt|FLI
t ) > 0 almost surely for all t. E.g. if the ﬁrms put relatively strong priors in the
neighborhood of ˜ε and ˜ε assumed in Example 1 convergence need not take place.
Example 3 (Truncated normal with correctly speciﬁed upper and lower bounds). Given the assumptions
of Example 1, but the lower and upper bounds assumed correspond to the true parameters. Then
E(zt|FLI
t ) > 0 by applying Rodriguez-Yam et al. (2004) and (7).
15
The above examples demonstrate why bounded support of (zt) is not suﬃcient for convergence. In
Example 1 the ﬁrms apply a prior and a likelihood with inconsistent bounds. That is to say, ﬁrms know
or guess the correct class of distributions, but use a miss-speciﬁed econometric model with respect to
the range of the random variable. Here also the estimates in the baseline case would be inconsistent.
Example 2 repairs this drawback by estimating the bounds. Although the posterior will converge to
the true parameter ψ if (zt)∞
t=0 is observed, this does not imply that the bounds for exit cannot be hit.
Example 3 works with correctly speciﬁed bounds of the distributions. This is a strong assumption since
ﬁrms exactly know the worst and the best shock that could happen.
Remark 5. Corollary 2 does not guarantee non-negative prices. The inverse demand function (1) and
the ﬁrst order condition (15) result in pt = zt − β
β+λ E(zt|FLI
t ). Even with the assumptions of Corollary 2
where θ > 0 and η ∈ (η, ¯η), the price can become negative if zt is small but the expectation is large. With
forecasts in [z, ¯z], by a simple plug in of the upper and lower bounds of zt we get pt ≥ 0 if z ≥ ¯z β
β+λ.
Remark 6. Non-negative prices and quantities can also be obtained in a diﬀerent ways. E.g. Adam
and Marcet (2010) work with log returns, while Vives (2008)[p. 271] propose a model with isoelastic cost
and inverse demand structure. By these assumptions the authors restrict to Case 1 of Proposition 2.
With isoelastic demand an xt → 0 forces prices to go to +∞ for any signal. This guarantees an interior
solution with xt > 0. A further opportunity is to transform zt by a function g(.) such that g(zt) fulﬁlls
the requirements of Corollary 2. For example the logistic function, the arcus tangens function, etc. can
be used in this case. In addition Bayesian learning results in non-negative forecasts.
Robustness: Based on these last examples and Remark 6, we might expect that with bounded support
consistency and convergence take place under fairly mild conditions. ”Fairly mild” is a matter of taste.
Even with innovations on bounded support and θ ∈ η ∈ (θ, ¯θ) we can use a Dirichlet mixtures as already
discussed in the last paragraph of Section 2 and in Appendix B. Similar to the asymmetric normal
distribution constructed in Appendix A, a truncated asymmetric normal distribution can be constructed
in the same way:
Example 4 (Mixture of Truncated Normals). Take the upper and lower bounds ε and ¯ε from the Example
3 such that zt > 0 and assume that these bounds are known by all agents. Let εt follow a truncated
16
normal with parameter ν for irrational ν, while εt is asymmetric truncated normal of Fernandez and Steel
(1998) type with some asymmetry parameter γ = 1 (see Appendix A). Once again the Bayes estimate of
ν is given Rodriguez-Yam et al. (2004) (see Example 9 in the Appendix B). This estimate converges to
the true parameter for all irrational ν while for rational ν we get inconsistent estimates.7
In Example 4 the requirements of Case C1 in Proposition 2 are satisﬁed all the time, i.e. we have
excluded shut down. However, also under these circumstances the Bayes estimate need not be consistent
and convergence of the limited to the full information case need not take place for all parameter values
of the parameter space. This implies, that we can also construct examples with inconsistent estimates
although shut down is not observed. Similar examples can be constructed with other distributions where
the distribution of the noise is from a diﬀerent class for rational and irrational elements. Therefore, even
if we replace the linear demand structure by a log-linear one to force prices and quantities to be positive
(as suggested by Remark 6), inconsistency of the Bayes estimate can still arise. Despite the fact that we
can make the model more realistic by diﬀerent assumptions to make prices and quantities non-negative,
learning the true parameter still remains an issue in all these cases.
5 Conclusions
By simple considerations we observed that by eliminating negative quantities but maintaining the linear
normal demand structure, the converge results derived in Jun and Vives (1996) and Vives (2008) need
not hold. By including a simple shutdown condition - even in the limit - the quantities provided in a full
information economy need not agree with the the quantities produced in the limited information setting.
To rescue the optimistic convergence results bounded support of the noise term and regularity conditions
on learning are required.
7
Note that consistency of an estimate means that for all elements of the parameter space the posterior converges to a
point mass at the true parameter (see Appendix B).
17
A The Truncated Normal and the Asymmetric Normal Distribution
Consider a truncated normal random variable X with density fTN (x; ν, σ2
TN , ε, ¯ε) (see e.g. Paolella (2007)).
The expected value of a truncated normal random variable X with location and scale parameters ν, σ2
TN
and bounds ε < ¯ε is
E(X) = ν +
fSN ((ε − ν)/σTN ) − fSN ((¯ε − ν)/σTN )
FSN ((¯ε − ν)/σTN ) − FSN ((ε − ν)/σTN )
σTN . (18)
fSN and FSN are a standard normal density and distribution function.
Based on Fernandez and Steel (1998) an asymmetric distribution with density g(x) can be constructed
from a symmetric distribution with density f(x) as follows (see e.g. also Paolella (2007)):
gγ(x) = 2
ζ
1 + ζ2
(f(x/γ)1x≥0 + f(γx)1x<0) . (19)
The parameter γ > 0 controls for the degree of asymmetry. For γ = 1, the distribution is symmetric such
that g(x) = f(x). The moments of X are given by:
E(Xr
|γ) =
ζr+1 + (−1)r/ζr+1
ζ + 1/ζ
2E (Xr
|X > 0, γ = 0) . (20)
By using the standard normal density fSN we get an asymmetric normal distribution with E (X) = 0 for
γ = 1. For the normal distribution E (Xr|X > 0, γ = 0) can be derived as follows: Given the stardard
normal density fSN (x) we observe that fSN (x) = (−x)fSN (x). By partial integration we observe
that E (Xr|X > 0, γ = 0) =
∞
0 xrfSN (x)dx = 1
r+1xr+1fSN (x)|∞
0 − 1
r−1
∞
0 xr+1(−x)fSN (x)dx =
1
r+1
∞
0 xr+2fSN (x)dx such E (Xr|X > 0, γ = 0) = 1
r+1E Xr+2|X > 0, γ = 0 . Since fN (x) =
(−x)fSN (x) we get
∞
0 fSN (x)dx =
∞
0 (−x)fSN (x)dx yielding E X1|X > 0, γ = 0 = 1√
2π
, while
E X2|X > 0, γ = 0 = 1
2 for a standard normal random variable. By the above recursive relationship
of the moments we get E (Xr|X > 0, γ = 0) for the standard normal. Some algebra yields
18
E (Xr|X > 0, γ = 0) = r!
2r/2(r/2)!
for even r and E (Xr|X > 0, γ = 0) = 2(r−1)/2((r − 1)/2)! 1√
2π
with r
odd.
Given a standardized skewed normal random variable X, we get as usual a non-standardized normal
random variable Y by means of Y = θ + σX, where θ and σ2 are the mean and the variance of Y with
γ = 1. By means of (19) we can also construct an asymmetric truncated normal distribution by using
fTN (.) instead of f(.).
B Bayesian Learning and Consistency
This section reviews Bayesian learning and presents examples with non-Gaussian innovations. Regarding
convergence, we refer the reader to Ghosal et al. (2000) and the web-note of Shalizi (2010), describing the
problems and providing literature on this topic.
Let us start with a prior π on Ψ. Ψ is the parameter space, the elements are ψ. Given, the distribution
of xt = (x1, . . . , xt), f(xt|ψ), the posterior can be derived by means of the Bayes theorem
π(ψ|xt
) =
f(xt|ψ)π(ψ)
Ψ f(xt|ψ)π(dψ)
. (21)
Only for a small number of applications π(ψ|xt) can be derived analytically. Bayesian simulation methods
to simulate from the posterior π(ψ|xt) is described in Appendix C.
A more detailed picture is obtained by following Diaconis and Freedman (1986) and Blume and Easley
(1993): The parameter space Ψ is a Borel subset of a complete separable metric space, the elements are
ψ. Suppose that π is non-degenerated and π > 0 for any Borel subset containing the true parameter
generating the data. Qψ is a probability measure on Borel subsets of the set of data X, x ∈ X. Q∞
ψ is
the product measure on X∞. The map ψ → Qψ is one to one and Borel, the point mass at ψ is denoted
by δψ. The joint distribution Pπ(A × B) = A Q∞
ψ (B)π(dψ), where A and B are Borel in Ψ and X∞,
respectively. For a Borel subset H of Ψ × X∞ we get Pπ(H) = Ψ(δψ × Q∞
ψ )(H)π(dψ). δψ is the point
mass at ψ. Then the posterior π(ψ|B) - also abbreviated by π(ψ|x∞) - is a version of the Pπ law of ψ
(Consistency does not only depend on the prior but also on the version of the posterior we choose; for
19
more details see Diaconis and Freedman (1986)[Example A.1]). For a measurable set Bt, this results in
the posterior π(ψ|Bt), or π(ψ|xt) in the other notation used in (21).
For convergence issues in this context usually the weak-star topology is applied: Consider the probability
measures µt and µ on Ψ: µt → µ weak star if and only if fdµt → fdµ for all bounded continuous
functions f on Ψ. µt → δψ if and only if µt(U) → 1 for every neighborhood U of ψ. δψ is the Dirac point
mass at ψ. By Doob’s theorem π(ψ|xt) → δψ in the weak star topology Pπ almost surely expect possibly
on a set of π measure zero (see Diaconis and Freedman (1986)[Corollary A.2], Ghosal et al. (2000) and
Blume and Easley (1993)[Theorem 1]). That is to say that the Bayesian is subjectively certain to converge
to the truth, or in other terms, that for almost all H ∈ Ψ × X∞ the posterior is consistent. However, in
frequentist terms: suppose that some ψ ∈ Ψ generates the data, we do not know whether the estimator
is consistent. Lijoi et al. (2007) recently derived an extension of Doob’s convergence result for stationary
data.
Consistency at ψ means that π(U|xt) → 1 (in the probability measure Pψ) for any neighborhood
U of ψ. Consistency implies that the posterior converges to the true parameter ψ; that is to say we
have consistency at ψ for all elements of the parameter space (see e.g. Diaconis and Freedman (1986) or
P¨otscher and Prucha (1997)). Doob’s theorem states that for any prior π(ψ) the posterior is consistent at
any ψ ∈ Ψ except possibly on a set of π measure zero (see Ghosal et al. (2000)). This does not guarantee
convergence to the true parameter for all ψ ∈ Ψ as required. Even worse, given a set of priors M, Diaconis
and Freedman (1986) have shown that the set of all pairs (ψ, π) ∈ Ψ×M for which the posterior based on
π is consistent is a meager set (of Baire category 1; here the topology of weak convergence has been used).
The original result has been shown on a countable parameter space, while Feldman (1991) extended this
result to a complete and separable, non-ﬁnite parameter space.
Remark 7. By the latter result we observe that there are many ”bad priors”, the set of ”good priors” is
topologically small. On the other hand if the statistician - in our case the ﬁrm - is ”certain about the prior
π(ψ)”, Doob’s theorem tells us that the set where inconsistency might arise is small in a measure theoretic
sense. The key aspect is that the ﬁrm has to choose a good prior as e.g. in the benchmark model. However,
choosing a good prior is in general a strong assumption on the capabilities of the economic agents.
20
Statistical literature provides examples where Bayesian learning is inconsistent; here the reader is
referred e.g. to Diaconis and Freedman (1986) and the literature cited there. Especially, when nonparametric
Bayesian techniques are used, inconsistency becomes a serious problem. Consistency for nonparametric
Bayesian models has been investigated more recently in Barron et al. (1999), Walker (2004),
Ghosal and van der Vaart (2007) and Rousseau et al. (2010). For suﬃcient conditions for consistency
and (counter-)examples in the non-parametric case we refer the reader to Jang et al. (2010). Based on
Christensen (2009) we get the following example:
Example 5 (Inconsistency, Application of Christensen (2009)). Given the assumptions of Section 2, but
ηt ∼ N(θ, σ2
η) for irrational θ, while ηt is Cauchy with location parameter θ and variance parameter σ2
e
for rational θ. By sticking to the normal prior θ ∼ N(θ0, σ2
0), the posterior π(θ|zt) = π(θ|ILI
t+1) is still
given by (10), where θ ∼ N(at, At). For irrational θ this estimate is consistent, while for rational θ we get
inconsistent estimates.
In this example we get inconsistency on a set of prior probability 0, i.e. this set is small in a measure
theoretic sense. However, we get inconsistent estimates on a dense set. A set is small in a topological sense
(meager set or set of the ﬁrst Baire category), if it can be expressed as the countable union of nowhere
dense sets. Using this topological concept, the set of consistent estimates is meager in this example. A
more heuristic interpretation of this results is as follows: In any case an agent can only report rationals/
key in rational numbers into a computer. However, with any of these rational numbers the estimate is not
consistent all the time. In addition if the sampling distribution of the unknown parameter θ puts positive
probability mass on the rational numbers, the problem becomes even worse.
Remark 8. From (10) we observe that with Gaussian noise the estimator of θ is a convex combination
of the sample mean and a term arising from the prior which becomes neglectable if t becomes larger. For
independent Gaussian noise the strong law of large number holds such that the sample mean converges
to θ. For a Cauchy distribution this is not the case. In Example 5 this is the main driving force for the
inconsistency of the estimator on rational θ.
If distributions with non-ﬁnite expectation do not seem very realistic to the reader, we can think of
an economy where estimates of θ and σ2
η enter into equilibrium quantities. Suppose that θ is ﬁxed while
21
σ2
η is unknown. The expected value should be θ for all σ2
η. Assume a normal distribution for irrational
σ2
η > 0 and a distribution where the second moment does not exist for rational σ2
η > 0 (e.g. a student
t-distribution with 1 < ν ≤ 2 degrees of freedom, a distribution symmetric with respect to θ where
E(X) = θ and E(X2) does not exist). In this case the posterior based on the Gaussian distribution is
an inverse Gamma distribution (see Robert (1994)[Chapter 4] or Fr¨uhwirth-Schnatter (2006)), where the
sample variance determines the scale parameter of this distribution. By the same argument as above we
get inconsistent estimates if the true σ2
η is a rational number.
Example 6 (Inconsistency, Variation of Example 5). Similar to Example 5 we can also construct Dirichlet
mixtures with ﬁnite expectation. Let ηt ∼ N(θ, σ2
η) for irrational θ, while ηt is asymmetric normal of
Fernandez and Steel (1998) type (see Appendix A). π(θ|zt) = π(θ|ILI
t+1) is once again given by (10). For
irrational θ this estimate is consistent such that at → θ, while for rational at → θ + E(ηt) by the law of
large numbers. This implies lim at = θ for the asymmetric case. A similar example can be constructed
with mixtures of asymmetric truncated normal distributions, etc.
One of the usual procedures to check for convergence to the true parameter for a particular model
is to show that the probability of non-convergence to the true parameter value ψ goes to zero (see also
Blume and Easley (1993)[Corollary 2.1]) or to work with so called consistency hypothesis tests as discussed
and presented in Ghosal et al. (1994), Ghosal et al. (1995a), Ghosal (2000) Ghosal et al. (2000), Ghosal
and Tang (2006), Ghosal and van der Vaart (2007) and Walker (2004). Section 2 investigated a regular
problem, where the support of the stochastic noise term does not change with the parameter ψ. For regular
problems the result derived by Schwartz (1965) can often be applied to check for consistency. Further
regular problems will be presented in Examples 8 and 9. Results for non-regular problems have also been
investigated in Ghosal et al. (1994), Ghosal et al. (1995a), Chernozhukov and Hong (2004) and Hirano
and Porter (2003). The following theorem provides suﬃcient conditions for almost sure convergence to
the true parameter:
Proposition 3. [Proposition 1 in Ghosal et al. (1995b); based on Ibragimov and Has’minskii (1981)]
Consider a prior π, Ψ ∈ Rd and Ut = ϕ−1
t (Ψ − ψ) for some normalizing factor ϕt. The likelihood ratio
22
process is deﬁned as
Lt(u) =
f(xt|ψ + ϕnu)
f(xt|ψ)
, u ∈ Ut.
Assume that the conditions
IH1 For some M > 0, m ≥ 0 and α > 0,
Eψ Lt(u1) − Lt(u2)
2
≤ M(1 + Rm
) u1 − u2
α
,
for all u1, u2 ∈ Ut satisfying u1 ≤ R and u2 ≤ R.
IH2 For all u ∈ Ut,
Eψ Lt(u) ≤ exp (−gt u ) ,
where {gt} is a sequence of real valued functions on [0, ∞) satisfying the following: (a) for a ﬁxed
t ≥ 1, gt(y) → ∞ as y → ∞ and (b) for any N > 0, limy,t→∞ yN exp (−gt(y)) = 0.
GGS For some s > 0, ∞
t=1 ϕt
s< ∞.
are satisﬁed, then almost sure convergence of the posterior to the true parameter is attained.
Usually ϕt = 1/
√
t is applied. When only conditions IH1 and IH2 are satisﬁed, we get convergence in
probability. It is worth noting that Proposition 3 does neither require iid data nor a regular problem; only
the existence of the densities (likelihood) f(xt|θ) is assumed. Proposition 3 also tells us that the prior
has to be strictly positive around the true parameter value and that it is only allowed to grow at most
like a polynomial. Lp convergence generally does not follow from almost sure convergence. If almost sure
convergence is observed and the p-th moment of the sequence of random variables is uniformly integrable,
then Lp convergence is attained e.g. by means of by Lebesgue’s dominated convergence theorem (see e.g.
Billingsley (1986)[p. 213] or Klenke (2008)[p. 140]).8
8
A further issue is the limit distribution of the posterior around the true parameter value. For ﬁnite dimensional parameter
spaces the Bernstein-von Mises Theorem (see e.g. LeCam and Yang (1990), Lehmann (1991), Freedman (1999), Bickel and
Doksum (2001)[p. 339]) states that the Bayesian posterior around its mean is close to the asymptotic distribution of the
maximum likelihood estimate around the true parameter value. In this case consistency of the maximum likelihood estimate
implies consistent Bayes estimates. Also here Ghosal et al. (1994), Ghosal et al. (1995a) and Ghosal and van der Vaart (2007)
provide useful tools to derive the limit distribution of the Bayes estimate.
23
In Section 2, the parameter ψ is θ, Ψ is the real line. (10) provides the posterior π(θ|ILI
t ) in closed
form. Consistency π(θ|ILI
t ) → δθ can be checked directly by taking limits. Alternatively, consistency
follows from Blume and Easley (1993)[Theorem 2.2]. Learning the mean of a normal random variable is
also presented as an example there. Lp convergence follows from the properties of the normal distribution.
The conditional expectation of zt was given by a convex combination of the parameter θ and the former
realization zt−1. Convergence results on the parameter θ automatically carry over to convergence of the
quantities. Generally, this need not be as simple, as demonstrated in the following example:
Example 7. Assume that ηt is Cauchy. Suppose that the posterior of the location parameter θ converges
to the true parameter θ. In this setting the parameter ψ = θ has been learned. In contrast to learning
the expectation of ηt does not exist. As a consequence the conditional expectation of zt does not exist
(neither in the full information nor in the limited information setting) and no convergence results on (xt)
can be obtained.
Example 7 demonstrates that consistent estimates are in general not suﬃcient for convergence (even
with |ζ| < 1). Hence, we can additionally require uniform integrability of the random variables zt or ∆ζzt.
Based on this discussion Proposition 1 can be extended as follows:
Corollary 3. [Extension of Proposition 1]
Suppose εt ∼ g(εt|ψ), E(|εt|p) < ∞ for all t and some p ≥ 1. In addition normalize the noise term such
that E(∆ζzt) = E(ε) = θ(1 − ζ). The unknown parameters are ψ, Ψ ∈ Rd.
If |ζ| = 1 and the consistency conditions of Proposition 3 are fulﬁlled, then the posterior converges to
the true parameter ψ almost surely. If, in addition, |ζ| = 1 and |E ψ|ILI
t |p is uniformly integrable for
p ≥ 1 then E ψ|ILI
t converges to ψ in Lp. For |ζ| < 1 and E ψ|ILI
t → ψ almost surely and Lp then
xt → xFI
t almost surely and in Lp. For ζ ∈ R, ζ = 1 and E ψ|ILI
t → ψ almost surely in in Lp, ∆ζxt
converges to ∆ζxFI
t a.s. and Lp. For ζ = 1, xt = xFI
t for t ≥ 2.
The following examples illustrate the implications of Corollary 3. In the ﬁrst example we replace the
normally distributed normal noise term with a student t-distributed noise term with ν degrees of freedom.
The second example investigates the truncated normal case.
24
Example 8. Assume ζ = 1 and ∆ζzt = (1 − ζ)θ + ηt, where ηt is student t centered around zero
with 1 < κ ≤ 2 degrees of freedom. The variance parameter σ2
η is ﬁxed. For the parameter θ the
agents apply a student t prior with ν degrees of freedom, mean parameter m0 and variance parameter
σ2
η/M0. W.l.g. set σ2
η = 1. Then the posterior of the unknown parameter θ is student-t with variance
parameter Mt = 1/(t(1 − ζ)2 + M0) and mean parameter mt =
(1−ζ) t
s=1 ∆ζzs+M0m0
t(1−ζ)2+M0
(see Harrison and
West (1997)[p. 422]). In the same way as with the baseline model we get E θ|ILI
t = mt−1. mt converges
to θ (this could also be directly observed by applying Etemadi’s version of the strong law of large number
Klenke (2008)[p. 112]). Here only Lκ converge, with κ < κ, follows from the dominated convergence
theorem.
Example 9. Given the assumptions of Section 2: ∆zt = εt. εt follows a truncated normal distribution
with lower bound εt and upper bound ¯εt. The mean parameter is ν, the variance parameter σ2
TN is ﬁxed,
ν ∈ [εt, ¯εt]. ν is such that E(εt) = θ(1 − ζ). Suppose a normal or a truncated normal prior. With the
variance parameter and the bounds ﬁxed, we veriﬁed that the conditions of Proposition 3 hold for the
unknown parameter ν. Therefore, we attain almost sure converge to the true parameter value ν. Although
L2 convergence generally depends on the prior π used, if agents assume a (truncated) normal prior then
L2 convergence is obtained.
Suppose in addition that agents assume a truncated normal prior with parameters νTN0 and σ2
TN0 and
the same lower and upper bounds. Rodriguez-Yam et al. (2004) have shown that the posterior of ν based on
the data ∆ζzt is a truncated normal distribution with mean parameter aTNt =
(1/σ2
ε ) t
s=1 ∆ζzs+νT N0/σ2
T N0
(1/σ2
T N )t+1/σ2
T N0
and variance parameter ATNt = (1/σ2
TN )t + 1/σ2
TN0
−1
.9
In Examples 8 and 9 we investigated regular problems, i.e. problems where the support of the stochastic
noise term does not change with the parameter Ψ. In Example 9 knowledge of the upper and the lower
bounds is a strong assumption. By applying Proposition 3 we derive the following result:
Example 10 (Application of Ghosal et al. (1995a)). Given the assumptions of Example 9, but the
lower and the upper bounds of the truncated normal distribution are unknown, i.e. the true parameter
9
In the baseline model and in this example it is not diﬃcult to verify that the estimates of ψ = {ν, σ2
T N } are also consistent,
i.e. also the variance parameter can be estimated consistently. Simulation methods as described in Appendix C have to be
applied to derive samples from the posterior.
25
generating the data is ψ = {ν, ε, ¯ε}. The parameter space Ψ ⊂ R3, with ˜ε < ¯˜ε ∈ Υ. Apply a normal prior
to ν. Either (i) Υ is a compact subset of R2, Υ is assumed to be ”suﬃciently large”, and a uniform prior
is applied for any pairs ˜ε, ¯˜ε ∈ Υ; or (ii) normal priors are applied to ˜ε, ¯˜ε. Υ ⊂ Ψ With these priors we
veriﬁed that the conditions of Proposition 3 hold. This implies that the posterior converges almost surely
to the true parameter.
C Bayesian Parameter Estimation
In most cases π(ψ|xt) cannot be derived analytically. Fortunately, statistical literature has developed a
bulk of numerical tools, Markov Chain Monte Carlo (MCMC) methods, which allow to simulate from the
posterior distribution. We refer the reader to textbooks on Bayesian statistics, such as Robert (1994),
Robert and Casella (1999), Casella and Berger (2001) or Fr¨uhwirth-Schnatter (2006). For the underlying
paper it is suﬃcient to know that these econometric tools allow to sample from the posterior π(ψ|xt)
whenever the likelihood π(xt|ψ) is available. The samples are denoted by ψ[m], m = B + 1, . . . , M. B is
the number of burn in steps which are required to attain convergence of the corresponding Markov chain.
Then the conditional expectation E (ψ|.) will be derived by taking the sample mean of ψ[m].
Since we have to deal with a missing data problem in Section 3, we brieﬂy remark on data augmentation,
proposed in Tanner and Wong (1987). The trick of this method is quite simple: Suppose that f(x|ψ) is not
available in closed form but f(x|ψ, y), where y is a latent variable, a missing value or the realization of a
latent process. By means of the Bayes theorem the posterior π(ψ, y|x) is proportional to f(x|ψ, y)π(ψ, y).
By the above arguments, we are able to sample from the joint posterior π(ψ, y|x), such that posterior
samples of ψ, y are available as well. This trick can be applied to a regression setting with missing values.
Using the notation of Section 2, consider the model ∆ζzt := zt − ζzt−1= (1 − ζ)θ + ηt, ζ = 1 and
ηt ∼ N(0, σ2
η), where the data ∆ζzt includes missing values. The regular observations are abbreviated
by ∆ζz∗, the missing values are ∆ζzt \ ∆ζz∗. Now augment the parameter space by ∆ζz∗∗, where each
elements of ∆ζz∗∗ replaces the a missing observation of ∆ζzt. Then the joint posterior of θ and ˜z∗∗ can
be derived by running a Bayesian sampler consisting of the following steps: For each sweep m:
Step 1: sample θ[m] from p(˜θ|∆ζz∗, ∆ζz∗∗,[m−1]).
26
Step 2: sample ∆ζz∗∗,[m] from p(∆ζz∗∗|θ[m], ∆ζz∗).
Sampling Step 1 exactly corresponds to drawing from the normal distribution described by (10). For the
augmented missing values, ∆ζz∗∗, one can work with the Metropolis Hastings algorithm. Given θ, ∆ζz∗∗
t
is normally distributed with mean (1−ζ)θ and variance σ2
η. the innovations ηt are obtained recursively. By
means of ˜θ[m] and ∆ζzt one directly obtains the residuals. The likelihood f((∆ζz∗, ∆ζz∗∗)|˜θ) is given by the
product of normals with mean (1−ζ)θ and variance σ2
η. Thus, one proposes ∆ζz∗∗,p
t from a normal density
q(∆ζz∗∗,.
t ) with mean (1 − ζ)θ[m] and variance σ2
η and accepts the proposed ∆ζz∗∗,p
t with a probability of
min{1, f((∆ζz∗,∆ζz∗∗,p)|˜θ[m])
f((∆ζz∗,∆ζz∗∗,[m−1])|˜θ[m])
q(∆ζz∗∗,[m−1])
q(∆ζz∗∗,p)
}, otherwise ∆ζz∗∗,[m] = ∆ζz∗∗,[m−1]. Here ∆ζz∗∗,[m−1] are the
current augmented missing values in the sampler. Alternative proposal densities can also be applied as
well as forward-ﬁltering backward sampling (see Fr¨uhwirth-Schnatter (1994)). If necessary, the missing zt
can now be derived from z0 and ∆ζzt.
Prediction and Missing Values: If at time t, zt−j, . . . , zt−1 are missing, then ﬁrms can take use of (14).
By strictly sticking to Bayesian sampling ∆ζz∗∗
t−j−1, . . . , ∆ζz∗∗
t−1 will be simulated by means of Step 2;
missing z∗∗
t follow from ∆ζzt. By the structure of the model zt+1|zt, θ[m] is normally distributed with
mean (1 − ζ)θ[m] + ζzt and variance σ2
η. θ[m] is by (10) normally distributed with mean at−1 and variance
At−1.
27
References
Adam, K. and Marcet, A. (2010). Booms and busts in asset prices. IMES Discussion Paper Series 10-E-02,
Institute for Monetary and Economic Studies, Bank of Japan.
Alfarano, S. and Milakovic, M. (2009). Network structure and n-dependence in agent-based herding
models. Journal of Economic Dynamics and Control, 33(1):78 – 92.
Barron, A., Schervish, M. J., and Wasserman, L. (1999). The consistency of posterior distributions in
nonparametric problems. Annals of Statistics, 27(2):536–561.
Bickel, P. and Doksum, K. (2001). Mathematical Statistics. Prentice Hall, New Jersey, 2 edition.
Billingsley, P. (1986). Probability and Measure. Wiley, Wiley series in probability and mathematical
statistics, New York, 2 edition.
Blume, L. and Easley, D. (1993). Rational expectations and rational learning. Game theory and information,
EconWPA.
Blume, L. E. and Easley, D. (1998). Rational expectations and rational learning. In Mukul Majumdar,
editor, Organizations with Incomplete Information, Cambridge University Press.
Brock, W. A. and Hommes, C. H. (1997). A rational route to randomness. Econometrica, 65(5):1059–1096.
Casella, G. and Berger, R. (2001). Statistical Inference. Duxbury Resource Center, Boston, 2 edition.
Chernozhukov, V. and Hong, H. (2004). Likelihood estimation and inference in a class of nonregular
econometric models. Econometrica, 72(5):1445–1480.
Chib, S. (1993). Bayes regression with autoregressive errors. Journal of Econometrics, 58:77–99.
Christensen, R. (2009). Inconsistent bayesian estimation. Bayesian Analysis, 4:413–416.
Diaconis, P. and Freedman, D. (1986). On the consistency of Bayes estimates. Annals of Statistics,
14(1):1–26.
28
Evans, G. and Honkapohja, S. (2001). Learning and Expectations in Macroeconomics. Princeton University
Press, Princeton.
Feldman, M. (1991). On the generic nonconvergence of bayesian actions and beliefs. Economic Theory,
1:301–321.
Fernandez, C. and Steel, M. F. (1998). On bayesian modeling of fat tails and skewness. Journal of the
American Statistical Association, 93(441):359–371.
Freedman, D. (1999). On the Bernstein-von Mises theorem with inﬁnite dimensional parameters. Technical
Report, number 492, Department of Statistics, University of California Berkeley.
Fr¨uhwirth-Schnatter, S. (1994). Data augmentation and dynamic linear models. Journal of Time Series
Analysis, 15:183–202.
Fr¨uhwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer, New York.
Ghosal, S., , and Samanta, T. (1995a). Stability and convergence of posterior in non-regular problems.
Mathematical Methods of Statistics, 4(4):361–388.
Ghosal, S. (2000). A review of consistency and convergence of posterior distribution. Technical report, In
Proceedings of National Conference in Bayesian Analysis, Benaras Hindu University, Varanashi, India.
Ghosal, S., Ghosh, J. K., and Samanta, T. (1994). Stability and Convergence of Posterior in Nonregular
Problems. Statistical Decision Theory and Related Topics 5, (S.S. Gupta and J.O. Berger, eds.),
Springer, New York.
Ghosal, S., Ghosh, J. K., and Samanta, T. (1995b). On convergence of posterior distributions. Annals of
Statistics, 23(6):2145–2152.
Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions.
Annals of Statistics, 28:500–531.
Ghosal, S. and Tang, Y. (2006). Bayesian consistency for markov processes. Sankhya, 68:227–239.
29
Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributions for non-iid observations.
Annals of Statistics, 35:192–223.
Guesnerie, R. (1992). An exploration of the eductive justiﬁcations of the rational-expectations hypothesis.
American Economics Review, 82(5):1254–1278.
Guesnerie, R. (1993). Theoretical tests of the rational expectations hypothesis in economic dynamical
models. Journal of Economic Dynamics and Control, 17(5-6):847 – 864.
Guesnerie, R. and Jara-Moroni, P. (2009). Expectational coordination in simple economic contexts:
Concepts and analysis with emphasis on strategic substitutabilities. Working Paper 2009 - 27, Paris
School of Economics.
Harrison, J. and West, M. (1997). Bayesian Forecasting and Dynamic Models. Springer.
Hirano, K. and Porter, J. R. (2003). Asymptotic eﬃciency in parametric structural models with parameterdependent
support. Econometrica, 71(5):1307–1338.
Ibragimov, I. A. and Has’minskii, R. Z. (1981). Statistical estimation, asymptotic theory. Applications of
Mathematics, vol. 16, Springer, New York.
Jang, G. H., Lee, J., and Lee, S. (2010). Posterior consistency of species sampling priors. Statistica Sinica,
20:581–593.
Jun, B. and Vives, X. (1996). Learning and convergence to a full-information equilibrium are not equivalent.
Review of Economic Studies, 63(4):653–674.
Kalai, E. and Lehrer, E. (1992). Bayesian forecasting. Discussion paper 998, Department of Managerial
Economics and Decision Sciences Kellog School of Management, Northwestern University.
Kelly, D. L. and Kolstad, C. D. (1999). Bayesian learning, growth, and pollution. Journal of Economic
Dynamics and Control, 23(4):491 – 518.
Klenke, A. (2008). Probability Theory - A Comprehensive Course. Springer.
30
LeCam, L. M. and Yang, G. L. (1990). Asymptotics in Statistics: Some Basic Concepts. Springer, New
York.
Lehmann, E. (1991). Theory of Point Estimation. Wadsworth and Brooks/Cole.
Lijoi, A., Pr¨unster, I., and Walker, S. G. (2007). Bayesian consistency for stationary models. Econometric
Theory, 23(4):749–759.
Meyn, S. and Tweedie, R. L. (2009). Markov Chains and Stochastic Stability. Cambridge University Press
(Cambridge Mathematical Library), New York, 2 edition.
Paolella, M. (2007). Intermediate Probability - A Computational Approach. Wiley.
Poirier, D. J. (1995). Intermediate Statistics and Econometrics: A Comparative Approach. MIT Press.
P¨otscher, B. M. and Prucha, I. R. (1997). Dynamic Nonlinear Econometric Models, Asymptotic Theory.
Springer, New York.
Robert, C. and Casella, G. (1999). Monte Carlo Statistical Methods. Springer, New York.
Robert, C. P. (1994). The Bayesian Choice. Springer, New York.
Rodriguez-Yam, G., Davis, R. A., and Scharf, L. (2004). Eﬃcient gibbs sampling of truncated multivariate
normal with application to constrained linear regression. Technical report, Department of Statistics,
Columbia University.
Rousseau, J., Chopin, N., and Liseo, B. (2010). Bayesian nonparametric estimation of the spectral density
of a long or intermediate memory gaussian process. Working paper, Universit´e Paris Dauphine, Paris.
Routledge, B. (1999). Adaptive learning in ﬁnancial markets. The Review of Financial Studies, 12:1165–
1202.
Schwartz, L. (1965). On bayes procedures. Probability Theory and Related Fields, 4:10–26.
31
Shalizi, C. R. (2010). Frequentist consistency of bayesian procedures. Technical report, Notebook, University
of Michigan, Department of Statistics, http://cscs.umich.edu/ crshalizi/notebooks/bayesian-
consistency.html.
Smith, L. and Sørensen, P. (2000). Pathological outcomes of observational learning. Econometrica,
68(2):371–398.
Stinchcombe, M. B. (2005). The unbearable ﬂightiness of bayesians: Generaically erratic updating. Working
paper, University of Texas at Austin.
Strasser, H. (1985). Mathematical Theory of Statistics. Studies in Mathematics 7. de Gruyter, Berlin.
Sun, Y. (2006). The exact law of large numbers via fubini extension and characterization of insurable
risks. Journal of Economic Theory, 126(1):31 – 69.
Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation.
Journal of the American Statistical Association, 82:528–550.
Timmermann, A. (1996). Excess volatility and predictability of stock prices in autoregressive dividend
models with learning. Review of Economic Studies, 63(4):523575.
Vives, X. (2008). Information and Learning in Markets. Princeton University Press.
Walker, S. (2004). New approaches to bayesian consistency. Annals of Statistics, 32:2028–2043.
32