I'm currently reading Peter Norvig's Artificial Intelligence A Modern Approach Chapter 15 Probabilistic Reasoning over Time and I'm not able to follow the derivation for filtering and prediction (page 572).
Given the result of filtering up to time t, the agent needs to compute the result for t + 1 from the new evidence et+1,
P(Xt+1|e1:t+1) = f(et+1, P(Xt|e1:t)),
for some function f. This is called recursive estimation. We can view the calculation as being composed of two parts: first, the current state distribution is projected forward from t to t + 1; then it is updated using the new evidence et+1. This two-part process emerges quite simply when the formula is rearranged:P(Xt+1|e1:t+1) = P(Xt+1|e1:t, et+1) (dividing up the evidence)
= α P(et+1|Xt+1, e1:t) P(Xt+1|e1:t) (using Bayes' rule)
How does using Bayes' rule result in the last forumla? Shouldn't it be
α P(e1:t, et+1|Xt+1) P(Xt+1)
They are both correct, except that the α is not identical between the two. The trick is that in Norvig's application of Bayes', the conditioning on e1:t is kept constant throughout. Just imagine that it wasn't there to begin with. You would still have all the identities. Then apply that conditioning to every part of the equation, and all the identities will hold.
Alternatively, the derivation can also be done without explicitly using Bayes', but instead, just using the definitions of joint and conditional probability (i.e., P(A|B)P(B)=P(A,B)=P(B|A)P(A)).
P(Xt+1|e1:t, et+1) P(et+1,e1:t) = P(e1:t, et+1|Xt+1) P(Xt+1)
Expanding the joint probabilities on both sides gives
P(Xt+1|e1:t, et+1) P(et+1|e1:t) P(e1:t) = P(et+1|Xt+1,e1:t) P(e1:t|Xt+1) P(Xt+1)
The last two terms on the RHS can be rewritten as
P(Xt+1|e1:t, et+1) P(et+1|e1:t) P(e1:t) = P(et+1|Xt+1,e1:t) P(Xt+1|e1:t) P(e1:t)
Now the P(e1:t) cancels (assuming nonzero), and we have
P(Xt+1|e1:t, et+1) = α P(et+1|Xt+1,e1:t) P(Xt+1|e1:t)
where α = 1 / P(et+1|e1:t).