Temporal Probability Models (Markov Models and Particle Filtering)
Chain Rule and HMMs
Look at the bellow model.

From the chain rule, every joint distribution over can be written as:
![]()
And because we have bellow terms:
![]()
After simplifications we have:
![]()
We can see some of real HMM examples:
Filtering/Monitoring
First of all we define Bt(x) = P(Xt | E1, …, Et) as the belief state and it shows our prediction from next hidden variable according to our observations from start to now. we start from first belief state B0(X) in an initial setting (usually uniform) and as time passes or we get observations, we update the value of Bt(X). In other words we have a vector with lentgh of number of hidden variables and each cell of this vector has our prediction of it's real value. we name this task of tracking the distribution Bt(X) (actually B(X)) over time "Flitering" or "Monitoring".
Example: Robot Localization
Robot localization is the process of determining where a mobile robot is located with respect to its environment. Localization is one of the most fundamental competencies required by an autonomous robot as the knowledge of the robot's own location is an essential precursor to making decisions about future actions. In a typical robot localization scenario, a map of the environment is available and the robot is equipped with sensors that observe the environment as well as monitor its own motion. in this example sensor model can read in which directions there is a wall, never more than 1 mistake and motion model may not execute action with small probability. as mentioned earlier B0(X) assigned uniform.

Bellow tape shows the colour of each probability for all cells. for example in t=0 each cell has equal probability.

Cells those are compatible with our evidence from sensors has most probability to be the real place of robot. lighter grey cells are possible to get the reading, but less likely b/c required 1 mistake. white cells need more mistakes so theirs probability is near to zero.
After skipping some states we have:

And then:

In this state the answer is approximately certain.
Passage of Time
If in the current state we have the belief B(Xt)=P(Xt|e1:t). then after one time step passes for P(Xt+1|e1:t) we have:

We know P(Xt+1|e1:t) isn't what we defined as Bt+1(X) so we name it B'(Xt+1) and we have:

We name the first part P(X'|xt) as transition and say beliefs get “pushed” through the transitions.
Example: Passage of Time
In this model as time passes, uncertainty about the answer accumulates and increases.

Observation
Now we want to affect the observation in our prediction about next value of hidden variable.
![]()
And now in each state after observation we have:

We named the second part P(Xt+1|e1:t) as belief and say beliefs get “reweighted” by likelihood of evidence.
** Unlike passage of time, we have to renormalize.
Example: Observation
In this model as we get observations, beliefs get reweighted and uncertainty about the answer decreases.

Example: Weather HMM
In this example we want to predict the weather by looking our friend, is he come with umbrella or not. First day we use a uniform distribution but after first day, in each day we compute B' and then after observation of umbrella compute B for that day and do this for each day to decreasing the uncertainty.

The Forward Algorithm
We are given evidence at each time and want to know:
![]()
We can derive the following updates:

We can normalize as we go if we want to have P(x|e) at each time step, or just once at the end. But which is better?
Online Belief Updates
Every time step, we start with current P(X | evidence)
We update for time:
![]()
We update for evidence:
![]()
Particle Filtering
In some problems |X| is too big for exact computing or even for storing B(X), so it's almost impossible to use previous algorithms. For example when X is continous. In this situations we must use approximate inference.
In this algorithm we track just samples of X not all values and name this samples particles. Time per step is linear in the number of samples but number needed may be large enough and in memory should store list of particles not states.

Now represent P(X) by a list of N particles and P(x) approximate by number of particles with value of x. We know generally N<<|X| so many x may have p(x)=0 and this isn't good event. For solving this issue we must use more particles to achieve more accuracy.(For now assume all particles has the same weight)
Elapse Time
Each particle moved by transition model to it's next position.
![]()
Just like the prior sampling, each sample frequency reflect the transtition probabilities.
As mentioned earlier for closing to exact values, must use enough samples.

Observe
Just like the likelihood weighting, each sample's probabilities computed based on the evidence.(As before, the probabilities don’t sum to one, since all have been down weighted and need to normalizing)

Resample
We use resampling (N times) intead of tracking weighted samples in this way choose from our weighted sample distribution (i.e. draw with replacement). This method is equivalent to renormalizing the distribution.
And now the update is complete for this time step, continue with the next one.
