Chain Rule and HMMs

Look at the bellow model.

From the chain rule, every joint distribution over can be written as:

And because we have bellow terms:

After simplifications we have:

We can see some of real HMM examples:

Speech recognition HMMs:
- Observations are acoustic signals (continuous valued)
- States are specific positions in specific words (so, tens of thousands)
Machine translation HMMs:
- Observations are words (tens of thousands)
- States are translation options
Robot tracking:
- Observations are range readings (continuous)
- States are positions on a map (continuous)

Filtering/Monitoring

First of all we define B_t(x) = P(X_t | E₁, …, E_t) as the belief state and it shows our prediction from next hidden variable according to our observations from start to now. we start from first belief state B₀(X) in an initial setting (usually uniform) and as time passes or we get observations, we update the value of B_t(X). In other words we have a vector with lentgh of number of hidden variables and each cell of this vector has our prediction of it's real value. we name this task of tracking the distribution B_t(X) (actually B(X)) over time "Flitering" or "Monitoring".

Example: Robot Localization

Robot localization is the process of determining where a mobile robot is located with respect to its environment. Localization is one of the most fundamental competencies required by an autonomous robot as the knowledge of the robot's own location is an essential precursor to making decisions about future actions. In a typical robot localization scenario, a map of the environment is available and the robot is equipped with sensors that observe the environment as well as monitor its own motion. in this example sensor model can read in which directions there is a wall, never more than 1 mistake and motion model may not execute action with small probability. as mentioned earlier B₀(X) assigned uniform.

Bellow tape shows the colour of each probability for all cells. for example in t=0 each cell has equal probability.

Cells those are compatible with our evidence from sensors has most probability to be the real place of robot. lighter grey cells are possible to get the reading, but less likely b/c required 1 mistake. white cells need more mistakes so theirs probability is near to zero.

After skipping some states we have:

And then:

In this state the answer is approximately certain.

Passage of Time

If in the current state we have the belief B(X_t)=P(X_t|e_1:t). then after one time step passes for P(X_t+1|e_1:t) we have:

We know P(X_t+1|e_1:t) isn't what we defined as B_t+1(X) so we name it B'(X_t+1) and we have:

We name the first part P(X'|x_t) as transition and say beliefs get “pushed” through the transitions.

Example: Passage of Time

In this model as time passes, uncertainty about the answer accumulates and increases.

Observation

Now we want to affect the observation in our prediction about next value of hidden variable.

And now in each state after observation we have:

We named the second part P(X_t+1|e_1:t) as belief and say beliefs get “reweighted” by likelihood of evidence.

** Unlike passage of time, we have to renormalize.

Example: Observation

In this model as we get observations, beliefs get reweighted and uncertainty about the answer decreases.

Example: Weather HMM

In this example we want to predict the weather by looking our friend, is he come with umbrella or not. First day we use a uniform distribution but after first day, in each day we compute B' and then after observation of umbrella compute B for that day and do this for each day to decreasing the uncertainty.

The Forward Algorithm

We are given evidence at each time and want to know:

We can derive the following updates:

We can normalize as we go if we want to have P(x|e) at each time step, or just once at the end. But which is better?

Online Belief Updates

Every time step, we start with current P(X | evidence)

We update for time:

We update for evidence:

Particle Filtering

In some problems |X| is too big for exact computing or even for storing B(X), so it's almost impossible to use previous algorithms. For example when X is continous. In this situations we must use approximate inference.

In this algorithm we track just samples of X not all values and name this samples particles. Time per step is linear in the number of samples but number needed may be large enough and in memory should store list of particles not states.

Now represent P(X) by a list of N particles and P(x) approximate by number of particles with value of x. We know generally N<<|X| so many x may have p(x)=0 and this isn't good event. For solving this issue we must use more particles to achieve more accuracy.(For now assume all particles has the same weight)

Elapse Time

Each particle moved by transition model to it's next position.

Just like the prior sampling, each sample frequency reflect the transtition probabilities.

As mentioned earlier for closing to exact values, must use enough samples.

Observe

Just like the likelihood weighting, each sample's probabilities computed based on the evidence.(As before, the probabilities don’t sum to one, since all have been down weighted and need to normalizing)

Resample

We use resampling (N times) intead of tracking weighted samples in this way choose from our weighted sample distribution (i.e. draw with replacement). This method is equivalent to renormalizing the distribution.

And now the update is complete for this time step, continue with the next one.

Temporal Probability Models (Markov Models and Particle Filtering)

Search Results:

Ali Mirzaei Saghezchi

Arman Mohammadi

Arman Babaei

Mahdi Ghaznavi