Consider the following characterization of the Bayes' theorem:
Bayes' Theorem
Given some observed data $x$, the posterior probability that the paramater $\Theta$ has the value $\theta$ is $p(\theta \mid x) = p(x \mid \theta) p (\theta) / p(x)$, where $p(x \mid \theta)$ is the likelihood, $p(\theta)$ is the prior probability of the value $\theta$, and $p(x)$ is the marginal probability of the value $x$.
Is there any special reason why we call $p(x)$ the "marginal probability"? What is "marginal" about it?
$\endgroup$ 23 Answers
$\begingroup$If you consider a joint distribution to be a table of values in columns and rows with there probabilities entered in the cells, then the "marginal distribution" is found by summing the values in the table along rows (or columns) and writing the total in the margins of the table.
$$\begin{array}{c c} & X \\ \Theta & \boxed{\begin{array}{c|cc|c} ~ & 0 & 1 & X\mid \Theta \\ \hline 0 & 0.15 & 0.35 & 0.5 \\ 1 & 0.20 & 0.30 & 0.5 \\\hline \Theta\mid X & 0.35 & 0.65 & ~\end{array}}\end{array}$$
$\endgroup$ 2 $\begingroup$To me, bayes theorem is all about inverting likelihood functions, and in that context calling it marginal probabity makes sense.
- Lets say I have a observation $c$,
- and a collection of states $\mathbf{s}=\{s_1,\ldots,s_n\}$, that could be causing that observation.
- And each of those states also defines a likelihood: $P(c\mid s_i)$
- as well we have a prior $P(s_i)$ (I'm assuming you have already motivated the prior, if not ask another question on this site)
- So I want to know the state, based on the variable
- If I just wanted to know the most likely state, and how they compair to each other, I could define a scoring function -- combining the likelihood of our observation given we are in the state, with the base chance of being in the state: $$\operatorname{score}_c(s_i)= P(c\mid s_i)P(s_i)$$
- Then to find the most likely state $s^\star$, i would just find the argmax $$s^\star = \operatorname{argmax}_{\forall s_i \in \mathbf{s}} \operatorname{score}_c(s_i) = \operatorname{argmax}_{\forall s_i \in \mathbf{s}} P(c\mid s_i)P(s_i) $$
- That score function is quiet nice. We can think of a score vector, which has all the scores and we can see which is the most likely, and which is the least. But it does not sum to one. We'ld like to make it sum to one -- we would normalise it and call it a probability (even if it isn't -- but it will turn out it is). Our normalised score depends only on $c$ so it will be $P(s_i\mid c)$, proper justification of this is beyond scope of this answer. The normalised score is given by$$P(s_i\mid c)=\dfrac{\operatorname{score}_c(s_i)}{\sum_{\forall s_j\in \mathbf{s}} \operatorname{score}_c(s_j) } = \dfrac{P(c\mid s_i)P(s_i)}{\sum_{\forall s_j\in \mathbf{s}} P(c\mid s_j)P(s_j) }$$
- the above is a very useful form of Bayes Theorem.
- let's take a closer look at the bottom line:$$\sum_{\forall s_j\in \mathbf{s}} P(c\mid s_j)P(s_j) = \sum_{\forall s_j\in \mathbf{s}} P(c,s_j)$$
- So we are summing the Joint probability, over all possible values that one of its fields can take. That is the very definition of the marginal probability of the other field. $$P(c) = \sum_{\forall s_j\in \mathbf{s}} P(c,s_j)$$
- Our bottom like -- the normalising factor to make it sum to one -- that is just the marginal probability of $c$. Substituting that back in:$$P(s_i\mid c) = \dfrac{P(c\mid s_i)P(s_i)}{P(c)}$$
So the bottom line $P(c)$ was just a marginally probability, that we find by summing over all possible values for the other field ($s_i$) in the top line.
$\endgroup$ 3 $\begingroup$The explanation I was given when I was taught conditional probabilities is that if you draw up a table of the probabilities $p(x,y)$, then the row/column sums$$ p(x) = \sum_{y} p(x,y) $$(by the law of total probability) are written in the margins of the table.
$\endgroup$ 6