Introduction
Let \(E_1, \ldots, E_n\) be a list of events in a probability space
with sample space \(\Omega\). Most are aware of the result
\begin{equation*}
P(E_1 \cup \cdots \cup E_n) \leq P(E_1) + \cdots + P(E_n).
\end{equation*}
Fewer are aware of the principle of inclusion-exclusion. If you aren’t,
don’t fear—all will be revealed with the power of indicator functions.
Indicator Functions
Indicator Function: Let \(X \subseteq Y\), then
\(1_X : Y \to \{0, 1\}\) is the indicator function on \(X\)
assigning \(1\) to elements in \(X\) and \(0\) elsewhere.
We can now rephrase the probability of an event \(E\) as
\(\mathbb{E}[1_{E_1}]\). With a probability (density of mass)
function of \(p(x)\), we can interpret it in two ways. In terms of
integration, this is
\(\int_\Omega 1_{E_1} p(x) \, \mathrm{d}x = \int_{E_1} p(x)
\, \mathrm{d}x\). In terms of summations, this just becomes
\(\sum_{x \in E_1} p(x)\). With general measure theory, one defines
the integral accordingly \(\int_\Omega 1_{E_1} \, \mathrm{d}
\mu = \mu(E_1) = P(E_1)\). More importantly, indicator functions allow
one to prove many things in more general settings using the properties
of expectation.
Markov’s Inequality: Let \(X\) be a non-negative random variable.
\begin{equation*}
\frac{\mathbb{E}[x]}{\lambda} \geq P(X \geq \lambda)
\end{equation*}
Our event is \(X \geq \lambda\). Then note
\(X \geq \lambda 1_{X \geq \lambda}\). To see this is true, consider
all the possible values \(X\) could be. If \(X < \lambda\), then
our indicator function is \(0\), so \(X \geq 0\) is the trivial
statement. If \(X \geq \lambda\), then the indicator function is
\(1\), so we have \(X \geq \lambda\). This is again trivially
true. Taking expectation of both sides. We can use the linearity of
expectation to get
\(\mathbb{E}[X] \geq \lambda E[1_{X \geq \lambda}]\) or
\(\frac{\mathbb{E}[X]}{\lambda} \geq P(X \geq \lambda)\).
Principle of Inclusion-Exclusion
The principle of inclusion-exclusion is usually framed as a way to count
a union, which is easier to understand.
As Counting
Say you are trying to count the size of a set
\(E_1 \cup \dots \cup E_n\). This might be computationally
expensive, but it might be much simpler to count the size of
intersections like \(E_{i_1} \cap \cdots \cap E_{i_m}\). We can
recover the size of the unions easily. First, we’ll tackle the
\(n = 3\) case.
We start by counting all the elements in \(E_1\), \(E_2\), and
\(E_3\). But since these sets might intersect, we need to remove
duplicates: \(E_1 \cap E_2\), \(E_1
\cap E_3\), and \(E_2 \cap E_3\). But this might remove too many,
because these sets might intersect. If any two intersect, we know that
element was contained in every set, so we added it \(3\) times and
removed it \(3\) times, so we only need to add it back in once.
In the arbitrary case, we do the same thing. We add in sets, then
subtract intersections, add in more intersections, subtract, etc.
As Indicator Functions
The counting approach has a few problems:
- It’s hard to prove.
- It’s hard to generalize.
We can rephrase and prove the principle with indicator functions. We
start with the easy fact that
\(1_{E_1} \cdots 1_{E_n} = 1_{E_1 \cap \cdots \cap E_n}\). Then let
\(E = E_1 \cap \cdots \cap E_n\), so
\(1_E \cdot 1_{E_i} = 1_{E_i}\). Lastly, we’ll need a little bit of
notation.
:math:`A` and :math:`E_A`: Let \(A\) be a subset of
\(\{1, \ldots, n\}\). Now let \(E_A = \bigcap_{i \in A} E_i\).
Then our adding and subtracting can easily be represented with sums over
subsets \(A\) of a specific size:
Principle of Inclusion-Exclusion
\begin{equation*}
1_E = \sum_{|A| = 1} 1_{E_A} - \sum_{|A| = 2} 1_{E_A} + \cdots - (-1)^n
\sum_{|A| = n} 1_{E_A}
\end{equation*}
We start with the statement
\((1_E - 1_{E_1}) \cdots (1_E - 1_{E_n}) = 0\). Recall that the left
side is a function on \(\Omega\), so plugging in \(x \in 1_E\)
tells us that \(x \in E\), so \(x \in E_i\) for some \(i\).
Thus \(1_E - 1_{E_i} = 0\). If \(x \not\in E\), then
\(x \not\in E_i\) for any \(i\), so \(1_E - 1_{E_i} = 0\).
Thus for all \(x \in \Omega\),
\((1_E - 1_{E_1}) \cdots (1_E - 1_{E_n})(x) = 0(x) = 0\).
Now we can apply just expand our polynomial and get:
\begin{equation*}
1_E^n - 1_E^{n-1} (1_{E_1} + \cdots + 1_{E_n}) + \cdots - \cdots + (-1)^n 1_{E_1} \cdots 1_{E_n} = 0
\end{equation*}
One subtraction later and a little sum notation, and we’ve arrived at
our answer.
\begin{equation*}
1_E^n = \sum_{|A| = 1} 1_{E_A} - \cdots - (-1)^n \sum_{|A| = n} 1_{E_A}
\end{equation*}
By summing over both sides over \(x \in X\), we get
Counting Version
\begin{equation*}
|E_1 \cup \cdots \cap E_n| = |E_1| + \cdots + |E_n| - |E_1 \cap E_2| - \cdots
+ \cdots - (-1)^n |E_1 \cap \cdots \cap E_n|.
\end{equation*}
Now we take the expectation as well to get
Probability Version
\begin{equation*}
P(E_1 \cup \cdots \cap E_n) = P(E_1) + \cdots + P(E_n) - P(E_1 \cap E_2) - \cdots
+ \cdots - (-1)^n P(E_1 \cap \cdots \cap E_n).
\end{equation*}
But even better than this is that we can reason about the indicator’s
functions values with combinatorics.
Leading up to the Bonferroni Inequalities
Let’s say we truncated our expanded polynomial from earlier at \(k\)
terms, giving us
\begin{equation*}
f = 1_E - \sum_{|A| = 1} 1_{E_A} + \cdots + (-1)^k \sum_{|A| = k} 1_{E_A}.
\end{equation*}
This is not going to be \(0\) anymore. If we choose \(k = 1\),
then anything in the intersection will be \((-1)^k\).
To evaluate this more easily, notice that for any given \(x \in X\),
each term of \(f(x)\) only depends on how many of
\(E_1, \cdots E_n\) that \(x\) is in. By summing over all
different combinations, we lose information about which subset that
\(x\) was in.
Let \(B\) be all the \(i\) such that \(x \in E_i\).
Examining the term \(\sum_{|A| =
i} 1_{E_A}(x)\), each \(1_{E_A}\) is \(1\) iff
\(A \subset B\). If \(A \not\subseteq B\), one element of
\(A\) has \(x \not\in i\), so \(x \not\in E_A\). If
\(A \subset B\), then \(x \in E_A\), so the indicator function
holds. Thus \(\sum_{|A| = i} 1_{E_A}\) is precisely the number of
subsets of size \(i\) contained in \(B\). If \(|B| = m\),
then this term is \(\binom{m}{i}\).
So then for \(x\) in precisely \(m\) \(E_i\),
\begin{equation*}
f(x) = \binom{m}{0} - \binom{m}{1} + \cdots + (-1)^k \binom{m}{k}.
\end{equation*}
There is the slight caveat that once \(k > m\), \(x\) is not in
any intersection of \(k\) subsets. So we take
\(\binom{m}{k} = 0\) when \(k > m\). This just means that adding
extra terms doesn’t really improve the situation. From now on, we will
assume \(k \leq m\).
Counting with Inversions
We will find a closed form expression of \(f\) by pairing adjacent
terms in such a way that they cancel each other out. We’ll need a
perculiar function first.
Let
\(i : \mathcal{P}(\{1, \ldots, m\}) \to \mathcal{P}(\{1, \cdots, m\})\)
where \(\mathcal{P}(S)\) is the power-set of \(S\). We define
\(i\) as follows:
\begin{equation*}
i(A) = \begin{cases}
A \cup \{m\} & \text{if } m \not\in A \\
A - \{m\} & \text{if } m \in A
\end{cases}.
\end{equation*}
Since \(i^2(A)\) first adds \(n\) then subtracts \(n\) if
\(n \not\in A\), or subtracts \(n\) then adds \(n\) if
\(n \in A\), \(i^2(A) = A\). This fact is why we calll \(i\)
an inversion. It also implies that \(i\) is bijective.
But the really crucial property is that \(i\) takes subsets of
\(\{1, \ldots, m\}\) from one term of \(f\) to another. A term
of \(f(x)\), say \(\binom{m}{i}\), counts the number of subsets
of a given size. Since \(i(A)\) has either one more or one less
element of \(A\), counting the subset associated to \(i(A)\)
cancels with the counting the subset \(A\) since adjacent terms are
off by a \(-1\).
There is one exception. When we are looking at the subsets counted in
the last term, \(\binom{m}{k}\), we may try to pair it with subsets
that are one larger. Sadly those subsets are not being counted in
\(f\). Luckily we can still count them. They are all the subsets not
containing \(n\) but still of size \(k\). In other terms,
\begin{equation*}
f(x) = (-1)^k \binom{m-1}{k}.
\end{equation*}
The Bonferroni Inequalities
Now that we have done all the counting inversion business, we really
only need the sign:
\begin{equation*}
f = 1_E - \sum_{|A| = 1} 1_{E_A} + \cdots + (-1)^k \sum_{|A| = k}
1_{E_A} \begin{cases}
\geq 0 & \text{if } k \text{ is even} \\
\leq 0 & \text{if } k \text{ is odd} \\
\end{cases}.
\end{equation*}
This gives us
\begin{equation*}
\begin{cases}
1_E \geq \sum_{|A| = 1} 1_{E_A} - \cdots - (-1)^k \sum_{|A| = k} 1_{E_A}
& \text{if } k \text{ is even} \\
1_E \leq \sum_{|A| = 1} 1_{E_A} - \cdots - (-1)^k \sum_{|A| = k} 1_{E_A}
& \text{if } k \text{ is odd} \\
\end{cases}
\end{equation*}
Finally taking the expectation of both sides gives us
Bonferroni Inequalities
\begin{equation*}
\begin{cases}
P(E) \geq \sum_{|A| = 1} P(E_A) - \cdots - (-1)^k \sum_{|A| = k} P(E_A)
& \text{if } k \text{ is even} \\
P(E) \leq \sum_{|A| = 1} P(E_A) - \cdots - (-1)^k \sum_{|A| = k} P(E_A)
& \text{if } k \text{ is odd} \\
\end{cases}.
\end{equation*}
As a special case, we have
\begin{equation*}
P(E_1 \cup \cdots \cup E_n) \leq P(E_1) + \cdots + P(E_N).
\end{equation*}