Introduction
Let \(E_1, \ldots, E_n\) be a list of events in a probability space with sample space \(\Omega\). Most are aware of the result
Fewer are aware of the principle of inclusion-exclusion. If you aren’t, don’t fear—all will be revealed with the power of indicator functions.
Indicator Functions
Indicator Function: Let \(X \subseteq Y\), then \(1_X : Y \to \{0, 1\}\) is the indicator function on \(X\) assigning \(1\) to elements in \(X\) and \(0\) elsewhere.
We can now rephrase the probability of an event \(E\) as \(\mathbb{E}[1_{E_1}]\). With a probability (density of mass) function of \(p(x)\), we can interpret it in two ways. In terms of integration, this is \(\int_\Omega 1_{E_1} p(x) \, \mathrm{d}x = \int_{E_1} p(x) \, \mathrm{d}x\). In terms of summations, this just becomes \(\sum_{x \in E_1} p(x)\). With general measure theory, one defines the integral accordingly \(\int_\Omega 1_{E_1} \, \mathrm{d} \mu = \mu(E_1) = P(E_1)\). More importantly, indicator functions allow one to prove many things in more general settings using the properties of expectation.
Markov’s Inequality: Let \(X\) be a non-negative random variable.
Our event is \(X \geq \lambda\). Then note \(X \geq \lambda 1_{X \geq \lambda}\). To see this is true, consider all the possible values \(X\) could be. If \(X < \lambda\), then our indicator function is \(0\), so \(X \geq 0\) is the trivial statement. If \(X \geq \lambda\), then the indicator function is \(1\), so we have \(X \geq \lambda\). This is again trivially true. Taking expectation of both sides. We can use the linearity of expectation to get \(\mathbb{E}[X] \geq \lambda E[1_{X \geq \lambda}]\) or \(\frac{\mathbb{E}[X]}{\lambda} \geq P(X \geq \lambda)\).
Principle of Inclusion-Exclusion
The principle of inclusion-exclusion is usually framed as a way to count a union, which is easier to understand.
As Counting
Say you are trying to count the size of a set \(E_1 \cup \dots \cup E_n\). This might be computationally expensive, but it might be much simpler to count the size of intersections like \(E_{i_1} \cap \cdots \cap E_{i_m}\). We can recover the size of the unions easily. First, we’ll tackle the \(n = 3\) case.
We start by counting all the elements in \(E_1\), \(E_2\), and \(E_3\). But since these sets might intersect, we need to remove duplicates: \(E_1 \cap E_2\), \(E_1 \cap E_3\), and \(E_2 \cap E_3\). But this might remove too many, because these sets might intersect. If any two intersect, we know that element was contained in every set, so we added it \(3\) times and removed it \(3\) times, so we only need to add it back in once.
In the arbitrary case, we do the same thing. We add in sets, then subtract intersections, add in more intersections, subtract, etc.
As Indicator Functions
The counting approach has a few problems:
- It’s hard to prove.
- It’s hard to generalize.
We can rephrase and prove the principle with indicator functions. We start with the easy fact that \(1_{E_1} \cdots 1_{E_n} = 1_{E_1 \cap \cdots \cap E_n}\). Then let \(E = E_1 \cap \cdots \cap E_n\), so \(1_E \cdot 1_{E_i} = 1_{E_i}\). Lastly, we’ll need a little bit of notation.
:math:`A` and :math:`E_A`: Let \(A\) be a subset of \(\{1, \ldots, n\}\). Now let \(E_A = \bigcap_{i \in A} E_i\).
Then our adding and subtracting can easily be represented with sums over subsets \(A\) of a specific size:
Principle of Inclusion-Exclusion
We start with the statement \((1_E - 1_{E_1}) \cdots (1_E - 1_{E_n}) = 0\). Recall that the left side is a function on \(\Omega\), so plugging in \(x \in 1_E\) tells us that \(x \in E\), so \(x \in E_i\) for some \(i\). Thus \(1_E - 1_{E_i} = 0\). If \(x \not\in E\), then \(x \not\in E_i\) for any \(i\), so \(1_E - 1_{E_i} = 0\). Thus for all \(x \in \Omega\), \((1_E - 1_{E_1}) \cdots (1_E - 1_{E_n})(x) = 0(x) = 0\).
Now we can apply just expand our polynomial and get:
One subtraction later and a little sum notation, and we’ve arrived at our answer.
By summing over both sides over \(x \in X\), we get
Counting Version
Now we take the expectation as well to get
Probability Version
But even better than this is that we can reason about the indicator’s functions values with combinatorics.
Leading up to the Bonferroni Inequalities
Let’s say we truncated our expanded polynomial from earlier at \(k\) terms, giving us
This is not going to be \(0\) anymore. If we choose \(k = 1\), then anything in the intersection will be \((-1)^k\).
To evaluate this more easily, notice that for any given \(x \in X\), each term of \(f(x)\) only depends on how many of \(E_1, \cdots E_n\) that \(x\) is in. By summing over all different combinations, we lose information about which subset that \(x\) was in.
Let \(B\) be all the \(i\) such that \(x \in E_i\). Examining the term \(\sum_{|A| = i} 1_{E_A}(x)\), each \(1_{E_A}\) is \(1\) iff \(A \subset B\). If \(A \not\subseteq B\), one element of \(A\) has \(x \not\in i\), so \(x \not\in E_A\). If \(A \subset B\), then \(x \in E_A\), so the indicator function holds. Thus \(\sum_{|A| = i} 1_{E_A}\) is precisely the number of subsets of size \(i\) contained in \(B\). If \(|B| = m\), then this term is \(\binom{m}{i}\).
So then for \(x\) in precisely \(m\) \(E_i\),
There is the slight caveat that once \(k > m\), \(x\) is not in any intersection of \(k\) subsets. So we take \(\binom{m}{k} = 0\) when \(k > m\). This just means that adding extra terms doesn’t really improve the situation. From now on, we will assume \(k \leq m\).
Counting with Inversions
We will find a closed form expression of \(f\) by pairing adjacent terms in such a way that they cancel each other out. We’ll need a perculiar function first.
Let \(i : \mathcal{P}(\{1, \ldots, m\}) \to \mathcal{P}(\{1, \cdots, m\})\) where \(\mathcal{P}(S)\) is the power-set of \(S\). We define \(i\) as follows:
Since \(i^2(A)\) first adds \(n\) then subtracts \(n\) if \(n \not\in A\), or subtracts \(n\) then adds \(n\) if \(n \in A\), \(i^2(A) = A\). This fact is why we calll \(i\) an inversion. It also implies that \(i\) is bijective.
But the really crucial property is that \(i\) takes subsets of \(\{1, \ldots, m\}\) from one term of \(f\) to another. A term of \(f(x)\), say \(\binom{m}{i}\), counts the number of subsets of a given size. Since \(i(A)\) has either one more or one less element of \(A\), counting the subset associated to \(i(A)\) cancels with the counting the subset \(A\) since adjacent terms are off by a \(-1\).
There is one exception. When we are looking at the subsets counted in the last term, \(\binom{m}{k}\), we may try to pair it with subsets that are one larger. Sadly those subsets are not being counted in \(f\). Luckily we can still count them. They are all the subsets not containing \(n\) but still of size \(k\). In other terms,
The Bonferroni Inequalities
Now that we have done all the counting inversion business, we really only need the sign:
This gives us
Finally taking the expectation of both sides gives us
Bonferroni Inequalities
As a special case, we have
None of this is original
This end result is an early exercise in Tao & Vu’s book Additive Combinatorics. It’s pretty obvious that it’s true, but figuring out a proof took me down this path.
See these resources:
- http://math.bme.hu/~gabor/oktatas/SztoM/TaoVu.AddComb.pdf for the original problem
- https://planetmath.org/proofofbonferroniinequalities for an alternate approach that I found confusing
- https://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle to dig through for more about the principle of inclusion-exclusion
- https://math.stackexchange.com/questions/208766/how-to-prove-bonferroni-inequalities which led me to
- https://planetmath.org/proofofbonferroniinequalities for an alternate approach that I found confusing
- https://math.stackexchange.com/questions/306771/combinatorial-proof-of-combinatorial-identity-involving-1k-binom-n-1k where I got the nice inversion trick