Notes on Representation in PGMs.

Graphs

A graph, denoted $\mathcal{K}$ is a tuple of $\mathcal{X}$ and $\mathcal{E}$ where $\mathcal{X}=\{X_1,\ldots,X_n\}$ is the sets of nodes (or vertices) and $\mathcal{E}$ is the set of edges. \begin{equation} \mathcal{K}=(\mathcal{X},\mathcal{E}) \end{equation}

Nodes, Edges

Any pair of nodes $X_i,X_j$, for $i\neq j$ is connected by either a directed edge $X_i\rightarrow X_j$ or an undirected edge $X_i-X_j$1. We use the notation $X_i\rightleftharpoons X_j$ to denote that $X_i$ is connected to $X_j$ via some edge, whether directed (in any direction) or undirected.

If the graph contains directed edges only, we call it a directed graph, denoted $\mathcal{G}$, else if the graph established by undirected edge only, it is referred as undirected graph, denoted $\mathcal{H}$.

Graph example
Figure 1: (taken from the PGM book) Example of a partially directed graph $\mathcal{K}$

Following are some necessary notations:

  • If $X_i\rightarrow X_j\in\mathcal{E}$, we say that $X_i$ is the parent of $X_j$ while $X_j$ is the child of $X_i$.
    E.g. node $I$ is a child of nodes $C,E$ and $H$ while $D$ is a parent of $G$.
  • If $X_i-X_j\in\mathcal{E}$, we say that $X_i$ is a neighbor of $X_j$, and vice versa.
    E.g. node $F$ is a neighbor of $G$.
  • If $X\rightleftharpoons Y\in\mathcal{E}$, we say that $X$ and $Y$ are adjacent.
    E.g. nodes $A$ and $C$ are adjacent, while $D$ is adjacent to $E$.
  • We use $\text{Pa}_X$ to denote the set of parents of $X$, $\text{Ch}_X$ to denote the set of its children and $\text{Nb}_X$ to denote its neighbors. The set $\text{Boundary}_X\doteq\text{Pa}_X\cup\text{Nb}_X$ is known as the boundary of $X$.
    E.g. $\text{Pa}_I=\{C,E,H\}$ and $\text{Boundary}_F=\text{Pa}_F\cup\text{Nb}_F=\{C\}\cup\{G\}=\{C,G\}$.
  • The degree of a node $X$ is the number of edges in which it participates; its indegree is the number of directed edges $Y\rightarrow X$. The degree of a graph is the maximal degree of a node in the graph.
    E.g. node $D$ has degree of $3$, indegree of $0$; the graph $\mathcal{K}$ has degree of $3$.

Subgraphs

Consider the graph $\mathcal{K}=(\mathcal{X},\mathcal{E})$ and let $\mathbf{X}\subset\mathcal{X}$ be a subset of nodes in $\mathcal{K}$. Then:

  • The induced subgraph of $\mathcal{K}$, denoted $\mathcal{K}[\mathbf{X}]$ is defined as the graph $(\mathbf{X},\mathcal{E}')$ where \begin{equation} \mathcal{E}'=\{X\rightleftharpoons Y:X,Y\in\mathbf{X}\} \end{equation}
  • A subgraph over $\mathbf{X}$ is complete if every two nodes in $\mathbf{X}$ are connected via some edges. The set $\mathbf{X}$ is known as a clique; or even a maximal clique if for any set of nodes $\mathbf{Y}\supset\mathbf{X}$, $\mathbf{Y}$ is not a clique, i.e. \begin{equation} \{\mathbf{Y}\text{ clique}:\mathbf{Y}\supset\mathbf{X}\}=\emptyset \end{equation}
  • The set $\mathbf{X}$ is called upward closed in $\mathbf{K}$ if for any $X\in\mathbf{X}$, we have that \begin{equation} \text{Boundary}_X\subset\mathbf{X} \end{equation} The upward closure of $\mathbf{X}$ is the minimal closed subset $\mathbf{Y}$ covering $\mathbf{X}$, i.e. \begin{equation} \mathbf{Y}=\sup\{\bar{\mathbf{Y}}\text{ upward closed in }\mathcal{K}:\bar{\mathbf{Y}}\supset\mathbf{X}\} \end{equation} The upwardly closed subgraph of $\mathbf{X}$, denoted $\mathcal{K}^+[\mathbf{X}]$, is the induced subgraph over $\mathbf{Y}$, $\mathcal{K}[\mathbf{Y}]$.

Paths, Trails

Consider the graph $\mathcal{K}=(\mathcal{X},\mathcal{E})$, the basic notion of edges gives rise to following definitions:

  • $X_1,\ldots,X_k$ form a path in $\mathcal{K}$ if for every $i=1,\ldots,k-1$, we have that either $X_i\rightarrow X_{i+1}$ or $X_i-X_{i+1}$. A path is directed if there exists a directed edge $X_i\rightarrow X_{i+1}$.
  • $X_1,\ldots,X_k$ form a trail in $\mathcal{K}$ if for every $i=1,\ldots,k-1$, we have that $X_i\rightleftharpoons X_{i+1}$.
  • $\mathcal{K}$ is connected if for every pair $X_i,X_j$ there is a trail between $X_i$ and $X_j$.
  • $X$ is an ancestor of $Y$ and correspondingly $Y$ is a descendant of $X$ in $\mathcal{K}$ if there exists a directed path $X_1,\ldots,X_k$ with $X_1=X$ and $X_k=Y$.
  • An ordering of nodes $X_1,\ldots,X_n$ is a topological ordering relative to $\mathbf{K}$ if whenever we have $X_i\rightarrow X_j\in\mathcal{E}$, then $i\lt j$. This gives rise to critical results that for each node $X_i$, we have that \begin{equation} \text{Pa}_{X_i}\subset\{X_1,\ldots,X_{i-1}\}, \end{equation} and \begin{equation} \text{Ch}_{X_i}\subset\{X_{i+1},\ldots,X_n\} \end{equation}

Cycles, Loops

  • A cycle in $\mathcal{K}$ is a directed path $X_1,\ldots,X_k$ where $X_1=X_k$. $\mathcal{K}$ is acyclic if it contains no cycles.
  • $\mathcal{K}$ is a directed acyclic graph (or DAG) if it is both directed and acyclic.
  • An acyclic graph containing both directed and undirected edges is known as a partially directed acyclic graph (or PDAG).
    Let $\mathcal{K}$ be a PDAG over $\mathcal{X}$ and let $\mathbf{K}_1,\ldots,\mathbf{K}_\ell$ be a disjoint partition of $\mathcal{X}$ such that:
    • the induced subgraph over $\mathbf{K}_i$ contains no directed edges;
    • for any pair of nodes $X\in\mathbf{K}_i$ and $Y\in\mathbf{K}_j$ for $i\lt j$, an edge between $X$ and $Y$ can only be a directed edge $X\rightarrow Y$.
    Each component $\mathbf{K}_i$ is called a chain component, while $\mathcal{K}$ is referred as a chain graph.
  • A loop in $\mathcal{K}$ is a trail $X_1,\ldots,X_k$ where $X_1=X_k$. A graph is singly connected if it contains no loops. A node in a singly connected graph is called a leaf if it has exactly one adjacent node.
  • A singly connected directed graph is called a polytree, while a singly connected undirected graph is known as a forest; if it is also connected, it is called a tree.
  • A directed graph is a forest if each node has at most one parent. A directed forest is a tree if it is also connected.
  • Let $X_1-\ldots-X_k-X_1$ be a loop in the graph, a chord in the loop is an edge connecting two nonconsecutive nodes $X_i$ and $X_j$. An undirected graph $\mathcal{H}$ is known as chordal if any loop $X_1-\ldots-X_k-X_1$ has a chord for $k\geq 4$. This implies that in a chordal graph, the longest minimal loop (the one has no shortcut) has length of $3$.

Directed Graphical Model

A Directed Graphical Model (or Bayesian network) is a tuple $\mathcal{B}=(\mathcal{G},P)$ where

  • $\mathcal{G}$ is a Bayesian network structure,
  • $P$ factorizes according to $\mathcal{G}$,
  • $P$ is specified as a set of CPDs associated with $\mathcal{G}$’s nodes.

Bayesian Network Structure

A Bayesian network structure (or Bayesian network graph, BN graph) is a DAG, denoted $\mathcal{G}=(\mathcal{X},\mathcal{E})$ with $\mathcal{X}=\{X_1,\ldots,X_n\}$ where

  • Each node $X_i\in\mathcal{X}$ represents a random variable.
  • Each node $X_i\in\mathcal{X}$ is associated with a conditional independencies assumption, called local independencies, denoted $\mathcal{I}_\ell(\mathcal{G})$, which says that $X_i$ is conditionally independent of its non-descendants given its parent, i.e. \begin{equation} (X_i\perp\text{NonDescendants}_{X_i}\vert\hspace{0.1cm}\text{Pa}_{X_i}), \end{equation} where $\text{NonDescendants}_{X_i}$ denotes the set of non-descendant nodes of $X_i$.

I-Maps

Let $P$ be a distribution over $\mathcal{X}$, we define $\mathcal{I}(P)$ to be the set of independence assertions of the form $(X\perp Y\hspace{0.1cm}\vert Z)$ that hold in $P$, i.e. \begin{equation} P\models(X\perp Y\vert Z), \end{equation} or \begin{equation} P(X,Y\vert Z)=P(X\vert Z)P(Y\vert Z) \end{equation} Let $\mathcal{K}$ be a graph associated with a set of independencies $\mathcal{I}(\mathcal{K})$, then $\mathcal{K}$ is an I-map (for independence map) for a set of independencies $\mathcal{I}$ if $\mathcal{I}(\mathcal{K})\subset\mathcal{I}$.

Hence, if $P$ satisfies the local dependencies associated with $\mathcal{G}$, we have \begin{equation} \mathcal{I}_\ell(\mathcal{G})\subset\mathcal{I}(P), \end{equation} which implies that $\mathcal{G}$ is an I-map for $\mathcal{I}(P)$, or simply an I-map for $P$.

Factorization

Let $\mathcal{G}$ is a BN graph over $X_1,\ldots,X_n\in\mathcal{X}$. A distribution $P$ over $\mathcal{X}$ is said to factorize according to $\mathcal{G}$ if $P$ can be expressed as a product \begin{equation} P(X_1,\ldots,X_n)=\prod_{i=1}^{n}P(X_i\vert\text{Pa}_{X_i}) \end{equation} This equation is known as the chain rule for Bayesian networks. Each individual factor $P(X_i\vert\text{Pa}_{X_i})$, which is a conditional probability distribution (CPD), is called the local probabilistic model.

I-Map - Factorization Connection

Theorem 1: Let $\mathcal{G}$ be a BN graph over a set of random variables $\mathcal{X}$ and let $P$ be a joint distribution over $\mathcal{X}$. Then $\mathcal{G}$ is an I-map for $P$ if and only if $P$ factorizes over $\mathcal{G}$.

Proof

  • I-map $\Rightarrow$ Factorization
    Without loss of generality, let $X_1,\ldots,X_n$ be a topological ordering of the variables in $\mathcal{X}$.
    Let us consider an arbitrary $X_i$ for $i\in\{1,\ldots,n\}$. As mentioning above, the topological ordering implies that \begin{align} \text{Pa}_{X_i}&\subset\{X_1,\ldots,X_{i-1}\}, \\ \text{Ch}_{X_i}&\subset\{X_{i+1},\ldots,X_n\} \end{align} Consequently, none of descendants of $X_i$ is in $\{X_1,\ldots,X_{n-1}\}$. Thus, if we denote the set of all non-descendant nodes of $X_i$ as $\text{NonDescentdants}_{X_i}$ and let $\mathbf{Z}\subset\text{NonDescentdants}_{X_i}$, then \begin{equation} \mathbf{Z}\cup\text{Pa}_{X_i}=\{X_1,\ldots,X_{i-1}\} \end{equation} Moreover, the local independence for $X_i$ implies that \begin{equation} (X_i\perp\mathbf{Z}\vert\text{ Pa}_{X_i}) \end{equation} Therefore, since $\mathcal{G}$ is an I-map for $P$ we obtain \begin{equation} P(X_i\vert X_1,\ldots,X_{i-1})=P(X_i\vert\text{Pa}_{X_i}) \end{equation} Thus, by the chain rule for probabilities, we have \begin{equation} P(X_1,\ldots,X_n)=\prod_{i=1}^{n}P(X_i\vert X_1,\ldots,X_{i-1})=\prod_{i=1}^{n}P(X_i\vert\text{Pa}_{X_i}) \end{equation}
  • I-map $\Leftarrow$ Factorization
    To prove that $\mathcal{G}$ is an I-map according to $P$, we need to show that $\mathcal{I}_\ell(\mathcal{G})$ holds in $P$. Consider an arbitrary node $X_i$ and the local independencies $(X_i\perp\text{NonDescendants}_{X_i}\vert\text{Pa}_{X_i})$, our problem remains to prove that \begin{equation} P(X_i\vert\mathcal{X}\backslash X_i)=P(X_i\vert\text{Pa}_{X_i}) \end{equation} since \begin{equation} P(X_i\vert\mathcal{X}\backslash X_i)=P(X_i\vert\text{NonDescendants}_{X_i}\cup\text{Pa}_{X_i}) \end{equation} By factorization, we have that \begin{align} P(\mathcal{X}\backslash X_i)&=\sum_{X_i}P(X_1,\ldots,X_n) \\ &=\sum_{X_i}\prod_{j=1}^{n}P(X_j\vert\text{Pa}_{X_j}) \\ &=\left(\prod_{j=1,j\neq i}^{n}P(X_j\vert\text{Pa}_{X_j})\right)\sum_{X_i}P(X_i\vert\text{Pa}_{X_i}) \\ &=\prod_{j=1,j\neq i}^{n}P(X_j\vert\text{Pa}_{X_j}), \end{align} where in the last step, we use the fact that the conditional probability function $P(X_i\vert\text{Pa}_{X_i})$ sum to $1$ over the sample space of $X_i$. This implies that by Bayes rules \begin{align} P(X_i\vert\mathcal{X}\backslash X_i)&=\frac{P(X_1,\ldots,X_n)}{P(\mathcal{X}\backslash X_i)} \\ &=\frac{\prod_{j=1}^{n}P(X_j\vert\text{Pa}_{X_j})}{\prod_{j=1,j\neq i}^{n}P(X_j\vert\text{Pa}_{X_j})} \\ &=P(X_i\vert\text{Pa}_{X_i}) \end{align}

Independencies in Bayesian Network

D-separation

Let $\mathcal{G}$ be a BN structure, $X_1\rightleftharpoons\ldots\rightleftharpoons X_n$ be a trail in $\mathcal{G}$ and let $\mathbf{Z}$ be a subset of observed variables. The trail $X_1\rightleftharpoons\ldots\rightleftharpoons X_n$ is active, i.e. influence can flow from $X_1$ to $X_n$ via $X_2,\ldots,X_{n-1}$, if

  • Whenever we have a v-structure $X_{i-1}\rightarrow X_i\leftarrow X_{i+1}$, $X_i$ or one of its descendants are in $\mathbf{Z}$;
  • No other node along the trail are in $\mathbf{Z}$.
Two-edge trails
Figure 2: (taken from the PGM book) The four possible two-edge trails from $X$ to $Y$ via $Z$: (a) Causal trail; (b) Evidential trail; (c) Common cause trail; (d) Common effect trail

Consider the trails forming from two edges as illustrated above:

  • The trail $X\rightarrow Z\rightarrow Y$ is active $\Leftrightarrow$ $Z$ is not observed.
  • The trail $X\leftarrow Z\leftarrow Y$ is active $\Leftrightarrow$ $Z$ is not observed.
  • The trail $X\leftarrow Z\rightarrow Y$ is active $\Leftrightarrow$ $Z$ is not observed.
  • The trail $X\rightarrow Z\leftarrow Y$ is active $\Leftrightarrow$ either $Z$ or one of its descendants is observed.

Let $\mathbf{X},\mathbf{Y},\mathbf{Z}$ be sets of nodes in $\mathcal{G}$. Then $\mathbf{X}$ and $\mathbf{Y}$ are said to be d-separated given $\mathbf{Z}$, denoted $\text{d-sep}_\mathcal{G}(\mathbf{X};\mathbf{Y}\vert\mathbf{Z})$, if there is no active trail between any node $X\in\mathbf{X}$ and $Y\in\mathbf{Y}$ given $\mathbf{Z}$.

We define the global Markov independencies associated with $\mathcal{G}$, denoted $\mathcal{I}(\mathcal{G})$, to be the set of all independencies that correspond to d-separation in $\mathcal{G}$ \begin{equation} \mathcal{I}(\mathcal{G})=\big\{(\mathbf{X}\perp\mathbf{Y}\vert\mathbf{Z}):\text{d-sep}_\mathcal{G}(\mathbf{X};\mathbf{Y}\vert\mathbf{Z})\big\} \end{equation}

Soundness, Completeness

Theorem 2 (Soundness of d-separation) If a distribution $P$ factorizes according to $\mathcal{G}$, then \begin{equation} \mathcal{I}(\mathcal{G})\subset\mathcal{I}(P) \end{equation} The soundness property says that if $\text{d-sep}_\mathcal{G}(X;Y\vert\mathbf{Z})$ then they are conditional independent given $\mathbf{Z}$, or $(X\perp Y\vert\mathbf{Z})$.

Theorem 3 (Completeness of d-separation) If two variables $X$ and $Y$ are independent given $\mathbf{Z}$, then they are d-separated.

The completeness property says that d-separation detects all possible independencies.

Theorem 4: Let $\mathcal{G}$ be a BN graph. If $X$ and $Y$ are not d-separated given $\mathbf{Z}$, then $X$ and $Y$ are dependent given $\mathbf{Z}$ in some distribution $P$ that factorizes over $\mathcal{G}$.

Theorem 5: For almost all distributions $P$ that factorize over $\mathcal{G}$, i.e. for all distributions except for a set of measure zero in the space of CPD parameterizations, we have that \begin{equation} \mathcal{I}(\mathcal{G})=\mathcal{I}(P) \end{equation} These results state that for almost all parameterizations $P$ of the graph $\mathcal{G}$, the d-separation test precisely characterizes the independencies that hold for $P$.

I-Equivalence

Two graph $\mathcal{K}_1$ and $\mathcal{K}_2$ over $\mathcal{X}$ are said to be I-equivalent if they encode the same set of conditional independencies assertions, i.e. \begin{equation} \mathcal{I}(\mathcal{K}_1)=\mathcal{I}(\mathcal{K}_2) \end{equation} This implies that any distribution $P$ that factorizes over $\mathcal{K}_1$ also factorizes according to $\mathcal{K}_2$ and vice versa.

The skeleton of a BN graph $\mathcal{G}$ over $\mathcal{X}$ is an undirected graph over $\mathcal{X}$ containing an edge $\{X,Y\}$ for every edge $(X,Y)$ in $\mathcal{G}$.

Theorem 6 (skeleton + v-structures $\Rightarrow$ I-equivalence) Let $\mathcal{G}_1$ and $\mathcal{G_2}$ be two graphs over $\mathcal{X}$. If $\mathcal{G}_1,\mathcal{G}_2$ both have the same skeleton and the same set of v-structures then they are I-equivalent.2

I-equivalent graphs
Figure 3: (taken from the PGM book) Two graphs have the same skeleton and set of v-structures, i.e. $\{X\rightarrow Y\leftarrow Z\}$, and thus are I-equivalent
Immorality

A v-structure $X\rightarrow Z\leftarrow Y$ is an immorality if there is no direct edge between. If there is such an edge, it is called a covering edge for the v-structure.

It is easily seen that not every v-structure is an immorality, which implies that two networks with the same set of immoralities do not necessarily have the same set of v-structures.

Theorem 7 (skeleton + immoralities $\Leftrightarrow$ I-equivalence) Let $\mathcal{G}_1$ and $\mathcal{G_2}$ be two graphs over $\mathcal{X}$. If $\mathcal{G}_1,\mathcal{G}_2$ both have the same skeleton and the same set of immoralities iff they are I-equivalent.

Proof

  • To prove the theorem, we first introduce the notion of minimal active trail and triangle.
    Definition (Minimal active trail) An active trail $X_1,\ldots,X_m$ is minimal if there is no other active trail from $X_1$ to $X_m$ that shortcuts some of the nodes, i.e. there is no active trail \begin{equation} X_1\rightleftharpoons X_{i_1}\rightleftharpoons\ldots X_{i_k}\rightleftharpoons X_m\hspace{1cm}\text{for }1\lt i_1\lt\ldots\lt i_k\lt m \end{equation} Definition (Triangle) Any three consecutive nodes in a trail $X_1,\ldots,X_m$ are called a triangle if their skeleton is fully connected, i.e. forms a 3-clique.

    Our attention now is to prove that the only possible triangle in minimal active trail is the one having form of $X_{i-1}\leftarrow X_i\rightarrow X_{i+1}$ and either $X_{i-1}\rightarrow X_{i+1}$ or $X_{i-1}\leftarrow X_{i+1}$.
    Consider a two-edge trail from $X_{i-1}$ to $X_{i+1}$ via $X_i$, which as being mentioned above, has four possible forms
    • $X_{i-1}\rightarrow X_i\rightarrow X_{i+1}$
      It is easily seen that $X_i$ has to be not observed to make the trail active. If $X_{i-1}$ is connected to $X_{i+1}$ via $X_{i-1}\rightarrow X_{i+1}$, this gives rise to a shortcut. On the other hand, if they are connected by $X_{i-1}\leftarrow X_{i+1}$, the triangle now induces a cycle.
    • $X_{i-1}\leftarrow X_i\leftarrow X_{i+1}$
      This case is symmetrically identical to the previous one, and thus is not viable.
    • $X_{i-1}\leftarrow X_i\rightarrow X_{i+1}$
      The first observation is that $X_i$ has to be not given. The second observation is $X_{i-1}$ and $X_{i+1}$ are symmetric through $X_i$, so we only need to consider some specific cases of $X_{i-1}$ and the same logic is applied to $X_{i+1}$ analogously.
      Let us examine the two-edge trail $X_{i-2},X_{i-1},X_i$. On the one hand, if we have $X_{i-2}\rightarrow X_{i-1}$, $X_{i-1}$ then has to be given, which implies that
      • If $X_{i-1}\leftarrow X_{i+1}$ exists, it will create a shortcut, which is not allowed.
      • If $X_{i-1}\rightarrow X_{i+1}$ exists, no shortcut appears, $X_{i-1},X_i,X_{i+1}$ satisfies the condition of a triangle in the minimal active trail $X_1,\ldots,X_m$.
      On the other hand, if we have $X_{i-2}\leftarrow X_{i-1}$, then $X_{i-1}$ is not observed, analogously, we instead have
      • If $X_{i-1}\leftarrow X_{i+1}$ exists, no shortcut is formed, $X_{i-1},X_i,X_{i+1}$ create a triangle.
      • If $X_{i-1}\rightarrow X_{i+1}$ exists, $X_{i-1},X_i,X_{i+1}$ is do not form a triangle due to the appearance of a shortcut through $X_{i-1}$ to $X_{i+1}$.
    • $X_{i-1}\rightarrow X_i\leftarrow X_{i+1}$
      In this case, $X_i$ or one of its descendant is observed. Using the similar procedure to previous case gives us no viable triangle formed by $X_{i-1},X_i,X_{i+1}$.
    Given these results, we are now ready for the main part. Let us begin with the forward path.
  • Skeleton + Immoralities $\Rightarrow$ I-equivalence
    Assume that there exists node $X,Y,Z$ such that \begin{align} (X\perp Y\vert Z)&\in\mathcal{I}(\mathcal{G}_1), \\ (X\perp Y\vert Z)&\not\in\mathcal{I}(\mathcal{G}_2), \end{align} which implies that there is an active trail through $X,Y$ and $Z$ in the graph $\mathcal{G}_2$. Let us consider the minimal one and continue by examining two cases that whether $Z$ is observed.
    • If $Z$ is observed, in $\mathcal{G}_1$, we have $X\rightarrow Z\rightarrow Y$, or $X\leftarrow Z\leftarrow Y$, or $X\leftarrow Z\rightarrow Y$, while we have $X\rightarrow Z\leftarrow Y$ in $\mathcal{G}_2$, which is a v-structure. To assure that both graphs have the same set of moralities, there exist an edge that directly connects $X$ and $Y$, or in other words, $X,Y,Z$ form a triangle. This contradicts to the claim we have proved in the previous part.
    • If $Z$ is not observed, thus in $\mathcal{G}_1$, $X,Y,Z$ now must form a v-structure $X\rightarrow Z\leftarrow Y$. And also, to guarantee that both graphs have the same moralities, there exists an edge, without loss of generality, we assume $X\rightarrow Y$. However, this edge will active the trail $X,Y,Z$, or in other words, in $\mathcal{G}_1$, we now have $(X\not\perp Y\vert Z)$, which is a contradiction of our assumption.
  • Skeleton + Immoralities $\Leftarrow$ I-equivalence
    Consider two I-equivalent graphs $\mathcal{G}_1$ and $\mathcal{G}_2$.
    • First assuming that they do not have that same skeleton. This implies without loss of generality that there exists a trail in $\mathcal{G}_1$ that does not appear in $\mathcal{G}_2$, which induces a conditional independence in $\mathcal{G}_1$ but not in $\mathcal{G}_2$, contradicts to the fact that they two graphs are I-equivalent.
    • Now assuming that two graphs do not have the same set of moralities.

From Distributions to Graphs

Given a distribution $P$, how can we represent the independencies of $P$ with a graph $\mathcal{G}$?

Minimal I-Maps

A graph $\mathcal{K}$ is a minimal I-map for a set of independencies $\mathcal{I}$ if it is an I-map for $\mathcal{I}$, and removing one edge from $\mathcal{K}$ makes it no longer be an I-map.

Perfect Maps

A graph $\mathcal{K}$ is a perfect map (or P-map) for a set of independencies $\mathcal{I}$ if we have that $\mathcal{I}(\mathcal{K})=\mathcal{I}$; and if $\mathcal{I}(\mathcal{K})=\mathcal{I}(P)$, $\mathcal{K}$ is said to be a perfect map for $P$.

Undirected Graphical Model

Similar to the directed case, each node in an undirected graphical model represents a random variable. However, as indicated from the name, each edge that connects two nodes is now undirected, and thus can not describe causal relationship between those nodes as in Bayesian network.

Markov Networks

Factors

Let $\mathbf{D}$ be a set of r.v.s. The function $\phi:\text{Val}(\mathbf{D})\mapsto\mathbb{R}$ is referred as a factor with the scope $\mathbf{D}$, denoted $\mathbf{D}=\text{Scope}(\phi)$.

A factor is nonnegative if all of its entries are nonnegative.

Let $\mathbf{X},\mathbf{Y},\mathbf{Z}$ be disjoint sets of variables, and let $\phi_1(\mathbf{X},\mathbf{Y})$ and $\phi_2(\mathbf{Y},\mathbf{Z})$ be factors. The product $\phi_1\times\phi_2$ is called a factor product, which is a factor $\psi:\text{Val}(\mathbf{X},\mathbf{Y},\mathbf{Z})\mapsto\mathbb{R}$, given as \begin{equation} \psi(\mathbf{X},\mathbf{Y},\mathbf{Z})=\phi_1(\mathbf{X},\mathbf{Y})\cdot\phi_2(\mathbf{Y},\mathbf{Z}) \end{equation}

Factor product
Figure 4: (taken from the PGM book) An example of factor product

It is worth remarking that both CPDs and joint distribution are factors. As each Bayesian network define a joint distribution factor, which is the product of the CPDs factor. Specifically, let $\phi_{X_i}(X_i,\text{Pa}_{X_i})$ denote $P(X\vert\text{Pa}_{X_i})$, we can write \begin{equation} P(X_1,\ldots,X_n)=\prod_{i=1}^{n}\phi_{X_i} \end{equation}

Markov Random Fields

Given the notions of factor and factor product, we are now ready to define an undirected parameterization of a distribution.

An Undirected Graphical Model (or Markov Random Field, or Markov network) is defined by an undirected graph $\mathcal{H}$ and a probability distribution $P_\Phi$ parameterized by a set of factors $\Phi=\{\phi_1(\mathbf{D}_1),\ldots,\phi_K(\mathbf{D}_K)\}$ over variables $X_1,\ldots,X_n$ such that \begin{equation} P_\Phi(X_1,\ldots,X_n)=\frac{1}{Z}\tilde{P}_\Phi(X_1,\ldots,X_n), \end{equation} where

  • Each node of $\mathcal{H}$ correspond to a variable $X_i$.
  • The factor product \begin{equation} \tilde{P}_\Phi(X_1,\ldots,X_n)=\phi_1(\mathbf{D}_1)\times\ldots\times\phi_K(\mathbf{D}_K) \end{equation} is an unnormalized measure.
  • $Z$ is a normalizing constant called the partition function, given by \begin{equation} Z=\sum_{X_1,\ldots,X_n}\tilde{P}_\Phi(X_1,\ldots,X_n) \end{equation}
  • $P_\Phi$ is also called a Gibbs distribution, which factorizes over $\mathcal{H}$, in the sense that each $\mathbf{D}_k$ for $k=1,\ldots,K$ is a complete subgraph (or clique) of $\mathcal{H}$.
  • The factors $\phi_1,\ldots,\phi_K$ that parameterize $\mathcal{H}$ are referred as clique potentials, or potential functions of $\mathcal{H}$.

Reduced Markov Networks

Consider the task of conditioning a distribution on some assignment $\mathbf{u}$ to some subset of variables $\mathbf{U}$. This task corresponds to the process

  • Step 1. Eliminate all entries in the joint distribution that are inconsistent with the event $\mathbf{U}=\mathbf{u}$.
  • Step 2. Normalize the remaining entries to sum to $1$.

Consider the case that the distribution is in form of $P_\Phi$ for some set of factor $\Phi$.

Factor Reduction

Let $\phi(\mathbf{Y})$ be a factor, and let $\mathbf{U}=\mathbf{u}$ be an assignment for $\mathbf{U}\subset\mathbf{Y}$. The reduction of the factor $\phi$ to the context $\mathbf{U}=\mathbf{u}$, denoted $\phi[\mathbf{U}=\mathbf{u}]$, or $\phi[\mathbf{u}]$ for short, is defined to be a factor over scope $\mathbf{Y}’=\mathbf{Y}\backslash\mathbf{U}$, such that \begin{equation} \phi[\mathbf{u}](\mathbf{y}’)=\phi(\mathbf{y}’,\mathbf{u}) \end{equation} For $\mathbf{U}\not\subset\mathbf{Y}$, we define $\phi[\mathbf{u}]$ to be $\phi[\mathbf{U}’=\mathbf{u}’]$ where

  • $\mathbf{U}’=\mathbf{U}\cap\mathbf{Y}$;
  • $\mathbf{u}’\doteq\mathbf{u}[\mathbf{U}’]$, where $\mathbf{u}[\mathbf{U}’]$ denotes the assignment in $\mathbf{u}$ to the variable in $\mathbf{U}’$.

Example 1: Consider the factor $\phi$ with $\text{Scope}(\phi)=\{A,B,C\}$, as given in the right-most table of Figure 4. The reduction of $\phi$ to the context $C=c^1$ is a factor over scope $\{A,B,C\}\backslash\{C\}=\{A,B\}$, given by \begin{equation} \phi[c^1](a,b)=\phi(a,b,c^1), \end{equation} which is illustrated in the following table

Factor reduction
Figure 5: (taken from the PGM book) Factor reduction: The factor computed in Figure 4, reduced to the context $C=c^1$.

With the definition of factor reduction, let us consider a product of factors. We have that an entry in the product is consistent with $\mathbf{u}$ iff it is a product of entries that are all consistent with $\mathbf{u}$.

Reduced Gibbs Distribution

Let $P_\Phi$ be a Gibbs distribution parameterized by $\Phi=\{\phi_1,\ldots,\phi_K\}$, and let $\mathbf{u}$ be a context. The reduced Gibbs distribution $P_\Phi[\mathbf{u}]$ is the Gibbs distribution defined by the set of factors \begin{equation} \Phi[\mathbf{u}]=\{\phi_1[\mathbf{u}],\ldots,\phi_K[\mathbf{u}]\} \end{equation}

Proposition 8: Let $P_\Phi(\mathbf{X})$ be a Gibbs distribution, we then have \begin{equation} P_\Phi[\mathbf{u}]=P_\Phi(\mathbf{W}\vert\mathbf{u}), \end{equation} where $\mathbf{W}=\mathbf{X}\backslash\mathbf{U}$.

Reduced Markov Network

Let $\mathcal{H}$ be a Markov network over the nodes $\mathbf{X}$ and $\mathbf{U}=\mathbf{u}$ be a context. The reduced Markov network $\mathcal{H}[\mathbf{u}]$ is a Markov network over the nodes $\mathbf{W}=\mathbf{X}\backslash\mathbf{U}$, where we have an edge $X-Y$ if there is an edge $X-Y$ in $\mathcal{H}$.

Proposition 9: Let $P_\Phi(\mathbf{X})$ be a Gibbs distribution that factorizes over $\mathcal{H}$, and let $\mathbf{U}=\mathbf{u}$ be a context. Then we have that $P_\Phi[\mathbf{u}]$ factorizes over $\mathcal{H}[\mathbf{u}]$.

Reduced Markov network
Figure 6: (taken from the PGM book) An example of a Markov network and the reduction of its factors to some contexts (a) The initial Markov network; (b) The reduced network to the context $G=g$; (c) The reduced network to the context $G=g,S=s$.

Independencies in Markov Network

Analogy to Bayesian network, the graph structure in a Markov Random Field can be viewed as encoding a set of independence assumptions, which can be specified by considering the undirected paths through nodes.

Separation

Let $\mathcal{H}$ be a Markov network structure and let $X_1-\ldots-X_k$ be a path in $\mathcal{H}=(\mathcal{X},\mathcal{E})$. Let $\mathbf{Z}\subset\mathcal{X}$ be a set of observed variables. Then $X_1-\ldots-X_k$ is active given $\mathbf{Z}$ if none of $X_1,\ldots,X_k$ is in $\mathbf{Z}$.

Let $\mathbf{X},\mathbf{Y}$ be set of nodes in $\mathcal{H}$. We say that $\mathbf{Z}$ separates $\mathbf{X}$ and $\mathbf{Y}$, denoted $\text{sep}_\mathcal{H}(\mathbf{X};\mathbf{Y}\vert\mathbf{Z})$ if there is no active path between any node $X\in\mathbf{X}$ and $Y\in\mathbf{Y}$ given $\mathbf{Z}$.

Global Markov Independencies

As in the case of Bayesian network, we define the global Markov independencies associated with $\mathcal{H}$, denoted $\mathcal{I}(\mathcal{H})$, to be the set of all independencies correspond to separation in $\mathcal{H}$ \begin{equation} \mathcal{I}(\mathcal{H})=\big\{(\mathbf{X}\perp\mathbf{Y}\vert\mathbf{Z}):\text{sep}_\mathcal{H}(\mathbf{X};\mathbf{Y}\vert\mathbf{Z})\big\} \end{equation}

Local Markov Independencies

Let $\mathcal{H}$ be a Markov network. We define the pairwise independencies associated with $\mathcal{H}$ to be \begin{equation} \mathcal{I}_p(\mathcal{H})=\big\{(X\perp Y\vert\mathcal{X}\backslash\{X,Y\}):X-Y\notin\mathcal{H}\big\} \end{equation} Or in other words, the pairwise independencies states that $X$ and $Y$ are independent given all other nodes in $\mathcal{H}$.

For a given graph $\mathcal{H}$ and for an arbitrary node $X$ of $\mathcal{H}$, the set of neighbors of $X$ is also called the Markov blanket of $X$, denoted $\text{MB}_\mathcal{H}(X)$. With this notion, we define the local independencies, or Markov local independencies, associated with $\mathcal{H}$ to be \begin{equation} \mathcal{I}_\ell(\mathcal{H})=\big\{\big(X\perp\mathcal{X}\backslash(\{X\}\cup\text{MB}_\mathcal{H}(X))\vert\text{MB}_\mathcal{H}(X)\big):X\in\mathcal{X}\big\} \end{equation} Or in other words, the Markov local independencies says that $X$ is independent of the rest of the nodes in $\mathcal{H}$ given its neighbors.

The definition of Markov blanket can also be rewritten using independencies assertions:
A set $\mathbf{U}$ is a Markov blanket of $X$ in a distribution $P$ if $X\notin\mathbf{U}$ and if $\mathbf{U}$ is a minimal set of nodes such that \begin{equation} \big(X\perp\mathcal{X}\backslash(\{X\}\cup\mathbf{U})\vert\mathbf{U}\big)\in\mathcal{I}(P) \end{equation}

Markov Independencies Relationships

Theorem 10: Let $\mathcal{H}$ be a Markov network and $P$ be a positive distribution. The following three statement are then equivalent:

  • $P\models\mathcal{I}_\ell(\mathcal{H})$.
  • $P\models\mathcal{I}_p(\mathcal{H})$.
  • $P\models\mathcal{I}(\mathcal{H})$.

Proof

  • (i) $\Rightarrow$ (ii)
    Consider an arbitrary node $X$ in $\mathcal{H}$. Let $Y\in\mathcal{X}$ such that $X-Y\notin\mathcal{H}$, then $Y\notin\text{MB}_\mathcal{H}(X)$, or in other words \begin{equation} Y\in\mathcal{X}\backslash(\{X\}\cup\text{MB}_\mathcal{H}(X)) \end{equation} Moreover, since $P\models\mathcal{I}_\ell(\mathcal{H})$, we have that $P$ satisfies \begin{equation} \big(X\perp\mathcal{X}\backslash(\{X\}\cup\text{MB}_\mathcal{H}(X))\vert\text{MB}_\mathcal{H}(X)\big), \end{equation} which implies that \begin{equation} \big(X\perp Y\vert\text{MB}_\mathcal{X}\cup\mathcal{X}\backslash(\{X,Y\}\cup\text{MB}_\mathcal{H}(X))\big) \end{equation} holds for $P$. Or in other words, for any $X,Y$ such that $X-Y\notin\mathcal{H}$, we have \begin{equation} P\models(X\perp Y\vert\mathcal{X}\backslash\{X,Y\}), \end{equation} which proves our claim.
  • (ii) $\Rightarrow$ (iii)
  • (iii) $\Rightarrow$ (i)
    This follows directly from the fact that if two nodes $X$ and $Y$ are not connected, then they are necessarily separated by all remaining nodes of the graph.

Soundness, Completeness

Theorem 11 (Soundness of separation) Let $\mathcal{H}=(\mathcal{X},\mathcal{E})$ be a Markov network structure and let $P$ be a distribution over $\mathcal{X}$. If $P$ is a Gibbs distribution that factorizes over $\mathcal{H}$, then $\mathcal{H}$ is an I-map for $P$.

Proof
Let $\mathbf{X},\mathbf{Y}$ and $\mathbf{Z}$ be any disjoint subsets of $\mathcal{X}$ such that $\text{sep}_\mathcal{H}(\mathbf{X};\mathbf{Y}\vert\mathbf{Z})$. We need to show that \begin{equation} P\models(\mathbf{X}\perp\mathbf{Y}\vert\mathbf{Z}) \end{equation} We begin by considering the case that $\mathbf{X}\cup\mathbf{Y}\cup\mathbf{Z}=\mathcal{X}$. Since $\mathbf{Z}$ separates $\mathbf{X}$ and $\mathbf{Y}$, then for any $X\in\mathbf{X}$ and for any $Y\in\mathbf{Y}$, there is no direct edge between $X,Y$. This implies that any clique in $\mathcal{H}$ is either in $\mathbf{X}\cup\mathbf{Z}$ or in $\mathbf{Y}\cup\mathbf{Z}$.

Let $I_\mathbf{X}$ denote the indices of the set of cliques within $\mathbf{X}\cup\mathbf{Z}$ and let $I_\mathbf{Y}$ represent the indices of the ones in $\mathbf{Y}\cup\mathbf{Z}$. Thus, as $P$ factorizes over $\mathcal{H}$, we have that \begin{align} P(X_1,\ldots,X_n)&=\frac{1}{Z}\prod_{i}\phi_i(\mathbf{D}_i) \\ &=\frac{1}{Z}\left(\prod_{i\in I_\mathbf{X}}\phi_i(\mathbf{D}_i)\right)\cdot\left(\prod_{i\in I_\mathbf{Y}}\phi_i(\mathbf{D}_i)\right) \\ &=\frac{1}{Z}\phi_\mathbf{X}(\mathbf{X},\mathbf{Z})\phi_\mathbf{Y}(\mathbf{Y},\mathbf{Z}), \end{align} where $\phi_\mathbf{X},\phi_\mathbf{Y}$ are some factors with scopes $\mathbf{X}\cup\mathbf{Z}$ and $\mathbf{Y}\cup\mathbf{Z}$ respectively. Hence, it follows that \begin{equation} P\models(\mathbf{X}\perp\mathbf{Y}\vert\mathbf{Z}) \end{equation} Now consider the case that $\mathbf{X}\cup\mathbf{Y}\cup\mathbf{Z}\subset\mathcal{X}$. Let $\mathbf{U}=\mathcal{X}\backslash(\mathbf{X}\cup\mathbf{Y}\cup\mathbf{Z})$. We can divide $\mathbf{U}$ into two disjoint sets $\mathbf{U}_1$ and $\mathbf{U}_2$ such that \begin{equation} \text{sep}_\mathcal{H}(\mathbf{X}\cup\mathbf{U}_1;\mathbf{Y}\cup\mathbf{U}_2\vert\mathbf{Z}) \end{equation} And since $(\mathbf{X}\cup\mathbf{U}_1)\cup(\mathbf{Y}\cup\mathbf{U}_2)\cup\mathbf{Z}=\mathcal{X}$, we can follow the previous procedure to conclude that \begin{equation} P\models(\mathbf{X}\cup\mathbf{U}_1\perp\mathbf{Y}\cup\mathbf{U}_2\vert\mathbf{Z}), \end{equation} which implies that \begin{equation} P\models(\mathbf{X}\perp\mathbf{Y}\vert\mathbf{Z}) \end{equation}

Theorem 12 (Hammersley-Clifford) Let $\mathcal{H}=(\mathcal{X},\mathcal{E})$ be a Markov network structure and let $P$ be a positive distribution over $\mathcal{X}$. If $\mathcal{H}$ is an I-map for $P$, then $P$ is a Gibbs distribution that factorizes over $\mathcal{H}$.

Corollary 13: The positive distribution $P$ factorizes a Markov network $\mathcal{H}$ iff $\mathcal{H}$ is an I-map for $P$.

Theorem 14 (Completeness of separation) Let $\mathcal{H}$ be a Markov network structure. If $X$ and $Y$ are not separated given $\mathbf{Z}$ in $\mathcal{H}$, then $X$ and $Y$ are dependent given $\mathbf{Z}$ in some distribution $P$ that factorizes over $\mathcal{H}$.

From Distributions to Graphs

As mentioned above, the notion of minimal I-map lets us encode the independencies of a given distribution $P$ using a graph structure.

In particular, for a distribution $P$, we can construct the minimal I-map based on either the pairwise independencies or the local independencies.

Theorem 15: Let $P$ be a positive distribution and $\mathcal{H}$ be a Markov network defined by including an edge $X-Y$ for all $X,Y$ such that $P\not\models(X\perp Y\vert\mathcal{X}\backslash\{X,Y\}) $. Then $\mathcal{H}$ is the unique minimal I-map for $P$.

Theorem 16: Let P be a positive distribution and let $\mathcal{H}$ be a Markov network defined by including an edge $X-Y$ for all $X$ and all $Y\in\text{MB}_\mathcal{H}(X)$. Then $\mathcal{H}$ is the unique minimal I-map for $P$.

Remark: Not every distribution has a perfect map as UGM (proof by contradiction).

Factor Graphs

A factor graph $\mathcal{F}$ is an undirected graph whose nodes are divided into two groups: variable nodes (denoted as ovals) and factor nodes (denoted as squares) and whose edges only connect each factor (potential function) $\psi_c$ to its dependent nodes $X\in X_c$.

Same Markov network different factor graphs
Figure 7: (based on figure from the PGM book) Different factor graphs for the same Markov network: (a) A Markov network consists of nodes $X_1,X_2,X_3$; (b) A factor graph with a factor $\psi_{1,2,3}$ connected to each $X_1,X_2,X_3$; (c) A factor graph with three pairwise factors $\psi_{1,2}$ (connected to $X_1,X_2$), $\psi_{1,3}$ (connected to $X_1,X_3$) and $\psi_{2,3}$ (connected to $X_2,X_3$)

Log-Linear Models

A distribution $P$ is a log-linear model over a Markov network $\mathcal{H}$ if it is associated with

  • a set of features $\mathcal{F}=\{\phi_1(\mathbf{X}_1),\ldots,\phi_k(\mathbf{X}_k)\}$ where each $\mathbf{X}_i$ is a complete subgraph in $\mathcal{H}$,
  • a set of weight $w_1,\ldots,w_k$, such that \begin{equation} P(X_1,\ldots,X_n)=\frac{1}{Z}\exp\left[-\sum_{i=1}^{k}w_i\phi_i(\mathbf{X}_i)\right] \end{equation}

The function $\phi_i$ are called energy functions.

Canonical Parameterization

The canonical parameterization of a Gibbs distribution over $\mathcal{H}$ is defined via a set of energy functions over all cliques. For instance, the Markov network given in Figure 7(a) would have energy functions for each of the cliques \begin{equation} \{\{X_1,X_2,X_3\},\{X_1,X_2\},\{X_2,X_3\},\{X_1,X_3\},\{X_1\},\{X_2\},\{X_3\}\}, \end{equation} plus a constant energy function for the empty clique.

Bayesian & Markov Networks

We are ready to derive the relationship between representations: Bayesian network and Markov network. Specifically, we can find a Bayesian network which is a minimal I-map for a given Markov network and vice versa.

Bayesian Networks to Markov Networks

Let us begin by considering a distribution $P_\mathcal{B}$, where $\mathcal{B}$ is a parameterized network over a graph $\mathcal{G}$. Then, $P_\mathcal{B}$ can be seen as a Gibbs distribution by considering each CPD $P(X_i\vert\text{Pa}_{X_i})$ as a factor with scope $X_i,\text{Pa}_{X_i}$. This Gibbs distribution then has $1$ as its partition function.

Proposition 17: Let $\mathcal{B}$ be a Bayesian network over $\mathcal{X}$ and let $\mathbf{E}=\mathbf{e}$ be an observation. Let $\mathbf{W}=\mathcal{X}\backslash\mathbf{E}$. Then $P_\mathcal{B}(\mathbf{W}\vert\mathbf{e})$ is a Gibbs distribution, defined by the factors $\Phi=\{\phi_{X_i}\}_{X_i\in\mathcal{X}}$, where \begin{equation} \phi_{X_i}=P_\mathcal{B}(X_i\vert\text{Pa}_{X_i})[\mathbf{E}=\mathbf{e}] \end{equation} The partition function for this Gibbs distribution is $P(\mathbf{e})$.

This result lets us consider any Bayesian network conditioned as evidence $\mathbf{e}$ as a Gibbs distribution with partion function $P(\mathbf{e})$.

To find the undirected graph serving as an I-map for a set of factors in a Bayesian network, we recall that we have considered each CPD $P(X_i\vert\text{Pa}_{X_i})$ as a factor with scope $X_i,\text{Pa}_{X_i}$, in the undirected I-map. Therefore, in the undirected I-map, we need to have an edge between $X_i$ and each of its parents, as well as between all of the parents of $X_i$ (due to each factor corresponds to a clique).

Moralized Graph

The moral graph $\mathcal{M}[\mathcal{G}]$ of a Bayesian network structure $\mathcal{G}$ over $\mathcal{X}$ is the undirected graph over $\mathcal{X}$ that consists of an undirected edge between $X$ and $Y$ if

  • there is a directed edge between them (in either direction), or
  • $X$ and $Y$ are both parents of the same node.
Moral graph
Figure 8: (based on figure from the PGM book) A Bayesian network and its moral graph: (a) A Bayesian network; (b) The moral graph established by converting directed edges into undirected, plus adding edges between non-connected nodes which are both parents of the same nodes (newly created edges are denoted as $\color{red}{red}$ color)

The construction of moral graphs follows directly to a result.

Corollary 18: Let $\mathcal{G}$ be a BN graph. Then for any distribution $P_\mathcal{B}$ such that $\mathcal{B}$ is a parameterization of $\mathcal{G}$, we have that $\mathcal{M}[\mathcal{G}]$ is an I-map for $P_\mathcal{B}$.

Proposition 19: Let $\mathcal{G}$ be any BN graph. The moralized graph $\mathcal{M}[\mathcal{G}]$ is a minimal I-map for $\mathcal{G}$.

Proof
We begin by introducing the notion of Markov blanket in a Bayesian network $\mathcal{G}$.
Definition (Markov blanket in BN) The Markov blanket of a node $X\in\mathcal{X}$ in a Bayesian network $\mathcal{G}$, denoted $\text{MB}_\mathcal{G}(X)$, is the set of $X$’s parents, $X$’s children, and other parents of $X$’s children.

Let $X\in\mathcal{X}$ be a node of $\mathcal{G}$, we have that $\text{MB}_\mathcal{G}(X)$ d-separates $X$ from all other variables in $\mathcal{G}$; and that no subset of $\text{MB}_\mathcal{G}(X)$ has that property. Specifically:

  • Let $\mathbf{W}=\mathcal{X}\backslash\big(\{X\}\cup\text{MB}_\mathcal{G}(X)\big)$, and let $Z\in\text{MB}_\mathcal{G}(X)$ be some node in the Markov blanket of $X$. Then for each $Y\in\mathbf{W}$ that connected to $X$ via a trail, we have three possible cases: \begin{equation} X\rightarrow Z\rightarrow Y;X\leftarrow Z\leftarrow Y;X\leftarrow Z\rightarrow Y \end{equation} The v-structure is excluded due to the definition of $\text{MB}_\mathcal{G}(X)$, i.e. if $X\rightarrow Z\leftarrow Y$ exists, then $Y\in\text{MB}_\mathcal{G}(X)$. As $\text{MB}_\mathcal{G}(X)$ is given, $Z$ is observed, all of those 2-edge trails are then in-active, which implies that $\text{d-sep}_\mathcal{G}(X;Y\vert Z)$, and thus $\text{d-sep}_\mathcal{G}(X;\mathbf{W}\vert\text{MB}_\mathcal{G}(X))$.
  • As we have mentioned above that if a v-structure exists, then $Y$ must be in the Markov blanket of $X$, it follows directly that $\text{MB}_\mathcal{G}(X)$ is the minimal set having the property.

Thus, in other words, we can conclude that the Markov blanket of $X$, $\text{MB}_\mathcal{G}(X)$, is the smallest set required to render $X$ independent of all other nodes in $\mathcal{G}$. For each $X\in\mathcal{X}$, by viewing its Markov blanket in $\mathcal{G}$ as the set of its neighbors in an undirected graph $\mathcal{H}$ (which is the definition of Markov blanket in a Markov network), we then have that $\mathcal{H}$ is a minimal I-map for $\mathcal{G}$. Additionally, by how it is constructed, $\mathcal{H}$ is also a moral graph of $\mathcal{G}$, and thus $\mathcal{I}(\mathcal{H})\subset\mathcal{I}(\mathcal{G})$.

Remark:

  • The addition of the moralizing edges to the Markov network $\mathcal{H}$ leads to the loss of independence information implied by $\mathcal{G}$.
  • However, moralization causes loss of independencies assertions only when it introduces new edges into the graph.

Proposition 20: If the directed graph $\mathcal{G}$ is moral (in the sense that it contains no immoralities, i.e. for any pair of $X,Y$ in $\mathcal{G}$ sharing a child, there is a covering edge between $X$ and $Y$), then its moralized graph $\mathcal{M}[\mathcal{G}]$, which now has the same edges as $\mathcal{G}$, is a perfect map of $\mathcal{G}$.

In other words, this result states that a moral graph $\mathcal{G}$ can be converted to a Markov network without losing independencies assertions.

Proof
Let $\mathcal{H}=\mathcal{M}[\mathcal{G}]$, then $\mathcal{H}$ and $\mathcal{G}$ have the same edges. As in Proposition 19, we have shown that $\mathcal{I}(\mathcal{H})\subset\mathcal{I}(\mathcal{G})$, our problem remains to prove that $\mathcal{I}(\mathcal{H})\supset\mathcal{I}(\mathcal{G})$.
Assume that there is an independence \begin{equation} (\mathbf{X}\perp\mathbf{Y}\vert\mathbf{Z})\in\mathcal{I}(\mathcal{G}), \end{equation} which is not in $\mathcal{I}(\mathcal{H})$. This implies that there exists some active trail from $\mathbf{X}$ to $\mathbf{Y}$ given $\mathbf{Z}$ in $\mathcal{H}$. Consider some such trail which is minimal. As $\mathcal{H},\mathcal{G}$ have same edges, the same trail must also exist in $\mathcal{G}$. Thus, it is also in-active in $\mathcal{G}$ given $\mathbf{Z}$, which implies that it contains a v-structure, say $X_1\rightarrow X_2\leftarrow X_3$. Moreover, as $\mathcal{G}$ is moral, there exists an edge connecting $X_1$ and $X_3$, contradicts the assumption that the trail is minimal.

Markov Networks to Bayesian Networks

Theorem 21: Let $\mathcal{H}$ be a Markov network graph, and let $\mathcal{G}$ be any Bayesian network minimal I-map for $\mathcal{H}$. Then $\mathcal{G}$ can have no immoralities.

Chordal Graphs

Theorem 22: Let $\mathcal{H}$ be a non-chordal Markov network. Then there is no Bayesian newtork $\mathcal{G}$ which is a perfect map for $\mathcal{H}$, i.e. $\mathcal{I}(\mathcal{G})=\mathcal{I}(\mathcal{H})$.

Conditional Random Fields

A conditional random field, or CRF, is an undirected graph $\mathcal{H}$ whose nodes correspond to $\mathbf{X}\cup\mathbf{Y}$ where $\mathbf{X}$ is a set of observed variables and $\mathbf{Y}$ is a (disjoint) set of target variables which specifies a conditional distribution (instead of a joint distribution) \begin{equation} P(\mathbf{Y}\vert\mathbf{X})=\frac{1}{Z(\mathbf{X})}\prod_{c\in C}\psi_c(\mathbf{X}_c,\mathbf{Y}_c), \end{equation} where the partition function $Z(\mathbf{X})$ now depends on $\mathbf{X}$ (rather than being a constant) \begin{equation} Z(\mathbf{X})=\sum_\mathbf{Y}\prod_{c\in C}\psi_c(\mathbf{X}_c,\mathbf{Y}_c) \end{equation}

Example - Naive Markov

Consider a CRF over the binary-value variables $\mathbf{X}=\{X_1,\ldots,X_k\}$ and $\mathbf{Y}=\{Y\}$, and a pairwise potential between $Y$ and each $X_i$. The model is referred as a naive Markov model. Assume that the pairwise potentials defined via the log-linear model \begin{equation} \psi_i(X_i,Y)=\exp\big[w_i\mathbf{1}\{X_i=1,Y=1\}\big] \end{equation} Additionally, we also have a single-node potential \begin{equation} \psi_0(Y)=\exp\big[w_0\mathbf{1}\{Y=1\}\big] \end{equation} We then have that \begin{equation} P(Y=1\vert x_1,\ldots,x_k)=\frac{1}{1+\exp\left[-\left(w_0+\sum_{i=1}^{k}w_i x_i\right)\right]} \end{equation}

Local Probabilistic Models

Tabular CPDs

For a space of discrete-valued random variables only, we can encode the CPDs $P(X\vert\text{Pa}_X)$ as a table where each entry corresponds to a pair of $X,\text{Pa}_X$.

It is easily seen that this representation raises a disadvantage that the number of parameters required grows exponentially in the number of parents. Also, it is impossible to store the conditional probability corresponding to a continuous-valued random variables.

To avoid these problems, instead of viewing CPDs as tables listing all of the conditional probabilities $P(x\vert\text{pa}_X)$, we should consider them as functions that given $\text{pa}_X$ and $x$, return the conditional probability $P(x\vert\text{pa}_X)$.

Deterministic CPDs

The simplest type of non-tabular CPD corresponds to a variable $X$ being a deterministic function of its parents $\text{Pa}_X$. It means, there exists $f:\text{Val}(Pa_X)\mapsto\text{Val}(X)$ such that \begin{equation} P(x\vert\text{pa}_X)=\begin{cases}1&\hspace{1cm}x=f(\text{pa}_X) \\ 0&\hspace{1cm}\text{otherwise}\end{cases} \end{equation} Deterministic variables are denoted as double-line ovals, as illustrated in the following example

Network with a deterministic CPD
Figure 9: (taken from the PGM book) A network with $C$ being a deterministic function of $A$ and $B$.

Consider the above figure, as $C$ being a deterministic function of $A$ and $B$, we can deduce that $C$ is fully observed if $A$ and $B$ are both observed. In other words, we have that \begin{equation} (D\perp E\vert A,B) \end{equation}

Theorem 22: Let $\mathcal{G}$ be a network structure, and let $\mathbf{X},\mathbf{Y},\mathbf{Z}$ be sets of variables, $\mathbf{D}$ be set of deterministic variables. If $\mathbf{X}$ is deterministically separated from $\mathbf{Y}$ given $\mathbf{Z}$3, then for all distributions $P$ such that $P\models\mathcal{I}_\ell(\mathcal{G})$ and where, for each $X\in\mathbf{D}$, $P(X\vert\text{Pa}_X)$ is a deterministic CPD, we have that $P\models(\mathbf{X}\perp\mathbf{Y}\vert\mathbf{Z})$.

Theorem 23: Let $\mathcal{G}$ be a network structure, and let $\mathbf{X},\mathbf{Y},\mathbf{Z}$ be sets of variables, $\mathbf{D}$ be set of deterministic variables. If $\mathbf{X}$ is not deterministically separated from $\mathbf{Y}$ given $\mathbf{Z}$, then there exists a distribution $P$ such that $P\models\mathcal{I}_\ell(\mathcal{G})$ and where, for each $X\in\mathbf{D}$, $P(X\vert\text{Pa}_X)$ is a deterministic CPD, but we instead have $P\not\models(\mathbf{X}\perp\mathbf{Y}\vert\mathbf{Z})$.

It is worth remarking that particular deterministic CPD might imply additional independencies. For instance, let us consider the following examples

Example 2: Consider the following Bayesian network

Network with a deterministic CPD
Figure 10: (taken from the PGM book) Another Bayesian network with $C$ being a deterministic function of $A$ and $B$.

In the above figure, if $C=A\text{ XOR }B$, we have that $A$ is fully determined given $C$ and $B$. In other words, we have that \begin{equation} (D\perp E\vert B,C) \end{equation} Example 3: Consider the network given in Figure 9, with $C=A\text{ OR }B$. Assume that we are given $A=a^1$, it is then immediately that $C=c^1$ without taking into account the value of $B$. Or in other words, we have that \begin{equation} P(D\vert B,a^1)=P(D\vert a^1) \end{equation} On the other hand, given $A=a^0$, the value of $C$ is not fully determined, and still depend the value of $B$.

Context-Specific Independence

Let $\mathbf{X},\mathbf{Y},\mathbf{Z}$ be pairwise disjoint sets of variables, let $\mathbf{C}$ be a set of variable (which might overlap with $\mathbf{X}\cup\mathbf{Y}\cup\mathbf{Z}$), and let $\mathbf{c}\in\text{Val}(\mathbf{C})$. We say that $X$ and $\mathbf{Y}$ are contextually independent given $\mathbf{Z}$ and the context $\mathbf{C}$ denoted $(\mathbf{X}\perp_c\mathbf{Y}\vert\mathbf{Z},\mathbf{c})$ if \begin{equation} P(\mathbf{Y},\mathbf{Z},\mathbf{c})>0\Rightarrow P(\mathbf{X}\vert\mathbf{Y},\mathbf{Z},\mathbf{c})=P(\mathbf{X}\vert\mathbf{Z},\mathbf{c}) \end{equation} Independence statements of this form is known as the context-specific independencies (CSI).

Given this definition, let us examine some examples.

Example 4: Given the Bayesian network in Figure 9 with $C$ being a deterministic function $\text{OR}$ of $A$ and $B$. By properties of $\text{OR}$ function, we can conclude some independence assertions \begin{align} &(C\perp_c B\hspace{0.1cm}\vert\hspace{0.1cm}a^1), \\ &(D\perp_c B\hspace{0.1cm}\vert\hspace{0.1cm}a^1), \\ &(A\perp_c B\hspace{0.1cm}\vert\hspace{0.1cm}c^0), \\ &(D\perp_c E\hspace{0.1cm}\vert\hspace{0.1cm}c^0), \\ &(D\perp_c E\hspace{0.1cm}\vert\hspace{0.1cm}b^0,c^1) \end{align} Example 5: Given the Bayesian network in Figure 10 with $C$ being the exclusive or of $A$ and $B$. We can also conclude some independence assertions using properties of $\text{XOR}$ function \begin{align} &(D\perp_c E\hspace{0.1cm}\vert\hspace{0.1cm}b^1,c^0), \\ &(D\perp_c E\vert\hspace{0.1cm}b^0,c^1) \end{align}

Context-specific CPDs

Tree-CPDs

A tree-CPD representing a CPD for variable $X$ is a rooted tree, where:

  • each leaf node is labeled with a distribution $P(X)$;
  • each internal node is labeled with some variable $Z\in\text{Pa}_X$;
  • each edge from an internal node, which is labeled as some $Z$, to its child nodes corresponds to a $z_i\in\text{Val}(Z)$.
Tree-CPD
Figure 11: (taken from the PGM book) A tree-CPD for $P(J\vert A,S,L)$.

The structure is common in cases where a variable can depend on a set of r.v.s but we have uncertainty about which r.v.s it depends on. For example, in the above tree-CDP representing $P(J\vert A,S,L)$, we have that \begin{align} &(J\perp_c L,S\vert\hspace{0.1cm}a^0), \\ &(J\perp_c L\vert\hspace{0.1cm}a^1,s^1), \\ &(J\perp_c L\vert\hspace{0.1cm}s^1) \end{align}

Multiplexer CPD

A CPD $P(Y\vert A,Z_1,\ldots,Z_k)$ is said to be a multiplexer CPD if $\text{Val}(A)=\{1,\ldots,k\}$ and \begin{equation} P(Y\vert a,Z_1,\ldots,Z_k)=\mathbf{1}\{Y=Z_a\}, \end{equation} where $a$ is the value of $A$. The variable $A$ is referred as the selector variable for the CPD.

Multiplexer-CPD
Figure 12: (based on figure from the PGM book) (a) A Bayesian network for $P(J,C,L_1,L_2)$; (b) Tree-CPD for $P(J\vert C,L_1,L_2)$; (c) Modified network with additional variable $L$ acting as a multiplexer CPD.

Rule CPDs

A rule $\rho$ is a pair $(\mathbf{c},p)$ where $\mathbf{c}$ is an assignment to some subset of variables $\mathbf{C}$ and $p\in[0,1]$. $\mathbf{C}$ is then referred as the scope of $\rho$, denoted $\mathbf{C}=\text{Scope}(\rho)$.

This representation decomposes a tree-CPD into its most basic elements.

Example 6: Consider the tree-CPD given in Figure 11. The tree defines eight rules \begin{equation} \left\{\begin{array}{l}(a^0,j^0;0.8), \\ (a^0,j^1;0.2), \\ (a^1,s^0,l^0,j^0;0.9), \\ (a^1,s^0,l^0,j^1;0.1), \\ (a^1,s^0,l^1,j^0;0.4), \\ (a^1,s^0,l^1,j^1;0.6), \\ (a^1,s^1,j^0;0.1), \\ (a^1,s^1,j^1;0.9)\end{array}\right\} \end{equation} It is necessary that each conditional distribution $P(X\vert\text{Pa}_X)$ is specified by exactly one rule. Or in other words, the rules in a tree-CPD must be mutually exclusive and exhaustive.

Rule-based CPD

A rule-based CPD $P(X\vert\text{Pa}_X)$ is a set of rules $\mathcal{R}$ such that

  • For each $\rho\in\mathcal{R}$, we have that \begin{equation} \text{Scope}(\rho)\subset\{X\}\cup\text{Pa}_X \end{equation}
  • For each assignment $(x,\mathbf{u})$ to $\{X\}\cup\text{Pa}_X$, we have exactly one rule $(\mathbf{c};p)\in\mathcal{R}$ such that $\mathbf{c}$ is compatible with $(x,\mathbf{u})$. And we say that \begin{equation} P(X=x\vert\text{Pa}_X=\mathbf{u})=p \end{equation}
  • $\sum_x P(x\vert\mathbf{u})=1$.

Example 7: Let $X$ be a variable with $\text{Pa}_X=\{A,B,C\}$ with $X$’s CPD is defined by sets of rules \begin{equation} \left\{\begin{array}{l}\rho_1:(a^1,b^1,x^0;0.1), \\ \rho_2:(a^1,b^1,x^1;0.9), \\ \rho_3:(a^0,c^1,x^0;0.2), \\ \rho_4:(a^0,c^1,x^1;0.8), \\ \rho_5:(b^0,c^0,x^0;0.3), \\ \rho_6:(b^0,c^0,x^1;0.7), \\ \rho_7:(a^1,b^0,c^1,x^0;0.4), \\ \rho_8:(a^1,b^0,c^1,x^1;0.6), \\ \rho_9:(a^0,b^1,c^0;0.5)\end{array}\right\} \end{equation} The tree-CPD corresponds to the above rule-based CPD $P(X\vert A,B,C)$ is given as:

Rule-based-CPD
It is worth noticing that both CPD entries $P(x^0\vert a^0,b^1,c^0)$ and $P(x^1\vert a^0,b^1,c^0)$ are determined by rule $\rho_9$ only. This kind of rule only works for uniform distribution.

Independencies in Context-specific CPDs

If $\mathbf{c}$ be a context associated with a branch in the tree-CPD for $X$, then $X$ is independent of the remaining parents, $\text{Pa}_X\backslash\text{Scope}(\mathbf{c})$, given $\mathbf{c}$. Moreover, there might exist CSI statements conditioned on contexts which are not induced by complete branches.

Example 8: Consider the tree-CPD given in Figure 11, as mentioned above, we have that \begin{equation} (J\perp_c L\hspace{0.1cm}\vert\hspace{0.1cm}s^1), \end{equation} where $s^1$ is not the full assignment associated with a branch.
Also, consider the tree-CPD given in Figure 12(b), we have that \begin{align} &(J\perp_c L_2\hspace{0.1cm}\vert\hspace{0.1cm}c^1), \\ &(J\perp_c L_1\hspace{0.1cm}\vert\hspace{0.1cm}c^2), \end{align} where neither $c^1$ nor $c^2$ is the full assignment associated with a branch.

Reduced Rule

Let $\rho=(\mathbf{c}’;p)$ be a rule and $\mathbf{c}$ be a context. If $\mathbf{c}’$ is compatible with $\mathbf{c}$, we say that $\rho\sim\mathbf{c}$.

In this case, let $\mathbf{c}’’$ be the assignment in $c’$ to the variables in $\text{Scope}(\mathbf{c}’)\backslash\text{Scope}(\mathbf{c})$. We then define the reduced rule $\rho[\mathbf{c}]=(\mathbf{c}’’;p)$. If $\mathcal{R}$ be a set of rules, we define the reduced rule set \begin{equation} \mathcal{R}[\mathbf{c}]=\{\rho[\mathbf{c}]:\rho\in\mathcal{R},\rho\sim\mathbf{c}\} \end{equation}

Example 9: Consider the rule set $\mathcal{R}$ given in Example 7, we have that the reduced set corresponding to context $a^1$ is \begin{equation} \mathcal{R}[a^1]=\left\{\begin{array}{l}\rho_1’:(b^1,x^0;0.1), \\ \rho_2:(b^1,x^1;0.9), \\ \rho_5:(b^0,c^0,x^0;0.3), \\ \rho_6:(b^0,c^0,x^1;0.7), \\ \rho_7:(b^0,c^1,x^0;0.4), \\ \rho_8’:(b^0,c^1,x^1;0.6),\end{array}\right\} \end{equation} which is obtained by selecting rules compatible with $a^1$, i.e. $\{\rho_1,\rho_2,\rho_5,\rho_6,\rho_7,\rho_8\}$, then canceling out $a^1$ from all the rules where it appeared.

Proposition 20: Let $\mathcal{R}$ be the rules in the rule-based CPD for a variable $X$, and let $\mathcal{R}_\mathbf{c}$ be the rules in $\mathcal{R}$ compatible with $\mathbf{c}$. Let $\mathbf{Y}\subset\text{Pa}_X$ be a subset such that $\mathbf{Y}\cap\text{Scope}(\mathbf{c})=\emptyset$. If for all $\rho\in\mathcal{R}[\mathbf{c}]$, we have that $\mathbf{Y}\cap\text{Scope}(\rho)=\emptyset$, then \begin{equation} (X\perp_c\mathbf{Y}\hspace{0.1cm}\vert\hspace{0.1cm}\text{Pa}_X\backslash\mathbf{Y},\mathbf{c}) \end{equation}

Spurious Edge

Let $P(X\vert\text{Pa}_X)$ be a CPD, $Y\in\text{Pa}_X$ be a set and let $\mathbf{c}$ be a context. The edge $Y\rightarrow X$ is said to be spurious in the context $\mathbf{c}$ if $P(X\vert\text{Pa}_X)$ satisfies $(X\perp_c Y\hspace{0.1cm}\vert\hspace{0.1cm}\text{Pa}_X\backslash\{Y\},\mathbf{c})$, where $\mathbf{c}’$ is the restriction of $\mathbf{c}$ to variables in $\text{Pa}_X$.

Hence, by examining the reduced rule set, we can specify whether an edge is spurious, i.e. if $\mathcal{R}$ be the rule-based CPD for $P(X\vert\text{Pa}_X)$, then $Y\rightarrow X$ is spurious in context $\mathbf{c}$ if $Y$ does not appear in $\mathcal{R}[\mathbf{c}]$.

Theorem 21: Let $\mathcal{G}$ be a network structure, $P$ be a distribution such that $P\models\mathcal{I}_\ell(\mathcal{G})$, $\mathbf{c}$ be a context, and $\mathbf{X},\mathbf{Y},\mathbf{Z}$ be sets of variables. If $\mathbf{X}$ is CSI-separated from $\mathbf{Y}$ given $\mathbf{Z}$ in the context $\mathbf{c}$4, then we have that $P\models(\mathbf{X}\perp_c\mathbf{Y}\hspace{0.1cm}\vert\hspace{0.1cm}\mathbf{Z},\mathbf{c})$.

Independence of Causal Influence

Noisy-Or Model

Let $Y$ be a binary-valued r.v with $k$ binary-valued parents $X_1,\ldots,X_k$. The CPD $P(Y\vert X_1,\ldots,X_k)$ is a noisy-or if there are $k+1$ noise parameters $\lambda_0,\lambda_1,\ldots,\lambda_k$ such that \begin{align} P(y^0\vert X_1,\ldots,X_k)&=(1-\lambda_0)\prod_{i:X_i=x_i^1}(1-\lambda_i)\label{eq:nom.1} \\ P(y^1\vert X_1,\ldots,X_k)&=1-(1-\lambda_0)\prod_{i:X_i=x_i^1}(1-\lambda_i) \end{align} If we interpret $x_i^0$ as $0$ and $x_i^1$ as $1$, \eqref{eq:nom.1} can be rewritten as \begin{equation} P(y^0\vert X_1,\ldots,X_k)=(1-\lambda_0)\prod_{i=1}^{k}(1-\lambda_i)^{x_i} \end{equation}

Generalized Linear Models

Binary-valued Variables
Multivalued Variables

Continuous Variables

Linear Gaussian Model

Let $Y$ be a continuous variable with continuous parents $X_1,\ldots,X_k$. We say that $Y$ has a linear Gaussian model if there are parameters $\beta_0,\ldots,\beta_k$ such that \begin{equation} p(Y\vert x_1,\ldots,x_k)=\mathcal{N}(\beta_0+\beta_1 x_1+\ldots+\beta_k x_k;\sigma^2) \end{equation} In vector notation, we have \begin{equation} p(Y\vert\mathbf{x})=\mathcal{N}(\beta_0+\boldsymbol{\beta}^\text{T}\mathbf{x};\sigma^2) \end{equation}

Conditional Bayesian Networks

A conditional Bayesian network $\mathcal{B}$ over $\mathbf{Y}$ given $\mathbf{X}$ is defined as a DAG $\mathcal{G}$ whose nodes are $\mathbf{X}\cup\mathbf{Y}\cup\mathbf{Z}$ where $\mathbf{X},\mathbf{Y},\mathbf{Z}$ are disjoint. The variables in $\mathbf{X}$ are called inputs, the ones in $\mathbf{Y}$ are referred as outputs and the others in $\mathbf{Z}$ are known as encapsulated.

The variables in $\mathbf{X}$ have no parents in $\mathcal{G}$, while the variables in $\mathbf{Y}\cup\mathbf{Z}$ are associated with a CPD. The network defines a CPD using chain rule \begin{equation} P_\mathcal{B}(\mathbf{Y},\mathbf{Z}\vert\mathbf{X})=\prod_{T\in\mathbf{Y}\cup\mathbf{Z}}P(T\vert\text{Pa}_T) \end{equation} The distribution $P_\mathcal{B}(\mathbf{Y}\vert\mathbf{X})$ is defined as the marginal of $P_\mathcal{B}(\mathbf{Y},\mathbf{Z}\vert\mathbf{X})$ \begin{equation} P_\mathcal{B}(\mathbf{Y}\vert\mathbf{X})=\sum_\mathbf{Z}P_\mathcal{B}(\mathbf{Y},\mathbf{Z}\vert\mathbf{X}) \end{equation} The conditional Bayesian network is the directed version of CRF mentioned above.

Encapsulated CPD

Let $Y$ be a r.v with $k$ parents $X_1,\ldots,X_k$. The CPD $P(Y\vert X_1,\ldots,X_k)$ is an encapsulated CPD if it is represented using a conditional Bayesian network over $Y$ given $X_1,\ldots,X_k$.

Template-based Representations

Temporal Models

In a temporal model, for each $X_i\in\mathcal{X}$, we let $X_i^{(t)}$ denote its instantiation at time $t$. The variables $X_i$ are referred as template variables.

Consider a distribution over trajectories sampled over time $t=0,1,\ldots,T$ - $P(\mathcal{X}^{(0)},\mathcal{X}^{(1)},\ldots,\mathcal{X}^{(T)})$, or $P(\mathcal{X}^{(0:T)})$ where $\mathcal{X}^{(t)}=\{X_i^{(t)}\}$. Using the chain rule for probabilities, we have that \begin{equation} P(\mathcal{X}^{(0:T)})=P(\mathcal{X}^{(0)},\mathcal{X}^{(1)},\ldots,\mathcal{X}^{(T)})=P(\mathcal{X}^{(0)})\prod_{t=0}^{T-1}P(\mathcal{X}^{(t+1)}\vert \mathcal{X}^{(0:t)}),\label{eq:tm.1} \end{equation} where $\mathcal{X}^{(t_1:t_2)}\doteq\{\mathcal{X}^{(t_1)},\mathcal{X}^{(t_1+1)},\ldots,\mathcal{X}^{(t_2-1)},\mathcal{X}^{(t_2)}\}$ for $t_1<t_2$. In other words, the distribution over trajectories is the product of conditional distribution, over the variables in each time step $t$ given the preceding one.

Markovian System

A dynamic system over the template variables $\mathcal{X}$ is referred as Markovian if it satisfies the Markov assumption, in the sense that \begin{equation} (\mathcal{X}^{(t+1)}\perp\mathcal{X}^{(0:t-1)}\vert\mathcal{X}^{(t)}) \end{equation} In other words, in such systems, we have a more compact form of \eqref{eq:tm.1}, which is \begin{equation} P(\mathcal{X}^{(0)},\mathcal{X}^{(1)},\ldots,\mathcal{X}^{(T)})=P(\mathcal{X}^{(0)})\prod_{t=0}^{T-1}P(\mathcal{X}^{(t+1)}\vert\mathcal{X}^{(t)}) \end{equation} A Markovian dynamic system is said to be stationary (or time invariant, or homogeneous) if $P(\mathcal{X}^{(t+1)}\vert\mathcal{X}^{(t)})$ is the same for all $t$. In this case, we can represent the process using a transition model $P(\mathcal{X}’\vert\mathcal{X})$, so that for any $t\geq0$, we have \begin{equation} P(\mathcal{X}^{(t+1)}=\xi’\vert\mathcal{X}^{(t)}=\xi)=P(\mathcal{X}’=\xi’\vert\mathcal{X}=\xi) \end{equation}

Dynamic Bayesian Networks

A 2-time-slice Bayesian network (2-TBN) for a process over $\mathcal{X}$ is a conditional Bayesian network over $\mathcal{X}’$ given $\mathcal{X}_I$, where $\mathcal{X}_I\subset\mathcal{X}$ is a set of interface variables.

Hence, as mentioned above, we have

  • Only the variables $\mathcal{X}'$ are associated with CPDs (i.e. having parents).
  • The interface variables $\mathcal{X}_I$ are variables whose values at time $t$ have a direct effect on the variables at time $t+1$. Thus, only the variables in $\mathcal{X}_I$ can be parents of variables in $\mathcal{X}'$.
  • The 2-TBN represents the distribution \begin{equation} P(\mathcal{X}'\vert\mathcal{X})=P(\mathcal{X}'\vert\mathcal{X}_I)=\prod_{i=1}^{n}P(X_i'\vert\text{Pa}_{X_i'}) \end{equation}
  • For each template variable $X_i$, the CPD $P(X_i'\vert\text{Pa}_{X_i'})$ is a template factor, i.e. it will be instantiated multiple times within the model, for multiple variables $X_i^{(t)}$ (and their parents).

In a 2-TBN, edges that go between time slices are called inter-time-slice, while the ones connecting variables in the same slices are known as intra-time-slice. Additionally, inter-time-slice edges having the form of $X\rightarrow X’$ are referred as persistence. The variable $X$ for which we have an edge $X\rightarrow X’$ is also called persistent variable.

2-TBN
Figure 13: (based on figure from the PGM book) A 2-TBN.

Based on the stationary property, a 2-TBN defines the probability distribution $P(\mathcal{X}^{(t+1)}\vert\mathcal{X}^{(t)})$ for any $t$. Given a distribution over the initial states, we can unroll the network over sequences of any length, to define a Bayesian network that induces a distribution over trajectories of that length.

A dynamic Bayesian network (or DBN) is a tuple $(\mathcal{B}_0,\mathcal{B}_\rightarrow)$, where

  • $\mathcal{B}_0$ is a Bayesian network over $\mathcal{X}^{(0)}$ representing the initial distribution over states;
  • $\mathcal{B}_\rightarrow$ is a 2-TBN for the process.

For any time span $T\geq0$, the distribution over $\mathcal{X}^{(0:T)}$ is defined as an unrolled Bayesian network, where, for any $i=1,\ldots,n$:

  • The structure and CPDs of $X_i^{(0)}$ are the same as those for $X_i$ in $\mathcal{B}_0$.
  • The structure and CPDs of $X_i^{(t)}$ for $t>0$ are the same as those for $X_i'$ in $\mathcal{B}_\rightarrow$.

Or in other words, $\mathcal{B}_0$ is the initial state, while $\mathcal{B}_\rightarrow$ represents the transition model.

Remark: Hence, we can view a DBN as a compact representation from which we can generate an infinite set of Bayesian networks (one for every $T>0$).

DBN
Figure 14: (based on figure from the PGM book) (a) $\mathcal{B}_\rightarrow$; (b) $\mathcal{B}_0$; (c) 3-step unrolled DBN.

In DBNs, we can partition the variables $\mathcal{X}$ into disjoint subsets $\mathbf{X}$ and $\mathbf{O}$ such that variables in $\mathbf{X}$ are always hidden, while ones in $\mathbf{O}$ are always observed. This introduces us to an another way of representing temporal process, which is the state-observation model.

State-Observation Models

A state-observation model utilizes two independent assumptions:

  • Markov assumption: \begin{equation} (\mathbf{X}^{(t+1)}\perp\mathbf{X}^{(0:t-1)}\vert\mathbf{X}^{(t)}) \end{equation}
  • Observations depend on current state only: \begin{equation} (\mathbf{O}^{(t)}\perp\mathbf{X}^{(0:t-1)},\mathbf{X}^{(t+1:\infty)}\vert\mathbf{X}^{(t)}) \end{equation}

Therefore, we can view our probabilistic model containing two components: the transition model, $P(\mathbf{X}’\vert\mathbf{X})$, and the observation model, $P(\mathbf{O}\vert\mathbf{X})$. This corresponds to a 2-TBN structure where the observation variables $\mathbf{O}’$ are all leaves, and have parents only in $\mathbf{X}’$. For instance, as considering Figure 13, $\textit{Observation}’$ are acting as $\mathbf{O}’$.

Hidden Markov Models

A Hidden Markov model, or HMM, is the simplest example of a state-observation model, and also a special case of a simple DBN, which has a sparse transition model $P(S’\vert S)$. Thus, HMMs are often represented using a different graphical notation which visualizes this sparse transition model.

Specifically, in the is representation, the transition model is encoded using a directed graph, where

  • Nodes represent the different states of the system, i.e. the values in $\text{Val}(S)$.
  • Each directed edge $s\rightarrow s'$ represents a possible transitioning from $s$ to $s'$, i.e. $P(s'\vert s)>0$.

Example 10: Consider an HMM with state variable $S$ that takes 4 possible values $s_1,s_2,s_3,s_4$ and with a transition model

$s_1$$s_2$$s_3$$s_4$
$s_1$0.30.700
$s_2$000.40.6
$s_3$0.5000.5
$s_4$00.900.1

where the rows correspond to states $s$, while the columns to next states $s’$. On other words, the $i$-th row represents the CPD $P(s’\vert s=s_i)$, and thus must sum to $1$. Its transition graph is shown below

HMM
Linear Dynamical Systems

A linear dynamical system, or Kalman filter represents a system of one or more real-valued variable that evolve linearly over time, with some Gaussian noise.

Such systems can be viewed as DBNs where the variables are all continuous and all of the dependencies are linear Gaussian.

They are traditionally represented as a state-observation model, where the state and the observation are both vector-valued r.v.s; and where the transition model and observation model are both encoded using matrices. In more specific, the model is generally defined via the set of equations \begin{align} P(\mathbf{X}^{(t)}\vert\mathbf{X}^{(t-1)})&=\mathcal{N}(A\mathbf{X}^{(t-1)};Q), \\ P(O^{(t)}\vert\mathbf{X}^{(t)})&=\mathcal{N}(H\mathbf{X}^{(t)};R), \end{align} where

  • $\mathbf{X}\in\mathbb{R}^n$ is the vector of state variables;
  • $O\in\mathbb{R}^m$ is the vector of observation variables;
  • $A\in\mathbb{R}^{n\times n}$ (precisely $A\in[0,1]^{n\times n}$) is the transition matrix, defines the linear transition model;
  • $H\in\mathbb{R}^{n\times m}$ (also $H\in[0,1]^{n\times m}$ to be exact) defines the linear observation model;
  • $R\in\mathbb{R}^{m\times m}$ defines the Gaussian noise associated with the observations.
Extended Kalman Filters

A nonlinear variant of the linear dynamical system, known as extended Kalman filter, is a system where the state transition and observation model are nonlinear functions, i.e. \begin{align} P(\mathbf{X}^{(t)}\vert\mathbf{X}^{(t-1)})&=f(\mathbf{X}^{(t-1)},\mathbf{U}^{(t-1)}) \\ P(O^{(t)}\vert\mathbf{X}^{(t)})&=g(\mathbf{X}^{(t)},\mathbf{W}^{(t)}), \end{align} where

  • $f,g$ are nonlinear functions;
  • $\mathbf{U}^{(t)},\mathbf{W}^{(t)}$ are Gaussian r.v.s.

Template Variables & Template Factors

As viewing the world as a set of objects, each of which can be divided into a set of mutually exclusive and exhaustive classes $\mathcal{Q}=Q_1,\ldots,Q_k$.

An attribute $A$ is a function $A(U_1,\ldots,U_k)$ whose range is some set $\text{Val}(A)$ and where each argument $U_i$ is known as a logical variable associated with a particular class $Q[U_i]\doteq Q_i$. The tuple $(U_1,\ldots,U_k)$ is called the argument signature of the attribute $A$, denoted $\alpha(A)$ \begin{equation} \alpha(A)\doteq(U_1,\ldots,U_k) \end{equation} Example 11: The argument signature of $\textit{Grade}$ attribute would have two logical variables $S,C$ where $S$ is of class $\textit{Student}$, and where $C$ is of class $\textit{Course}$.

Let $\mathcal{Q}$ be a set of classes, and $\aleph$ be a set of attributes over $\mathcal{Q}$. An object skeleton $\kappa$ specifies a fixed, finite set of objects $\mathcal{O}^\kappa[Q]$ for every $Q\in\mathcal{Q}$. We also define \begin{equation} \mathcal{O}^\kappa[\alpha(A)]=\mathcal{O}^\kappa[U_1,\ldots,U_k]\doteq\mathcal{O}^\kappa[Q[U_1]]\times\ldots\times\mathcal{O}^\kappa[Q[U_k]] \end{equation} By default, we let $\Gamma_\kappa[A]=\mathcal{O}^\kappa[\alpha(A)]$ to be the set of possible assignments to the logical variables in the argument signature of $A$. However, an object skeleton might also specify a subset of legal assignments, i.e. $\Gamma_\kappa[A]\subset\mathcal{O}^\kappa[\alpha(A)]$.

For an object skeleton $\kappa$ over $\mathcal{Q},\aleph$. We define sets of ground random variables \begin{align} \mathcal{X}_\kappa[A]&\doteq\{A(\gamma):\gamma\in\Gamma_\kappa[A]\} \\ \mathcal{X}_\kappa[\aleph]&\doteq\bigcup_{A\in\aleph}\mathcal{X}_\kappa[A] \end{align} Here, we are abusing notation, identifying an argument $\gamma=(U_1\mapsto u_1,\ldots,U_k\mapsto u_k)$ with the tuple $(u_1,\ldots,u_k)$.

A template factor $\xi$ is a function defined over a tuple of template attributes $A_1,\ldots,A_l$ where each $A_i$ has a range $\text{Val}(A_i)$. It defines a mapping $\text{Val}(A_1)\times\ldots\times\text{Val}(A_l)\mapsto\mathbb{R}$. Given r.v.s $X_1,\ldots,X_l$ such that $\text{Val}(X_i)=\text{Val}(A_i)$ for all $i=1,\ldots,j$, we define $\xi(X_1,\ldots,X_l)$ to be the instantiated factor from $\mathbf{X}$ to $\mathbb{R}$.

Directed Probabilistic Models for Object-Relational

Plate Models

A plate model $\mathcal{M}_\text{Plate}$ defines for each template attribute $A\in\aleph$ with argument signature $U_1,\ldots,U_k$:

  • a set of template parents \begin{equation} \text{Pa}_A=\{B_1(\mathbf{U}_1),\ldots,B_l(\mathbf{U}_l)\}, \end{equation} such that for each $B_i(\mathbf{U}_i)$, we have that $\mathbf{U}_i\subset\{U_1,\ldots,U_k\}$. The variables $\mathbf{U}_i$ are the argument signature of the parent $B_i$.
  • a template CPD $P(A\vert\text{Pa}_A)$.
Plate models
Figure 15: (based on figure from the PGM book) Plate models and induced ground Bayesian networks: (a) Single plate: for any student $s$, $P(I(s))$ and $P(G(s)\vert I(s))$ are the same; (b) Nested plates: for any (student, course) pair $(s,c)$, $\textit{Grade}(s,c)$ depends on $\textit{Difficulty}(c)$ and on $\textit{Intelligence}(s,c)$; (c) Intersecting plates: for any (student, course) pair $(s,c)$, $\text{Grade}(s,c)$ depends on $\textit{Difficulty}(c)$ and on $\textit{Intelligence}(s)$.
Ground Bayesian Networks for Plate Models

A plate model $\mathcal{M}_\text{Plate}$ and object skeleton $\kappa$ define a ground Bayesian network $\mathcal{B}_\kappa^{\mathcal{M}_\text{Plate}}$ as follows. Let $A(U_1,\ldots,U_k)$ be any template attribute in $\aleph$. Then, for any $\gamma=(U_1\mapsto u_1,\ldots,U_k\mapsto u_k)\in\Gamma_\kappa[A]$, we have a variable $A(\gamma)$ in the ground network, with parents $B(\gamma)$ for all $B\in\text{Pa}_A$, and the instantiated CPD $P(A(\gamma)\vert\text{Pa}_A(\gamma))$.

Example 12: Consider the Figure 15(c), without loss of generality, we have that:

  • The plate model $\mathcal{M}_\text{Plate}$ is defined over a set $\aleph=\{\textit{Grade},\textit{Difficulty},\textit{Intelligence}\}$ of template attributes, for each of which:
    • $\alpha(\textit{Grade})=(S,C)$ and $\text{Pa}_\textit{Grade}=\{\textit{Difficulty}(C),\textit{Intelligence}(S)\}$;
    • $\alpha(\textit{Difficulty})=(C)$ and $\text{Pa}_\textit{Difficulty}=\emptyset$;
    • $\alpha(\textit{Intelligence})=(S)$ and $\text{Pa}_\textit{Intelligence}=\emptyset$,
    where $S$ is logical variable of class $\textit{Student}$, and $C$ is a logical variable of class $\textit{Course}$.
  • Let $(S\mapsto s,C\mapsto c)$ be some assignment to some logical variables $S,C$ where $S$ is of class $\textit{Student}$, $C$ is of class $\textit{Course}$. We then have instantiated CPDs: \begin{align} P(G(s,c)\vert\text{Pa}_G(s,c))&=P(G(s,c)\vert D(s,c),I(s,c))=P(G(s,c)\vert D(c),I(s)), \\ P(D(c)\vert\text{Pa}_D(c))&=P(D(c)), \\ P(I(s)\vert\text{Pa}_I(s))&=P(I(s)) \end{align}

Probabilistic Relational Models

For a template attribute $A$, we define a contingent dependency model as a tuple containing:

  • A parent argument signature $\alpha(\text{Pa}_A)$, which is a tuple of typed logical variables $U_i$ such that $\alpha(\text{Pa}_A)\supset\alpha(A)$.
  • A guard $\Gamma$, which is a binary-valued formula defined in terms of a set of template attributes $\text{Pa}_A^\Gamma$ over the argument signature $\alpha(\text{Pa}_A)$.
  • A set of template parents \begin{equation} \text{Pa}_A=\{B_1(\mathbf{U}_1),\ldots,B_l(\mathbf{U}_l)\}, \end{equation} such that for each $B_i(\mathbf{U}_i)$, we have that $\mathbf{U}_i\subset\alpha(\text{Pa}_A)$.

A probabilistic relational model (or PRM) $\mathcal{M}_\text{PRM}$ defines, for each $A\in\aleph$ a contingent dependency model and a template CPD.

Ground Bayesian Networks for PRMs

A PRM $\mathcal{M}_\text{PRM}$ and object skeleton $\kappa$ define a ground Bayesian network $\mathcal{B}_\kappa^{\mathcal{M}_\text{PRM}}$ as follows. Let $A(U_1,\ldots,U_k)$ be any template attribute in $\aleph$. Then, for any assignment $\gamma\in\Gamma_\kappa[A]$, we have a variable $A(\gamma)$ in the ground network. This variable has, for any $B\in\text{Pa}_A^\Gamma\cup\text{Pa}_A$ and any assignment $\gamma’$ to $\alpha(\text{Pa}_A)\backslash\alpha(A)$, the parent that is the instantiated variable $B(\gamma,\gamma’)$.

Undirected Representation

A relational Markov network $\mathcal{M}_\text{RMN}$ is defined in terms of a set $\Lambda$ of template features, where each $\lambda\in\Lambda$ contains:

  • a real-valued template feature $f_\lambda$ whose arguments are $\aleph(\lambda)=\{A_1(\mathbf{U}_1),\ldots,A_l(\mathbf{U}_l)\}$;
  • a weight $w_\lambda\in\mathbb{R}$.

Ground Gibbs Distribution

Given an RMN $\mathcal{M}_\text{RMN}$, an object skeleton $\kappa$, we can define a ground Gibbs distribution $P_\kappa^{\mathcal{M}_\text{RMN}}$ as:

  • The variables in the network are $\mathcal{X}_\kappa[\aleph]$ (as defined above);
  • $P_\kappa^{\mathcal{M}_\text{RMN}}$ contains a term \begin{equation} \exp\big(w_\lambda f_\lambda(\gamma)\big), \end{equation} for each feature template $\lambda\in\Lambda$ and each assignment $\gamma\in\Gamma_\kappa[\alpha(\lambda)]$.

References

[1] Daphne Koller, Nir Friedman. Probabilistic Graphical Models. The MIT Press, 2009.

[2] Michael I. Jordan. An Introduction to Probabilistic Graphical Models. In preparation.

[3] Eric P. Xing. 10-708: Probabilistic Graphical Model. CMU Spring 2020.

[4] Stefano Ermon. CS228: Probabilistic Graphical Model. Stanford Winter 2017-2018.

Footnotes


  1. Note that $X_i\rightarrow X_j\equiv X_j\leftarrow X_i$ but $X_i\rightarrow X_j\not\equiv X_i\leftarrow X_j$, while $X_i-X_j\equiv X_j-X_i$. ↩︎

  2. Note that the inverse is not true. ↩︎

  3. This can be specified by doing the procedure

    Let $\mathbf{Z}^+\leftarrow\mathbf{Z}$
    While $\exists X_i$ such that $X_i\in\mathbf{D}$ and $\text{Pa}_{X_i}\subset\mathbf{Z}^+$
    $\hspace{1cm}\mathbf{Z}^+\leftarrow\mathbf{Z}\cup\{X_i\}$
    return $\text{d-sep}(\mathbf{X};\mathbf{Y}\vert\mathbf{Z})$

     ↩︎
  4.  ↩︎