Nearest Neighbor Estimator

NB: for the formalism needed to understand what follows please go to MuTE page and read SOME FORMALISM section

Since its first introduction in 1967, Cover (1967), the nearest neighbours method has been shown to be a powerful non parametric technique for classification, density estimation, and regression estimation. This method can be used to estimate the entropy of a d-dimensional random variable X, H(x), starting from a random sample (x_1, \ldots,x_n) of N realizations of X. Following the reasoning in Kraskov (2004), if we consider the probability distribution P_k(\varepsilon) for the distance between x_i and
its kth nearest neighbour, the probability P_k(\varepsilon)d\varepsilon is equal to the chance that there is one point within distance r \in [\varepsilon/2, \varepsilon/2 + d\varepsilon/2 ] from x_i, that there are k - 1 other points at smaller distances, and that N - k - 1 points have larger distances from x_k. Let p_i be the mass of the \varepsilon-sphere centered at x_i, p_i(\varepsilon) = \int_{\| \xi - x_i\|< \varepsilon/2}{\mu(\xi) d\xi}. Then, the expectation value of log(p_i(\varepsilon)) is

(1)   \begin{equation*}  E(log(p_i)) = \int_0^{\inf}{P_k(\varepsilon)\,log(p_i(\varepsilon))d\varepsilon} = \psi(k) - \psi(N) \end{equation*}

where P_k(\varepsilon) is evaluated through the trinomial formula and \psi(\cdot) is the digamma function. The expectation is taken here over the positions of all other N - 1 points, with x_i kept fixed. An estimator for log(\mu(x)) is then obtained by assuming that \mu(x) is constant in the entire \varepsilon-sphere. The latter gives

(2)   \begin{equation*}  p_i(\varepsilon) \approx c_d \varepsilon^d \mu(x_i) \end{equation*}

where d is the dimension of x and c_d is the volume of the d-dimensional unit sphere. For the maximum norm one has simply c_d = 1, while c_d = \pi^{\varepsilon/2}/\Gamma(1+d/2)/2^d for the Euclidean norm. From \eqref{eq:elog} and \eqref{eq:p} we can evaluate log(\mu(x_i)) and finally:

(3)   \begin{equation*} H(X) = - \psi(k) + \psi(N) +log(c_d) + \frac{d}{N}\sum_{i=1}^N {log(\varepsilon(i))} \end{equation*}

The NN estimator faces the issue of the bias in the estimation of multiple entropies for vector variables of different dimensions by computing entropy sums through a neighbor search in the space of higher dimension, and range searches in the projected sub-spaces of lower dimensions, Kraskov (2004). This approach can be fruitfully exploited for the estimation of the TE, as previously done, e.g., in Wibral (2011), Vicente (2011). To do this, we first rewrite the expression for TE in ( equation (2) ) in terms of the components of the embedding vector V = [V_n^Y, V_n^X, V_n^{\textbf{Z}}] spanning a space of dimension (d_X + d_Y + d_Z):

(4)   \begin{equation*} \begin{aligned}   TE_{X \rightarrow Y|\mathbf{Z}} &= H(Y_n,V_n^Y,V_n^{\textbf{Z}})-H(V_n^Y,V_n^{\textbf{Z}})-H(Y_n,V) + H(V)\\  \end{aligned}  \end{equation*}

The term H(Y_n,V) is estimated through neighbor search in the (d_X + d_Y + d_Z + 1)- dimensional space, while the three other terms are estimated through range searches in the spaces of dimension (d_Y + d_Z + 1), (d_Y + d_Z) and (d_X + d_Y + d_Z). Accordingly, adaptation of \eqref{eq:Hknn} to the four terms in
\eqref{eq:teNN} yields the equation for TE based on the nearest neighbor estimator:

(5)   \begin{equation*} TE_{X \rightarrow Y|\mathbf{Z}} = \psi(k) +\left \langle \psi(N_{V_n^YV_n^Z} + 1) - \psi(N_{Y_nV_n^YV_n^Z} + 1) - \psi(N_V + 1) \right \rangle \end{equation*}

where N_{V_n^YV_n^Z},N_{Y_nV_n^YV_n^Z} and N_V are the number of points whose distance from
[V_n^Y,V_n^{\textbf{Z}}], [Y_n,V_n^Y,V_n^{\textbf{Z}}] and V, respectively, is strictly less than the distance from [Y_n,V] to its k-th neighbor.

In the NUE implementation of the NN estimator, maximization of the mutual information between the component
\hat{W}_n selected at the step k and the target variable Y_n (step (*) of the algorithm ) was obtained in terms of maximization of the conditional mutual information I(Y_n;\hat{W}_n|V_n^{(k-1)}), which was computed with the approach described above of estimating the four relevant entropies through a neighbor search in the space of higher dimension, and range searches in the projected sub-spaces of lower dimensions. Moreover, the randomization procedure applied to test candidate significance consisted in shuffling randomly and independently both the points of \hat{W}_n and those of Y_n. These techniques have been recently shown to be optimal for the selection of candidates in a non-uniform embedding approach using nearest neighbor entropy estimators, Kugiumtzis (2013).


  1. Cover, Thomas, Hart, Peter: Nearest neighbor pattern classification. In: IEEE Trans Inf Theory, 13 (1), pp. 21–27, 1967.
  2. Kraskov, Alexander, St"ogbauer, Harald, Grassberger, Peter: Estimating mutual information. In: Phys Rev E, 69 (6), pp. 066138, 2004.
  3. Wibral, Michael, Rahm, Benjamin, Rieder, Maria, Lindner, Michael, Vicente, Raul, Kaiser, Jochen: Transfer entropy in magnetoencephalographic data: Quantifying information flow in cortical and cerebellar networks. In: Prog Biophys Mol Biol, 105 (1), pp. 80–97, 2011.
  4. Vicente, Raul, Wibral, Michael, Lindner, Michael, Pipa, Gordon: Transfer entropy a model-free measure of effective connectivity for the neurosciences. In: J Comput Neurosci, 30 (1), pp. 45–67, 2011.
  5. Kugiumtzis, D.: Direct-coupling information measure from nonuniform embedding. In: Phys Rev E, 87 , pp. 062918, 2013.

Looking for related topics? Search Google