## Neural Networks Estimator

**Introduction**

A fundamental problem in the study of dynamical systems is how to find the interdependencies among their individual components, whose activity is recorded and stored in time series. Over the last few years, considerable effort has been dedicated to the development of algorithms for the inference of causal relationships among subsystems, a problem which is strictly related to the estimate of the information flow among subsystems, Wibral (2014), Sameshima (2014). Two major approaches to accomplish this task are Granger causality (GC), Granger (1969), Bressler (2011) and transfer entropy (TE), Schreiber (2000). GC is based on regression, testing whether a source variable (driver) is helpful to improve the prediction of a destination variable (target) beyond the degree to which the target predicts its own future. GC is a model-based approach, implying that the corresponding statistics for validation can be derived from analytic models, resulting in a fast and accurate analysis. A pitfall, however, is inherent to model-based approaches: the model assumed to explain the data often implies strong assumptions and the method is not able to detect the correct directed dynamical networks

when these assumptions are not met. On the other hand non-parametric approaches, such as transfer entropy, allow the pattern of influences to be obtained in the absence of any guidance or constraints from theory; the main disadvantages of non-parametric methods are the unavailability of analytic formulas to evaluate the significance of the transfer entropy and the computational burden, typically heavier than those required by model-based approaches.

Feed-forward neural networks, consisting of layers of interconnected *artificial neurons*, Rumelhart (1986), are among the most widely used statistical tools for non-parametric regression. Relying on neural networks, the proposed approach to Granger causality will be both non-parametric and based on regression, thus realizing the Granger paradigm in a non-parametric fashion.

In this paper we address the implementation of Granger’s original definition of causality in the context of the artificial neural networks approach Bishop (1995). The metrics used to validate the hypothesis of directed influence is the *prediction error*: the difference between the network output and the expected target. The choice of the correct prediction error, and consequently the choice of the past states of the time series that will be fed to the model, has to be accompanied by a validation phase. Only under optimization of the *generalization error* one can be sure that the network is not overfitting.

In order to deal with an increasing number of inputs, each one representing a specific candidate source of directed influence, we will adopt a non-uniform embedding procedure, Faes (2011) that is an iterative procedure to select only the most informative past states of the system to predict the future of the target series among a wider number of available past states. In line with this procedure the network will be trained with an increasing number of inputs, each of them representing a precise past state of the variables that are most helpful to predict the target. Also this selection process will be implemented using the notions of prediction error and generalization error, the former quantifying how well the training data are reproduced, the latter describing the goodness of the validation on a novel set of data.

It is worth stressing that a neural networks approach to GC has been already proposed in Attanasio (2011), where neural networks with a fixed number of inputs, together with other estimators of information flow, are used to evaluate GC. In Attanasio (2011) neural networks are trained without a validation set and an empirical method to avoid overfitting is adopted. To our knowledge the present approach is the first time that non-uniform embedding and a regularization strategy by a validation set are used together in the context of neural network approaches to detect dynamic causal links. Moreover, the neural networks built by our approach will accomplish not only the task of estimating information flows among variables, they may also be used for dynamic classification task as well.

**Introduction to Neural Networks**

Artificial neural networks (ANN) are a very popular branch of machine learning. Here we give a brief introduction to neural networks.

Neural networks can be represented as oriented graphs whose nodes are simple processing elements called “neurons” handling their local input, consisting of a weighted summation of the outputs from the parents nodes, Rumelhart (1986). The input signal is processed by means of a function, called “activation function”, and the corresponding outcome, called “output”, is then sent to the linked nodes by a weighted connection; the weight is a real number that represents the degree of relevance of that connection inside the neural network. The most common architecture of a neural network consists of neurons ordered into layers. The first one is called “input layer” that receives the external inputs. The last layer is called “output layer” that gives the result of the computations made by the whole network. All the layers between the input and output layer are called “hidden layers”.

Neural networks with at least one hidden layer and activation functions as the sigmoid function on the hidden nodes are able to adequately approximate all kinds of continuous functions defined on a compact set from a -dimensional input space , the domain, to a -dimensional output space , the codomain given a sufficient number of hidden nodes: in this sense one can say that neural networks can perform any mapping between two different vector spaces, Bishop (1995). In order to allow a neural network to find the correct mapping, a so-called “learning phase” is needed. In this work we use *supervised* learning, during which inputs are presented to the network and its output is compared to a known output. The weights are adjusted by the network that tries to minimize a cost function that depends upon the network output and the known output. This kind of learning allows a network to discover hidden patterns inside the data.

In this work we implement a growing neural network to study dynamical interactions in a system made up of several variables, described by time series, interacting with each other. The aim of the work is not only to find a directional relationship of influence between a subset of time series, the *source*, and a *target* time series taking into account the rest of the series collected in a set, called *conditioning*, but also to determine the delay at which the source variables are influencing the target. We will then see how the neural networks approach can be useful to accomplish, under the same framework, several tasks such as: finding the directed dynamical influences among variables chosen at a certain delay; predicting a target series when the network is fed with a novel realization of a dynamical system whose connectivity structure has been previously learned; classifying a new data set, giving information about how close the causal relationships are to those observed in data sets used during the learning phase.

**Mathematical Framework**

In this work we deal with growing feed-forward neural networks to better infer the directed dynamical influences in a composite system. Each stochastic variable at hand is assumed to be zero mean (the mean of the data sample is subtracted from data), hence we will deal with neural networks without bias terms. A classical feed-forward neural network without bias is usually described by: a finite set of nodes divided in inputs, output nodes and hidden nodes; a finite set of one way direction connections each one connecting a node belonging to the -th layer to a node belonging to the -th layer, with . A weight is associated with each connection from the node to the node . Each node is characterized by an input function , an input value , an activation function , and an output value , Bishop (1995). Let us now define as the weights vector of the connections which leave the nodes of the -th layer and reach the node . Let us define as the output vector of a generic layer of nodes. The input is given by . Usually we have: . Each is given by

(1)

To evaluate the output of a multilayer network, consecutive applications of (1) are needed to activate all network nodes. Fig1 depicts the schematic structure of the feed-forward networks under discussion here.

To summarize: the output values of a network can be expressed as deterministic functions of the inputs. Assuming that the network has only one hidden layer , we can say that the whole network represents a function, linear or non-linear depending on the linearity or non-linearity of the , between the -dimensional input space and the -dimensional output space, with parameters given by the network weights. The relation holding between inputs and outputs of the network can be approximated in the multidimensional space spanned by the hidden nodes by either an hyperplane if linear functions are used as activation functions of the hidden nodes or by a smoother approximation when non-linear functions are set as activation functions of the hidden nodes. Usually, the activation functions of the output nodes are set to be the identity function that does not modify the input values of the output nodes because the outputs of the network are not supposed to be bounded in order to assume values as close as possible to the training target values, assuming that overfitting is avoided.

So far we have shown how neural networks can process inputs and how they can be mapped onto a parametric function .

We can now assume that there is a function to be modelled and we know a finite set of couples , where , is the value of the function evaluated in plus an error . We want to approximate using the parametric function . The function can be found through the minimization of a certain error function . For instance a classical error function is the sum of squares function (2) to minimize by means of an iterative procedure that requires the data to be presented to the network several times through consecutive realizations called *epochs*:

(2)

The training of the network is the process to determine, starting from a finite set of couples , the weights that can better shape to be as close as possible to . After each epoch of the training phase the weights in the network are adjusted. At this point a definition of *close* is in order. Let us suppose a noisy dataset consisting of and where is the noise term. If we train the network until the input can be exactly reproduced then is not only reproducing , but the noise too. It is easy to understand that the more specialized the network the less it will be able to predict the right when a never seen before is presented to the network. In this case we say that the network is not able to *generalize*. To overcome this issue the is embedded in the learning. The validation phase is paramount because it allows the network to both model the function from which the data could have been drawn and to avoid modeling fluctuations produced by noise in the training set. In order to accomplish these two modeling tasks at the same time, the whole learning procedure is divided into two well distinguished steps:

- the whole data set is divided in two groups. One group is used for the training step during which the weights are updated
- the second group is used for the validation step. These data have not been used in the previous step. We used a maximum amount of
*training epochs*and a smaller number of epochs called*validation epochs*. The validation phase is embedded in the learning phase: this combination of training and validation avoids erroneous use of the training procedure, thus avoiding overfitting.

In the following section, we present the algorithm used for the learning phase:

- training step: adjust the weights after a number of
*training epochs* - validation step: evaluate the generalization error and store it in a vector
- repeat steps 1. and 2. continuing to train the network until one of the following three stop conditions is verified:
- the relative error evaluated as
is less than a

*validation threshold*set to . The value of is set to 5; -
- the maximum number of training epochs is reached.

The and *validation threshold* values have been chosen taking into account a cautious gradient descent implying small updating steps as the main concern here is the risk of overfitting.

**Granger Causality with neural networks**

The aim is to find directed dynamical influences among variables, modeled as time series, using neural networks as a powerful tool to compute the prediction errors needed to evaluate causality in the Granger sense. According to the original definition, Granger causality (GC) deals with two linear models of the present state of a target variable. The first model does not include information about the past states of a driver variable, while the second model contains such information. If the second model’s error is less than that of the first model in predicting the present state of the target, then we can safely say that the driver is causing the target in the sense of Granger Granger (1969). Here we introduce a new Granger causality measure called *neural networks Granger causality* (NNGC) defined as

(3)

where is the prediction error obtained by the network that does not take into account the driver’s past states, while is the prediction error evaluated by the network that takes into account the driver’s past states.

Therefore, instead of fitting predefined models, (linear ones in the original proposal by Granger) we train a neural network to estimate the target using only the past states that can better explain the target series, by using the non-uniform embedding technique. Such strategy leads to growing neural networks, with an increasing number of input neurons, each input neuron representing a past state chosen from the amount of past states available, considering all the variables in the system. The architecture of the network and choice of the most suitable past states, performed through the non-uniform embedding approach, are described in detail in the next sections. Relying on

neural networks, this method realizes the Granger paradigm in a non-parametric fashion, like in Ancona (2004), Marinazzo (2006), where radial basis function networks where employed. This article improves such previous work by (i) using non-uniform embedding and (ii) employing training and validation phases concurrently to ensure a more robust detection of dynamical interactions.

**Non-Uniform Embedding Using Neural Networks (NeuNet NUE)**

Here we want to investigate the opportunity to use neural networks to create the two models needed to evaluate NNGC and, at the same time, to better choose the right candidates to be considered as terms of the models. In this sense our method is a model-free approach because we do not assume any model a priori that can explain the data, but we allow the network to explore the parameters space in order to find the model we need. The procedure will be able to model a function from the input space, spanned by the time series’ past states, and the output space, spanned by the present state of the target series: . It will be possible to estimate a function as close as possible to . This will ensure a precise prediction of from itself, and . It is easy to see that from it is possible to assess whether for another data set , the same relation holds: in this case the network will be able to generalize.

In this study a three-layers feed-forward neural network is used Bishop (1995), trained by means of the resilient back propagation technique that is one of the fastest learning algorithms, Riedmiller (1993). Briefly, the resilient back propagation is an optimized algorithm to update the weights of a neural network based on the gradient descent technique. Let be the weight update value that only determines the size of the weight update and the error function. Then the resilient back propagation rule is the following:

(4)

where . To summarize: every time the partial derivative of the current weight changes in sign, i.e. the error function slope changes indicating that a local minimum has been avoided, the updated value is decreased by the factor allowing a reversal, or “*coming back*“, in the parameters space towards the local minimum. If the derivative does not change sign, then the updated value is increased by the factor accelerating towards a local minimum.

Once the updated value is evaluated, the weight update is quite straightforward as shown by the following equations:

(5)

so that . However, we should also take into account the case when the partial derivative changes sign: the previous weight update is then reverted as follows:

(6)

Following the NUE scheme, each input corresponds to a candidate, while the minimization criterion is the prediction error between the network output and . We should keep in mind that the core of the entire procedure lies in the choice of the candidates that can actually help to predict the target series. Once the relative prediction error, defined as

where can assume values from 1 to the number of initial available candidates, is greater than or equal to a threshold, the procedure stops and no further candidates are chosen. To summarize: the hypothesis of Granger causality evaluates how much information is introduced by adding a new input with respect to the information carried only by the inputs previously considered. Moreover, it is worth stressing that in this case we do not rely on the comparison with a null distribution in order to choose whether a given candidate must be chosen or not. On the other hand, when a driver-response relationship among variables holds, the algorithm will find, input by input, the candidate that will give the lowest prediction error, this being a condition that can hold only if we ensure the network is not overfitted. The risk of overfitting is the reason why a validation phase, described in detail in the following sections, was implemented and the idea of a fixed amount of training iterations was discarded. As soon as the error on the validation set, called *generalization error* increases, the training of the network stops ensuring the capability of the network to generalize, implying that it has not been overfitted.

To better explain the steps implemented to select the past states as a pseudo-code we can say that a *for* loop is nested within a *while* loop. The *while* loop condition, that takes into account the decreasing of the prediction error during the training phase, determines whether or not an additional input should be added to the network. It is worth stressing that during the whole procedure of the candidates’ selection, the internal architecture of the network is kept fixed: the number of hidden nodes is set up as a fraction of the maximum number of candidates available and it does not change. The *for* loop, instead, tests each available candidate given the previous inputs already chosen. During this test, at each iteration of the *for* loop, a network is trained taking into account the current candidate and the validation phase takes place according to the procedure explained at the point above. Therefore, the validation error is taken into account in order to allow the network itself to reach its best performance, in terms of the generalization task, according to the current candidate. At the end of the *for* loop the candidate which gives the minimum prediction error is selected. If the prediction error satisfies the *while* loop condition, such that the relative prediction error is smaller than a threshold, the candidate is chosen and deleted from the set of the available candidates so that the procedure can continue. Otherwise, the procedure will stop. The pseudo-code of the algorithm is shown in the following:

- Initialize network parameters;
- Initialize the embedded matrix
*EM* - Initialize the prediction error
*PE*vector - Initialize the current prediction error
*CPE*vector - Initialize final candidate matrix FCM
- Initialize CS . The terms and should be considered too in case the instantaneous effects should be taken into account.
- k = 1
**While**CS**If**CS is full- Initialize the network NN with one input, the number of chosen hidden nodes, one output
**Else**- Add to NN another input;
- Initialize only the weights between the new input and the hidden nodes keeping all the rest fixed;
**End If****For****While**epoch maxTrainEpochs- train the network, after 30 training epochs evaluate the prediction error;
- validate the network evaluating the generalized error;
**If**epochs/valEpochs == 0**and**epochs maxTrainEpochs- evaluate the relative validation error
**If**validationThreshold**or**relative error**or**epochs == maxTrainEpochs- Store the prediction error in CPE(i)
- epoch = maxTrainEpochs + 1
**End If**- epoch = epoch + 1
**End If****End While****End For**- NN = neural network having in input the candidates that give the minimum prediction error stored in CPE
- PE = min(CPE)
**If**relative prediction error trainThreshold- NN = NN
- PE = PE
- CS
**Else**- NN = NN
- add to FCM the candidate of CS that returns the minimum prediction error
- delete from CS the candidate that returns the minimum prediction error
- k = k + 1
**End If****End While**- return NN; FCM; PE

In the following we will explain in more detail the algorithm showed above:

- In the initialization phase it is worth noting that , , represent the maximum lag considered for the past variables of the observed processes. In the following experiments we will set , , to take into account more past states than needed.
- at the -th step of the while loop at line , where runs on the number of inputs chosen, the network tests all the candidates available by means of the for loop at line : there are inputs. The first -1 inputs are the ones chosen so far and on the -th input one candidate per time is considered. The initial conditions are the same for each candidate: the weights have been fixed so the ones departing from the -1 inputs are the same found as the result of the training at the (-1)-th step and the weights departing from the -th input are the same at the beginning of each training session when the RMSE between the network output and is evaluated. Lines – take care of whether to stop the training phase for the current candidate according to the generalization error.
- lines – check whether it would be worth adding candidates, or it is better to stop the whole procedure because no further meaningful information can be added to better predict the target. The generalization error is not relevant at this stage, since it is only used to stop the training phase.

The network is finally trained to reproduce the best correspondence between the space spanned by the terms of FCM and the space spanned by . The network is then the model that can be used to explain , including the driver’s candidates. This model will give . To evaluate the candidates belonging to the source system should be removed from the network, so the corresponding inputs and weights should not be considered. This configuration leaves the weights between the hidden nodes and the output unchanged, so the network now won’t be able to approximate as well as during the evaluation of the previous term if the causal hypothesis holds between the driver and the target. can be computed projecting the information carried by the inputs representing the candidates belonging to the target and the conditioning variables on the output space and evaluating the RMSE between the network output and . NNGC is now immediately evaluated by the difference between the two terms. The significance of the causality measure estimated with the neural network method embedded into the NUE approach results simply from the selection of, at least, one lagged component of the driver. In other words, if at least one component from the driver is selected, the Granger causality is strictly positive and can be assumed as significant. If this is not the case, the estimated causality that results is exactly zero and is assumed to be non-significant.

NUE is used here as a feature selection algorithm. Other feature selection algorithms can be used to select the most informative candidates; in the present work our choice is in line with other approaches to detect dynamical interactions present in literature, thus offering a coherent framework for all the estimators.

**Bibliography**

- Directed Information Measures in Neuroscience. Springer, 2014.
- Methods in Brain Connectivity Inference through Multivariate Time Series Analysis. CRC Press, 2014.
- Investigating causal relations by econometric models and cross-spectral methods. In: Econometrica, pp. 424–438, 1969.
- Wiener--Granger causality: a well established methodology. In: Neuroimage, 58 (2), pp. 323–329, 2011.
- Measuring information transfer. In: Phys Rev Lett, 85 (2), pp. 461, 2000.
- Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. MIT Press, Cambridge, MA, USA, 1986, ISBN: 0-262-68053-X.
- Neural networks for pattern recognition. In: 1995.
- Information-based detection of nonlinear Granger causality in multivariate processes via a nonuniform embedding technique. In: Phys. Rev. E, 83 , pp. 051112, 2011.
- Detecting human influence on climate using neural networks based Granger causality. In: Theoretical and Applied Climatology, 103 (1-2), pp. 103–107, 2011.
- MuTE: A MATLAB Toolbox to Compare Established and Novel Estimators of the Multivariate Transfer Entropy. In: PloS one, 9 (10), pp. e109462, 2014.
- Radial basis function approach to nonlinear Granger causality of time series. In: Phys. Rev. E, 70 , pp. 056221, 2004.
- Nonlinear parametric model for Granger causality of time series. In: Phys. Rev. E, 73 , pp. 066216, 2006.
- A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In: Neural Networks, 1993., IEEE International Conference on, pp. 586–591, IEEE 1993.