Neural Networks but in infinite dimensions!

In this blog post, we will dive deep into these so called infinite dimension extension of neural networks -- Neural Operators. We start with making precise some notions related to Neural Operators.

Discretization-Invariant Models

Generally, when we train NNs to solve partial differential equations (PDEs) we choose a fixed resolution of the discretization grid. This causes the training error to be less for that particular discretization and NNs do not generalize well to other resolutions producing a huge training error. If a NN is discretization-invariant then it makes sure that the underlying NN is independent of the chosen resolution of the grid and do not require re-training them when we want to change the resolution. For a model to be discretization-invariant, it need to satisfy the following conditions: - It should allow any point of the input domain for its evaluation. - It should allow any point of the output domain for its evaluation. - As we refine the chosen discretization, it should converge to continuum version of the operator.

Some popular architectures such as Graph Neural Networks ¹ and Transformer models ² are resolution invariant but they fail to satisfy the last property given above.

Neural Operator

The neural operator are multi-layers architectures where layers are themselves operators composed with non-linear activation functions. Hence the overall composition is an operator, and thus satisfies the discretization invariance property. Authors describes the key design choice for neural operators is the operator layers. These operator layers are linear operators with activation functions on top of them which closely resembles the classical neural networks setting. Moreover, neural operators also satisfies the universal approximation property which simply states that they can approximate any continuous function defined on a compact subset of Banach spaces upto any desired accuracy.

Neural operators are the only known class of models that guarentee both discretization-invariance and unviersal approximation.

Learning Operators

Generic Parametric PDEs

We consider the generic family of PDEs of the following form:

\[ \begin{align} (L_a u)(x) &= f(x), \qquad && x \in D \\ u(x) &= 0, \qquad && x \in \partial D \nonumber \end{align} \]

for some \(a \in \cal{A}, f \in \cal{U}^*\) and \(D \subset \bb{R}^d\) a bounded domain. We assume that the solution \(u : D \to \bb R\) belongs to the Banach space \(\cal U\) and \(L_a : \cal A \to \cal L(\cal U,\cal U^*)\) is a mapping from the parameter Banach space \(\cal A\) to the space of (posisbly unbounded) linear operators mapping \(\cal U\) to its dual \(\cal U^*\). A natural operator arises from this PDE is \(\cal G^\dagger := L_a^{-1} f: \cal A \to \cal U\) defined to map the parameter to the solution \(a \mapsto u\).

Wherever needed, we will assume that the domain \(D\) is discretized into \(K \in \bb N\) points and that we observe \(N \in \bb N\) pairs of coefficient functions and (approximate) solution functions \(\{a^{(i)}, u^{(i)}\}^N_{i=1}\) that are used to train the model. We also assume that \(a^{(i)}\) are i.i.d. samples from a probability measure \(\mu\) supported on \(\cal A\) and \(a^{(i)}\) are the pushforwards under \(\cal G^\dagger\).

Problem setting

Goal is to learn a mapping between two infinite dimensional spaces by using a finite collection of observations of input-output pairs from this mapping.

Let \(\cal A\) and \(\cal U\) be Banach spaces of functions defined on bounded domains \(D \subset \bb R^d, D' \subset \bb R^{d'}\) respectively and \(\cal G^\dagger : \cal A \to \cal U\) be a (typically) non-linear map. Suppose we have observations \(\{a^{(i)},u^{(i)}\}^N_{i=1}\) where \(a^{(i)} \sim \mu\) are i.i.d. samples drawn from some probability measure \(\mu\) supported on \(\cal A\) and \(u^{(i)} = \cal G^\dagger(a^{(i)})\) is possibly corrupted with noise. We aim to build an approximation of \(\cal G^\dagger\) by contructing a parametric map

\[ \begin{equation} \cal G_\theta : \cal A \to \cal U, \qquad \theta \in \bb R^p \end{equation} \]

with parameters from the finite-dimensional space \(\bb R^n\) and then choosing \(\theta^\dagger \in \bb R^p\) so that \(\cal G_{\theta^\dagger} \approx \cal G^\dagger\).Athithi@1106

We will be interested in controlling the error of the approximation on average with respect to \(\mu\). In particular, assuming \(\cal G^\dagger\) is \(\mu\)-measurable, we will aim to control the \(L^2_{\mu}(\cal A;\cal U)\) Bochner norm of the approximation

\[ \begin{equation} \|\cal G^\dagger - \cal G_\theta\|^2_{L^2_\mu(\cal A;\cal U)} = \bb E_{a\sim \mu}\|\cal G^\dagger(a) - \cal G_\theta(a)\|^2_{\cal U} = \int_{\cal A} \|\cal G^\dagger(a) - \cal G_\theta(a)\|^2_{\cal U}~~d\mu(a). \end{equation} \]

With this framework for leanring in infinite-dimensions, we can solve the associated empirical-risk minimization problem

\[ \begin{equation} \min_{\theta\in \bb R^p} \bb E_{a\sim \mu} \|\cal G^\dagger(a) - \cal G_\theta(a)\|^2_{\cal U} \approx \min_{\theta \in \bb R^p} \frac{1}{N}\sum_{i=1}^{N} \|u^{(i)} - \cal G_\theta(a^{(i)})\|^2_{\cal U} \end{equation} \]

which is same as in finite-dimensional settings. Along with this, we will also consider the setting where error is measured uniformly over compact sets of \(\cal A\). In particular, given any \(K \subset \cal A\) compact, we consider

\[ \begin{equation} \sup_{a\in K} \|\cal G^\dagger(a) - \cal G(\theta)(a)\|_{\cal U} \end{equation} \]

which is standard error metric in the approximation theory literature. The archictecture proposed in this paper ³, does not depend on the way the functions \(a^{(i)}, u^{(i)}\) are discretized.

Discretization

Our data \(a^{(i)}\) and \(u^{(i)}\) are functions and we assume to have access only to their point-wise evaluation. For simplicity \(D = D'\) and suppose that the input and output functions are both real-valued. Let \(D^{(i)} = \{x^{(i)}_l\}^L_{l=1} \subset D\) be a \(L\)-point discretization of domain \(D\) and assume we have observations \(a^{(i)}|_{D^{(i)}}, u^{(i)}|_{D^{(i)}} \in \bb R^L\), for a finite collection of input-output pairs indexed by \(j\). The parametric operator class that has been presented in this paper, is consistent i.e. given a fixed set of parameters, refinement of the input discretization converges to the true function space operator. This notion has been made precise below:

Definition 1

We call a discrete refinement of the domain \(D \subset \bb R^d\) any sequence of nested sets \(D_1 \subset D_2 \subset \dots \subset D\) with \(|D_L| = L\) for any \(L \in \bb N\) such that, for any \(\epsilon > 0\), there exists a number \(L = L(\epsilon)\) such that

\[ D \subseteq \bigcup_{x \in D_L} \{y \in \bb R^d : \|y-x\|_2 < \epsilon\}.\]

Definition 2

Given a discrete refinement \((D_L)^{\infty}_{L=1}\) of the domain \(D \subset \bb R^d\), any member \(D_L\) is called a discretization of \(D\).

Since \(a: D \subset \bb R^d \to \bb R^m\), pointwise evaluation of the function at a set of \(L\) points gives rise to the data set \(\{ (x_l,a(x_l))\}^L_{l=1}\). This can also be viewed as a vector in \(\bb R^{Ld} \times \bb R^{Lm}\).

Definition 3

Suppose \(\cal A\) is a Banach space of \(\bb R^m\)-valued functions on the domain \(D \subset \bb R^d\). Let \(\cal G : \cal A \to \cal U\) be an operator, \(D_L\) be an \(L\)-point discretization of \(D\), and \(\hat{\cal G} : \bb R^{Ld} \times \bb R^{Lm} \to \cal U\) some map. For any \(K \subset \cal A\) compact, we define the discretized uniform risk as

\[ R_K(\cal G, \hat{\cal G}, D_L) = \sup_{a\in K} \|\hat{\cal G}(D_L, a|_{D_L}) - \cal G(a)\|_{\cal U}.\]

Definition 4

Let \(\Theta \subseteq \bb R^p\) be a finite dimensional parameter space and \(\cal G : \cal A \times \Theta \to \cal U\) a map representing a parametric class of operators with parameters \(\theta \in \Theta\). Given a discrete refinement \((D_n)^\infty_{n=1}\) of the domain \(D \subset \bb R^d\), we say \(\cal G\) is discretization-invariant if there exists a sequence of maps \(\hat{\cal G}_1, \hat{\cal G}_2, \dots\) where \(\hat{\cal G}_L : \bb R^{Ld} \times \bb R^{Lm} \times \Theta \to \cal U\) such that, for any \(\theta \in \Theta\) and any compact set \(K \subset \cal A\),

\[ \lim_{L \to \infty} R_K(\cal G(\cdot, \theta), \hat{\cal G}_L(\cdot,\cdot,\theta),D_L) = 0.\]

References

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009. URL: https://ieeexplore.ieee.org/document/4700287, doi:10.1109/TNN.2008.2005605. ↩
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, 6000–6010. Red Hook, NY, USA, 2017. Curran Associates Inc. ↩
Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: learning maps between function spaces with applications to pdes. Journal of Machine Learning Research, 24(89):1–97, 2023. URL: http://jmlr.org/papers/v24/21-1524.html. ↩

📅 Created 0 days ago ✏️ Updated 0 days ago