NeurIPS 2025 · Spotlight

ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition

We jointly amortize Bayesian inference, active learning, and Bayesian experimental design within a single Transformer architecture.

Daolang Huang1,2 Xinyi Wen1,2,3 Ayush Bharti2 Samuel Kaski*1,2,4 Luigi Acerbi*3
ELLIS Institute Finland1 Aalto University2 University of Helsinki3 University of Manchester4

The Challenge: Closing the Loop in Real-Time Science

Imagine a drone locating survivors after a disaster. Ideally, it should strategically decide the next exploration area to maximize discovery chances, while simultaneously estimating survivor locations from incoming sensor data. Similarly, in emergency diagnosis, a doctor needs to determine the optimal sequence of tests and instantly infer the illness from unfolding results to enable timely treatment. These scenarios demand two critical capabilities working in tandem:

While deep learning has successfully amortized these tasks individually, the loop remains broken. Amortized Bayesian inference methods, such as Neural Processes [1, 2, 3], Prior-data Fitted Networks [4, 5], and Neural Posterior Estimation [6], offer rapid updates but are typically passive observers, unable to strategize data collection. Conversely, Amortized Experimental Design (BED) [7, 8, 9] excels at picking informative data points but often lacks the mechanism to perform efficient, instant inference updates after every step. This separation blocks the path to truly autonomous, real-time science.

In this paper, we introduce Amortized Active Learning and Inference Engine (ALINE). This work builds on our previous work for flexible amortized inference, extending the architecture of the Amortized Conditioning Engine (ACE) [3] to the active setting. Leveraging a Transformer trained via reinforcement learning, ALINE jointly queries informative data and refines posterior beliefs - all within a single forward pass. More importantly, ALINE introduces Flexible Targeting, which allows it to dynamically refocus its strategy on specific parameter subsets or predictive goals at runtime, without the need for retraining.

Key Capabilities

Instantly select the most informative data point to acquire.
Instantly estimate target posterior or predictive distributions.
Flexibly adapt to specified goals without re-training.
Comparison between ALINE and prior amortized inference or design methods
ALINE closes the loop by integrating amortized inference and data acquisition, while prior approaches solve only one side.

Method: Amortized Active Learning and Inference Engine

Problem setup

We assume a parametric conditional model \(p(y \mid x, \theta)\) with parameters \(\theta\) and prior \(p(\theta)\). During an episode of size \(T\), the system sequentially collects a dataset \(D_T = \{(x_i, y_i)\}_{i=1}^T\), to enable accurate and rapid Bayesian inference.

The key idea in ALINE is to make both inference and data acquisition flexible. To achieve this goal, we introduce a target specifier \(\xi\) that tells the model what it should care about in the current episode:

  • Parameter targets \(\xi_S^\theta\): the goal is to infer a subset of parameters \(\theta_S = \{\theta_l : l \in S\}\). Examples include focusing only on the parameters of scientific interest and treating the rest as nuisance variables.
  • Predictive targets \(\xi^{y^\star}_{p_\star}\): the goal is to improve the posterior predictive distribution \(p(y^\star \mid x^\star, D_T)\) for test inputs \(x^\star \sim p_\star(x^\star)\), e.g., a region of the input space where accurate predictions matter most.

The set of possible targets \(\Xi\) contains both parameter and predictive tasks, and during training we sample \(\xi \sim p(\xi)\), so ALINE learns to adapt its strategy to whatever goal is specified at runtime.

Workflow of ALINE: closed loop between data acquisition and inference
Conceptual workflow of ALINE. Given the current history and target, a single forward pass proposes the next query and updates the approximate posterior or predictive distribution, closing the loop between design and inference.

Architecture

ALINE builds on Transformer Neural Processes and organizes its inputs into three sets:

  • Context set \(D_t = \{(x_i, y_i)\}_{i=1}^t\): the history of queried inputs and observations.
  • Query set \(Q = \{x_n^{(q)}\}_{n=1}^N\): a pool of candidate points that the policy may choose as the next query.
  • Target set \(T = \{\xi\}\): an embedding of the current inference goal (parameter subset or predictive task).

Each element is passed through embedding layers, then processed by shared Transformer layers. Self-attention captures dependencies inside the context set, while cross-attention lets query and target tokens attend to the encoded context.

To support flexible targeting, ALINE adds an extra query–target cross-attention block: candidate queries explicitly attend to the target tokens, so their representation encodes how informative this point is for the current goal.

On top of the shared backbone, ALINE uses two specialized output heads:

  • An inference head \(q_\phi\) that outputs approximate marginal posteriors over parameters or predictive distributions, parameterized as a Gaussian mixture.
  • An acquisition head \(\pi_\psi\) that assigns a score or probability to each candidate in the query set and samples the next input \(x_{t+1}\).
ALINE architecture with context, query and target sets, transformer backbone, and two heads
ALINE architecture. Context, query and target sets are embedded and processed by shared Transformer layers. The inference head performs approximate Bayesian inference, while the acquisition head outputs a policy over candidate queries.

Training objectives

The inference head is trained with standard negative log-likelihood objectives. For a parameter target \(\xi_S^\theta\) we minimize

$$ \mathcal{L}_S^\theta(\phi) = - \mathbb{E}_{p(\theta)p(D_T \mid \theta)} \left[ \sum_{l \in S} \log q_\phi(\theta_l \mid D_T) \right], $$

which encourages the marginals \(q_\phi(\theta_l \mid D_T)\) to match the true posterior over the parameters of interest. For a predictive target \(\xi^{y^\star}_{p_\star}\) we draw target inputs \(x^\star \sim p_\star(x^\star)\) and minimize

$$ \mathcal{L}^{y^\star}_{p_\star}(\phi) = - \mathbb{E}_{p(\theta)p(D_T \mid \theta) p_\star(x^\star)p(y^\star \mid x^\star,\theta)} \left[ \log q_\phi(y^\star \mid x^\star, D_T) \right]. $$

These losses drive the inference head to approximate either the posterior or the posterior predictive distribution, and they also provide the learning signal used to train the acquisition policy.

For the acquisition head we do not optimize the true trajectory-level information gain directly, which would be intractable. Instead, we form variational lower bounds using the approximate distributions provided by \(q_\phi\). For a parameter target \(\xi_S^\theta\) the objective is

$$ \mathcal{J}_S^\theta(\psi) = \mathbb{E}_{p(\theta)p(D_T \mid \pi_\psi,\theta)} \left[ \sum_{l \in S} \log q_\phi(\theta_l \mid D_T) \right] + H[p(\theta_S)], $$

and for a predictive target \(\xi^{y^\star}_{p_\star}\)

$$ \mathcal{J}^{y^\star}_{p_\star}(\psi) = \mathbb{E}_{p(\theta)p(D_T \mid \pi_\psi,\theta) p_\star(x^\star)p(y^\star \mid x^\star,\theta)} \left[ \log q_\phi(y^\star \mid x^\star, D_T) \right] + \mathbb{E}_{p_\star(x^\star)} [H[p(y^\star \mid x^\star)]]. $$

Both objectives are lower bounds on the corresponding trajectory-level EIG and EPIG [10]. As the inference head becomes more accurate, these bounds tighten and the policy receives a more faithful reward signal.

Learning to query with self-estimated information gain

We train the acquisition head as a policy network using policy-gradient reinforcement learning. Instead of a single terminal reward, ALINE uses a dense per-step reward that measures how much the inference head improved after observing the new data point \((x_t, y_t)\).

Concretely, if the current target is \(\xi_S^\theta\), the reward at step \(t\) is the increase in average log-probability that \(q_\phi\) assigns to the true parameters of interest:

$$ R_t(\xi_S^\theta) = \frac{1}{|S|} \sum_{l \in S} \Big[ \log q_\phi(\theta_l \mid D_t) - \log q_\phi(\theta_l \mid D_{t-1}) \Big], $$

and an analogous definition is used for predictive targets based on the change in log predictive density. The policy \(\pi_\psi\) is then updated to maximize the discounted return \(\sum_{t=1}^T \gamma^t R_t(\xi)\), while gradients from the policy loss are not propagated back through the inference head.

In practice we first warm up the model by training only the inference head with randomly sampled queries. After this phase the policy and inference heads are trained jointly, leading to a single Transformer that can both propose informative queries and instantly perform Bayesian inference for a wide range of targets.

Applications and Experimental Results

Active Learning for Regression

To efficiently learn an unknown function, we need to sequentially select the most informative points to query. We pre-train ALINE entirely on diverse synthetic functions sampled from GP priors, then evaluate on both in-distribution draws and out-of-distribution benchmarks (such as Gramacy) to assess generalization.

Synthetic 1D GP Function

Gramacy Benchmark Function

We compare against standard GP-based methods (GP-US, GP-EPIG [10], etc.) and the amortized baseline ACE. ALINE outperforms other methods on most out-of-distribution benchmarks.

Active learning quantitative comparisons
Figure: RMSE (↓) over query steps on synthetic and benchmark functions. Shaded areas denote 95% CI.

We then test ALINE's unique ability to switch its goal at runtime - without retraining - to infer the underlying GP's hyper-parameters (e.g., lengthscale, output scale).

Step 1

Starting exploration: posterior beliefs are diffuse.

Posterior visualization

Bayesian Experimental Design

We evaluate ALINE on two Bayesian experimental design tasks: Location Finding and CES. We compare ALINE against both non-amortized baselines (VPCE) and state-of-the-art amortized methods (DAD, RL-BOED). Results show that ALINE achieves superior or competitive EIG while maintaining instant deployment speeds.

Location Finding

CES Task

Results on BED benchmarks
Results on BED benchmarks.

Psychometric Model

To demonstrate ALINE's flexible targeting capability, we consider the psychometric model, which describes the relationship between stimulus intensity and response probability. The model is governed by four parameters:

$$ \psi(x; \theta, \alpha, \gamma, \lambda) = \gamma + (1 - \gamma - \lambda) \Phi\left(\frac{x - \theta}{\alpha}\right) $$

where \(\theta\) is the threshold, \(\alpha\) is the slope, \(\gamma\) is the guess rate, and \(\lambda\) is the lapse rate.

When the goal is to estimate threshold (\(\theta\)) and slope (\(\alpha\)): ALINE performs comparably to the baselines. Its query strategy concentrates stimuli near the threshold region.

When the goal is to estimate guess rate (\(\gamma\)) and lapse rate (\(\lambda\)): QUEST+ performs sub-optimally, while ALINE and Psi-marginal achieve significantly better accuracy. ALINE now selects 'easy' stimuli at extreme values to best reveal biases and lapses.

Psychometric function estimation results
Performance comparison on psychometric function estimation with different parameter targets.

BibTeX

@inproceedings{huang2025aline,
  title   = {ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition},
  author  = {Huang, Daolang and Wen, Xinyi and Bharti, Ayush and Kaski, Samuel and Acerbi, Luigi},
  booktitle = {Advances in Neural Information Processing Systems},
  year    = {2025}
}

References

  1. Garnelo M, Rosenbaum D, Maddison C, Ramalho T, Saxton D, Shanahan M, Teh Y W, Rezende D, Eslami S A. Conditional neural processes. In International conference on machine learning, 2018.
  2. Nguyen T, Grover A. Transformer Neural Processes: Uncertainty-aware meta learning via sequence modeling. In International conference on machine learning, 2022.
  3. Chang P E, Loka N, Huang D, Remes U, Kaski S, Acerbi L. Amortized Probabilistic Conditioning for Optimization, Simulation and Inference. In International Conference on Artificial Intelligence and Statistics, 2025.
  4. Müller S, Hollmann N, Arango S P, Grabocka J, Hutter F. Transformers Can Do Bayesian Inference. In International conference on learning representations, 2022.
  5. Hollmann N, Müller S, Purucker L, Krishnakumar A, Körfer M, Hoo S B, Schirrmeister R T, Hutter F. Accurate predictions on small data with a tabular foundation model. Nature, 2025.
  6. Papamakarios G, Murray I. Fast ε-free inference of simulation models with Bayesian conditional density estimation. Advances in neural information processing systems, 2016.
  7. Foster A, Ivanova D R, Malik I, Rainforth T. Deep adaptive design: Amortizing sequential Bayesian experimental design. In International conference on machine learning, 2021.
  8. Blau T, Bonilla E V, Chades I, Dezfouli A. Optimizing sequential experimental design with deep reinforcement learning. In International conference on machine learning, 2022.
  9. Huang D, Guo Y, Acerbi L, Kaski S. Amortized Bayesian Experimental Design for Decision-Making. In Advances in Neural Information Processing Systems, 2024.
  10. Smith F B, Kirsch A, Farquhar S, Gal Y, Foster A, Rainforth T. Prediction-oriented Bayesian active learning. In International conference on artificial intelligence and statistics, 2023.

Acknowledgments

This work was supported by the ELLIS Institute Finland, the Research Council of Finland (grants 358980 and 356498 and Flagship programme: Finnish Center for Artificial Intelligence FCAI); Business Finland (project 3576/31/2023); the UKRI Turing AI World-Leading Researcher Fellowship [EP/W002973/1]. The authors thank Finnish Computing Competence Infrastructure (FCCI), Aalto Science-IT project, and CSC–IT Center for Science, Finland, for the computational and data storage resources provided, including access to the LUMI supercomputer.