NeurIPS 2025 · Spotlight
We jointly amortize Bayesian inference, active learning, and Bayesian experimental design within a single Transformer architecture.
Imagine a drone locating survivors after a disaster. Ideally, it should strategically decide the next exploration area to maximize discovery chances, while simultaneously estimating survivor locations from incoming sensor data. Similarly, in emergency diagnosis, a doctor needs to determine the optimal sequence of tests and instantly infer the illness from unfolding results to enable timely treatment. These scenarios demand two critical capabilities working in tandem:
While deep learning has successfully amortized these tasks individually, the loop remains broken. Amortized Bayesian inference methods, such as Neural Processes [1, 2, 3], Prior-data Fitted Networks [4, 5], and Neural Posterior Estimation [6], offer rapid updates but are typically passive observers, unable to strategize data collection. Conversely, Amortized Experimental Design (BED) [7, 8, 9] excels at picking informative data points but often lacks the mechanism to perform efficient, instant inference updates after every step. This separation blocks the path to truly autonomous, real-time science.
In this paper, we introduce Amortized Active Learning and Inference Engine (ALINE). This work builds on our previous work for flexible amortized inference, extending the architecture of the Amortized Conditioning Engine (ACE) [3] to the active setting. Leveraging a Transformer trained via reinforcement learning, ALINE jointly queries informative data and refines posterior beliefs - all within a single forward pass. More importantly, ALINE introduces Flexible Targeting, which allows it to dynamically refocus its strategy on specific parameter subsets or predictive goals at runtime, without the need for retraining.
We assume a parametric conditional model \(p(y \mid x, \theta)\) with parameters \(\theta\) and prior \(p(\theta)\). During an episode of size \(T\), the system sequentially collects a dataset \(D_T = \{(x_i, y_i)\}_{i=1}^T\), to enable accurate and rapid Bayesian inference.
The key idea in ALINE is to make both inference and data acquisition flexible. To achieve this goal, we introduce a target specifier \(\xi\) that tells the model what it should care about in the current episode:
The set of possible targets \(\Xi\) contains both parameter and predictive tasks, and during training we sample \(\xi \sim p(\xi)\), so ALINE learns to adapt its strategy to whatever goal is specified at runtime.
ALINE builds on Transformer Neural Processes and organizes its inputs into three sets:
Each element is passed through embedding layers, then processed by shared Transformer layers. Self-attention captures dependencies inside the context set, while cross-attention lets query and target tokens attend to the encoded context.
To support flexible targeting, ALINE adds an extra query–target cross-attention block: candidate queries explicitly attend to the target tokens, so their representation encodes how informative this point is for the current goal.
On top of the shared backbone, ALINE uses two specialized output heads:
The inference head is trained with standard negative log-likelihood objectives. For a parameter target \(\xi_S^\theta\) we minimize
which encourages the marginals \(q_\phi(\theta_l \mid D_T)\) to match the true posterior over the parameters of interest. For a predictive target \(\xi^{y^\star}_{p_\star}\) we draw target inputs \(x^\star \sim p_\star(x^\star)\) and minimize
These losses drive the inference head to approximate either the posterior or the posterior predictive distribution, and they also provide the learning signal used to train the acquisition policy.
For the acquisition head we do not optimize the true trajectory-level information gain directly, which would be intractable. Instead, we form variational lower bounds using the approximate distributions provided by \(q_\phi\). For a parameter target \(\xi_S^\theta\) the objective is
and for a predictive target \(\xi^{y^\star}_{p_\star}\)
Both objectives are lower bounds on the corresponding trajectory-level EIG and EPIG [10]. As the inference head becomes more accurate, these bounds tighten and the policy receives a more faithful reward signal.
We train the acquisition head as a policy network using policy-gradient reinforcement learning. Instead of a single terminal reward, ALINE uses a dense per-step reward that measures how much the inference head improved after observing the new data point \((x_t, y_t)\).
Concretely, if the current target is \(\xi_S^\theta\), the reward at step \(t\) is the increase in average log-probability that \(q_\phi\) assigns to the true parameters of interest:
and an analogous definition is used for predictive targets based on the change in log predictive density. The policy \(\pi_\psi\) is then updated to maximize the discounted return \(\sum_{t=1}^T \gamma^t R_t(\xi)\), while gradients from the policy loss are not propagated back through the inference head.
In practice we first warm up the model by training only the inference head with randomly sampled queries. After this phase the policy and inference heads are trained jointly, leading to a single Transformer that can both propose informative queries and instantly perform Bayesian inference for a wide range of targets.
To efficiently learn an unknown function, we need to sequentially select the most informative points to query. We pre-train ALINE entirely on diverse synthetic functions sampled from GP priors, then evaluate on both in-distribution draws and out-of-distribution benchmarks (such as Gramacy) to assess generalization.
We compare against standard GP-based methods (GP-US, GP-EPIG [10], etc.) and the amortized baseline ACE. ALINE outperforms other methods on most out-of-distribution benchmarks.
We then test ALINE's unique ability to switch its goal at runtime - without retraining - to infer the underlying GP's hyper-parameters (e.g., lengthscale, output scale).
We evaluate ALINE on two Bayesian experimental design tasks: Location Finding and CES. We compare ALINE against both non-amortized baselines (VPCE) and state-of-the-art amortized methods (DAD, RL-BOED). Results show that ALINE achieves superior or competitive EIG while maintaining instant deployment speeds.
To demonstrate ALINE's flexible targeting capability, we consider the psychometric model, which describes the relationship between stimulus intensity and response probability. The model is governed by four parameters:
where \(\theta\) is the threshold, \(\alpha\) is the slope, \(\gamma\) is the guess rate, and \(\lambda\) is the lapse rate.
When the goal is to estimate threshold (\(\theta\)) and slope (\(\alpha\)): ALINE performs comparably to the baselines. Its query strategy concentrates stimuli near the threshold region.
When the goal is to estimate guess rate (\(\gamma\)) and lapse rate (\(\lambda\)): QUEST+ performs sub-optimally, while ALINE and Psi-marginal achieve significantly better accuracy. ALINE now selects 'easy' stimuli at extreme values to best reveal biases and lapses.
@inproceedings{huang2025aline,
title = {ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition},
author = {Huang, Daolang and Wen, Xinyi and Bharti, Ayush and Kaski, Samuel and Acerbi, Luigi},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025}
}
This work was supported by the ELLIS Institute Finland, the Research Council of Finland (grants 358980 and 356498 and Flagship programme: Finnish Center for Artificial Intelligence FCAI); Business Finland (project 3576/31/2023); the UKRI Turing AI World-Leading Researcher Fellowship [EP/W002973/1]. The authors thank Finnish Computing Competence Infrastructure (FCCI), Aalto Science-IT project, and CSC–IT Center for Science, Finland, for the computational and data storage resources provided, including access to the LUMI supercomputer.