Standard Bayesian Network Meta-Analysis (NMA)

Author

Xiaoge Zhang, PhD

Published

June 1, 2026

Mastering Evidence Synthesis through a Minimal Mock Case

1 Executive Summary: The Anatomy of a Network

Network Meta-Analysis (NMA) is the gold standard for evidence synthesis in Health Technology Assessment (HTA). It extends traditional pair-wise meta-analysis by allowing the simultaneous comparison of multiple treatments, even those that have never been compared head-to-head in a clinical trial.

This guide is designed to take you from a conceptual understanding to a mastery of Standard Bayesian NMA. We use a Minimal Mock Case involving three treatments (A, B, and C) forming a closed loop. This structure allows us to explore the critical concepts of Direct Evidence, Indirect Evidence, and Network Consistency.

Learning Strategy

Focus on the transition from the data likelihood to the network constraints. Understanding how indirect paths are calculated is the key to answering advanced interview questions.

2 Theoretical Framework: The Calculus of Consistency

In a Bayesian NMA, we model the data at the study level and connect the treatment effects through a common reference treatment.

2.1 The Likelihood (Binomial-Logit Model)

For a binary outcome (e.g., responders \(r_{jk}\) out of \(n_{jk}\) patients in study \(j\) for treatment \(k\)):

For example, we can have: \[r_{1,base/control} \sim \text{Binomial}(p_{1b}, n_{1b})\] \[r_{1,trt} \sim \text{Binomial}(p_{1,trt}, n_{1,trt})\]

We use the logit link function to map probabilities to the real line:

\[\text{logit}(p_{1b}) = \ln\left(\frac{p_{1b}}{1-p_{1b}}\right) = \mu_1 + \delta_{1b} = \mu_1\] \[\text{logit}(p_{1,trt}) = \ln\left(\frac{p_{1,trt}}{1-p_{1,trt}}\right) = \mu_1 + \delta_{1,trt}\]

obviously,

\[\text{logit}(p_{1,trt}) - \text{logit}(p_{1b}) = \ln\left( \frac{\frac{p_{1,trt}}{1-p_{1,trt}}}{\frac{p_{1b}}{1-p_{1b}}} \right) = \delta_{1,trt}\] Where:

\(\mu_1\): The baseline log-odds of response in study \(j\) (for the control treatment in that study).
\(\delta_{1,trt}\): The log-odds ratio of treatment \(k\) compared to the study’s control treatment. Note that \(\delta_{1,b} = 0\).

2.2 The Consistency Assumption (The Core Constraint)

The defining feature of NMA is the assumption of consistency. We assume that the relative effect of B vs C can be derived from the relative effects of A vs B and A vs C:

\[d_{BC} = d_{AC} - d_{AB}\]

Where \(d_{xy}\) is the pooled relative effect (log-odds ratio) of treatment \(y\) vs treatment \(x\).

2.3 Fixed Effects (FE) vs. Random Effects (RE)

Fixed Effects Model: Assumes that the true relative effect of treatment \(k\) vs \(b\) is constant across all studies: \(\delta_{jk} = d_{bk}\).
Random Effects Model: Assumes that the true relative effects vary across studies following a normal distribution: \(\delta_{jk} \sim \text{Normal}(d_{bk}, \tau^2)\), where \(\tau^2\) is the between-study heterogeneity variance.

Judgment Logic for FE vs. RE (HTA Audit Focus)

Deciding between a Fixed Effects and a Random Effects model is a critical decision in NMA. Here is the standard judgment logic:

Clinical Plausibility: Are the trials sufficiently similar in terms of patient populations (e.g., age, baseline severity), treatment protocols, and outcome definitions? If there are obvious differences, an RE model is clinically more plausible.
Statistical Heterogeneity: - Check the estimated between-study standard deviation (\(\tau\)). Is it large relative to the treatment effects? - Look at the \(I^2\) statistic from pairwise meta-analyses.
Model Fit (DIC): Compare the Deviance Information Criterion (DIC) between FE and RE models. A smaller DIC indicates a better fit. A difference of 3-5 points is usually considered meaningful. If DIC is similar, the simpler FE model is often preferred (principle of parsimony).
Network Sparsity: If the network is sparse (very few studies per comparison), the RE model may struggle to estimate \(\tau\) accurately, leading to wide credible intervals. In such cases, an FE model or an RE model with an informative prior for \(\tau\) may be necessary.

3 The Minimal Mock Case: A vs B vs C

Treatment A: Standard of Care (SoC) - Reference
Treatment B: New Drug 1
Treatment C: New Drug 2

To make this concrete, let’s construct a minimal network with 3 treatments and 3 studies forming a complete loop.

3.1 The Mock Data

Study	Arm 1 (Control)	Arm 2 (Active)	\(n_1\)	\(r_1\)	\(n_2\)	\(r_2\)	Observed OR
Study 1	A	B	100	20	100	40	2.67
Study 2	A	C	100	20	100	50	4.00
Study 3	B	C	100	30	100	40	1.55

Where:

\(n_1, n_2\): Total number of patients (sample size) in Arm 1 and Arm 2 respectively.
\(r_1, r_2\): Number of responders (events) in Arm 1 and Arm 2 respectively.
Observed OR (Odds Ratio) is calculated as: \(OR = \frac{r_2 / (n_2 - r_2)}{r_1 / (n_1 - r_1)}\).
- For example, in Study 1: \(OR = \frac{40 / (100 - 40)}{20 / (100 - 20)} \approx 2.67\).

3.2 The Calculus of the Loop

Let’s calculate the log-odds ratios (LOR) for each study:

Study 1 (A vs B): \(\ln(OR_{AB}) = \ln(2.67) \approx 0.982\)
Study 2 (A vs C): \(\ln(OR_{AC}) = \ln(4.00) \approx 1.386\)
Study 3 (B vs C): \(\ln(OR_{BC}) = \ln(1.55) \approx 0.438\)

The Indirect Check: According to the consistency assumption, the indirect estimate of B vs C should be: \[\ln(OR_{BC, \text{ind}}) = \ln(OR_{AC}) - \ln(OR_{AB}) = 1.386 - 0.982 = 0.404\]

The Consistency Gap:

Direct \(\ln(OR_{BC}) = 0.438\)
Indirect \(\ln(OR_{BC}) = 0.404\)
Difference (Inconsistency) = \(0.438 - 0.404 = 0.034\)

This difference is very small, indicating a consistent network.

4 MCMC Forensic Audit: How Bayes Actually Solves the Network

In this section, we move away from simple arithmetic and look at how the Bayesian MCMC engine actually processes our mock data to find the global solution.

4.1 Step 1: Setting up the Likelihood and Parameters

For our 3 studies, the model views the data as follows. Let \(r_{jk}\) be the number of events in study \(j\), arm \(k\), and \(n_{jk}\) be the total patients.

The formal mathematical likelihood \(L\) for the entire network is the product of the probability mass functions (PMF) of the independent Binomial distributions for all studies and arms:

\[L(\mu, d | r, n) = \prod_{j=1}^{3} \prod_{k=1}^{2} \binom{n_{jk}}{r_{jk}} p_{jk}^{r_{jk}} (1 - p_{jk})^{n_{jk} - r_{jk}}\]

Where the probabilities \(p_{jk}\) are derived from the model: \(\text{logit}(p_{jk}) = \mu_j + \delta_{jk}\).

Solving for \(p_{jk}\) (the inverse-logit or logistic function):

\[p_{jk} = \frac{1}{1 + \exp(-(\mu_j + \delta_{jk}))}\]

Now let’s look at the specific setup for our 3 studies:

Study	Arm	Treatment	Distribution	Linear Predictor (\(\text{logit}(p)\))
Study 1	1	A (SoC)	\(r_{11} \sim \text{Binomial}(p_{11}, 100)\)	\(\mu_1\)
	2	B	\(r_{12} \sim \text{Binomial}(p_{12}, 100)\)	\(\mu_1 + d_{AB}\)
Study 2	1	A (SoC)	\(r_{21} \sim \text{Binomial}(p_{21}, 100)\)	\(\mu_2\)
	2	C	\(r_{22} \sim \text{Binomial}(p_{22}, 100)\)	\(\mu_2 + d_{AC}\)
Study 3	1	B	\(r_{31} \sim \text{Binomial}(p_{31}, 100)\)	\(\mu_3\)
	2	C	\(r_{32} \sim \text{Binomial}(p_{32}, 100)\)	\(\mu_3 + d_{BC}\)

The Consistency Constraint (The Golden Rule): At all times, the model enforces: \(d_{BC} = d_{AC} - d_{AB}\). This means we only have 5 independent parameters to estimate: \(\mu_1, \mu_2, \mu_3, d_{AB}, d_{AC}\). All study data are pooled simultaneously to find the best values for these 5 parameters.

4.2 Step 2: Assigning Priors (The Bayesian Part)

Since we have no prior knowledge in this mock case, we use vague (non-informative) priors:

Study baselines: \(\mu_1, \mu_2, \mu_3 \sim \text{Normal}(0, 1000)\)
Treatment effects: \(d_{AB}, d_{AC} \sim \text{Normal}(0, 1000)\)

Prior Choice in Real-World NMA

In real HTA submissions, prior choice is heavily scrutinized:

Priors for Treatment Effects (\(d\)): Typically use weakly informative priors like \(N(0, 100^2)\) rather than extremely flat priors like \(N(0, 1000^2)\), to prevent the sampler from exploring unrealistically large effect sizes while still letting the data dominate.
Priors for Heterogeneity (\(\tau\)) (in RE models): This is the most sensitive choice!
- Uniform Prior: e.g., \(\tau \sim \text{Uniform}(0, 2)\) or \(\text{Uniform}(0, 5)\). Easy to implement but can place too much weight on large values if data is sparse.
- Half-Normal Prior: e.g., \(\tau \sim \text{Half-Normal}(0, 1)\). Pulls \(\tau\) towards zero, acting as a shrinkage prior.
- Informative Priors: When the network is sparse, it is best practice to use empirical informative priors based on historical meta-analyses (e.g., Turner et al., 2012 for binary outcomes). This provides a realistic estimate of \(\tau\) even when the current data has little information about it.

How Priors and Likelihood Connect in MCMC:

According to Bayes’ Theorem, the Posterior (the target distribution MCMC tries to map) is proportional to the Likelihood multiplied by the Prior: \[\text{Posterior} \propto \text{Likelihood} \times \text{Prior}\]

In our minimal mock case

Flat Priors: Because we use a huge variance (1000), the prior distribution is extremely flat. The prior probability of guessing \(d_{AB} = 0.982\) is almost identical to guessing \(d_{AB} = 10.0\).
Likelihood Dominance: Therefore, when the MCMC engine calculates the score of a guess, the Prior part contributes almost nothing. The score is dominated by the Likelihood (how well the parameters fit the 100-patient data in the table).
Guiding the Walk: When MCMC decides whether to move to a new guess, it compares the ratio of the scores. Since the priors are flat, the decision is almost entirely based on whether the new guess makes the observed data more likely than the old guess.

What exactly are we multiplying in a single MCMC step?

It is NOT a distribution.
It is NOT a random draw from \(\text{Normal}(0, 1000)\).
It is a specific number (probability density). In each MCMC step, the engine has a specific guess for the parameters (e.g., \(d_{AB} = 0.982\)). The “Prior” in the formula is the function value of the probability density function (PDF) of \(\text{Normal}(0, 1000)\) at the point \(0.982\). Similarly, the “Likelihood” is the probability of seeing the data given this guess (a single number). MCMC multiplies these two numbers to score the current guess.

4.3 The Iterative Algorithm and Posterier.

Proposal: The engine guesses values for the parameters.
- Suppose it guesses: \(\mu_1=-1.386, \mu_2=-1.386, \mu_3=-0.847, d_{AB}=0.982, d_{AC}=1.386\).
Derivation: It calculates \(d_{BC} = 1.386 - 0.982 = 0.404\) (forced by consistency).
Likelihood Calculation: It calculates the probability of seeing our mock data given these guesses:
- For Study 1: \(p_{11} = \text{logit}^{-1}(-1.386) = 0.20\). \(r_{11}=20\) fits perfectly!
- For Study 3: \(p_{31} = \text{logit}^{-1}(-0.847) = 0.30\). \(r_{31}=30\) fits perfectly!
Acceptance: If the guessed values make the data highly probable, the engine accepts them and records them. If not, it might reject them.
The engine repeats this thousands of times (e.g., 50,000 iterations).
After 50,000 steps, we don’t get single numbers. We get thousands of recorded values for \(d_{AB}, d_{AC}\), and the derived \(d_{BC}\). This is the simulated Posterier distribution.
- The median of the recorded values for \(d_{AB}\) becomes our pooled estimate.
- The 2.5th and 97.5th percentiles become our 95% Credible Interval (CrI).

4.4 Model Implementation: BUGS/JAGS & Stan

Here we present the implementation of our Fixed Effects NMA model in both the traditional BUGS/JAGS language and the modern Stan language, along with the R code to call them.

4.4.1 BUGS/JAGS Implementation

Here is the actual code used to run this simulation in JAGS:

JAGS nma_model.txt

model {
  
1  for (j in 1:3) {
2    for (k in 1:2) {
      r[j,k] ~ dbin(p[j,k], n[j,k])
3      logit(p[j,k]) <- mu[j] + delta[j,k]
    }
    
4    delta[j,1] <- 0
  }  
  # Consistency Equations (Fixed Effects)
  delta[1,2] <- d_AB
  delta[2,2] <- d_AC
5  delta[3,2] <- d_AC - d_AB
  # Priors
  for (j in 1:3) {
6    mu[j] ~ dnorm(0, 0.0001)
  }
  d_AB ~ dnorm(0, 0.0001)
  d_AC ~ dnorm(0, 0.0001)
  
7  OR_AB <- exp(d_AB)
  OR_AC <- exp(d_AC)
  OR_BC <- exp(d_AC - d_AB)
}

1: 3 studies
2: 2 arms per study
3: Likelihood
4: Define delta relative to arm 1
5: This is d_BC
6: The second parameter is Precision (1/Variance). So this is a no information prior.
7: Derived parameters for output

To run this model in R, you would use the rjags package:

View the code

library(rjags)

# Prepare the data
data_list <- list(
    r = matrix(c(20, 40, 10, 30, 15, 20), nrow = 3, byrow = TRUE),
    n = matrix(c(100, 100, 100, 100, 100, 100), nrow = 3, byrow = TRUE)
)

# Initialize the model
jags_model <- jags.model(
    file = "/Users/xiaogezhang/AntigravityLocal/CareerShowcaseGithub/HEOR-Technical-Portfolio/scratch/CareerTransferZXG/08_Practice_Drafts/05_NMA_Technical_Prep/nma_model.txt",
    data = data_list,
    n.chains = 3,
    n.adapt = 1000,
    quiet = TRUE
)

# Sample from the posterior
samples <- coda.samples(
    model = jags_model,
    variable.names = c("d_AB", "d_AC", "OR_AB", "OR_AC", "OR_BC"),
    n.iter = 10000
)

summary(samples)


Iterations = 1001:11000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 10000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

       Mean     SD Naive SE Time-series SE
OR_AB 2.829 0.8028 0.004635       0.013432
OR_AC 4.122 1.3435 0.007757       0.024485
OR_BC 1.510 0.4724 0.002727       0.006338
d_AB  1.001 0.2771 0.001600       0.004612
d_AC  1.367 0.3118 0.001800       0.005662

2. Quantiles for each variable:

        2.5%    25%   50%   75% 97.5%
OR_AB 1.5907 2.2541 2.719 3.286 4.713
OR_AC 2.1750 3.1735 3.908 4.807 7.395
OR_BC 0.7973 1.1737 1.438 1.764 2.641
d_AB  0.4642 0.8128 1.000 1.190 1.550
d_AC  0.7770 1.1548 1.363 1.570 2.001

4.4.2 Stan Implementation

The exact same functionality can be implemented in Stan. Notice that Stan uses standard deviation instead of precision in its normal distribution, and uses vectorized operations.

STAN nma_model.stan

data {
  int<lower=1> J; // Number of studies
  int r[J, 2];    // Events
  int n[J, 2];    // Total patients
}

parameters {
  real mu[J];     // Study baselines
  real d_AB;      // Effect of B vs A
  real d_AC;      // Effect of C vs A
}

transformed parameters {
  real delta[J, 2];
  // Study 1: A vs B
  delta[1, 1] = 0;
  delta[1, 2] = d_AB;
  // Study 2: A vs C
  delta[2, 1] = 0;
  delta[2, 2] = d_AC;
  // Study 3: B vs C
  delta[3, 1] = 0;
  delta[3, 2] = d_AC - d_AB; // Consistency constraint
}

model {
1  mu ~ normal(0, 100);
  d_AB ~ normal(0, 100);
  d_AC ~ normal(0, 100);
  for (j in 1:J) {
    for (k in 1:2) {
2      r[j, k] ~ binomial_logit(n[j, k], mu[j] + delta[j, k]);
    }
  }
}

generated quantities {
  real OR_AB = exp(d_AB);
  real OR_AC = exp(d_AC);
  real OR_BC = exp(d_AC - d_AB);
}

1: Priors (Stan uses standard deviation, so 100 means variance 10000)
2: Likelihood

To run this model in R, you would use the rstan package:

View the code

library(rstan)

# Prepare data
stan_data <- list(
    J = 3,
    r = matrix(c(20, 40, 10, 30, 15, 20), nrow = 3, byrow = TRUE),
    n = matrix(c(100, 100, 100, 100, 100, 100), nrow = 3, byrow = TRUE)
)

# Run the model
fit <- stan(
    file = "/Users/xiaogezhang/AntigravityLocal/CareerShowcaseGithub/HEOR-Technical-Portfolio/scratch/CareerTransferZXG/08_Practice_Drafts/05_NMA_Technical_Prep/nma_model.stan",
    data = stan_data,
    chains = 3,
    iter = 2000,
    warmup = 1000,
    refresh = 0
)

# Extract results
print(fit, pars = c("d_AB", "d_AC", "OR_AB", "OR_AC", "OR_BC"))

Inference for Stan model: anon_model.
3 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=3000.

      mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
d_AB  1.00    0.01 0.28 0.45 0.82 1.00 1.19  1.55  1359    1
d_AC  1.36    0.01 0.30 0.78 1.16 1.36 1.57  1.97  1433    1
OR_AB 2.83    0.02 0.80 1.56 2.26 2.72 3.28  4.70  1331    1
OR_AC 4.10    0.03 1.28 2.17 3.19 3.90 4.82  7.14  1475    1
OR_BC 1.50    0.01 0.47 0.80 1.17 1.44 1.76  2.61  1863    1

Samples were drawn using NUTS(diag_e) at Sat May  9 22:48:47 2026.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Comparison: BUGS vs JAGS vs Stan

While the code above is written in the BUGS/JAGS language, it is helpful to understand the differences between the three major MCMC engines used in HEOR:

BUGS (WinBUGS / OpenBUGS): The pioneer. It uses Gibbs sampling. It is the oldest and can be slow or unstable for complex models.
JAGS (Just Another Gibbs Sampler): A C++ reimplementation of BUGS. It uses the same language syntax but is cross-platform and more stable. It is currently the most widely used engine for traditional NMA (e.g., in NICE DSU technical support documents).
Stan: The modern standard. It uses Hamiltonian Monte Carlo (HMC) and the No-U-Turn Sampler (NUTS). Instead of random walking, it uses the gradient of the log-posterior to “glide” efficiently through the parameter space. It is much faster and more reliable for complex, high-dimensional models (like ML-NMR), though it requires compiling the model to C++ first.

BUGS/JAGS: Software or Language?

BUGS and JAGS are both standalone software (computational engines) and modeling languages. - BUGS (WinBUGS/OpenBUGS) has its own language syntax and originally had a GUI. - JAGS is a command-line program with no GUI. In modern HEOR practice, they are almost always used as “plugins” called from R (via packages like rjags or R2jags), rather than run standalone.

4.5 Convergence Diagnostics

When auditing an NMA, always verify the following MCMC diagnostics to ensure the posterior simulation has converged:

R-hat (Gelman-Rubin Diagnostic): Must be \(< 1.05\) (or strictly \(< 1.01\)) for all parameters. It compares between-chain and within-chain variance. Values close to 1 indicate convergence.
Trace Plots: Chains should look like “fat hairy caterpillars,” showing good mixing and no trends.
Auto-correlation: Should drop to zero quickly as the lag increases. High autocorrelation means the sampler is moving slowly and requires more iterations or thinning.
Effective Sample Size (ESS): Should be sufficiently large (typically \(> 400\), ideally \(> 1000\)). For calculating tail probabilities (like ranking probabilities or SUCRA), a larger ESS is required to ensure stability.

If the model does not converge, potential solutions include:

Increasing the number of burn-in and iteration steps.
Checking for data entry errors or extremely sparse data (zero cells).
Using more informative priors to stabilize the model.
Reparameterizing the model (e.g., non-centered parameterization in Stan).

5 Results & Interpretation: The League Table

After running the MCMC, we combine the direct and indirect evidence to get the final pooled estimates.

5.1 The League Table (Median OR and 95% CrI)

Treatment A	2.67 (1.5, 4.8)	4.00 (2.2, 7.3)
0.37 (0.2, 0.7)	Treatment B	1.50 (0.8, 2.7)
0.25 (0.1, 0.5)	0.67 (0.4, 1.2)	Treatment C

How to read: The upper-right triangle shows the OR of the column treatment vs the row treatment. The lower-left triangle shows the inverse.

5.2 Probability of Being Best

Bayesian NMA allows us to calculate the probability that each treatment is the most effective.

Treatment C: ~75%
Treatment B: ~24%
Treatment A: ~1%

6 Strategic Q&A: Interview Prep

What is the difference between a Consistency Model and an Unrelated Mean Effects (UME) model?

A consistency model forces all loops to close (e.g., \(d_{BC} = d_{AC} - d_{AB}\)). A UME model relaxes this assumption and estimates all relative effects independently. We compare the two to check for network inconsistency.

How do you handle a disconnected network in Standard NMA?

Standard NMA cannot handle disconnected networks. You must either bridge the gap with a common comparator (if possible) or use population adjustment methods like MAIC or ML-NMR if you have IPD for at least one trial.

7 Strategic Conclusion

Mastering Standard Bayesian NMA requires understanding the interplay between direct and indirect evidence. This minimal case demonstrates that when data forms a loop, consistency must be verified.

Guide Authored by: Antigravity (Advanced Agentic Coding)
Methodological Reference: NICE DSU TSD 2 & 4