Interpreting Boltzgen Metrics and Filtering in Protein Design

Written by Danial Gharaie Amirabadi | Published 2025-11-18

Introduction: Why BoltzGen Filtering and Metrics Matter

BoltzGen does not end at generation. Every design run produces a set of candidate binders that must be evaluated, filtered, and ranked before they can move toward experimental validation. The usefulness of any campaign depends on understanding this post-processing stage. If the filtering logic or the metrics are misunderstood, strong designs can be discarded and weak ones can slip through.

The pipeline produces several key outputs: structural models, a metrics table for all generated designs, and a metrics table for the final selected set. These artifacts encode structural quality, interface geometry, physicochemical interactions, and refolding stability. They form the basis for deciding which candidates are worth synthesizing.

BoltzGen’s filtering process is built to narrow a broad search space into a small set of high-confidence designs. It first applies strict thresholds on structural and sequence properties. It then ranks the remaining designs using a weighted collection of metrics, including predicted TM scores, predicted alignment error at the interface, the number of hydrogen bonds and salt bridges, and changes in solvent-accessible surface area. Each metric contributes differently to the final ranking based on an inverse-importance weighting scheme.

This post explains how to interpret these outputs. The goal is to provide a clear understanding of how the filters operate, what each metric means, and how they combine into the final ranking. With the right mental model, the results report becomes a practical tool for selecting strong candidates and guiding iterative design.

The Filtering Pipeline

BoltzGen evaluates every generated design through a structured sequence of checks before any ranking or selection takes place. These checks remove designs that fail basic structural or biochemical requirements so that downstream ranking focuses only on viable candidates.

Filtering Criteria

The first stage applies mandatory thresholds. A design must pass every threshold in the table below to proceed.

Filtering Criteria Table

feature	lower_is_better	threshold	Pass
has_x	True	0.0	1
filter_rmsd	True	2.5	0
filter_rmsd_design	True	2.5	0
CYS_fraction	True	0.0	1
ALA_fraction	True	0.2	1
GLY_fraction	True	0.2	1
GLU_fraction	True	0.2	1
LEU_fraction	True	0.2	1
VAL_fraction	True	0.2	1

These constraints prevent backbone instability, avoid excessive use of certain residues, and ensure that basic structural requirements are met. Any design that fails even one entry in this table is removed before ranking.

Transition to Ranking

Only designs that satisfy all filters move forward. This step ensures that the ranking stage works with designs that are at least structurally coherent. If very few designs survive, the thresholds, refolding configuration, or design specification may need adjustment.

Ranking and Metric Weighting

After filtering removes structurally invalid or compositionally problematic designs, the remaining candidates are ranked. BoltzGen uses a metric-based system that converts a diverse set of quality indicators into a single ordering. The goal is not to maximize one metric but to identify designs that perform consistently well across many dimensions.

Each metric is ranked independently, and these rankings are then scaled by an inverse-importance weight. Metrics with higher weights have less influence on the final score. This approach prevents any single metric from dominating the selection unless it is explicitly marked as highly important.

Sorting Criteria Table

Metric	Inverse Importance
design_iiptm	1
design_ptm	1
neg_min_design_to_target_pae	1
plip_hbonds_refolded	2
plip_saltbridge_refolded	2
delta_sasa_refolded	2

The table shows that predicted structural quality metrics (design_iiptm, design_ptm, and interface PAE) are treated as high-importance signals. Interface hydrogen bonds, salt bridges, and solvent-accessible surface area changes contribute as well but are given higher inverse weights and therefore influence the ranking less.

How the Ranking Score Is Computed

Each metric produces a rank for every surviving design.
Ranks are divided by the metric's inverse-importance weight.
For each design, the maximum of these scaled ranks becomes its final score.
Lower scores are better because they indicate a design that does not perform poorly on any metric.

This method penalizes designs that have even one weak metric. It favors candidates with balanced profiles rather than outliers that excel in one dimension while failing in another.

Implications for Design Selection

A design that is strong across all metrics will rise to the top. A design with one weak metric will fall behind even if it performs well elsewhere. This encourages the model to produce physically coherent binders with consistent geometric and energetic properties, rather than artifacts that score well by chance on a single metric.

Understanding the Metrics

The metrics computed during evaluation capture structural confidence, interface quality, physicochemical interactions, and changes in solvent exposure. Each metric is recorded in the results tables, and together they provide a detailed view of how well a design is expected to fold and bind.

Below is the reference table summarizing the meaning of each column.

CSV Column Reference

Column	Description
id	filename used to retrieve the design
design_sequence	amino acids that were designed (may be a subset of a chain)
designed_chain_sequence	full sequence of the chain containing the designed residues
num_design	number of designed residues
secondary_rank	intermediate rank from the sorting procedure
design_ptm	predicted TM score for intra-design contacts (higher is better)
design_iptm	predicted TM score for design–target contacts (higher is better)
design_to_target_iptm	same as design_iptm but for multi-chain designs
min_design_to_target_pae	minimum predicted alignment error between design and target (lower better)
plip_saltbridge	number of salt-bridge interactions
plip_hbonds	number of hydrogen-bond interactions
plip_hydrophobic	number of hydrophobic interactions
delta_sasa_original	change in solvent-accessible surface area upon binding
delta_sasa_refolded	same as above but computed on the refolded structure

Interpreting the Key Metrics

design_ptm
Measures structural confidence of the designed binder itself. Higher values indicate a well-formed internal structure.

design_iptm / design_to_target_iptm
Capture the quality of interactions between the binder and the target. These scores approximate interface correctness, similar to metrics used in structure prediction tasks.

min_design_to_target_pae
Reports the lowest predicted alignment error at the interface. Lower values imply a more precise and confident interaction geometry.

plip_hbonds, plip_saltbridge, plip_hydrophobic
Count interface contacts based on physical interaction types. Higher counts suggest stronger and more specific binding, but must be interpreted in the context of the design’s size and interface area.

delta_sasa_original / delta_sasa_refolded
Quantify buried surface area upon binding. Large burial is often associated with stronger affinity, but extreme values can reflect overpacking or potential instability.

Summary

Together, these metrics provide a multi-angle description of a candidate binder’s quality. Structural confidence, interface accuracy, and physical interaction density must align for a design to be competitive during ranking. This section establishes the vocabulary needed to interpret the tables and plots that make up the remainder of the results report.

Ready to Try BoltzGen?

Curious about what BoltzGen can do? Check out our BoltzGen Service and see it in action.

Explore more posts

Protein-Protein Docking Simplified: Illuminating the Mechanics of Protein Interactions

By Amélie Lagacé-O'Connor

Understanding the Differences between AI, Machine Learning, and Deep Learning

By Keaun Amani

Interpreting Boltz-1 (AlphaFold3) Metrics and Visualizations on Neurosnap

By Danial Gharaie Amirabadi

From AlphaFold3 to Protenix: Making Biomolecular Modeling More Practical

By Danial Gharaie Amirabadi