Interpreting Boltzgen Metrics and Filtering in Protein Design
Written by Danial Gharaie Amirabadi | Published 2025-11-18
Written by Danial Gharaie Amirabadi | Published 2025-11-18
BoltzGen does not end at generation. Every design run produces a set of candidate binders that must be evaluated, filtered, and ranked before they can move toward experimental validation. The usefulness of any campaign depends on understanding this post-processing stage. If the filtering logic or the metrics are misunderstood, strong designs can be discarded and weak ones can slip through.
The pipeline produces several key outputs: structural models, a metrics table for all generated designs, and a metrics table for the final selected set. These artifacts encode structural quality, interface geometry, physicochemical interactions, and refolding stability. They form the basis for deciding which candidates are worth synthesizing.
BoltzGen’s filtering process is built to narrow a broad search space into a small set of high-confidence designs. It first applies strict thresholds on structural and sequence properties. It then ranks the remaining designs using a weighted collection of metrics, including predicted TM scores, predicted alignment error at the interface, the number of hydrogen bonds and salt bridges, and changes in solvent-accessible surface area. Each metric contributes differently to the final ranking based on an inverse-importance weighting scheme.
This post explains how to interpret these outputs. The goal is to provide a clear understanding of how the filters operate, what each metric means, and how they combine into the final ranking. With the right mental model, the results report becomes a practical tool for selecting strong candidates and guiding iterative design.
BoltzGen evaluates every generated design through a structured sequence of checks before any ranking or selection takes place. These checks remove designs that fail basic structural or biochemical requirements so that downstream ranking focuses only on viable candidates.
The first stage applies mandatory thresholds. A design must pass every threshold in the table below to proceed.
| feature | lower_is_better | threshold | Pass |
|---|---|---|---|
| has_x | True | 0.0 | 1 |
| filter_rmsd | True | 2.5 | 0 |
| filter_rmsd_design | True | 2.5 | 0 |
| CYS_fraction | True | 0.0 | 1 |
| ALA_fraction | True | 0.2 | 1 |
| GLY_fraction | True | 0.2 | 1 |
| GLU_fraction | True | 0.2 | 1 |
| LEU_fraction | True | 0.2 | 1 |
| VAL_fraction | True | 0.2 | 1 |
These constraints prevent backbone instability, avoid excessive use of certain residues, and ensure that basic structural requirements are met. Any design that fails even one entry in this table is removed before ranking.
Only designs that satisfy all filters move forward. This step ensures that the ranking stage works with designs that are at least structurally coherent. If very few designs survive, the thresholds, refolding configuration, or design specification may need adjustment.
After filtering removes structurally invalid or compositionally problematic designs, the remaining candidates are ranked. BoltzGen uses a metric-based system that converts a diverse set of quality indicators into a single ordering. The goal is not to maximize one metric but to identify designs that perform consistently well across many dimensions.
Each metric is ranked independently, and these rankings are then scaled by an inverse-importance weight. Metrics with higher weights have less influence on the final score. This approach prevents any single metric from dominating the selection unless it is explicitly marked as highly important.
| Metric | Inverse Importance |
|---|---|
| design_iiptm | 1 |
| design_ptm | 1 |
| neg_min_design_to_target_pae | 1 |
| plip_hbonds_refolded | 2 |
| plip_saltbridge_refolded | 2 |
| delta_sasa_refolded | 2 |
The table shows that predicted structural quality metrics (design_iiptm, design_ptm, and interface PAE) are treated as high-importance signals. Interface hydrogen bonds, salt bridges, and solvent-accessible surface area changes contribute as well but are given higher inverse weights and therefore influence the ranking less.
Each metric produces a rank for every surviving design.
Ranks are divided by the metric's inverse-importance weight.
For each design, the maximum of these scaled ranks becomes its final score.
Lower scores are better because they indicate a design that does not perform poorly on any metric.
This method penalizes designs that have even one weak metric. It favors candidates with balanced profiles rather than outliers that excel in one dimension while failing in another.
A design that is strong across all metrics will rise to the top. A design with one weak metric will fall behind even if it performs well elsewhere. This encourages the model to produce physically coherent binders with consistent geometric and energetic properties, rather than artifacts that score well by chance on a single metric.
The metrics computed during evaluation capture structural confidence, interface quality, physicochemical interactions, and changes in solvent exposure. Each metric is recorded in the results tables, and together they provide a detailed view of how well a design is expected to fold and bind.
Below is the reference table summarizing the meaning of each column.
| Column | Description |
|---|---|
| id | filename used to retrieve the design |
| design_sequence | amino acids that were designed (may be a subset of a chain) |
| designed_chain_sequence | full sequence of the chain containing the designed residues |
| num_design | number of designed residues |
| secondary_rank | intermediate rank from the sorting procedure |
| design_ptm | predicted TM score for intra-design contacts (higher is better) |
| design_iptm | predicted TM score for design–target contacts (higher is better) |
| design_to_target_iptm | same as design_iptm but for multi-chain designs |
| min_design_to_target_pae | minimum predicted alignment error between design and target (lower better) |
| plip_saltbridge | number of salt-bridge interactions |
| plip_hbonds | number of hydrogen-bond interactions |
| plip_hydrophobic | number of hydrophobic interactions |
| delta_sasa_original | change in solvent-accessible surface area upon binding |
| delta_sasa_refolded | same as above but computed on the refolded structure |
design_ptm
Measures structural confidence of the designed binder itself. Higher values indicate a well-formed internal structure.
design_iptm / design_to_target_iptm
Capture the quality of interactions between the binder and the target. These scores approximate interface correctness, similar to metrics used in structure prediction tasks.
min_design_to_target_pae
Reports the lowest predicted alignment error at the interface. Lower values imply a more precise and confident interaction geometry.
plip_hbonds, plip_saltbridge, plip_hydrophobic
Count interface contacts based on physical interaction types. Higher counts suggest stronger and more specific binding, but must be interpreted in the context of the design’s size and interface area.
delta_sasa_original / delta_sasa_refolded
Quantify buried surface area upon binding. Large burial is often associated with stronger affinity, but extreme values can reflect overpacking or potential instability.
Together, these metrics provide a multi-angle description of a candidate binder’s quality. Structural confidence, interface accuracy, and physical interaction density must align for a design to be competitive during ranking. This section establishes the vocabulary needed to interpret the tables and plots that make up the remainder of the results report.
Curious about what BoltzGen can do? Check out our BoltzGen Service and see it in action.
By Danial Gharaie Amirabadi
By Keaun Amani
By Danial Gharaie Amirabadi
By Amélie Lagacé-O'Connor
By Keaun Amani
By Danial Gharaie Amirabadi
Register for free — upgrade anytime.
Interested in getting a license? Contact Sales.
Sign up free