Dr Karsten Eastman, CEO, Sethera Therapeutics, highlights the challenges around ‘true’ diversification in DNA libraries.

In drug discovery, big numbers for screening campaigns are loved. Trillions of molecules! 1020‑member libraries! But there is a very real question to ask: how many distinct molecules can actually fit in a tube?

The attraction of theoretical diversity

In early-stage drug discovery, library size matters because biology is sparse. For most targets, only a tiny fraction of chemical space will bind at all, and an even smaller fraction will bind with the right combination of potency, selectivity, and developability. Large libraries increase the probability of sampling those rare, high-value molecules, especially for difficult targets such as protein–protein interactions or shallow binding surfaces, where viable binders may be exceptionally uncommon. Modern discovery platforms such as mRNA and cDNA display, phage display, DNA-encoded libraries (DELs), and related technologies are therefore often described in terms of the diversity they can reach.

For example, if we take 12 random positions in a peptide, and each position can be any of the 20 standard amino acids, the total sequence space is 2012 = 4,096,000,000,000,000 possibilities!  That’s a bit over four quadrillion potential variants.

Similarly, when people talk about “1014” mRNA display libraries or “1012” DNA‑encoded libraries, they are usually referring to how many different sequences could exist in principle if every combination were realised. But there’s a second, less exciting question that matters for real screens, “How many distinct molecules are physically present in a single selection pool?”

That number is set by grams of DNA, basic chemistry, and the realities of target production.  There is a hard ceiling set by DNA mass.  This ceiling is often overlooked when considering how diverse a screening system or libraries can be. 

GlobalData Strategic Intelligence

US Tariffs are shifting - will you react or anticipate?

Don’t let policy changes catch you off guard. Stay proactive with real-time data and expert analysis.

By GlobalData

Starting with a hypothetical build: suppose there are 4 micrograms of double‑stranded DNA in a tube, each construct is about 200 base pairs long, and the approximation of each DNA base pair is ~660 g/mol.  Then a 200 bp fragment therefore has a molecular weight of roughly 132,000 g/mol. Dividing 4 × 10-6 g by that number and multiplying by Avogadro’s constant, the output is ≈ 1.8 × 1013 molecules, or about 18 trillion distinct DNA molecules, at most.

That number is a hard ceiling on the number of unique templates that can be present in the tube. It doesn’t matter how perfect the translational enzymes are or how efficient any downstream steps become; no workflow can conjure more unique molecules than the DNA started with.

Another common design is to use NNK randomisation at each variable position (N = A/C/G/T, K = G/T). NNK encodes 32 codons, 31 of which specify amino acids and one of which is a stop codon.

If we randomise 12 positions with NNK, the fraction of sequences without a stop codon is (31/32)¹² ≈ 68%.

From those 18 trillion DNA molecules, only about 12 trillion correspond to full‑length, stop‑free peptides or constructs. Twelve trillion is a huge number, but it is still only ~0.3% of the four‑quadrillion theoretical sequence space discussed above. 

An aggressive but still realistic build samples less than one third of one percent of the sequences that “exist on paper” for that 12‑position design.

What it would take to “cover everything”

How much DNA would be necessary to have one copy of each sequence in that 2012 space? If 4 quadrillion sequences each need a single DNA molecule, the total comes out to just under 1 milligram of DNA at 200 bp. That is more than two orders of magnitude above the typical 4 µg sample in our example.

But “one copy each” is not enough in practice. Random sampling behaves like drawing balls from an enormous urn. To be reasonably sure members of the library are not missed, multiple copies are necessary per sequence on average.

If  ~95% of all possible variants should be present at least once, about three copies per sequence on average are necessary, and ~99% coverage would require ~4.6 copies per sequence.  For the 2012 space, that translates to 2.7 mg of DNA for ~95% coverage and 4.1 mg of DNA for ~99% coverage. 

Those numbers are not unimaginable, but they’re already far beyond the DNA amounts routinely used for many display and DEL workflows, especially when considering replication, ligations, transcription, and other losses.

Expanded alphabets exacerbate the problem

Now imagine moving beyond the 20 standard amino acids. Many platforms are exploring noncanonical residues, stereochemical variants, or other building blocks to increase functional diversity and make their libraries more “drug-like”. In the peptide space, two common extensions are D-amino acids (mirror-image versions of the natural L-forms) and β-amino acids (which have one extra carbon in the backbone). If the same 20 side chains used by nature are allowed in three backbone flavours, L-α (standard), D-α, and β, the per-position alphabet expands from 20 to 60 defined building blocks.

Re-calculating the diversity for 12 positions would result in 6012 ≈ 2.2 × 1021 possible sequences. In other words: about two sextillion variants. One copy of each of those at 200 bp would require on the order of half a kilogram of DNA. To achieve strong occupancy (again, ~95% of sequences present at least once), it lands in the 1.4 kilogram range.

Even before considering that enzymes are required to process DNA, what selection format is used, or manufacturing, the physics of DNA mass states that this is not a realisable single-pool library, for anyone.

Targets are finite, too

Another constraint that is rarely mentioned is target abundance. Each molecule in a library can only be selected if it has a chance to encounter the target, in screens this is commonly a purified protein or protein complex immobilised on beads or a surface. For a hypothetical library with 10²⁰ variants and only one copy of each, the idealised world would also provide comparable or higher numbers of target molecules and enough surface area and binding sites that interactions are not dominated by a tiny fraction of the library.

Most screens are done with micrograms to milligrams of protein, even for relatively accessible targets. Scaling target production by 6–8 orders of magnitude to “match” absurdly large library claims would be both economically and technically prohibitive. Even if it was somehow possible to produce a 1020‑member library, it would likely not be possible to produce or handle enough target to screen it effectively. 

Practical diversity is more architectures, not just more zeros

If we repeat that 4 µg, 12‑position build 100 times using the same architecture, we would cumulatively sample on the order of ~1.8 × 10¹⁵ total molecules (about 1,800 trillion) across runs.  Even if every single one of those molecules were unique and non‑overlapping, which is generous, that still covers only ~45% of the 2012 space (and ~30% if counting only stop‑free NNK products).

From a platform perspective, those 100 builds are arguably much better spent on different architectures: varying loop lengths, constraint patterns, chemistries, or topologies. That strategy buys orders of magnitude more coverage of structural and topological space, rather than brute‑forcing one template.

In other words, real discovery power comes less from adding zeros and more from how intelligently finite molecules are generated and “spent” on targets.

How to read library‑size claims going forward

mRNA display can routinely access libraries in the 10¹²–10¹³ range; some systems push higher. DNA‑encoded libraries leverage DNA’s amplifiability to screen hundreds of billions of small molecules in a single experiment. The piece that is missing is clarity between what is practical and what is only theoretical diversity. 

Whenever encountering a “trillion‑member” or “1020-member” claim, three questions help anchor the conversation.

First, how much DNA went into the actual selection pool? From the mass and construct length, the maximum number of distinct templates can be determined.

How many copies per variant were present? That determines whether most of the claimed space was even sampled once.

How much target was used, and how was it presented? The economics and physics of the target often constrain the practical library size more than the chemistry does.

Grounding diversity in grams of DNA and moles of target is important for staying within what is real. However, this demonstrates that we are already pushing against what physics and biochemistry allow.   Smart design, multiple architectures, and honest accounting of diversity will end up mattering more than ever‑larger numbers on slide decks.