WORTH KNOWING

Scientists Generate 10 Million Data Points in Days to Supercharge AI Protein Engineering

By Jamie Sullivan · Tuesday, April 21, 2026

Finn's Take· TL;DR

Rice University scientists developed Sequence Display, generating 10 million protein data points in days using smart barcoding to accelerate AI training.
Traditional protein engineering faces impractical testing challenges; this platform automates variant screening through DNA barcodes read by sequencing, eliminating months of manual work.
AI and experiments work together synergistically—AI models use the massive datasets to predict beneficial protein mutations for therapeutics and research tools.

See this from any side — with sources:

Left take Neutral Right take

Revolutionary Data Generation Platform

Scientists at Rice University have developed a groundbreaking method that can generate more than 10 million data points in a single experiment , dramatically accelerating artificial intelligence training for protein engineering. The platform, called Sequence Display, represents a major leap forward in addressing one of the biggest bottlenecks in AI-guided protein design: the lack of sufficient high-quality training data.

"One of the biggest bottlenecks in AI-guided protein engineering is not coming up with machine-learning models. It is generating the right and enough experimental data to train them," said Han Xiao, Rice University professor of chemistry, biosciences and bioengineering . Traditional protein engineering faces an almost impossible challenge— for a protein that is just 50 amino acids in length, this leads to approximately 1.13x1065 potential combinations to test , making laboratory testing impractical.

Xiao's team and collaborators from Johns Hopkins University and Microsoft have done just that, sharing an approach that provided the needed data and created accurate models in just three days . This breakthrough, published in Nature Biotechnology, fundamentally changes how researchers can approach protein optimization.

Smart Barcoding System Tracks Protein Activity

The Sequence Display platform works through an ingenious barcoding system that automatically records protein performance. First, they mutated the DNA that codes for the Cas9 protein, creating many variations. A blank DNA barcode was attached to each variant, along with a special editor that would change the barcode in response to the protein's activity level. As the protein's activity levels increased, so did the editor's .

The DNA barcodes were then read by next-generation sequencing, which would essentially scan the barcode and classify each sequence by level of activity . This automated approach eliminates the need for researchers to manually test each protein variant, a process that previously took months or years.

"We were able to develop an activity-based barcoding system that records the activity of individual protein variants and generates the kind of dataset needed to train a machine learning model," said Linqi Cheng, a Rice graduate student and first author on the study . The team successfully repeated this process with other proteins, including aminoacyl-tRNA synthetases, cytosine deaminase and uracil glycosylase inhibitor. In each case, the barcoding experiment generated enough data points to train AI models .

AI and Experiments Work Together

Unlike approaches that rely solely on computational power, Sequence Display creates a symbiotic relationship between laboratory experiments and artificial intelligence. "The AI is not replacing the experiment here. It instead depends on the experiment," Cheng said . The massive datasets generated by the platform provide AI models with the rich information they need to make accurate predictions about protein behavior.

These data points are then fed into protein language AI models, which use them to predict which changes to a protein's amino acids will create the desired change for the protein's activity or function . The approach has already yielded practical results, with researchers discovering several Cas9 variants with expanded protospacer-adjacent motif recognition and evolved aminoacyl-tRNA synthetase variants capable of recognizing different noncanonical amino acids .

"What this approach provides is a practical framework for integrating AI with protein engineering," said Xiao, who is also a Cancer Prevention and Research Institute Scholar. "Rather than relying on machine learning as a stand-alone solution, we couple it with an experimental platform that generates high-quality training data. This synergy enables more efficient discovery of advanced research tools and next-generation therapeutic proteins" .

Transforming Drug Discovery and Beyond

The implications extend far beyond academic research. By adopting methods that leverage massive datasets, labs can move toward a "predict-first" model. This shift allows highly trained personnel to spend less time on repetitive pipetting and more time on the strategic interpretation of AI-generated results . The technology promises to accelerate the development of new therapeutics, industrial enzymes, and biotechnology applications.

As these models become more accessible, the ability to rapidly design and verify AI-designed DNA and protein sequences will become a standard requirement for competitive research facilities. Implementing these advanced workflows ensures that a lab remains at the forefront of the rapidly evolving field of generative biology .

The Sequence Display platform represents a fundamental shift in how scientists approach protein engineering—moving from slow, manual processes to rapid, data-driven design cycles that could revolutionize everything from medicine to sustainable manufacturing.

Have a question about this story?

Ask Finn — answers grounded in this article, from any viewpoint.