Artificial intelligence-driven plant bio-genomics research: a new era

Lin Yang; Hao Wang; Meiling Zou; Haiwei Chai; Zhiqiang Xia; Lin Yang; Hao Wang; Meiling Zou; Haiwei Chai; Zhiqiang Xia

doi:10.48130/tp-0025-0008

Figures (3) Tables (1)

Figure 1.
Comparison of deep learning sequence models^[49]: RNN / LSTM: iteratively transmitting sequence information based on hidden states. CNN: Converging the data of adjacent areas through the local perceptual field of view. Transformer: The self-attention mechanism is used to fully capture the pattern information of any span of the input sequence.
Figure 2.
Illustration of the masked pre-training task process for BERT models based on DNA and text sequences^[49,51]. The process is as follows: 1) Input sequence preparation: One-dimensional character sequences serve as the initial input data. 2) Data preprocessing: The original character sequences undergo segmentation, trimming, and padding to standardize the sequence length, ensuring a consistent data format for subsequent processing. 3) Tokenization, masking, and special token insertion: The preprocessed sequences are tokenized, transforming character sequences into token sequences. Tokens within the sequence are randomly selected for masking, which involves replacing them with a special [MASK] token, randomly substituting them with other tokens, or leaving them unchanged. Special tokens [CLS] and [SEP] are inserted at the beginning and end of the sequence, respectively, to denote the start and end of the sentence, aiding the model in comprehending the overall structure of the sequence. Tokens, as the fundamental units for model learning, encode the semantic information of the sequence. 4) Embedding layer processing: The tokenized sequences are fed into the embedding layer, where each token is converted into a vector representation, thereby extracting and encoding the feature information of the data. 5) Transformer module training: The embedded vector sequences are input into the Transformer module for deep training. The Transformer module comprises multiple stacked encoder layers, each containing components such as self-attention mechanism, feedforward neural network, and normalization layer. These components collectively process the sequence to capture linguistic features and contextual information. 6) Model output and token prediction: The model outputs the vector representation of the masked tokens and attempts to predict the original tokens at the masked positions. 7) Similarity computation and parameter updating: The similarity between the predicted tokens and the original (pre-masked) tokens is computed (typically using the cross-entropy loss function). Through the backpropagation algorithm, the loss information is propagated back to the model, thereby updating the model parameters and gradually enhancing the model's ability to predict masked tokens. 8) Training iteration and optimization: The process is repeated over multiple training epochs to continuously optimize the model parameters, thus continuously improving the model's understanding and generation capabilities for the sequences.
Figure 3.
Similarity between genome sequence and language sequence. The genome size data is derived from the NCBI genome database^[82].

Model	Time	Base model	Model architecture	Parameters	Pre-trained data species	Capability	Tokenizer type	Tokens length (bp)	Ref.
DNABERT	2021.02	BERT	Transformer Encoder	110 M	Animal	Function Prediction	K-mer (3−6)	512	[49]
Enformer	2021.10	Transformer + CNN	Transformer Encoder	240 M	Animal	Function prediction	Single-base	196,608	^[81]
MoDNA	2022.08	BERT	Transformer encoder	-	Animal	Function prediction	K-mer (6)	512	[84]
GPN	2022.08	BERT + CNN	Transformer encoder	−	Plant	Variant prediction	Single-base	512	[33]
iEnhance-BERT	2022.08	BERT(DNABERT) + CNN	Transformer encoder	−	−	Enhancer prediction	K-mer (3−6)	512	[85]
FloraBERT	2022.08	BERT	Transformer encoder	−	Plant	Function prediction	BPE	256	[36]
TFBert	2022.09	BERT(DNABERT)	Transformer Encoder	110 M	Animal	Epigenomics prediction	K-mer (3−6)	512	[86]
iDNA-ABF	2022.10	BERT(DNABERT)	Transformer encoder	110 M	Animal	Function prediction	K-mer (3−6)	512	[24]
iEnhancer-ELM	2022.12	BERT(DNABERT)	Transformer encoder	110 M	Animal	Enhancer prediction	K-mer (3−6)	512	[87]
Species-aware DNALM	2023.01	BERT(DNABERT)	Transformer encoder	110 M	Fungus	Function prediction	K-mer (6)	512	[25]
Nucleotide Transformer	2023.01	BERT	Transformer encoder	50 M−2.5 B	Animal	Function Prediction	K-mer (6)	1,000−2,048	[88]
GPN-MSA	2023.01	BERT	Transformer encoder	86 M	Animal	Variant prediction	Single-base	128	[89]
SpliceBERT	2023.02	BERT	Transformer encoder	19.4 M	Animal	Splicing prediction	Single-base	1,024	[28]
miProBERT	2023.03	BERT(DNABERT)	Transformer encoder	110 M	−	Promoter prediction	K-mer (6)	512	[90]
RNA-MSM	2023.03	BERT+CNN	Transformer encoder	−	−	Structure prediction	Single-base	512	[31]
MRM-BERT	2023.06	BERT(DNABERT)	Transformer encoder	110M	Animal + Microorganism	RNA modifications prediction	K-mer (6)	512	[26]
GENA-LM	2023.06	BERT/BigBird	Transformer encoder	110−360 M	Animal	Function prediction	BPE	512−4,096	[91]
DNABERT-2	2023.06	BERT	Transformer encoder	117 M	Animal	Function prediction	BPE	128	[92]
Geneformer	2023.06	BERT	Transformer encoder	−	Animal	Function prediction	Custom tokenizer	2,048	[93]
PLPMpro	2023.07	BERT(DNABERT)	Transformer encoder	−	Eucaryon	Promoter prediction	K-mer (6)	512	[94]
EpiGePT	2023.07	BERT + CNN	Transformer encoder	−	Animal	Epigenomics prediction	Single-base	128	[29]
Uni-RNA	2023.07	BERT	Transformer encoder	25−400 M	−	Function prediction	Single-base	512−1,280	[32]
AgroNT	2023.10	BERT	Transformer encoder	1B	Plant	Function prediction	K-mer (6)	1,024	[35]
FGBERT	2024.02	BERT	Transformer encoder	954.73 M	Metagenomics	Functional prediction	−	−	[27]
RiNALMo	2024.02	BERT	Transformer encoder	135−650 M	−	Functional prediction	Single-base	1,024	[95]
gLM	2024.04	BERT	Transformer encoder	1B	Metagenomics	Functional prediction	−	−	[96]
RNAErnie	2024.05	BERT	Transformer encoder	105 M	−	Functional prediction	Single-base	512	[30]
GenSLMs	2022.10	Diffusion + GPT	Transformer decoder	25−250 M	Virus	Variant prediction	K-mer (3)	2,048	[97]
DNAGPT	2023.08	GPT	Transformer decoder	100 M−3 B	Animal	Function prediction + Sequence generation	K-mer (6)	512−4,096	[98]
ENBED	2023.11	Transformer	Transformer encoder−decoder	1.2 B	Animal + Plant + Insect + Bacteria	Function prediction + Sequence generation	Single-base	16,384	[99]
HyenaDNA	2023.06	Hyena	Hyena	0.44−6.6 M	Animal	Function prediction + Species classfication	Single-base	64 K−1 M	[100]
Evo	2024.02	StripedHyena	StripedHyena	7B	Bacteria + archaea	Function prediction + Sequence generation	Single-base	131 K	[101]
PDLLM	2024.12	Mamba	Mamba	130 M	Plant	Function prediction	Single-base/ K-mer (2−6)/BPE	512	[34]

Table 1.

Summary table of genomic LLMs.