Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb
Views: 2535
If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers as well as some other libraries. Uncomment the following cell and run it.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 43.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.8/56.8 kB 7.4 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.2/7.2 MB 104.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 486.2/486.2 kB 43.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 236.8/236.8 kB 23.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.6/227.6 kB 21.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 49.6 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 37.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.5/110.5 kB 11.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.5/212.5 kB 20.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.3/134.3 kB 15.4 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 56.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.5/114.5 kB 9.7 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.8/268.8 kB 23.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 149.6/149.6 kB 16.9 MB/s eta 0:00:00
If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.
To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.
First you have to login to the huggingface hub
Then you need to install Git-LFS. Uncomment the following instructions:
We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.
Fine-Tuning the Nucleotide-transformer with LoRA
The Nucleotide Transformer paper Dalla-torre et al, 2023 introduces 4 genomics foundational models developed by InstaDeep. These transformers, of various sizes and trained on different datasets, allow powerful representations of DNA sequences that allow to tackle a very diverse set of problems such as chromatin accessibility, deleteriousness prediction, promoter and enhancer prediction etc... These representations can be extracted from the transformer and used as proxies of the DNA sequences (this is called probing) or the transformer can be trained further on a specific task (this is called finetuning).
This notebook allows you to fine-tune one of these models.
LoRA: Low-Rank Adaptation of Large Language Models is one of the state of the art parameter-efficient finetuning methods that is explained in details in this blog post. Any transformer model can be finetuned using this method with very little effort using the 🤗 Transformers library, which is why it is used in this notebook instead of the IA³ technique presented in the original paper.
The model we are going to use is the 500M Human Ref model, which is a 500M parameters transformer pre-trained on the human reference genome, per the training methodology presented in the Nucleotide Transformer Paper. It is one of the 4 models introduced, all available on the Instadeep HuggingFace page:
Note that even though the finetuning is done with a parameter-efficient method, using the larger checkpoints will still require more GPU memory and produce longer finetuning times
In the following, we showcase the nucleotide transformer ability to classify genomic sequences as two of the most basic genomic motifs: promoters and enhancers types. Both of them are classification task, but the enhancers types task is much more challenging with its 3 classes.
These two tasks are very basic, but the nucleotide transformers have been shown to beat/match state of the art models on much more complex tasks such as DeepSEA, which, given a DNA sequence, predicts 919 chromatin profiles from a diverse set of human cells and tissues from a single sequence or DeepSTARR, which predicts an enhancer's activity.
Importing required packages and setting up PEFT model
Import and install
Prepare and create the model for fine-tuning
The nucleotide transformer will be fine-tuned on two classification tasks: promoter and enhancer types classification. The AutoModelForSequenceClassification
module automatically loads the model and adds a simple classification head on top of the final embeddings.
The LoRA parameters are now added to the model, and the parameters that will be finetuned are indicated.
First task : Promoter prediction
Promoter prediction is a sequence classification problem, in which the DNA sequence is predicted to be either a promoter or not.
A promoter is a region of DNA where transcription of a gene is initiated. Promoters are a vital component of expression vectors because they control the binding of RNA polymerase to DNA. RNA polymerase transcribes DNA to mRNA which is ultimately translated into a functional protein
This task was introduced in DeePromoter, where a set of TATA and non-TATA promoters was gathered. A negative sequence was generated from each promoter, by randomly sampling subsets of the sequence, to guarantee that some obvious motifs were present both in the positive and negative dataset.
Dataset loading and preparation
Let us have a look at the data. If we extract the last sequence of the dataset, we see that it is indeed a promoter, as its label is 1. Furthermore, we can also see that it is a TATA promoter, as the TATA motif is present at the 221th nucleotide of the sequence!
Tokenizing the datasets
All inputs to neural nets must be numerical. The process of converting strings into numerical indices suitable for a neural net is called tokenization.
Fine-tuning and evaluation
We initialize our TrainingArguments
. These control the various training hyperparameters, and will be passed to our Trainer
.
The hyperparameters used for the IA³ method in the paper do not provide good performance for the LoRa method. Mainly, LoRA introduces more trainable parameters, therefore requiring a smaller learning rate. We here use a learning rate of 5.10⁻⁴, which enables us to get close to the paper's performance.
Next, we define the metric we will use to evaluate our models and write a compute_metrics
function. We can load this from the scikit-learn
library.
We can now finetune our model by just calling the train
method:
Note that the finetuning is done with a small batch size (8). The training time can be reduced by increasing the batch size, as it leverages parallelism in the GPU.
Validation F1 score
F1 score on the test dataset
For the promoter prediction task, we reproduced the experiment carried out in the article by adapting the learning rate to the LoRa method. A F1 score of 0.937 is obtained after just 1000 training steps. To get closer to the 0.954 score obtained in the nucleotide transformer paper after 10,000 training steps, we surely need to train for longer!
Second task : Enhancer prediction
In this section, we fine-tune the nucleotide transformer model on enhancer type prediction, which consists in classifying a DNA sequence as strong, weak or non enhancer.
In genetics, an enhancer is a short (50–1500 bp) region of DNA that can be bound by proteins (activators) to increase the likelihood that transcription of a particular gene will occur.
A deep learning framework for enhancer prediction using word embedding and sequence generation introduced the dataset used here by augmenting an original set of enhancers with 6000 synthetic enhancers and 6000 synthetic non-enhancers produced through a generative model.
Dataset loading and preparation
Tokenizing the datasets
Fine-tuning and evaluation
We initialize our TrainingArguments
. These control the various training hyperparameters, and will be passed to our Trainer
.
We keep the same hyperparameters as for the promoter task, i.e the same as in the paper except for a learning rate of 5.10⁻⁴, which enables us to get close to paper's performance.
Here, the metric used to evaluate the model is the Matthews Correlation Coefficient, which is more relevant than the accuracy when the classes in the dataset are unbalanced. We can load a predefined function from the scikit-learn
library.
We can now finetune our model by just calling the train
method:
As with the first task, the time can be greatly reduced by increasing the batch size.
Validation MCC score
MCC on the test dataset
For the enhancers types prediction task, we obtain a perforance after 1000 training steps that is 0.40, which is already beating the baseline on which Nucleotide Transformer is compared (0.395). This is still, however, 8.5 percent points below its performance (0.485) in the Nucleotide Transformers paper. To match the paper results more closely, it will probably be necessary to increase the number of training steps. Also note that the paper used a parameter-efficient finetuning method called IA3, whereas in this notebook we use the LoRA setting, which differs in various manners