CoCalc -- Assignment2.ipynb

⁶⁹⁹⁸ views
ubuntu2204

Kernel: Python 3 (system-wide)

In [0]:

Assignment 2:

Prerequisites and Dependencies

Software Tools:

Nextflow: nf-core/fetchngs, nf-core/rnaseq

Documentation: fetchngs, rnaseq

Reference Genome: Ensembl

FASTA: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz

GTF: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz

Sequence data:

Sequence ids to download sequences from SRA Selector : GSE180869

Hardware:

32 vCPUs

128 GB RAM

Either:

50 GB SSD disk (if you are running the full pipeline using a subset of the RNASeq data)

(optional) or 150 GB SSD disk (if you are running the full pipeline using the full set of RNASeq data)

Note: There is no difference in learning experience or marks received in your choice of dataset. The only additional lesson you will learn is how expensive computing is.

Installation

Let's start by installing the core software Nextflow.

Create a terminal in your Lab 2 folder.

Run the shell script below.

IMPORTANT: This step only needs to be done once.

Execute these commands to install Java:

sudo apt update
# Install Java
sudo apt install default\-jre # When prompted, enter Y
# Install Nextflow
curl -s https://get.nextflow.io | bash
# Set Java memory limit
# Add to .bashrc directly
echo 'NXF_OPTS="-Xms1g -Xmx4g"' >> ~/.bashrc
# Source .bashrc to apply changes
source ~/.bashrc

Multiple fasta files per sample

Run in terminal:

./nextflow run nf-core/fetchngs \
  --input [filepath] \
  --outdir ~/scratch/lab_2/fetchngs \
  --max_cpus 32 --max_memory 128.GB \
  --download_method sratools \
  --nf_core_pipeline rnaseq \
  -w ~/scratch/work/lab_2/fetchngs \
  -profile docker

RNAseq

~/Assignment_2/data/id.csv\

run the follow in terminal:

→ get sra , fastq datas, run rnaseq

→ output into ~/Assignment_2/scratch/fetchngs \

%%bash
./nextflow run nf-core/fetchngs \
  --input ~/Assignment_2/data/id.csv \
  --outdir ~/Assignment_2/scratch/fetchngs \
  --max_cpus 32 --max_memory 128.GB \
  --download_method sratools \
  --nf_core_pipeline rnaseq \
  -w ~/Assignment_2/scratch/work/fetchngs \
  
  -profile docker

# Remember to clean your work directory before proceeding
# Example: ./nextflow clean nice_gilbert -f
./nextflow clean [run name] -f

Index genome /Adapter trimming

%%bash
./nextflow run nf-core/rnaseq \

--input ~/Assignment_2/scratch/fetchngs/samplesheet/samplesheet.csv \
--outdir ~/Assignment_2/scratch/index_run \

--fasta ~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.fna.gz \
--gtf ~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz \

--skip_alignment --skip_pseudo_alignment \
--trimmer fastp \
--save_reference true \

-w ~/Assignment_2/scratch/work/index_run \

-profile docker

# Remember to clean your work directory before proceeding
# Example: ./nextflow clean nice_gilbert -f
./nextflow clean [run name] -f

Navigate to ~/Assignment_2/scratch/index_run/fast .

Copy the first SRX########.fastp.html file to your Assignment_2 folder on the Home server. Make sure to select your Home server.

copy index_run/pipeline_info/execution_report_{YYYY-MM-DD_HH:MM:SS}.html to your Assignment_2 folder.

Adapter trimming

./nextflow run nf-core/rnaseq \
--input ~/Assignment_2/scratch/fetchngs/samplesheet/samplesheet.csv \
--outdir ~/Assignment_2/scratch/alignment_run \

--fasta ~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.fna.gz \
--gtf ~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz \

--salmon_index "~/Assignment_2/scratch/index_run/genome/index/salmon" \

--trimmer fastp \
--aligner hisat2 \
--pseudo_aligner salmon \
--extra_salmon_quant_args "--gcBias --seqBias" \
--deseq2_vst true \

-w ~/scratch/work/lab_2/alignment_run \
-profile docker

input_path = "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE180nnn/GSE180869/soft/GSE180869_family.soft.gz"

matrix_path = "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE180nnn/GSE180869/matrix/GSE180869_series_matrix.txt.gz"

transcript_length_matrix_path = "[https://ftp.ncbi.nlm.nih.gov/geo/series/GSE180nnn/GSE180869/suppl/GSE180869_RAW.tar"](https://ftp.ncbi.nlm.nih.gov/geo/series/GSE235nnn/GSE235705/suppl/GSE235705_RAW.tar)

differentialabundance

Create contrasts.csv file

We will perform the following comparisons:

control and NFIA_gRNA1 (blocking: replicate)

control and NFIA_gRNA2 (blocking: replicate)

control and NFIX_gRNA1 (blocking: replicate)

control and NFIX_gRNA2 (blocking: replicate)

control and NFIA/X_gRNA1 (blocking: replicate)

In [0]:

file required:

input = "~/Assignment_2/data/samplesheet.csv"
contrasts = "~/Assignment_2/data/contrasts.csv"

matrix = "https://www.ncbi.nlm.nih.gov/geo/download/?type=rnaseq_counts&acc=GSE180869&format=file&file=GSE180869_raw_counts_GRCh38.p13_NCBI.tsv.gz"

transcript_length_matrix = "https://www.ncbi.nlm.nih.gov/geo/download/?type=rnaseq_counts&acc=GSE180869&format=file&file=GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv.gz"

gtf = "~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz" 

gsea_permute "~/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt"
resource: https://github.com/ELTEbioinformatics/GMT_files_for_mulea/tree/main/GMT_files/Homo_sapiens_9606

In [0]:

./nextflow run nf-core/differentialabundance \
--input ~/Assignment_2/data/samplesheet.csv \
--contrasts ~/Assignment_2/data/contrasts.csv \
--matrix ~Assignment_2/reference_genome/GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv \
--transcript_length_matrix Assignment_2/reference_genome/GSE180869_raw_counts_GRCh38.p13_NCBI.tsv \
--gtf ~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz \
--filtering_min_proportion 0.3 \
--filtering_grouping_var condition \
--deseq2_cores 4 \
--gsea_run true \
--gsea_permute ~/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt \

--max_cpus 8 --max_memory 8.GB \
--outdir ~/Assignment_2/scratch/homo_analysis_filtered \
-w ~/Assignment_2/scratch/work/homo_analysis \
-profile rnaseq,docker

In [0]:

# Remember to clean your work directory before proceeding
# Example: ./nextflow clean nice_gilbert -f
./nextflow clean [run name] -f

In [0]:

./nextflow run nf-core/differentialabundance \
--input /home/user/Assignment_2/data/samplesheet.csv \
--contrasts /home/user/Assignment_2/data/contrasts.csv \
--matrix /home/user/Assignment_2/reference_genome/GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv \
--transcript_length_matrix /home/user/Assignment_2/reference_genome/GSE180869_raw_counts_GRCh38.p13_NCBI.tsv \
--gtf /home/user/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz \
--filtering_min_proportion 0.3 \
--filtering_grouping_var condition \
--deseq2_cores 4 \
--gsea_run true \
--gsea_permute /home/user/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt \
--gprofiler2_run true \
--gprofiler2_organism hsapiens \
--gprofiler2_sources "GO,GO:MF,GO:BP,GO:CC"\
--gprofiler2_correction_method gSCS \
--shinyngs_build_app true \
--max_cpus 8 --max_memory 8.GB \
--outdir /home/user/Assignment_2/scratch/homo_analysis_filtered \
-w /home/user/Assignment_2/scratch/work/homo_analysis \
-profile rnaseq,docker

In [1]:

#new
./nextflow run nf-core/differentialabundance \
--input /home/user/Assignment_2/data/samplesheet.csv \
--contrasts /home/user/Assignment_2/data/contrasts.csv \
--matrix /home/user/Assignment_2/reference_genome/GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv \
--transcript_length_matrix /home/user/Assignment_2/reference_genome/GSE180869_raw_counts_GRCh38.p13_NCBI.tsv \
--gtf /home/user/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf \
--filtering_min_proportion 0.3 \
--filtering_grouping_var condition \
--deseq2_cores 4 \
--gsea_run true \
--gene_sets_files /home/user/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt \
--gsea_permute gene_set \
--gprofiler2_run true \
--gprofiler2_organism hsapiens \
--gprofiler2_sources "GO,GO:MF,GO:BP,GO:CC"\
--gprofiler2_correction_method gSCS \
--shinyngs_build_app true \
--max_cpus 8 --max_memory 8.GB \
--outdir /home/user/Assignment_2/scratch/homo_analysis_filtered \
-w /home/user/Assignment_2/scratch/work/homo_analysis \
-profile rnaseq,docker

Out[1]:

  Cell In[1], line 7
    --gtf /home/user/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz \
                                                                   ^
SyntaxError: invalid decimal literal

In [3]:

head /home/user/Assignment_2/reference_genome/GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv

Out[3]:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 head /home/user/Assignment_2/reference_genome/GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv
NameError: name 'head' is not defined

In [0]:

./nextflow run nf-core/differentialabundance \
--input /home/user/labs/data/case_study/GSE180869/samplesheet_with_condition.csv \
--contrasts /home/user/Assignment_2/data/contrasts.csv \
--matrix /home/user/labs/data/case_study/GSE180869/salmon.merged.gene_counts.tsv \
--transcript_length_matrix /home/user/labs/data/case_study/GSE180869/salmon.merged.gene_lengths.tsv \
--genome GRCh38 \
--filtering_min_proportion 0.3 \
--filtering_grouping_var condition \
--deseq2_cores 4 \
--gsea_run true \
--gsea_permute gene_set \
--gsea_make_sets true \
--gsea_permute_n 1000 \
--gsea_metric Diff_of_Classes \
--gene_sets_files /home/user/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt \
--max_cpus 8 --max_memory 8.GB \
--outdir /home/user/Assignment_2/scratch/homo_analysis_filtered \
-w /home/user/Assignment_2/scratch/work/homo_analysis \
-profile docker

In [4]:

import pandas as pd

samplesheet = pd.read_csv("/home/user/labs/data/case_study/GSE180869/samplesheet.csv")

mapping = pd.read_csv("/home/user/Assignment_2/data/SraRunTable.csv", usecols=["Run", "GEO_Accession (exp)"])
mapping.columns = ["sample", "sample_gsm"]  # Rename columns to match your samplesheet

# Merge the mapping into your sample sheet
merged = pd.merge(samplesheet, mapping, on="sample", how="left")

# Save the updated file
merged.to_csv("/home/user/Assignment_2/data/samplesheet_with_gsm.csv", index=False)

In [5]:

import pandas as pd

# Load the original sample sheet
samplesheet_df = pd.read_csv("samplesheet.csv")

# Define a helper function to extract the condition from sample title
def extract_condition(title):
    title = title.lower()
    if 'negative' in title:
        return 'Negative_gRNA'
    elif 'nfia' in title and 'x' in title:
        return 'NFIA/X_gRNA1'  # special case based on your contrast ID
    elif 'nfia' in title:
        if 'grna1' in title:
            return 'NFIA_gRNA1'
        else:
            return 'NFIA_gRNA2'
    elif 'nfix' in title:
        if 'grna1' in title:
            return 'NFIX_gRNA1'
        else:
            return 'NFIX_gRNA2'
    else:
        return 'Unknown'

# Apply the function to create a new 'condition' column
samplesheet_df['condition'] = samplesheet_df['sample_title'].apply(extract_condition)

# Save the updated DataFrame to a new CSV file
samplesheet_df.to_csv("samplesheet_with_condition.csv", index=False)

Out[5]:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[5], line 4
import pandas as pd
# Load the original sample sheet
----> 4 samplesheet_df = pd.read_csv("samplesheet.csv")
# Define a helper function to extract the condition from sample title
def extract_condition(title):
File /usr/local/lib/python3.12/dist-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)
   dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File /usr/local/lib/python3.12/dist-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser
File /usr/local/lib/python3.12/dist-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
File /usr/local/lib/python3.12/dist-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   if "b" not in mode:
       mode += "b"
-> 1880 self.handles = get_handle(
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle
File /usr/local/lib/python3.12/dist-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 873         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'samplesheet.csv'

In [0]:

--gprofiler2_run true \
--gprofiler2_organism hsapiens \
--gprofiler2_sources "GO,GO:MF,GO:BP,GO:CC" \
--gprofiler2_max_qval 0.05 \
--gprofiler2_correction_method gSCS \
--shinyngs_build_app true \

In [6]:

import pandas as pd 
df = pd.read_csv("/home/user/labs/data/case_study/GSE180869/samplesheet_with_condition.csv")
df['condition'] = df ['condition'].str.replace('NFIA/X_gRNA1','NFIA_X_gRNA1')
df.to_csv("/home/user/labs/data/case_study/GSE180869/samplesheet_with_condition.csv", index= False)



# Load the original contrast file
contrasts = pd.read_csv("/home/user/Assignment_2/data/contrasts.csv")

# Replace slashes with underscores in both columns
contrasts["reference"] = contrasts["reference"].str.replace("/", "_")
contrasts["target"] = contrasts["target"].str.replace("/", "_")
contrasts.to_csv("/home/user/Assignment_2/data/contrasts.csv", index=False)

In [8]:

import pandas as pd

df = pd.read_csv("/home/user/labs/data/case_study/GSE180869/samplesheet_with_condition.csv")
print(df["condition"].value_counts())

Out[8]:

condition
Negative_gRNA    2
NFIA_gRNA1       2
NFIA_gRNA2       2
NFIX_gRNA1       2
NFIX_gRNA2       2
NFIA_X_gRNA1     2
NFIA_X_gRNA2     2
Name: count, dtype: int64

In [0]:

#run this first to install Java
sudo apt update
# Install Java
sudo apt install default\-jre # When prompted, enter Y
# Install Nextflow
curl -s https://get.nextflow.io | bash
# Set Java memory limit
# Add to .bashrc directly
echo 'NXF_OPTS="-Xms1g -Xmx4g"' >> ~/.bashrc
# Source .bashrc to apply changes
source ~/.bashrc

In [3]:

#FINAL CODE
./nextflow run nf-core/differentialabundance \
--input /home/user/labs/data/case_study/GSE180869/samplesheet_with_condition.csv \
--contrasts /home/user/Assignment_2/data/contrasts.csv \
--matrix /home/user/labs/data/case_study/GSE180869/salmon.merged.gene_counts.tsv \
--transcript_length_matrix /home/user/labs/data/case_study/GSE180869/salmon.merged.gene_lengths.tsv \
--genome GRCh38 \
--filtering_min_proportion 0.3 \
--filtering_grouping_var condition \
--deseq2_cores 4 \
--gsea_run true \
--gsea_permute gene_set \
--gsea_make_sets true \
--gsea_permute_n 1000 \
--gsea_metric Diff_of_Classes \
--gene_sets_files /home/user/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt \
--gprofiler2_run true \
--gprofiler2_organism hsapiens \
--gprofiler2_sources "GO,GO:MF,GO:BP,GO:CC" \
--gprofiler2_max_qval 0.05 \
--gprofiler2_correction_method gSCS \
--shinyngs_build_app true \
--max_cpus 8 --max_memory 8.GB \
--outdir /home/user/Assignment_2/scratch/homo_analysis_filtered \
-w /home/user/Assignment_2/scratch/work/homo_analysis \
-profile docker

Product

Resources

Company