Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
5553 views
ubuntu2204
Kernel: Python 3 (system-wide)

Assignment 2:

Prerequisites and Dependencies

Software Tools:

Nextflow: nf-core/fetchngs, nf-core/rnaseq

Documentation: fetchngs, rnaseq

Reference Genome: Ensembl

FASTA: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz

GTF: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz

Sequence data:

Sequence ids to download sequences from SRA Selector : GSE180869

Hardware:

32 vCPUs

128 GB RAM

Either:

50 GB SSD disk (if you are running the full pipeline using a subset of the RNASeq data)

(optional) or 150 GB SSD disk (if you are running the full pipeline using the full set of RNASeq data)

Note: There is no difference in learning experience or marks received in your choice of dataset. The only additional lesson you will learn is how expensive computing is.

Installation

Let's start by installing the core software Nextflow.

Create a terminal in your Lab 2 folder.

Run the shell script below.

IMPORTANT: This step only needs to be done once.

Execute these commands to install Java:

sudo apt update # Install Java sudo apt install default\-jre # When prompted, enter Y # Install Nextflow curl -s https://get.nextflow.io | bash # Set Java memory limit # Add to .bashrc directly echo 'NXF_OPTS="-Xms1g -Xmx4g"' >> ~/.bashrc # Source .bashrc to apply changes source ~/.bashrc

Multiple fasta files per sample

Run in terminal:

./nextflow run nf-core/fetchngs \ --input [filepath] \ --outdir ~/scratch/lab_2/fetchngs \ --max_cpus 32 --max_memory 128.GB \ --download_method sratools \ --nf_core_pipeline rnaseq \ -w ~/scratch/work/lab_2/fetchngs \ -profile docker

RNAseq

~/Assignment_2/data/id.csv\

run the follow in terminal:

→ get sra , fastq datas, run rnaseq

→ output into ~/Assignment_2/scratch/fetchngs \

%%bash ./nextflow run nf-core/fetchngs \ --input ~/Assignment_2/data/id.csv \ --outdir ~/Assignment_2/scratch/fetchngs \ --max_cpus 32 --max_memory 128.GB \ --download_method sratools \ --nf_core_pipeline rnaseq \ -w ~/Assignment_2/scratch/work/fetchngs \ -profile docker
# Remember to clean your work directory before proceeding # Example: ./nextflow clean nice_gilbert -f ./nextflow clean [run name] -f

Index genome /Adapter trimming

%%bash ./nextflow run nf-core/rnaseq \ --input ~/Assignment_2/scratch/fetchngs/samplesheet/samplesheet.csv \ --outdir ~/Assignment_2/scratch/index_run \ --fasta ~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.fna.gz \ --gtf ~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz \ --skip_alignment --skip_pseudo_alignment \ --trimmer fastp \ --save_reference true \ -w ~/Assignment_2/scratch/work/index_run \ -profile docker
# Remember to clean your work directory before proceeding # Example: ./nextflow clean nice_gilbert -f ./nextflow clean [run name] -f

Navigate to ~/Assignment_2/scratch/index_run/fast .

Copy the first SRX########.fastp.html file to your Assignment_2 folder on the Home server. Make sure to select your Home server.

copy index_run/pipeline_info/execution_report_{YYYY-MM-DD_HH:MM:SS}.html to your Assignment_2 folder.

Adapter trimming

./nextflow run nf-core/rnaseq \ --input ~/Assignment_2/scratch/fetchngs/samplesheet/samplesheet.csv \ --outdir ~/Assignment_2/scratch/alignment_run \ --fasta ~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.fna.gz \ --gtf ~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz \ --salmon_index "~/Assignment_2/scratch/index_run/genome/index/salmon" \ --trimmer fastp \ --aligner hisat2 \ --pseudo_aligner salmon \ --extra_salmon_quant_args "--gcBias --seqBias" \ --deseq2_vst true \ -w ~/scratch/work/lab_2/alignment_run \ -profile docker

input_path = "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE180nnn/GSE180869/soft/GSE180869_family.soft.gz"

matrix_path = "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE180nnn/GSE180869/matrix/GSE180869_series_matrix.txt.gz"

transcript_length_matrix_path = "[https://ftp.ncbi.nlm.nih.gov/geo/series/GSE180nnn/GSE180869/suppl/GSE180869_RAW.tar"](https://ftp.ncbi.nlm.nih.gov/geo/series/GSE235nnn/GSE235705/suppl/GSE235705_RAW.tar)

differentialabundance

Create contrasts.csv file

We will perform the following comparisons:

control and NFIA_gRNA1 (blocking: replicate)

control and NFIA_gRNA2 (blocking: replicate)

control and NFIX_gRNA1 (blocking: replicate)

control and NFIX_gRNA2 (blocking: replicate)

control and NFIA/X_gRNA1 (blocking: replicate)

control and NFIA/X_gRNA1 (blocking: replicate)

file required: input = "~/Assignment_2/data/samplesheet.csv" contrasts = "~/Assignment_2/data/contrasts.csv" matrix = "https://www.ncbi.nlm.nih.gov/geo/download/?type=rnaseq_counts&acc=GSE180869&format=file&file=GSE180869_raw_counts_GRCh38.p13_NCBI.tsv.gz" transcript_length_matrix = "https://www.ncbi.nlm.nih.gov/geo/download/?type=rnaseq_counts&acc=GSE180869&format=file&file=GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv.gz" gtf = "~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz" gsea_permute "~/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt" resource: https://github.com/ELTEbioinformatics/GMT_files_for_mulea/tree/main/GMT_files/Homo_sapiens_9606
./nextflow run nf-core/differentialabundance \ --input ~/Assignment_2/data/samplesheet.csv \ --contrasts ~/Assignment_2/data/contrasts.csv \ --matrix ~Assignment_2/reference_genome/GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv \ --transcript_length_matrix Assignment_2/reference_genome/GSE180869_raw_counts_GRCh38.p13_NCBI.tsv \ --gtf ~/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz \ --filtering_min_proportion 0.3 \ --filtering_grouping_var condition \ --deseq2_cores 4 \ --gsea_run true \ --gsea_permute ~/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt \ --max_cpus 8 --max_memory 8.GB \ --outdir ~/Assignment_2/scratch/homo_analysis_filtered \ -w ~/Assignment_2/scratch/work/homo_analysis \ -profile rnaseq,docker
# Remember to clean your work directory before proceeding # Example: ./nextflow clean nice_gilbert -f ./nextflow clean [run name] -f
./nextflow run nf-core/differentialabundance \ --input /home/user/Assignment_2/data/samplesheet.csv \ --contrasts /home/user/Assignment_2/data/contrasts.csv \ --matrix /home/user/Assignment_2/reference_genome/GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv \ --transcript_length_matrix /home/user/Assignment_2/reference_genome/GSE180869_raw_counts_GRCh38.p13_NCBI.tsv \ --gtf /home/user/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz \ --filtering_min_proportion 0.3 \ --filtering_grouping_var condition \ --deseq2_cores 4 \ --gsea_run true \ --gsea_permute /home/user/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt \ --gprofiler2_run true \ --gprofiler2_organism hsapiens \ --gprofiler2_sources "GO,GO:MF,GO:BP,GO:CC"\ --gprofiler2_correction_method gSCS \ --shinyngs_build_app true \ --max_cpus 8 --max_memory 8.GB \ --outdir /home/user/Assignment_2/scratch/homo_analysis_filtered \ -w /home/user/Assignment_2/scratch/work/homo_analysis \ -profile rnaseq,docker
#new ./nextflow run nf-core/differentialabundance \ --input /home/user/Assignment_2/data/samplesheet.csv \ --contrasts /home/user/Assignment_2/data/contrasts.csv \ --matrix /home/user/Assignment_2/reference_genome/GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv \ --transcript_length_matrix /home/user/Assignment_2/reference_genome/GSE180869_raw_counts_GRCh38.p13_NCBI.tsv \ --gtf /home/user/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf \ --filtering_min_proportion 0.3 \ --filtering_grouping_var condition \ --deseq2_cores 4 \ --gsea_run true \ --gene_sets_files /home/user/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt \ --gsea_permute gene_set \ --gprofiler2_run true \ --gprofiler2_organism hsapiens \ --gprofiler2_sources "GO,GO:MF,GO:BP,GO:CC"\ --gprofiler2_correction_method gSCS \ --shinyngs_build_app true \ --max_cpus 8 --max_memory 8.GB \ --outdir /home/user/Assignment_2/scratch/homo_analysis_filtered \ -w /home/user/Assignment_2/scratch/work/homo_analysis \ -profile rnaseq,docker
Cell In[1], line 7 --gtf /home/user/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz \ ^ SyntaxError: invalid decimal literal
head /home/user/Assignment_2/reference_genome/GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[3], line 1 ----> 1 head /home/user/Assignment_2/reference_genome/GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv NameError: name 'head' is not defined
./nextflow run nf-core/differentialabundance \ --input /home/user/labs/data/case_study/GSE180869/samplesheet_with_condition.csv \ --contrasts /home/user/Assignment_2/data/contrasts.csv \ --matrix /home/user/labs/data/case_study/GSE180869/salmon.merged.gene_counts.tsv \ --transcript_length_matrix /home/user/labs/data/case_study/GSE180869/salmon.merged.gene_lengths.tsv \ --genome GRCh38 \ --filtering_min_proportion 0.3 \ --filtering_grouping_var condition \ --deseq2_cores 4 \ --gsea_run true \ --gsea_permute gene_set \ --gsea_make_sets true \ --gsea_permute_n 1000 \ --gsea_metric Diff_of_Classes \ --gene_sets_files /home/user/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt \ --max_cpus 8 --max_memory 8.GB \ --outdir /home/user/Assignment_2/scratch/homo_analysis_filtered \ -w /home/user/Assignment_2/scratch/work/homo_analysis \ -profile docker
import pandas as pd samplesheet = pd.read_csv("/home/user/labs/data/case_study/GSE180869/samplesheet.csv") mapping = pd.read_csv("/home/user/Assignment_2/data/SraRunTable.csv", usecols=["Run", "GEO_Accession (exp)"]) mapping.columns = ["sample", "sample_gsm"] # Rename columns to match your samplesheet # Merge the mapping into your sample sheet merged = pd.merge(samplesheet, mapping, on="sample", how="left") # Save the updated file merged.to_csv("/home/user/Assignment_2/data/samplesheet_with_gsm.csv", index=False)
import pandas as pd # Load the original sample sheet samplesheet_df = pd.read_csv("samplesheet.csv") # Define a helper function to extract the condition from sample title def extract_condition(title): title = title.lower() if 'negative' in title: return 'Negative_gRNA' elif 'nfia' in title and 'x' in title: return 'NFIA/X_gRNA1' # special case based on your contrast ID elif 'nfia' in title: if 'grna1' in title: return 'NFIA_gRNA1' else: return 'NFIA_gRNA2' elif 'nfix' in title: if 'grna1' in title: return 'NFIX_gRNA1' else: return 'NFIX_gRNA2' else: return 'Unknown' # Apply the function to create a new 'condition' column samplesheet_df['condition'] = samplesheet_df['sample_title'].apply(extract_condition) # Save the updated DataFrame to a new CSV file samplesheet_df.to_csv("samplesheet_with_condition.csv", index=False)
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Cell In[5], line 4 1 import pandas as pd 3 # Load the original sample sheet ----> 4 samplesheet_df = pd.read_csv("samplesheet.csv") 6 # Define a helper function to extract the condition from sample title 7 def extract_condition(title):
File /usr/local/lib/python3.12/dist-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend) 1013 kwds_defaults = _refine_defaults_read( 1014 dialect, 1015 delimiter, (...) 1022 dtype_backend=dtype_backend, 1023 ) 1024 kwds.update(kwds_defaults) -> 1026 return _read(filepath_or_buffer, kwds)
File /usr/local/lib/python3.12/dist-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds) 617 _validate_names(kwds.get("names", None)) 619 # Create the parser. --> 620 parser = TextFileReader(filepath_or_buffer, **kwds) 622 if chunksize or iterator: 623 return parser
File /usr/local/lib/python3.12/dist-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds) 1617 self.options["has_index_names"] = kwds["has_index_names"] 1619 self.handles: IOHandles | None = None -> 1620 self._engine = self._make_engine(f, self.engine)
File /usr/local/lib/python3.12/dist-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine) 1878 if "b" not in mode: 1879 mode += "b" -> 1880 self.handles = get_handle( 1881 f, 1882 mode, 1883 encoding=self.options.get("encoding", None), 1884 compression=self.options.get("compression", None), 1885 memory_map=self.options.get("memory_map", False), 1886 is_text=is_text, 1887 errors=self.options.get("encoding_errors", "strict"), 1888 storage_options=self.options.get("storage_options", None), 1889 ) 1890 assert self.handles is not None 1891 f = self.handles.handle
File /usr/local/lib/python3.12/dist-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 868 elif isinstance(handle, str): 869 # Check whether the filename is to be opened in binary mode. 870 # Binary mode does not support 'encoding' and 'newline'. 871 if ioargs.encoding and "b" not in ioargs.mode: 872 # Encoding --> 873 handle = open( 874 handle, 875 ioargs.mode, 876 encoding=ioargs.encoding, 877 errors=errors, 878 newline="", 879 ) 880 else: 881 # Binary mode 882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'samplesheet.csv'
--gprofiler2_run true \ --gprofiler2_organism hsapiens \ --gprofiler2_sources "GO,GO:MF,GO:BP,GO:CC" \ --gprofiler2_max_qval 0.05 \ --gprofiler2_correction_method gSCS \ --shinyngs_build_app true \
import pandas as pd df = pd.read_csv("/home/user/labs/data/case_study/GSE180869/samplesheet_with_condition.csv") df['condition'] = df ['condition'].str.replace('NFIA/X_gRNA1','NFIA_X_gRNA1') df.to_csv("/home/user/labs/data/case_study/GSE180869/samplesheet_with_condition.csv", index= False) # Load the original contrast file contrasts = pd.read_csv("/home/user/Assignment_2/data/contrasts.csv") # Replace slashes with underscores in both columns contrasts["reference"] = contrasts["reference"].str.replace("/", "_") contrasts["target"] = contrasts["target"].str.replace("/", "_") contrasts.to_csv("/home/user/Assignment_2/data/contrasts.csv", index=False)
import pandas as pd df = pd.read_csv("/home/user/labs/data/case_study/GSE180869/samplesheet_with_condition.csv") print(df["condition"].value_counts())
condition Negative_gRNA 2 NFIA_gRNA1 2 NFIA_gRNA2 2 NFIX_gRNA1 2 NFIX_gRNA2 2 NFIA_X_gRNA1 2 NFIA_X_gRNA2 2 Name: count, dtype: int64
#run this first to install Java sudo apt update # Install Java sudo apt install default\-jre # When prompted, enter Y # Install Nextflow curl -s https://get.nextflow.io | bash # Set Java memory limit # Add to .bashrc directly echo 'NXF_OPTS="-Xms1g -Xmx4g"' >> ~/.bashrc # Source .bashrc to apply changes source ~/.bashrc
#FINAL CODE ./nextflow run nf-core/differentialabundance \ --input /home/user/labs/data/case_study/GSE180869/samplesheet_with_condition.csv \ --contrasts /home/user/Assignment_2/data/contrasts.csv \ --matrix /home/user/labs/data/case_study/GSE180869/salmon.merged.gene_counts.tsv \ --transcript_length_matrix /home/user/labs/data/case_study/GSE180869/salmon.merged.gene_lengths.tsv \ --genome GRCh38 \ --filtering_min_proportion 0.3 \ --filtering_grouping_var condition \ --deseq2_cores 4 \ --gsea_run true \ --gsea_permute gene_set \ --gsea_make_sets true \ --gsea_permute_n 1000 \ --gsea_metric Diff_of_Classes \ --gene_sets_files /home/user/Assignment_2/reference_genome/GO_All_Homo_sapiens_GeneSymbol.gmt \ --gprofiler2_run true \ --gprofiler2_organism hsapiens \ --gprofiler2_sources "GO,GO:MF,GO:BP,GO:CC" \ --gprofiler2_max_qval 0.05 \ --gprofiler2_correction_method gSCS \ --shinyngs_build_app true \ --max_cpus 8 --max_memory 8.GB \ --outdir /home/user/Assignment_2/scratch/homo_analysis_filtered \ -w /home/user/Assignment_2/scratch/work/homo_analysis \ -profile docker