Assignment 2:
Prerequisites and Dependencies
Software Tools:
Nextflow: nf-core/fetchngs, nf-core/rnaseq
Documentation: fetchngs, rnaseq
Reference Genome: Ensembl
Sequence data:
Sequence ids to download sequences from SRA Selector : GSE180869
Hardware:
32 vCPUs
128 GB RAM
Either:
50 GB SSD disk (if you are running the full pipeline using a subset of the RNASeq data)
(optional) or 150 GB SSD disk (if you are running the full pipeline using the full set of RNASeq data)
Note: There is no difference in learning experience or marks received in your choice of dataset. The only additional lesson you will learn is how expensive computing is.
Installation
Let's start by installing the core software Nextflow.
Create a terminal in your Lab 2 folder.
Run the shell script below.
IMPORTANT: This step only needs to be done once.
Execute these commands to install Java:
Multiple fasta files per sample
Run in terminal:
RNAseq
~/Assignment_2/data/id.csv\
run the follow in terminal:
→ get sra , fastq datas, run rnaseq
→ output into ~/Assignment_2/scratch/fetchngs \
Index genome /Adapter trimming
Navigate to ~/Assignment_2/scratch/index_run/fast .
Copy the first SRX########.fastp.html file to your Assignment_2 folder on the Home server. Make sure to select your Home server.
copy index_run/pipeline_info/execution_report_{YYYY-MM-DD_HH:MM:SS}.html to your Assignment_2 folder.
Adapter trimming
input_path = "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE180nnn/GSE180869/soft/GSE180869_family.soft.gz"
matrix_path = "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE180nnn/GSE180869/matrix/GSE180869_series_matrix.txt.gz"
transcript_length_matrix_path = "[https://ftp.ncbi.nlm.nih.gov/geo/series/GSE180nnn/GSE180869/suppl/GSE180869_RAW.tar"](https://ftp.ncbi.nlm.nih.gov/geo/series/GSE235nnn/GSE235705/suppl/GSE235705_RAW.tar)
differentialabundance
Create contrasts.csv file
We will perform the following comparisons:
control and NFIA_gRNA1 (blocking: replicate)
control and NFIA_gRNA2 (blocking: replicate)
control and NFIX_gRNA1 (blocking: replicate)
control and NFIX_gRNA2 (blocking: replicate)
control and NFIA/X_gRNA1 (blocking: replicate)
control and NFIA/X_gRNA1 (blocking: replicate)
Cell In[1], line 7
--gtf /home/user/Assignment_2/reference_genome/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz \
^
SyntaxError: invalid decimal literal
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 1
----> 1 head /home/user/Assignment_2/reference_genome/GSE180869_norm_counts_TPM_GRCh38.p13_NCBI.tsv
NameError: name 'head' is not defined
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[5], line 4
1 import pandas as pd
3 # Load the original sample sheet
----> 4 samplesheet_df = pd.read_csv("samplesheet.csv")
6 # Define a helper function to extract the condition from sample title
7 def extract_condition(title):
, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
, in _read(filepath_or_buffer, kwds)
617 _validate_names(kwds.get("names", None))
619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
622 if chunksize or iterator:
623 return parser
, in TextFileReader.__init__(self, f, engine, **kwds)
1617 self.options["has_index_names"] = kwds["has_index_names"]
1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
, in TextFileReader._make_engine(self, f, engine)
1878 if "b" not in mode:
1879 mode += "b"
-> 1880 self.handles = get_handle(
1881 f,
1882 mode,
1883 encoding=self.options.get("encoding", None),
1884 compression=self.options.get("compression", None),
1885 memory_map=self.options.get("memory_map", False),
1886 is_text=is_text,
1887 errors=self.options.get("encoding_errors", "strict"),
1888 storage_options=self.options.get("storage_options", None),
1889 )
1890 assert self.handles is not None
1891 f = self.handles.handle
, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
868 elif isinstance(handle, str):
869 # Check whether the filename is to be opened in binary mode.
870 # Binary mode does not support 'encoding' and 'newline'.
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,
876 encoding=ioargs.encoding,
877 errors=errors,
878 newline="",
879 )
880 else:
881 # Binary mode
882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'samplesheet.csv'