Path: blob/master/part-1/batch-jobs/bio-data.md
696 views
---
---
Exercise: Retrieving data from bio data repositories
This exercise covers retrieving data from various commonly used bio data repositories.
We will do these exercises in an interactive session launched using the sinteractive command:
Alternatively, open a compute node shell through the Puhti web interface.
To access the applications in parts 2 and 3, we will need to load the
biokit
module:Create a directory for yourself under the
/scratch
directory of your project and move there:
💭 Everyone in a project shares the same /scratch
directory, so it is a good idea to use subdirectories for each user and task to avoid accidentally deleting or overwriting others' files.
🗯 In normal usage it may be a good idea to use the chmod
command to alter file access rights so that only you have write access to your own subfolder, but please do not do this if you are using a CSC course project, as it will make clean-up after the course harder.
💡 You can find more information about this on the Disk areas page in Docs CSC.
1. Downloading data with curl
curl
andwget
are general tools to download data from an URL.Download a dataset from internet using
curl
and uncompress it. The dataset contains some Pythium genomes with related BWA indexes.
2. Downloading data with NCBI edirect
Create directory
cellulose_synthase
and move to this new directory:Next we use the NCBI edirect tool to retrieve some data.
Check how many proteins are found in the NCBI protein database for Pythium species (
count
row in the results):Check the number of proteins for cellulose synthase 1, cellulose synthase 2 and cellulose synthase 3 that are found for Pythium species.
For cellulose synthase 1 this can be done with:
Do the same for the other proteins.
Retrieve the cellulose synthase 3 sequences in Fasta format
Run the
esearch
command that tells how many cellulose synthase 3 sequences there are in total in the NCBI protein database?Extra exercise for fast ones: Align the cellulose synthase 3 set with
mafft
Study the results:
3. Downloading with enaDataGet
Check the options of
enaDataGet
with command:Download a file (Pythium iwayamai genome assembly)
Extra exercise for fast ones: Study the downloaded file:
4. Finishing up
Close the interactive session when you are done by typing
exit
.