Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
galaxyproject
GitHub Repository: galaxyproject/training-material
Path: blob/main/faqs/galaxy/analysis_differential_expression_help.md
1677 views
---
redirect_from: [/faqs/galaxy/analysis_extended_Extended_help_differential_expression_analysis_tools] title: Extended Help for Differential Expression Analysis Tools area: analysis box_type: tip layout: faq contributors: [jennaj, Melkeb]
---

The error and usage help in this FAQ applies to most if not all Bioconductor tools.

  • DEseq2

  • Limma

  • edgeR

  • goseq

  • Diffbind

  • StringTie

  • Featurecounts

  • HTSeq-count

  • HTseq-clip

  • Kalisto

  • Salmon

  • Sailfish

  • DEXSeq

  • DEXSeq-count

  • IsoformSwitchAnalyzeR

{% icon galaxy-info %} Review your error messages and you'll find some clues about what may be going wrong and what needs to be adjusted in your rerun. If you are getting a message from R, that usually means the underlying tool could not read in or understand your inputs. This can be a labeling problem (what was typed on the form) or a content problem (data within the files).

Expect odd errors or content problems if any of the usage requirements below are not met.

General

  • Are your reference genome, reference transcriptome, and reference annotation all based on the same genome assembly?

    • Check the identifiers in all inputs and adjust as needed.

    • These all may mean the same thing to a person but not to a computer or tool: chr1, Chr1, 1, chr1.1

  • Differential expression tools all require sample count replicates. Rationale from two of the DEseq tool authors.

    • At least two factor levels/groups/conditions with two samples each.

    • All must all contain unique content for valid scientific results.

  • Factor/Factor level names should only contain alphanumeric characters and optionally underscores.

    • Avoid starting these with a number and do not include spaces.

    • Galaxy may be able to normalize these values for you, but if you are getting an error: standardize the format yourself.

  • DEXSeq additionally requires that the first Condition is labeled as Condition.

  • If your count inputs have a header, the option Files have header? is set to Yes. If no headers, set to No.

    • If your files have more than one header line: keep the sample header line, remove all extra line(s).

  • Make sure that tool form settings match your annotation content or the tool cannot match up the inputs!

    • If you are counting by gene_id, your annotation should contain gene_id attributes (9th column)

    • If you are summarizing by exon, your annotation should contain exon features (3rd column)

  • Sometimes these tools do not understand transcript_id.N and gene_id.N notation (where N is a version number).

    • This notation could be in fasta or tabular inputs.

    • Try [removing .N from all inputs]({% link search2.html %}?query=olympics), and check for the accidental creation of new duplicates!

  • Errors? [Understanding the job log messages]({% link faqs/galaxy/analysis_troubleshooting.md %}) can be confusing! But are accessible and worth reviewing.

    • The good news is that usage in Galaxy produces the same error messages as direct usage.

    • This means that a search at the Bioconductor Support website can provide useful clues! Come back to the Galaxy Help forum with any remaining questions.

{% icon tip %} Remember, for any value in your inputs that is not a number, using only alphanumeric characters and optionally underscores _ with no spaces is what the authors recommend. Check your factor names, sample names, gene identifiers, transcript identifiers, and header lines in files.

Reference genome (fasta)

  • Can be a server reference genome (hosted index in the pull down menu) or a custom reference genome (fasta from the history).

  • Custom reference genomes must be [formatted correctly]({% link faqs/galaxy/reference_genomes_custom_genomes.md %}).

  • If you are using Salmon or Kalisto, you probably don't need a reference genome but a reference transcriptome instead!

  • More about understanding and [working with large fasta datasets]({% link faqs/galaxy/datasets_working_with_fasta.md %}).

Reference transcriptome (fasta)

  • Fasta file containing assembled transcripts.

  • Unassembled short or long reads will not work as a substitute.

  • The transcript identifiers on the >seq fasta lines must exactly match the transcript_id values in your annotation or tabular mapping file.

Reference annotation (tabular, GTF, GFF3)

  • Reference annotation [in GTF format]({% link faqs/galaxy/datasets_working_with_reference_annotation.md %}) works best.

  • If a GTF dataset is not available for your genome, a two-column tabular dataset containing transcript <tab> gene can be used instead with most of these tools.

  • HTseq-count requires GTF attributes. Featurecounts is an alternative tool choice.

  • Sometimes the tool gffread is used to transform GFF3 data to GTF.

  • DO use UCSC's reference annotation (GTF) and reference transcriptome (fasta) data from their Downloads area.

    • These are a match for the UCSC genomes indexed at public Galaxy servers.

    • Links can be directly copy/pasted into the Upload tool.

    • Allow Galaxy to autodetect the datatype to produce an uncompressed dataset in your history ready to use with tools.

  • Avoid GTF data from the UCSC Table Browser: this leads to scientific problems. GTFs will have the same content populated for both the transcript_id and gene_id values. See the note at UCSC for more about why.

  • Still have problems? Try removing all GTF header lines with the tool Remove beginning of a file.

  • More about understanding and [working with GTF/GFF/GFF3 reference annotation]({% link faqs/galaxy/datasets_working_with_reference_annotation.md %})