Path: blob/main/faqs/galaxy/analysis_regular_expressions.md
1677 views
---
---
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression | Matches |
---|---|
abc | an occurrence of abc within your data |
`(abc | def)` |
[abc] | a single character which is either a , b , or c |
[^abc] | a character that is NOT a , b , nor c |
[a-z] | any lowercase letter |
[a-zA-Z] | any letter (upper or lower case) |
[0-9] | numbers 0-9 |
\d | any digit (same as [0-9] ) |
\D | any non-digit character |
\w | any alphanumeric character |
\W | any non-alphanumeric character |
\s | any whitespace |
\S | any non-whitespace character |
. | any character |
\. | literal . (period) |
{x,y} | between x and y repetitions |
^ | the beginning of the line |
$ | the end of the line |
Note: you see that characters such as *
, ?
, .
, +
etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \?
matches the question mark character exactly.
Examples
Regular expression | matches |
---|---|
\d{4} | 4 digits (e.g. a year) |
chr\d{1,2} | chr followed by 1 or 2 digits |
.*abc$ | anything with abc at the end of the line |
^$ | empty line |
^>.* | Line starting with > (e.g. Fasta header) |
^[^>].* | Line not starting with > (e.g. Fasta sequence) |
Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...)
, which we can refer to using \1
, \2
etc for the first and second captured values. If you want to refer to the whole match, use &
.
Regular expression | Input | Captures |
---|---|---|
chr(\d{1,2}) | chr14 | \1 = 14 |
(\d{2}) July (\d{4}) | 24 July 1984 | \1 = 24 , \2 = 1984 |
An expression like s/find/replacement/g
indicates a replacement expression, this will search (s
) for any occurrence of find
, and replace it with replacement
. It will do this globally (g
) which means it doesn't stop after the first match.
Example: s/chr(\d{1,2})/CHR\1/g
will replace chr14
with CHR14
etc.
You can also use replacement modifier such as convert to lower case \L
or upper case \U
. Example: s/.*/\U&/g
will convert the whole text to upper case.
Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don't have to use the s/../../g
structure.
There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip: Cyrilex is a visual regular expression tester.