Roman E. Reggiardo, Vikas Peddu
16 July, 2023
You work for a molecular diagnostics laboratory.
As their lead bioinformatician, you need to process DNA sequencing data generated from patients who might have Lung Cancer.
The laboratory has decided to identify disease using a set of genes with known Lung Cancer-causing mutations: KRAS, EGFR, and TP53.
The lab team is sending data that might need to be processed further before it can be used.
Is to build a computational approach that
Eventually, we’re going to build a tool that Identifies patient DNA with Lung Cancer-causing mutations
The file you worked with in 01a, talking.txt
, is in FASTA
format.
> Roman
'Doing a PhD is fun'
> Vikas
'Its free for you, get paid to do science'
While this file has .txt
as its extension, FASTA
files usually have .fa
or .fasta
– we’ll use .fa
What are some things we could do with/to FASTA
files using the command line
?
data
Your team is going to provide with a couple different types of data
This data is reference data
Given your experience with the FASTA
format, what do you expect the head
and tail
commands to print if you run them on a larger .fa
file?
Mutations are changes to the sequence of a gene, at least compared to what its supposed to be (lots to say on ‘supposed’ sequences…)
But in order to find the right mutations, we need to know where the mutation happens in the gene, and where the gene is in the genome
In this scenario, the Lab has sent you a .fa
file that contains a number of gene headers
and their corresponding sequence
.
gene_panel_database.fa
has headers with 5 pieces of information
Gene Name, chromosome, start position, stop position, strand
The ,
used here are called delimeters
, we’ll use them to separate data into tables
gene_panel_database.fa
Your Lab team has uploaded the gene_panel_database.fa
file to /media/fileshare/
gene_panel_database.fa
to your BSCC_2023_dir/data
directory (using the command line)take a few moments to explore its contents, are there any problems with the data that we need to address?
gene_panel_database.fa
contains genes you do not want to check for mutations.
Extract the headers for KRAS, EGFR, and TP53
print the headers to a file final_gene_panel.csv
these are the ‘columns’ of our data
csv
– ‘comma-separated values’
>KRAS,chr12,25215441,25245384,-
>EGFR,chr7,55019278,55170544,+
>TP53,chr17,7661939,7676594,-
You’re trying to avoid running the same code over and over to extract the information you need from gene_panel_database.fa
for loops
to ‘iterate’ over our genesThe for loop
command structure is as follows
for variable name
in list of variable values
;
tool/command
$variable name
;done
in Bash
, placing $
in front of a word tells the computer its a variable
Variables ($variable_name
) allow us to apply the same operation to multiple inputs
$gene
is the variable that we use to represent KRAS, EGFR, and finally TP53 in the for loop
Try:
Output:
bash: -c: line 0: syntax error near unexpected token `echo'
bash: -c: line 0: `for gene in KRAS EGFR TP53; echo $gene; done'
How do we fix each of these?
Fixed:
gene_panel_database.fa
contains genes you do not want to check for mutations.
Iteratively Extract the headers for KRAS, EGFR, and TP53
print the headers to a file final_gene_panel.csv
these are the ‘columns’ of our data
csv
– ‘comma-separated values’
accomplished in one for loop
>KRAS,chr12,25215441,25245384,-
>EGFR,chr7,55019278,55170544,+
>TP53,chr17,7661939,7676594,-
cut
Your final_gene_panel.csv
should look something like with rows on each line and columns separated by ,
:
gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-
Which is fine, but still has the >
from the FASTA
format – not something we need anymore
cut
, explainedcut
can separate files based on delimeters
that represent columns. It can also return only columns you want:
# cut -d [delimeter] -f [fields/columns] [input_file]
cut -d'>' -f2 ../..//slides/01_command_line_and_bash/BSCC_2023_dir/data/final_gene_panel.csv
gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-
>
and keep only the right (second) slice of data: -f2
gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-
delimeter
with ,
what does setting -f2
return
-f1,4,5
?
gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-
2.
chromosome
chr12
chr7
chr17
3.
gene,stop,strand
KRAS,25245384,-
EGFR,55170544,+
TP53,7676594,-
What we’ve just seen with grep
and cut
has split the one thing we want to do (making a csv) into two steps:
grep
and print to the csvcut
the file to remove the >
leftoverinstead, we can combine these commands with the pipe
operator: |
Pipes
Imagine a literal piece of plumbing connecting the output
of one command to the input
of another. This is the |
operator’s function
|
takes the output from the left side and uses it as input on the right:gene_panel_database.fa
contains genes you do not want to check for mutations.
Iteratively Extract the headers for KRAS, EGFR, and TP53 and trim off the carrot >
append the headers to a file final_gene_panel.csv
with the first line:
gene,chromosome,start,stop,strand
these are the ‘columns’ of our data
csv
– ‘comma-separated values’
gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-
We just did a lot, what stands out to you as the most generally useful thing?
What is the most frustrating/slow/annoying aspect of how we’re doing things now?
How can we use for
, pipe
and $variables
to make our Bioinformatics job smoother?
for
loops enable iteration over multiple values represented by a variable
in
, do
, and done
cut
separated files based on delimeters
(-d
) and allows you to select fields/columns (-f
) to return
|
pipe operator the end-to-end connection of commands by supplying the output from the left side of |
as input to the right side.
cmd1 | cmd2 {output from 1} | cmd3 {output from 2} …
Your for loop
works well and you could see your Bioinformatics job getting a lot easier if you could just save this kind of stuff and use it on the other noisy data the Lab sends you….
Before that, let’s work with the patient data the Lab has also sent over
Each patient comes into the Lab and fills out a card:
patient_id,gene,age,sex
1,KRAS,45,M
These are each turned into a file
and uploaded to the database as a compressed archive
tar/gzip
You might have seen these before: directories collapsed into a single, movable, compressed file:
Laboratory patient data:
We won’t go into detail here, safe to use the following command to unzip
the archive into the original multiple-file format
wildcards
*
is called ‘grob’ and it matches or completes to anythingMatching everything that begins with 4
or 12
:
*
Matching everything with healthy
in it
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//12_TP53healthy_card.csv
Matching everything that ends in .csv
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
Print all the file names that contain mut
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv
???
?
is used when the number unknown of characters is knownOnly double digit patient ID numbers
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv
???
Only single digit patient ID numbers
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv
?
Print all file names that contain mut
without explicitly including mut
in the command
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv
[]
Ranges of numbers:
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
List of characters:
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv
[]
Print all file names that start with digits 1 - 9
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
Alright, now we have the tools to do a bit more work with files and directories
patient_data.csv
with column names.on my mac (zsh
), [1-9] include 10,11,12, this doesn’t seem to generalize
patient_id,gene,age,sex
10,TP53,56,M
11,TP53,56,M
12,TP53,56,M
1,KRAS,56,M
2,KRAS,74,F
3,KRAS,74,F
4,KRAS,74,F
5,EGFR,63,F
6,EGFR,63,F
7,EGFR,63,F
8,EGFR,63,F
9,EGFR,63,F
We can operate on contents files
: rows, columns
We can operate on the contents of directories
: files
themselves
wildcards
One thing is missing from making this whole approach fast, reproducible, and easy – what do you think it is?