Roman E. Reggiardo, Vikas Peddu
16 July, 2023
You work for a molecular diagnostics laboratory.
As their lead bioinformatician, you need to process DNA sequencing data generated from patients who might have Lung Cancer.
The laboratory has decided to identify disease using a set of genes with known Lung Cancer-causing mutations: KRAS, EGFR, and TP53.
The lab team is sending data that might need to be processed further before it can be used.
Is to build a computational approach that
Eventually, we’re going to build a tool that Identifies patient DNA with Lung Cancer-causing mutations
The file you worked with in 01a, talking.txt, is in FASTA format.
> Roman
'Doing a PhD is fun'
> Vikas
'Its free for you, get paid to do science'
While this file has .txt as its extension, FASTA files usually have .fa or .fasta – we’ll use .fa
What are some things we could do with/to FASTA files using the command line?
dataYour team is going to provide with a couple different types of data
This data is reference data
Given your experience with the FASTA format, what do you expect the head and tail commands to print if you run them on a larger .fa file?
Mutations are changes to the sequence of a gene, at least compared to what its supposed to be (lots to say on ‘supposed’ sequences…)
But in order to find the right mutations, we need to know where the mutation happens in the gene, and where the gene is in the genome
In this scenario, the Lab has sent you a .fa file that contains a number of gene headers and their corresponding sequence.
gene_panel_database.fa has headers with 5 pieces of information
Gene Name, chromosome, start position, stop position, strand
The , used here are called delimeters , we’ll use them to separate data into tables
gene_panel_database.faYour Lab team has uploaded the gene_panel_database.fa file to /media/fileshare/
gene_panel_database.fa to your BSCC_2023_dir/data directory (using the command line)take a few moments to explore its contents, are there any problems with the data that we need to address?
gene_panel_database.fa contains genes you do not want to check for mutations.
Extract the headers for KRAS, EGFR, and TP53
print the headers to a file final_gene_panel.csv
these are the ‘columns’ of our data
csv – ‘comma-separated values’
>KRAS,chr12,25215441,25245384,-
>EGFR,chr7,55019278,55170544,+
>TP53,chr17,7661939,7676594,-
You’re trying to avoid running the same code over and over to extract the information you need from gene_panel_database.fa
for loops to ‘iterate’ over our genesThe for loop command structure is as follows
for variable name
in list of variable values ;
tool/command $variable name ;done
in Bash, placing $ in front of a word tells the computer its a variable
Variables ($variable_name) allow us to apply the same operation to multiple inputs
$gene is the variable that we use to represent KRAS, EGFR, and finally TP53 in the for loopTry:
Output:
bash: -c: line 0: syntax error near unexpected token `echo'
bash: -c: line 0: `for gene in KRAS EGFR TP53; echo $gene; done'
How do we fix each of these?
Fixed:
gene_panel_database.fa contains genes you do not want to check for mutations.
Iteratively Extract the headers for KRAS, EGFR, and TP53
print the headers to a file final_gene_panel.csv
these are the ‘columns’ of our data
csv – ‘comma-separated values’
accomplished in one for loop
>KRAS,chr12,25215441,25245384,-
>EGFR,chr7,55019278,55170544,+
>TP53,chr17,7661939,7676594,-
cutYour final_gene_panel.csv should look something like with rows on each line and columns separated by ,:
gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-
Which is fine, but still has the > from the FASTA format – not something we need anymore
cut, explainedcut can separate files based on delimeters that represent columns. It can also return only columns you want:
# cut -d [delimeter] -f [fields/columns] [input_file]
cut -d'>' -f2 ../..//slides/01_command_line_and_bash/BSCC_2023_dir/data/final_gene_panel.csvgene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-
> and keep only the right (second) slice of data: -f2gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-
delimeter with ,
what does setting -f2 return
-f1,4,5 ?
gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-
2.
chromosome
chr12
chr7
chr17
3.
gene,stop,strand
KRAS,25245384,-
EGFR,55170544,+
TP53,7676594,-
What we’ve just seen with grep and cut has split the one thing we want to do (making a csv) into two steps:
grep and print to the csvcut the file to remove the > leftoverinstead, we can combine these commands with the pipe operator: |
PipesImagine a literal piece of plumbing connecting the output of one command to the input of another. This is the | operator’s function
| takes the output from the left side and uses it as input on the right:gene_panel_database.fa contains genes you do not want to check for mutations.
Iteratively Extract the headers for KRAS, EGFR, and TP53 and trim off the carrot >
append the headers to a file final_gene_panel.csv with the first line:
gene,chromosome,start,stop,strandthese are the ‘columns’ of our data
csv – ‘comma-separated values’
gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-
We just did a lot, what stands out to you as the most generally useful thing?
What is the most frustrating/slow/annoying aspect of how we’re doing things now?
How can we use for , pipe and $variables to make our Bioinformatics job smoother?
for loops enable iteration over multiple values represented by a variable
in , do , and donecut separated files based on delimeters (-d ) and allows you to select fields/columns (-f) to return
| pipe operator the end-to-end connection of commands by supplying the output from the left side of | as input to the right side.
cmd1 | cmd2 {output from 1} | cmd3 {output from 2} …Your for loop works well and you could see your Bioinformatics job getting a lot easier if you could just save this kind of stuff and use it on the other noisy data the Lab sends you….
Before that, let’s work with the patient data the Lab has also sent over
Each patient comes into the Lab and fills out a card:
patient_id,gene,age,sex
1,KRAS,45,M
These are each turned into a file and uploaded to the database as a compressed archive
tar/gzipYou might have seen these before: directories collapsed into a single, movable, compressed file:
Laboratory patient data:
We won’t go into detail here, safe to use the following command to unzip the archive into the original multiple-file format
wildcards* is called ‘grob’ and it matches or completes to anythingMatching everything that begins with 4 or 12:
*Matching everything with healthy in it
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//12_TP53healthy_card.csv
Matching everything that ends in .csv
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
Print all the file names that contain mut
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv
???? is used when the number unknown of characters is knownOnly double digit patient ID numbers
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv
???Only single digit patient ID numbers
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv
?Print all file names that contain mut without explicitly including mut in the command
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv
[]Ranges of numbers:
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
List of characters:
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv
[]Print all file names that start with digits 1 - 9
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
Alright, now we have the tools to do a bit more work with files and directories
patient_data.csv with column names.on my mac (zsh), [1-9] include 10,11,12, this doesn’t seem to generalize
patient_id,gene,age,sex
10,TP53,56,M
11,TP53,56,M
12,TP53,56,M
1,KRAS,56,M
2,KRAS,74,F
3,KRAS,74,F
4,KRAS,74,F
5,EGFR,63,F
6,EGFR,63,F
7,EGFR,63,F
8,EGFR,63,F
9,EGFR,63,F
We can operate on contents files : rows, columns
We can operate on the contents of directories : files themselves
wildcardsOne thing is missing from making this whole approach fast, reproducible, and easy – what do you think it is?