01b: From Command Line to Bash

Roman E. Reggiardo, Vikas Peddu

16 July, 2023

Scenario: Mutations

  • You work for a molecular diagnostics laboratory.

  • As their lead bioinformatician, you need to process DNA sequencing data generated from patients who might have Lung Cancer.

  • The laboratory has decided to identify disease using a set of genes with known Lung Cancer-causing mutations: KRAS, EGFR, and TP53.

  • The lab team is sending data that might need to be processed further before it can be used.

Scenario: Your First Job

Is to build a computational approach that

  1. Builds your panel of potentially mutated genes
  2. Builds the patient manifest

Eventually, we’re going to build a tool that Identifies patient DNA with Lung Cancer-causing mutations

Reflection

The file you worked with in 01a, talking.txt, is in FASTA format.

head -4 ../../data/talking.txt
> Roman
'Doing a PhD is fun'
> Vikas
'Its free for you, get paid to do science'

While this file has .txt as its extension, FASTA files usually have .fa or .fasta – we’ll use .fa

What are some things we could do with/to FASTA files using the command line?

FASTA files contain sequence data

  • Your team is going to provide with a couple different types of data

  • This data is reference data

    • the sequences and locations we expect and understand the genes of interest to have

Prediction:

Given your experience with the FASTA format, what do you expect the head and tail commands to print if you run them on a larger .fa file?

What data do we need to identify mutations?

  • Mutations are changes to the sequence of a gene, at least compared to what its supposed to be (lots to say on ‘supposed’ sequences…)

  • But in order to find the right mutations, we need to know where the mutation happens in the gene, and where the gene is in the genome

Looking at data from the Laboratory

In this scenario, the Lab has sent you a .fa file that contains a number of gene headers and their corresponding sequence.

  • gene_panel_database.fa has headers with 5 pieces of information

    • Gene Name, chromosome, start position, stop position, strand

    • The , used here are called delimeters , we’ll use them to separate data into tables

Building the diagnostic panel: gene_panel_database.fa

Your Lab team has uploaded the gene_panel_database.fa file to /media/fileshare/

  • Copy gene_panel_database.fa to your BSCC_2023_dir/data directory (using the command line)

take a few moments to explore its contents, are there any problems with the data that we need to address?

Job #1: Extract the gene panel

gene_panel_database.fa contains genes you do not want to check for mutations.

  • Extract the headers for KRAS, EGFR, and TP53

  • print the headers to a file final_gene_panel.csv

  • these are the ‘columns’ of our data

  • csv – ‘comma-separated values’

Job #1 output

>KRAS,chr12,25215441,25245384,-
>EGFR,chr7,55019278,55170544,+
>TP53,chr17,7661939,7676594,- 

3 genes means we do everything 3x….but not by hand

You’re trying to avoid running the same code over and over to extract the information you need from gene_panel_database.fa

  • To do this, we can use for loops to ‘iterate’ over our genes
for gene in KRAS EGFR TP53;do echo $gene; done
KRAS
EGFR
TP53

Iteration: For Loops and Variables

The for loop command structure is as follows

for gene in KRAS EGFR TP53;do echo $gene; done
  • for variable name

    • in list of variable values ;

      • do tool/command $variable name ;
  • done

in Bash, placing $ in front of a word tells the computer its a variable

Variables in For Loops

Variables ($variable_name) allow us to apply the same operation to multiple inputs

  • In this examples, $gene is the variable that we use to represent KRAS, EGFR, and finally TP53 in the for loop

Try:

for gene in KRAS EGFR TP53;do echo gene; done
for genetic_element in KRAS EGFR TP53;do echo $gene; done
for gene in KRAS EGFR TP53; echo $gene; done
for gene in KRAS EGFR TP53;do echo $gene

Variables in For Loops

Output:

for gene in KRAS EGFR TP53;do echo gene; done
gene
gene
gene
for genetic_element in KRAS EGFR TP53;do echo $gene; done
for gene in KRAS EGFR TP53; echo $gene; done
bash: -c: line 0: syntax error near unexpected token `echo'
bash: -c: line 0: `for gene in KRAS EGFR TP53; echo $gene; done'
for gene in KRAS EGFR TP53;do echo $gene
bash: -c: line 1: syntax error: unexpected end of file

How do we fix each of these?

Variables in For Loops

Fixed:

for gene in KRAS EGFR TP53;do echo $gene; done
KRAS
EGFR
TP53
for genetic_element in KRAS EGFR TP53;do echo $genetic_element; done
KRAS
EGFR
TP53
for gene in KRAS EGFR TP53;do echo $gene; done
KRAS
EGFR
TP53
for gene in KRAS EGFR TP53;do echo $gene; done
KRAS
EGFR
TP53

Job #1.5: Do it with iteration!

gene_panel_database.fa contains genes you do not want to check for mutations.

  • Iteratively Extract the headers for KRAS, EGFR, and TP53

  • print the headers to a file final_gene_panel.csv

  • these are the ‘columns’ of our data

  • csv – ‘comma-separated values’

Job #1.5 output

accomplished in one for loop

>KRAS,chr12,25215441,25245384,-
>EGFR,chr7,55019278,55170544,+
>TP53,chr17,7661939,7676594,- 

A tool to work with columns: cut

  • Your final_gene_panel.csv should look something like with rows on each line and columns separated by ,:

    gene,chromosome,start,stop,strand
    KRAS,chr12,25215441,25245384,-
    EGFR,chr7,55019278,55170544,+
    TP53,chr17,7661939,7676594,- 
  • Which is fine, but still has the > from the FASTA format – not something we need anymore

    # cut -d [delimeter] -f [fields/columns] [input_file]
    cut -d '>' -f2 ../..//slides/01_command_line_and_bash/BSCC_2023_dir/data/final_gene_panel.csv
    gene,chromosome,start,stop,strand
    KRAS,chr12,25215441,25245384,-
    EGFR,chr7,55019278,55170544,+
    TP53,chr17,7661939,7676594,- 

cut, explained

cut can separate files based on delimeters that represent columns. It can also return only columns you want:

# cut -d [delimeter] -f [fields/columns] [input_file]
cut -d'>' -f2 ../..//slides/01_command_line_and_bash/BSCC_2023_dir/data/final_gene_panel.csv
gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,- 
  • In this case, we want to cut the file by > and keep only the right (second) slice of data: -f2

Practice 05:

  1. Run:
cut -d '>' -f1 ../..//slides/01_command_line_and_bash/BSCC_2023_dir/data/final_gene_panel.csv
gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,- 
  • why doesn’t this return anything?
  1. replace the delimeter with ,
    1. what does setting -f2 return

    2. -f1,4,5 ?

Practice 05 output:

gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,- 

2.

chromosome
chr12
chr7
chr17

3.

gene,stop,strand
KRAS,25245384,-
EGFR,55170544,+
TP53,7676594,- 

One product, two steps

What we’ve just seen with grep and cut has split the one thing we want to do (making a csv) into two steps:

  1. grep and print to the csv
  2. cut the file to remove the > leftover

instead, we can combine these commands with the pipe operator: |

grep 'KRAS' ../..//slides/01_command_line_and_bash/BSCC_2023_dir/data/gene_panel_database.fa | cut -d'>' -f2
KRAS,chr12,25215441,25245384,-

Complex commands: Pipes

Imagine a literal piece of plumbing connecting the output of one command to the input of another. This is the | operator’s function

  • | takes the output from the left side and uses it as input on the right:
# cmd 1 | cmd 2 [input is derived from | ]
echo 'gene,chromosome,start,stop,strand' | cut -d, -f1,3,5
gene,start,strand
# cmd 1 | cmd 2 [input derived from | ] | cmd 3 [input derived from | ]
pwd
pwd | cut -d/ -f8 | cut -d_ -f5
/Users/vikas/Documents/UCSC/teaching/ucsc_scbc_2022/slides/01_command_line_and_bash
slides

Job #1.75: Putting it all together

gene_panel_database.fa contains genes you do not want to check for mutations.

  • Iteratively Extract the headers for KRAS, EGFR, and TP53 and trim off the carrot >

  • append the headers to a file final_gene_panel.csv with the first line:

    • gene,chromosome,start,stop,strand
  • these are the ‘columns’ of our data

  • csv – ‘comma-separated values’

Job #1.75 output:

gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,- 

Reflection:

We just did a lot, what stands out to you as the most generally useful thing?

What is the most frustrating/slow/annoying aspect of how we’re doing things now?

Prediction:

How can we use for , pipe and $variables to make our Bioinformatics job smoother?

Summary Job #1

  • for loops enable iteration over multiple values represented by a variable

    • uses in , do , and done
  • cut separated files based on delimeters (-d ) and allows you to select fields/columns (-f) to return

  • | pipe operator the end-to-end connection of commands by supplying the output from the left side of | as input to the right side.

    • cmd1 | cmd2 {output from 1} | cmd3 {output from 2} …

Back to work

Your for loop works well and you could see your Bioinformatics job getting a lot easier if you could just save this kind of stuff and use it on the other noisy data the Lab sends you….

Before that, let’s work with the patient data the Lab has also sent over

Job #2: Figuring out who the patients are

Each patient comes into the Lab and fills out a card:

patient_id,gene,age,sex
1,KRAS,45,M

These are each turned into a file and uploaded to the database as a compressed archive

Moving directories as files: tar/gzip

You might have seen these before: directories collapsed into a single, movable, compressed file:

  • Laboratory patient data:

    • link (right click and copy)
  • We won’t go into detail here, safe to use the following command to unzip the archive into the original multiple-file format

# located in /media/fileshare/patient_database.tar.gz
tar -zxvf patient_database.tar.gz

Working with multiple files in directory: wildcards

  • * is called ‘grob’ and it matches or completes to anything

Matching everything that begins with 4 or 12:

cat ../../data/patient_database//4*
patient_id,gene,age,sex
4,KRAS,74,F
ls ../../data/patient_database//12*
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv

Grob continued *

Matching everything with healthy in it

ls ../../data/patient_database//*healthy*
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//12_TP53healthy_card.csv

Matching everything that ends in .csv

# and truncating a little since there are alot
ls ../../data/patient_database//*.csv | head
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv

Practice 06: Grob

Print all the file names that contain mut

output

../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv

More limited matching: number of characters ???

  • ? is used when the number unknown of characters is known

Only double digit patient ID numbers

ls ../../data/patient_database//??_*.csv
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv

More limited matching: number of characters ???

Only single digit patient ID numbers

ls ../../data/patient_database//?_*.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv

Practice 07: ?

Print all file names that contain mut without explicitly including mut in the command

output

../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv

More specific matches: ranges []

Ranges of numbers:

ls ../../data/patient_database//[2-4]_*.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv

List of characters:

ls ../../data/patient_database//*_[T,K]*.csv
../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv

Practice 08: []

Print all file names that start with digits 1 - 9

output

../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv

Back to work…Job #2

Alright, now we have the tools to do a bit more work with files and directories

  • take the files you captured Practice 08 and combine them into a single patient_data.csv with column names.

Job #2 Output

on my mac (zsh), [1-9] include 10,11,12, this doesn’t seem to generalize

patient_id,gene,age,sex
10,TP53,56,M
11,TP53,56,M
12,TP53,56,M
1,KRAS,56,M
2,KRAS,74,F
3,KRAS,74,F
4,KRAS,74,F
5,EGFR,63,F
6,EGFR,63,F
7,EGFR,63,F
8,EGFR,63,F
9,EGFR,63,F

Reflection:

  • We can operate on contents files : rows, columns

  • We can operate on the contents of directories : files themselves

    • this is greatly facilitated by wildcards
  • One thing is missing from making this whole approach fast, reproducible, and easy – what do you think it is?