01b: From Command Line to Bash

Roman E. Reggiardo, Vikas Peddu

16 July, 2023

Scenario: Mutations

You work for a molecular diagnostics laboratory.
As their lead bioinformatician, you need to process DNA sequencing data generated from patients who might have Lung Cancer.
The laboratory has decided to identify disease using a set of genes with known Lung Cancer-causing mutations: KRAS, EGFR, and TP53.
The lab team is sending data that might need to be processed further before it can be used.

Scenario: Your First Job

Is to build a computational approach that

Builds your panel of potentially mutated genes
Builds the patient manifest

Eventually, we’re going to build a tool that Identifies patient DNA with Lung Cancer-causing mutations

Reflection

The file you worked with in 01a, talking.txt, is in FASTA format.

head -4 ../../data/talking.txt

> Roman
'Doing a PhD is fun'
> Vikas
'Its free for you, get paid to do science'

While this file has .txt as its extension, FASTA files usually have .fa or .fasta – we’ll use .fa

What are some things we could do with/to FASTA files using the command line?

FASTA files contain sequence `data`

Your team is going to provide with a couple different types of data
This data is reference data
- the sequences and locations we expect and understand the genes of interest to have

Prediction:

Given your experience with the FASTA format, what do you expect the head and tail commands to print if you run them on a larger .fa file?

What data do we need to identify mutations?

Mutations are changes to the sequence of a gene, at least compared to what its supposed to be (lots to say on ‘supposed’ sequences…)
But in order to find the right mutations, we need to know where the mutation happens in the gene, and where the gene is in the genome

Looking at data from the Laboratory

In this scenario, the Lab has sent you a .fa file that contains a number of gene headers and their corresponding sequence.

gene_panel_database.fa has headers with 5 pieces of information
- Gene Name, chromosome, start position, stop position, strand
- The , used here are called delimeters , we’ll use them to separate data into tables

Building the diagnostic panel: `gene_panel_database.fa`

Your Lab team has uploaded the gene_panel_database.fa file to /media/fileshare/

Copy gene_panel_database.fa to your BSCC_2023_dir/data directory (using the command line)

take a few moments to explore its contents, are there any problems with the data that we need to address?

Job #1: Extract the gene panel

gene_panel_database.fa contains genes you do not want to check for mutations.

Extract the headers for KRAS, EGFR, and TP53
print the headers to a file final_gene_panel.csv
these are the ‘columns’ of our data
csv – ‘comma-separated values’

Job #1 output

>KRAS,chr12,25215441,25245384,-
>EGFR,chr7,55019278,55170544,+
>TP53,chr17,7661939,7676594,-

3 genes means we do everything 3x….but not by hand

You’re trying to avoid running the same code over and over to extract the information you need from gene_panel_database.fa

To do this, we can use for loops to ‘iterate’ over our genes

for gene in KRAS EGFR TP53;do echo $gene; done

KRAS
EGFR
TP53

Iteration: For Loops and Variables

The for loop command structure is as follows

for gene in KRAS EGFR TP53;do echo $gene; done

for variable name
- in list of variable values ;
  - do tool/command $variable name ;
done

in Bash, placing $ in front of a word tells the computer its a variable

Variables in For Loops

Variables ($variable_name) allow us to apply the same operation to multiple inputs

In this examples, $gene is the variable that we use to represent KRAS, EGFR, and finally TP53 in the for loop

Try:

for gene in KRAS EGFR TP53;do echo gene; done

for genetic_element in KRAS EGFR TP53;do echo $gene; done

for gene in KRAS EGFR TP53; echo $gene; done

for gene in KRAS EGFR TP53;do echo $gene

Variables in For Loops

Output:

for gene in KRAS EGFR TP53;do echo gene; done

gene
gene
gene

for genetic_element in KRAS EGFR TP53;do echo $gene; done

for gene in KRAS EGFR TP53; echo $gene; done

bash: -c: line 0: syntax error near unexpected token `echo'
bash: -c: line 0: `for gene in KRAS EGFR TP53; echo $gene; done'

for gene in KRAS EGFR TP53;do echo $gene

bash: -c: line 1: syntax error: unexpected end of file

How do we fix each of these?

Variables in For Loops

Fixed:

for gene in KRAS EGFR TP53;do echo $gene; done

KRAS
EGFR
TP53

for genetic_element in KRAS EGFR TP53;do echo $genetic_element; done

KRAS
EGFR
TP53

for gene in KRAS EGFR TP53;do echo $gene; done

KRAS
EGFR
TP53

for gene in KRAS EGFR TP53;do echo $gene; done

KRAS
EGFR
TP53

Job #1.5: Do it with iteration!

gene_panel_database.fa contains genes you do not want to check for mutations.

Iteratively Extract the headers for KRAS, EGFR, and TP53
print the headers to a file final_gene_panel.csv
these are the ‘columns’ of our data
csv – ‘comma-separated values’

Job #1.5 output

accomplished in one for loop

>KRAS,chr12,25215441,25245384,-
>EGFR,chr7,55019278,55170544,+
>TP53,chr17,7661939,7676594,-

A tool to work with columns: `cut`

Your final_gene_panel.csv should look something like with rows on each line and columns separated by ,:

gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-

Which is fine, but still has the > from the FASTA format – not something we need anymore

# cut -d [delimeter] -f [fields/columns] [input_file]
cut -d '>' -f2 ../..//slides/01_command_line_and_bash/BSCC_2023_dir/data/final_gene_panel.csv

gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-

`cut`, explained

cut can separate files based on delimeters that represent columns. It can also return only columns you want:

# cut -d [delimeter] -f [fields/columns] [input_file]
cut -d'>' -f2 ../..//slides/01_command_line_and_bash/BSCC_2023_dir/data/final_gene_panel.csv

gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-

In this case, we want to cut the file by > and keep only the right (second) slice of data: -f2

Practice 05:

Run:

cut -d '>' -f1 ../..//slides/01_command_line_and_bash/BSCC_2023_dir/data/final_gene_panel.csv

gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-

why doesn’t this return anything?

replace the delimeter with ,
1. what does setting -f2 return
2. -f1,4,5 ?

Practice 05 output:

gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-

chromosome
chr12
chr7
chr17

gene,stop,strand
KRAS,25245384,-
EGFR,55170544,+
TP53,7676594,-

One product, two steps

What we’ve just seen with grep and cut has split the one thing we want to do (making a csv) into two steps:

grep and print to the csv
cut the file to remove the > leftover

instead, we can combine these commands with the pipe operator: |

grep 'KRAS' ../..//slides/01_command_line_and_bash/BSCC_2023_dir/data/gene_panel_database.fa | cut -d'>' -f2

KRAS,chr12,25215441,25245384,-

Complex commands: `Pipes`

Imagine a literal piece of plumbing connecting the output of one command to the input of another. This is the | operator’s function

| takes the output from the left side and uses it as input on the right:

# cmd 1 | cmd 2 [input is derived from | ]
echo 'gene,chromosome,start,stop,strand' | cut -d, -f1,3,5

gene,start,strand

# cmd 1 | cmd 2 [input derived from | ] | cmd 3 [input derived from | ]
pwd
pwd | cut -d/ -f8 | cut -d_ -f5

/Users/vikas/Documents/UCSC/teaching/ucsc_scbc_2022/slides/01_command_line_and_bash
slides

Job #1.75: Putting it all together

gene_panel_database.fa contains genes you do not want to check for mutations.

Iteratively Extract the headers for KRAS, EGFR, and TP53 and trim off the carrot >
append the headers to a file final_gene_panel.csv with the first line:
- gene,chromosome,start,stop,strand
these are the ‘columns’ of our data
csv – ‘comma-separated values’

Job #1.75 output:

gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-

Reflection:

We just did a lot, what stands out to you as the most generally useful thing?

What is the most frustrating/slow/annoying aspect of how we’re doing things now?

Prediction:

How can we use for , pipe and $variables to make our Bioinformatics job smoother?

Summary Job #1

for loops enable iteration over multiple values represented by a variable
- uses in , do , and done
cut separated files based on delimeters (-d ) and allows you to select fields/columns (-f) to return
| pipe operator the end-to-end connection of commands by supplying the output from the left side of | as input to the right side.
- cmd1 | cmd2 {output from 1} | cmd3 {output from 2} …

Back to work

Your for loop works well and you could see your Bioinformatics job getting a lot easier if you could just save this kind of stuff and use it on the other noisy data the Lab sends you….

Before that, let’s work with the patient data the Lab has also sent over

Job #2: Figuring out who the patients are

Each patient comes into the Lab and fills out a card:

patient_id,gene,age,sex
1,KRAS,45,M

These are each turned into a file and uploaded to the database as a compressed archive

Moving directories as files: `tar/gzip`

You might have seen these before: directories collapsed into a single, movable, compressed file:

Laboratory patient data:
- link (right click and copy)
We won’t go into detail here, safe to use the following command to unzip the archive into the original multiple-file format

# located in /media/fileshare/patient_database.tar.gz
tar -zxvf patient_database.tar.gz

Working with multiple files in directory: `wildcards`

* is called ‘grob’ and it matches or completes to anything

Matching everything that begins with 4 or 12:

cat ../../data/patient_database//4*

patient_id,gene,age,sex
4,KRAS,74,F

ls ../../data/patient_database//12*

../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv

Grob continued `*`

Matching everything with healthy in it

ls ../../data/patient_database//*healthy*

../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//12_TP53healthy_card.csv

Matching everything that ends in .csv

# and truncating a little since there are alot
ls ../../data/patient_database//*.csv | head

../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv

Practice 06: Grob

Print all the file names that contain mut

output

../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv

More limited matching: number of characters `???`

? is used when the number unknown of characters is known

Only double digit patient ID numbers

ls ../../data/patient_database//??_*.csv

../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv

More limited matching: number of characters `???`

Only single digit patient ID numbers

ls ../../data/patient_database//?_*.csv

../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv

Practice 07: `?`

Print all file names that contain mut without explicitly including mut in the command

output

../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv
../../data/patient_database//t_EGFRmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv

More specific matches: ranges `[]`

Ranges of numbers:

ls ../../data/patient_database//[2-4]_*.csv

../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv

List of characters:

ls ../../data/patient_database//*_[T,K]*.csv

../../data/patient_database//10_TP53healthy_card.csv
../../data/patient_database//10_TP53mut_card.csv
../../data/patient_database//11_TP53healthy_card.csv
../../data/patient_database//11_TP53mut_card.csv
../../data/patient_database//12_TP53healthy_card.csv
../../data/patient_database//12_TP53mut_card.csv
../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//t_KRASmut_card.csv
../../data/patient_database//t_TP53mut_card.csv

Practice 08: `[]`

Print all file names that start with digits 1 - 9

output

../../data/patient_database//1_KRASmut_card.csv
../../data/patient_database//2_KRASmut_card.csv
../../data/patient_database//3_KRASmut_card.csv
../../data/patient_database//4_KRASmut_card.csv
../../data/patient_database//5_EGFRmut_card.csv
../../data/patient_database//6_EGFRmut_card.csv
../../data/patient_database//7_EGFRmut_card.csv
../../data/patient_database//8_EGFRmut_card.csv
../../data/patient_database//9_EGFRmut_card.csv

Back to work…Job #2

Alright, now we have the tools to do a bit more work with files and directories

take the files you captured Practice 08 and combine them into a single patient_data.csv with column names.

Job #2 Output

on my mac (zsh), [1-9] include 10,11,12, this doesn’t seem to generalize

patient_id,gene,age,sex
10,TP53,56,M
11,TP53,56,M
12,TP53,56,M
1,KRAS,56,M
2,KRAS,74,F
3,KRAS,74,F
4,KRAS,74,F
5,EGFR,63,F
6,EGFR,63,F
7,EGFR,63,F
8,EGFR,63,F
9,EGFR,63,F

Reflection:

We can operate on contents files : rows, columns
We can operate on the contents of directories : files themselves
- this is greatly facilitated by wildcards
One thing is missing from making this whole approach fast, reproducible, and easy – what do you think it is?

01b: From Command Line to Bash

Scenario: Mutations

Scenario: Your First Job

Reflection

FASTA files contain sequence data

Prediction:

What data do we need to identify mutations?

Looking at data from the Laboratory

Building the diagnostic panel: gene_panel_database.fa

Job #1: Extract the gene panel

Job #1 output

3 genes means we do everything 3x….but not by hand

Iteration: For Loops and Variables

Variables in For Loops

Variables in For Loops

Variables in For Loops

Job #1.5: Do it with iteration!

Job #1.5 output

A tool to work with columns: cut

cut, explained

Practice 05:

Practice 05 output:

One product, two steps

Complex commands: Pipes

Job #1.75: Putting it all together

Job #1.75 output:

Reflection:

Prediction:

Summary Job #1

Back to work

Job #2: Figuring out who the patients are

Moving directories as files: tar/gzip

Working with multiple files in directory: wildcards

Grob continued *

Practice 06: Grob

output

More limited matching: number of characters ???

More limited matching: number of characters ???

Practice 07: ?

output

More specific matches: ranges []

Practice 08: []

output

Back to work…Job #2

Job #2 Output

Reflection:

FASTA files contain sequence `data`

Building the diagnostic panel: `gene_panel_database.fa`

A tool to work with columns: `cut`

`cut`, explained

Complex commands: `Pipes`

Moving directories as files: `tar/gzip`

Working with multiple files in directory: `wildcards`

Grob continued `*`

More limited matching: number of characters `???`

More limited matching: number of characters `???`

Practice 07: `?`

More specific matches: ranges `[]`

Practice 08: `[]`