01c: Scripting for Automation

Roman E. Reggiardo, Vikas Peddu

18 July, 2023

Scenario: Automating molecular diagnosis

Yep, you’re still a Bioinformatician working for a molecular diagnostics lab

The only thing is, you need to start thinking about doing these kind of analyses again and again and again and again ……
The lab says they might send 100’s if not 1000’s of samples through your approach

Prediction:

What’s a script?

Job #3: Automate your approaches in a Bash script

But first…..what’s a script?

Kinda like a tool but you make it yourself

Really, its any assembly of code to accomplish a task or multiple tasks.

When you combine a bunch of tools that already exist, you might call it a pipeline
Similar to the pipe | we just learned about, this means outputs flowing into inputs for the next set of commands

From command-line to command…code?

A bash script exists in a type of text file

they have special extensions: .sh
and need to be executable (more on this in a minute)

navigate into your `/media/fileshare/` directory

and copy first_script.sh to BSCC_2023_dir/code/

So, what do we do with new files?

Take a look! Remember, # means the following is a comment

cat ../../code/first_script.sh

#!/usr/bin/env bash
# first_script.sh
# rreggiar@ucsc.edu
# 2022-07-18

script_name='first_script.sh' # variable_name = value
input_userID='1000' 

echo "The name of this script is:" $script_name 
# echo can print combinations of text and variables
echo "Your user ID is:" $input_userID
# some values, like PWD, are stored in 'global' variables
echo "Your present working directory is:" $PWD
# to execute cmdline tools, wrap them in $()
echo "The contents of $PWD are:" $(ls)

What do we do with scripts?

Execute them!

cd ../../code
./first_script.sh

bash: line 1: ./first_script.sh: Permission denied

uh oh…permission denied? this is our computer!!

We need to change the permissions on the file to allow execution

Quick aside: permissions for execution

Files are protected from being used incorrectly by permissions

we can view permissions with

ls -l ../../code

total 32
-rw-r--r--  1 vikas  staff   957 Jul  9 12:48 call_variant.sh
-rw-r--r--  1 vikas  staff   477 Jul  9 12:48 first_script.sh
-rwxr-xr-x  1 vikas  staff  1786 Jul  9 12:48 process_lab_data.sh
-rwxr-xr-x  1 vikas  staff   635 Jul  9 12:48 second_script.sh
-rw-r--r--  1 vikas  staff     0 Jul  9 12:48 skeleton.sh

Viewing permissions

ls -l ../../code

total 32
-rw-r--r--  1 vikas  staff   957 Jul  9 12:48 call_variant.sh
-rw-r--r--  1 vikas  staff   477 Jul  9 12:48 first_script.sh
-rwxr-xr-x  1 vikas  staff  1786 Jul  9 12:48 process_lab_data.sh
-rwxr-xr-x  1 vikas  staff   635 Jul  9 12:48 second_script.sh
-rw-r--r--  1 vikas  staff     0 Jul  9 12:48 skeleton.sh

three main types of permission are available:

r - read
w - write
x - execute

what types of permission does first_script.sh have?

Let’s just add some execution permissions and move on…

To make a file executable

# chmod -- change file modes, +x adds exec to file
# chmod [mode change] [input file]
chmod +x ../../code/first_script.sh

now, what does this look like?

ls -l ../../code

total 32
-rw-r--r--  1 vikas  staff   957 Jul  9 12:48 call_variant.sh
-rwxr-xr-x  1 vikas  staff   477 Jul  9 12:48 first_script.sh
-rwxr-xr-x  1 vikas  staff  1786 Jul  9 12:48 process_lab_data.sh
-rwxr-xr-x  1 vikas  staff   635 Jul  9 12:48 second_script.sh
-rw-r--r--  1 vikas  staff     0 Jul  9 12:48 skeleton.sh

notice the added x’s

Reflection:

How could we use permission to modify the role and use of files we create and use?

Exploring `first_script.sh` and reviewing commands

Take a look at the first four lines, the shebang and boilerplate

Code

head -4 ../../code/first_script.sh

#!/usr/bin/env bash
# first_script.sh
# rreggiar@ucsc.edu
# 2022-07-18

The shebang, tells the computer we’re using bash and where to find it to run the script – try which bash in cmd line

#!/usr/bin/env bash

The rest is just useful information

Exploring `first_script.sh`: variable assignment

Code

grep '=' ../../code/first_script.sh

script_name='first_script.sh' # variable_name = value
input_userID='1000'

Assigning variables is just variable = value

Try: Run the code block above

Run echo $script_name on the command line, what do you get?

Exploring `first_script.sh`: operations

Code

tail -7 ../../code/first_script.sh

# echo can print combinations of text and variables
echo "Your user ID is:" $input_userID
# some values, like PWD, are stored in 'global' variables
echo "Your present working directory is:" $PWD
# to execute cmdline tools, wrap them in $()
echo "The contents of $PWD are:" $(ls)

Three echo commands, each using either a variable or a command along with text

Prediction:

Where could echo with text and variables be useful going forward?

Exploring `first_script.sh`: using cmd line tools

Code

tail -2 ../../code/first_script.sh

echo "The contents of $PWD are:" $(ls)

Sometimes we’ll need to explicitly mark the tool for execution use $() , otherwise we’ll just print ls here

Practice 10:

On the command line, run each echo line from first_script.sh, what do you get?

echo "The name of this script is:" $script_name 
echo "Your user ID is:" $input_userID
echo "Your present working directory is:" $PWD
echo "The contents of $PWD are:" $(ls)

why do some work and others don’t?

Practice 10 output:

The name of this script is:

Your user ID is:

Your present working directory is: /Users/vikas/Documents/UCSC/teaching/ucsc_scbc_2022/code

The contents of /Users/vikas/Documents/UCSC/teaching/ucsc_scbc_2022/code are: call_variant.sh first_script.sh process_lab_data.sh second_script.sh skeleton.sh

Reflection:

Why do you think the script is structured like this:

Shebang
Variable assignment
Operations

Prediction:

Since we can’t see the output of commands in a script like we can at the command line, how can we test our work to make sure its doing what we expect?

Remember…Job #3! (pt. 1)

Automate your gene and patient extraction in a Bash script

create a bash script: process_lab_data.sh , open it in text editor
1. make it executable
within process_lab_data.sh
1. write shebang/boilerplate code in the first lines
2. make a gene_db variable that stores the path to your gene_panel_database.fa
3. make a patient_db variable that stores the path to your patient_database/ directory

Job #3 (pt.2)

Test your code by introducing echo commands that:

print the paths to your data
print the contents of the patient_database directory

Output:

../../code/process_lab_data.sh: line 7: [: =: unary operator expected
initializing gene and patient databases...
gene database:  /home/jovyan/SCBC_2022_dir/data/gene_panel_database.fa 

patient database:  /home/jovyan/SCBC_2022_dir/data/patient_database 

ls: /home/jovyan/SCBC_2022_dir/data/patient_database: No such file or directory
patient database contents:

Job #3 (pt.3)

Introduce the commands that generate final_gene_panel.fa by operating on the variable you’ve set to gene_panel_database.fa

gene,chromosome,start,stop,strand
KRAS,chr12,25215441,25245384,-
EGFR,chr7,55019278,55170544,+
TP53,chr17,7661939,7676594,-

Introduce the commands that generate patient_data.csv by operating on the variable you’ve set to patient_database/ (output is just head -3)
```
patient_id,gene,age,sex
10,TP53,56,M
11,TP53,56,M
```

Reflection:

Compare my first_script.sh to your process_lab_data.sh – what’s different?

Job #3 (pt.4)

Add comments to your code to explain what you’re doing on each line
Add echo commands to report to the user which step the code is on

Prediction:

What’s something else our tool might able to use that command line tools already use?

Defining script variables at the cmd line

Bash scripts have special values that correspond to positions on the command line:

$0 $1 $2 $3 ….

that let us implement command line arguments

to see more, copy second_script.sh from /media/fileshare/

Checking out `second_script.sh`

cd ../../code
tail -16 second_script.sh # for some reason only this many lines fit

# 2022-07-18

cmd_recieved=$0 # $0 stores the command entered to the cmd line
script_name=$(basename $0) # basename extracts the last entry in a path
input_var=${1:-10} # $1 stores the first cmd line argument
# :-VAL sets VAL to the default value of the argument
print_info=${2:-"TRUE"} # $2 stores the second cmd line argument

if [ $print_info = "TRUE" ]; then
    echo "command: " $cmd_recieved
    # echo can print combinations of text and variables
    echo "The name of this script is: " $script_name 
    # bash can do math inside $(())
    echo $input_var / 2 = $((input_var/2))
fi

Running `second_script.sh`

remember to make second_script.sh executable

../../code/second_script.sh

command:  ../../code/second_script.sh
The name of this script is:  second_script.sh
10 / 2 = 5

../../code/second_script.sh 640 TRUE

command:  ../../code/second_script.sh
The name of this script is:  second_script.sh
640 / 2 = 320

../../code/second_script.sh 1200 FALSE

what does each command line argument appear to do?

Breaking down `second_script.sh`

The top bit looks the same except for minor details

Code

head -4 ../../code/second_script.sh

#!/usr/bin/env bash
# second_script.sh
# rreggiar@ucsc.edu
# 2022-07-18

why this is sometimes called “boilerplate”

Variables look similar but values are different

Bash scripts have special values that correspond to positions on the command line:

$0 $1 $2 $3 ….

Code

head -10 ../../code/second_script.sh | tail -5

cmd_recieved=$0 # $0 stores the command entered to the cmd line
script_name=$(basename $0) # basename extracts the last entry in a path
input_var=${1:-10} # $1 stores the first cmd line argument
# :-VAL sets VAL to the default value of the argument
print_info=${2:-"TRUE"} # $2 stores the second cmd line argument

`$0` is the command itself

Code

head -10 ../../code/second_script.sh | tail -5 | head -2

cmd_recieved=$0 # $0 stores the command entered to the cmd line
script_name=$(basename $0) # basename extracts the last entry in a path

basename is a tool that takes the last entry in a path

cd ../../code
basename $PWD

code

`$1` is the first positional argument

Code

head -10 ../../code/second_script.sh | tail -3 | head -2

input_var=${1:-10} # $1 stores the first cmd line argument
# :-VAL sets VAL to the default value of the argument

Here, ${1:-10} is being used to set a default value of 10 for the first positional value $1

Practice 11:

Get second_script.sh to return 21436 as the output value

`$2` is the second positional argument

Code

head -10 ../../code/second_script.sh | tail -1

print_info=${2:-"TRUE"} # $2 stores the second cmd line argument

As before, we set a default arg of TRUE for $2 , what happens if we change it?

`$2` -> `print_info` operates in an `if` statement

Code

tail -8 ../../code/second_script.sh

if [ $print_info = "TRUE" ]; then
    echo "command: " $cmd_recieved
    # echo can print combinations of text and variables
    echo "The name of this script is: " $script_name 
    # bash can do math inside $(())
    echo $input_var / 2 = $((input_var/2))
fi

Another critical tool: `IF` statements

Much like for loops, if statements are multi-part commands that enable complex logic

if
- [ condition to check for ]
  - then (equivalent to do)
    - do something
  - fi (equivalent to done)

if statements allow us to check something before running

Practice 12:

Try if on the command line

change tmp_var around until you can figure out what condition we are “satisfying”

tmp_var=7
if [ $(($tmp_var%2)) = 0 ]; then echo "satisfied"; fi

Prediction:

What is missing from the current if statement that might be useful to adapt to different challenges?

Job #3 (pt.5)

Add two command line arguments to process_lab_data.sh
1. proc_genes
2. proc_patients
Use these variables to enable command-line control over whether final_gene_panel.csv and patient_data.csv are generated

Reflection:

How has writing scripts changed your view on command line and Bash programming?

01c: Scripting for Automation

Scenario: Automating molecular diagnosis

Prediction:

Job #3: Automate your approaches in a Bash script

Kinda like a tool but you make it yourself

From command-line to command…code?

navigate into your /media/fileshare/ directory

So, what do we do with new files?

What do we do with scripts?

Quick aside: permissions for execution

Viewing permissions

Let’s just add some execution permissions and move on…

Reflection:

Exploring first_script.sh and reviewing commands

Exploring first_script.sh: variable assignment

Exploring first_script.sh: operations

Prediction:

Exploring first_script.sh: using cmd line tools

Practice 10:

Practice 10 output:

Reflection:

Prediction:

Remember…Job #3! (pt. 1)

Automate your gene and patient extraction in a Bash script

Job #3 (pt.2)

Output:

Job #3 (pt.3)

Reflection:

Job #3 (pt.4)

Prediction:

Defining script variables at the cmd line

Checking out second_script.sh

Running second_script.sh

Breaking down second_script.sh

Variables look similar but values are different

$0 is the command itself

$1 is the first positional argument

Practice 11:

$2 is the second positional argument

$2 -> print_info operates in an if statement

Another critical tool: IF statements

Practice 12:

Prediction:

Job #3 (pt.5)

Reflection:

navigate into your `/media/fileshare/` directory

Exploring `first_script.sh` and reviewing commands

Exploring `first_script.sh`: variable assignment

Exploring `first_script.sh`: operations

Exploring `first_script.sh`: using cmd line tools

Checking out `second_script.sh`

Running `second_script.sh`

Breaking down `second_script.sh`

`$0` is the command itself

`$1` is the first positional argument

`$2` is the second positional argument

`$2` -> `print_info` operates in an `if` statement

Another critical tool: `IF` statements