01c: Scripting for Automation

Roman E. Reggiardo, Vikas Peddu

18 July, 2023

Scenario: Automating molecular diagnosis

Yep, you’re still a Bioinformatician working for a molecular diagnostics lab

  • The only thing is, you need to start thinking about doing these kind of analyses again and again and again and again ……

  • The lab says they might send 100’s if not 1000’s of samples through your approach

Prediction:

What’s a script?

Job #3: Automate your approaches in a Bash script

But first…..what’s a script?

Kinda like a tool but you make it yourself

Really, its any assembly of code to accomplish a task or multiple tasks.

  • When you combine a bunch of tools that already exist, you might call it a pipeline

  • Similar to the pipe | we just learned about, this means outputs flowing into inputs for the next set of commands

From command-line to command…code?

A bash script exists in a type of text file

  • they have special extensions: .sh

  • and need to be executable (more on this in a minute)

and copy first_script.sh to BSCC_2023_dir/code/

So, what do we do with new files?

Take a look! Remember, # means the following is a comment

cat ../../code/first_script.sh
#!/usr/bin/env bash
# first_script.sh
# rreggiar@ucsc.edu
# 2022-07-18

script_name='first_script.sh' # variable_name = value
input_userID='1000' 

echo "The name of this script is:" $script_name 
# echo can print combinations of text and variables
echo "Your user ID is:" $input_userID
# some values, like PWD, are stored in 'global' variables
echo "Your present working directory is:" $PWD
# to execute cmdline tools, wrap them in $()
echo "The contents of $PWD are:" $(ls)

What do we do with scripts?

Execute them!

cd ../../code
./first_script.sh
bash: line 1: ./first_script.sh: Permission denied

uh oh…permission denied? this is our computer!!

  • We need to change the permissions on the file to allow execution

Quick aside: permissions for execution

Files are protected from being used incorrectly by permissions

we can view permissions with

ls -l ../../code
total 32
-rw-r--r--  1 vikas  staff   957 Jul  9 12:48 call_variant.sh
-rw-r--r--  1 vikas  staff   477 Jul  9 12:48 first_script.sh
-rwxr-xr-x  1 vikas  staff  1786 Jul  9 12:48 process_lab_data.sh
-rwxr-xr-x  1 vikas  staff   635 Jul  9 12:48 second_script.sh
-rw-r--r--  1 vikas  staff     0 Jul  9 12:48 skeleton.sh

Viewing permissions

ls -l ../../code
total 32
-rw-r--r--  1 vikas  staff   957 Jul  9 12:48 call_variant.sh
-rw-r--r--  1 vikas  staff   477 Jul  9 12:48 first_script.sh
-rwxr-xr-x  1 vikas  staff  1786 Jul  9 12:48 process_lab_data.sh
-rwxr-xr-x  1 vikas  staff   635 Jul  9 12:48 second_script.sh
-rw-r--r--  1 vikas  staff     0 Jul  9 12:48 skeleton.sh

three main types of permission are available:

  1. r - read
  2. w - write
  3. x - execute

what types of permission does first_script.sh have?

Let’s just add some execution permissions and move on…

To make a file executable

# chmod -- change file modes, +x adds exec to file
# chmod [mode change] [input file]
chmod +x ../../code/first_script.sh

now, what does this look like?

ls -l ../../code
total 32
-rw-r--r--  1 vikas  staff   957 Jul  9 12:48 call_variant.sh
-rwxr-xr-x  1 vikas  staff   477 Jul  9 12:48 first_script.sh
-rwxr-xr-x  1 vikas  staff  1786 Jul  9 12:48 process_lab_data.sh
-rwxr-xr-x  1 vikas  staff   635 Jul  9 12:48 second_script.sh
-rw-r--r--  1 vikas  staff     0 Jul  9 12:48 skeleton.sh

notice the added x’s

Reflection:

How could we use permission to modify the role and use of files we create and use?

Exploring first_script.sh and reviewing commands

Take a look at the first four lines, the shebang and boilerplate

Code
head -4 ../../code/first_script.sh
#!/usr/bin/env bash
# first_script.sh
# rreggiar@ucsc.edu
# 2022-07-18

The shebang, tells the computer we’re using bash and where to find it to run the script – try which bash in cmd line

#!/usr/bin/env bash
  • The rest is just useful information

Exploring first_script.sh: variable assignment

Code
grep '=' ../../code/first_script.sh
script_name='first_script.sh' # variable_name = value
input_userID='1000' 

Assigning variables is just variable = value

Try: Run the code block above

Run echo $script_name on the command line, what do you get?

Exploring first_script.sh: operations

Code
tail -7 ../../code/first_script.sh
# echo can print combinations of text and variables
echo "Your user ID is:" $input_userID
# some values, like PWD, are stored in 'global' variables
echo "Your present working directory is:" $PWD
# to execute cmdline tools, wrap them in $()
echo "The contents of $PWD are:" $(ls)

Three echo commands, each using either a variable or a command along with text

Prediction:

Where could echo with text and variables be useful going forward?

Exploring first_script.sh: using cmd line tools

Code
tail -2 ../../code/first_script.sh
echo "The contents of $PWD are:" $(ls)

Sometimes we’ll need to explicitly mark the tool for execution use $() , otherwise we’ll just print ls here

Practice 10:

On the command line, run each echo line from first_script.sh, what do you get?

echo "The name of this script is:" $script_name 
echo "Your user ID is:" $input_userID
echo "Your present working directory is:" $PWD
echo "The contents of $PWD are:" $(ls)

why do some work and others don’t?

Practice 10 output:

1.

The name of this script is:

2.

Your user ID is:

3.

Your present working directory is: /Users/vikas/Documents/UCSC/teaching/ucsc_scbc_2022/code

4.

The contents of /Users/vikas/Documents/UCSC/teaching/ucsc_scbc_2022/code are: call_variant.sh first_script.sh process_lab_data.sh second_script.sh skeleton.sh

Reflection:

Why do you think the script is structured like this:

  1. Shebang
  2. Variable assignment
  3. Operations

?

Prediction:

Since we can’t see the output of commands in a script like we can at the command line, how can we test our work to make sure its doing what we expect?

Remember…Job #3! (pt. 1)

Automate your gene and patient extraction in a Bash script

  1. create a bash script: process_lab_data.sh , open it in text editor
    1. make it executable
  2. within process_lab_data.sh
    1. write shebang/boilerplate code in the first lines

    2. make a gene_db variable that stores the path to your gene_panel_database.fa

    3. make a patient_db variable that stores the path to your patient_database/ directory

Job #3 (pt.2)

Test your code by introducing echo commands that:

  1. print the paths to your data
  2. print the contents of the patient_database directory

Output:

../../code/process_lab_data.sh: line 7: [: =: unary operator expected
initializing gene and patient databases...
gene database:  /home/jovyan/SCBC_2022_dir/data/gene_panel_database.fa 

patient database:  /home/jovyan/SCBC_2022_dir/data/patient_database 

ls: /home/jovyan/SCBC_2022_dir/data/patient_database: No such file or directory
patient database contents:  

Job #3 (pt.3)

  1. Introduce the commands that generate final_gene_panel.fa by operating on the variable you’ve set to gene_panel_database.fa

    gene,chromosome,start,stop,strand
    KRAS,chr12,25215441,25245384,-
    EGFR,chr7,55019278,55170544,+
    TP53,chr17,7661939,7676594,- 
  2. Introduce the commands that generate patient_data.csv by operating on the variable you’ve set to patient_database/ (output is just head -3)

    patient_id,gene,age,sex
    10,TP53,56,M
    11,TP53,56,M

Reflection:

Compare my first_script.sh to your process_lab_data.sh – what’s different?

Job #3 (pt.4)

  1. Add comments to your code to explain what you’re doing on each line
  2. Add echo commands to report to the user which step the code is on

Prediction:

What’s something else our tool might able to use that command line tools already use?

Defining script variables at the cmd line

Bash scripts have special values that correspond to positions on the command line:

$0 $1 $2 $3 ….

that let us implement command line arguments

  • to see more, copy second_script.sh from /media/fileshare/

Checking out second_script.sh

cd ../../code
tail -16 second_script.sh # for some reason only this many lines fit
# 2022-07-18

cmd_recieved=$0 # $0 stores the command entered to the cmd line
script_name=$(basename $0) # basename extracts the last entry in a path
input_var=${1:-10} # $1 stores the first cmd line argument
# :-VAL sets VAL to the default value of the argument
print_info=${2:-"TRUE"} # $2 stores the second cmd line argument

if [ $print_info = "TRUE" ]; then
    echo "command: " $cmd_recieved
    # echo can print combinations of text and variables
    echo "The name of this script is: " $script_name 
    # bash can do math inside $(())
    echo $input_var / 2 = $((input_var/2))
fi

Running second_script.sh

remember to make second_script.sh executable

../../code/second_script.sh
command:  ../../code/second_script.sh
The name of this script is:  second_script.sh
10 / 2 = 5
../../code/second_script.sh 640 TRUE
command:  ../../code/second_script.sh
The name of this script is:  second_script.sh
640 / 2 = 320
../../code/second_script.sh 1200 FALSE
  • what does each command line argument appear to do?

Breaking down second_script.sh

The top bit looks the same except for minor details

Code
head -4 ../../code/second_script.sh
#!/usr/bin/env bash
# second_script.sh
# rreggiar@ucsc.edu
# 2022-07-18
  • why this is sometimes called “boilerplate”

Variables look similar but values are different

Bash scripts have special values that correspond to positions on the command line:

$0 $1 $2 $3 ….

Code
head -10 ../../code/second_script.sh | tail -5
cmd_recieved=$0 # $0 stores the command entered to the cmd line
script_name=$(basename $0) # basename extracts the last entry in a path
input_var=${1:-10} # $1 stores the first cmd line argument
# :-VAL sets VAL to the default value of the argument
print_info=${2:-"TRUE"} # $2 stores the second cmd line argument

$0 is the command itself

Code
head -10 ../../code/second_script.sh | tail -5 | head -2
cmd_recieved=$0 # $0 stores the command entered to the cmd line
script_name=$(basename $0) # basename extracts the last entry in a path

basename is a tool that takes the last entry in a path

cd ../../code
basename $PWD
code

$1 is the first positional argument

Code
head -10 ../../code/second_script.sh | tail -3 | head -2
input_var=${1:-10} # $1 stores the first cmd line argument
# :-VAL sets VAL to the default value of the argument

Here, ${1:-10} is being used to set a default value of 10 for the first positional value $1

Practice 11:

Get second_script.sh to return 21436 as the output value

$2 is the second positional argument

Code
head -10 ../../code/second_script.sh | tail -1
print_info=${2:-"TRUE"} # $2 stores the second cmd line argument
  • As before, we set a default arg of TRUE for $2 , what happens if we change it?

Another critical tool: IF statements

Much like for loops, if statements are multi-part commands that enable complex logic

  • if

    • [ condition to check for ]

      • then (equivalent to do)

        • do something
      • fi (equivalent to done)

if statements allow us to check something before running

Practice 12:

Try if on the command line

  • change tmp_var around until you can figure out what condition we are “satisfying”
tmp_var=7
if [ $(($tmp_var%2)) = 0 ]; then echo "satisfied"; fi

Prediction:

What is missing from the current if statement that might be useful to adapt to different challenges?

Job #3 (pt.5)

  1. Add two command line arguments to process_lab_data.sh
    1. proc_genes

    2. proc_patients

  2. Use these variables to enable command-line control over whether final_gene_panel.csv and patient_data.csv are generated

Reflection:

How has writing scripts changed your view on command line and Bash programming?