Welcome to the BRITE-REU Programming Workshop!

Instructions

Through the ten weeks you are at BU, we will introduce some useful programming tools and skills during six weekly programming workshop sessions. These include skills on using linux, the BU shared computing cluster (SCC), sharing your code and work on github using git, coding with R and Python, machine learning software, and the SQL database language. The sessions will have two parts, a 1 hour offline document or video that you study prior to attending the session and 2 hours hands-on session during which we will walk you through different commands and programs.

In order to participate, you will need to install the required tools and applications prior to coming to the classes. Below are the instructions to install an SSH Client, git, R and RStudio, Python and anaconda, RapidMiner, and mySQL. Please have them ready and running on your laptops so we can make the most out of the sessions. These instructions have been tested, but it’s not possible for us to run through all the installation steps for all operating systems. If you encounter a problem or need more clarification, please make a note and send it to us, or write up your own modification of these instructions. In case you have any difficulty, please contact us to fix any potential problems before the specific sessions start.

Instructors:

Workshop schedule:

  1. Week 2: Linux and Bash
  2. Week 3: Git and SCC
  3. Week 4: Python
  4. Week 6: R
  5. Week 7: Machine learning
  6. Week 8: SQL

All sessions meet Tuesday 1-3 pm in LSE 904

Linux/Unix and Bash

Windows 10 (linux subsystem for windows)

  1. Control Panel -> Programs and Features and click “Turn Windows Features on or off” on the top left pane
  2. In the dialog box, check “Windows Subsystem for Linux” and click Ok
    unix_1
  3. Reboot machine
  4. Start Menu -> Microsoft store -> Search for “Linux”
  5. Click “Get the apps” under the “Linux on Windows?” banner
    unix_2
  6. You’ll see a list of every Linux distribution currently available in the Windows Store. We recommend that you select “Ubuntu” (unless you know what you’re doing)
  7. To open the Linux environment you installed, just open the Start menu and search for whatever distribution you installed. For example, if you installed Ubuntu, launch the Ubuntu shortcut
  8. After installation it will ask to create a UNIX username and password
  9. You are now ready to use the bash shell

Windows 8- (Installing Babun)

Babun is just a customized and pre-configured Cygwin. You can install Cygwin if you’re familiar with that.

  1. Just download the dist file from here, unzip it and run the install.bat script.
  2. After a few minutes Babun starts automatically.
  3. The application will be installed to the %USERPROFILE%\.babun directory.
  4. Use the /target option to install babun to a custom directory.

Mac / Linux

You default shell should already be bash. Otherwise run: chsh -s /bin/bash

Git

Git command-line

Linux

If you’re on a RPM-based distribution:

$ sudo dnf install git-all

If you’re on a Debian-based distribution:

$ sudo apt install git-all
Mac

A macOS Git installer is maintained and available for download at the Git website

Windows

Install git for windows

Note

This is a project called Git for Windows, which is separate from Git itself

Git GUI (cross-platform)

Git is best used as a command line interface. Although git GUIs are not as powerful as the command line, it is still nice to be able to visualize your commit history. Git for windows comes with a GUI, but I recommend trying GitKraken - it’s cleaner and comes with a light and dark UI.

gitkraken

SSH Client

In order to do anything on the Shared Computing Cluster (SCC) from your local computer, you first need to connect to the SCC using an SSH (Secure SHell) client. This task varies greatly based on your local operating system. Follow the SSH client installation instructions below depending on your operating system.

Windows

Option 1 - Mobaxterm

SSH client with x-forwarding capabilities for graphic sharing

  1. Download and install Mobaxterm from here
  2. Launch Mobaxterm and start a new session
    ssh_1
  3. Select SSH as the session type
    ssh_2
  4. Specify ssc1.bu.edu as the remote host and click “OK”
    ssh_3
  5. Your connection will be saved on the left sidebar, so the next time you can start your session by clicking the “scc1.bu.edu [SSH]” link. In the terminal window you will get a prompt to enter your BU login information and password
    ssh_4
Option 2 - PuTTY
  1. Download and Install PuTTY

  2. Enter your connection settings
    • Host name: scc1.bu.edu
    • Port: 22 (leave as default)
    • Connection type: SSH (leave as default)

    ssh_5

  3. Click Open to start SSH session

  4. If this is your first time connecting to the server from this computer, you will see the following output. Accept the connection by clicking Yes

    ssh_6

  5. Once the SSH Connection is open, you should see a terminal prompt asking for your username: username@scc1.bu.edu and then your password

Mac and Linux

  1. Use built-in terminal for both
  2. Mac option: For X11-forwarding download XQuartz
  3. To sign into the scc for Mac and linux, open a terminal
  4. Use ssh to connect to the SCC with your login credentials in the terminal your_local_machine% ssh username@scc1.bu.edu
  5. Enter BU kerberos password when prompted
  6. Type exit to close session

Python

Anaconda

To install Python, it is recommended to use the Anaconda distribution. Anaconda is a cross platform python distribution that packages useful tools for scientific programming in Python such as IDEs/text editors (Spyder/VSCode), package managing tools (pip/conda), interactive notebooks (Jupyter), and other useful tools. To install Anaconda use the following steps:

  1. Go to https://www.anaconda.com/download/
  2. It’s 2018, so make sure to download the Python 3.6 version. Python2 is rapidly being dropped from many important libraries, so Python3 is preferred.
  3. During installation on Windows, you may be asked if you would like to add Anaconda to your PATH. This will make Anaconda packages/Python available across your computer, so it’s up to you whether this is something you want. Installation on MAC/Linux should be straight forward.
  4. Once installation is successful, you will now have access to all the tools we need. To ensure everything installed properly, look for Anaconda Navigator in your applications. Launch the application, you should have a window that looks like this:
    python
  5. If the button under Jupyter Notebook reads “Install” please click it to ensure Jupyter Notebooks are installed.
  6. That’s it! You’re done!

R and Rstudio

R is a programming language which is commonly used in bioinformatics and statistics.

Mac

  1. Install R:
    1. Open an internet browser and go to www.r-project.org.
    2. Click the “download R” link in the middle of the page under “Getting Started.”
    3. Select a CRAN location (a mirror site) and click the corresponding link.
    4. Click on the “Download R for (Mac) OS X” link at the top of the page.
    5. Click on the file containing the latest version of R under “Files.”
    6. Save the .pkg file, double-click it to open, and follow the installation instructions.
    7. Now that R is installed, you need to download and install RStudio.
  2. To install R Studio:
    1. Go to www.rstudio.com and click on the “Download RStudio” button.
    2. Click on “Download RStudio Desktop.”
    3. Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder.
  3. To Install the SDSFoundations Package
    1. Download SDSFoundations to your desktop (make sure it has the .tgz extension).
    2. Open RStudio.
    3. Click on the Packages tab in the bottom right window.
    4. Click “Install.”
    5. Select install from “Package Archive File.”
    6. Select the SDSFoundations package file from your desktop.
    7. Click install. You are done! You can now delete the SDSpackage file from your desktop.

Windows Users

  1. To Install R:
    1. Open an internet browser and go to www.r-project.org.
    2. Click the “download R” link in the middle of the page under “Getting Started.”
    3. Select a CRAN location (a mirror site) and click the corresponding link.
    4. Click on the “Download R for Windows” link at the top of the page.
    5. Click on the “install R for the first time” link at the top of the page.
    6. Click “Download R for Windows” and save the executable file somewhere on your computer. Run the .exe file and follow the installation instructions.
    7. Now that R is installed, you need to download and install RStudio.
  2. To Install RStudio
    1. Go to www.rstudio.com and click on the “Download RStudio” button.
    2. Click on “Download RStudio Desktop.”
    3. Click on the version recommended for your system, or the latest Windows version, and save the executable file. Run the .exe file and follow the installation instructions.
  3. To Install the SDSFoundations Package
    1. Download SDSFoundations to your desktop (make sure it has the .zip extension).
    2. Open RStudio.
    3. Click on the Packages tab in the bottom right window.
    4. Click “Install.”
    5. Select install from “Package Archive File.”
    6. Select the SDSFoundations package file from your desktop.
    7. Click install. You are done! You can now delete the SDSpackage file from your desktop.

Machine Learning

RapidMiner Studio

For the machine learning session we will use R and RapidMiner Studio. You can download RapidMiner Studio 8.2. For this course we will not use a license, but you can register to get feedback from other users. You can find installation guides here

SQL

SQLite

Install SQLite

(Note: the SQLite program is called sqlite3)

Windows:

See this very useful video to install SQLite https://www.youtube.com/watch?v=zOJWL3oXDO8

Mac and Linux:

Should already be preinstalled. To check, open a terminal window and type “sqlite3”. To quit the program after you’ve started it, type “.quit”.

GUI:

There are two reasonably good GUIs for using SQLite. I haven’t used either extensively, so can’t make a strong recommendation, but I preferred SQLite Studio. These are not required for the workshop, but may be beneficial if you use the program after this summer.

Test:

Follow these steps:

  • In all three major operating systems, open a command terminal, create a directory called “sqlitedb”, and move to that directory

  • Create a new database file called test.db by typing: “sqlite3 test.db”

  • You should see something like this. If so, it’s working.

    version 3.13.0 2016-05-18 10:57:30
    Enter ".help" for usage hints.
    sqlite>
    
  • Exit the program by typing “.quit”

Tutorial and Documentation:

This is a good introduction: https://www.tutorialspoint.com/sqlite/index.htm

This is the official SQLite Documentation: https://sqlite.org

MySQL – We are not using MySQL this year!

For Mac, use the DMG archive. https://dev.mysql.com/downloads/mysql/5.7.html#downloads

For Windows: https://dev.mysql.com/downloads/mysql/5.7.html#downloads Choose (mysql-installer-web-community-8.0.11.0.msi)

For linux: https://dev.mysql.com/doc/refman/5.7/en/linux-installation.html

Install a mysql database interface

This gives a GUI interface to databases and their contents, as well as a window to write SQL commands.

On Mac use Sequel Pro: https://www.sequelpro.com/

On linux and Windows, use phpmyadmin: https://www.phpmyadmin.net/

See this wiki page for installing on Windows: https://www.wikihow.com/Install-phpMyAdmin-on-Your-Windows-PC

Workshop 1: Linux and Bash

UNIX and Linux

UNIX is an operating system which was first developed in the 1960s, and has been under constant development ever since. It is a stable, multi-user, multi-tasking system for servers, desktops and laptops. There are many different versions of UNIX, although they share common similarities. The most popular varieties of UNIX are Sun Solaris, GNU/Linux, and MacOS X.

The UNIX operating system is made up of three parts; the kernel, the shell and the programs.

  1. The kernel: The kernel of UNIX is the hub of the operating system: it allocates time and memory to programs and handles the filestore and communications in response to system calls.
  2. The shell: The shell acts as an interface between the user and the kernel. The shell is a command line interpreter (CLI). It interprets the commands the user types in and arranges for them to be carried out. The commands are themselves programs.
  3. The programs: Programs are instructions that tell the computer what to do.

Everything in UNIX is either a file or a process. A process is an executing program identified by a unique PID (process identifier). A file is a collection of data. They are created by users using text editors, running compilers etc.

Introduction to UNIX and Linux. Runtime ~ 5min

In a GNU/Linux system, Linux is the kernel component. The rest of the system consists of other programs, many of which were written by or for the GNU Project. Because the Linux kernel alone does not form a working operating system, the term “GNU/Linux” is more appropriate for systems that many people casually refer to as “Linux”. The GNU Project has developed a comprehensive set of free software tools for use with Unix™and Unix-like operating systems such as Linux. These tools enable users to perform tasks ranging from the mundane (such as copying or removing files from the system) to the arcane (such as writing and compiling programs or doing sophisticated editing in a variety of document formats).

Linux is also less likely to crash, better able to run more than one program at the same time, and more secure than many operating systems. With these advantages, Linux is the fastest growing operating system in the server market. More recently, Linux has begun to be popular among home and business users as well.


Bash

Bash is a UNIX shell and command language developed by the GNU Project as a replacement for the Bourne shell. It the default login shell for most Linux distributions and maxOS. Bash is a command processor that typically runs in a text window, where the user types commands that cause actions. Bash can also read and execute commands from a file, called a shell script. Like all Unix shells, it supports filename globbing (wildcard matching), piping, here documents, command substitution, variables, and control structures for condition-testing and iteration. We will explore some of these concepts in the subsequent tutorial sections.

The terminal vs. the shell. Runtime ~ 1min

What is the Bash Shell. Runtime ~ 1min


Tutorials

1. CLI intro

Runtime ~ 5 min

If you were able to open a terminal, you should see something like this:

_images/terminal.png

Most often, you will see your username, your current position in the file system, the “$” or “#” symbol and then a cursor.

Given this is a “command” line, type a command and then press enter!

_images/ls.png

I gave the ls command, short for “list”. This lists all of the files in the current directory. Because of my personal settings, directories are colored in blue and regular files are colored white. If you are ever playing around with your terminal settings, setting colors on will prove to be useful. Another command is the cd command:

_images/cd.png

By cd example, what I’m doing is changing my *C*urrent *D*irectory to the one named example. Using ls, I see that there are two files called poem.txt and prose.txt and a directory called there_is_nothing_in_here in the example directory. In the command cd example, cd is the command name and example is considered the first argument for that command.

The next command is the man command, and it stands for manual. It takes a single argument, the name of a command:

_images/man.png _images/man_ls.png

You can press down, the space bar or page down to read down the manual, or up and page up to scroll back up. When you are done reading the manual, just press q and you’ll be brought back to the command line. Most manuals have several sections:

Name - Gives the name of the command

Synopsis - The usage of the command is written in a short hand

Description - Gives a description of the purpose of the command

Options - Optional flags that the command uses

Examples - Example command are given

Author - The people who wrote the command’s source code

Bugs - Known bugs/where to report bugs

Copyright - Who actually owns the source code

If we go down far enough on the ls manual, we’ll see that the -l flag gives the output in long list format. Here’s how you use flags:

_images/man_ls-l.png

This lets you see a whole bunch of information about the files in the directory. Flags are just arguments and are often separated from other arguments by whitespace, such as a space. However, flags can be combined into a single argument as such:

_images/man_ls-la.png

The -a flag lets you see hidden files and directories. Files become hidden by having the first character of their name be a .!

Also, clicking tab will do an auto-complete if what you’ve written out is unique and in the right spot. For example, typing cd is enough in the example directory for you to hit tab and autocomplete there_is_nothing_in_here, as there is no other directory in the current working directory to cd into. This works with commands as well, but commands tend to be short.

Woopdie doo, you can see files in a directory, go to nearby directories and look at manuals. You can do that with the file explorer GUI too. However, you’ll come to see that the terminal can do a lot of the things that the GUI cannot do. If you are still having trouble, there are resources online that can help bolster your knowledge, such as this tutorial series: https://youtu.be/MmHcOPJEjGA .

3. Grep/Awk/Sed


Grep

Grep (an acronym for “Global Regular Expression Print”), finds a string in a given file or input.

Grep format:

grep [options] [regexp] [filename]

Grep usecases:

  1. Case-insensitive search (grep -i):
grep -i 'mary' mary-lamb.txt
  1. Whole-word search (grep -w):
grep -w 'as' mary-lamb.txt
  1. recursively search through sub-folders (grep -r <pattern> <path>):
grep -r '456' /<your_working_directory>/
  1. Inverted search (grep -v):
grep -v ‘the’ mary-lamb.txt
  1. Print additional (trailing) context lines after match (grep -A <NUM>):
grep -A1 'School'  mary-lamb.txt
  1. Print additional (leading) context lines before match (grep -B <NUM>):
grep -B2 'School'  mary-lamb.txt
  1. Print additional (leading and trailing) context lines before and after the match (grep -C <NUM>):
grep -C3 'School' mary-lamb.txt
  1. Print the filename for each match (grep -H <pattern> filename):
grep -H 'School' mary-lamb.txt

Regexp or regular expression:

Regexp is how we specify that we find to see a particular pattern (it could be words or characters).

  • The period . matches any single character.
  • * when the previous pattern could be matched zero or more times.
grep 'M.a' mary-lamb.txt
grep 'M*y' Mary_Lamb_lyrics.txt

AWK:

awk [options] [filename]

Named after the authors: Aho, Weinberger, Kernighan

  • Print everything in the text file:
awk '{print}' BRITE_students.txt

  • Now, let’s get the more specific. Let’s ask for names only:
awk '{print $1}' BRITE_students.txt

  • What if we want to see two columns at the same time, let’s say name and GPA?
awk '{print $1" "$3}' BRITE_students.txt

  • Now what let’s see what your info is (exact match):
awk '$1=="Ali"' BRITE_students.txt

  • How can we see a particular pattern in our cohort (not an exact match):
awk '/Kat/ {print $0}' BRITE_students.txt

  • Question for you: How do you print the name and favorite sport of students whose names contain the letter “u”?
<insert code here>

  • How many students are there whose name begins with “Kat”?
awk '/Kat/{++cnt} END {print "Count = ", cnt}' BRITE_students.txt

  • You could also run loops in awk, print all :
awk 'BEGIN {
   sum = 0; for (i = 0; i < 20; ++i) {
       sum += i; if (sum > 50) exit(10); else print "Sum =", sum
   }
}'

SED:

sed [options] [filename]

SED stands for “Stream EDitor”. It is a widely used text processing Linux tool.

  • I want to read the first three lines of a text file:
cat BRITE_students.txt | sed -n 3p

  • What if we want to replace one word with another:
cat mary-lamb.txt | sed 's/Mary/Maria/g'

  • Let’s remove the 1st, 2nd and 5th lines from a text file:
sed -e '1d' -e '2d' -e '5d' BRITE_students.txt

  • But what if we had a much longer list and wanted to remove more lines?
echo -e "1d\n2d\n5d" > my_lines.txt
cat my_lines.txt
sed -f my_lines.txt BRITE_students.txt

  • Now let’s print the 2nd line to last:
cat BRITE_students.txt | sed -n 2,'$p'

4. File editing in the terminal

Typically, you’d want to turn on X-11 forwarding or mount the server onto your machine to work on some sort of GUI or IDE, but sometimes you just want to make a quick change to a file that you have write permissions. Well, I will describe vim, because that is our most used terminal editor. Vim is a very popular text editors these days, and has been around since the 1970s.

You activate vim by typing vim followed by a file name. If the file doesn’t already exist, then it will be created when you save you work.

When you first enter vim, you will be in normal mode. Here, you can go into other modes to perform commands, or you can go into editor mode by pressing i once. Once in editor mode, you can type your code. You can only move the cursor in the four directions, no mouse. However, if you go back into normal mode, you can do some navigating tricks:

  • ^ - This brings you to the beginning of a line
  • $ - This brings you to the end of a line
  • G - This brings you to the end of the file
  • gg- This brings you to the beginning of the file
  • 0 - Often synonymous with ^

-/.*- It’s a find function. After typing a forward slash, you may write anything. upon pressing enter, vim will search the document and bring your cursor to the first instance of that string.

You can also do some editting tricks with vim:

  • [0-9]* dd - type a number and then dd. This will delete that many lines below you.
  • D - delete until the end of a line
  • u - undo the last action
  • [0-9]* y - “yank.” It means to copy. Regular select control-c works.
  • p - Paste. Regular control-p works too, not in conjunction with yank, though.

if you ever type control-s, vim will keep recording your actions, but not display them, appearing to be stuck. Press control-q to get out of that jam.

To leave vim, go into normal mode from editor mode by pressing escape. Then press :. You can follow this with:

  • q - quit. No changes made.
  • q! - quit. Discard changes made.
  • w - save. returns you to normal mode afterwards.
  • wq - save and return to CLI.

Other text editors exist, such as emacs and nano. Find the one that works best for you and learn all of their tricks!

5. Piping and Redirection

I/O redirection and concepts covered in the video:

  1. Standard output (stdout), standard input (stdin), standard error (stderr)
  2. >, >> - redirect stdout and replace/append to file
  3. 2>, 2>> - redirect stderr and replace/append to file
  4. &>, &>> - redirect both stdout and stderr and replace/append to file
  5. > - stdin redirect from file
  6. | - pipe stdout from one program to another

Piping and redirection. Runtime ~ 18min

6. Processes

Concepts covered in the video:

  1. PID - Process identifier
  2. init - Initial process that starts all the other processes
  3. Parent PID
  4. UID - User identifier
  5. Process priority
  6. ps - print out information on running processes
  7. TTY - terminal process is associated with
  8. fg - brings background job to foreground
  9. kill - Terminate a process
  10. top - Browse processes

Introduction to processes and ps. Runtime ~ 7min

Jobs and Killing Processes. Runtime ~ 8min

7. Bash scripting

A bash script is a file containing commands that can run on the bash shell. They usually have the .sh extension.

A minimal example

#!/usr/bin/env bash
# A simple bash script

echo "Hello World"

Note

#! is called the “shebang”. It indicates the path to the program/interpreter that would be used to execute the script.


Execution

$ ./myscript.sh
bash: ./myscript.sh: Permission denied
$ ls -l myscript.sh
-rw-r--r-- 18 dkishore users 4096 Jun 10 09:12 myscript.sh
  • The permissions on the script need to be changed to allow for execution - chmod +x myscript.sh
  • The file can be executed either as ./myscript.sh or bash myscript.sh

Variables

#!/bin/bash
cp $1 $2
# Verification
echo Details for $2
ls -lh $2

$1 and $2 are the first and second arguments to the script.

Note

  • $0 refers to the name of the bash script
  • $# refers to the number of arguments passed to the script
  • $@ refers to all the arguments supplied
  • $? refers to the exit status of the most recent process

Setting your own variables:

#!/bin/bash
myvariable=Hello
anothervar=Fred
echo $myvariable $anothervar
sampledir=/etc
echo $sampledir

Note

  • Use quotes if your value has a space. Eg: myvar="Hello World!"
  • When referring to or reading a variable we place a $ sign before the variable name
  • When setting a variable we leave out the $ sign
  • Do not use white-space around the =
    • var=23, that’s the correct variable assignment syntax: a word that consists of unquoted letters followed by an unquoted = that appears before a command argument (here it’s on its own)
    • var =23, the var command with =23 as argument (except in zsh where =something is a special operator that expands to the path of the something command. Here, you’d likely to get an error as 23 is unlikely to be a valid command name).
    • var= 23, an assignment var= followed by a command name 23. That’s meant to execute 23 with var= passed to its environment (var environment variable with an empty value).
    • var = 23, var command with = and 23 as argument. Try with echo = 23 for instance.

Command Substitution

myvar=$( ls /etc | wc -l )
echo $myvar

Command substitution is nice and simple if the output of the command is a single word or line. If the output goes over several lines then the newlines are simply removed and all the output ends up on a single line.


Exporting variables

Scripts are run in their own process hence you cannot use a variable you assign outside of the script, in the script. To use external variables they need to be exported.

script1.sh

#!/bin/bash
# demonstrate variable scope 1.
var1=blah
var2=foo
# Let's verify their current value
echo $0 :: var1 : $var1, var2 : $var2
export var1
./script2.sh
# Let's see what they are now
echo $0 :: var1 : $var1, var2 : $var2

script2.sh

#!/bin/bash
# demonstrate variable scope 2
# Let's verify their current value
echo $0 :: var1 : $var1, var2 : $var2
# Let's change their values
var1=flop
var2=bleh

result

./script1.sh
script1.sh :: var1 : blah, var2 : foo
script2.sh :: var1 : blah, var2 :
script1.sh :: var1 : blah, var2 : foo

Input

#!/bin/bash
# Ask the user for their name
echo Hello, who am I talking to?
read varname
echo It\'s nice to meet you $varname

Run the command read and save the users response into the variable varname.


Arithmetic

  1. let
#!/bin/bash
# Basic arithmetic using let
let a=5+4
echo $a # 9
let "a = 5 + 4"
echo $a # 9
let a++
echo $a # 10
let "a = 4 * 5"
echo $a # 20
let "a = $1 + 30"
echo $a # 30 + first command line argument
  1. expr
#!/bin/bash
# Basic arithmetic using expr
expr 5 + 4 # 9
expr "5 + 4" # 5 + 4
expr 5+4 # 5+4
expr 5 \* $1
expr 11 % 2
a=$( expr 10 - 3 )
echo $a # 7
  1. double parentheses
#!/bin/bash
# Basic arithmetic using double parentheses
a=$(( 4 + 5 ))
echo $a # 9
a=$((3+5))
echo $a # 8
b=$(( a + 3 ))
echo $b # 11
b=$(( $a + 4 ))
echo $b # 12
(( b++ ))
echo $b # 13
(( b += 3 ))
echo $b # 16
a=$(( 4 * 5 ))
echo $a # 20

If statements

  1. If statements
#!/bin/bash
# Basic if statement
if [ $1 -gt 100 ]
then
    echo "Hey that\'s a large number."
    pwd
fi
  • The square brackets in the if statement is a reference to the test command.
  • -gt is equivalent to >=. Similarly there are !, -n, -z, =, != and many more.
  • Can be alternatively used as test 001 = 1. (This won’t return anything you can test the exit status using $?. 0 means TRUE and 1 means FAILURE).
  1. If-else
#!/bin/bash
# else example
if [ $# -eq 1 ]
then
    nl $1
else
    nl /dev/stdin
fi
  1. If-else-if
#!/bin/bash
# elif statements
if [ $1 -ge 18 ]
then
    echo "You may go to the party."
elif [ $2 == 'yes' ]
then
    echo "You may go to the party but be back before midnight."
else
    echo "You may not go to the party."
fi
  1. Case statements
#!/bin/bash
# case example
case $1 in
    start)
    echo starting
    ;;
    stop)
    echo stoping
    ;;
    restart)
    echo restarting
    ;;
    *)
    echo "don\'t know"
    ;;
esac

Note

The ;; are used as break statements


Loops

  1. while loop
#!/bin/bash
# Basic while loop
counter=1
while [ $counter -le 10 ]
do
    echo $counter
    ((counter++))
done
echo "All done"
  1. until loop
#!/bin/bash
# Basic until loop
counter=1
until [ $counter -gt 10 ]
do
    echo $counter
    ((counter++))
done
echo "All done"

The until loop is the exact opposite of the while loop

  1. for loops
#!/bin/bash
# Basic for loop
names='Stan Kyle Cartman Kenny'  # is one way to define lists
for name in $names
do
    echo $name
done
echo All done
  1. Ranges and Iterators
#!/bin/bash
# Basic range in for loop
for value in {1..5}
do
    echo $value
done
echo All done

Note

  1. You can have custom range by providing a step. Eg. {10..0..2}
  2. GNU seq can also be used to create custom iterators. Eg. seq 10 -2 0

Functions

  • Simple example
#!/bin/bash
# Basic function
print_something () {
    echo Hello I am a function
}
print_something
  • Passing arguments
#!/bin/bash
# Passing arguments to a function
print_something () {
    echo Hello $1
}
print_something Mars
print_something Jupiter
  • Return values

Bash functions don’t allow for return values however they allow for a return status

#!/bin/bash
# Setting a return status for a function
print_something () {
    echo Hello $1
    return 5
}
print_something Mars
print_something Jupiter
echo The previous function has a return value of $?
  • Variable scope
#!/bin/bash
# Experimenting with variable scope
var_change () {
    local var1='local 1'
    echo Inside function: var1 is $var1 : var2 is $var2
    var1='changed again'
    var2='2 changed again'
}
var1='global 1'
var2='global 2'
echo Before function call: var1 is $var1 : var2 is $var2
var_change
echo After function call: var1 is $var1 : var2 is $var2

result

Before function call: var1 is global 1 : var2 is global 2
Inside function: var1 is local 1 : var2 is global 2
After function call: var1 is global 1 : var2 is 2 changed again

8. Tips and tricks

Customizing your .bashrc

.bash_profile and .bashrc files. Runtime ~ 8min

Environment variables and aliases. Runtime ~ 14min

Brace expansion

mkdir {a..c}-{1..2}.txt

Aliasing

alias gst="git status"

Processes

# SIGKILL
kill -9 pid
# SIGTERM
kill -15 pid

Bash shortcuts. Runtime ~ 9min

History Substitutions

Bash stores your history. If you find yourself needing to reuse parts of code, as you often do, use these tricks.

Long complicated command has a typo in the middle? Feel like you have to pull the command down and then go to the error and correct it? Well, just use this syntax:

^[mistake]^[correction

Doing this, bash will pull the previous command and make that single substitution then run it.

The ! symbol stands for “the last line.” Therefore:

  • !! - run the last line again. useful if you need to rerun things with more permission, ie sudo !!
  • !$ - this will be replaced by the last word of the last line. useful when you mkdir and then want to quickly cd into it
  • !^ - this will be replaced by the first word of the last line. useful when the first word is long, complicated and not able to autocomplete
  • !-[0-9]- this indicates not the last line, but the [0-9] line beforehand. by itself, it will run the [0-9] line before it.
  • !.* - This means “run the most recent command that began with this character string.” Useful if that command is who knows how many commands back it was or a command that took in many parameters that would be tedious to type again or find and copy.

More found here: https://www.thegeekstuff.com/2011/08/bash-history-expansion/

Command Substitutions

If you have multiple commands on a single line, you may put backticks on the ones that you want to go first. They act as parantheses in math do.

Shell tricks. Runtime ~ 11min

Workshop 2: SCC and git

SCC and git In this online workshop you will learn about the Shared Computing Cluster (SCC) that we use at BU. You will also learn version control using git.

You are expected to study the the following content:

In the hands-on workshop we will work on ssh login, loading modules and job submitting. Then we will continue to clone a sample repository and work on forking, resolving conflicts and more advanced git commands.

Tutorials

1. The SCC

Shared Computing Cluster

The Shared Computing Cluster (SCC) at Boston University is a Linux cluster with over 690 nodes, 14,000 processors, 324 GPUs, and currently over 4.2 petabytes of disk storage. The SCC is located in Holyoke, MA at the Massachusetts Green High Performance Computing Center (MGHPCC), a collaboration between 5 major universities and the Commonwealth of Massachusetts.

The SCC is suitable for high-performance computing in various areas of research such as bioinformatics and is commonly used to:

  • Share and store data
  • Run code that exceeds workstation capability (RAM, Network, Disk)
  • Run code that runs for long periods of time (hours, days, weeks)
  • Run code in highly parallelized formats (use 100 machines simultaneously)
  • Access specialized software packages

SCC Architecture

The SCC uses the linux command line environment. To use the SCC, you must login to one of several login nodes. Everyone who has a BU ID can login to SCC1. If you are on a project, you can login to SCC2. For users in the Earth and Environmental Departments, use GEO login node. For BUMC users and for work on sensitive data, you can login to SCC4.

Note: SCC4 is only accessible through the BU network. To work remotely, you will need to use a virtual private network (VPN) to connect to the BU network.

For more information:

https://www.bu.edu/tech/support/research/system-usage/getting-started/connect-ssh/

http://www.bu.edu/tech/services/cccs/remote/vpn/use/

File Storage

1. Home Directory

This directory is entirely controlled by you. The default permission are set so that no other user can see or access your files. Home directories have a quota of 10 GB.

2. Backed Up Project Disk Space

Research projects are by default granted 50GB of backed-up space (/project/project_name/). Files that cannot be replaced and source code should be stored in this space.

3. Not Backed Up Project Disk Space

Projects are by default granted 50 GB of space(/projectnb/project_name/). Files generally stored in this space include output files, downloaded data sets, and large quantities of data that you could recreate in the unlikely event of data loss.

4. Scratch Space

Each node has a directory called /scratch stored on a local hard drive. This can be used by batch jobs to quickly write temporary files. If you wish to keep these files, you should copy them to your own space when the job completes.

Note: Scratch files are kept for 30 days, with no guarantees.

http://www.bu.edu/tech/support/research/system-usage/running-jobs/resources-jobs/local_scratch/

Recovering lost files use Snapshots

You can retrieve lost files by using snapshots. Snapshots are copies of files from home directories and Project Disk Space that are stored within the file system. This is convenient when you want to retrieve a file that was lost or accidentally deleted.

For more information:

http://www.bu.edu/tech/support/research/computing-resources/file-storage/

SSH Login

To connect to the SCC, first you will need to download an SSH Client or use a terminal application depending on your operating system. An SSH client is a software program which uses the secure shell protocol to connect to a remote computer.

Go to the Instructions section to download the appropriate software to connect to the SCC.

To login to the SCC:

Windows/MobaXterm

local_prompt% ssh tanyatk@scc1.bu.edu

Mac

local_prompt% ssh –Y tanyatk@scc1.bu.edu

Linux

local_prompt% ssh –X tanyatk@scc1.bu.edu

Once you login, you will find yourself in your home directory.

1. Running Jobs

The SCC is also a space to run code or other jobs that require multiple resources that are unavailable on a local workstation, including jobs that exceed workstation capability (RAM, Network, Disk), long periods of time to run (hours, days, weeks), and specialized software.

Types of Jobs

Interactive job – running interactive shell: run GUI applications, code debugging, benchmarking of serial and parallel code performance;

Interactive Graphics job ( for running interactive software with advanced graphics ) .

Batch job – execution of the program without manual intervention;

During the workshop, we will focus on batch jobs.

Submitting jobs using qsub

To submit a non-interactive job, you will use the qsub command.

scc1 % qsub [options] command [arguments]

Example: Submit code printenv using qsub to print the environment variables

scc1 % qsub -b y printenv
Your job #jobID ("printenv") has been submitted

The option -b y tells the batch system that the following command is a binary executable. The output message of the qsub command will print the job ID, which you can use to monitor the job’s status within the queue.

While the job is running the batch system creates stdout and stderr files in the job’s working directory. These files are names after the job (i.e. printenv) followed by the job’s id number.

For example,

printenv.o#jobID - the output of the command

printenv.e#jobID - list of errors, if any, that occurred while the job was running

Another way to submit a job using qsub is through a bash script (job_script.sh) that specifies the options, commands, and arguments required to run the job.

scc1 % qsub job_script.sh

Resources required to run a job

There are a number of directives or options that the user can pass to the batch system. These are provided as arguments to the qsub command or added as lines with symbols #$ in the job script.

Example: Qsub command arguments

scc1% qsub -l h_rt=24:00:00 -N myjob -j y printenv

Example: Job Script arguments

#!/bin/bash
#$ -l h_rt=24:00:00   # Specify the hard time limit for the job
#$ -N myjob           # Give job a name
#$ -j y               # Merge the error and output streams into a single file

printenv

The table below shows general directives that are commonly used including specifying the number of processes or project name required to run a job.

_images/Qsub_Resources.png

To request other resources besides the general directives, the scc website details available directives that can be requested. http://www.bu.edu/tech/support/research/system-usage/running-jobs/submitting-jobs/

Job status and deletion

After submitting a job, you can check the status of your job using the qstat command.

Checking status of a batch job:

scc1% qstat -u <userID>

List running jobs only:

scc1% qstat -s r <userID>

Checking information about a job:

  scc1% qstat -j <userID>


Display resources requested by the user jobs:
scc1% qstat -r <userID>

To retreive information about a past job, you can use the qacct command.

Information about a specific job:

scc1% qacct -j <userID>

Information about all jobs that were run in the past number of days:

scc1% qacct -o <userID> -d <number of days> -j

To delete a job, you can use the qdel command.

Delete all jobs of user:

scc1% qdel -u <userID>

Delete specific job:

scc1% qdel <jobID>

3. Version control with git

During your career as a researcher, you will write code and create documents over time, go back and edit them, reuse parts of it, share your code with other people or collaborate with others to make tools and documents.

Have you ever had the problem that you lost files that weren’t saved? Or have you gone to a conference or interview and met someone interested in your work and realized you don’t have the files on your laptop? On a gloomy day, have you changed some part of your code when suddenly everything broke and you wished you could just go back to the previous working version, but alas there is no backup and you have tens of folders with misleading names?

Or are you familiar with the scenario, in which you are working with a group, writing a function and then notice another person simultaneously making changes to the same file and you don’t know how to merge the changes? Or someone makes changes to your working version and now when you run it, everything crashes? Have you experienced these or a million other situations when you felt frustrated and stressed and spent hours trying to fix things and wished there was a time machine to go back in time? The time machine has been invented, version control.

What is version control?

  • You can save all your code, data, and documents on the cloud, as you develop a project.
  • You can manage the versions throughout time and see which changes were made at which time, and by whom.
  • You can find other projects, import their scripts and modify them to reuse them for your purpose.
  • You can share your code online: it’s good for science and it’s good for your resume.
  • If you are a PhD student, you can start saving your files early on, and by the time you finish, you will have all your analyses documented and easily accessible, which will help a lot when you’re writing your thesis.

How does all the magic happen? Usually using git. So what is git? git is a open source tool, which features functionalities to make repositories, download them, get and push updates. It can allow for teams to work on the same project, manage conflicts, monitor changes and track issues.

Version control platforms

The most widely used version control platforms supporting git are Bitbucket and GitHub.

  • Repositories on Bitbucket are by default private and only viewable by you and your team.
  • Repositories on GitHub are by default public (everyone can see them), and to make them private you need to pay.

For a more comprehensive comparison of the two platforms see this comparison by UpGuard. When choosing a platform you must consider the limitations of each tool, and if you are employed in research, most likely, you will have to use the platform preferred by your research institute or company. Note that Bitbucket has a limitation on the number of teams one can make for free, and after some point you will need to pay.

Another platform for git is Gitlab.

So go on, start by signing up and making a username of these platforms. Use Bitbucket for in-progress work, and GitHub for publishing. Choose a professional ID for your GitHub account and put it on your CV.

Installing configuration

How will you run git on your system? If you prefer terminals, just install git and you are good to go. You can install git on a Debian system using:

sudo apt-get install git

or on a Red Hat based system

sudo yum install git

and on Mac

brew install git

For Windows, to get a git shell you can install TortoiseGit. If you prefer to work with a GUI, you could install gitKraken on all three Operating Systems.

If you are using a terminal, the first thing to do is configure git with your username and email. The username will be printed on the commits and changes you make. The email will be used to log in. You will be prompted for your password when pushing and pulling from the server.

git config --global user.name "[your_username]"
git config --global user.email "[your_email]"

git is pretty simple. You can find a list of git commands here. The most typical use of git consists of: git init to initialize a new repository, git clone to copy a repository onto your local computer, git add to make a list of changes you made locally, git commit to make a log of your changes, git push to send the changes to the online repository, and git pull to get changes.

SSH vs HTTPS

The connection to the server is secured with SSH or HTTPS. It is recommended to use HTTPS, but if you want to, you can use SSH. GitHub explains which URL to use. Typically when cloning you will see sometimes the repository has a .git at the end, and sometimes it doesn’t. If you use SSH you will need an SSH key. Read here to learn how to connect to GitHub with SSH Bitbucket gives you the option of using either one.

_images/HTTPS_SSH_bitbucket.png

When using your_username to clone/fetch a repository from the_author, an SSH url will look like:

git@bitbucket.org:[the_author]/[repository].git

and HTTPS will look like:

https://[your_username]@bitbucket.org/[the_author]/[repository].git.

git tutorial

There are plenty of nice turorials to learn git on the web. The best may be the GitHub tutorial which features a built-in terminal that you can use to walk through the commands step by step. The Bibucket tutorial from Atlassian is a very comprehensive and detailed turorial, and overall, a good resource to find what you need.

Try this at home

Start with the GitHub tutorial and finish the 15 modules of level 1. Notice the folder structure and the hidden .git folder. For the workshop, we expect you to know how to clone a repository, add and commit changes, push to, and pull from the repository.

Useful tips

Let’s go over some standards to keep in mind when using git.

commit messages

When you are committing your changes always use meaningful messages.

git commit -m "[a brief meaningful message explaining what the change was about]"

Avoid vague messages such as changed file x and fixed function y. The commit itself shows which files have been changed. The message should explain the functionality of the change.

Another important concept is that, each commit should have one functionality. It is not a good practice to make a lot of progress then push all the changes at once. The server will not run out of space if you do several commits. Commits are very useful to track the jobs you have completed.

When you find a conflict or something is not working, do not make duplicate files. For example, having main.tex and then creating main1.tex is confusing and voids the purpose of version control.

Commits can be undone. Conflicts can be resolved. And we will learn how to fix mistakes.

Semantic versioning

Have you ever wondered how developers decide how to number the different versions of their software? Do they just randomly come up with numbers? No, the version number consists of 3 numbers, x.y.z where x is a major change, y is a minor change and z a patch. There is official documentation on this, which you can read if you are interested. But assume you have a tool that reads some data and performs some function on the data. If you find a bug and fix it, you publish the fix by adding to z. If you added a small functionality, for example support for compressed data input and compatibility with other tools, increase y. If you added another function to it, increase x.

GNU licensing

git is opensource. If you use GitHub and/or Bitbucket, you can publish your tool with the GNU licensing. GNU is open source, and open source does not mean free. Whenever using code with GNU licensing, you must cite the authors/developers. For more information on the license check the GNU organization documentation.

Readme and MarkDown syntax

It’s a good practice to make a Readme. The Readme file can be made online using the editors GitHub and Bitbucket provide. Typically they are written in MarkDown syntax, which is very simple. You might have heard about R MarkDown, but MarkDown is a syntax that R has knitted into its compiler. Again there are many tutorials to learn MarkDown. You can check the syntax on the Atlassian website.

A Readme should include information about:

  • name of the tool and the version,
  • what is this tool about,
  • who are the authors,
  • requirements and dependencies,
  • how to install /clone it,
  • how to run it,
  • what is the input and output,
  • licensing,
  • how to cite it.

Look at this nice outline for a standard Readme file in MarkDown syntax. To get the source code click the Raw button on the top left.

Issue tracking

Both Bitbucket and GitHub allow for issue tracking. Members of a team can create an issue, assign it to a developer or admin, and comment on it. An issue can be marked according to its importance and type, for example, fixing a bug or adding functionality; and the issue can be resolved once it is has been taken care of. Issues can be linked to commits, to show which commit resulted in resolving an issue.

When a repository is publicly accessible, you can create issues to inform the developers there is a bug or a functionality you would be interested in. So, the next time you find an issue with some tool that you can’t resolve after trying for a few days, just post an issue on their GitHub repository.

Workshop 2: SCC and git

Part 2.1: SCC and qsub

Login into SCC
ssh [username]@scc1.bu.edu
Loading modules

SCC has many preinstalled programs and utilities which we refer to as modules. You can search for different modules using:

module avail [pattern]

To load a specific module use module load:

module load [module_name]

For example let’s check for all the available JAVA versions on SCC and load version 9.

module avail java
module load java/9.0.1
java -version
Submitting jobs to the SCC

When you ssh to the SCC you are connected to a head node. Head nodes are the only nodes on the cluster that are connected to the internet (so that you can access the SCC). They are the busiest and maintain all user connections. We should not run any program on the head node. By default your program will be killed if it runs more than 20 minutes, but even if you have some code that runs in 5 minutes, do not run it on the head node and always submit it as a job.

qsub -P [project_name] -N [name_of_job] [bash_script]

When a job is running the standard output of it will be saved in a file names [job_name].o[job_ID] and the standard error output will go to [job_name].e[job_ID] in the directory you called qsub. To merge them use -j y (join=yes). For a full list of parameters and option for qsub see here. You can see how to allocate more memory, multiple processes to multi-threaded jobs, send notification emails upon the completion of your job, … Note the machines and resources available.

Useful parameters

  • Send an email upon ending: -m e -M [email]
  • Get multiple processes/slots: -pe omp [#processes]
  • Set the maximum (hard) running time: -l h_rt=hh:mm::ss

Once the job is given the resources it requires you can check the status of your ongoing jobs using qstat:

qstat -u [username]

This will return all the running jobs with their job_ID, name, starting time, and status (qw=waiting, r=running).

To delete/stop a job use qdel.

qdel [job_ID]
Useful tips
  • Always use a meaningful name for your jobs. In general always use meaningful names!
  • Do not allocate more resources than you need. This will not make your program run faster. If your program is not meant to be multi-threaded or asking for too much memory when your program is not memory expensive, allocating more than one process just makes you wait longer in the queue. In general, don’t be wasteful!
  • You can use j_hold to make one job to wait for another one to finish then run. If the job is running the machine associated to it will be shown too. You can ssh to that machine and see the status of that job, too. Use top -u [username] to see your ongoing processes and the amount of resources they use.
Hands on activity

Here we will do an activity. SRA toolkit is a useful tool used to download sequencing data from GEO. Here we will use the toolkit to download some RNASeq data.

Log on to SCC.

ssh [username]@scc1.bu.edu

Choose some RNAseq data First Query for a series on GEO. For example the GSE113476 series contains human breast cancer PDX samples. Get the SRA project (SRP) accession ID (SRP141444) in the relations box. To download this project, we need to get all the sample files (with SRR accession IDs). To do so use SRA Run Selector and search project SRP141444. Get the accession list (SRR for each sample). Save it as a file on SCC either with filezilla or just vim.

vim SRR_Acc_List.txt

Copy the first four SRR IDs into the file and save it:

SRR7050666
SRR7050667
SRR7050668
SRR7050669

Load sra toolkit. See what versions of the toolkit are available.

module avail sra

Load the default version.

module load sratoolkit

Make a bash script to download it. Make a script to read the SRR accession IDs one by one and fastq-dump them into a directory.

vim dl_sra.sh

Copy the following code into the bash script file.

#!/bin/bash
ACCESSION_LIST_FILE=$1
OUTPUT_DIR=$2
while read SRR_ID; do
   fastq-dump --gzip --split-files --outdir $OUTPUT_DIR $SRR_ID
done < $ACCESSION_LIST_FILE

Submit the code.

qsub -N SRA_example dl_sra.sh SRR_Acc_List.txt SRP141444

Check if your code is running:

qstat -u [username]

This will download each SRR one by one. That is slow. Let’s kill it (qdel) and make it faster.

Make your code multi-processed to run faster.

You can make it multi processing (especially when you need to use large numbers of processes) is to use multiple jobs. Try that on your own. Make a bash script that sends a query (qsub) for each SRR accession.

Part 2.2: Version control with git

Forking a repository

On Bitbucket you can fork from the left menu:

_images/fork_atlassian.gif

On GitHub on the top left you can find the fork button.

_images/fork_github.png

You will be divided into groups. One person from each team forks the repository.

  • Login to BitBucket Fork the repository REU_workshop2_git.
  • Go to your copy of the repository.
  • Click on Send invitation and then Manage this repository on bitbucket or Settings and Collaborators on github.
  • Add your team member/members and give them Admin access.

Each team member will clone the repository on their own computer or SCC.

Editing from the server

Go to your repository, and find your repository. Go to Source, and open the Readme file. Click Edit to make changes to the Readme, and write your name. Click the Commit button to save your changes.

Running the code

Read the Readme file. You will need to have Python3 and all the required modules installed. If you don’t already have a conda environment, use:

module load anaconda2
conda create --name [env_name] python=3.6.2
source activate [env_name]

We are going to use Python3, so make sure you create an environment accordingly. You can check your Python version using:

python -V

We will need to install some modules in order to run the code.

# install the required libraries
conda install scikit-learn
conda install matplotlib
pip install textblob
python -m textblob.download_corpora

You can run the code now and play around.

python src/digit_recognition_game.py
python src/predict_sentiment.py

Untracked directory

When you run the code, a log file called human_vs_machine.cvs is made, which stores information for each run. You do not want the content of your runs to be uploaded to the repository. To do so, you can make a .gitignore file in the data folder.

vim .gitignore
ls -a data

Make some changes on digit_recognition_game.py

src/digit_recognition_game.py : runs a small code to learn handwritten digits from low resolution pictures. Then it will compete with you to see who can do better!!! You will make the following changes to improve the code:

Start by entering your own name in line 2 of the src/digit_recognition_game.py file and commit your changes:

You can see which files you have changed by:

git status

and you can see the difference between the files, e.g. the lines that were changed by:

git diff

Push it to the server:

git add src/*
git commit -m "[your message]"
git push

Did some of your team members get an error message?

! [rejected]        master -> master (non-fast-forward)
error: failed to push some refs to 'https://[your_username]@bitbucket.org/[owner_repository]/bub_workshop07_git.git'
To prevent you from losing history, non-fast-forward updates were rejected
Merge the remote changes before pushing again.  See the 'Note about
fast-forwards' section of 'git push --help' for details.

Resolving conflicts

If you got a conflict message, try to pull the recent changes made by others.

git pull

This will try to automatically merge the changes that do not conflict. However, if there is a conflict, you will get an error message and in the file/s the conflicts will be marked. Such as:

<<<<<<< HEAD
aaaaaaa
=======
bbbbbb
>>>>>>> 38b76457af9eba704534f7293817653888c03fc5

If you don’t want to merge and just get rid of all the changes you have made, you can use git stash. All your changed will be lost.

Try this on your own

Now let’s improve the code a bit.

  • Change A1. Allow the user to choose the learning algorithm. Currently the program supports Support VEctor Machines (SVM), Naive Bayes (NB), and K-nearest neighbors (KNB) classifiers. Prompt the user a number 1-3 to pick the classifier.

    def set_classifier(clf='KNN'):
        """ Set the type of classifier to use"""
    
        # Define ML classifier algorithms we are going to test out
        if classifier == "SVM":
            classifier = svm.SVC(kernel="linear")  #support vector machine()
        elif classifier == "KNN":
            classifier = neighbors.KNeighborsClassifier(9)  #K Nearest-Neighbors
        elif classifier == "NB":
            classifier = naive_bayes.GaussianNB()  #Naive Bayes
        else:
            classifier = neighbors.KNeighborsClassifier(9)  #K Nearest-Neighbors
    
        return classifier
    
  • Change A2. check that the user enters a digit between 0 to 9. If the input is not a one digit number, warn the user and prompt for another number.

    def get_human_prediction():
      """
      Function: Prompts the user for the number they are guessing.
      Returns: (int) number user guessed
      """
      human_prediction = None
      while human_prediction == None:
         try:
             human_prediction = input("Type the number your saw: ")
             human_prediction = int(human_prediction)
         except:  #except all errors and reset the variable so the user can be prompted again
             human_prediction = None
      return human_prediction
    
  • Change A3. As you can see the image of the figure opens in a large size. Can you change this so it opens in a smaller size?

    fig, ax = plt.subplots(figsize=(3, 3))
    

Try to push and resolve your conflicts again.

Revert changes (undoing the commit)

git reset HEAD

You can do a --soft or a --hard reset. Oh no, did all your changes disappear? We can move back and forward with git. Get the ID of any commit and you can time travel.

git reset d656972
Branching and merging

You can make branches to work separately on different functionalities of a tools. This is useful for big teams of developers where each one works on a different module. This is how you make a branch:

# make a branch for your team
git branch [your_branch]
git checkout [your_branch]

Or you could make a branch and checkout at the same time.

git checkout -b [your_branch]

See what branch you are on:

git branch

You have your own local copy on a separate branch.

Make some changes on predict_sentiment.py

src/predict_sentiment.py runs a small code to learn the sentiment (positive or negative) of a sentence from a set of training sentences and tests on a another set. Make the following changes:

  • Change B1. The train and test sentences are currently hardcoded in the code. Save them into two text files train.txt and test.txt file and make the program read the data from the disk.

    def load_data_from_csv(filename):
    """
    Load data from a 2 column CSV file
    The data should have the a sentence in column 1
    and the sentiment "positive" or "negative" in column 2
    """
    f = open(filename,'r',encoding='latin-1')
    data = []
    for line in f:
       line = line.strip()
       sentence, sentiment = line.split(',')
       data.append( (sentence,sentiment) )
    return data
    
  • Change B2. After learning the sentiments, make the program prompt sentences from the user and guess the sentiment.

    input_sentence  = input("Type a new sentence: ")
    new_sentence = TextBlob(input_sentence,classifier=cl)
    print ("New sentence: %s" % new_sentence)
    print ("Predicted Connotation: %s" % new_sentence.classify())
    

Push your changes on your own branch. There should be no conflicts.

Merge the branch into master

git checkout master
git merge [your_branch]

Hopefully you won’t have conflicts. If you do, you know how to solve it.

Pull requests

You can inform other’s of you magnificent changes and accomplishments by making pull requests. This way you let everyone know that you made some changes and they need to pull.

Go to the repository, from the left side menu click on Pull requests. Create a new pull request. Note: It is better to send pull requests on branches, the changes you have been making.

Workshop 3: Python

Python This workshop will serve as an introduction to Python. The workshop breaks into two sections: a brief overview of Python as a programming language (with quick examples and explanations of common functionality), and a problem-based workshop where students will create a python script to perform protein synthesis in silico. The introduction should be performed before the in-person workshop. The workshop should be done in pairs, with both students alternating who “drives”.

Sections

Python 3

Welcome to this supa-quick, supa-dope Python 3 tutorial. Python is a general purpose programming language created in the early 1990s by Guido van Rossum. Today, Python is one of the most popular languages and enjoys particular success in statistics/data science and scientific computing. This tutorial will serve as a brief introduction to the capabilities of Python and its syntax.

Getting Started

To get started we will likely need to install Python. While there are many ways to install Python on your system, I recommend using the Anaconda Distribution (https://www.continuum.io/downloads). Anaconda is a cross-platform (OSX, Linux, Windows) distribution manager that simplifies installing and managing packages. While this tutorial only makes use of the base Python packages, installing via Anaconda will also install several scientific libraries that you will likely find useful later. Further, Jupyter is also included in the Anaconda install, giving you access to Jupyter Notebooks.

Interacting with Python

Once Python is installed on our system, there are two main ways we can interact with Python: 1) opening a python interpreter using the terminal, 2) creating a python script file.

Accessing a Python Interpreter

To access a Python Interpreter simply open a terminal window, and type ‘python’. This will create an interactive Python session where we can write and test Python code. If you are on a Windows machine, instead of the normal command prompt, barring specific installation steps, you will need to open an Anaconda Prompt. This is a special terminal that will give you access to your Python/Anaconda installation.

Writing a Python Script

A python script is a file with the ‘.py’ extension and can be written using your favorite text editor or IDE. If you have Anaconda installed on your computer, you will have access to the Sypder IDE, which is a popular and useful IDE for writing scripts in Python. A python file can be run by typing ‘python *script_name*.py’ into the terminal.

Basic Python Variables and Operations

Mathematical Operators

Unsurprisingly, Python can do math! The basic mathematic operators are +, -, *, and \ for addition, subtraction, multiplication, and division

# The print function takes a value or expression and displays the output to the screen.
# The hash symbol denotes the proceeding text as a comment, and thus is not evaluated
# by the interpreter.

print(2 + 2)
print(2 - 2)
print(2*2)
print(2/2)
4
0
4
1.0
# Negative values are demonstrated with a '-'
print(-3 + 2)
-1
# Exponents use the double star operator '**'
print(2**3)
8
# The percent symbol, '%', is used as the modulo operator for calculating remainders.
print(6 % 4)  # 6 = 4*1 + 2
2
# Mathematical expressions follow the order of operations.
print((2+3)*(-1)**2/2)
2.5
Mathematical Variables

There are two basic numerical data types in Python: integers and floating point numbers. Integers are whole number, signed or unsigned, while floating point numbers contain decimal values.

# The data type of a value can be determined using the 'type()' function.
print(type(2))
print(type(2.0))
<class 'int'>
<class 'float'>
# Values in Python can be assigned to variables with different names for later access.
# Variable assignment is done using the '=' symbol.
x = 2
y = 3.0
print(x)
print(y)
print(y*x)
2
3.0
6.0
# Variables can be cast to compatible data types using the desired data type function.
print(y)
print(type(y))

z = int(y)
print(z)
print(type(z))
3.0
<class 'float'>
3
<class 'int'>
# While we instantiated 'z' using 'y' and then modified 'z', the value 'y' remains unchanged.
print(y)
3.0
Boolean Values and Operations

Boolean values are values that determine the truth value of a specific statement. In Python, these take the form the key words, True and False. There are several useful operators such as <, >, <=, >=, and == for excessing relationships between numerical values. Each of these operators returns a boolean value representing the truth value of the given statement. All the previously listed operators expect to be sandwiched between two values, one to the left and one to the left, and are evaluated left to right.

# The less than operator '<'
x = 3
y = 6
z = 10
print(x < 5)
# The greater than operator '>'
print(z > x)
# the less than or equal to operator '<='
print(x <= 5)
print(x <= 3)
# the greater than or equal to operator '>='
print(x >= 5)
print(x >= 3)
# the equality operator '=='
print(y == 6)
print(y == 7)
True
True
True
True
False
True
True
False

Boolean statements (e.g. 3 < 5) can be strung together using and maniuplated using the and, or, and not keywords. All keywords follow their formal logic definitions: the and keyword is true is both statements are also true, the or keyword is true if one of the statements is true, and not negates the original truth value.

print(y > x and y < z)
print(y < x or y < z)
print(not y > x)
True
True
False
String Variables and Operations

Strings are data types used to represent text data. They can be instantiated by placing expressions between single (‘[expression]’) or double (“[expression]”) quotes.

string_1 = 'dog'
string_2 = "cat"
print(string_1)
print(string_2)
dog
cat
# strings can be concatenated using the '+' operator
string_3 = string_2 + string_1
print("What do you mean you've never seen a " + string_3 + "?!")
What do you mean you've never seen a catdog?!
String Substitution

Values can be substituted into a string using string substitution. This is done using the .format() method available to string objects.

# the second single or double quote mark can be escaped using a backslash: \
statement = 'What do you mean you\'ve never seen a {0}?!'
print(statement.format(string_3))
What do you mean you've never seen a catdog?!
# strings be evaluated using boolean operators
print(string_1 == string_2)  # are they the same string?
print(string_1 < string_2)  # is string_1 shorter than string_2?
print(string_3 > string_2)  # is string_3 longer than string_2?

# strings are case sensitive
print('cat' == 'Cat')
False
False
True
False
# String case can be changed using the .upper() and .lower() string methods.

print(string_2.upper())
print(string_2.upper() == 'CAT')
print(string_2 == 'CAT'.lower())
CAT
True
True
# The length of a string can be accessed using the built-in len() function.
print("The string '{0}' is {1} characters long.".format(string_1, len(string_1)))
The string 'dog' is 3 characters long.
# Characters in a string can be assessed by position.
# Python indexing starts at 0.

print("The first character in '{0}' is: {1}.".format(string_1, string_1[0]))

# Due to zero indexing, the last element is the n - 1 element.
print("The last character in '{0}' is: {1}.".format(string_1, string_1[len(string_1) - 1]))

# Negative indexing also works (e.g. -1 accesses the last element):
print("The second to last character in '{0}' is: {1}.".format(string_1, string_1[-2]))
The first character in 'dog' is: d.
The last character in 'dog' is: g.
The second to last character in 'dog' is: o.
# If a string is of a numerical value, the string can be converted to an integer or float.

float_string = '2.5'
int_string = '2'
print_msg = 'Converted {0} to {1} from type {2} to type {3}'

int_num = int(int_string)
print(print_msg.format(int_string, int_num, type(int_string), type(int_num)))

float_num = float(float_string)
print(print_msg.format(float_string, float_num, type(float_string), type(float_num)))

# Likewise, numbers can easily be converted to strings
num = 3.5
print(print_msg.format(num, str(num), type(num), type(str(num))))

# It is important to note that if a string represents a floating point number,
# Python is unable to convert that number to an integer.
Converted 2 to 2 from type <class 'str'> to type <class 'int'>
Converted 2.5 to 2.5 from type <class 'str'> to type <class 'float'>
Converted 3.5 to 3.5 from type <class 'float'> to type <class 'str'>
Container Variables and Operations

There are three main container data structures in base Python: lists, sets, and dictionaries.

Lists

Lists are arbitrarily long collections of objects. The are instantiated by placing comma-separated values within square bracks [** **].

my_list = [1, 2, 3, 4]
print(my_list)
[1, 2, 3, 4]
# Like strings, elements within lists can be accessed via their position.
print('The first element of my_list is {0}'.format(my_list[0]))
The first element of my_list is 1
# Access and assign list value by accessing an indexed element,
# and assigning it to a new value.
new_list = [1, 2, 3]
print(new_list)
new_list[2] = 5
print(new_list)
[1, 2, 3]
[1, 2, 5]
# A range of objects within a list can be select using ':'
print(my_list[1:3])

# Another ':' can be used to define step size for the selection range.
print(my_list[1:4:2])
[2, 3]
[2, 4]
# element membership within a list can be tested using the 'in' keyword.

print(5 in my_list)
print(3 in my_list)
False
True
# The length of a list is also assessed using the len() function.
print(len(my_list))
4
# An empty list can be constructed using empty square brackets
x = []
print(len(x))
print(x)
0
[]
# Elements can added onto the end of a list using the .append() list method.

x.append('Hi')
print(x)
['Hi']
# Lists can have mixed-type variables (e.g. a list can contain both integers and strings)
my_list.append('String!')
print(my_list)
[1, 2, 3, 4, 'String!']
# incremental lists up to a defined number can be created using the built-in range() function.
# The range function outputs a 'range' object. However, it can be casted to a list
# using the list() function.

n = 10
# Create list of length 10 ranging from 0 - 9
range_list = list(range(n))
print(range_list)

# The list doesn't need to start at 0
m = 3
print(list(range(m, n)))

# Likewise, we can specify our own step size
step = 2
print(list(range(m, n, step)))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[3, 4, 5, 6, 7, 8, 9]
[3, 5, 7, 9]
# Lists can be concatenated using the '+' operator
string_list = ['I', 'Love', 'Dogs']
print(my_list + string_list)
[1, 2, 3, 4, 'String!', 'I', 'Love', 'Dogs']
Sets

Sets are container objects that can only contain unique elements. If you are familiar with Set Theory in Mathematics, Python sets are simply an implementation of such a structure. Sets are constructed passing a list to the ‘set()’ function or constructing via { }.

# Sets can only contain unique elements.
set_1 = set([1, 1, 2, 2, 3, 4, 5])
print(set_1)

set_2 = {3, 4, 6, 7, 7, 8 , 9, 10}
print(set_2)
{1, 2, 3, 4, 5}
{3, 4, 6, 7, 8, 9, 10}
# add elements to a set using the .add set method
set_1.add(6)
print(set_1)
{1, 2, 3, 4, 5, 6}
# still only unique elements
set_1.add(5)
print(set_1)
{1, 2, 3, 4, 5, 6}
# Remove elements using the .remove set method
set_1.remove(6)
print(set_1)
{1, 2, 3, 4, 5}
# retrieve union of two sets using the .union set method
print(set_1.union(set_2))

# retrieve set difference of two sets using the .difference method
print(set_2.difference(set_1))

# retrieve set intersection using the .intersection method
print(set_1.intersection(set_2))
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
{8, 9, 10, 6, 7}
{3, 4}
# Unlike lists, sets are unordered and thus don't support indexing.
print(set_1[0])
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-37-c17aa407af1e> in <module>()
      1 # Unlike lists, sets are unordered and thus don't support indexing.
----> 2 print(set_1[0])


TypeError: 'set' object does not support indexing
Dictionaries

Dictionaries are collections with key-value pairs. They are constructed by matching a key with an associated value. The value can then be retrieved at a later time using the provided key. In python, keys and values can be of arbitrary data types. Similar to sets, dictionaries are consructed using curly brackets { }, though each entry must follow the key:value syntax.

# Construct dictionaries by separating keys and values using ':'
# Separate key-value pairs using ','
my_dict = {'a': 1, 'b': 2, 'c': 3}
print(my_dict)
# Look up values using keys
my_dict['a']
# Create an empty list using {}
empty_dict = {}

# add elements by 'indexing' by a given key and provided an associated
# value as an assignment.
empty_dict['key'] = 'value'
print(empty_dict)
# Retrieve keys of a dictionary using .keys() dictionary method
print(my_dict.keys())
# Retrieve values of a dictionary using .values() dictionary method
print(my_dict.values())

If, Else, and Elif Statements

Sometimes when writing a program, you need to execute different code snippets depending on the value of a specific variable. In Python, we do this by employing the three boolean key words: if, else, and elif

An if statement uses if the following syntax:

**if (boolean statement): **

run this code
# if statements must be followed by a colon.
# Likewise, the next line MUST be indented using either a tab or 4 spaces.
if True:
    print("It's true!")

x = 3
if (x < 10):
    print('{0} is less than 10'.format(x))
# An else statement must follow an if statement and is executed
# if the statement in the if statement is not met.
x = 11
if (x < 10):
    print('{0} is less than 10'.format(x))
else:
    print('{0} is greater than or equal to 10'.format(x))
# Like an else statement, an elif statement must follow a preceding if statement.
# However, like an if statement, an elif must also have its own boolean statement
# that must be met in order for its snippets to be run.

if (x < 10):
    print('{0} is less than 10'.format(x))
elif (x < 15):
    print('{0} is greater than 9, but less than 15'.format(x))
else:
    print('{0} is greater than 14'.format(x))

Iteration and Looping

While programming, it is common you will want to execute a code snippet multiple times, or execute the same line over a set of values. For this, we use looping. There are two different types of loops we can use in Python: for loops and while loops. For loops iterate through a set of values; a while loop iterates until a specific condition is met.

For loops

For loops employ the following syntax:

for each in list:

run code

The variable each is defined in the loop statement. Similarly, the variable list can be any iterable data type: not just a list. Like if, else, and elif statements, loop statements end with a colon and must be followed by a new line and an indentation.

# iterate through a list
my_list = [1, 'hi', 'yellow', 'pizza', 4.5]
for each in my_list:
    print(each)
# use the range() function to iterate through integer values
for i in range(5):
    print(i)

Nested For Loops

We can nest loops within other loops for loop-ception. In a nested loop, the first loop will run with the first value specified by the iterator (e.g. i = 0) until the inner loop gone to completion (e.g. executed for j =0 and j = 1). Once the inner loop is completed, the outer loop then moves on to the next value, and the process is repeated.

for i in range(5):
    for j in range(2):
        print('(i={0}, j={1})'.format(i, j))

While Loops

While loops execute until a boolean statement returns False. While loops employ the following syntax:

while boolean_statement:

execute code
count = 0
while count < 5:
    print(count)
    count += 1  # the += operator increments the value of a variable by the right value

Nested While Loops

Like for loops, while loops can also be nested; however, in order to fully iterate through each loop, values used in the boolean statement in the inner loop must be set in the outer loop. This ensures the value will be reset for the next iteration in the inner loop.

count = 0
while count < 3:
    num = 5
    while num > 3:
        print('num: ' + str(num))
        num -= 1  # the -= operater decrements a variable by the right value.
    print('count: ' + str(count))
    count += 1

Functions

It often a good idea to modularize your programming. That is, break your code into smaller parts that can be run together to complete your task. This is often performed by declaring functions. In Python, functions take a defined set of inputs, perform some set of operations using the inputs, and likely outputs some value. Functions are defined using the following syntax:

def function_name(input_1, …):

run code

Like loops and control statements, function definitions end with a colon followed by a new line and an indentation.

def add(x, y):
    return(x + y)

print(add(1, 2))
# It is common to have doc-strings, denoted by three sets of quotation marks,
# after a function definition to define the use of the function.
def multiply(x, y):
    """
    Multiplies two numbers together.

    Arguments:
        x (float or int): a numeric value.
        y (float or int): a numeric value.

    Returns:
        (float or int): the product of `x` and `y`.
    """
    return(x*y)

print(multiply(3, 2))
# It is possible to include optional parameters in functions.
# These are defined by setting an arguments name and giving
# a default value using '='

def increment(x, step=1):
    """
    Increments a value by specified value.

    Arguments:
        x (float or int): a numeric value.
        step (float, optional): a numeric value to increment `x` by.
            Default value is 1.
    Returns:
        (float or int): sum of `x` and `step`.
    """
    return(x + step)
print(increment(2))
print(increment(2, 3))

Scope

When discussing functions, it is important to also talk about the scope of a variable. The scope of a variable is the environment in which the variable is defined. If a variable is defined within a function, it’s scope is local and unique to that function: the variable cannot be accessed outside of the function. If a variable is defined outside of a function, at the first indentation level, the scope is global: the variable can be accessed anywhere within the Python file.

global_var = 20
def scope_function():
    """Scope example."""
    local_var = 3
    print(global_var + local_var)  # global_var has global scope
# local_var was defined only within scope_function(). Thus,
# it does not exist outside of the function.
print(local_var)

File Input and Output.

Often when writing a program, it is necessary to read or write to a file. Reading and writing can be done in a variety of ways and we’ll go over the most useful here.

Reading a file

To read a file, we must first create a connection to the file. The most basic way to do this is with the open command and utilize the readline io method.

# The open command creates a TextIOWrapper object that is used to read
# lines in a file. The first argument in the file to open, while the
# second argument specifies the object should be in "read-mode"

read_file = open('input_file.txt', 'r')  # open the file
file_string = ""
line = read_file.readline()  # read a line using the readline TextIOWrapper method.
while len(line) > 0:  # read lines until no lines are left in the file.
    file_string += line
    line = read_file.readline()
print(file_string)
read_file.close()  # close the connection to the file.
Using with to simplify file reading

The above method requires we create a separate file object and remember to open and close it. This can be simplified by using the with and as keywords:

with open('input_file.txt') as f:
    for line in f:
        print(line)
Writing Files

We write to files analagous to the way we first read a file: creating a connection, iterating through the lines we want to write, and finally closing the file.

write_list = ['This is a line',
              'This is also a line.',
              'In case you didn\'t know,',
              'You can have line breaks',
              'in between list elements',
              'and really any bounded element.']

f = open('output_file.txt', 'w')  # the 'w' parameter specifies "write-mode"
for each in write_list:
    f.write(each)
f.close()  # Look in your present working directory and you'll notice an output_file.txt file.

Importing Modules

In Python, a module is an external library that provides functionality that extends past the built-in functionality. However, there are several standard libraries/modules that are included in the base Python install, such as math, sys, os and other modules. These, and any other module, must be brought into the python environment using the import keyword.

On a basic import, any method, data structure, or value provided by the module must be accessed by first appending the module name to the method (e.g. to use the sin function in the math module, we type math.sin)

import math
# find the sin of 1, 0, and pi
print(math.sin(1))
print(math.sin(0))
print(math.sin(math.pi))

It is possible to import specifc methods or sub-modules from libraries. This is done by combining the from keyword with the import keyword. Depending on the level of import, the syntax for accessing the imported methods changes.

from math import cos
print(cos(math.pi))  # no `math.cos` necessary because we imported `cos` directly.
from os import path
# import 'path' submodule from 'os' module to gain access to 'realpath' method.
# When executing, os.path.realpath' not necessary because 'path' sub-module imported.
# However, path.realpath necessary because 'realpath' is in the 'path' sub-module.
print(path.realpath('input_file.txt'))
# You can re-name modules using the 'as' keyword on import
import math as m
print(m.pi)

Conclusion

This concludes our brief introduction to Python 3. This document simply serves as a primer to first getting acquainted with the syntax and data structures in Python. Many concepts, techniques, and capabilities were left out. Feel free to explore more of Python’s capabilities on your own if you so desire. Looking into external libraries such as numpy and scipy will be incredibly beneficial for anyone looking to continue to perform analysis in Python.

BRITE REU Python Workshop

Instructor: Dakota Hawkins

Overview

Protein synthesis generally follows what has been termed “The Central Dogma of Molecular Biology.” That is that DNA codes RNA where RNA then makes protein. Here is a useful source if you need a quick refresher (https://www.nature.com/scitable/topicpage/translation-dna-to-mrna-to-protein-393). In today’s workshop we will be writing a small Python script to simulate this process by reading a DNA sequence from a FASTA file, transcribing the sequence to mRNA, translating the computed mRNA strand to amino acids, and finally writing the protein sequence to another FASTA file. This workshop is intended to synthesise the information we learned in the Python Introduction notebook.

For this workshop you will be working with a partner in small teams. The groups will be used as a means to facilitate discussion (e.g. “How can we structure this function?”), while you and your partner will help each other implement the code. Partners should choose a single computer to write the code with. While a single person will be “driving” at a time, both partners are expected to converse and contribute. Likewise, no one person should be driving for the entire workshop: make sure to switch semi-regularly to ensure each person is getting the same out of the workshop. Please ensure each partner has a working copy of the completed Jupyter Notebook after the workshop is complete.

This notebook includes skeleton methods for all of the different Python functions we’ll need: ``read_fasta()``, ``write_fastsa()``, ``read_codon_table()``, ``transcribe()``, ``translate()``, and ``main()``. While these functions should encompass all of the functions we’ll need, feel free to write your own helper functions if you deem it necessary. Similarly, if you’d rather eskew the structure I provided – whether combining previously separated functions, changing passed arguments, etc. – feel free to do so. The only requirement is both partners are onboard with the change and the final product produces the same output. The skeleton code is mainly used to provide a starting structure so the code is easier to jump into.

Helpful Tips and Files
  1. The provided file, ‘codon_table.csv’, contains information on which codons produce which amino acids.
  2. The ``re`` python module contains a ``sub`` method to perform regular expression substitution.
  3. FASTA files are text files with standardized format for storing biological sequence. Generally, the first line in FASTA files is a description demarked by ``>`` (or less frequently ``;``) as the first character. The next lines contain the actual biological sequence. Generally each line is either 60 or 70 characters long before a line break. An example FASTA file (human_notch.fasta) has been included. For more information: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
  4. Helpful functions
Library Function Description Example Call
base open() Access a file in Python
``read_file =
open(file_na

me, ‘r’)``

base readline( ) Read the current line from a file object read_file.r eadline()
base write() Write a string to a file write_file. write("Hi the re.")
base strip() Remove leading and trailing whitespace and formatting characters "\n Hi ther e    ".strip( )
base split() Separate a string into disjoint sections given a specified delimiter "1,2,3,4".s plit(',')
re sub() Substitute one given pattern with another
``re.sub(“F”,
“J”, “Functi

on”)``

Read FASTA Files:

def read_fasta(fasta_file):
    """
    Retrieve a DNA or protein sequence data from a FASTA file.

    Arguments:
        fasta_file (string): path to FASTA file.
    Returns:
        (string): DNA or protein sequence found in `fasta_file`.
    """
    return('')

Write FASTA Files:

def write_fasta(sequence, output_file, desc=''):
    """
    Write a DNA or protein sequence to a FASTA file.

    Arguments:
        sequence (string): sequence to write to file.
        output_file (string): path designating where to write the sequence.
        desc (string, optional): description of sequence. Default is empty.
    Returns:
        None.
    """
    return(None)

Read codon_table.csv:

def read_codon_table(codon_table='codon_table.csv'):
    """
    Create a dictionary that maps RNA codons to amino acids.

    Constructs dictionary by reading a .csv file containing codon to amino
    acid mappings.

    Arguments:
        codon_table (string, optional): path to the .csv file containing codon
            to amino acid mappings. Assumed column structure is 'Codon',
            'Amino Acid Abbreviation', 'Amino Acid Code', and 'Amino Acid Name'.
            Default is 'codon_table.csv'
    Returns:
        (dictionary, string:string): dictionary with codons as keys and amino acid codes
            as values.
    """
    return({'': ''})

Transcribe DNA to RNA:

def transcribe(dna_seq, direction='-'):
    """
    Transcribe a DNA sequence to an RNA sequence.

    Arguments:
        dna_seq (string): DNA sequence to transcribe to RNA.
        direction (string, optional): Direction of sequence. The symbol '+'
            denotes forward/template strand while '-' denotes reverse/coding strand.
            Default is '-'.
    Returns:
        (string): transcribed RNA sequence from `dna_seq`.
    """

    return('')

Translate RNA to Protein:

def translate(rna_seq, codon_to_amino):
    """
    Translate an RNA sequence to an amino acid sequence.

    Arguments:
        rna_seq (string): RNA sequence to translate to amino acid sequence.
        codon_to_amino (dict string:string): mapping of three-nuceleotide-long codons to
            amino acid codes.
    Returns:
        (string): amino acid sequence of translated `rna_seq` codons.
    """

    return('')

Tie the Steps Together:

def main(dna_seq, output_fasta):
    """
    Return the first protein synthesized by a DNA sequence.

    Arguments:
        dna_seq (string): DNA sequence to parse.
        output_fasta (string): fasta file to write translated amino acid sequence to.
    Returns:
        None.
    """

    return(None)
If You Finish Early

If you finish early, here are some suggestions to extend the functionality of your script:

  • System Arguments: Using the ``sys`` Python module it is possible to access command-line arguments passed by a user. Specifically, the ``sys.argv`` variable stores user-passed information. Implement command line functionality that takes a user-provided FASTA file, converts the DNA sequence to amino acids, and outputs to another user-provided FASTA file.
  • Defensive Programming: When you’re creating a program, usually you have a pretty good idea of its use and how it works. However, sometimes we’re not the only ones using our programs. Therefore, it’s a good idea to protect against user and input error. For example, what happens if non-recoganized letters, whitespace, or special characters (``*``, ``-``) are included in the input sequence? Ensure your program is able to handle these, but remember some characters may have special meanings.
  • Calculating Statistics: Higher GC content in genomic regions is related to many important biological functions such as protein coding. Discuss with your partner the best way to measure the GC content of a DNA sequence. Once you’ve agreed on the best way, implement a function that will calculate the percentage along a provided sequence. Using the Python module ``matplotlib``, the output of this function to visualize how the measure changes along the sequence. In order to easily identify areas of high and low GC content, make sure to include a line that plots the mean level accross sequence.
  • Simulating Single Nucleotide Polymorphisms: Single nucleotide polymorphisms (SNPs) are single-point mutations that change the nucleotide of a single base in a strand of DNA. SNPs are incredibly important in genome-wide association studies (GWAS) that look to infer the relationship between specific genotypes and phenotypic outcomes such as disease status. Using a numerical library, such as numpy/scipy, create a function to randomly select a base for mutation. Apply some function that determines the identity of the newly mutated base. How biologically reasonable is your model? What biological phenomena should we consider to create an accurate simulation?

For some exercises, you will likely need to look for, and read, library specific documentation in order to implement the functions. This alone is a helpful exercise, as throughout your coding career you will continually need to reference documentation.

Workshop 4: R and RStudio

R and RStudio In this online workshop you will learn the R programming language, RStudio interface for programming in R, and useful tips for exploring and working with data.

You are expected to study the the following content:

In the workshop, we will work with RNA-Seq data and perform differential analysis.

Tutorials

R and RStudio: Introduction and Data Structures

R is a free programming language for statistical computing and graphics. It is an implementation of the S programming language and was created by Ross Ihaka and Robert Gentleman at the Univeristy of Auckland, New Zealand. R is currently developed by the R Development Core Team. RStudio is an Integrated Development Environment (IDE) for R.

To start, download the latest versions of R and RStudio following the instructions provided here

Getting Started

Open RStudio locally and learn how to use the RStudio interface.

Basic Operations in R

We can use R as a calculator to do simple math operations such as addition (+), subtraction (-), multiplication (*), division (), and raise a value to a power (^). We can type these calculations in the console or run them in an R script that extension ends in .R

#We can use hashtags to make comments about our code
#Basic calculations in R
4 + 5
4 - 5
4 * 5
4/5
#Outputs of calculations
[1] 9
[1] -1
[1] 20
[1] 0.8
#Calculate exponents using ^
4^5
#Output of exponent
[1] 1024
Data Structures

R has many data structures and types that we can use, depending on the information we want to work with.

The major data types include:

  • character
  • numeric (real or decimal)
  • integer
  • logical
  • double
  • complex

The major data structures include:

  • Scalars
  • Atomic Vectors
  • Factors
  • Lists
  • Matrices and Arrays
  • Dataframes
Scalars

The simplest type of object is a scalar which is an object with one value. We can assign a value or calculations to a variable using the assignment operator “<-“.

Note: The equals sign “=” is not an assignment operator in R and has a different functionality which will be discussed further below.

To create scalar data objects x and y:

#Set x and y as values
x <- 4
y <- 5

The objects x and y were set a numeric data type.

We can manipulate these objects in R and perform different calculations together. To print the value of these variables, we can use the print() function or call the variable itself.

#Calculations with numeric variables

z <- x+y

z

print(z)

x*y/z
#Output of calculations

[1] 9

[1] 9

[1] 29

As stated above, we can also create data objects of other data types such as logical and character mode.

For logical data, we use TRUE (T) and FALSE (F)

Logical <- T

Logical

[1] TRUE

For characte data, we use single or double quotation to enclose the data

Character_Data <- "T"

Character_Data

[1] "T"

We can use available functions in R to determine the mode or type of data we are working with.

#Use mode function
mode(x)
[1] "numeric"

mode(Logical)
[1] "logical"

mode(Character_Data)
[1] "character"
#Use is.object() function
is.numeric(x)
[1] TRUE

is.logical(Logical)
[1] TRUE

is.numeric(Character_Data)
[1] FALSE
Vectors

A vector is a basic data structure in R. It is a set of scalars of the same data type.

We can create vectors in different ways.

One of the main ways is to use the function c() to concatenate multiple scalars together.

x <- c(1, 5, 4, 9, 0)

x

[1] 1  5  4  9  0

We can use function typeof() to determine the data type of a vector, and we can check the length of the vector using the funtion length() .

typeof(x)

[1] "double"

length(x)

[1] 5

If we set x to have elements of different data types, the elements will be coerced to the same type.

x <- c(1, 5, FALSE, 9, "help")

x

[1] "1"  "5"  "FALSE"  "9"  "help"

typeof(x)

[1] "character"

Instead of reassigning the elements of x using the function c(), we could reassign specific elements based on the index number.

#Reassign third and fifth elements back to original values
x

[1] "1"  "5"  "FALSE"  "9"  "help"

x[3] <- 4

x[5] <- 0

x

[1] 1  5  4  9  0

typeof(x)

[1] "double"

Other ways to creat vectors is to use other operators and functions such as “:” operator, seq() function, and rep() function.

#Create vector of consecutive numbers

y <- 1:10

y

[1] 1  2  3  4  5  6  7  8  9  10

#Create vector of a sequence of numbers
#Defining number of points in an interval or step size

seq(1, 10, by = 1)

[1]  1  2  3  4  5  6  7  8  9 10

seq(1, 10, length.out = 10)

[1]  1  2  3  4  5  6  7  8  9 10

#Create vector of the same values

rep(3, 5)  # A set of 5 numbers with value set as 3

[1] 3 3 3 3 3
Factors

A factor is a special type of character vector. Factors are qualitative or categorical variables that are often used in statistical modeling. To create a factor data structure, we will first create a character vector and convert it to a factor using the factor() function.

temperature <- c("High","Medium","Low")
temperature <- factor(temperature)

Converting temperature character vector to a factor type creates “levels” based on the factor values (these are the values of categorical variables).

temperature

[1] High Medium Low
Levels: High Low Medium
Matrices and Arrays

So far we have discussed one-dimensional objects. We can create objects of multidimensional data. Matrices are data structures that contain data values in two dimensions. An array is a matrix with more than two dimensions. Matrices and arrays are used perform efficient calculations in a computationally fast and efficient manner.

To create a matrix, we can use the matrix() function, which takes as arguments a data vector and parameters for the number of rows and columns.

We can determine the dimensions of a matrix using the dim() function.

#Create a simple 2 by 2 matrix.

mat<-matrix(c(2,6,3,8),nrow=2,ncol=2)

mat

    [,1] [,2]
[1,] 2    3
[2,] 6    8

dim(mat)

[1] 2 2

We can also choose to add row names and column names to the matrix.

#Add row names and column names

rownames(mat) <- c("a", "b")

colnames(mat) <- c("c", "d")

  c d
a 2 3
b 6 8

#Add row names and column through the matrix function

mat<-matrix(c(2,6,3,8),nrow=2,ncol=2,
            dimnames = list(
                c(a,b),
                c(c,d)
            )
            )

mat

  c d
a 2 3
b 6 8

We can also create a matrix by concatenating vectors together using rbind() function to concatenate by rows or cbind() function to concatenate by columns.

x <- 1:3

y <- 4:6

# Combine by rows
a <- rbind(x,y)

a
   [,1] [,2] [,3]
x    1    2    3
y    4    5    6
# Combined by columns
b <- cbind(x,y)

b
     x y
[1,] 1 4
[2,] 2 5
[3,] 3 6

To create an array, we can use the function array(), which takes as arguments vectors as input and uses the values in the dim parameter to create an array.

vector1 <- c(1,2,3)
vector2 <- c(5,6,7,8,9,10)

# Create an array with dimension (3,3,2) that creates 2 arrays each with 3 rows and 3 columns.

array1 <- array(c(vector1,vector2),dim = c(3,3,2))

array1
, , 1

     [,1] [,2] [,3]
[1,]    1    5    8
[2,]    2    6    9
[3,]    3    7   10

, , 2

     [,1] [,2] [,3]
[1,]    1    5    8
[2,]    2    6    9
[3,]    3    7   10
Lists

Lists are data objects which contain elements of different types including numbers, strings, vectors, and other lists. A list can also contain a matrix or even a function as its elements.

#Create a list of different data types

list_data <- list(c(2,4,6,8), "Hello", matrix(c(11,12,13,14),nrow=2,ncol=2),TRUE, 62.13, FALSE)
print(list_data)

# Give names to the elements in the list

names(list_data) <- c("Vector1", "Character1", "Matrix1", "Logical1", "Numeric", "Logical2")

list_data
$Vector1
[1] 2 4 6 8

$Character1
[1] "Hello"

$Matrix1
     [,1] [,2]
[1,]   11   13
[2,]   12   14

$Logical1
[1] TRUE

$Numeric
[1] 62.13

$Logical2
[1] FALSE

We can use the function str() to list the underlying structure of the data object.

str(list_data)
  List of 6
$ Vector1   : num [1:4] 2 4 6 8
$ Character1: chr "Hello"
$ Matrix1   : num [1:2, 1:2] 11 12 13 14
$ Logical1  : logi TRUE
$ Numeric   : num 62.1
$ Logical2  : logi FALSE
Data Frames

A data frame is a table in which each column contains values of one variable or vector and each row contains one set of values from each column. Within each column, all data elements must be of the same data type. However, different columns can be of different data types. The data stored in a data frame can be of numeric, factor or character type. In addition, each column should contain same number of data elements.

To create a data frame, we can use the function data.frame():

#Create a data frame with employee ID, salaries, and start dates

emp.data <- data.frame(
 emp_id = c("U974","U503","U298","U545","U612"),
 salary = c(623.3,515.2,611.0,729.0,843.25),
 start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
    "2015-03-27")),
 stringsAsFactors = FALSE
)

emp.data
    emp_id salary start_date
1     U974 623.30 2012-01-01
2     U503 515.20 2013-09-23
3     U298 611.00 2014-11-15
4     U545 729.00 2014-05-11
5     U612 843.25 2015-03-27

We can use the function str() to list the underlying structure of the data object.

str(emp.data)
 'data.frame': 5 obs. of  3 variables:
$ emp_name  : chr  "U974" "U503" "U298" "U545" ...
$ salary    : num  623 515 611 729 843
$ start_date: Date, format: "2012-01-01" "2013-09-23" ...

We can extract data from the data frame and also add data to the data frame.

#Extract salary information
emp.data$salary
[1] 623.30 515.20 611.00 729.00 843.25
#Add column vector
emp.data$dept <- c("IT","Operations","IT","HR","Finance")
    emp_id salary start_date       dept
1     U974 623.30 2012-01-01         IT
2     U503 515.20 2013-09-23 Operations
3     U298 611.00 2014-11-15         IT
4     U545 729.00 2014-05-11         HR
5     U612 843.25 2015-03-27    Finance
More Examples of Data Structures and Types

To learn more about data types and structures and see more examples, watch these available videos below. Part 1 Part 2

Conditional Statements and Looping

Logical and relational operators

Logical and relational operators can be used to execute code based on certain conditions. Common operators include:

_images/Logical_Operators.png
If statements
q <- 3
t<-5

#if else conditional statement

if(q<t){

 w<-q+t

 } else

   w<-q-t
 w

[1] 8
a<-2
b<-3
c<-4
#Using and to test two conditions, both true

if(a<b & b<c) x<-a+b+c
 x
[1] 9
Looping

We can use looping to efficiently repeat code without having to write the same code over and over.

The while loop repeats a condition while the expression in parenthesis holds true and takes the form:

while (condition controlling flow is true)
      perform task
x<-0
while(x<=5){x<-x+1}
x
[1] 6

For loops are used to iterate through a process a specified number of times. A counter variable such as “i” is used to count the number of times the loop is executed:

for (i in start:finish)
    execute task

An example is to add values 1 to 10 to vector y using a for loop.

#Create empty vector
y<-vector(mode="numeric")

#Loop through 1 to 10 to add values to y
for(i in 1:10){
   y[i]<-i
   }
y

[1] 1 2 3 4 5 6 7 8 9 10

To learn more about if statements and logical operators, check out this video

Alternatives to using looping and conditional statements include using the apply function in R. A quick introduction to apply function is provided here.

Exploring Data in R

In this section we will go into more detail as to how to import and explore data through different packages,functions, and graphics.

R packages and libraries

R packages are collections of functions and data sets developed by the R community. The main repository used in R is CRAN which has over 10,000 packages published and more that are publicly available.

To install most packages, the function install.packages(“package_name”) can be used.

There are other repositories such as Bioconductor that are used in Bioinformatics and other fields.

To learn how to install packages, read a quick description about package installation and watch a tutorial here.

Loading Data

Importing downloaded data

Based on the data object, there are different functions available to import datasets into R. A cheatsheet from RStudio community is provided below with useful functions to load in data.

_images/Import_Cheatsheet.png

An alternative way to import downloaded data is to also click on “Import Dataset” on the upper right hand side under Environment.

_images/RStudio_Console.png

In addition, datasets that are available online can be imported into R using their url.

For example

#install and load data.table library
install.packages("data.table")
library(data.table)

#Use fread function to download data set under the variable mydat
mydat <- fread('http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat')
head(mydat)

        V1  V2   V3    V4 V5
1:  1 307  930 36.58  0
2:  2 307  940 36.73  0
3:  3 307  950 36.93  0
4:  4 307 1000 37.15  0
5:  5 307 1010 37.23  0
6:  6 307 1020 37.24  0
Available data sets in R

R has many available datasets that can be loaded using the function data(). Typing data() in the console provides a list of datasets and their descriptions.

_images/Dataset.png

We can load these data sets with the function load(). To look at the first few lines of the data set, we can use the function head(). To see the last few lines of the data set, we can use the function tail().

_images/Load_Dataset.png
Saving Data Object and Files

We can save objects using the save() function.

For example, if we loaded the mtcars dataset from data() function in R, we can then save mtcars object by specifying the object (mtcars) and the file path with an .RData extension. Note, we can save more than one data object in a .RData file.

#load mtcars data set
data("mtcars")

#View mtcars dataset
head(mtcars)

                           mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#save mtcars in .RData extension
save(mtcars, file = "mtcars.RData")

To load this file into R, we can use the load() function.

load(file = "mtcars.RData")

Another way to save one data object is to save it using a .RDS extension. To save and load a .RDS extension, we can use saveRDS() function and readRDS() function.

#save mtcars to a .RDS file
saveRDS(mtcars, file = "mtcars.rds")

#read in .RDS file and save under mtcars variable name
mtcars <- readRDS(file = "mtcars.rds")

To write an R object or variable to a file, we can use existing functions to write mtcars to a csv file and txt file.

write.csv(mtcars, file = "mtcars.csv")

write.table(mtcars, file = "mtcars.txt", sep="")

Data Exploration

Common functions used to initially explore data include functions for mean and standard deviation. In addition, we can use the summary() function to give us some descriptive statistics about a data set.

_images/Data_Exploration.png
Manipulating Data

We can use packages to reshape or clean our data prior to analysis. Two main packages that are used are tidyr and reshape2.

To learn more about how to use these package to tidy and reshape data, read this page. In addition, an example of using reshape2 on a cancer data set is shown here.

Plotting and visualizations in R

R supports a variety of graphics in the base package, and numerous other packages provide additional graphics.

For example, we can use a simple plot() function to plot specific variables of the mtcars data set.

plot(mtcars$wt, mtcars$mpg)
_images/Plot_Example.png

Other plot functions include:

_images/Plot_Functions.png

Graphical parameters can be added to these plots including:

_images/Graphical_Parameters.png

Many plot functions also include graphical parameter arguments.

For example, we can add a title and axis labels and change the point size using arguments in the plot function.

plot(mtcars$wt, mtcars$mpg, main="Scatterplot", xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)
_images/Plot_Parameters.png

An alternative way to generate plots is to use ggplot2 package.

install.packages("ggplot2")
library(ggplot2)

p <- ggplot(mtcars, aes(wt, mpg))
p + geom_point(size=2) + xlab("Car Weight") + ylab("Miles Per Gallon")
_images/ggplot2_example.png

With ggplot2, we can add other features and variables to our plot.

p <- ggplot(mtcars, aes(wt, mpg))
geom_point(aes(colour=factor(cyl), size = qsec)) + xlab("Car Weight") + ylab("Miles Per Gallon")
_images/ggplot2_variable.png

To learn more advanced uses of ggplot2, look at this more detailed step by step tutorial.

R Workshop: RNA-seq Airway Data and Differential Expression Analysis

In this workshop, we will focus on learning how to load packages, import data, perform exploratory analysis with built in functions as well as functions from packages installed, performing differential expression analysis of RNA-seq data with the DESeq2 package, and visualizing the results using ggplot2.

We will work in R Markdown, a .Rmd file written in markdown and contains chunks of embedded R code.

The R Mardown file and two csv files containing count data (airway_scaledcounts.csv) and meta data file (airway_metadata.csv) are in the R_workshop folder which you can download here.

Load Packages

We will begin by loading the necessary packages:

Go ahead and install these packages using install.packages():

  • readr
  • ggplot2
  • dplyr
  • magrittr

We will use packages from the bioconductor repository, which provides tools for analysis of high-throughput genomic data.

source(“https://bioconductor.org/biocLite.R”)

Use bioclite(“package_name”) function to install packages SummarizedExperiment, DESeq2 and airway.

Note: If package base is not already installed, please install that as well.

packages <- c("readr", "ggplot2", "dplyr", "magrittr")
install.packages(packages, dependencies = TRUE)

source("https://bioconductor.org/biocLite.R")
biocLite("SummarizedExperiment", dependencies = TRUE)
biocLite("DESeq2", dependencies = TRUE)
biocLite("airway", dependencies = TRUE)

Load these libraries using library(“package_name”) function:

#library(base) in case it's not loaded
library(readr)
library(dplyr)
library(ggplot2)
library(magrittr)
library(SummarizedExperiment)
library(DESeq2)
library(airway)

Import Airway Data

If you have not downloaded the R_Workshop folder already, please do that now.

Let’s begin first by setting our working directory. Set your working directory to where the R_Workshop folder is located on your computer.

#Find working directory
getwd()

#Set working directory path
setwd("/Users/tanyatk/Desktop/R_Workshop/")

#Check working directory again
getwd()

Today we will work with the airway dataset. This data set comes from an RNA-Seq experiment, a high throughput sequencing method, on four human airway smooth muscle cell lines treated and untreated with dexamethasone. We will work with read counts or expression matrix for this dataset (i.e. processed files).

Note: The sequencing files of this experiment are available on the GEO database with GEO Series Number GSE52778, and can be downloaded using SRA toolkit.

Use the read_csv(“file”) function from package readr to import the airway_scalecounts.csv (count data) and airway_metadata.csv (meta data) files from the downloaded folder R_Workshop.

#User read_csv() function to import airway_scaledcounts.csv and airway_metadata.csv files into R

Use base functions to describe and look at the airway data: scaledcounts and metadata.

  • dim() - Dimensions
  • head() - Print first lines of data
  • tail() - Print last few lines of data
  • str() - Describe data object structure and information
#Use base functions to gain an initial view of the data

This data set is also available in a package called “airway” in bioconductor. It is saved as an S4 object (object oriented programming) that contains the count data, meta data, and other information important to the data in fields or slots in the object. To load the airway data we can use the data(“data_name”) function and call airway to add the dataset to our workspace.

You’ll notice that the class is called RangedSummarizedExperiment (i.e. an S4 object), which is used to store matrices of experimental results such as the count data and meta data. This class is from the SummarizedExperiment package which is used often to store sequencing and microarray data.

#call airway data using data() and print airway data to save to workspace

Since we imported the same data set twice, we can remove data from our workspace using the rm() function.

Let’s remove the variables scaledcounts and metadata from our workspace. We’ll keep the airway object since it will be easier to work with for downstream analysis.

#remove scaledcounts and metadata variable

Explore Airway Dataset

Let’s first do some preliminary work with the airway dataset. The sample/metadata information is saved under the slot colData which can be extracted using airway@colData or colData(airway).

First check the data structure of the colData(airway) dataset.

Hint: Built in functions to check data structure

Let’s set colData(airway) as a data frame.

Hint: We will use the as.data.frame() function to do this.

#Check mode of colData(airway) and make change the structure to a data frame.

The count data is saved under the slot assay. We can extract the count matrix by calling airway@assay or assay(airway). We can also use descriptive statistics to look at the expression acrosss samples. We will sum the expression of each column and scale by 1e6 to get scaled expression value. We will then use the summary() function to look at the range of expression between the samples.

Determine a way to sum the expression of each column.

Hint: You can use a for loop, apply function, or base functions such as colSums()

#Sum the expression of each column, divide by 1e6
#Use summary function to see the range of values between each sample

Differential Expression Analysis using DESeq2

We will use DESeq2 package for differential expression analysis of the airway data set to find differentially expressed genes between untreated and treated samples. We will first load DESeq2 and set up the data to be compatible with DESeq by using the function DESeqDataSet().

We can use the help(“function_name”) or ?function_name to look up the function to get a description.

A description or help pages will show up under the Help tab in the bottom right corner.

#Look up DESeqDataSet() function description

We can also go to the bioconductor page for DESeq2 and look at the manual for functions as well as a tutorial of using the package itself. Click here to see the page.

The function DESeqDataSet includes an argument called design which asks for a formula that expresses how the counts for each gene depends on the variables in colData. In this case we choose variables cell and dex because we care about the cell line and which samples are treated with dexamethasone versus which samples are untreated controls.

DE_airway <- DESeqDataSet(airway, design = ~ cell + dex)

DE_airway

Before we continue, we must set our control group as our reference level for comparison in our differential expression analysis.

DE_airway@colData$dex <- relevel(DE_airway@colData$dex, ref = "untrt")

Now we wil run the differential expression analysis steps through the function DESeq(). Again we can look up the function to learn more about what it does and the arguments needed to run it. We use the results() function to generate a results table with log2 fold changes, p values and adjusted p values for each gene. The log2 fold change and the Wald test p value is based on the last variable in the design formula, in this case variable dex. Therefore our results will show which genes are differentially expressed between the untreated and treated groups.

help("DESeq")

DE_airway <- DESeq(DE_airway)
res <- results(DE_airway)

res

How do we order the results table (res) based on the p-value? There are already available functions in R that we can use to sort the dataframe. Hint: Use function order() to order the rows based on p-value

#Use order() to order the results table based on the p-value

In DESeq2, the function plotMA generates an MA Plot commonly used to visualize the differential expression results. The plot shows the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples in the DESeqDataSet. Points represent genes and will be colored red if the adjusted p value is less than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down.

plotMA(res, ylim=c(-2,2))

Manipulate and Visualize Results

Let’s add a column that tell us whether each gene is significant. Using the mutate() function from library dplyr, we can add a column showing whether the significance is TRUE or FALSE based on cutoff padj < 0.01.

#change res to a tibble format to work with dplyr

res <- tbl_df(res)

#add sig column to show which genes are significant or not by using mutate() from dplyr

res <- mutate(res, sig=padj<0.01)

# We can use the symbol %>% from library magrittr to represent a pipe. Pipes take the output from one function and feed it to the first argument of the next function. You may have seen something similar in unix with |

res <- res %>% mutate(sig=padj<0.01)

head(res)

Let’s use the filter() function from dplyr to filter out results based on padj < 0.01, and write this to a csv file using write_csv() function from readr.

Try using piping format %>% to do this!

# Filter res based on cutoff padj < 0.01 and save this result into a csv file called significant_results.csv

What if we want to generate our own plots? We can use ggplot2 to create our own volcano plot of the differential expression results between the untreated and treated groups.

Now let’s try generating a volcano plot using ggplot2?

Hint: log2FoldChange for x-axis, -1*log10(pvalue) for y-axis, sig to color the points.

Make sure to include argument for points and include the title “Volcano plot”

Bonus: Change the axis titles to something more readable and change the point shapes, or play around with any other parameters to get a feel for how ggplot2 works.

#Create Volcano plot using ggplot2

How would you generate the same MA plot above using ggplot2? Hint: Use baseMean for x-axis, log2FoldChange for y-axis, sig for color.

Make sure to have points and to use a log10 scale for the x-axis (i.e. scale_x_log10() ).

Add the title “MA plot” to your plot as well.

#Create MA plot using ggplot2

We can look at our session information including the packages we loaded and worked with.

sessionInfo()

R Workshop: RNA-seq Airway Data and Differential Expression Analysis

In this workshop, we will focus on learning how to load packages, import data, perform exploratory analysis with built in functions as well as functions from packages installed, performing differential expression analysis of RNA-seq data with the DESeq2 package, and visualizing the results using ggplot2.

We will work in R Markdown, a .Rmd file written in markdown and contains chunks of embedded R code.

The R Mardown file and two csv files containing count data (airway_scaledcounts.csv) and meta data file (airway_metadata.csv) are in the R_workshop folder which you can download here.

Load Packages

We will begin by loading the necessary packages:

Go ahead and install these packages using install.packages():

  • readr
  • ggplot2
  • dplyr
  • magrittr

We will use packages from the bioconductor repository, which provides tools for analysis of high-throughput genomic data.

source(“https://bioconductor.org/biocLite.R”)

Use bioclite(“package_name”) function to install packages SummarizedExperiment, DESeq2 and airway.

Note: If package base is not already installed, please install that as well.

packages <- c("readr", "ggplot2", "dplyr", "magrittr")
install.packages(packages, dependencies = TRUE)

source("https://bioconductor.org/biocLite.R")
biocLite("SummarizedExperiment", dependencies = TRUE)
biocLite("DESeq2", dependencies = TRUE)
biocLite("airway", dependencies = TRUE)

Load these libraries using library(“package_name”) function:

#library(base) in case it's not loaded
library(readr)
library(dplyr)
library(ggplot2)
library(magrittr)
library(SummarizedExperiment)
library(DESeq2)
library(airway)

Import Airway Data

If you have not downloaded the R_Workshop folder already, please do that now.

Let’s begin first by setting our working directory. Set your working directory to where the R_Workshop folder is located on your computer.

#Find working directory
getwd()

#Set working directory path
setwd("/Users/tanyatk/Desktop/R_Workshop/")

#Check working directory again
getwd()

Today we will work with the airway dataset. This data set comes from an RNA-Seq experiment, a high throughput sequencing method, on four human airway smooth muscle cell lines treated and untreated with dexamethasone. We will work with read counts or expression matrix for this dataset (i.e. processed files).

Note: The sequencing files of this experiment are available on the GEO database with GEO Series Number GSE52778, and can be downloaded using SRA toolkit.

Use the read_csv(“file”) function from package readr to import the airway_scalecounts.csv (count data) and airway_metadata.csv (meta data) files from the downloaded folder R_Workshop.

#User read_csv() function to import airway_scaledcounts.csv and airway_metadata.csv files into R

scaledcounts <- read_csv("airway_scaledcounts.csv")
metadata <-  read_csv("airway_metadata.csv")

Use base functions to describe and look at the airway data: scaledcounts and metadata.

  • dim() - Dimensions
  • head() - Print first lines of data
  • tail() - Print last few lines of data
  • str() - Describe data object structure and information
#Use base functions to gain an initial view of the data

#Look at scaledcounts variable
dim(scaledcounts)

head(scaledcounts)

tail(scaledcounts)

str(scaledcounts)

#Look at metadata variable
dim(metadata)

head(metadata)

tail(metadata)

str(metadata)

This data set is also available in a package called “airway” in bioconductor. It is saved as an S4 object (object oriented programming) that contains the count data, meta data, and other information important to the data in fields or slots in the object. To load the airway data we can use the data(“data_name”) function and call airway to add the dataset to our workspace.

You’ll notice that the class is called RangedSummarizedExperiment (i.e. an S4 object), which is used to store matrices of experimental results such as the count data and meta data. This class is from the SummarizedExperiment package which is used often to store sequencing and microarray data.

#call airway data using data() and print airway data to save to workspace

data("airway")
airway

Since we imported the same data set twice, we can remove data from our workspace using the rm() function.

Let’s remove the variables scaledcounts and metadata from our workspace. We’ll keep the airway object since it will be easier to work with for downstream analysis.

#remove scaledcounts and metadata variable
rm(scaledcounts)

rm(metadata)

Explore Airway Dataset

Let’s first do some preliminary work with the airway dataset. The sample/metadata information is saved under the slot colData which can be extracted using airway@colData or colData(airway).

First check the data structure of the colData(airway) dataset.

Hint: Built in functions to check data structure

Let’s set colData(airway) as a data frame.

Hint: We will use the as.data.frame() function to do this.

#Check mode of colData(airway) and make change the structure to a data frame.

mode(colData(airway))

dat_airway <- as.data.frame(colData(airway))

dat_airway

The count data is saved under the slot assay. We can extract the count matrix by calling airway@assay or assay(airway). We can also use descriptive statistics to look at the expression acrosss samples. We will sum the expression of each column and scale by 1e6 to get scaled expression value. We will then use the summary() function to look at the range of expression between the samples.

Determine a way to sum the expression of each column.

Hint: You can use a for loop, apply function, or base functions such as colSums()

#Sum the expression of each column, divide by 1e6
#Use summary function to see the range of values between each sample

 head(assay(airway))

 summary(colSums(assay(airway))/1e6)

Differential Expression Analysis using DESeq2

We will use DESeq2 package for differential expression analysis of the airway data set to find differentially expressed genes between untreated and treated samples. We will first load DESeq2 and set up the data to be compatible with DESeq by using the function DESeqDataSet().

We can use the help(“function_name”) or ?function_name to look up the function to get a description.

A description or help pages will show up under the Help tab in the bottom right corner.

#Look up DESeqDataSet() function description

 help("DESeqDataSet")

 ?DESeqDataSet

We can also go to the bioconductor page for DESeq2 and look at the manual for functions as well as a tutorial of using the package itself. Click here to see the page.

The function DESeqDataSet includes an argument called design which asks for a formula that expresses how the counts for each gene depends on the variables in colData. In this case we choose variables cell and dex because we care about the cell line and which samples are treated with dexamethasone versus which samples are untreated controls.

DE_airway <- DESeqDataSet(airway, design = ~ cell + dex)

DE_airway

Before we continue, we must set our control group as our reference level for comparison in our differential expression analysis.

DE_airway@colData$dex <- relevel(DE_airway@colData$dex, ref = "untrt")

Now we wil run the differential expression analysis steps through the function DESeq(). Again we can look up the function to learn more about what it does and the arguments needed to run it. We use the results() function to generate a results table with log2 fold changes, p values and adjusted p values for each gene. The log2 fold change and the Wald test p value is based on the last variable in the design formula, in this case variable dex. Therefore our results will show which genes are differentially expressed between the untreated and treated groups.

help("DESeq")

DE_airway <- DESeq(DE_airway)
res <- results(DE_airway)

res

How do we order the results table (res) based on the p-value? There are already available functions in R that we can use to sort the dataframe. Hint: Use function order() to order the rows based on p-value

#Use order() to order the results table based on the p-value

res[order(res$pvalue),]

In DESeq2, the function plotMA generates an MA Plot commonly used to visualize the differential expression results. The plot shows the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples in the DESeqDataSet. Points represent genes and will be colored red if the adjusted p value is less than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down.

plotMA(res, ylim=c(-2,2))

Manipulate and Visualize Results

Let’s add a column that tell us whether each gene is significant. Using the mutate() function from library dplyr, we can add a column showing whether the significance is TRUE or FALSE based on cutoff padj < 0.01.

#change res to a tibble format to work with dplyr

res <- tbl_df(res)

#add sig column to show which genes are significant or not by using mutate() from dplyr

res <- mutate(res, sig=padj<0.01)

# We can use the symbol %>% from library magrittr to represent a pipe. Pipes take the output from one function and feed it to the first argument of the next function. You may have seen something similar in unix with |

res <- res %>% mutate(sig=padj<0.01)

head(res)

Let’s use the filter() function from dplyr to filter out results based on padj < 0.01, and write this to a csv file using write_csv() function from readr.

Try using piping format %>% to do this!

# Filter res based on cutoff padj < 0.01 and save this result into a csv file called significant_results.csv

res %>%
filter(padj<0.01) %>%
write_csv("significant_results.csv")

What if we want to generate our own plots? We can use ggplot2 to create our own volcano plot of the differential expression results between the untreated and treated groups.

Now let’s try generating a volcano plot using ggplot2?

Hint: log2FoldChange for x-axis, -1*log10(pvalue) for y-axis, sig to color the points.

Make sure to include argument for points and include the title “Volcano plot”

Bonus: Change the axis titles to something more readable and change the point shapes, or play around with any other parameters to get a feel for how ggplot2 works.

#Create Volcano plot using ggplot2

ggplot(res, aes(log2FoldChange, -1*log10(padj), col=sig)) + geom_point() + ggtitle("Volcano plot")

res %>% ggplot(aes(log2FoldChange, -1*log10(padj), col=sig)) + geom_point() + ggtitle("Volcano plot")

How would you generate the same MA plot above using ggplot2? Hint: Use baseMean for x-axis, log2FoldChange for y-axis, sig for color.

Make sure to have points and to use a log10 scale for the x-axis (i.e. scale_x_log10() ).

Add the title “MA plot” to your plot as well.

#Create MA plot using ggplot2

ggplot(res, aes(baseMean, log2FoldChange, col=sig)) + geom_point() + scale_x_log10() + ggtitle("MA plot")

We can look at our session information including the packages we loaded and worked with.

sessionInfo()

Workshop 5: Machine Learning

This workshop provides a basic introduction to machine learning. First we will talk about data preparation and exploration. Then we will introduce a general pipeline for unsupervised and supervised learning.

You are expected to study the the following content:

In the workshop, do some basic data exploration and modeling.

Tutorials

Data preparation

The data in machine learning is presented in the form of a matrix (data frame in R) consisting of N rows (samples or instances) and M columns (features).

Features

Features can be different types:

  • Binominal (TRUE/FALSE)
  • Categorial/nominal (different classes)
  • Text - not supported by all applications
  • Numeric (Real numbers)
  • Integer (Natural numbers)

Note that a common mistake is to mischaracterize features. Nominal values are usually presented by numbers but cannot be compared. If not set properly, errors could occur.

Example: Many choose to set education status as 1: no college, 2: college degree, 3: post-graduate. Ask yourself, does 1.5 have a meaning?

Example: Stages of a tumor is represented by numbers (1, 2, ..) but it nominal.

Special features include the ID, label, batch, etc. They are treated differently. ID is excluded from models and must be unique. Label is used for supervised learning (classification). A dataset can only have one label (per each run).

Meta-features are features made using the measured features. For example BMI is a feature made from weight and height. PCA vectors are features based on all features.

Data exploration

Explore the data to validate your data and potentially find new patterns. Data validation means that you should show known relations shown in the literature is present in your data. For example if a gene has been highly associated with your case studies, you should observe the same pattern in your data. Or if you have blood pressure and heart disease, they should be correlated. In addition you should show your data is representative of the real population. For example if you are working on samples from USA and you have an obesity feature (not label) you should make sure that the ratio of obese people is comparable to the ratio in USA. Or if the ratio of female to male is about 50%.

Draw many plots. Show how the features are distributed and how you expect them to be. Show your data is balanced (e.g. female and male have the same ratio) and if not be aware of the imbalance (for example rare diseases). The probability distributions in your data can result in misinterpretation.

In order to validate your data you can look for correlations between numeric features and associations between nominal features. The known relations in the literature should be validated by your data. You might find new relations and associations which can further be studied.

Let’s look at an example on as subset of features patients with diabetes type 2.

     age       sex      education   living     smoking      weight           height           LDL             HDL
Min.   : 4.0   F:1688   1   :976   1   :1319   0:2148   Min.   : 16.00   Min.   : 83.0   Min.   : 11.0   Min.   : 16.0
1st Qu.:45.0   M: 757   2   :753   2   : 872   1: 130   1st Qu.: 61.00   1st Qu.:151.0   1st Qu.: 56.0   1st Qu.: 45.0
Median :52.0            3   :377   NA's: 254   2:  23   Median : 71.00   Median :157.0   Median :106.0   Median : 60.0
Mean   :51.6            NA's:339               3: 144   Mean   : 73.73   Mean   :157.9   Mean   :104.3   Mean   :106.8
3rd Qu.:60.0                                            3rd Qu.: 87.00   3rd Qu.:165.0   3rd Qu.:141.0   3rd Qu.:160.0
Max.   :89.0                                            Max.   :161.00   Max.   :196.0   Max.   :700.0   Max.   :665.0
NA's   :15                                              NA's   :332      NA's   :620     NA's   :587     NA's   :404

Education is the level of education (1=finished high school, 2=college degree, 3=post graduate) and living is the type of living environment (1=city/town, 2=village). Smoking is another nominal feature (0=doesn’t smoke, 1: occasional smoker, 2=light smoker, 3=heavy smoker). HDL and LDL are good and bad cholesterol levels in blood. Missing data is represented with NAs.

1. Correlation matrix: calculate the correlation between the features and draw a heatmap. Look at the highly correlated features. Make sure the correlations are valid (by literature) and mark down if they are direct correlations or indirect.

_images/corr_matrix.png

2. Association rules: find associations between sets of features to another feature. Each rule associates a set of features to another feature. The rule certainty is measured using two parameters: support or frequency (how much of the data supports it) and confidence (out of all applicable data points, how many follow the rule). Association rules are applied on nominal features or discrete values.

> library(arules)
> rules <- apriori(data, parameter=list(support=0.10, confidence=0.50))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target   ext
        0.5    0.1    1 none FALSE            TRUE       5     0.1      1     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 244

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[26 item(s), 2445 transaction(s)] done [0.00s].
sorting and recoding items ... [23 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 done [0.00s].
writing ... [477 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
> rules
set of 477 rules
    > inspect(head(rules, n = 10, by ="lift"))
     lhs                                             rhs                support   confidence lift     count
[1]  {sex=M,smoking=0,weight=[80,161]}            => {height=[162,196]} 0.1132924 0.7527174  2.997384 277
[2]  {sex=M,weight=[80,161]}                      => {height=[162,196]} 0.1525562 0.7474950  2.976588 373
[3]  {sex=M,smoking=0,height=[162,196]}           => {weight=[80,161]}  0.1132924 0.8683386  2.920341 277
[4]  {sex=M,height=[162,196]}                     => {weight=[80,161]}  0.1525562 0.8477273  2.851022 373
[5]  {weight=[80,161],height=[162,196]}           => {sex=M}            0.1525562 0.8555046  2.763156 373
[6]  {smoking=0,weight=[80,161],height=[162,196]} => {sex=M}            0.1132924 0.8195266  2.646952 277
[7]  {height=[162,196]}                           => {weight=[80,161]}  0.1783231 0.7100977  2.388155 436
[8]  {weight=[80,161]}                            => {height=[162,196]} 0.1783231 0.5997249  2.388155 436
[9]  {sex=M,smoking=0}                            => {height=[162,196]} 0.1304703 0.5885609  2.343699 319
[10] {smoking=0,height=[162,196]}                 => {weight=[80,161]}  0.1382413 0.6954733  2.338971 338

You should make sure that all the top rules are meaningful. For example: {age=[57,89]} => {education=1} makes sense since the data was collected in a medium size city in the south of Iran, and the older people were most likely uneducated.

3. Cognitive map shows the relations known in your data and the ones you also found.

_images/cognitive_map.png

Data preparation

The most important but neglected part of machine learning and data mining is preparing the data. If your data is invalid, no matter what skills you have, the results will be invalid. The goal of data preparation is to make sure the data is representative and correct.

1. Typos are the most common error in data. Most datasets are collected over time, manually input by operators. For any nominal value you should check the levels in the data. For example for sex make sure you only have 2 levels (F/M or female/male). For numeric values draw boxplots and histograms. Make sure the data follows the expected distribution and estimates (mean and standard deviation are same as expected). If you have nominal features, make sure the numeric values for each are correctly spread out. For example if you have sex and age in your data, make sure the age distribution for female and male are comparable.

2. Missing data is common. Make sure they are presented in a correct format recognized by the tool and code you use. Some tools take NA or blanks as missing, some use “?”. Make a table and see which data points are missing and how often. Try to understand why and if it is randomly missing or has a pattern? Decide how to handle them. Some methods accept missing values and some don’t. Understand how missing values are interpreted. If you remove them have a good explanation of your criteria. Some might choose to replace missing data with nearby datapoints if possible.

3. Normalization is an important step to make the samples and features comparable inside and in between datasets. Choose an appropriate normalization method and explain how it was done. In case of classification, the test has to be normalized in the same way but independent of the train data to avoid leaking train information into test.

Expression data is usually log2 transformed and then quantile normalized. RMA and frozen-RMA are versions of quantile normalization common for microarray datasets which handle outliers better. zscore is a intuitive normalization method but flattens the data (forces them into a normal) and range normalization keeps the distribution but is very sensitive to outliers. Centering numeric values around zero is a good practice for some models. It is a good practice to make features in the same range to be able to compare the weights assigned to each fature by a model. For example if you have a feature in the order of thousands and a feature in the order of 10, the weights might seem smaller for the former, while the truth is the weights cannot be directly compared. Note than normalizing can be applied on features (normalizing measurements over all samples) or on samples (correcting for batch effects).

4. Feature selection and reduction is used to chose relevant features. Note that the number of features should be significantly less than the sample size (M<<N). In general a model with less parameters is a better model and is less likely to overfit. Redundant features (usually very highly correlated features) should be removed for some models (any model doing determinant on the data matrix). Principle Component Analysis is a good practice to reduce the number of features while maintaining the variability. Feature selection can be done based on variability (keeping highly variable features), fold changes (difference in mean between label classes such as deferentially expressed genes in gene expression data), or recursively by applying a classification model and applying the weights (choosing the features with highest importance). Feature reduction can be done based on correlation (removing highly correlated features) or invariability (features which have similar distributions between classes). Note that in case of classification feature selection should be done only on the train data and not test.

After data preparation, you should be able to explain the data in terms of what features there are and what distributions they follow. You should show your data is representative and balanced. You should handle missing data in a rational way. You should have a well established method for choosing features.

Workshop 5: Learning

After preparing the data you can learn it. There are two main learning methods: supervised, when we have labels and we want to learn a model using those to predict future samples (classification) or when we don’t know the label and we simply want to find pattern and classes (clustering).

You can choose to code your models in R or Python or use specialized interatice platforms such as RapidMiner, KNIME, or Weka.

Unsupervised learning

“Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data.” read more…

_images/unsupervised_flowchart.png

Dimensionality reduction

It is hard to interpret the data in the feature space (M dimensions). The first exploration step into data is to reduce the dimensionality to 2 dimensions and plot the dimensions in x-y coordinates which are human readable.

  • Principle Component Analysis (PCA) is the most commonly used dimensionality reduction method. Each principle component is a linear combination of \(weights.features\). Weights are adjusted to capture the most variability across samples. The components are uncorrelated and can be used as features for further clustering or classification. Common practice is to plot principle components and color the samples by different features to see how the samples are separating. You should keep in mind that while PCA shows the variability in the data, it is not always meaningful and might be showing noise or batch errors.
_images/grz_tissues_PCA.png
  • Singular-value decomposition (SVD) calculates the diagonal matrix S on the data matrix A :
\[A_{M \times N} = U_{M \times M} S_{M \times N} V_{N \times N}\]

where M is the feature size (columns) and N is the sample size (rows). Read more here.

Clustering

Clustering algorithms try to divide the data samples based on some sort of similarity into different clusters. An example is to cluster single cell gene expression data to find tissue types.

  • Hierarchical (agglermorative) clustering is the most basic clustering algorithm. The samples are put in its own clusters and then iteratively the most similar clusters are combined to create super-clusters. This can be done bottom up or downwards to form a dendrogram which you can cut at any level to obtain different number of clusters. Hierarchical clustering is usually shown as a heatmap.
_images/hierarchical_clustering.png
  • K-means is a method to cluster the samples into K distributions with different means. The algorithm starts by choosing K random points as means in the feature space and assigns samples to a cluster with the closest mean (by some similarity measure). The means are iteratively moved to best fit the samples until no further improvement are made. This method is not deterministic and depends on the initial random guesses. Bad guesses will result in poor clustering.
_images/k-means.png
  • X-means is a general K-means clustering algorithm that tries different K values to find the K that best represents the data.

The following image shows different clustering algorithm run on different data (2 dimensions).

_images/clustering.png

Note that in all these clustering methods we talked about similarity. Similarity can be defined differently. Euclidean distance is the most common measure which is simply measuring the straight line distance between the two samples in the feature space. For a detailed comparison read this article:

Shirkhorshidi, Ali Seyed, Saeed Aghabozorgi, and Teh Ying Wah. “A comparison study on similarity and dissimilarity measures in clustering continuous data.” PloS one 10.12 (2015): e0144059.

Keep in mind that any data clusters. Clusters obtained by any method are a way to explore the data. You can measure the fitness of the clustering by measuring the intra-clusters similarity vs. the in-between cluster dissimilarity. Commonly used measure are Silhouette coefficient and the Davies–Bouldin index.

Supervised learning

Another common learning approach is to learn a predictive model on labeled data to make predictions on new unknown samples. This is referred to as classification. Classification is applied to a subset of the data - train, and tested on a smaller subset - test. So the first step after data preparation is to randomly split the data into two sets (usually 75% train and 25% test). Check if your train and test are not biased (by age, sex, or label). The purpose of separating some data as test is to later verify the model and ensure we are not over-fitting. The most critical part in classification is to make sure the train data does not lea^k into the test, meaning no information from the train should be secreted into the test data - whether at normalization, feature selection, or when learning the model.

_images/supervised_flowchart.png

Classification models

Probability based

These models use probability to predict the label.

  • Naive-bayes uses Bayes theorem on the feature distribution and probabilities. Usually is used as a baseline model (default or worse). Applies to nominal labels.
  • K nearest neighbors (KNN) predicts each sample based on majority vote of its K nearest neighbors (the K most similar samples). It works good in tumor/tissue samples. The kernel determines what similarity measure we are applying. Applies to nominal labels.
Regression

This group of classifiers to separate numeric features using regression. The separation can be single line (a weighted sum of the features to estimate the label) or a plane in a M-dimensional space. (M is the number of features)

  • Linear regression is a simple model used to find effect size of features. It’s the first model you would run to check if there is some significant correlation between your features and label; gives you weights (intercepts) and applies to numeric labels.
  • Logistic regression is similar to linear regression but applies to nominal labels.
  • Support Vector Machine (SVM) is one of the strongest and most popular learning models. It produces vectors to separate the samples. adjusting the kernel can make it very powerful. Applies to binominal labels. The prediction output is probabilistic: [0, 1] for each class.
Trees

At each iteration a tree randomly samples the train data (sampling) and chooses a random set of features (bagging). The it find one feature that on a threshold divides the sampling data such that the labels are best separated. At each node a decision is to be made and following the branches we can get to a leaf node which is marked with the most probable label. The label should be nominal for trees.

Bellow is a decision tree to predict if a sample is a vampire. Each branch ask a question and based on that divides the samples. Following the branches you get to a leaf which is labeled by the label majority of the train samples ending there.

_images/vampire-decsion-tree.jpg

Trees are very popular in gene expression since they give an understanding of which genes are most important (top nodes) and what is the splitting criteria (if expression > threshold it’s a case or control) and thus the results can be tested in the lab. Parameters on trees matter a lot. Trees are random algorithms (random sampling). Unlike regression models, trees are not linear. Trees can be used for feature selection.

  • Decision tree is a binary tree with each node being a decision on a feature.
  • Random forest is a collection of decision trees, each tree capturing one aspect of the data. Works best in case you have different tissue or classes (race, female/male) of data within labels.
  • Decision Stump is a 1 level decision tree. Gives you the best feature to separate the data.
  • Gradient boosted trees uses boosting to combine weak decision trees. The leaf have a scoring value and the final label is predicted based on the sum of the leafs of each decision tree. This way if by random chance we found a weak tree, another random tree will make up for it.
Rule induction

Rule induction algorithms are similar to decision trees in the form that they find rules that split the data and made a hierarchy of rules. Unlike decision trees they use all the data and do not apply sampling or bagging.

Neural networks

Inspired by the way human brains work, neural networks are commonly used in Natural Language Processing (NLP), speech recognition and Artificial Intelligence (AI). They are black box learning models, meaning we don’t know how the model is working and we don’t care; we are only interested in the their prediction ability. Neural networks can literally learn anything. A 4 layer neural network can learn to drive a car or recognize faces. A 2 layer can learn to find shadows and read handwriting. Despite the power of neural networks, they are not favored in bioinformatics due to their black box nature.

Parameters

It is very important that you completely understand how the model you are using works. You must be able to justify choosing that model and be fully aware of the algorithm and the parameters it takes. R will allow you to run any model with default parameters. But setting the correct parameters is key to finding the best model. In order to find the best parameter combination, you can use optimization grids: try every possible value of each parameter with some resolution. For example for two parameters of a decision tree, minimum confidence and maximal gain, you can try all possible combinations {0.00, 0.01, 0.02, …, 0.50} x {0.00, 0.01, 0.02, …, 1.00} which is total 50x100=5,000 combinations. Given 10X validation that would be 50,000 models to learn for 2 parameters.

Kernels

Every classification model uses a pattern analysis method to find patterns within label classes. This pattern analysis can be determined by different kernels. A kernel is a function f (features –> label). For example a linear kernel is defined as \(\sum{weights.X}\), a polynomial kernel of degree \(d\) is defined as \(\sum{weights.X^d}\), radial basis function is \(exp(- \frac{|X-X'|^2}{2\sigma ^2})\), … Understand the kernel you choose. And in general, the more simple kernels are preferable (if linear works fine don’t go to a ANOVA kernel). Complicated models put you at the risk of over-fitting. Whatever choice you make in life, you should be able to justify it.

Cross validation

When learning the model on the train set, you should always use cross-validation. Cross validation is a method to divide the data (train) into X portions. In a 10X validation, the data will be divided into 10 portions, and each time 90% of the data will be used to learn the model and then the model will be tested on the remaining 10%. The final performance will be the average of all 10 models.

_images/cross_validation.png

Cross validation helps to avoid overfitting. Also by calculating the the standard deviation of the performances, we can see how robust the model is. The following figure illustrates how cross validation will help find the best fit. The top left model is overfitted (while the average performance will be good the standard deviation will be high). The bottom right model is under-fitting, where the average performance will be low.

_images/CV_fit.gif

Note that cross validation is applied when learning a model on the train. It is a good approach to build a model but after this we still need to test the model on independent test data. Why? Because the splits in the cross validation were correlated, so cross-validation is not a test performance, but rather a training performance.

Fitness of the model

A classification model is measured by its fit: how similar are the predicted labels to the actual labels. We could obtain very high fitness by increasing the number of the features (M). This situation is referred to as overfitting. This means instead of learning general patterns in the data we are learning noise and individual patterns, such that although we do respectively good on the train dataset, our model will fail to perform well on new data (test set) due to lack of generalization. On the contrary, underfitting is when our model is over-generalizing, and thus cannot perform well even on train. Underfitting is easier to detect because the model has low performance (low accuracy or precision), while over-fitting can be tempting as you see bloated results.

Nominal labels

Confusion matrix is a table showing how the samples were classified. The columns show the actual labels and the rows are the predicted labels.

_images/confusion_matrix.png
TN=true negative (samples predicted to be in class negative and that was correct)
TP=true positive (samples predicted to be in class positive and that was correct)
FN=true negative (samples predicted to be in class negative and that was incorrect)
FP=true positive (samples predicted to be in class positive and that was incorrect)

If you show the performance of the model as a confusion matrix, fitness can be measured by 4 main criteria:

  • Accuracy
\[\frac{TP + TN}{TP + FP + TN + FN}\]
  • Precision
\[\frac{TP}{TP + FP}\]
  • Recall
\[\frac{TP}{TP + FN}\]
  • Area Under Curve (AUC): I will not go into detail but AUC measures “the probability that a randomly chosen positive instance higher than a randomly chosen negative one (assuming positive ranks higher than negative)”.
Numeric labels

In the case of numeric labels, we have to measure the error of the prediction. Here the prediction is not binary. We need to measure how close to the real value the model predicts. The fitness measures for numeric values are:

  • Mean Squared Error (MSE)
\[MSE = \frac{1}{N} \sum{(label_{predicted} - label_{actual})^2}\]
  • Root Mean Squared Deviation (RMSD)
\[RMSD = \sqrt{\frac{\sum{(label_{predicted} - label_{actual})^2}}{N}}\]
Learning curves

Receiver operating characteristic (ROC) curve illustrates the performance of a model. The true positive rate (sensitivity) is plotted as a function of the false positive rate for different cutoffs of a parameter. The area under the curve is the AUC measure mentioned above.

_images/roc_curve.png

Learning curves plot the performance of the model for different sample sizes. It is used to show our model is general and not overfitting. Note that in the following figure if the train and test error lines don’t get tangent, that means we are underfitting. If the lines cross that means we are overfitting.

_images/learning_curve.png

Semi-supervised learning

Semi-supervised learning is applied to data that is partially labeled. First using a clustering algorithm you find clusters, then you use the known labels and propagate them to the nearby samples.

Summary notes

  1. Do not skip the data preparation step. Never trust the data you are working on. You might end up working for months trying to improve a dataset then notice there was some mislabeled samples. Or get very unexpected good results and then notice you had redundant samples.
  2. Check for imbalances in your data. If 90% of your data is control and 10% case, a model that classifies everything as control will show 90% accuracy. If the nature of your data is imbalanced, make sure you specify balanced loss criteria while learning the model (positive and negative error will be treated respectively). When splitting the data to train and test make sure the splits are balanced.
  3. The main mistake in learning rises from train leakage to test. If the normalization method uses information from other samples (e.g. quantile normalization), it should be done separately on train and test. Feature selection/reduction should be done only on train, then the final features will be extracted from the test. No cheating.
  4. When applying a model understand how the models work. Know it’s parameters. Make rational choices and optimize parameters.
  5. Test multiple models and draw ROC curves to compare their performance.
  6. Show more than one performance measurement. Do not rely on accuracy. Know what the expectation of your model is. The accuracy for a plane flight related model should be 99.99% but for human disease with so much variability, 80% can be a good prediction.
  7. When possible use more simple models. If you model performs 90% with 100 features and 89% with 10 features, the later is a better model. Same goes for complexity, e.g. degree of a kernel. In general avoid using degree greater than 2.5.
  8. Always draw learning curves to check for under/overfitting. Keep in mind every model is overfitting to some extent.
  9. After you tested your model on the test and proved your model is correct and generalized, combine all the data and make a final model with cross-validation.
  10. The power of a model is in its sample size and good feature selection. More samples make a better model.

Machine learning workshop

In this workshop, we will study GSE53987 dataset on Bipolar disorder (BD) and major depressive disorder (MDD) and schizophrenia:

Lanz TA, Joshi JJ, Reinhart V, Johnson K et al. STEP levels are unchanged in pre-frontal cortex and associative striatum in post-mortem human brain samples from subjects with schizophrenia, bipolar disorder and major depressive disorder. PLoS One 2015;10(3):e0121744. PMID: 25786133

This is a microarray data on platform GPL570 (HG-U133_Plus_2, Affymetrix Human Genome U133 Plus 2.0 Array) consisting of 54675 probes.

The raw CEL files of the GEO series were downloaded, frozen-RMA normalized , and the probes have been converted to HUGO gene symbols using the annotate package averaging on genes. The sample clinical data (meta-data) was parsed from the series matrix file. You can download it here:

<https://github.com/BRITE-REU/programming-workshops/blob/master/source/workshops/05_Machine_learning/data/GSE53987_combined.csv

In total there are 205 rows consisting of 19 individuals diagnosed with BPD, 19 with MDD, 19 schizophrenia and 19 controls. Each sample has gene expression from 3 tissues (post-mortem brain). There are a total of 13768 genes (numeric features) and 10 meta features and 1 ID (GEO sample accession).

  • Age
  • Race (W for white and B for black)
  • Gender is F for female and M for male
  • Ph is the ph of the brain tissue
  • Pmi is the post mortal interval
  • Rin is the RNA integrity number
  • Patient is unique for each patient. Each patient has up to 3 tissue samples. The patient ID is written as disease followed by a number from 1 to 19
  • Tissue is the tissue the expression was obtained from.
  • Disease.state is the class of disease the patient belongs to: bipolar, schizophrenia, depression or control.
  • source.name is the combination of th etissue and disease.state

Data exploration

  1. check all the features, which one is numeric, binominal..

sex is binominal balanced, age can be centeralized, see if the measurements are normally distributed and (we say) what is the normal expected in literature, race is very unbalanced

what would the labels be? could be (MDD/BPD/control) or tissue () or combine (9 classes)

what is the mean age of each disease? boxplots

  1. plot PCA, (we give them) code for plotting, color by disease, tissue, race, sex …

repeat for each tissue (which tissue is more predective)

repeat for each disease…

  1. feature selection

by DE, by PCA weights, by variability (limma), fisher’s criteria

  1. correlation matrix

Unsupervised learning

  1. clustering, hierarichal (heatmap)
  2. association rules

Supervised learning

  1. learning models (on 2 sets of labels), logistic regression

Workshop 6: SQL

Tutorials

Workshop 6: An Introduction to Relational Detabases

This workshop provides a basic introduction to Relational Databases using the SQLite program.

There are three main aspects of database usage,

  • database design and construction
  • loading data
  • querying the data

Below I discuss the main points of each and introduce use of the SQL language in the context of the sqlite3 database management program. This document contains the following sections:

Database Design

Relational databases, the most common type, are designed around entities and relationships between entities. Database design deals with these.

For example, a movie database might contain information on movies and actors. These are entities. The relationship that ties certain actors to certain movies can be called the cast. The figure below is part of the design of such a database. In it, the rectangles are entities and the diamond is a relationship. The lines connect the entities to the relationship.

_images/movies.actors.cast.er.diagram.png

Relational databases consist of tables of data. Each table consists of rows. In an entity table, each row contains data about one instance of that entity. For example, in a movie table, each row has information about one movie. The following is a description of a table to hold movie data written is SQL. This description is used to create the movie table.

CREATE TABLE movies (
    mid integer primary key,
    title text,
    year integer,
    genres text
);

The data in a row is divided into fields. Each field holds a particular piece of data. In our movie rows, the individual fields are:

  • mid – a unique integer identifier for the row
  • title – the movie title, stored as a text string
  • year – the year the movie came out, stored as an integer
  • genres – a list of classification labels for the movie content, stored as a text string

The primary key notation on the mid field indicates that the data will be sorted for fast lookup on this field. The following are a few rows of data from the movies table. This data comes from the publicly available IMDb (Internet Movie Database) at https://www.imdb.com/interfaces/ .

mid         title                           year  genres
----------  ------------------------------  ----  --------------------
369610      Jurassic World                  2015  Action,Adventure,Sci
1326190     Aliens: Zone-X                  2015  Sci-Fi
1392190     Mad Max: Fury Road              2015  Action,Adventure,Sci
1828251     Journey to Mt. Fuji             2015  Adventure,Family,Sci
2395427     Avengers: Age of Ultron         2015  Action,Adventure,Sci
2577662     The Rise of the Robots          2015  Sci-Fi
2651352     Ratpocalypse                    2015  Fantasy,Sci-Fi

Similarly, each row in an actors table holds data about an actor. The following is a description of a table to hold actor data written is SQL. Again, this description is used to create the actor table.

CREATE TABLE actors (
    aid integer primary key,
    name text
);

Rows in this table hold only two values:

  • aid – a unique integer identifier for the row
  • name – the name of the actor, stored as a text string

The following are a few rows of data from the actors table.

aid         name
----------  ------------------------------
1           Fred Astaire
2           Lauren Bacall
3           Brigitte Bardot
4           John Belushi
5           Ingmar Bergman
6           Ingrid Bergman
7           Humphrey Bogart
8           Marlon Brando
9           Richard Burton
10          James Cagney

Relationship tables are different. They hold values that tie the entities together. Instead of using actual data, the identifiers are used in a relationship table. The following is a description of the cast table.

CREATE TABLE cast (
    mid integer,
    aid integer,
    role text
);

The fields are:

  • mid – an integer identifier from the movies table
  • aid – an integer identifier from the actors table
  • role – a description of the actor’s role in the movie, stored as a text string

Movies typically have more than one actor, so the cast table will typically have more than one row for the same movie, each with a different actor. For example, the movie “Wonder Woman” has the following row in the movies table:

mid         title         year        genres
----------  ------------  ----------  ------------------------
451279      Wonder Woman  2017        Action,Adventure,Fantasy

Note the movie row identifier mid = 451279. In order to tie the movies to its actors, the same identifier, is used in the cast table.

mid         aid         role
----------  ----------  ------------------------------
451279      2933757     ["Diana"]
451279      1517976     ["Steve Trevor"]
451279      705         ["Antiope"]
451279      205063      ["Etta"]

Who are these actors? The only way to find out is to go to the actors table and look for the rows with the corresponding aid identifiers.

aid         name
----------  --------------------
2933757     Gal Gadot
1517976     Chris Pine
705         Robin Wright
205063      Lucy Davis

Adding Data

In sqlite3, the easiest way to add data to a table is to load it from a file. sqlite3 has a special command for this called .import that is one of a series of commands that start with a period and are called Dot Commands.

The file should:

  • contain rows of data
  • have in each row,
    • one value for each field
    • fields in the same order as the create table statement
  • all fields separated by the same character, such as
    • a tab “\t” (a tab separated file or tsv)
    • a comma “,” (a comma separated file or csv)

For example, importing movie data into the movies table can be done as follows. First set the type of field separator. This can be done with .mode csv or .mode tabs command, then import the data from the file movies.tsv. Note that the prompt sqlite> appears when the sqlite3 program is running.

sqlite>.mode tabs
sqlite>.import movies.tsv

Querying Data

Data is queried with SQL select statements. The basic form of an SQL query (Structured Query Language) for a single table is:

SELECT field name, field name, ...
FROM table name
WHERE condition [AND|OR condition etc.]
GROUP BY field name
ORDER BY field name [asc|desc] ...
LIMIT integer

The individual query parts are referred to as clauses. The Select and From clauses are required, all others are optional.

  • Select – lists the fields in the output, any order
  • From – lists the table(s) where the data is stored
  • Where – gives boolean condition(s) (true/false) limiting the rows used
  • Group by – used with aggregates like count(*)
  • Having – gives boolean conditions limiting output after a GROUP BY
  • Order by – sorts the output by field(s), either ascending (ASC) or descending (DESC)
  • Limit – restricts the output to a maximum number of rows

The simplest query returns the whole table. Limit is used because the table contains over 100,000 rows. Here, “*” means “all fields.”

SELECT *
FROM Movies
LIMIT 10
mid     title           year  genres
------  --------------  ----  ----------
35423   Kate & Leopold  2001  Comedy,Fan
66853   Na Boca da Noi  2016  Drama
69049   The Other Side  2018  Drama
88751   The Naked Mons  2005  Comedy,Hor
94859   Chief Zabu      2016  Comedy
96056   Crime and Puni  2002  Drama
97540   Responso        2004  \N
100275  The Wandering   2017  Comedy,Dra
102362  Istota          2000  Drama,Roma
107706  Stupid Lovers   2000  \N

Note that \N means NULL or no value.

To restrict the fields, use field names:

SELECT title, genres, year
FROM Movies
LIMIT 10
title           genres                year
--------------  --------------------  ----
Kate & Leopold  Comedy,Fantasy,Roman  2001
Na Boca da Noi  Drama                 2016
The Other Side  Drama                 2018
The Naked Mons  Comedy,Horror,Sci-Fi  2005
Chief Zabu      Comedy                2016
Crime and Puni  Drama                 2002
Responso        \N                    2004
The Wandering   Comedy,Drama,Fantasy  2017
Istota          Drama,Romance         2000
Stupid Lovers   \N                    2000

To restrict records, impose a condition

SELECT title, genres, year
FROM Movies
WHERE year = 2018
LIMIT 10
title                       genres                year
--------------------------  --------------------  ----
The Other Side of the Wind  Drama                 2018
T.G.M. - osvoboditel        \N                    2018
To Chase a Million          Action,Drama          2018
Fahrenheit 451              Drama,Sci-Fi,Thrille  2018
Nappily Ever After          Comedy,Drama,Romance  2018
Alita: Battle Angel         Action,Adventure,Rom  2018
Surviving in L.A.           Comedy,Drama,Romance  2018
Escape from Heaven          Comedy,Fantasy        2018
The Last Full Measure       Drama,War             2018
Caravaggio and My Mother t  Comedy,Drama          2018

For string comparison several options are available.

  • = – strings must match exactly (usage: field = “pattern”)

    • not case sensitive
  • LIKE – strings must match exactly (usage: field LIKE “pattern”)

    • can use wildcards in pattern
    • ‘%’ for zero or more “I don’t care” letters
    • ‘_’ for one letter
    • not case sensitive

The following example uses a condition on the title and genres to restrict the output to titles which begin with “star” and where “sci-fi” occurs somewhere in the genres field.

sqlite> select title, genres, year
   ...> from movies
   ...> where year = 2017 and title like "star%" and genres like "%sci-fi%"
   ...> limit 10;
title                          genres                year
-----------------------------  --------------------  ----
Star Wars: The Fallen Brother  Action,Fantasy,Sci-F  2017
Starwatch                      Action,Drama,Sci-Fi   2017
Star Wars: The Dark Reckoning  Sci-Fi                2017
Star Trek: The Paradise Maker  Adventure,Animation,  2017
Joins

When you want to combine data from different tables, joins are used. This is how to retrieve information on both actors and movies in the same query. Joins occur in the FROM clause. All the required tables are listed and the columns that should be used to join the rows are specified. Recall the actors – cast – movies diagram from above. Now it’s labeled with the columns that join the entity and relationship tables.

_images/movies.actors.cast.er.diagram.with.primary.keys.png

Going back to the Wonder Woman example. Here is a query that returns the actors by looking for the movie name. The results are shown after the query.

sqlite> select mid, title, aid, name, role
   ...> from movies join cast using(mid) join actors using (aid)
   ...> where title like "wonder woman";
mid         title         aid         name        role
----------  ------------  ----------  ----------  --------------------
451279      Wonder Woman  2933757     Gal Gadot   ["Diana"]
451279      Wonder Woman  1517976     Chris Pine  ["Steve Trevor"]
451279      Wonder Woman  705         Robin Wrig  ["Antiope"]
451279      Wonder Woman  205063      Lucy Davis  ["Etta"]

Notice the joins in the from clause. The first one is

movies join cast using(mid)

This indicates that rows from movie should be combined with rows from cast when they share the same mid value. In effect, this produces an intermediate table with the following rows: mid, title, aid, role as can be seen in the following query.

sqlite> select *
from movies join cast using (mid)
limit 10;
mid     title           year  genres      aid       role
------  --------------  ----  ----------  --------  --------------
35423   Kate & Leopold  2001  Comedy,Fan  212       ["Kate McKay"]
35423   Kate & Leopold  2001  Comedy,Fan  413168    ["Leopold"]
35423   Kate & Leopold  2001  Comedy,Fan  630       ["Stuart Besse
35423   Kate & Leopold  2001  Comedy,Fan  5227      ["Charlie McKa
66853   Na Boca da Noi  2016  Drama       180878    ["Vítor Hugo"
66853   Na Boca da Noi  2016  Drama       206883    ["Hugo"]
66853   Na Boca da Noi  2016  Drama       94426     \N
66853   Na Boca da Noi  2016  Drama       138681    \N
69049   The Other Side  2018  Drama       1379      ["Jake Hannafo
69049   The Other Side  2018  Drama       709947    ["John Dale"]

The second join is:

X join actors using (aid)

where X is the result of the first join. This indicates that rows from the first join should be combined with rows from actors when they share the same aid. Again, this has the effect of producing an intermediate table with one additional field, name.

sqlite> select *
from movies join cast using (mid) join actors using (aid)
limit 10;
mid     title           year  genres      aid       role            name
------  --------------  ----  ----------  --------  --------------  --------------------
35423   Kate & Leopold  2001  Comedy,Fan  212       ["Kate McKay"]  Meg Ryan
35423   Kate & Leopold  2001  Comedy,Fan  413168    ["Leopold"]     Hugh Jackman
35423   Kate & Leopold  2001  Comedy,Fan  630       ["Stuart Besse  Liev Schreiber
35423   Kate & Leopold  2001  Comedy,Fan  5227      ["Charlie McKa  Breckin Meyer
66853   Na Boca da Noi  2016  Drama       180878    ["Vítor Hugo"   Rubens Correia
66853   Na Boca da Noi  2016  Drama       206883    ["Hugo"]        Ivan de Albuquerque
66853   Na Boca da Noi  2016  Drama       94426     \N              Roberto Bonfim
66853   Na Boca da Noi  2016  Drama       138681    \N              Marilia Carneiro
69049   The Other Side  2018  Drama       1379      ["Jake Hannafo  John Huston
69049   The Other Side  2018  Drama       709947    ["John Dale"]   Robert Random

To obtain the results we’re interested in, sqlite searches the rows in the final intermediate table for those whose titles match “wonder woman”.

SQL Workshop

Tasks

In the workshop, we’ll do the following. See the instructions below for guidance in each task.

  • Task 1: Create tables for movies, actors, and cast.
  • Task 2: Add data to the tables using the files movies.tsv, actors.tsv, cast.tsv.
  • Task 3: Write queries to get answers for the following.

Using SELECT and WHERE in a single table

  1. Pick a movie you know from year 2000 or later and find out its mid. (answer is mid)
  2. Pick an actor you know and find out her or his aid. (answer is aid)
  3. Pick a year and list the first five movies in the year you picked with titles that start with a “b” and with “comedy” in the genres column. (answer is five rows, each containing year, title, genre)

Using count()

  1. How many actors are listed in the actor table? (answer is a count)
  2. How many movies in the movie table? (answer is a count)
  3. How many movies have the word “bride” in the title? “groom” in the title? (answer for each is a count)
  4. How many actors have a first name that starts “Amy”? (answer is a count)

Using Group By

  1. List the number of movies in each year. (answer is multiple rows, each containing year and count)

Using joins

  1. Pick a favorite actor and list all titles and years of the movies that person appears in. (answer is multiple rows, each containing name, title, year)
  2. Pick a movie and find all the actors that appeared in it. (answer is multiple rows, each containing title, name)

Using ORDER BY

  1. List the top ten actors with the most roles. (answer is multiple rows, each containing name, count of roles)

Using the same table more than once in a join

  1. Find two actors that appear together in two different movies (harder). (answer is multiple rows, each containing actor1, actor2, movie1, movie2)
Task 1

Starting and stopping sqlite.

The following starts sqlite and creates a database file mydatabase.db or uses that file if it already exists. Note that % is used below as an arbitrary symbol for your system prompt.

%sqlite3 mydatabase.db

The following stops sqlite. Note that “sqlite>” is the sqlite prompt.

sqlite> .quit

Create a file create.txt in an editor and enter the CREATE TABLE statements for movies, actors, and cast. Also add the following two lines to your create.txt file. They create indexes which sort the data in the cast file for fast lookup. This is necessary because the cast table doesn’t have a primary key.

CREATE INDEX mid_aid_index on cast (mid, aid);
CREATE INDEX aid_mid_index on cast (aid, mid);

Use .read to read in the file create.txt and execute the statements in sqlite.

sqlite> .read create.txt

Use .schema to see that all the tables were created. This will list the CREATE TABLE and CREATE INDEX statements.

sqlite> .schema
Task 2

Copy the files “movies.tsv”, “actors.tsv”, and “cast.tsv” into your directory and load their data into the tables you’ve created. Use something similar to the following for each file.

sqlite> .mode tabs
sqlite> .import movies.tsv movies

Confirm that data has been loaded into each table using commands like the following, which list the first 10 lines from a table. Note that the .mode and .headers commands make the output easy to read. select * means output all fields of each row.

sqlite> .mode column
sqlite> .headers on
sqlite> select * from movies limit 10;

Note that if you get the continuation symbol …> it means you hit return before the command was complete. Either continue typing or add a missing semicolon (;) at the end.

sqlite> select * from movies limit 10
...>;

Confirm the number of rows of data in the table. select count(*) means count the number of rows in the table.

sqlite> select count(*) from movies;
Task 3

Write SQL select statements to get the answers to the listed questions. Use the format shown below.

SELECT field name, field name, ...
FROM table name
WHERE condition [AND|OR condition etc.]
GROUP BY field name
ORDER BY field name [asc|desc] ...
LIMIT integer
Try It At Home

Follow these steps to add movie ratings to your database.

  • Create a ratings table. It should have three fields:
    • mid – a unique integer identifier for the movie (set this as the primary key)
    • rating – a floating point value for the movie rating (datatype: real)
    • votes – an integer value for the number of votes received by the movie
  • Download the data file “ratings.tsv
  • Import the data into your table

Asnwer these queries

  1. How many movies are rated?
  2. How many movies have more than 5000 votes?
  3. What are the top ten rated movies with at least 5000 votes? With at least 50,000 votes? With less than 5000 votes?
  4. What is the range of ratings (use min() for low and max() for high)?
  5. Show the ratings, votes, and year of all Star Wars movies, from highest to lowest.
  6. What is the distribution of ratings in bins of size 1 (i.e., how many are rated from 0 to 0.999, from 1 to 1.999, etc). To do this you can use 1) the round( ) function on the ratings and 2) GROUP BY.
  7. What is the distribution of votes in bins of size 1000?

SQLite Dot Commands

sqlite3 dot commands

.quit                   Exit sqlite3
.headers on|off         Turn display of field names on or off
.help                   Show this message
.import FILE TABLE      Import data from FILE into TABLE
.mode OPTION            Set output mode where OPTION is one of:
                            csv           Comma-separated values
                            tabs          Tab-separated values
                            list          Values delimited by .separator strings
                            column        Left-aligned columns for display (use with .width)
.open FILE              Close existing database and open FILE database
.output FILE|stdout     Send output (such as result of SQL query) to FILE or screen
.read FILE              Execute SQL in FILE
.schema                 Show the CREATE statements in this database
.separator "x"          Change the column separator to x for both .import and output
.show                   Show the current values for various settings
.width n1 n2 …          Set column widths for "column" mode, 0 means auto set column,
                            negative values right-justify

Solutions to Common Queries

Number of movies

select count(*)
from movies;

Number of actors

select count(*)
from actors;

Number of rows in cast

select count(*)
from cast;

Movies in a range of mid values

select *
from movies
where mid>112303 and mid <114000
limit 10;

Movies named “Frozen” (case sensitive)

select *
from movies
where title = "Frozen"
limit 10;

Movies name “frozen” (case insensitive)

select *
from movies
where title like "frozen"
limit 10;

Movies with title containing “star”.

select *
from movies
where title like "%star%"
limit 10;

Movies with “adventure” in genres

select *
from movies
where genres like "%adventure%"
limit 10;

Minimum year of movies in database

select min(year)
from movies;

Maximum year of movies in database

select max(year)
from movies;

Count of movies per year

select year, count(year)
from movies
group by year
limit 20;

Average number of actors per movie (uses subquery)

select avg(n)
from (
        select count(aid) as n
        from cast
        group by mid
);

Actors in movies titled “Frozen”

select mid, title, year, name, role, aid
from movies join cast using(mid) join actors using(aid)
where title like "Frozen";

Movies for Emma Stone sorted descending by year

select name, title, year
from movies join cast using(mid) join actors using(aid)
where name="Emma Stone"
order by year desc;

Movies for Chris Evans sorted by title

select name, title, year
from movies join cast using(mid) join actors using(aid)
where name="Chris Evans"
order by title;

Movies for George Clooney sorted by title

select name, title, year
from movies join cast using(mid) join actors using(aid)
where name="George Clooney"
order by title;

Top actors (most movies) over 30

select name, count(mid) as c
from cast join actors using(aid)
group by name
having c >= 30
order by c desc
limit 10;

Top actors (most movies) since 2015

select name, count(mid)
from movies join cast using(mid) join actors using(aid)
where year >= 2015
group by name
order by count(mid) desc
limit 10;

Same two actors in two movies (complete version, note less than (<) instead of not equal (<>) in final part of the where to avoid reversed duplicates)

_images/two.actors.two.movies.png
select a1.name, a2.name, m1.title, m2.title
from actors a1 join cast c1 using (aid)
        join cast as c2 using(mid)
                join cast as c3 on c1.aid=c3.aid
                        join cast as c4 on c2.aid = c4.aid and c3.mid=c4.mid
                                join actors a2 on c4.aid=a2.aid
                                        join movies as m1 on m1.mid=c1.mid
                                                join movies as m2 on m2.mid=c4.mid
where c1.aid <> c2.aid and c1.mid < c3.mid
limit 10;

Same two actors in two movies, one of which is Emma Stone

select a1.name, a2.name, m1.title, m2.title
from actors a1 join cast c1 using (aid)
        join cast as c2 using(mid)
                join cast as c3 on c1.aid=c3.aid
                        join cast as c4 on c2.aid = c4.aid and c3.mid=c4.mid
                                join actors a2 on c4.aid=a2.aid
                                        join movies as m1 on m1.mid=c1.mid
                                                join movies as m2 on m2.mid=c4.mid
where c1.aid <> c2.aid and c1.mid < c3.mid and a1.name like "Emma Stone"
limit 10;

Posters

Abstracts

Abstract Draft Guidelines

Include:
  • Title
  • Motivation and Background
  • Question or Goal
  • Methods and Data
  • Results
  • Discussion and Future work
Avoid:
  • undefined words or phrases
  • vagueness
  • sloppy grammar

From the ABRCMS National Conference website: “Your abstract must contain a hypothesis or statement about the problem under investigation, a statement of the experimental methods/methodology used, essential results provided in summary form (even if preliminary), and a conclusion that explains how the work contributes to the hypothesis or statement of problem.”

Posters

Poster Templates

The following are suggested posters templates.

Poster Draft Guidelines

  • Title
  • Authors
  • Affiliations of Authors
    • Yours should be your school and “Boston University Bioinformatics BRITE REU Program, Summer 2018”
  • Abstract
  • Text for
    • Motivation and Background
    • Methods
    • Results
    • Discussion and Future Work
  • Figures
  • References
  • Acknowledgements
    • Include: “This work was funded, in part, by NSF grant DBI-1559829, awarded to the Boston University Bioinformatics BRITE REU program, [and if other grants] and <grant agency, like NSF or NIH> grant <grant number>.”

What to look for when reading a poster

Contents
  1. Does the abstract say briefly what the authors did, why they did it (including importance), what results they got?
  2. Is there an introduction to basic concepts?
  3. Does it use diagrams or flowcharts to increase clarity?
  4. Does the methods section describe briefly what was done and how? what data was used?
  5. Does the results section make clear what was the outcome?
  6. Are graphs and figures clear, well labeled, and described. Are important results highlighted?
  7. Does the conclusion discuss the importance of the results and what further work needs to be done?
  8. Is there a reference section with relevant articles and books?
  9. Is there an acknowledgement section that contains grant support information?
Appearance and overall effect
  1. Is it interesting?
  2. Is there a good mix of text and figures?
  3. Was there a good flow in the story?
  4. Did you learn something from it?

Submission Deadlines

2018 ABRCMS National Conference
November 14-17, 2018
Indianapolis, Indiana

Travel Award

Deadline Aug 22 at 11:59 p.m. PDT

Apply for a travel award

Abstract

Deadline Sept 7 at 11:59 p.m. PDT

Submit an abstract

Note

This is a Work In Progress and is actively being updated.

Indices and tables