–Introduction to the Command Line for Genomics
Command line interface (CLI) and graphic user interface (GUI) are different ways of
interacting with a computer’s operating system. When using the GUI, you see visual
representations of files, folders, applications etc. When using the CLI, you work largely with
text representations of files, folders, input and output etc. The shell is a program that
presents a command line interface that allows you to control your computer by typing
instructions with a keyboard.
Essential
$ man command = manual for bash commands
Ctrl+C will cancel the command you are writing and give you a fresh prompt.
Ctrl+R will do a reverse-search through your command history
Ctrl+L will clear your screen.
You can also review your recent commands with the history command
then you could repeat command #260 by entering: $ !260
echo is a built-in shell command that writes its arguments, like a line of text to standard
output.
Navigating Directories
$ pwd = print working directory
$ cd subdirectory1 = change location to $ cd - = precedent directory
subdirectory1
$ cd = directory 0
$ ls = list directory contents $ ls ~ = list home directory (~ = shortcut
for home directory)
$ ls -l = list in long format (additional
information: name of the owner, when $ ls *.fastq = * can be used to represent
the file was last modified, and whether any type of character. Thus, *.fastq
the current user has permission to read matches every file that ends with .fastq.
and write to the file)
$ ls *977.fastq = lists only the file that
$ ls -F = list and classify (append indicator) ends with 977.fastq.
$ls -a = list all content (including hidden ls /usr/bin/*[ac]* = List all the files in
content) /usr/bin that contain the letter ‘a’ or the
letter ‘c’
Navigating Files
$ cat /* = print the contents of all the files
$ cat SRR098026.fastq = print all the content of the SRR…026.fastq file
$ less SRR097977.fastq = interactive lecture mode
Space = to go forward, scroll G = to go to the end
b = to go backward q = to quit
g = to go to the beginning /”….” = search “…”
head and tail let you look at the beginning and end of a file, respectively.
$ head -n 1 SRR098026.fastq
$ tail -n 1 SRR098026.fastq
= print the first or last n lines of a file
FASTQ format
FASTQ format:
@SRR097977.1 209DTAAXX_Lenski2_1_7:8:3:710:178 length=36
TATTCTGCCATAATGAAATTCGCCACTTGTTAGTGT
+SRR097977.1 209DTAAXX_Lenski2_1_7:8:3:710:178 length=36
CCCCCCCCCCCCCCC>CCCCC7CCCCCCACA?5A5<
Line Description
1 Always begins with ‘@’ and then information about the read
2 The actual DNA sequence
3 Always begins with a ‘+’ and sometimes the same info in line 1
Has a string of characters which represent the quality scores; must have same number
4
of characters as line 2
Quality = probability of an incorrect base call or, equivalently, the base call accuracy.
The numerical score is converted into a code where each individual character represents the
numerical quality score for an individual nucleotide.
For example:
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
The # character and each of the ! characters represent the encoded quality for an individual
nucleotide.
The numerical value assigned to each of these characters depends on the sequencing
platform that generated the reads. The sequencing machine used to generate our data uses
the standard Sanger quality PHRED score encoding, Illumina version 1.8 onwards. Each
character is assigned a quality score between 0 and 42 as shown in the chart below:
Each quality score represents the probability that the corresponding nucleotide call is
incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a base
call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%. These
probability values are the results from the base calling algorithm and dependent on how
much signal was captured for the base incorporation.
Looking back at our read:
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
we can now see that the quality of each of the Ns is 0 and the quality of the only nucleotide
call (C) is also very poor (# = a quality score of 2). This is indeed a very bad read.
Working with Files
$ cp SRR098026.fastq SRR098026-copy.fastq = copy file
$ mkdir backup = create directory
$ mv SRR098026-copy.fastq backup = move file in this directory
$ mv SRR098026-copy.fastq SRR098026-backup.fastq = rename file
File Permissions
$ ls -l = list long form
user = u
group = g
other = o
all = a
chmod {u/g/o/a}{+/-/=}{rwx} file_name = change file modes or access
chmod 777 file_name = change file modes
{read = 4
write = 2
execute = 1}
Removing
$ rm SRR098026-backup.fastq
You’ll be asked if you want to override your file permissions:
rm: remove write-protected regular file ‘SRR098026-backup.fastq’?
You should enter n for no, y for yes.
Important: The rm command permanently removes the file.
By default, rm will not delete directories.
To delete a directory, you have to use the -r (recursive) option.
$ rm -r name_directory
This will delete not only the directory, but all files within the directory.
Searching Files
$ less SRR097977.fastq = interactive lecture mode
/”….” = search “…”
or
$ grep [option] ‘patterns’ [file] = search ‘patterns’ in the file
Options: -n = to have the corresponding line number
-c = to count the number of correspondences
-v = to have just the lines without the patterns
-An = print n lines after the pattern line
-Bn = print n lines before the pattern line
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | grep -v '^--'
^ = beginning of line
$ = end of line
Redirecting output
> = redirecting output to a file. the new output will replace the output that was already
present in the file
>> = append redirect = will append new output to the end of a file
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
Lines, words, and bytes count
wc = count lines, words, and characters
Options: -l = print the newline counts
-w = print the word counts
-c = print the bytes counts
$ wc bad_reads.txt
The pipe command |
| = take the output and use it as input to another command
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | less
For Loop
Define a variable
Examples for loop:
Basename
basename = strip directory and suffix from filenames
Rename files
Writing Files
$ nano README.txt = create a file called README.txt and open the text editor called nano
Ctrl-O = write the data to disk (save the file)
Ctrl-X = quit the nano editor
Ctrl-G = get help
We can also use another programs, such as Notepad++ for Windows.
On Unix systems (Linux and MacOS), many programmers use Emacs or Vim, or a graphical
editor such as Gedit.
Writing Scripts
Scripts let you save commands to run them and also lets you put multiple commands
together.
bash bad-reads-scripts.sh = execute bad-reads-scripts.sh
We had to type bash because we needed to tell the computer what program to use to run
this script.
Making the script into a program
The x tells us we can run it as a program.
Execute:
Moving and Downloading Data
Getting data from the cloud
wget = “world wide web get” = download web pages or data at a web address
cURL = “see URL” = display webpages or data at a web address
Examples:
wget ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt
curl ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt
curl -O = download the page and save the file using the same name it had on the server
which = BASH program that looks through everything you have installed, and tells you what
folder it is installed to
Transferring Data Between your Local Machine and the Cloud