Regular Expressions and
Finite State Automata
E Jembere
What we will cover in this Unit
• Regular Expressions (RE)
• Finite State Automata (FSA)
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 2
Introductory Problem
• How can we create a price list for the
laptops given some information in a
text file?
08/17/2025 3
4
Regular Expressions
• A regular expression is a pattern to
be matched against a string. For
example, the pattern $299.99.
• Matching either succeeds or fails.
• Sometimes you may want to replace
a matched pattern with another
string.
• Regular expressions are used by
many other Unix commands and
programs, such as grep, sed, awk,
vi, emacs, nano, etc.
© Jurafsky and Martin, Speech and Language Processing
08/17/2025 5
There is more to RE
• Formally, a Regular Expression is an
algebraic notation for characterising
a set of Strings
• A regular expression:
Can specify search strings
Define a language in a formal way.
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 6
Disjunction
• Letters in square Brackets “[]”
• E.g. Find all the instances of the word
“regular expression” in a text.
A regular expression is a pattern to be matched against a
string. For example, the pattern $299.99. Matching either
succeeds or fails. Sometimes you may want to replace a
matched pattern with another string. Regular expressions
are used by many other Unix commands and programs,
such as grep, sed, awk, vi, emacs, nano and even some
shells
regular expression
[Rr]egular expression
© Jurafsky and Martin, Speech and Language Processing
08/17/2025 7
Ranges
Pattern Matches
[A-Z] An upper case letter Matching either succeeds or fails
[a-z] A lower case letter Matching either succeeds or fails
[0-9] A single digit There is no digit larger than 9
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 8
Negation in Disjunction
Pattern Matches
[^A-Z] Not an upper case letter
[^Ss] Neither ‘S’ nor ‘s’
[^e] Not ‘e’
a\^b The pattern ‘a^b’
[^\.] Not a period
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 9
More on Disjunctions
• Suppose we want to search for the string ‘cats’ or string
‘dogs’ in a given text
Cats & Dogs is a 2001 American-Australian action-comedy film, directed
by Lawrence Guterman. The screenplay by John Requa and Glenn Ficarra
centers on the relationships between cats and dogs, depicting the
relationship as an intense rivalry in which both sides use organizations and
tactics that mirror those used in human espionage. John starred in the
movie.
• We can use the pipe ‘|‘ for disjunction
• We can also combine the | and []
E.g [Jj]ohn|[Rr]ichard
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 10
Special Characters
• Special characters: ? * + .
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 11
Anchors
• ^ matches the beginning of a line;
• $ the end of a line
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 12
Operator Precedence Hierarchy
Group Symbols
Parenthesis ()
Counters *, +, ?, {}
Sequences and the, ^my, end$
Anchors
Disjunctions I
• E.g. ‘the*’ matches ‘theeeeee’ and not
‘thethe’
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 13
The Sheep Language
• baa!, baaa!, baaaa!, baaaaa!...
Any string from this infinite set is in the sheep
language.
• It has three characters in its alphabet; a, b
and !.
• Which of the following is a correct Regex
for the sheep language?
a) ba*!
b) ba+!
c) baa*!
d) baa+!
e) baaa*!
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 14
15
Regex Substitution and
Memory
• Substitution in Java
String pattern = "(\\w)(\\s+)([\\.,])";
System.out.println(EXAMPLE_TEST.replaceAll(pattern,
"$1$3"));
Replaces white space between a word character and a .
• Regex in Perl
Syntax: s/regexp1/modifier/
E.g.
s/I am feeling (.+)/You are feeling \1?/
s/I gave the (.+) to (.+)/Why would you give \2 the \1?/
16
/g specifies that a
regex should perform
a global search.
17
ELIZA: Substitutions Using
Memory
User: Men are all alike.
ELIZA: IN WHAT WAY
s/.* all .*/IN WHAT WAY/
They’re always bugging us about something or other.
ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
User: My boyfriend says I’m depressed.
ELIZA: I AM SORRY TO HEAR YOU ARE DEPRESSED
s/.* I’m (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
18
Finite State Automata
• Regular expressions can be viewed
as a textual way of specifying the
structure of finite-state automata.
• Any regular expression can be
represented as an FSA.
• FSA capture significant aspects of
what linguists say we need for
Natural Language Processing.
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 19
FSAs as Graphs
• Let’s revisit the sheep language
/baa+!/
• Directed graph with labeled nodes and arc
transitions
• Five states: q0 the start state, q4 the final state,
5 transitions
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 20
Sheep Language FSA
• We can say the following things about
this machine:
It has 5 states
b, a, and ! are in its alphabet
q0 is the start state
q4 is an accept state
It has 5 transitions
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 21
But Note
• There are other machines that
correspond to this same language
• More on this one later
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 22
More Formally
• FSA is a 5-tuple consisting of
Q: set of states {q0, q1, q2, q3, q4}
: an alphabet of symbols {a,b,!}
q0: a start state
F: a set of final states in Q {q4}
(q,i): a transition function mapping Q x to
Q
23
About Alphabets
• Don’t take term alphabet too
narrowly; it just means we need a
finite set of symbols in the input.
• These symbols can and will stand for
bigger objects that can have internal
structure.
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 24
Dollars and Cents
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 25
State Transition Table
• An FSA can also b a ! ϵ
be represented 0 1
as table/matrix
1 2
2 2,3
If you’re in state 1
and you’re looking 3 4
at an a, go to state
2 4:
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 26
Recognition
• Recognition is the process of
determining if a string should be
accepted by the machine
• Or… it’s the process of determining if a
string is in the language defined by the
machine
• Or… it’s the process of determining if a
regular expression matches a string
• These all amount to the same thing.
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 27
Recognition…..
• Traditionally, (Turing’s notion) this process
is depicted with a tape.
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 28
Recognition…..
• FSA recognizes (accepts) strings of a
regular language
baa!
baaa!
baaa!
…
• Tape metaphor: a rejected input,
q0
a b a ! b
29
Recognition….
• Simply a process of starting in the
start state
• Examining the current input
• Consulting the table
• Going to a new state and updating
the tape pointer.
• Until you run out of tape.
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 30
D-Recognize
31
q0 q1 q2 q3 q3 q3 q4
b a a a a !
Input
State
b a !
0 1 Ø Ø
1 Ø 2 Ø
2 Ø 3 Ø
3 Ø 3 4
4: Ø Ø Ø
32
Key Points
• Deterministic means that at each point in
processing there is always one unique thing
to do (no choices).
• D-recognize is a simple table-driven
interpreter
• The algorithm is universal for all
unambiguous regular languages.
To change the machine, you simply change the
table.
• Crudely therefore… matching strings with
regular expressions (e.g Perl, grep, etc.) is a
matter of:
translating
08/17/2025
the
© Jurafsky and Martin, Speech regular
and Language Processing expression into a FSM (a 33
Generative Formalisms
• FSAs can be viewed from two
perspectives:
Acceptors that can tell you if a string is
in the language
Generators to produce all and only the
strings in the language
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 34
Non-Determinism
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 35
Non-Determinism….
• Epsilon-transitions
Key point: these transitions do not
examine or advance the tape during
recognition
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 36
ND Recognition
• Two basic approaches (used in all
major implementations of regular
expressions)
1. Either take a ND machine and convert
it to a D machine and then do
recognition with that.
2. Or explicitly manage the process of
recognition as a state-space search
(leaving the machine as it is).
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 37
ND Recognition…
• In a ND FSA there exists at least one path
directed through the machine by a string
that is in the language defined by the
machine that leads to an accept condition.
• But not all paths directed through the
machine by an accepted string lead to an
accept state. It is OK for some paths to lead
to a reject condition.
• In a ND FSA no path directed through the
machine by a string outside the language
leads to an accept condition. 38
ND Recognition…
• So success in a non-deterministic recognition occurs
when a path is found through the machine that ends
in an accept.
• However, being driven to a reject condition by an
input does not imply it should be rejected.
• Failure occurs only when none of the possible paths
lead to an accept state.
• This means that the problem of non-deterministic
recognition can be thought of as a standard search
problem.
39
The Problem of Choice
• Choice in non-deterministic models comes
up again and again in NLP.
• Several Standard Solutions exist:
1. Backup
Save input/state of machine at choice points
If wrong choice, use this saved state to back up and try
another choice
2. Lookahead
Look ahead in the input to help make a choice
3. Parallelism
Look at all choices in parallel
40
Backup
• After a wrong choice leads to a dead-end
(either no input left in a non-accept state,
or no legal transitions), return to a previous
choice point to pursue another unexplored
choice.
• Thus, at each choice point, the search
process needs to remember the
(unexplored) choices.
• Standard State Space Search.
State = (FSA node or machine state, tape-
41
Example
b a a a ! \
q0 q1 q2 q2 q3 q4
42
ND-Recognize Code
43
Example
Agenda:
44
Example
45
Example
Agenda:
46
Example
47
Example
Agenda:
48
Example
49
Example
Agenda:
50
Example
Agenda:
51
Example
Agenda:
52
Example
53
Example
Agenda:
54
Example
Agenda:
55
Example
Agenda:
56
Example
Agenda:
57
Recognition as Search
• You can view this algorithm as a trivial kind of state-
space search.
• States are pairings of tape positions and state
numbers.
• Operators are compiled into the table
• Goal state is a pairing with the end of tape position
and a final accept state
• Different search strategies can be used for the
state-space serach in ND-recognize.
Depth-First Search,
Breadth-First Search,
etc
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 58
Summing Up
• Regular expressions and FSAs can represent
subsets of natural language as well as regular
languages
Both representations may be impossible for
humans to understand for any real subset of
a language
But they are very easy to use for smaller
subsets
They are extremely useful for certain
subproblems (e.g., morphological analysis).
59