KEMBAR78
RegexFSA | PDF | Regular Expression | Theoretical Computer Science
0% found this document useful (0 votes)
3 views59 pages

RegexFSA

Regex notes

Uploaded by

arkham2462
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views59 pages

RegexFSA

Regex notes

Uploaded by

arkham2462
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Regular Expressions and

Finite State Automata

E Jembere
What we will cover in this Unit

• Regular Expressions (RE)


• Finite State Automata (FSA)

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 2


Introductory Problem
• How can we create a price list for the
laptops given some information in a
text file?

08/17/2025 3
4
Regular Expressions

• A regular expression is a pattern to


be matched against a string. For
example, the pattern $299.99.
• Matching either succeeds or fails.
• Sometimes you may want to replace
a matched pattern with another
string.
• Regular expressions are used by
many other Unix commands and
programs, such as grep, sed, awk,
vi, emacs, nano, etc.
© Jurafsky and Martin, Speech and Language Processing
08/17/2025 5
There is more to RE

• Formally, a Regular Expression is an


algebraic notation for characterising
a set of Strings
• A regular expression:
 Can specify search strings
 Define a language in a formal way.

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 6


Disjunction
• Letters in square Brackets “[]”
• E.g. Find all the instances of the word
“regular expression” in a text.

A regular expression is a pattern to be matched against a


string. For example, the pattern $299.99. Matching either
succeeds or fails. Sometimes you may want to replace a
matched pattern with another string. Regular expressions
are used by many other Unix commands and programs,
such as grep, sed, awk, vi, emacs, nano and even some
shells

 regular expression
 [Rr]egular expression
© Jurafsky and Martin, Speech and Language Processing
08/17/2025 7
Ranges

Pattern Matches
[A-Z] An upper case letter Matching either succeeds or fails
[a-z] A lower case letter Matching either succeeds or fails
[0-9] A single digit There is no digit larger than 9

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 8


Negation in Disjunction

Pattern Matches
[^A-Z] Not an upper case letter

[^Ss] Neither ‘S’ nor ‘s’

[^e] Not ‘e’

a\^b The pattern ‘a^b’

[^\.] Not a period

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 9


More on Disjunctions

• Suppose we want to search for the string ‘cats’ or string


‘dogs’ in a given text

Cats & Dogs is a 2001 American-Australian action-comedy film, directed


by Lawrence Guterman. The screenplay by John Requa and Glenn Ficarra
centers on the relationships between cats and dogs, depicting the
relationship as an intense rivalry in which both sides use organizations and
tactics that mirror those used in human espionage. John starred in the
movie.

• We can use the pipe ‘|‘ for disjunction


• We can also combine the | and []
 E.g [Jj]ohn|[Rr]ichard

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 10


Special Characters
• Special characters: ? * + .

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 11


Anchors
• ^ matches the beginning of a line;
• $ the end of a line

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 12


Operator Precedence Hierarchy

Group Symbols
Parenthesis ()
Counters *, +, ?, {}
Sequences and the, ^my, end$
Anchors
Disjunctions I

• E.g. ‘the*’ matches ‘theeeeee’ and not


‘thethe’

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 13


The Sheep Language
• baa!, baaa!, baaaa!, baaaaa!...
 Any string from this infinite set is in the sheep
language.
• It has three characters in its alphabet; a, b
and !.
• Which of the following is a correct Regex
for the sheep language?
a) ba*!
b) ba+!
c) baa*!
d) baa+!
e) baaa*!
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 14
15
Regex Substitution and
Memory
• Substitution in Java
String pattern = "(\\w)(\\s+)([\\.,])";
System.out.println(EXAMPLE_TEST.replaceAll(pattern,
"$1$3"));
 Replaces white space between a word character and a .

• Regex in Perl
 Syntax: s/regexp1/modifier/
 E.g.
 s/I am feeling (.+)/You are feeling \1?/
 s/I gave the (.+) to (.+)/Why would you give \2 the \1?/

16
/g specifies that a
regex should perform
a global search.

17
ELIZA: Substitutions Using
Memory
User: Men are all alike.
ELIZA: IN WHAT WAY
s/.* all .*/IN WHAT WAY/

They’re always bugging us about something or other.


ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

User: My boyfriend says I’m depressed.


ELIZA: I AM SORRY TO HEAR YOU ARE DEPRESSED
s/.* I’m (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/

18
Finite State Automata
• Regular expressions can be viewed
as a textual way of specifying the
structure of finite-state automata.
• Any regular expression can be
represented as an FSA.
• FSA capture significant aspects of
what linguists say we need for
Natural Language Processing.

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 19


FSAs as Graphs
• Let’s revisit the sheep language
 /baa+!/

• Directed graph with labeled nodes and arc


transitions
• Five states: q0 the start state, q4 the final state,
5 transitions

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 20


Sheep Language FSA

• We can say the following things about


this machine:
 It has 5 states
 b, a, and ! are in its alphabet
 q0 is the start state
 q4 is an accept state
 It has 5 transitions

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 21


But Note

• There are other machines that


correspond to this same language

• More on this one later

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 22


More Formally
• FSA is a 5-tuple consisting of
 Q: set of states {q0, q1, q2, q3, q4}
 : an alphabet of symbols {a,b,!}
 q0: a start state
 F: a set of final states in Q {q4}
 (q,i): a transition function mapping Q x  to
Q

23
About Alphabets
• Don’t take term alphabet too
narrowly; it just means we need a
finite set of symbols in the input.
• These symbols can and will stand for
bigger objects that can have internal
structure.

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 24


Dollars and Cents

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 25


State Transition Table

• An FSA can also b a ! ϵ


be represented 0 1
as table/matrix
1 2
2 2,3
If you’re in state 1
and you’re looking 3 4
at an a, go to state
2 4:

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 26


Recognition

• Recognition is the process of


determining if a string should be
accepted by the machine
• Or… it’s the process of determining if a
string is in the language defined by the
machine
• Or… it’s the process of determining if a
regular expression matches a string
• These all amount to the same thing.
08/17/2025 © Jurafsky and Martin, Speech and Language Processing 27
Recognition…..
• Traditionally, (Turing’s notion) this process
is depicted with a tape.

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 28


Recognition…..
• FSA recognizes (accepts) strings of a
regular language
 baa!
 baaa!
 baaa!
 …
• Tape metaphor: a rejected input,
q0

a b a ! b
29
Recognition….
• Simply a process of starting in the
start state
• Examining the current input
• Consulting the table
• Going to a new state and updating
the tape pointer.
• Until you run out of tape.

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 30


D-Recognize

31
q0 q1 q2 q3 q3 q3 q4

b a a a a !
Input
State
b a !

0 1 Ø Ø

1 Ø 2 Ø

2 Ø 3 Ø

3 Ø 3 4

4: Ø Ø Ø
32
Key Points
• Deterministic means that at each point in
processing there is always one unique thing
to do (no choices).
• D-recognize is a simple table-driven
interpreter
• The algorithm is universal for all
unambiguous regular languages.
 To change the machine, you simply change the
table.
• Crudely therefore… matching strings with
regular expressions (e.g Perl, grep, etc.) is a
matter of:
 translating
08/17/2025
the
© Jurafsky and Martin, Speech regular
and Language Processing expression into a FSM (a 33
Generative Formalisms
• FSAs can be viewed from two
perspectives:
 Acceptors that can tell you if a string is
in the language
 Generators to produce all and only the
strings in the language

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 34


Non-Determinism

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 35


Non-Determinism….
• Epsilon-transitions
 Key point: these transitions do not
examine or advance the tape during
recognition

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 36


ND Recognition
• Two basic approaches (used in all
major implementations of regular
expressions)
1. Either take a ND machine and convert
it to a D machine and then do
recognition with that.
2. Or explicitly manage the process of
recognition as a state-space search
(leaving the machine as it is).

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 37


ND Recognition…
• In a ND FSA there exists at least one path
directed through the machine by a string
that is in the language defined by the
machine that leads to an accept condition.

• But not all paths directed through the


machine by an accepted string lead to an
accept state. It is OK for some paths to lead
to a reject condition.

• In a ND FSA no path directed through the


machine by a string outside the language
leads to an accept condition. 38
ND Recognition…
• So success in a non-deterministic recognition occurs
when a path is found through the machine that ends
in an accept.

• However, being driven to a reject condition by an


input does not imply it should be rejected.

• Failure occurs only when none of the possible paths


lead to an accept state.

• This means that the problem of non-deterministic


recognition can be thought of as a standard search
problem.
39
The Problem of Choice

• Choice in non-deterministic models comes


up again and again in NLP.
• Several Standard Solutions exist:
1. Backup
 Save input/state of machine at choice points
 If wrong choice, use this saved state to back up and try
another choice
2. Lookahead
 Look ahead in the input to help make a choice

3. Parallelism
 Look at all choices in parallel

40
Backup
• After a wrong choice leads to a dead-end
(either no input left in a non-accept state,
or no legal transitions), return to a previous
choice point to pursue another unexplored
choice.

• Thus, at each choice point, the search


process needs to remember the
(unexplored) choices.

• Standard State Space Search.


 State = (FSA node or machine state, tape-
41
Example

b a a a ! \

q0 q1 q2 q2 q3 q4

42
ND-Recognize Code

43
Example
Agenda:

44
Example

45
Example

Agenda:

46
Example

47
Example

Agenda:

48
Example

49
Example

Agenda:

50
Example

Agenda:

51
Example

Agenda:

52
Example

53
Example

Agenda:

54
Example

Agenda:

55
Example

Agenda:

56
Example

Agenda:

57
Recognition as Search

• You can view this algorithm as a trivial kind of state-


space search.
• States are pairings of tape positions and state
numbers.
• Operators are compiled into the table
• Goal state is a pairing with the end of tape position
and a final accept state
• Different search strategies can be used for the
state-space serach in ND-recognize.
 Depth-First Search,
 Breadth-First Search,
 etc

08/17/2025 © Jurafsky and Martin, Speech and Language Processing 58


Summing Up
• Regular expressions and FSAs can represent
subsets of natural language as well as regular
languages
 Both representations may be impossible for
humans to understand for any real subset of
a language
 But they are very easy to use for smaller
subsets
 They are extremely useful for certain
subproblems (e.g., morphological analysis).

59

You might also like