KEMBAR78
Regular Expression | PDF | Regular Expression | Bracket
0% found this document useful (0 votes)
11 views32 pages

Regular Expression

Uploaded by

shriyansh363
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views32 pages

Regular Expression

Uploaded by

shriyansh363
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

REGULAR EXPRESSION Dr.

Jitendra Kumar Rout


Dept. of CSE, NIT Raipur
REGULAR EXPRESSION
A regular expression is an algebraic notation for characterizing a set of strings.
useful for searching in texts, when we have a pattern to search for and a corpus of
texts to search through.
A regular expression search function will search through the corpus, returning all
texts that match the pattern.
A search can be designed to return every match on a line, if there are more than
one, or just the first match.
In the following examples we generally underline the exact part of the pattern that
matches the regular expression and show only the first match.
We’ll show regular expressions delimited by slashes but note that slashes are not
part of the regular expressions.
BASIC REGULAR EXPRESSION PATTERNS

Regular expressions are case sensitive; lower case /s/ is distinct from upper case
/S/.
The pattern /woodchucks/ will not match the string Woodchucks.
BASIC REGULAR EXPRESSION PATTERNS
What if we need both upper case and lower case?
Solution: Use of use the square braces [ and ].
The pattern /[wW]/ matches patterns containing either w or W.
BASIC REGULAR EXPRESSION PATTERNS
The regular expression /[1234567890]/ specifies any single digit.
it’s inconvenient to specify /[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/ to mean “any
capital letter”
In cases where there is a well-defined sequence associated with a set of characters,
the brackets can be used with the dash (-) to specify any one character in a range.
The pattern /[2-5]/ specifies any one of the characters 2, 3, 4, or 5.
 The pattern /[b-g]/ specifies one of the characters b, c, d, e, f, or g.
BASIC REGULAR EXPRESSION PATTERNS
The square braces can also be used to specify what a single character cannot be,
by use of the caret ^. If the caret ^ is the first symbol after the open square brace [,
the resulting pattern is negated.
For example, the pattern /[^a]/ matches any single character (including special
characters) except a.
This is only true when the caret is the first symbol after the open square brace. If it
occurs anywhere else, it usually stands for a caret.
BASIC REGULAR EXPRESSION PATTERNS
How can we talk about optional elements, like an optional s in woodchuck and
woodchucks?
We can’t use the square brackets, because while they allow us to say “s or S”, they
don’t allow us to say “s or nothing”.
For this we use the question mark /?/, which means “the preceding character or
nothing”.
BASIC REGULAR EXPRESSION PATTERNS
How to deal with “zero or one instances of the previous character”?

This language consists of strings with a b, followed by at least two a’s, followed by an
exclamation point.
Kleene * (generally pronounced “cleany star”).
The Kleene star means “zero or more occurrences of the immediately previous character
or regular expression”.
So /a*/ means “any string of zero or more as”. This will match a or aaaaaa, but it will also
match Off Minor since the string Off Minor has zero a’s. So the regular expression for
matching one or more a is /aa*/, meaning one a followed by zero or more as.
BASIC REGULAR EXPRESSION PATTERNS
More complex patterns can also be repeated. So /[ab]*/ means “zero or more a’s or b’s”
(not “zero or more right square braces”).
This will match strings like aaaa or ababab or bbbb.
For specifying multiple digits (useful for finding prices) we can extend /[0-9]/, the regular
expression for a single digit.
An integer (a string of digits) is thus /[0-9][0-9]*/. (Why isn’t it just /[0-9]*/?)
Sometimes it’s annoying to have to write the regular expression for digits twice, so there is a
shorter way to specify “at least one” of some character.
This is the Kleene +, which means “one or more occurrences of the immediately preceding
character or regular expression”.
Thus, the expression /[0-9]+/ is the normal way specify “a sequence of digits”.
There are thus two ways to specify the sheep language: /baaa*!/ or /baa+!/.
BASIC REGULAR EXPRESSION PATTERNS
One very important special character is the period (/./), a wildcard expression that
matches any single character (except a carriage return).

The wildcard is often used together with the Kleene star to mean “any string of
characters”.
 For example, suppose we want to find any line in which a particular word, for
example, aardvark, appears twice.
We can specify this with the regular expression /aardvark.*aardvark/.
ANCHORS IN REGULAR EXPRESSIONS
Anchors are special characters that anchor regular expressions to particular places in a string.
The caret ^ matches the start of a line. The pattern /^The/ matches the word The only at the start
of a line.
Thus, the caret ^ has three uses:
to match the start of a line,
to indicate a negation inside of square brackets,
and just to mean a caret.

The dollar sign $ matches the end of a line.


So the pattern $ is a useful pattern for matching a space at the end of a line, and /^The
dog\.$/ matches a line that contains only the phrase The dog.
(We have to use the backslash here since we want the . to mean “period” and not the wildcard.)
ANCHORS IN REGULAR EXPRESSIONS

There are also two other anchors: \b matches a word boundary, and \B matches a non-boundary.
Thus, /\bthe\b/ matches the word the but not the word other.
More technically, a “word” for the purposes of a regular expression is defined as any sequence of
digits, underscores, or letters; this is based on the definition of “words” in programming languages.
For example, /\b99\b/ will match the string 99 in “There are 99 bottles of beer on the wall”
(because 99 follows a space) but not 99 in “There are 299 bottles of beer on the wall” (since 99
follows a number). But it will match 99 in “$99” (since 99 follows a dollar sign ($), which is not a digit,
underscore, or letter).
DISJUNCTION, GROUPING, AND PRECEDENCE
we want to search for a pet i.e. either the string cat or the string dog.
Since we can’t use the square brackets to search for “cat or dog” (why can’t we say
/[catdog]/?)
a new operator i.e. the disjunction operator, denoted as the pipe symbol | is used for
this purpose.
The pattern /cat|dog/ matches either the string cat or the string dog.
How can I specify both guppy and guppies?
We cannot simply say /guppy|ies/, because that would match only the strings guppy
and ies.
This is because sequences like guppy take precedence over the disjunction operator |.
DISJUNCTION, GROUPING, AND PRECEDENCE
To make the disjunction operator apply only to a specific pattern, we need to use the
parenthesis operators ( and ).
So the pattern /gupp(y|ies)/ would specify that we meant the disjunction only to apply to the
suffixes y and ies.
Unlike the | operator, the Kleene* operator applies by default only to a single character,
not to a whole sequence.
Suppose we want to match repeated instances of a string.
Perhaps we have a line that has column labels of the form Column 1 Column 2 Column 3.
The expression will not match any number of columns; instead, it will
match a single column followed by any number of spaces! The star here applies only to the
space that precedes it, not to the whole sequence.
With the parentheses, we could write the expression to match the
word Column, followed by a number and optional spaces, the whole pattern repeated zero or
more times.
OPERATOR PRECEDENCE HIERARCHY FOR REGULAR
EXPRESSIONS

The table gives the order of RE operator precedence, from highest precedence to
lowest precedence.
Thus, because counters have a higher precedence than sequences, /the*/ matches
theeeee but not thethe.
 Because sequences have a higher precedence than disjunction, /the|any/ matches
the or any but not thany or theny.
GREEDY VS NON-GREEDY MATCHING
Consider the expression /[a-z]*/ when matching against the text once upon a time.
Since /[a-z]*/ matches zero or more letters, this expression could match nothing, or
just the first letter o, on, onc, or once.
In these cases regular expressions always match the largest string they can; we
say that patterns are greedy, expanding to cover as much of a string as they can.
There are, however, ways to enforce non-greedy matching, using another meaning
of the ? qualifier.
The operator *? is a Kleene star that matches as little text as possible.
The operator +? is a Kleene plus that matches as little text as possible.
A SIMPLE EXAMPLE
Suppose we wanted to write a RE to find cases of the English article the.
A simple (but incorrect) pattern might be: /the/
One problem is that this pattern will miss the word when it begins a sentence and
hence is capitalized (i.e., The).
 This might lead us to the following pattern: /[tT]he/
But we will still incorrectly return texts with the embedded in other words (e.g.,
other or theology).
So we need to specify that we want instances with a word boundary on both sides:
/\b[tT]he\b/
A SIMPLE EXAMPLE
Suppose we wanted to do this without the use of /\b/.
We might want this since /\b/ won’t treat underscores and numbers as word boundaries; but we might
want to find the in some context where it might also have underlines or numbers nearby (the_ or
the25).
 We need to specify that we want instances in which there are no alphabetic letters on either side of
the the: /[^a-zA-Z][tT]he[^a-zA-Z]/
But there is still one more problem with this pattern: it won’t find the word the when it begins a line.
This is because the regular expression [^a-zA-Z], which we used to avoid embedded instances of the,
implies that there must be some single (although non-alphabetic) character before the the.
We can avoid this by specifying that before the the we require either the beginning-of-line or a non-
alphabetic character, and the same at the end of the line:
/(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/
A SIMPLE EXAMPLE
The process we just went through was based on fixing two kinds of errors: false positives,
strings that we incorrectly matched like other or there, and false negatives, strings that we
incorrectly missed, like The.
Addressing these two kinds of errors comes up again and again in implementing speech
and language processing systems.
Reducing the overall error rate for an application thus involves two antagonistic efforts:
• Increasing precision (minimizing false positives)
• Increasing recall (minimizing false negatives)
MORE OPERATORS: SOME ALIASES FOR COMMON RANGES
Besides the Kleene * and Kleene + we can also use explicit numbers as counters, by
enclosing them in curly brackets.
The regular expression /{3}/ means “exactly 3 occurrences of the previous character or
expression”.
So /a\.{24}z/ will match a followed by 24 dots followed by z (but not a followed by 23
or 25 dots followed by a z).
REGULAR EXPRESSION OPERATORS FOR COUNTING
.
SPECIAL CHARACTERS
To refer to characters that are special themselves (like ., *, [, and \), precede them
with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).
A MORE COMPLEX EXAMPLE: POWER OF RE S
Suppose we want to build an application to help a user buy a computer on the Web.
The user might want “any machine with at least 6 GHz and 500 GB of disk space for
less than $1000”.
To do this kind of retrieval, we first need to be able to look for expressions like 6 GHz
or 500 GB or Mac or $999.99.
First, let’s complete our regular expression for prices. Here’s a regular expression for a
dollar sign followed by a string of digits: /$[0-9]+/ ($ here doesn’t mean end-of-line. )
Now we just need to deal with fractions of dollars. We’ll add a decimal point and two
digits afterwards: /$[0-9]+\.[0-9][0-9]/
This pattern only allows $199.99 but not $199. We need to make the cents optional
and to make sure we’re at a word boundary: /(^|\W)$[0-9]+(\.[0-9][0-9])?\b/
A MORE COMPLEX EXAMPLE: POWER OF RE S
One last catch! This pattern allows prices like $199999.99 which would be far too
expensive! We need to limit the dollars: /(^|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/
How about disk space? We’ll need to allow for optional fractions again (5.5 GB); note the
use of ? for making the final s optional, and the of to mean “zero or more spaces”
since there might always be extra spaces lying around:

Modify this regular expression so that it only matches more than 500 GB. (Assignment).
SUBSTITUTION, CAPTURE GROUPS
For example, the substitution operator s/regexp1/pattern/ used in Python and in Unix
commands like vim or sed allows a string characterized by a regular expression to be
replaced by another string: s/colour/color/
It is often useful to be able to refer to a particular subpart of the string matching the first
pattern.
 For example, suppose we wanted to put angle brackets around all integers in a text, for
example, changing the 35 boxes to the <35> boxes.
We’d like a way to refer to the integer we’ve found so that we can easily add the brackets.
To do this, we put parentheses ( and ) around the first pattern and use the number operator \1
in the second pattern to refer back.
Here’s how it looks: s/([0-9]+)/<\1>/
SUBSTITUTION, CAPTURE GROUPS
The parenthesis and number operators can also specify that a certain string or expression must
occur twice in the text.
For example, suppose we are looking for the pattern “the Xer they were, the Xer they will be”,
where we want to constrain the two X’s to be the same string.
We do this by surrounding the first X with the parenthesis operator, and replacing the second X
with the number operator \1, as follows: /the (.*)er they were, the \1er they will be/
Here the \1 will be replaced by whatever string matched the first item in parentheses.
So this will match the bigger they were, the bigger they will be but not the bigger they were, the faster
they will be.
SUBSTITUTION, CAPTURE GROUPS
This use of parentheses to store a pattern in memory is called a capture group.
Every time a capture group is used (i.e., parentheses surround a pattern), the
resulting match is stored in a numbered register. If you match two different sets of
parentheses, \2 means whatever matched the second capture group.
Thus /the (.*)er they (.*), the \1er we \2/ will match the faster they ran, the faster
we ran but not the faster they ran, the faster we ate.
Similarly, the third capture group is stored in \3, the fourth is \4, and so on.
Parentheses thus have a double function in regular expressions; they are used to
group terms for specifying the order in which operators should apply, and they are
used to capture something in a register.
SUBSTITUTION, CAPTURE GROUPS
Occasionally we might want to use parentheses for grouping, but don’t want to
capture the resulting pattern in a register.
In that case we use a non-capturing group, which is specified by putting the
commands ?: after the open paren, in the form (?: pattern ).
/(?:some|a few) (people|cats) like some \1/
will match some cats like some cats but not some cats like some a few.
Substitutions and capture groups are very useful in implementing simple chatbots
like ELIZA(Weizenbaum, 1966).
LOOKAHEAD
There will be times when we need to predict the future: look ahead in the text to see if
some pattern matches, but not advance the match cursor, so that we can then deal with the
pattern if it occurs.
These lookahead assertions make use of the (? syntax for non-capture groups.
The operator (?= pattern) is true if pattern occurs, but is zero-width, i.e. the match
pointer doesn’t advance.
The operator (?! pattern) only returns true if a pattern does not match, but again is zero-
width and doesn’t advance the cursor.
 Negative lookahead is commonly used when we are parsing some complex pattern but
want to rule out a special case.
For example suppose we want to match, at the beginning of a line, any single word that
doesn’t start with “Volcano”. We can use negative lookahead to do this:
/^(?!Volcano)[A-Za-z]+/
CORPUS, WORDS, LEMMA, TYPES, TOKENS
corpus (plural corpora), a computer-readable collection of text or speech.
How about inflected forms like cats versus cat? These two words have the same lemma cat but are
different wordforms.
A lemma is a set of lexical forms having he same stem, the same major part-of-speech, and the
same word sense.
The wordform is the full inflected or derived form of the word.
For morphologically complex languages like Arabic, we often need to deal with lemmatization. For
many tasks in English, however, wordforms are sufficient.
How many words are there in English? To answer this question we need to distinguish two ways of
talking about words.
Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V, the
number of types is the vocabulary size |V|.
 Tokens are the total number N of running words.
SOME ENGLISH LANGUAGE CORPRA
HERDAN’S LAW OR HEAPS’ LAW
The larger the corpora we look at, the more word types we find, and in fact this relationship
between the number of types |V| and number of tokens N is called Herdan’s Law
(Herdan, 1960)or Heaps’ Law (Heaps, 1978) after its discoverers (in linguistics and
information retrieval respectively).
 It is shown in Eq.2.1, where k and β are positive constants, and 0 < β < 1.

The value of β depends on the corpus size and the genre, but at least for the large corpora
in Fig.2.11, β ranges from .67 to .75. Roughly then we can say that the vocabulary size for a
text goes up significantly faster than the square root of its length in words.

You might also like