Module 1: NLP
Aarti Dharmani
Estimate bigram probabilities
• <s> I am Sam </s>
• <s> Sam I am </s>
• <s> I do not like green eggs and ham </s>
P(I|<s>) =
P(Sam|<s>) =
P(am|I) =
P(</s>|Sam) =
P(Sam|am) =
P(do|I) =
Given no. of bigrams and unigrams count of
a dataset
i want to eat chinese food lunch spend
i 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0
i want to eat chinese food lunch spend
2533 927 2417 746 158 1093 341 278
Calculate the probability of a sentence
• P(I want chinese food to eat) = ?
• P(I) x P(want|I) x P(chinese|want) x P(food|chinese) x P(to|food) x
P(eat|to) = ?
Regular Expressions
Regular expressions provide a powerful, flexible, and efficient method
for processing text.
The extensive pattern-matching notation of regular expressions
enables you to quickly parse large amounts of text to:
• Find specific character patterns.
• Validate text to ensure that it matches a predefined pattern (such as
an email address).
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Elements of Regular Expressions
1. Repeaters ( *, +, and { } )
These symbols act as repeaters and tell the computer that the preceding character
is to be used for more than just one time.
2. The asterisk symbol ( * )
It tells the computer to match the preceding character (or set of characters) for 0
or more times (upto infinite).
3. The Plus symbol ( + )
It tells the computer to repeat the preceding character (or set of characters) at
atleast one or more times(up to infinite).
4. The curly braces { … }
It tells the computer to repeat the preceding character (or set of characters) for as
many times as the value inside this bracket.
5. Wildcard ( . )
The dot symbol can take the place of any other symbol, that is why it is called the
wildcard character.
6. Optional character ( ? )
This symbol tells the computer that the preceding character may or may not be present in the string to be
matched.
7. The caret ( ^ ) symbol ( Setting position for the match )
The caret symbol tells the computer that the match must start at the beginning of the string or line.
8. The dollar ( $ ) symbol
It tells the computer that the match must occur at the end of the string or before \n at the end of the line or
string.
9. Character Classes
A character class matches any one of a set of characters. It is used to match the
most basic element of a language like a letter, a digit, a space, a symbol, etc.
10. [^set_of_characters] Negation:
Matches any single character that is not in set_of_characters. By default, the
match is case-sensitive.
11. [first-last] Character range:
• Matches any single character in the range from first to last.
12. The Escape Symbol ( \ )
If you want to match for the actual ‘+’, ‘.’ etc characters, add a backslash( \ ) before
that character. This will tell the computer to treat the following character as a
search character and consider it for a matching pattern.
13. Grouping Characters ( )
A set of different symbols of a regular expression can be grouped together to act
as a single unit and behave as a block, for this, you need to wrap the regular
expression in the parenthesis( ).
14. Vertical Bar ( | )
Matches any one element separated by the vertical bar (|) character.
Write Regular Expressions for the following
cases
1. Mobile number:should start with 8 or 9 and total number of digits:10
• f you're looking for a regular expression for a mobile number that should start with 8 or 9 and have a total of 10
digits, you can use the following:
• regexCopy code
• ^[89]\d{9}$
• Explanation:
• ^[89]: The caret (^) asserts the start of the string. [89] means the first digit should be 8 or 9.
• \d{9}: \d represents any digit, and {9} specifies that there should be exactly 9 digits following the first one.
• $: The dollar sign asserts the end of the string.
Email ID:
Should have the format "nlp123@gmail.com"
• regexCopy code
• ^[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]{2,}$
• Explanation:
• ^[a-zA-Z0-9]+: Starts with one or more alphanumeric characters.
• @: Contains the "@" symbol.
• [a-zA-Z0-9]+: Followed by one or more alphanumeric characters for the
domain name.
• \.: Contains a dot before the top-level domain.
• [a-zA-Z]{2,}$: Ends with at least two alphabetic characters for the top-level
domain.
First Character uppercase, contains lower case
alphabets, only one digit allowed in between
regex
• ^[A-Z][a-z]*\d?[a-z]*$
• Explanation:
• ^[A-Z]: The caret (^) asserts the start of the string. [A-Z] means the first
character should be an uppercase letter.
• [a-z]*: Matches zero or more lowercase letters.
• \d?: Optionally matches one digit.
• [a-z]*$: Matches zero or more lowercase letters until the end of the string.
• This regular expression ensures that the first character is uppercase, and
the string can contain lowercase letters with at most one digit in between
them.