What are Regular Expressions?
Regular Expression (regex) is an algebraic notation for describing regular languages. It provides a compact, human-readable way to specify patterns in strings.
Simple Analogy
Think of regular expressions like a search pattern recipe:
- DFA/NFA: Step-by-step instructions (state machine)
- Regex: Recipe description (“ends with ‘a’ OR contains ‘b’”)
Example: Finding phone numbers, email addresses, or specific text patterns
Formal Definition
A regular expression over alphabet Σ is defined recursively:
Base Cases:
- ∅ is a regular expression (empty language)
- ε is a regular expression (language {ε})
- a is a regular expression for any a ∈ Σ (language {a})
Recursive Cases: If r and s are regular expressions: 4. (r|s) is a regular expression (union) 5. (r·s) or (rs) is a regular expression (concatenation) 6. (r*) is a regular expression (Kleene star)
Precedence: * (highest) > concatenation > | (lowest)
Basic Operations
1. Union (|)
Notation: r₁ | r₂ or r₁ + r₂
Meaning: Strings matching either r₁ or r₂
Examples:
a|b matches: "a" or "b"
0|1 matches: "0" or "1"
cat|dog matches: "cat" or "dog"
2. Concatenation (·)
Notation: r₁·r₂ or simply r₁r₂
Meaning: String from r₁ followed by string from r₂
Examples:
ab matches: "ab"
(a|b)c matches: "ac" or "bc"
01(0|1) matches: "010" or "011"
3. Kleene Star (*)
Notation: r*
Meaning: Zero or more repetitions of r
Examples:
a* matches: "", "a", "aa", "aaa", ...
(ab)* matches: "", "ab", "abab", "ababab", ...
(0|1)* matches: all binary strings (including ε)
4. Additional Operators
Plus (+): One or more repetitions
r+ = rr* (at least one)
a+ matches: "a", "aa", "aaa", ... (but NOT "")
Optional (?): Zero or one occurrence
r? = r|ε
a? matches: "" or "a"
Exact count ({n}):
a{3} = aaa (exactly 3 a's)
a{2,5} = aa, aaa, aaaa, or aaaaa (2 to 5 a's)
Language Denoted by Regex
L(r) = language (set of strings) described by regex r
Examples
| Regex | L(r) | Description | | ----- | -------------------------------- | --------------------------- | ----------------- | | a | {a} | Single string “a” | | ε | {ε} | Empty string only | | ∅ | ∅ | No strings (empty language) | | a | b | {a, b} | Either “a” or “b” | | ab | {ab} | String “ab” | | a* | {ε, a, aa, aaa, …} | Zero or more a’s | | (a | b)* | All strings over {a,b} | Any combination | | ab | {ε, a, b, aa, bb, aab, abb, …} | a’s followed by b’s |
Regex Examples
Example 1: Strings Ending with ‘1’
Language: L = {w | w ends with ‘1’}
Regex: (0|1)*1
Explanation:
- (0|1)*: Any string of 0’s and 1’s
- 1: Must end with ‘1’
Matches: 1, 01, 11, 001, 101, 111, 0001, …
Example 2: Strings Containing ‘01’
Language: L = {w | w contains “01”}
Regex: (0|1)01(0|1)
Explanation:
- (0|1)*: Any prefix
- 01: The required substring
- (0|1)*: Any suffix
Matches: 01, 001, 010, 101, 0011, …
Example 3: Even Number of a’s
Language: L = {w | w has even number of a’s}
Regex: (baba)b
Explanation:
- (baba)*: Pairs of a’s separated by b’s
- b*: Final b’s
Matches: ε, b, bb, aa, bab, aba, aabb, …
Alternative: b*(abab)*
Example 4: Strings with Length ≥ 3
Language: L = {w | |w| ≥ 3}
Regex: (0|1)(0|1)(0|1)(0|1)*
Or: Σ³Σ* where Σ = (0|1)
Matches: 000, 001, 010, 011, 100, 101, 110, 111, 0000, …
Example 5: Binary Numbers Divisible by 3
Regex: (0|1(01*0)*1)*
Complex pattern: Represents DFA for mod 3 arithmetic
Example 6: Valid Identifiers
Language: Starts with letter, then letters/digits
Regex: [a-zA-Z][a-zA-Z0-9]*
Explanation:
- [a-zA-Z]: First character is letter
- [a-zA-Z0-9]*: Followed by letters or digits
Matches: x, var1, MyVariable, test_123
Example 7: Email Pattern (Simplified)
Regex: [a-z]+@[a-z]+.[a-z]+
Matches: user@example.com, test@domain.org
Example 8: All Strings Except ε
Language: L = Σ⁺
Regex: (0|1)(0|1)* or (0|1)⁺
Matches: All non-empty strings
Example 9: Strings Not Containing ‘11’
Language: L = {w | “11” is not a substring of w}
Regex: 0*(10+)*1?
Explanation:
- 0*: Leading zeros
- (10+)*: Pattern “1” followed by one or more “0”s
- 1?: Optional final “1”
Matches: ε, 0, 1, 00, 01, 10, 001, 010, 100, 101, …
Example 10: Exactly Three a’s
Language: L = {w | w has exactly three a’s}
Regex: bababab
Explanation: Three a’s with any number of b’s between/around
Matches: aaa, baaa, ababa, bababab, …
Precedence Rules
Priority (highest to lowest):
- * (Kleene star) - highest precedence
- · (concatenation)
- | (union) - lowest precedence
Examples
| Expression | Interpretation | | ---------- | --------------- | ------------------------------------ | ------------ | ---- | | ab* | a(b*) not (ab)* | | a | bc | a | (bc) not (a | b)c | | ab | c | (ab) | c not a(b | c) | | a | b* | a | (b*) not (a | b)* | | ab*c | a(b*)c | | (a | b)*c | All strings over {a,b} ending with c |
Parentheses override precedence:
- ab* ≠ (ab)*
- a|b* ≠ (a|b)*
Equivalence of Regular Expressions
Definition
Two regex r₁ and r₂ are equivalent if:
L(r₁) = L(r₂)
They describe the same language
Algebraic Laws
Commutativity
r|s = s|r (union is commutative)
rs ≠ sr (concatenation is NOT commutative)
Associativity
(r|s)|t = r|(s|t) (union)
(rs)t = r(st) (concatenation)
Identity Elements
r|∅ = r (∅ is identity for union)
rε = εr = r (ε is identity for concatenation)
Annihilator
r∅ = ∅r = ∅ (∅ annihilates concatenation)
Idempotence
r|r = r (union is idempotent)
Distributivity
r(s|t) = rs|rt (concatenation distributes over union)
(s|t)r = sr|tr
Kleene Star Properties
r* = ε|r|rr|rrr|... (definition)
r** = r* (star is idempotent)
∅* = ε (important!)
ε* = ε
(r*)* = r*
r*r* = r*
rr* = r*r
r* = (r+)|ε (star = plus or epsilon)
(r|ε)* = r* (including ε doesn't change)
Equivalence Examples
Example 1:
(0|1)* = (0*1*)*
Both represent: All binary strings
Example 2:
a*b* ≠ (a|b)*
Left: a's followed by b's only
Right: Any mix of a's and b's
Example 3:
(a|b)*a = a|ba|(a|b)*aa|b*ba
Both: Strings ending with 'a'
Kleene’s Theorem
The Fundamental Result
Kleene’s Theorem: The following are equivalent:
- Language is recognized by a DFA
- Language is recognized by an NFA
- Language is described by a regular expression
Regular Languages = DFA = NFA = Regex
Three Conversions
DFA ←→ NFA
↕ ↕
Regex ←→ NFA
All three representations have equal expressive power!
Regex to NFA Conversion
Thompson’s Construction
Algorithm: Recursively build NFA from regex using ε-transitions
Base Cases:
1. Empty Language ∅
NFA: q₀ (start, no accept state)
Accepts: Nothing
2. Empty String ε
NFA: q₀ (start and accept)
Accepts: ε only
3. Single Symbol a
NFA:
a
q₀ → q₁ (accept)
Accepts: "a" only
Recursive Cases
Union: r₁ | r₂
Strategy: Create new start state with ε-transitions to both NFAs
ε
┌────────→ NFA(r₁) ────────┐
│ │ ε
q₀ ─┤ ├──→ qf
│ │ ε
└────────→ NFA(r₂) ────────┘
ε
Steps:
- Create new start q₀ and accept qf
- ε-transition from q₀ to both r₁ and r₂
- ε-transitions from both accepts to qf
Concatenation: r₁r₂
Strategy: Connect output of NFA(r₁) to input of NFA(r₂)
q₀ → NFA(r₁) → [ε] → NFA(r₂) → qf
Steps:
- Remove accept status from r₁’s accept
- Add ε-transition to r₂’s start
- r₂’s accept becomes final accept
Kleene Star: r*
Strategy: Add loops with ε-transitions
┌───────────── ε ────────────┐
↓ │
q₀ → [ε] → NFA(r) → [ε] → qf
│ ↑
└────────── ε ───────────┘
Steps:
- Create new start q₀ and accept qf
- ε from q₀ to old start (for repetitions)
- ε from old accept back to old start (loop)
- ε from q₀ to qf (for zero repetitions)
- ε from old accept to qf (end repetitions)
Complete Example: (a|b)*a
Step-by-step construction:
1. Build NFA for ‘a’:
a
q₁ → q₂
2. Build NFA for ‘b’:
b
q₃ → q₄
3. Union: (a|b):
ε
┌───────→ q₁ ─a→ q₂ ────┐
│ │ ε
q₀ ─┤ ├──→ q₅
│ │ ε
└───────→ q₃ ─b→ q₄ ────┘
ε
4. Star: (a|b)*:
┌──────────────── ε ────────────────┐
↓ │
q₆ → [ε] → (a|b union NFA) → [ε] → q₇
│ ↑
└──────────────── ε ───────────────┘
5. Concatenate with ‘a’:
q₈ → q₉ (accepting)
a
6. Final NFA for (a|b)*a: Connect star NFA to ‘a’ NFA with ε
Result: NFA with ~10 states and many ε-transitions
NFA to Regex Conversion
State Elimination Method
Algorithm: Progressively remove states, updating transitions with regex labels
Steps:
- Ensure single start and single accept state
- Add ε-transitions if needed
- Eliminate states one by one (except start/accept)
- Update transition labels with regex expressions
- Final regex is label from start to accept
Elimination Rule
When eliminating state q:
For each pair (predecessor p, successor s):
Old transitions:
p →r₁→ q →r₂→ s
q →r₃→ q (self-loop)
New transition:
p →r₁r₃*r₂→ s
Formula: r₁r₃*r₂ (go through q via self-loop)
Example: Convert Simple DFA to Regex
DFA: Strings ending with ‘1’
0 1
→(A) ───→ ((B))
↺ 0 ↺ 1
Step 1: Add new start/accept with ε
ε 0 1 ε
qₛ → (A) ───→ ((B)) → qₐ
↺ 0 ↺ 1
Step 2: Eliminate A
Transition from qₛ to B through A:
ε · 0*1 = 0*1
Self-loop on A: 0*
Path to B: qₛ →0*1→ B
Step 3: Eliminate B
Transition from qₛ to qₐ through B:
0*1 · 1* · ε = 0*11* = 0*1 · 1*
Self-loop on B: 1*
Final Regex: 01·1 = 0*1⁺ or (0|1)*1
Result: (0|1)*1
DFA to Regex Directly
Arden’s Theorem
For equation: X = AX | B (where ε ∉ L(A))
Solution: X = A*B
Used to solve: Systems of equations from DFA states
Example Using Arden’s Theorem
DFA: Same as before
State equations:
A = εA | 0A | 1B (start from A, or loop, or go to B)
B = 1A | 1B | ε (accept state)
Simplify B:
B = 1B | (1A | ε)
B = 1*(1A | ε) (by Arden's)
B = 1*1A | 1*ε
B = 1⁺A | 1*
Substitute in A:
A = (0|1)A | 1B
A = (0|1)A | 1(1⁺A | 1*)
A = (0|1)A | 11⁺A | 11*
A = ((0|1) | 11⁺)A | 11*
A = ((0|1) | 11⁺)*11* (by Arden's)
Final regex for A (start state): Complex but equivalent to (0|1)*1
Practical Applications
1. Text Processing
grep, sed, awk (Unix tools)
grep "error|warning" logfile.txt
sed 's/[0-9]+/NUM/g' file.txt
2. Programming Languages
Python:
import re
pattern = r'^[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-z]+$'
re.match(pattern, email)
JavaScript:
const regex = /^[A-Z][a-z]*/;
regex.test(name);
3. Lexical Analysis
Compiler tokens:
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*
NUMBER: [0-9]+(\.[0-9]+)?
STRING: "([^"\\]|\\.)*"
4. Input Validation
Form validation:
- Email:
[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,} - Phone:
\d{3}-\d{3}-\d{4} - URL:
https?://[^\s]+
5. Data Extraction
Log parsing:
\[(\d{4}-\d{2}-\d{2})\] (ERROR|WARN|INFO): (.*)
Captures: date, level, message
6. Search and Replace
Code refactoring:
Find: function\s+(\w+)\s*\(
Replace: const $1 = (
Extended Regex Features (Practical)
Modern regex engines add features beyond regular languages:
1. Character Classes
\d = [0-9] (digits)
\w = [a-zA-Z0-9_] (word characters)
\s = [ \t\n\r] (whitespace)
. = any character
2. Anchors
^ = start of string
$ = end of string
\b = word boundary
3. Quantifiers
{n} = exactly n times
{n,} = n or more times
{n,m} = between n and m times
? = {0,1}
+ = {1,}
* = {0,}
4. Groups and Backreferences
(pattern) = capturing group
(?:pattern) = non-capturing
\1, \2 = backreference
Example: (\w)\1 matches doubled letters (aa, bb, cc)
Note: Backreferences make regex more powerful than regular languages!
5. Lookahead/Lookbehind
(?=pattern) = positive lookahead
(?!pattern) = negative lookahead
(?<=pattern) = positive lookbehind
(?<!pattern) = negative lookbehind
These go beyond regular languages (not expressible in DFA/NFA)
Important Exam Points
- ✓ Regex = algebraic notation for regular languages
- ✓ Three operations: Union (|), Concatenation (·), Star (*)
- ✓ Precedence: * > concatenation > |
- ✓ Equivalence: L(r₁) = L(r₂) means same language
- ✓ Kleene’s Theorem: DFA = NFA = Regex (equivalent)
- ✓ Thompson’s Construction: Regex → NFA with ε-transitions
- ✓ State Elimination: NFA → Regex by removing states
- ✓ Arden’s Theorem: Solve X = AX | B as X = A*B
- ✓ Base cases: ∅, ε, a (single symbol)
- ✓ Laws: r* = ε|r⁺, ∅* = ε, (r*)* = r*
Common Mistakes to Avoid
✗ Mistake 1: ab* means (ab)*
✓ Correct: ab* = a(b*), use parentheses for (ab)*
✗ Mistake 2: a|b* means (a|b)*
✓ Correct: a|b* = a|(b*), union has lowest precedence
✗ Mistake 3: Concatenation is commutative
✓ Correct: rs ≠ sr in general (e.g., ab ≠ ba)
✗ Mistake 4: ∅* = ∅
✓ Correct: ∅* = ε (star of empty language is epsilon)
✗ Mistake 5: ε* = ∅
✓ Correct: ε* = ε (star of epsilon is epsilon)
✗ Mistake 6: L(ε) = ∅
✓ Correct: L(ε) = {ε} (set containing empty string)
✗ Mistake 7: Regex can describe all languages
✓ Correct: Regex describes only regular languages
Practice Questions
Q1: What is the language of (0|1)*0?
Answer: All binary strings ending with ‘0’
Q2: Write regex for “contains at least two a’s”
Answer: babab* or (a|b)*a(a|b)a(a|b)
Q3: Are (a*)* and a* equivalent?
Answer: Yes, (a*)* = a* (star is idempotent)
Q4: What is ∅*?
Answer: ε (important edge case!)
Q5: Convert (a|b)*a to NFA using Thompson’s Construction
Answer: Build union of a and b, apply star, concatenate with a
Q6: What is precedence of ab*|c?
Answer: (a(b*))|c = ab* or c
Q7: Can regex with backreferences describe non-regular languages?
Answer: Yes! Backreferences (like \1) go beyond regular languages
Q8: Write regex for valid C identifiers
Answer: [a-zA-Z_][a-zA-Z0-9_]*
Summary
- Regular expressions provide algebraic notation for regular languages
- Three operations: Union (|), Concatenation (·), Kleene Star (*)
- Precedence: * (highest) > concatenation > | (lowest)
- Equivalence: Two regex equivalent if they describe same language
- Kleene’s Theorem: DFA = NFA = Regex (same expressive power)
- Thompson’s Construction: Regex → NFA using ε-transitions
- State Elimination: NFA/DFA → Regex by removing states
- Arden’s Theorem: X = AX | B solves to X = A*B
- Laws: Many algebraic identities (∅, ε, distributivity, etc.)
- Applications: Text processing, lexical analysis, pattern matching, validation
- Practical regex: Modern engines add features beyond regular languages
Regular expressions bridge human-readable patterns and machine-executable automata, making them essential in computer science!