Regular Expressions and Equivalence

What are Regular Expressions?

Regular Expression (regex) is an algebraic notation for describing regular languages. It provides a compact, human-readable way to specify patterns in strings.

Simple Analogy

Think of regular expressions like a search pattern recipe:

DFA/NFA: Step-by-step instructions (state machine)
Regex: Recipe description (“ends with ‘a’ OR contains ‘b’”)

Example: Finding phone numbers, email addresses, or specific text patterns

Formal Definition

A regular expression over alphabet Σ is defined recursively:

Base Cases:

∅ is a regular expression (empty language)

ε is a regular expression (language {ε})

a is a regular expression for any a ∈ Σ (language {a})

Recursive Cases: If r and s are regular expressions: 4. (r|s) is a regular expression (union) 5. (r·s) or (rs) is a regular expression (concatenation) 6. (r*) is a regular expression (Kleene star)

Precedence: * (highest) > concatenation > | (lowest)

Basic Operations

1. Union (|)

Notation: r₁ | r₂ or r₁ + r₂

Meaning: Strings matching either r₁ or r₂

Examples:

a|b          matches: "a" or "b"
0|1          matches: "0" or "1"
cat|dog      matches: "cat" or "dog"

2. Concatenation (·)

Notation: r₁·r₂ or simply r₁r₂

Meaning: String from r₁ followed by string from r₂

Examples:

ab           matches: "ab"
(a|b)c       matches: "ac" or "bc"
01(0|1)      matches: "010" or "011"

3. Kleene Star (*)

Notation: r*

Meaning: Zero or more repetitions of r

Examples:

a*           matches: "", "a", "aa", "aaa", ...
(ab)*        matches: "", "ab", "abab", "ababab", ...
(0|1)*       matches: all binary strings (including ε)

4. Additional Operators

Plus (+): One or more repetitions

r+ = rr* (at least one)
a+ matches: "a", "aa", "aaa", ... (but NOT "")

Optional (?): Zero or one occurrence

r? = r|ε
a? matches: "" or "a"

Exact count ({n}):

a{3} = aaa (exactly 3 a's)
a{2,5} = aa, aaa, aaaa, or aaaaa (2 to 5 a's)

Language Denoted by Regex

L(r) = language (set of strings) described by regex r

Examples

| Regex | L(r) | Description | | ----- | -------------------------------- | --------------------------- | ----------------- | | a | {a} | Single string “a” | | ε | {ε} | Empty string only | | ∅ | ∅ | No strings (empty language) | | a | b | {a, b} | Either “a” or “b” | | ab | {ab} | String “ab” | | a* | {ε, a, aa, aaa, …} | Zero or more a’s | | (a | b)* | All strings over {a,b} | Any combination | | ab | {ε, a, b, aa, bb, aab, abb, …} | a’s followed by b’s |

Regex Examples

Example 1: Strings Ending with ‘1’

Language: L = {w | w ends with ‘1’}

Regex: (0|1)*1

Explanation:

(0|1)*: Any string of 0’s and 1’s
1: Must end with ‘1’

Matches: 1, 01, 11, 001, 101, 111, 0001, …

Example 2: Strings Containing ‘01’

Language: L = {w | w contains “01”}

Regex: (0|1)01(0|1)

Explanation:

(0|1)*: Any prefix
01: The required substring
(0|1)*: Any suffix

Matches: 01, 001, 010, 101, 0011, …

Example 3: Even Number of a’s

Language: L = {w | w has even number of a’s}

Regex: (baba)b

Explanation:

(baba)*: Pairs of a’s separated by b’s
b*: Final b’s

Matches: ε, b, bb, aa, bab, aba, aabb, …

Alternative: b*(abab)*

Example 4: Strings with Length ≥ 3

Language: L = {w | |w| ≥ 3}

Regex: (0|1)(0|1)(0|1)(0|1)*

Or: Σ³Σ* where Σ = (0|1)

Matches: 000, 001, 010, 011, 100, 101, 110, 111, 0000, …

Example 5: Binary Numbers Divisible by 3

Regex: (0|1(01*0)*1)*

Complex pattern: Represents DFA for mod 3 arithmetic

Example 6: Valid Identifiers

Language: Starts with letter, then letters/digits

Regex: [a-zA-Z][a-zA-Z0-9]*

Explanation:

[a-zA-Z]: First character is letter
[a-zA-Z0-9]*: Followed by letters or digits

Matches: x, var1, MyVariable, test_123

Example 7: Email Pattern (Simplified)

Regex: [a-z]+@[a-z]+.[a-z]+

Matches: user@example.com, test@domain.org

Example 8: All Strings Except ε

Language: L = Σ⁺

Regex: (0|1)(0|1)* or (0|1)⁺

Matches: All non-empty strings

Example 9: Strings Not Containing ‘11’

Language: L = {w | “11” is not a substring of w}

Regex: 0*(10+)*1?

Explanation:

0*: Leading zeros
(10+)*: Pattern “1” followed by one or more “0”s
1?: Optional final “1”

Matches: ε, 0, 1, 00, 01, 10, 001, 010, 100, 101, …

Example 10: Exactly Three a’s

Language: L = {w | w has exactly three a’s}

Regex: bababab

Explanation: Three a’s with any number of b’s between/around

Matches: aaa, baaa, ababa, bababab, …

Precedence Rules

Priority (highest to lowest):

* (Kleene star) - highest precedence
· (concatenation)
| (union) - lowest precedence

Examples

| Expression | Interpretation | | ---------- | --------------- | ------------------------------------ | ------------ | ---- | | ab* | a(b*) not (ab)* | | a | bc | a | (bc) not (a | b)c | | ab | c | (ab) | c not a(b | c) | | a | b* | a | (b*) not (a | b)* | | ab*c | a(b*)c | | (a | b)*c | All strings over {a,b} ending with c |

Parentheses override precedence:

ab* ≠ (ab)*
a|b* ≠ (a|b)*

Equivalence of Regular Expressions

Definition

Two regex r₁ and r₂ are equivalent if:

L(r₁) = L(r₂)

They describe the same language

Algebraic Laws

Commutativity

r|s = s|r               (union is commutative)
rs ≠ sr                 (concatenation is NOT commutative)

Associativity

(r|s)|t = r|(s|t)       (union)
(rs)t = r(st)           (concatenation)

Identity Elements

r|∅ = r                 (∅ is identity for union)
rε = εr = r             (ε is identity for concatenation)

Annihilator

r∅ = ∅r = ∅             (∅ annihilates concatenation)

Idempotence

r|r = r                 (union is idempotent)

Distributivity

r(s|t) = rs|rt          (concatenation distributes over union)
(s|t)r = sr|tr

Kleene Star Properties

r* = ε|r|rr|rrr|...     (definition)
r** = r*                (star is idempotent)
∅* = ε                  (important!)
ε* = ε
(r*)* = r*
r*r* = r*
rr* = r*r
r* = (r+)|ε             (star = plus or epsilon)
(r|ε)* = r*             (including ε doesn't change)

Equivalence Examples

Example 1:

(0|1)* = (0*1*)*
Both represent: All binary strings

Example 2:

a*b* ≠ (a|b)*
Left: a's followed by b's only
Right: Any mix of a's and b's

Example 3:

(a|b)*a = a|ba|(a|b)*aa|b*ba
Both: Strings ending with 'a'

Kleene’s Theorem

The Fundamental Result

Kleene’s Theorem: The following are equivalent:

Language is recognized by a DFA

Language is recognized by an NFA

Language is described by a regular expression

Regular Languages = DFA = NFA = Regex

Three Conversions

     DFA ←→ NFA
      ↕       ↕
    Regex ←→ NFA

All three representations have equal expressive power!

Regex to NFA Conversion

Thompson’s Construction

Algorithm: Recursively build NFA from regex using ε-transitions

Base Cases:

1. Empty Language ∅

NFA: q₀ (start, no accept state)
Accepts: Nothing

2. Empty String ε

NFA: q₀ (start and accept)
Accepts: ε only

3. Single Symbol a

NFA:
    a
q₀ → q₁ (accept)

Accepts: "a" only

Recursive Cases

Union: r₁ | r₂

Strategy: Create new start state with ε-transitions to both NFAs

         ε
    ┌────────→ NFA(r₁) ────────┐
    │                          │ ε
q₀ ─┤                          ├──→ qf
    │                          │ ε
    └────────→ NFA(r₂) ────────┘
         ε

Steps:

Create new start q₀ and accept qf
ε-transition from q₀ to both r₁ and r₂
ε-transitions from both accepts to qf

Concatenation: r₁r₂

Strategy: Connect output of NFA(r₁) to input of NFA(r₂)

q₀ → NFA(r₁) → [ε] → NFA(r₂) → qf

Steps:

Remove accept status from r₁’s accept
Add ε-transition to r₂’s start
r₂’s accept becomes final accept

Kleene Star: r*

Strategy: Add loops with ε-transitions

       ┌───────────── ε ────────────┐
       ↓                            │
q₀ → [ε] → NFA(r) → [ε] → qf
 │                        ↑
 └────────── ε ───────────┘

Steps:

Create new start q₀ and accept qf
ε from q₀ to old start (for repetitions)
ε from old accept back to old start (loop)
ε from q₀ to qf (for zero repetitions)
ε from old accept to qf (end repetitions)

Complete Example: (a|b)*a

Step-by-step construction:

1. Build NFA for ‘a’:

    a
q₁ → q₂

2. Build NFA for ‘b’:

    b
q₃ → q₄

3. Union: (a|b):

         ε
    ┌───────→ q₁ ─a→ q₂ ────┐
    │                       │ ε
q₀ ─┤                       ├──→ q₅
    │                       │ ε
    └───────→ q₃ ─b→ q₄ ────┘
         ε

4. Star: (a|b)*:

       ┌──────────────── ε ────────────────┐
       ↓                                   │
q₆ → [ε] → (a|b union NFA) → [ε] → q₇
 │                                  ↑
 └──────────────── ε ───────────────┘

5. Concatenate with ‘a’:

q₈ → q₉ (accepting)
       a

6. Final NFA for (a|b)*a: Connect star NFA to ‘a’ NFA with ε

Result: NFA with ~10 states and many ε-transitions

NFA to Regex Conversion

State Elimination Method

Algorithm: Progressively remove states, updating transitions with regex labels

Steps:

Ensure single start and single accept state
Add ε-transitions if needed
Eliminate states one by one (except start/accept)
Update transition labels with regex expressions
Final regex is label from start to accept

Elimination Rule

When eliminating state q:

For each pair (predecessor p, successor s):

Old transitions:
  p →r₁→ q →r₂→ s
  q →r₃→ q (self-loop)

New transition:
  p →r₁r₃*r₂→ s

Formula: r₁r₃*r₂ (go through q via self-loop)

Example: Convert Simple DFA to Regex

DFA: Strings ending with ‘1’

       0         1
→(A) ───→ ((B))
  ↺ 0       ↺ 1

Step 1: Add new start/accept with ε

       ε         0         1         ε
qₛ → (A) ───→ ((B)) → qₐ
         ↺ 0       ↺ 1

Step 2: Eliminate A

Transition from qₛ to B through A:
  ε · 0*1 = 0*1

Self-loop on A: 0*
Path to B: qₛ →0*1→ B

Step 3: Eliminate B

Transition from qₛ to qₐ through B:
  0*1 · 1* · ε = 0*11* = 0*1 · 1*

Self-loop on B: 1*

Final Regex: 01·1 = 0*1⁺ or (0|1)*1

Result: (0|1)*1

DFA to Regex Directly

Arden’s Theorem

For equation: X = AX | B (where ε ∉ L(A))

Solution: X = A*B

Used to solve: Systems of equations from DFA states

Example Using Arden’s Theorem

DFA: Same as before

State equations:
  A = εA | 0A | 1B     (start from A, or loop, or go to B)
  B = 1A | 1B | ε      (accept state)

Simplify B:

B = 1B | (1A | ε)
B = 1*(1A | ε)      (by Arden's)
B = 1*1A | 1*ε
B = 1⁺A | 1*

Substitute in A:

A = (0|1)A | 1B
A = (0|1)A | 1(1⁺A | 1*)
A = (0|1)A | 11⁺A | 11*
A = ((0|1) | 11⁺)A | 11*
A = ((0|1) | 11⁺)*11*    (by Arden's)

Final regex for A (start state): Complex but equivalent to (0|1)*1

Practical Applications

1. Text Processing

grep, sed, awk (Unix tools)

grep "error|warning" logfile.txt
sed 's/[0-9]+/NUM/g' file.txt

2. Programming Languages

Python:

import re
pattern = r'^[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-z]+$'
re.match(pattern, email)

JavaScript:

const regex = /^[A-Z][a-z]*/;
regex.test(name);

3. Lexical Analysis

Compiler tokens:

IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*
NUMBER: [0-9]+(\.[0-9]+)?
STRING: "([^"\\]|\\.)*"

4. Input Validation

Form validation:

Email: [a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}
Phone: \d{3}-\d{3}-\d{4}
URL: https?://[^\s]+

5. Data Extraction

Log parsing:

\[(\d{4}-\d{2}-\d{2})\] (ERROR|WARN|INFO): (.*)

Captures: date, level, message

6. Search and Replace

Code refactoring:

Find: function\s+(\w+)\s*\(
Replace: const $1 = (

Extended Regex Features (Practical)

Modern regex engines add features beyond regular languages:

1. Character Classes

\d = [0-9]           (digits)
\w = [a-zA-Z0-9_]    (word characters)
\s = [ \t\n\r]       (whitespace)
. = any character

2. Anchors

^ = start of string
$ = end of string
\b = word boundary

3. Quantifiers

{n} = exactly n times
{n,} = n or more times
{n,m} = between n and m times
? = {0,1}
+ = {1,}
* = {0,}

4. Groups and Backreferences

(pattern) = capturing group
(?:pattern) = non-capturing
\1, \2 = backreference

Example: (\w)\1 matches doubled letters (aa, bb, cc)

Note: Backreferences make regex more powerful than regular languages!

5. Lookahead/Lookbehind

(?=pattern) = positive lookahead
(?!pattern) = negative lookahead
(?<=pattern) = positive lookbehind
(?<!pattern) = negative lookbehind

These go beyond regular languages (not expressible in DFA/NFA)

Important Exam Points

✓ Regex = algebraic notation for regular languages
✓ Three operations: Union (|), Concatenation (·), Star (*)
✓ Precedence: * > concatenation > |
✓ Equivalence: L(r₁) = L(r₂) means same language
✓ Kleene’s Theorem: DFA = NFA = Regex (equivalent)
✓ Thompson’s Construction: Regex → NFA with ε-transitions
✓ State Elimination: NFA → Regex by removing states
✓ Arden’s Theorem: Solve X = AX | B as X = A*B
✓ Base cases: ∅, ε, a (single symbol)
✓ Laws: r* = ε|r⁺, ∅* = ε, (r*)* = r*

Common Mistakes to Avoid

✗ Mistake 1: ab* means (ab)*
✓ Correct: ab* = a(b*), use parentheses for (ab)*

✗ Mistake 2: a|b* means (a|b)*
✓ Correct: a|b* = a|(b*), union has lowest precedence

✗ Mistake 3: Concatenation is commutative
✓ Correct: rs ≠ sr in general (e.g., ab ≠ ba)

✗ Mistake 4: ∅* = ∅
✓ Correct: ∅* = ε (star of empty language is epsilon)

✗ Mistake 5: ε* = ∅
✓ Correct: ε* = ε (star of epsilon is epsilon)

✗ Mistake 6: L(ε) = ∅
✓ Correct: L(ε) = {ε} (set containing empty string)

✗ Mistake 7: Regex can describe all languages
✓ Correct: Regex describes only regular languages

Practice Questions

Q1: What is the language of (0|1)*0?

Answer: All binary strings ending with ‘0’

Q2: Write regex for “contains at least two a’s”

Answer: babab* or (a|b)*a(a|b)a(a|b)

Q3: Are (a*)* and a* equivalent?

Answer: Yes, (a*)* = a* (star is idempotent)

Q4: What is ∅*?

Answer: ε (important edge case!)

Q5: Convert (a|b)*a to NFA using Thompson’s Construction

Answer: Build union of a and b, apply star, concatenate with a

Q6: What is precedence of ab*|c?

Answer: (a(b*))|c = ab* or c

Q7: Can regex with backreferences describe non-regular languages?

Answer: Yes! Backreferences (like \1) go beyond regular languages

Q8: Write regex for valid C identifiers

Answer: [a-zA-Z_][a-zA-Z0-9_]*

Summary

Regular expressions provide algebraic notation for regular languages
Three operations: Union (|), Concatenation (·), Kleene Star (*)
Precedence: * (highest) > concatenation > | (lowest)
Equivalence: Two regex equivalent if they describe same language
Kleene’s Theorem: DFA = NFA = Regex (same expressive power)
Thompson’s Construction: Regex → NFA using ε-transitions
State Elimination: NFA/DFA → Regex by removing states
Arden’s Theorem: X = AX | B solves to X = A*B
Laws: Many algebraic identities (∅, ε, distributivity, etc.)
Applications: Text processing, lexical analysis, pattern matching, validation
Practical regex: Modern engines add features beyond regular languages

Regular expressions bridge human-readable patterns and machine-executable automata, making them essential in computer science!

Regular Expressions and Equivalence

What are Regular Expressions?

Simple Analogy

Formal Definition

Basic Operations

1. Union (|)

2. Concatenation (·)

3. Kleene Star (*)

4. Additional Operators

Language Denoted by Regex

Examples

Regex Examples

Example 1: Strings Ending with ‘1’

Example 2: Strings Containing ‘01’

Example 3: Even Number of a’s

Example 4: Strings with Length ≥ 3

Example 5: Binary Numbers Divisible by 3

Example 6: Valid Identifiers

Example 7: Email Pattern (Simplified)

Example 8: All Strings Except ε

Example 9: Strings Not Containing ‘11’

Example 10: Exactly Three a’s

Precedence Rules

Examples

Equivalence of Regular Expressions

Definition

Algebraic Laws

Commutativity

Associativity

Identity Elements

Annihilator

Idempotence

Distributivity

Kleene Star Properties

Equivalence Examples

Kleene’s Theorem

The Fundamental Result

Three Conversions

Regex to NFA Conversion

Thompson’s Construction

1. Empty Language ∅

2. Empty String ε

3. Single Symbol a

Recursive Cases

Union: r₁ | r₂

Concatenation: r₁r₂

Kleene Star: r*

Complete Example: (a|b)*a

NFA to Regex Conversion

State Elimination Method

Elimination Rule

Example: Convert Simple DFA to Regex

DFA to Regex Directly

Arden’s Theorem

Example Using Arden’s Theorem

Practical Applications

1. Text Processing

2. Programming Languages

3. Lexical Analysis

4. Input Validation

5. Data Extraction

6. Search and Replace

Extended Regex Features (Practical)

1. Character Classes

2. Anchors

3. Quantifiers

4. Groups and Backreferences

5. Lookahead/Lookbehind

Important Exam Points

Common Mistakes to Avoid

Practice Questions

Summary

Install Notes App