Regular Expressions and Equivalence

What are Regular Expressions?

Regular Expression (regex) is an algebraic notation for describing regular languages. It provides a compact, human-readable way to specify patterns in strings.

Simple Analogy

Think of regular expressions like a search pattern recipe:

  • DFA/NFA: Step-by-step instructions (state machine)
  • Regex: Recipe description (“ends with ‘a’ OR contains ‘b’”)

Example: Finding phone numbers, email addresses, or specific text patterns

Formal Definition

A regular expression over alphabet Σ is defined recursively:

Base Cases:

  1. is a regular expression (empty language)
  2. ε is a regular expression (language {ε})
  3. a is a regular expression for any a ∈ Σ (language {a})

Recursive Cases: If r and s are regular expressions: 4. (r|s) is a regular expression (union) 5. (r·s) or (rs) is a regular expression (concatenation) 6. (r*) is a regular expression (Kleene star)

Precedence: * (highest) > concatenation > | (lowest)

Basic Operations

1. Union (|)

Notation: r₁ | r₂ or r₁ + r₂

Meaning: Strings matching either r₁ or r₂

Examples:

a|b          matches: "a" or "b"
0|1          matches: "0" or "1"
cat|dog      matches: "cat" or "dog"

2. Concatenation (·)

Notation: r₁·r₂ or simply r₁r₂

Meaning: String from r₁ followed by string from r₂

Examples:

ab           matches: "ab"
(a|b)c       matches: "ac" or "bc"
01(0|1)      matches: "010" or "011"

3. Kleene Star (*)

Notation: r*

Meaning: Zero or more repetitions of r

Examples:

a*           matches: "", "a", "aa", "aaa", ...
(ab)*        matches: "", "ab", "abab", "ababab", ...
(0|1)*       matches: all binary strings (including ε)

4. Additional Operators

Plus (+): One or more repetitions

r+ = rr* (at least one)
a+ matches: "a", "aa", "aaa", ... (but NOT "")

Optional (?): Zero or one occurrence

r? = r|ε
a? matches: "" or "a"

Exact count ({n}):

a{3} = aaa (exactly 3 a's)
a{2,5} = aa, aaa, aaaa, or aaaaa (2 to 5 a's)

Language Denoted by Regex

L(r) = language (set of strings) described by regex r

Examples

| Regex | L(r) | Description | | ----- | -------------------------------- | --------------------------- | ----------------- | | a | {a} | Single string “a” | | ε | {ε} | Empty string only | | ∅ | ∅ | No strings (empty language) | | a | b | {a, b} | Either “a” or “b” | | ab | {ab} | String “ab” | | a* | {ε, a, aa, aaa, …} | Zero or more a’s | | (a | b)* | All strings over {a,b} | Any combination | | ab | {ε, a, b, aa, bb, aab, abb, …} | a’s followed by b’s |

Regex Examples

Example 1: Strings Ending with ‘1’

Language: L = {w | w ends with ‘1’}

Regex: (0|1)*1

Explanation:

  • (0|1)*: Any string of 0’s and 1’s
  • 1: Must end with ‘1’

Matches: 1, 01, 11, 001, 101, 111, 0001, …

Example 2: Strings Containing ‘01’

Language: L = {w | w contains “01”}

Regex: (0|1)01(0|1)

Explanation:

  • (0|1)*: Any prefix
  • 01: The required substring
  • (0|1)*: Any suffix

Matches: 01, 001, 010, 101, 0011, …

Example 3: Even Number of a’s

Language: L = {w | w has even number of a’s}

Regex: (baba)b

Explanation:

  • (baba)*: Pairs of a’s separated by b’s
  • b*: Final b’s

Matches: ε, b, bb, aa, bab, aba, aabb, …

Alternative: b*(abab)*

Example 4: Strings with Length ≥ 3

Language: L = {w | |w| ≥ 3}

Regex: (0|1)(0|1)(0|1)(0|1)*

Or: Σ³Σ* where Σ = (0|1)

Matches: 000, 001, 010, 011, 100, 101, 110, 111, 0000, …

Example 5: Binary Numbers Divisible by 3

Regex: (0|1(01*0)*1)*

Complex pattern: Represents DFA for mod 3 arithmetic

Example 6: Valid Identifiers

Language: Starts with letter, then letters/digits

Regex: [a-zA-Z][a-zA-Z0-9]*

Explanation:

  • [a-zA-Z]: First character is letter
  • [a-zA-Z0-9]*: Followed by letters or digits

Matches: x, var1, MyVariable, test_123

Example 7: Email Pattern (Simplified)

Regex: [a-z]+@[a-z]+.[a-z]+

Matches: user@example.com, test@domain.org

Example 8: All Strings Except ε

Language: L = Σ⁺

Regex: (0|1)(0|1)* or (0|1)⁺

Matches: All non-empty strings

Example 9: Strings Not Containing ‘11’

Language: L = {w | “11” is not a substring of w}

Regex: 0*(10+)*1?

Explanation:

  • 0*: Leading zeros
  • (10+)*: Pattern “1” followed by one or more “0”s
  • 1?: Optional final “1”

Matches: ε, 0, 1, 00, 01, 10, 001, 010, 100, 101, …

Example 10: Exactly Three a’s

Language: L = {w | w has exactly three a’s}

Regex: bababab

Explanation: Three a’s with any number of b’s between/around

Matches: aaa, baaa, ababa, bababab, …

Precedence Rules

Priority (highest to lowest):

  1. * (Kleene star) - highest precedence
  2. · (concatenation)
  3. | (union) - lowest precedence

Examples

| Expression | Interpretation | | ---------- | --------------- | ------------------------------------ | ------------ | ---- | | ab* | a(b*) not (ab)* | | a | bc | a | (bc) not (a | b)c | | ab | c | (ab) | c not a(b | c) | | a | b* | a | (b*) not (a | b)* | | ab*c | a(b*)c | | (a | b)*c | All strings over {a,b} ending with c |

Parentheses override precedence:

  • ab* ≠ (ab)*
  • a|b* ≠ (a|b)*

Equivalence of Regular Expressions

Definition

Two regex r₁ and r₂ are equivalent if:

L(r₁) = L(r₂)

They describe the same language

Algebraic Laws

Commutativity

r|s = s|r               (union is commutative)
rs ≠ sr                 (concatenation is NOT commutative)

Associativity

(r|s)|t = r|(s|t)       (union)
(rs)t = r(st)           (concatenation)

Identity Elements

r|∅ = r                 (∅ is identity for union)
rε = εr = r             (ε is identity for concatenation)

Annihilator

r∅ = ∅r = ∅             (∅ annihilates concatenation)

Idempotence

r|r = r                 (union is idempotent)

Distributivity

r(s|t) = rs|rt          (concatenation distributes over union)
(s|t)r = sr|tr

Kleene Star Properties

r* = ε|r|rr|rrr|...     (definition)
r** = r*                (star is idempotent)
∅* = ε                  (important!)
ε* = ε
(r*)* = r*
r*r* = r*
rr* = r*r
r* = (r+)|ε             (star = plus or epsilon)
(r|ε)* = r*             (including ε doesn't change)

Equivalence Examples

Example 1:

(0|1)* = (0*1*)*
Both represent: All binary strings

Example 2:

a*b* ≠ (a|b)*
Left: a's followed by b's only
Right: Any mix of a's and b's

Example 3:

(a|b)*a = a|ba|(a|b)*aa|b*ba
Both: Strings ending with 'a'

Kleene’s Theorem

The Fundamental Result

Kleene’s Theorem: The following are equivalent:

  1. Language is recognized by a DFA
  2. Language is recognized by an NFA
  3. Language is described by a regular expression

Regular Languages = DFA = NFA = Regex

Three Conversions

     DFA ←→ NFA
      ↕       ↕
    Regex ←→ NFA

All three representations have equal expressive power!

Regex to NFA Conversion

Thompson’s Construction

Algorithm: Recursively build NFA from regex using ε-transitions

Base Cases:

1. Empty Language ∅

NFA: q₀ (start, no accept state)
Accepts: Nothing

2. Empty String ε

NFA: q₀ (start and accept)
Accepts: ε only

3. Single Symbol a

NFA:
    a
q₀ → q₁ (accept)

Accepts: "a" only

Recursive Cases

Union: r₁ | r₂

Strategy: Create new start state with ε-transitions to both NFAs

         ε
    ┌────────→ NFA(r₁) ────────┐
    │                          │ ε
q₀ ─┤                          ├──→ qf
    │                          │ ε
    └────────→ NFA(r₂) ────────┘
         ε

Steps:

  1. Create new start q₀ and accept qf
  2. ε-transition from q₀ to both r₁ and r₂
  3. ε-transitions from both accepts to qf

Concatenation: r₁r₂

Strategy: Connect output of NFA(r₁) to input of NFA(r₂)

q₀ → NFA(r₁) → [ε] → NFA(r₂) → qf

Steps:

  1. Remove accept status from r₁’s accept
  2. Add ε-transition to r₂’s start
  3. r₂’s accept becomes final accept

Kleene Star: r*

Strategy: Add loops with ε-transitions

       ┌───────────── ε ────────────┐
       ↓                            │
q₀ → [ε] → NFA(r) → [ε] → qf
 │                        ↑
 └────────── ε ───────────┘

Steps:

  1. Create new start q₀ and accept qf
  2. ε from q₀ to old start (for repetitions)
  3. ε from old accept back to old start (loop)
  4. ε from q₀ to qf (for zero repetitions)
  5. ε from old accept to qf (end repetitions)

Complete Example: (a|b)*a

Step-by-step construction:

1. Build NFA for ‘a’:

    a
q₁ → q₂

2. Build NFA for ‘b’:

    b
q₃ → q₄

3. Union: (a|b):

         ε
    ┌───────→ q₁ ─a→ q₂ ────┐
    │                       │ ε
q₀ ─┤                       ├──→ q₅
    │                       │ ε
    └───────→ q₃ ─b→ q₄ ────┘
         ε

4. Star: (a|b)*:

       ┌──────────────── ε ────────────────┐
       ↓                                   │
q₆ → [ε] → (a|b union NFA) → [ε] → q₇
 │                                  ↑
 └──────────────── ε ───────────────┘

5. Concatenate with ‘a’:

q₈ → q₉ (accepting)
       a

6. Final NFA for (a|b)*a: Connect star NFA to ‘a’ NFA with ε

Result: NFA with ~10 states and many ε-transitions

NFA to Regex Conversion

State Elimination Method

Algorithm: Progressively remove states, updating transitions with regex labels

Steps:

  1. Ensure single start and single accept state
  2. Add ε-transitions if needed
  3. Eliminate states one by one (except start/accept)
  4. Update transition labels with regex expressions
  5. Final regex is label from start to accept

Elimination Rule

When eliminating state q:

For each pair (predecessor p, successor s):

Old transitions:
  p →r₁→ q →r₂→ s
  q →r₃→ q (self-loop)

New transition:
  p →r₁r₃*r₂→ s

Formula: r₁r₃*r₂ (go through q via self-loop)

Example: Convert Simple DFA to Regex

DFA: Strings ending with ‘1’

       0         1
→(A) ───→ ((B))
  ↺ 0       ↺ 1

Step 1: Add new start/accept with ε

       ε         0         1         ε
qₛ → (A) ───→ ((B)) → qₐ
         ↺ 0       ↺ 1

Step 2: Eliminate A

Transition from qₛ to B through A:
  ε · 0*1 = 0*1

Self-loop on A: 0*
Path to B: qₛ →0*1→ B

Step 3: Eliminate B

Transition from qₛ to qₐ through B:
  0*1 · 1* · ε = 0*11* = 0*1 · 1*

Self-loop on B: 1*

Final Regex: 01·1 = 0*1⁺ or (0|1)*1

Result: (0|1)*1

DFA to Regex Directly

Arden’s Theorem

For equation: X = AX | B (where ε ∉ L(A))

Solution: X = A*B

Used to solve: Systems of equations from DFA states

Example Using Arden’s Theorem

DFA: Same as before

State equations:
  A = εA | 0A | 1B     (start from A, or loop, or go to B)
  B = 1A | 1B | ε      (accept state)

Simplify B:

B = 1B | (1A | ε)
B = 1*(1A | ε)      (by Arden's)
B = 1*1A | 1*ε
B = 1⁺A | 1*

Substitute in A:

A = (0|1)A | 1B
A = (0|1)A | 1(1⁺A | 1*)
A = (0|1)A | 11⁺A | 11*
A = ((0|1) | 11⁺)A | 11*
A = ((0|1) | 11⁺)*11*    (by Arden's)

Final regex for A (start state): Complex but equivalent to (0|1)*1

Practical Applications

1. Text Processing

grep, sed, awk (Unix tools)

grep "error|warning" logfile.txt
sed 's/[0-9]+/NUM/g' file.txt

2. Programming Languages

Python:

import re
pattern = r'^[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-z]+$'
re.match(pattern, email)

JavaScript:

const regex = /^[A-Z][a-z]*/;
regex.test(name);

3. Lexical Analysis

Compiler tokens:

IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*
NUMBER: [0-9]+(\.[0-9]+)?
STRING: "([^"\\]|\\.)*"

4. Input Validation

Form validation:

  • Email: [a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}
  • Phone: \d{3}-\d{3}-\d{4}
  • URL: https?://[^\s]+

5. Data Extraction

Log parsing:

\[(\d{4}-\d{2}-\d{2})\] (ERROR|WARN|INFO): (.*)

Captures: date, level, message

6. Search and Replace

Code refactoring:

Find: function\s+(\w+)\s*\(
Replace: const $1 = (

Extended Regex Features (Practical)

Modern regex engines add features beyond regular languages:

1. Character Classes

\d = [0-9]           (digits)
\w = [a-zA-Z0-9_]    (word characters)
\s = [ \t\n\r]       (whitespace)
. = any character

2. Anchors

^ = start of string
$ = end of string
\b = word boundary

3. Quantifiers

{n} = exactly n times
{n,} = n or more times
{n,m} = between n and m times
? = {0,1}
+ = {1,}
* = {0,}

4. Groups and Backreferences

(pattern) = capturing group
(?:pattern) = non-capturing
\1, \2 = backreference

Example: (\w)\1 matches doubled letters (aa, bb, cc)

Note: Backreferences make regex more powerful than regular languages!

5. Lookahead/Lookbehind

(?=pattern) = positive lookahead
(?!pattern) = negative lookahead
(?<=pattern) = positive lookbehind
(?<!pattern) = negative lookbehind

These go beyond regular languages (not expressible in DFA/NFA)

Important Exam Points

  1. Regex = algebraic notation for regular languages
  2. Three operations: Union (|), Concatenation (·), Star (*)
  3. Precedence: * > concatenation > |
  4. Equivalence: L(r₁) = L(r₂) means same language
  5. Kleene’s Theorem: DFA = NFA = Regex (equivalent)
  6. Thompson’s Construction: Regex → NFA with ε-transitions
  7. State Elimination: NFA → Regex by removing states
  8. Arden’s Theorem: Solve X = AX | B as X = A*B
  9. Base cases: ∅, ε, a (single symbol)
  10. Laws: r* = ε|r⁺, ∅* = ε, (r*)* = r*

Common Mistakes to Avoid

Mistake 1: ab* means (ab)*
Correct: ab* = a(b*), use parentheses for (ab)*

Mistake 2: a|b* means (a|b)*
Correct: a|b* = a|(b*), union has lowest precedence

Mistake 3: Concatenation is commutative
Correct: rs ≠ sr in general (e.g., ab ≠ ba)

Mistake 4: ∅* = ∅
Correct: ∅* = ε (star of empty language is epsilon)

Mistake 5: ε* = ∅
Correct: ε* = ε (star of epsilon is epsilon)

Mistake 6: L(ε) = ∅
Correct: L(ε) = {ε} (set containing empty string)

Mistake 7: Regex can describe all languages
Correct: Regex describes only regular languages

Practice Questions

Q1: What is the language of (0|1)*0?

Answer: All binary strings ending with ‘0’

Q2: Write regex for “contains at least two a’s”

Answer: babab* or (a|b)*a(a|b)a(a|b)

Q3: Are (a*)* and a* equivalent?

Answer: Yes, (a*)* = a* (star is idempotent)

Q4: What is ∅*?

Answer: ε (important edge case!)

Q5: Convert (a|b)*a to NFA using Thompson’s Construction

Answer: Build union of a and b, apply star, concatenate with a

Q6: What is precedence of ab*|c?

Answer: (a(b*))|c = ab* or c

Q7: Can regex with backreferences describe non-regular languages?

Answer: Yes! Backreferences (like \1) go beyond regular languages

Q8: Write regex for valid C identifiers

Answer: [a-zA-Z_][a-zA-Z0-9_]*

Summary

  • Regular expressions provide algebraic notation for regular languages
  • Three operations: Union (|), Concatenation (·), Kleene Star (*)
  • Precedence: * (highest) > concatenation > | (lowest)
  • Equivalence: Two regex equivalent if they describe same language
  • Kleene’s Theorem: DFA = NFA = Regex (same expressive power)
  • Thompson’s Construction: Regex → NFA using ε-transitions
  • State Elimination: NFA/DFA → Regex by removing states
  • Arden’s Theorem: X = AX | B solves to X = A*B
  • Laws: Many algebraic identities (∅, ε, distributivity, etc.)
  • Applications: Text processing, lexical analysis, pattern matching, validation
  • Practical regex: Modern engines add features beyond regular languages

Regular expressions bridge human-readable patterns and machine-executable automata, making them essential in computer science!