Simplification of Context-Free Grammars

Why Simplify CFGs?

Context-Free Grammars can be messy! They may contain:

  • Useless symbols: Variables/terminals that never appear in derivations
  • ε-productions: Rules like A → ε (complicates parsing)
  • Unit productions: Rules like A → B (adds unnecessary steps)
  • Redundant rules: Multiple ways to derive same strings

Goal: Transform CFG into simpler equivalent form while preserving the language

Simple Analogy

Think of simplifying CFG like cleaning code:

  • Remove unused variables (dead code)
  • Eliminate unnecessary indirection (unit productions)
  • Simplify empty cases (ε-productions)
  • Keep functionality same (same language)

Result: Cleaner, more efficient grammar!

Types of Problematic Productions

1. Useless Symbols

Definition: Symbols that don’t contribute to generating terminal strings

Two types:

Type 1: Non-Terminating Symbols

Definition: Variables that cannot derive any terminal string

Example:

S → AB
A → a
B → bB    (B never reaches terminal!)

Problem: B loops forever, never produces terminal string

Type 2: Unreachable Symbols

Definition: Symbols that cannot be reached from start symbol

Example:

S → AB
A → a
B → b
C → c     (C unreachable from S!)

Problem: C is isolated, never used in any derivation from S

2. ε-Productions (Nullable Productions)

Definition: Rules of form A → ε

Example:

S → AB
A → aA | ε
B → bB | ε

Problem: Complicates parsing, makes grammar analysis harder

Note: If ε ∈ L(G), we need S → ε where S is start symbol

3. Unit Productions

Definition: Rules of form A → B (single variable on right)

Example:

S → A
A → B
B → a

Problem: Unnecessary indirection, adds derivation steps

Can simplify to: S → a, A → a, B → a

Simplification Algorithm Overview

Order matters! Follow this sequence:

1. Remove useless symbols
   a. First: Remove non-terminating symbols
   b. Then: Remove unreachable symbols

2. Remove ε-productions (except S → ε if needed)

3. Remove unit productions

Why this order?

  • Removing useless symbols first reduces grammar size
  • Must remove non-terminating before unreachable
  • ε-elimination before unit production removal

Step 1: Removing Useless Symbols

Step 1a: Remove Non-Terminating Symbols

Goal: Keep only variables that can derive terminal strings

Algorithm:

1. Mark all terminals as "terminating"
2. Repeat until no new variables marked:
   - If A → α and all symbols in α are terminating
   - Mark A as terminating
3. Remove all non-terminating variables and their rules

Example 1

Original Grammar:

S → AB | a
A → b
B → bB

Step 1: Mark terminals: a, b are terminating

Step 2: Check rules

  • S → a: right side has only terminal a → S is terminating ✓
  • A → b: right side has only terminal b → A is terminating ✓
  • B → bB: B appears on right, not yet marked → skip

Step 3: Check again

  • B → bB: b is terminating, but B is not → skip

Step 4: No new variables marked → done

Terminating: {S, A} Non-terminating: {B}

Simplified Grammar:

S → a
A → b

(Rules with B removed)

Example 2

Original Grammar:

S → AB | C
A → a
B → BC | b
C → c | D
D → dD

Analysis:

  • Terminals: a, b, c, d (terminating)
  • A → a: A is terminating ✓
  • B → b: B is terminating ✓
  • C → c: C is terminating ✓
  • D → dD: D loops, never terminating ✗
  • S → AB: A and B terminating → S is terminating ✓

Terminating: {S, A, B, C} Non-terminating: {D}

Simplified Grammar:

S → AB | C
A → a
B → BC | b
C → c

(D and rules with D removed)

Step 1b: Remove Unreachable Symbols

Goal: Keep only symbols reachable from start symbol

Algorithm:

1. Mark S (start symbol) as reachable
2. Repeat until no new symbols marked:
   - For each rule A → α where A is reachable
   - Mark all symbols in α as reachable
3. Remove all unreachable symbols and their rules

Example 1

Grammar (after removing non-terminating):

S → AB
A → a
B → b
C → c
D → d

Step 1: S is reachable

Step 2: Check S → AB

  • A is reachable ✓
  • B is reachable ✓

Step 3: Check A → a

  • a is reachable ✓

Step 4: Check B → b

  • b is reachable ✓

Reachable: {S, A, B, a, b} Unreachable: {C, D, c, d}

Simplified Grammar:

S → AB
A → a
B → b

Example 2

Grammar:

S → AB | C
A → a
B → b | C
C → c
D → d

Analysis:

  • S reachable (start)
  • From S → AB | C: A, B, C reachable
  • From A → a: a reachable
  • From B → b | C: b reachable (C already marked)
  • From C → c: c reachable
  • D never reached ✗

Reachable: {S, A, B, C, a, b, c} Unreachable: {D, d}

Simplified Grammar:

S → AB | C
A → a
B → b | C
C → c

Complete Example: Removing Useless Symbols

Original Grammar:

S → AB | a
A → b
B → C
C → D
D → E
E → a
F → f

Step 1a: Remove non-terminating

  • S → a: S terminating ✓
  • A → b: A terminating ✓
  • E → a: E terminating ✓
  • D → E: E terminating → D terminating ✓
  • C → D: D terminating → C terminating ✓
  • B → C: C terminating → B terminating ✓
  • F → f: F terminating ✓

All variables terminate!

Step 1b: Remove unreachable

  • S reachable (start)
  • From S → AB | a: A, B, a reachable
  • From A → b: b reachable
  • From B → C: C reachable
  • From C → D: D reachable
  • From D → E: E reachable
  • F never reached ✗

Final Grammar:

S → AB | a
A → b
B → C
C → D
D → E
E → a

(F removed)

Step 2: Removing ε-Productions

Goal: Eliminate A → ε rules (except S → ε if ε ∈ L(G))

Nullable Variables

Definition: Variable A is nullable if A ⇒* ε

Finding nullable variables:

1. If A → ε, then A is nullable
2. If A → B₁B₂...Bₙ and all Bᵢ are nullable, then A is nullable
3. Repeat until no new nullable variables found

Algorithm to Remove ε-Productions

1. Find all nullable variables
2. For each rule A → X₁X₂...Xₙ:
   - Add new rules with all combinations of nullable variables present/absent
   - But don't add A → ε (unless A is start and ε ∈ L(G))
3. Remove all A → ε rules

Example 1: Simple ε-Removal

Original Grammar:

S → AB
A → aA | ε
B → bB | ε

Step 1: Find nullable variables

  • A → ε: A nullable ✓
  • B → ε: B nullable ✓
  • S → AB: Both nullable → S nullable ✓

Nullable: {S, A, B}

Step 2: Generate new rules

For S → AB:

  • Both present: AB (original)
  • A absent: B
  • B absent: A
  • Both absent: ε (only if S is start and ε ∈ L)

New rules for S: S → AB | A | B | ε

For A → aA:

  • A present: aA (original)
  • A absent: a

New rules for A: A → aA | a

For B → bB:

  • B present: bB (original)
  • B absent: b

New rules for B: B → bB | b

Step 3: Remove ε-productions

Final Grammar:

S → AB | A | B | ε
A → aA | a
B → bB | b

Note: Kept S → ε because S is start and ε was in original language

Example 2: More Complex

Original Grammar:

S → ABC
A → aA | ε
B → bB | ε
C → c

Step 1: Nullable variables

  • A → ε: A nullable ✓
  • B → ε: B nullable ✓
  • C → c: not nullable ✗
  • S → ABC: A, B nullable, but C not → S not nullable ✗

Nullable: {A, B}

Step 2: Generate new rules

For S → ABC:

  • All present: ABC
  • A absent: BC
  • B absent: AC
  • A and B absent: C

New rules for S: S → ABC | BC | AC | C

For A → aA:

  • A present: aA
  • A absent: a

New rules for A: A → aA | a

For B → bB:

  • B present: bB
  • B absent: b

New rules for B: B → bB | b

Step 3: Remove ε

Final Grammar:

S → ABC | BC | AC | C
A → aA | a
B → bB | b
C → c

Special Case: Start Symbol

If ε ∈ L(G), need special handling:

1. Remove all ε-productions as above
2. If S was nullable, add new rule: S₀ → S | ε
3. S₀ becomes new start symbol

Example:

Original: S → A, A → a | ε
ε ∈ L(G) because S ⇒ A ⇒ ε

After removal:
S₀ → S | ε    (new start)
S → A
A → a

Step 3: Removing Unit Productions

Goal: Eliminate rules of form A → B

Algorithm

1. For each variable A, find all variables B such that A ⇒* B using only unit productions
2. For each such B, if B → α (non-unit production), add A → α
3. Remove all unit productions

Example 1: Simple Unit Removal

Original Grammar:

S → A
A → B
B → a | b

Step 1: Find unit production chains

  • S ⇒* A ⇒* B (using S → A, A → B)

Step 2: Add direct rules

  • B → a, B → b (non-unit)
  • Since S ⇒* B: add S → a | b
  • Since A ⇒* B: add A → a | b

Step 3: Remove unit productions

Final Grammar:

S → a | b
A → a | b
B → a | b

Example 2: More Complex

Original Grammar:

S → A | a
A → B | b
B → C | c
C → S | d

Step 1: Find unit production closure

  • S ⇒* A ⇒* B ⇒* C ⇒* S (cycle!)
  • So S ⇒* {S, A, B, C}
  • A ⇒* {A, B, C, S}
  • B ⇒* {B, C, S, A}
  • C ⇒* {C, S, A, B}

Step 2: For each variable, add non-unit productions from closure

For S (reaches {S, A, B, C}):

  • S → a (from S)
  • S → b (from A)
  • S → c (from B)
  • S → d (from C)

For A (reaches {A, B, C, S}):

  • A → b (from A)
  • A → c (from B)
  • A → d (from C)
  • A → a (from S)

Similarly for B and C

Final Grammar:

S → a | b | c | d
A → a | b | c | d
B → a | b | c | d
C → a | b | c | d

Example 3: Realistic Grammar

Original (after ε-removal):

E → E + T | T
T → T * F | F
F → (E) | id

Step 1: Unit production chains

  • E ⇒* T ⇒* F
  • T ⇒* F

Step 2: Add non-unit productions

For E (reaches T, F):

  • E → E + T (already there)
  • From T: E → T * F
  • From F: E → (E) | id

For T (reaches F):

  • T → T * F (already there)
  • From F: T → (E) | id

Final Grammar:

E → E + T | T * F | (E) | id
T → T * F | (E) | id
F → (E) | id

Complete Simplification Example

Original Grammar:

S → AB | ε
A → aA | ε
B → Bb | C
C → c
D → d

Step 1: Remove Useless Symbols

Step 1a: Non-terminating

  • All variables can reach terminals ✓

Step 1b: Unreachable

  • D is unreachable ✗

After Step 1:

S → AB | ε
A → aA | ε
B → Bb | C
C → c

Step 2: Remove ε-Productions

Nullable variables: {S, A}

New productions:

  • S → AB | B | ε (A can be absent)
  • A → aA | a (A can be absent)
  • B → Bb | C (unchanged)
  • C → c (unchanged)

After Step 2:

S → AB | B | ε
A → aA | a
B → Bb | C
C → c

Step 3: Remove Unit Productions

Unit productions: B → C

Closure:

  • B ⇒* C
  • Add B → c (from C)

After Step 3:

S → AB | B | ε
A → aA | a
B → Bb | c
C → c

Wait, S → B is also unit production!

Closure: S ⇒* B

  • Add S → Bb | c (from B)

Final Simplified Grammar:

S → AB | Bb | c | ε
A → aA | a
B → Bb | c
C → c

Actually, C is now unreachable! Remove it:

Final Final Grammar:

S → AB | Bb | c | ε
A → aA | a
B → Bb | c

Why Simplification Matters

1. Efficiency

Smaller grammar = faster parsing

  • Fewer rules to check
  • Fewer derivation steps
  • Less memory usage

2. Clarity

Cleaner grammar = easier to understand

  • No dead code (useless symbols)
  • No unnecessary indirection (unit productions)
  • Clearer structure

3. Normal Forms

Simplified grammar is prerequisite for:

  • Chomsky Normal Form
  • Greibach Normal Form
  • Other canonical forms

4. Parsing Algorithms

Many parsing algorithms require:

  • No ε-productions
  • No unit productions
  • No useless symbols

Examples: CYK algorithm, LL parsing, LR parsing

Important Properties

Language Preservation

Theorem: Simplification preserves the language

L(G_original) = L(G_simplified)

Except: May need to handle ε specially

Canonical Form

Simplified grammar is not unique!

Different simplification orders can give different (but equivalent) grammars

Complexity

Simplification is efficient:

  • Useless symbol removal: O(n²)
  • ε-production removal: O(2ⁿ) worst case, but usually practical
  • Unit production removal: O(n²)

Where n = grammar size

Common Mistakes to Avoid

Mistake 1: Wrong order (unreachable before non-terminating)
Correct: Non-terminating first, then unreachable

Mistake 2: Removing S → ε when ε ∈ L(G)
Correct: Keep S → ε if ε is in the language

Mistake 3: Missing combinations in ε-removal
Correct: Generate ALL combinations of nullable variables

Mistake 4: Not finding all unit production chains
Correct: Use closure to find all A ⇒* B relationships

Mistake 5: Forgetting to remove newly unreachable symbols
Correct: After each step, check for new useless symbols

Mistake 6: Applying steps in wrong order
Correct: Useless → ε-productions → Unit productions

Important Exam Points

  1. Three types of problems: Useless symbols, ε-productions, unit productions
  2. Order matters: Non-terminating → Unreachable → ε → Unit
  3. Useless symbols: Both non-terminating and unreachable
  4. Nullable variables: Variables that derive ε
  5. ε-removal: Generate all combinations, keep S → ε if needed
  6. Unit productions: A → B (single variable)
  7. Unit removal: Find closure, add direct rules, remove units
  8. Language preserved: L(G) unchanged (except ε handling)
  9. Not unique: Different simplifications possible
  10. Prerequisite: Needed for normal forms

Practice Questions

Q1: What’s the difference between non-terminating and unreachable?

Answer: Non-terminating can’t derive terminals; unreachable can’t be reached from S.

Q2: Why remove non-terminating before unreachable?

Answer: Removing non-terminating may make new symbols unreachable.

Q3: Can we remove S → ε?

Answer: Only if ε ∉ L(G). If ε ∈ L(G), must keep it or use S₀ → S | ε.

Q4: What’s a nullable variable?

Answer: Variable A where A ⇒* ε.

Q5: In S → AB with A, B nullable, what new rules?

Answer: S → AB | A | B (and S → ε if S is start and ε ∈ L).

Q6: Why remove unit productions?

Answer: Unnecessary indirection, adds derivation steps.

Q7: How to find unit production closure?

Answer: Find all B such that A ⇒* B using only unit productions.

Q8: Does simplification change the language?

Answer: No, L(G_original) = L(G_simplified).

Summary

  • Goal: Simplify CFG while preserving language
  • Three steps (in order):
    1. Remove useless symbols (non-terminating, then unreachable)
    2. Remove ε-productions (except S → ε if needed)
    3. Remove unit productions
  • Useless symbols: Can’t reach terminals or can’t be reached from S
  • ε-productions: A → ε (find nullable, generate combinations)
  • Unit productions: A → B (find closure, add direct rules)
  • Language preserved: L(G) unchanged
  • Benefits: Cleaner, faster parsing, prerequisite for normal forms
  • Applications: Parser optimization, normal form conversion
  • Key insight: Systematic cleanup makes grammar more manageable

Simplification is essential preparation for converting CFGs to normal forms!