Why Simplify CFGs?
Context-Free Grammars can be messy! They may contain:
- Useless symbols: Variables/terminals that never appear in derivations
- ε-productions: Rules like A → ε (complicates parsing)
- Unit productions: Rules like A → B (adds unnecessary steps)
- Redundant rules: Multiple ways to derive same strings
Goal: Transform CFG into simpler equivalent form while preserving the language
Simple Analogy
Think of simplifying CFG like cleaning code:
- Remove unused variables (dead code)
- Eliminate unnecessary indirection (unit productions)
- Simplify empty cases (ε-productions)
- Keep functionality same (same language)
Result: Cleaner, more efficient grammar!
Types of Problematic Productions
1. Useless Symbols
Definition: Symbols that don’t contribute to generating terminal strings
Two types:
Type 1: Non-Terminating Symbols
Definition: Variables that cannot derive any terminal string
Example:
S → AB
A → a
B → bB (B never reaches terminal!)
Problem: B loops forever, never produces terminal string
Type 2: Unreachable Symbols
Definition: Symbols that cannot be reached from start symbol
Example:
S → AB
A → a
B → b
C → c (C unreachable from S!)
Problem: C is isolated, never used in any derivation from S
2. ε-Productions (Nullable Productions)
Definition: Rules of form A → ε
Example:
S → AB
A → aA | ε
B → bB | ε
Problem: Complicates parsing, makes grammar analysis harder
Note: If ε ∈ L(G), we need S → ε where S is start symbol
3. Unit Productions
Definition: Rules of form A → B (single variable on right)
Example:
S → A
A → B
B → a
Problem: Unnecessary indirection, adds derivation steps
Can simplify to: S → a, A → a, B → a
Simplification Algorithm Overview
Order matters! Follow this sequence:
1. Remove useless symbols
a. First: Remove non-terminating symbols
b. Then: Remove unreachable symbols
2. Remove ε-productions (except S → ε if needed)
3. Remove unit productions
Why this order?
- Removing useless symbols first reduces grammar size
- Must remove non-terminating before unreachable
- ε-elimination before unit production removal
Step 1: Removing Useless Symbols
Step 1a: Remove Non-Terminating Symbols
Goal: Keep only variables that can derive terminal strings
Algorithm:
1. Mark all terminals as "terminating"
2. Repeat until no new variables marked:
- If A → α and all symbols in α are terminating
- Mark A as terminating
3. Remove all non-terminating variables and their rules
Example 1
Original Grammar:
S → AB | a
A → b
B → bB
Step 1: Mark terminals: a, b are terminating
Step 2: Check rules
- S → a: right side has only terminal a → S is terminating ✓
- A → b: right side has only terminal b → A is terminating ✓
- B → bB: B appears on right, not yet marked → skip
Step 3: Check again
- B → bB: b is terminating, but B is not → skip
Step 4: No new variables marked → done
Terminating: {S, A} Non-terminating: {B}
Simplified Grammar:
S → a
A → b
(Rules with B removed)
Example 2
Original Grammar:
S → AB | C
A → a
B → BC | b
C → c | D
D → dD
Analysis:
- Terminals: a, b, c, d (terminating)
- A → a: A is terminating ✓
- B → b: B is terminating ✓
- C → c: C is terminating ✓
- D → dD: D loops, never terminating ✗
- S → AB: A and B terminating → S is terminating ✓
Terminating: {S, A, B, C} Non-terminating: {D}
Simplified Grammar:
S → AB | C
A → a
B → BC | b
C → c
(D and rules with D removed)
Step 1b: Remove Unreachable Symbols
Goal: Keep only symbols reachable from start symbol
Algorithm:
1. Mark S (start symbol) as reachable
2. Repeat until no new symbols marked:
- For each rule A → α where A is reachable
- Mark all symbols in α as reachable
3. Remove all unreachable symbols and their rules
Example 1
Grammar (after removing non-terminating):
S → AB
A → a
B → b
C → c
D → d
Step 1: S is reachable
Step 2: Check S → AB
- A is reachable ✓
- B is reachable ✓
Step 3: Check A → a
- a is reachable ✓
Step 4: Check B → b
- b is reachable ✓
Reachable: {S, A, B, a, b} Unreachable: {C, D, c, d}
Simplified Grammar:
S → AB
A → a
B → b
Example 2
Grammar:
S → AB | C
A → a
B → b | C
C → c
D → d
Analysis:
- S reachable (start)
- From S → AB | C: A, B, C reachable
- From A → a: a reachable
- From B → b | C: b reachable (C already marked)
- From C → c: c reachable
- D never reached ✗
Reachable: {S, A, B, C, a, b, c} Unreachable: {D, d}
Simplified Grammar:
S → AB | C
A → a
B → b | C
C → c
Complete Example: Removing Useless Symbols
Original Grammar:
S → AB | a
A → b
B → C
C → D
D → E
E → a
F → f
Step 1a: Remove non-terminating
- S → a: S terminating ✓
- A → b: A terminating ✓
- E → a: E terminating ✓
- D → E: E terminating → D terminating ✓
- C → D: D terminating → C terminating ✓
- B → C: C terminating → B terminating ✓
- F → f: F terminating ✓
All variables terminate!
Step 1b: Remove unreachable
- S reachable (start)
- From S → AB | a: A, B, a reachable
- From A → b: b reachable
- From B → C: C reachable
- From C → D: D reachable
- From D → E: E reachable
- F never reached ✗
Final Grammar:
S → AB | a
A → b
B → C
C → D
D → E
E → a
(F removed)
Step 2: Removing ε-Productions
Goal: Eliminate A → ε rules (except S → ε if ε ∈ L(G))
Nullable Variables
Definition: Variable A is nullable if A ⇒* ε
Finding nullable variables:
1. If A → ε, then A is nullable
2. If A → B₁B₂...Bₙ and all Bᵢ are nullable, then A is nullable
3. Repeat until no new nullable variables found
Algorithm to Remove ε-Productions
1. Find all nullable variables
2. For each rule A → X₁X₂...Xₙ:
- Add new rules with all combinations of nullable variables present/absent
- But don't add A → ε (unless A is start and ε ∈ L(G))
3. Remove all A → ε rules
Example 1: Simple ε-Removal
Original Grammar:
S → AB
A → aA | ε
B → bB | ε
Step 1: Find nullable variables
- A → ε: A nullable ✓
- B → ε: B nullable ✓
- S → AB: Both nullable → S nullable ✓
Nullable: {S, A, B}
Step 2: Generate new rules
For S → AB:
- Both present: AB (original)
- A absent: B
- B absent: A
- Both absent: ε (only if S is start and ε ∈ L)
New rules for S: S → AB | A | B | ε
For A → aA:
- A present: aA (original)
- A absent: a
New rules for A: A → aA | a
For B → bB:
- B present: bB (original)
- B absent: b
New rules for B: B → bB | b
Step 3: Remove ε-productions
Final Grammar:
S → AB | A | B | ε
A → aA | a
B → bB | b
Note: Kept S → ε because S is start and ε was in original language
Example 2: More Complex
Original Grammar:
S → ABC
A → aA | ε
B → bB | ε
C → c
Step 1: Nullable variables
- A → ε: A nullable ✓
- B → ε: B nullable ✓
- C → c: not nullable ✗
- S → ABC: A, B nullable, but C not → S not nullable ✗
Nullable: {A, B}
Step 2: Generate new rules
For S → ABC:
- All present: ABC
- A absent: BC
- B absent: AC
- A and B absent: C
New rules for S: S → ABC | BC | AC | C
For A → aA:
- A present: aA
- A absent: a
New rules for A: A → aA | a
For B → bB:
- B present: bB
- B absent: b
New rules for B: B → bB | b
Step 3: Remove ε
Final Grammar:
S → ABC | BC | AC | C
A → aA | a
B → bB | b
C → c
Special Case: Start Symbol
If ε ∈ L(G), need special handling:
1. Remove all ε-productions as above
2. If S was nullable, add new rule: S₀ → S | ε
3. S₀ becomes new start symbol
Example:
Original: S → A, A → a | ε
ε ∈ L(G) because S ⇒ A ⇒ ε
After removal:
S₀ → S | ε (new start)
S → A
A → a
Step 3: Removing Unit Productions
Goal: Eliminate rules of form A → B
Algorithm
1. For each variable A, find all variables B such that A ⇒* B using only unit productions
2. For each such B, if B → α (non-unit production), add A → α
3. Remove all unit productions
Example 1: Simple Unit Removal
Original Grammar:
S → A
A → B
B → a | b
Step 1: Find unit production chains
- S ⇒* A ⇒* B (using S → A, A → B)
Step 2: Add direct rules
- B → a, B → b (non-unit)
- Since S ⇒* B: add S → a | b
- Since A ⇒* B: add A → a | b
Step 3: Remove unit productions
Final Grammar:
S → a | b
A → a | b
B → a | b
Example 2: More Complex
Original Grammar:
S → A | a
A → B | b
B → C | c
C → S | d
Step 1: Find unit production closure
- S ⇒* A ⇒* B ⇒* C ⇒* S (cycle!)
- So S ⇒* {S, A, B, C}
- A ⇒* {A, B, C, S}
- B ⇒* {B, C, S, A}
- C ⇒* {C, S, A, B}
Step 2: For each variable, add non-unit productions from closure
For S (reaches {S, A, B, C}):
- S → a (from S)
- S → b (from A)
- S → c (from B)
- S → d (from C)
For A (reaches {A, B, C, S}):
- A → b (from A)
- A → c (from B)
- A → d (from C)
- A → a (from S)
Similarly for B and C
Final Grammar:
S → a | b | c | d
A → a | b | c | d
B → a | b | c | d
C → a | b | c | d
Example 3: Realistic Grammar
Original (after ε-removal):
E → E + T | T
T → T * F | F
F → (E) | id
Step 1: Unit production chains
- E ⇒* T ⇒* F
- T ⇒* F
Step 2: Add non-unit productions
For E (reaches T, F):
- E → E + T (already there)
- From T: E → T * F
- From F: E → (E) | id
For T (reaches F):
- T → T * F (already there)
- From F: T → (E) | id
Final Grammar:
E → E + T | T * F | (E) | id
T → T * F | (E) | id
F → (E) | id
Complete Simplification Example
Original Grammar:
S → AB | ε
A → aA | ε
B → Bb | C
C → c
D → d
Step 1: Remove Useless Symbols
Step 1a: Non-terminating
- All variables can reach terminals ✓
Step 1b: Unreachable
- D is unreachable ✗
After Step 1:
S → AB | ε
A → aA | ε
B → Bb | C
C → c
Step 2: Remove ε-Productions
Nullable variables: {S, A}
New productions:
- S → AB | B | ε (A can be absent)
- A → aA | a (A can be absent)
- B → Bb | C (unchanged)
- C → c (unchanged)
After Step 2:
S → AB | B | ε
A → aA | a
B → Bb | C
C → c
Step 3: Remove Unit Productions
Unit productions: B → C
Closure:
- B ⇒* C
- Add B → c (from C)
After Step 3:
S → AB | B | ε
A → aA | a
B → Bb | c
C → c
Wait, S → B is also unit production!
Closure: S ⇒* B
- Add S → Bb | c (from B)
Final Simplified Grammar:
S → AB | Bb | c | ε
A → aA | a
B → Bb | c
C → c
Actually, C is now unreachable! Remove it:
Final Final Grammar:
S → AB | Bb | c | ε
A → aA | a
B → Bb | c
Why Simplification Matters
1. Efficiency
Smaller grammar = faster parsing
- Fewer rules to check
- Fewer derivation steps
- Less memory usage
2. Clarity
Cleaner grammar = easier to understand
- No dead code (useless symbols)
- No unnecessary indirection (unit productions)
- Clearer structure
3. Normal Forms
Simplified grammar is prerequisite for:
- Chomsky Normal Form
- Greibach Normal Form
- Other canonical forms
4. Parsing Algorithms
Many parsing algorithms require:
- No ε-productions
- No unit productions
- No useless symbols
Examples: CYK algorithm, LL parsing, LR parsing
Important Properties
Language Preservation
Theorem: Simplification preserves the language
L(G_original) = L(G_simplified)
Except: May need to handle ε specially
Canonical Form
Simplified grammar is not unique!
Different simplification orders can give different (but equivalent) grammars
Complexity
Simplification is efficient:
- Useless symbol removal: O(n²)
- ε-production removal: O(2ⁿ) worst case, but usually practical
- Unit production removal: O(n²)
Where n = grammar size
Common Mistakes to Avoid
✗ Mistake 1: Wrong order (unreachable before non-terminating)
✓ Correct: Non-terminating first, then unreachable
✗ Mistake 2: Removing S → ε when ε ∈ L(G)
✓ Correct: Keep S → ε if ε is in the language
✗ Mistake 3: Missing combinations in ε-removal
✓ Correct: Generate ALL combinations of nullable variables
✗ Mistake 4: Not finding all unit production chains
✓ Correct: Use closure to find all A ⇒* B relationships
✗ Mistake 5: Forgetting to remove newly unreachable symbols
✓ Correct: After each step, check for new useless symbols
✗ Mistake 6: Applying steps in wrong order
✓ Correct: Useless → ε-productions → Unit productions
Important Exam Points
- ✓ Three types of problems: Useless symbols, ε-productions, unit productions
- ✓ Order matters: Non-terminating → Unreachable → ε → Unit
- ✓ Useless symbols: Both non-terminating and unreachable
- ✓ Nullable variables: Variables that derive ε
- ✓ ε-removal: Generate all combinations, keep S → ε if needed
- ✓ Unit productions: A → B (single variable)
- ✓ Unit removal: Find closure, add direct rules, remove units
- ✓ Language preserved: L(G) unchanged (except ε handling)
- ✓ Not unique: Different simplifications possible
- ✓ Prerequisite: Needed for normal forms
Practice Questions
Q1: What’s the difference between non-terminating and unreachable?
Answer: Non-terminating can’t derive terminals; unreachable can’t be reached from S.
Q2: Why remove non-terminating before unreachable?
Answer: Removing non-terminating may make new symbols unreachable.
Q3: Can we remove S → ε?
Answer: Only if ε ∉ L(G). If ε ∈ L(G), must keep it or use S₀ → S | ε.
Q4: What’s a nullable variable?
Answer: Variable A where A ⇒* ε.
Q5: In S → AB with A, B nullable, what new rules?
Answer: S → AB | A | B (and S → ε if S is start and ε ∈ L).
Q6: Why remove unit productions?
Answer: Unnecessary indirection, adds derivation steps.
Q7: How to find unit production closure?
Answer: Find all B such that A ⇒* B using only unit productions.
Q8: Does simplification change the language?
Answer: No, L(G_original) = L(G_simplified).
Summary
- Goal: Simplify CFG while preserving language
- Three steps (in order):
- Remove useless symbols (non-terminating, then unreachable)
- Remove ε-productions (except S → ε if needed)
- Remove unit productions
- Useless symbols: Can’t reach terminals or can’t be reached from S
- ε-productions: A → ε (find nullable, generate combinations)
- Unit productions: A → B (find closure, add direct rules)
- Language preserved: L(G) unchanged
- Benefits: Cleaner, faster parsing, prerequisite for normal forms
- Applications: Parser optimization, normal form conversion
- Key insight: Systematic cleanup makes grammar more manageable
Simplification is essential preparation for converting CFGs to normal forms!