What is Chomsky Normal Form?
Chomsky Normal Form (CNF) is a restricted form of context-free grammar where every production has a very specific, simple structure.
Formal Definition
A CFG is in Chomsky Normal Form if every production is of one of these two forms:
- A → BC (two variables on right)
- A → a (single terminal on right)
Exception: S → ε is allowed if S doesn’t appear on any right side
That’s it! Only these forms allowed!
Simple Analogy
Think of CNF like standardized LEGO blocks:
- Regular CFG: Mix of different block sizes and shapes
- CNF: Only 2 types - “2-stud blocks” (A → BC) and “1-stud blocks” (A → a)
- Easier to work with - uniform structure
Why CNF?
Advantages:
- Parsing efficiency: CYK algorithm runs in O(n³)
- Uniform structure: Easy to analyze
- Theoretical proofs: Simplifies many proofs
- Parse tree structure: Always binary tree
Applications:
- CYK parsing algorithm
- Natural language processing
- Compiler design
- Grammar analysis
CNF Rules Explained
Rule Type 1: A → BC
Form: Variable produces exactly two variables
Examples:
S → AB ✓
A → BC ✓
E → TF ✓
NOT allowed:
S → ABC ✗ (three variables)
S → A ✗ (one variable - unit production)
S → ε ✗ (epsilon - unless S is start)
Rule Type 2: A → a
Form: Variable produces exactly one terminal
Examples:
A → a ✓
B → b ✓
T → 0 ✓
NOT allowed:
A → ab ✗ (two terminals)
A → Aa ✗ (mix of variable and terminal)
A → abc ✗ (multiple terminals)
Special Case: S → ε
Allowed ONLY if:
- S is the start symbol
- S does not appear on the right side of any production
Example:
S → AB | ε ✓ (if S not on right side)
S → SS | ε ✗ (S appears on right side!)
Conversion Algorithm
Input: Any CFG G
Output: Equivalent CFG G’ in CNF
Steps:
1. Simplify grammar (remove useless, ε, unit productions)
2. Handle start symbol appearing on right side
3. Remove terminals from mixed productions
4. Break long productions into binary form
Step 1: Simplify Grammar
Apply simplification (from previous topic):
- Remove useless symbols
- Remove ε-productions (except S → ε if needed)
- Remove unit productions
Result: Cleaner grammar ready for CNF conversion
Step 2: Handle Start Symbol
If S appears on right side of any production:
1. Create new start symbol S₀
2. Add production S₀ → S
3. S₀ never appears on right side
Example:
Before:
S → SS | a
After:
S₀ → S (new start)
S → SS | a
Step 3: Remove Terminals from Mixed Productions
For each production with both variables and terminals:
Replace each terminal a with new variable Uₐ
Add production Uₐ → a
Example:
Before:
S → aSb
After:
S → UₐSUᵦ
Uₐ → a
Uᵦ → b
Another Example:
Before:
A → aB | Ab | a
After:
A → UₐB | AUᵦ | a (a alone is fine)
Uₐ → a
Uᵦ → b
Step 4: Break Long Productions
For productions A → B₁B₂…Bₙ where n ≥ 3:
Replace with:
A → B₁C₁
C₁ → B₂C₂
C₂ → B₃C₃
...
Cₙ₋₂ → Bₙ₋₁Bₙ
Example:
Before:
A → BCDE
After:
A → BC₁
C₁ → CD₁
D₁ → DE
Another Example:
Before:
S → ABCD
After:
S → AX₁
X₁ → BX₂
X₂ → CD
Complete Conversion Examples
Example 1: Simple Grammar
Original Grammar:
S → AB
A → a
B → b
Check CNF:
- S → AB ✓ (two variables)
- A → a ✓ (one terminal)
- B → b ✓ (one terminal)
Already in CNF! ✓
Example 2: Adding ε
Original Grammar:
S → aSb | ε
Step 1: Simplify (remove ε except S → ε)
- Nullable: {S}
- S → aSb becomes: S → aSb | ab | ε
Step 2: S on right side? No ✓
Step 3: Remove terminals from mixed
- S → aSb becomes S → UₐSUᵦ
- S → ab becomes S → UₐUᵦ
Step 4: Break long productions
- S → UₐSUᵦ: three symbols → break
- S → UₐX
- X → SUᵦ
- S → UₐUᵦ: two symbols → OK
Final CNF:
S → UₐX | UₐUᵦ | ε
X → SUᵦ
Uₐ → a
Uᵦ → b
Example 3: More Complex
Original Grammar:
S → AB | a
A → aAb | ab
B → b
Step 1: Already simplified
Step 2: S not on right ✓
Step 3: Remove terminals from mixed
- A → aAb becomes A → UₐAUᵦ
- A → ab becomes A → UₐUᵦ
Step 4: Break long productions
- A → UₐAUᵦ: three symbols
- A → UₐX
- X → AUᵦ
- A → UₐUᵦ: two symbols OK
Final CNF:
S → AB | a
A → UₐX | UₐUᵦ
X → AUᵦ
B → b
Uₐ → a
Uᵦ → b
Example 4: Arithmetic Expressions
Original Grammar:
E → E + T | T
T → T * F | F
F → (E) | id
Step 1: Remove unit productions
- E → T becomes E → T * F | (E) | id
- T → F becomes T → (E) | id
Grammar after Step 1:
E → E + T | T * F | (E) | id
T → T * F | (E) | id
F → (E) | id
Step 2: S (E) not on right… wait, E is on right! Actually, that’s OK - only problem if START symbol on right
Step 3: Remove terminals from mixed
- E + T: E → EU₊T
- T * F: T → TU*F
- (E): F → U₍EU₎
Step 4: Break long productions
- E → EU₊T: three symbols
- E → EX₁
- X₁ → U₊T
- Similarly for others
Final CNF (simplified notation):
E → EX₁ | TX₂ | U₍X₃ | id
X₁ → U₊T
X₂ → U*F
X₃ → EU₎
T → TX₂ | U₍X₃ | id
F → U₍X₃ | id
U₊ → +
U* → *
U₍ → (
U₎ → )
Example 5: Balanced Parentheses
Original Grammar:
S → (S) | SS | ε
Step 1: Remove ε
- Nullable: {S}
- S → (S) becomes: (S) | ()
- S → SS becomes: SS | S
- Keep S → ε (start symbol)
After Step 1:
S → (S) | () | SS | S | ε
Remove unit S → S: (already handled)
Step 2: S on right side in SS! Add S₀ → S
Step 3: Remove terminals
- (S) becomes U₍SU₎
- () becomes U₍U₎
Step 4: Break long
- U₍SU₎: three symbols
- S → U₍X
- X → SU₎
- U₍U₎: two symbols OK
Final CNF:
S₀ → S
S → U₍X | U₍U₎ | SS | ε
X → SU₎
U₍ → (
U₎ → )
Parse Trees in CNF
CNF parse trees are binary trees (except terminal leaves)
Example: Parse “aabb”
Grammar (CNF):
S → UₐX | UₐUᵦ | ε
X → SUᵦ
Uₐ → a
Uᵦ → b
Derivation:
S ⇒ UₐX ⇒ UₐSUᵦ ⇒ UₐUₐXUᵦ ⇒ UₐUₐSUᵦUᵦ ⇒ UₐUₐUᵦUᵦ ⇒ aabb
Parse Tree:
S
/ \
Uₐ X
| / \
a S Uᵦ
/\ |
Uₐ X b
| /\
a S Uᵦ
/\ |
Uₐ Uᵦ b
| |
a b
Binary tree structure!
CYK Parsing Algorithm
CYK (Cocke-Younger-Kasami) algorithm uses CNF for efficient parsing
Algorithm Overview
Input: String w, Grammar G in CNF
Output: Whether w ∈ L(G)
Method: Dynamic programming
Time Complexity: O(n³|G|) where n = |w|, |G| = grammar size
Key Idea
Build table bottom-up:
- Table[i,j] = variables that derive substring from position i, length j
- Start with length 1 (terminals)
- Build up to length n (full string)
Example: CYK for “aabb”
Grammar:
S → AB | AA
A → a
B → BB | AB
Wait, B → BB not in CNF! Need two variables, not two same variables… Actually that IS CNF!
Let me use proper example:
Grammar in CNF:
S → AB
A → BC | a
B → AC | b
C → AB | a
String: “aaab”
CYK Table: (complex, see textbook for details)
Result: Check if S in Table[1, 4] (position 1, length 4)
Properties of CNF
1. Binary Parse Trees
Every internal node has exactly 2 children (except terminals)
Height: O(n) for string of length n
2. Derivation Length
For string of length n: Exactly 2n - 1 steps in derivation
Why?
- Each A → BC creates 2 variables
- Each A → a creates 1 terminal
- Need n terminals, so n-1 binary splits
- Total: n + (n-1) = 2n - 1
3. Unique Structure
CNF is not unique for a language
Multiple CNF grammars can generate same language
4. Language Equivalence
Theorem: Every CFL without ε has equivalent CNF grammar
Proof: Conversion algorithm constructs it!
Applications
1. CYK Parsing
Most important application
Benefits:
- O(n³) time
- Handles ambiguous grammars
- Easy to implement
2. Natural Language Processing
Parse natural language sentences:
- Convert grammar to CNF
- Use CYK for efficient parsing
- Find all possible parse trees
3. Compiler Design
Syntax analysis:
- Some parsing algorithms prefer CNF
- Uniform structure simplifies implementation
4. Theoretical Proofs
CNF simplifies many proofs:
- Pumping lemma for CFLs
- Closure properties
- Decidability results
Limitations of CNF
1. Size Increase
CNF grammar is typically larger than original
Why?: Adding new variables for
- Terminal replacement
- Breaking long productions
Example: n-length right side → n-1 new variables
2. Readability
CNF is less readable than natural grammar
Original: S → if E then S else S
CNF: Many small rules, hard to understand
3. No ε (mostly)
Can’t have ε productions (except S → ε)
Problem: Some languages naturally include ε
4. Parsing Overhead
More rules = more work in parsing
But: O(n³) is still good for general CFGs
CNF vs Greibach Normal Form
CNF: A → BC | a
GNF: A → aα (terminal followed by variables)
Comparison:
| Feature | CNF | GNF |
|---|---|---|
| Form | A → BC or A → a | A → aα |
| Parse tree | Binary | n-ary |
| Left recursion | Possible | Not possible |
| Parsing | CYK (bottom-up) | Top-down |
| Conversion | Easier | Harder |
Both are useful for different purposes!
Important Exam Points
- ✓ CNF form: A → BC or A → a (only these two types)
- ✓ Exception: S → ε allowed if S not on right side
- ✓ Conversion steps: Simplify → Start symbol → Terminals → Break long
- ✓ Terminal replacement: a in mixed rule → Uₐ with Uₐ → a
- ✓ Break long: A → B₁B₂…Bₙ → series of binary rules
- ✓ Every CFL: Has equivalent CNF (except ε)
- ✓ Parse trees: Binary structure in CNF
- ✓ CYK algorithm: Uses CNF, runs in O(n³)
- ✓ Derivation length: 2n-1 steps for string length n
- ✓ Not unique: Multiple CNF grammars for same language
Common Mistakes to Avoid
✗ Mistake 1: Allowing A → ABC in CNF
✓ Correct: Must break into binary: A → AB₁, B₁ → BC
✗ Mistake 2: Mixing terminals and variables: A → aB
✓ Correct: Replace: A → UₐB, Uₐ → a
✗ Mistake 3: Forgetting to handle start symbol on right
✓ Correct: Add S₀ → S if S appears on right
✗ Mistake 4: Allowing unit productions: A → B
✓ Correct: Remove unit productions first
✗ Mistake 5: Wrong order in breaking long productions
✓ Correct: Break left-to-right systematically
✗ Mistake 6: Allowing A → ab (two terminals)
✓ Correct: Must be A → UₐUᵦ with Uₐ → a, Uᵦ → b
Practice Questions
Q1: Is S → ABC in CNF?
Answer: No, must break to S → AB₁, B₁ → BC.
Q2: Is A → aB in CNF?
Answer: No, mixed terminal and variable. Need A → UₐB, Uₐ → a.
Q3: Can CNF have S → ε?
Answer: Yes, only if S is start symbol and doesn’t appear on right side.
Q4: Convert A → abcd to CNF
Answer: A → UₐX₁, X₁ → UᵦX₂, X₂ → UᵤUᵈ, Uₐ → a, Uᵦ → b, Uᵤ → c, Uᵈ → d.
Q5: Why is CNF useful?
Answer: Enables CYK parsing (O(n³)), uniform structure, simplifies proofs.
Q6: Are all CFLs expressible in CNF?
Answer: Yes, except ε may need special handling with S → ε.
Q7: What’s the derivation length for string “abc” in CNF?
Answer: 2n-1 = 2(3)-1 = 5 steps.
Q8: How many children do internal nodes have in CNF parse tree?
Answer: Exactly 2 (binary tree).
Summary
- Chomsky Normal Form: Restricted CFG with only A → BC or A → a
- Exception: S → ε allowed if S not on right side
- Conversion: Simplify → Start symbol → Terminals → Break long
- Every CFL: Has equivalent CNF grammar
- Properties: Binary parse trees, 2n-1 derivation length
- Applications: CYK parsing (O(n³)), NLP, compiler design
- Advantages: Uniform structure, efficient parsing, theoretical proofs
- Limitations: Larger grammar, less readable, no ε (mostly)
- Key insight: Trade readability for parsing efficiency
CNF is the foundation for the CYK parsing algorithm and many theoretical results!