Chomsky Normal Form

What is Chomsky Normal Form?

Chomsky Normal Form (CNF) is a restricted form of context-free grammar where every production has a very specific, simple structure.

Formal Definition

A CFG is in Chomsky Normal Form if every production is of one of these two forms:

A → BC (two variables on right)

A → a (single terminal on right)

Exception: S → ε is allowed if S doesn’t appear on any right side

That’s it! Only these forms allowed!

Simple Analogy

Think of CNF like standardized LEGO blocks:

Regular CFG: Mix of different block sizes and shapes
CNF: Only 2 types - “2-stud blocks” (A → BC) and “1-stud blocks” (A → a)
Easier to work with - uniform structure

Why CNF?

Advantages:

Parsing efficiency: CYK algorithm runs in O(n³)
Uniform structure: Easy to analyze
Theoretical proofs: Simplifies many proofs
Parse tree structure: Always binary tree

Applications:

CYK parsing algorithm
Natural language processing
Compiler design
Grammar analysis

CNF Rules Explained

Rule Type 1: A → BC

Form: Variable produces exactly two variables

Examples:

S → AB ✓
A → BC ✓
E → TF ✓

NOT allowed:

S → ABC ✗ (three variables)
S → A ✗ (one variable - unit production)
S → ε ✗ (epsilon - unless S is start)

Rule Type 2: A → a

Form: Variable produces exactly one terminal

Examples:

A → a ✓
B → b ✓
T → 0 ✓

NOT allowed:

A → ab ✗ (two terminals)
A → Aa ✗ (mix of variable and terminal)
A → abc ✗ (multiple terminals)

Special Case: S → ε

Allowed ONLY if:

S is the start symbol
S does not appear on the right side of any production

Example:

S → AB | ε ✓ (if S not on right side)
S → SS | ε ✗ (S appears on right side!)

Conversion Algorithm

Input: Any CFG G
Output: Equivalent CFG G’ in CNF

Steps:

1. Simplify grammar (remove useless, ε, unit productions)
2. Handle start symbol appearing on right side
3. Remove terminals from mixed productions
4. Break long productions into binary form

Step 1: Simplify Grammar

Apply simplification (from previous topic):

Remove useless symbols
Remove ε-productions (except S → ε if needed)
Remove unit productions

Result: Cleaner grammar ready for CNF conversion

Step 2: Handle Start Symbol

If S appears on right side of any production:

1. Create new start symbol S₀
2. Add production S₀ → S
3. S₀ never appears on right side

Example:

Before:
S → SS | a

After:
S₀ → S    (new start)
S → SS | a

Step 3: Remove Terminals from Mixed Productions

For each production with both variables and terminals:

Replace each terminal a with new variable Uₐ
Add production Uₐ → a

Example:

Before:
S → aSb

After:
S → UₐSUᵦ
Uₐ → a
Uᵦ → b

Another Example:

Before:
A → aB | Ab | a

After:
A → UₐB | AUᵦ | a  (a alone is fine)
Uₐ → a
Uᵦ → b

Step 4: Break Long Productions

For productions A → B₁B₂…Bₙ where n ≥ 3:

Replace with:
A → B₁C₁
C₁ → B₂C₂
C₂ → B₃C₃
...
Cₙ₋₂ → Bₙ₋₁Bₙ

Example:

Before:
A → BCDE

After:
A → BC₁
C₁ → CD₁
D₁ → DE

Another Example:

Before:
S → ABCD

After:
S → AX₁
X₁ → BX₂
X₂ → CD

Complete Conversion Examples

Example 1: Simple Grammar

Original Grammar:

S → AB
A → a
B → b

Check CNF:

S → AB ✓ (two variables)
A → a ✓ (one terminal)
B → b ✓ (one terminal)

Already in CNF! ✓

Example 2: Adding ε

Original Grammar:

S → aSb | ε

Step 1: Simplify (remove ε except S → ε)

Nullable: {S}
S → aSb becomes: S → aSb | ab | ε

Step 2: S on right side? No ✓

Step 3: Remove terminals from mixed

S → aSb becomes S → UₐSUᵦ
S → ab becomes S → UₐUᵦ

Step 4: Break long productions

S → UₐSUᵦ: three symbols → break
- S → UₐX
- X → SUᵦ
S → UₐUᵦ: two symbols → OK

Final CNF:

S → UₐX | UₐUᵦ | ε
X → SUᵦ
Uₐ → a
Uᵦ → b

Example 3: More Complex

Original Grammar:

S → AB | a
A → aAb | ab
B → b

Step 1: Already simplified

Step 2: S not on right ✓

Step 3: Remove terminals from mixed

A → aAb becomes A → UₐAUᵦ
A → ab becomes A → UₐUᵦ

Step 4: Break long productions

A → UₐAUᵦ: three symbols
- A → UₐX
- X → AUᵦ
A → UₐUᵦ: two symbols OK

Final CNF:

S → AB | a
A → UₐX | UₐUᵦ
X → AUᵦ
B → b
Uₐ → a
Uᵦ → b

Example 4: Arithmetic Expressions

Original Grammar:

E → E + T | T
T → T * F | F
F → (E) | id

Step 1: Remove unit productions

E → T becomes E → T * F | (E) | id
T → F becomes T → (E) | id

Grammar after Step 1:

E → E + T | T * F | (E) | id
T → T * F | (E) | id
F → (E) | id

Step 2: S (E) not on right… wait, E is on right! Actually, that’s OK - only problem if START symbol on right

Step 3: Remove terminals from mixed

E + T: E → EU₊T
T * F: T → TU*F
(E): F → U₍EU₎

Step 4: Break long productions

E → EU₊T: three symbols
- E → EX₁
- X₁ → U₊T
Similarly for others

Final CNF (simplified notation):

E → EX₁ | TX₂ | U₍X₃ | id
X₁ → U₊T
X₂ → U*F
X₃ → EU₎
T → TX₂ | U₍X₃ | id
F → U₍X₃ | id
U₊ → +
U* → *
U₍ → (
U₎ → )

Example 5: Balanced Parentheses

Original Grammar:

S → (S) | SS | ε

Step 1: Remove ε

Nullable: {S}
S → (S) becomes: (S) | ()
S → SS becomes: SS | S
Keep S → ε (start symbol)

After Step 1:

S → (S) | () | SS | S | ε

Remove unit S → S: (already handled)

Step 2: S on right side in SS! Add S₀ → S

Step 3: Remove terminals

(S) becomes U₍SU₎
() becomes U₍U₎

Step 4: Break long

U₍SU₎: three symbols
- S → U₍X
- X → SU₎
U₍U₎: two symbols OK

Final CNF:

S₀ → S
S → U₍X | U₍U₎ | SS | ε
X → SU₎
U₍ → (
U₎ → )

Parse Trees in CNF

CNF parse trees are binary trees (except terminal leaves)

Example: Parse “aabb”

Grammar (CNF):

S → UₐX | UₐUᵦ | ε
X → SUᵦ
Uₐ → a
Uᵦ → b

Derivation:

S ⇒ UₐX ⇒ UₐSUᵦ ⇒ UₐUₐXUᵦ ⇒ UₐUₐSUᵦUᵦ ⇒ UₐUₐUᵦUᵦ ⇒ aabb

Parse Tree:

Binary tree structure!

CYK Parsing Algorithm

CYK (Cocke-Younger-Kasami) algorithm uses CNF for efficient parsing

Algorithm Overview

Input: String w, Grammar G in CNF
Output: Whether w ∈ L(G)

Method: Dynamic programming

Time Complexity: O(n³|G|) where n = |w|, |G| = grammar size

Key Idea

Build table bottom-up:

Table[i,j] = variables that derive substring from position i, length j
Start with length 1 (terminals)
Build up to length n (full string)

Example: CYK for “aabb”

Grammar:

S → AB | AA
A → a
B → BB | AB

Wait, B → BB not in CNF! Need two variables, not two same variables… Actually that IS CNF!

Let me use proper example:

Grammar in CNF:

S → AB
A → BC | a
B → AC | b
C → AB | a

String: “aaab”

CYK Table: (complex, see textbook for details)

Result: Check if S in Table[1, 4] (position 1, length 4)

Properties of CNF

1. Binary Parse Trees

Every internal node has exactly 2 children (except terminals)

Height: O(n) for string of length n

2. Derivation Length

For string of length n: Exactly 2n - 1 steps in derivation

Why?

Each A → BC creates 2 variables
Each A → a creates 1 terminal
Need n terminals, so n-1 binary splits
Total: n + (n-1) = 2n - 1

3. Unique Structure

CNF is not unique for a language

Multiple CNF grammars can generate same language

4. Language Equivalence

Theorem: Every CFL without ε has equivalent CNF grammar

Proof: Conversion algorithm constructs it!

Applications

1. CYK Parsing

Most important application

Benefits:

O(n³) time
Handles ambiguous grammars
Easy to implement

2. Natural Language Processing

Parse natural language sentences:

Convert grammar to CNF
Use CYK for efficient parsing
Find all possible parse trees

3. Compiler Design

Syntax analysis:

Some parsing algorithms prefer CNF
Uniform structure simplifies implementation

4. Theoretical Proofs

CNF simplifies many proofs:

Pumping lemma for CFLs
Closure properties
Decidability results

Limitations of CNF

1. Size Increase

CNF grammar is typically larger than original

Why?: Adding new variables for

Terminal replacement
Breaking long productions

Example: n-length right side → n-1 new variables

2. Readability

CNF is less readable than natural grammar

Original: S → if E then S else S
CNF: Many small rules, hard to understand

3. No ε (mostly)

Can’t have ε productions (except S → ε)

Problem: Some languages naturally include ε

4. Parsing Overhead

More rules = more work in parsing

But: O(n³) is still good for general CFGs

CNF vs Greibach Normal Form

CNF: A → BC | a
GNF: A → aα (terminal followed by variables)

Comparison:

Feature	CNF	GNF
Form	A → BC or A → a	A → aα
Parse tree	Binary	n-ary
Left recursion	Possible	Not possible
Parsing	CYK (bottom-up)	Top-down
Conversion	Easier	Harder

Both are useful for different purposes!

Important Exam Points

✓ CNF form: A → BC or A → a (only these two types)
✓ Exception: S → ε allowed if S not on right side
✓ Conversion steps: Simplify → Start symbol → Terminals → Break long
✓ Terminal replacement: a in mixed rule → Uₐ with Uₐ → a
✓ Break long: A → B₁B₂…Bₙ → series of binary rules
✓ Every CFL: Has equivalent CNF (except ε)
✓ Parse trees: Binary structure in CNF
✓ CYK algorithm: Uses CNF, runs in O(n³)
✓ Derivation length: 2n-1 steps for string length n
✓ Not unique: Multiple CNF grammars for same language

Common Mistakes to Avoid

✗ Mistake 1: Allowing A → ABC in CNF
✓ Correct: Must break into binary: A → AB₁, B₁ → BC

✗ Mistake 2: Mixing terminals and variables: A → aB
✓ Correct: Replace: A → UₐB, Uₐ → a

✗ Mistake 3: Forgetting to handle start symbol on right
✓ Correct: Add S₀ → S if S appears on right

✗ Mistake 4: Allowing unit productions: A → B
✓ Correct: Remove unit productions first

✗ Mistake 5: Wrong order in breaking long productions
✓ Correct: Break left-to-right systematically

✗ Mistake 6: Allowing A → ab (two terminals)
✓ Correct: Must be A → UₐUᵦ with Uₐ → a, Uᵦ → b

Practice Questions

Q1: Is S → ABC in CNF?

Answer: No, must break to S → AB₁, B₁ → BC.

Q2: Is A → aB in CNF?

Answer: No, mixed terminal and variable. Need A → UₐB, Uₐ → a.

Q3: Can CNF have S → ε?

Answer: Yes, only if S is start symbol and doesn’t appear on right side.

Q4: Convert A → abcd to CNF

Answer: A → UₐX₁, X₁ → UᵦX₂, X₂ → UᵤUᵈ, Uₐ → a, Uᵦ → b, Uᵤ → c, Uᵈ → d.

Q5: Why is CNF useful?

Answer: Enables CYK parsing (O(n³)), uniform structure, simplifies proofs.

Q6: Are all CFLs expressible in CNF?

Answer: Yes, except ε may need special handling with S → ε.

Q7: What’s the derivation length for string “abc” in CNF?

Answer: 2n-1 = 2(3)-1 = 5 steps.

Q8: How many children do internal nodes have in CNF parse tree?

Answer: Exactly 2 (binary tree).

Summary

Chomsky Normal Form: Restricted CFG with only A → BC or A → a
Exception: S → ε allowed if S not on right side
Conversion: Simplify → Start symbol → Terminals → Break long
Every CFL: Has equivalent CNF grammar
Properties: Binary parse trees, 2n-1 derivation length
Applications: CYK parsing (O(n³)), NLP, compiler design
Advantages: Uniform structure, efficient parsing, theoretical proofs
Limitations: Larger grammar, less readable, no ε (mostly)
Key insight: Trade readability for parsing efficiency

CNF is the foundation for the CYK parsing algorithm and many theoretical results!

Chomsky Normal Form

What is Chomsky Normal Form?

Formal Definition

Simple Analogy

Why CNF?

CNF Rules Explained

Rule Type 1: A → BC

Rule Type 2: A → a

Special Case: S → ε

Conversion Algorithm

Step 1: Simplify Grammar

Step 2: Handle Start Symbol

Step 3: Remove Terminals from Mixed Productions

Step 4: Break Long Productions

Complete Conversion Examples

Example 1: Simple Grammar

Example 2: Adding ε

Example 3: More Complex

Example 4: Arithmetic Expressions

Example 5: Balanced Parentheses

Parse Trees in CNF

Example: Parse “aabb”

CYK Parsing Algorithm

Algorithm Overview

Key Idea

Example: CYK for “aabb”

Properties of CNF

1. Binary Parse Trees

2. Derivation Length

3. Unique Structure

4. Language Equivalence

Applications

1. CYK Parsing

2. Natural Language Processing

3. Compiler Design

4. Theoretical Proofs

Limitations of CNF

1. Size Increase

2. Readability

3. No ε (mostly)

4. Parsing Overhead

CNF vs Greibach Normal Form

Important Exam Points

Common Mistakes to Avoid

Practice Questions

Summary

Install Notes App