Chomsky Normal Form

What is Chomsky Normal Form?

Chomsky Normal Form (CNF) is a restricted form of context-free grammar where every production has a very specific, simple structure.

Formal Definition

A CFG is in Chomsky Normal Form if every production is of one of these two forms:

  1. A → BC (two variables on right)
  2. A → a (single terminal on right)

Exception: S → ε is allowed if S doesn’t appear on any right side

That’s it! Only these forms allowed!

Simple Analogy

Think of CNF like standardized LEGO blocks:

  • Regular CFG: Mix of different block sizes and shapes
  • CNF: Only 2 types - “2-stud blocks” (A → BC) and “1-stud blocks” (A → a)
  • Easier to work with - uniform structure

Why CNF?

Advantages:

  1. Parsing efficiency: CYK algorithm runs in O(n³)
  2. Uniform structure: Easy to analyze
  3. Theoretical proofs: Simplifies many proofs
  4. Parse tree structure: Always binary tree

Applications:

  • CYK parsing algorithm
  • Natural language processing
  • Compiler design
  • Grammar analysis

CNF Rules Explained

Rule Type 1: A → BC

Form: Variable produces exactly two variables

Examples:

S → AB ✓
A → BC ✓
E → TF ✓

NOT allowed:

S → ABC ✗ (three variables)
S → A ✗ (one variable - unit production)
S → ε ✗ (epsilon - unless S is start)

Rule Type 2: A → a

Form: Variable produces exactly one terminal

Examples:

A → a ✓
B → b ✓
T → 0 ✓

NOT allowed:

A → ab ✗ (two terminals)
A → Aa ✗ (mix of variable and terminal)
A → abc ✗ (multiple terminals)

Special Case: S → ε

Allowed ONLY if:

  • S is the start symbol
  • S does not appear on the right side of any production

Example:

S → AB | ε ✓ (if S not on right side)
S → SS | ε ✗ (S appears on right side!)

Conversion Algorithm

Input: Any CFG G
Output: Equivalent CFG G’ in CNF

Steps:

1. Simplify grammar (remove useless, ε, unit productions)
2. Handle start symbol appearing on right side
3. Remove terminals from mixed productions
4. Break long productions into binary form

Step 1: Simplify Grammar

Apply simplification (from previous topic):

  • Remove useless symbols
  • Remove ε-productions (except S → ε if needed)
  • Remove unit productions

Result: Cleaner grammar ready for CNF conversion

Step 2: Handle Start Symbol

If S appears on right side of any production:

1. Create new start symbol S₀
2. Add production S₀ → S
3. S₀ never appears on right side

Example:

Before:
S → SS | a

After:
S₀ → S    (new start)
S → SS | a

Step 3: Remove Terminals from Mixed Productions

For each production with both variables and terminals:

Replace each terminal a with new variable Uₐ
Add production Uₐ → a

Example:

Before:
S → aSb

After:
S → UₐSUᵦ
Uₐ → a
Uᵦ → b

Another Example:

Before:
A → aB | Ab | a

After:
A → UₐB | AUᵦ | a  (a alone is fine)
Uₐ → a
Uᵦ → b

Step 4: Break Long Productions

For productions A → B₁B₂…Bₙ where n ≥ 3:

Replace with:
A → B₁C₁
C₁ → B₂C₂
C₂ → B₃C₃
...
Cₙ₋₂ → Bₙ₋₁Bₙ

Example:

Before:
A → BCDE

After:
A → BC₁
C₁ → CD₁
D₁ → DE

Another Example:

Before:
S → ABCD

After:
S → AX₁
X₁ → BX₂
X₂ → CD

Complete Conversion Examples

Example 1: Simple Grammar

Original Grammar:

S → AB
A → a
B → b

Check CNF:

  • S → AB ✓ (two variables)
  • A → a ✓ (one terminal)
  • B → b ✓ (one terminal)

Already in CNF!

Example 2: Adding ε

Original Grammar:

S → aSb | ε

Step 1: Simplify (remove ε except S → ε)

  • Nullable: {S}
  • S → aSb becomes: S → aSb | ab | ε

Step 2: S on right side? No ✓

Step 3: Remove terminals from mixed

  • S → aSb becomes S → UₐSUᵦ
  • S → ab becomes S → UₐUᵦ

Step 4: Break long productions

  • S → UₐSUᵦ: three symbols → break
    • S → UₐX
    • X → SUᵦ
  • S → UₐUᵦ: two symbols → OK

Final CNF:

S → UₐX | UₐUᵦ | ε
X → SUᵦ
Uₐ → a
Uᵦ → b

Example 3: More Complex

Original Grammar:

S → AB | a
A → aAb | ab
B → b

Step 1: Already simplified

Step 2: S not on right ✓

Step 3: Remove terminals from mixed

  • A → aAb becomes A → UₐAUᵦ
  • A → ab becomes A → UₐUᵦ

Step 4: Break long productions

  • A → UₐAUᵦ: three symbols
    • A → UₐX
    • X → AUᵦ
  • A → UₐUᵦ: two symbols OK

Final CNF:

S → AB | a
A → UₐX | UₐUᵦ
X → AUᵦ
B → b
Uₐ → a
Uᵦ → b

Example 4: Arithmetic Expressions

Original Grammar:

E → E + T | T
T → T * F | F
F → (E) | id

Step 1: Remove unit productions

  • E → T becomes E → T * F | (E) | id
  • T → F becomes T → (E) | id

Grammar after Step 1:

E → E + T | T * F | (E) | id
T → T * F | (E) | id
F → (E) | id

Step 2: S (E) not on right… wait, E is on right! Actually, that’s OK - only problem if START symbol on right

Step 3: Remove terminals from mixed

  • E + T: E → EU₊T
  • T * F: T → TU*F
  • (E): F → U₍EU₎

Step 4: Break long productions

  • E → EU₊T: three symbols
    • E → EX₁
    • X₁ → U₊T
  • Similarly for others

Final CNF (simplified notation):

E → EX₁ | TX₂ | U₍X₃ | id
X₁ → U₊T
X₂ → U*F
X₃ → EU₎
T → TX₂ | U₍X₃ | id
F → U₍X₃ | id
U₊ → +
U* → *
U₍ → (
U₎ → )

Example 5: Balanced Parentheses

Original Grammar:

S → (S) | SS | ε

Step 1: Remove ε

  • Nullable: {S}
  • S → (S) becomes: (S) | ()
  • S → SS becomes: SS | S
  • Keep S → ε (start symbol)

After Step 1:

S → (S) | () | SS | S | ε

Remove unit S → S: (already handled)

Step 2: S on right side in SS! Add S₀ → S

Step 3: Remove terminals

  • (S) becomes U₍SU₎
  • () becomes U₍U₎

Step 4: Break long

  • U₍SU₎: three symbols
    • S → U₍X
    • X → SU₎
  • U₍U₎: two symbols OK

Final CNF:

S₀ → S
S → U₍X | U₍U₎ | SS | ε
X → SU₎
U₍ → (
U₎ → )

Parse Trees in CNF

CNF parse trees are binary trees (except terminal leaves)

Example: Parse “aabb”

Grammar (CNF):

S → UₐX | UₐUᵦ | ε
X → SUᵦ
Uₐ → a
Uᵦ → b

Derivation:

S ⇒ UₐX ⇒ UₐSUᵦ ⇒ UₐUₐXUᵦ ⇒ UₐUₐSUᵦUᵦ ⇒ UₐUₐUᵦUᵦ ⇒ aabb

Parse Tree:

           S
          / \
         Uₐ  X
         |  / \
         a S  Uᵦ
          /\  |
         Uₐ X  b
         | /\
         a S Uᵦ
          /\  |
         Uₐ Uᵦ b
         |  |
         a  b

Binary tree structure!

CYK Parsing Algorithm

CYK (Cocke-Younger-Kasami) algorithm uses CNF for efficient parsing

Algorithm Overview

Input: String w, Grammar G in CNF
Output: Whether w ∈ L(G)

Method: Dynamic programming

Time Complexity: O(n³|G|) where n = |w|, |G| = grammar size

Key Idea

Build table bottom-up:

  • Table[i,j] = variables that derive substring from position i, length j
  • Start with length 1 (terminals)
  • Build up to length n (full string)

Example: CYK for “aabb”

Grammar:

S → AB | AA
A → a
B → BB | AB

Wait, B → BB not in CNF! Need two variables, not two same variables… Actually that IS CNF!

Let me use proper example:

Grammar in CNF:

S → AB
A → BC | a
B → AC | b
C → AB | a

String: “aaab”

CYK Table: (complex, see textbook for details)

Result: Check if S in Table[1, 4] (position 1, length 4)

Properties of CNF

1. Binary Parse Trees

Every internal node has exactly 2 children (except terminals)

Height: O(n) for string of length n

2. Derivation Length

For string of length n: Exactly 2n - 1 steps in derivation

Why?

  • Each A → BC creates 2 variables
  • Each A → a creates 1 terminal
  • Need n terminals, so n-1 binary splits
  • Total: n + (n-1) = 2n - 1

3. Unique Structure

CNF is not unique for a language

Multiple CNF grammars can generate same language

4. Language Equivalence

Theorem: Every CFL without ε has equivalent CNF grammar

Proof: Conversion algorithm constructs it!

Applications

1. CYK Parsing

Most important application

Benefits:

  • O(n³) time
  • Handles ambiguous grammars
  • Easy to implement

2. Natural Language Processing

Parse natural language sentences:

  • Convert grammar to CNF
  • Use CYK for efficient parsing
  • Find all possible parse trees

3. Compiler Design

Syntax analysis:

  • Some parsing algorithms prefer CNF
  • Uniform structure simplifies implementation

4. Theoretical Proofs

CNF simplifies many proofs:

  • Pumping lemma for CFLs
  • Closure properties
  • Decidability results

Limitations of CNF

1. Size Increase

CNF grammar is typically larger than original

Why?: Adding new variables for

  • Terminal replacement
  • Breaking long productions

Example: n-length right side → n-1 new variables

2. Readability

CNF is less readable than natural grammar

Original: S → if E then S else S
CNF: Many small rules, hard to understand

3. No ε (mostly)

Can’t have ε productions (except S → ε)

Problem: Some languages naturally include ε

4. Parsing Overhead

More rules = more work in parsing

But: O(n³) is still good for general CFGs

CNF vs Greibach Normal Form

CNF: A → BC | a
GNF: A → aα (terminal followed by variables)

Comparison:

FeatureCNFGNF
FormA → BC or A → aA → aα
Parse treeBinaryn-ary
Left recursionPossibleNot possible
ParsingCYK (bottom-up)Top-down
ConversionEasierHarder

Both are useful for different purposes!

Important Exam Points

  1. CNF form: A → BC or A → a (only these two types)
  2. Exception: S → ε allowed if S not on right side
  3. Conversion steps: Simplify → Start symbol → Terminals → Break long
  4. Terminal replacement: a in mixed rule → Uₐ with Uₐ → a
  5. Break long: A → B₁B₂…Bₙ → series of binary rules
  6. Every CFL: Has equivalent CNF (except ε)
  7. Parse trees: Binary structure in CNF
  8. CYK algorithm: Uses CNF, runs in O(n³)
  9. Derivation length: 2n-1 steps for string length n
  10. Not unique: Multiple CNF grammars for same language

Common Mistakes to Avoid

Mistake 1: Allowing A → ABC in CNF
Correct: Must break into binary: A → AB₁, B₁ → BC

Mistake 2: Mixing terminals and variables: A → aB
Correct: Replace: A → UₐB, Uₐ → a

Mistake 3: Forgetting to handle start symbol on right
Correct: Add S₀ → S if S appears on right

Mistake 4: Allowing unit productions: A → B
Correct: Remove unit productions first

Mistake 5: Wrong order in breaking long productions
Correct: Break left-to-right systematically

Mistake 6: Allowing A → ab (two terminals)
Correct: Must be A → UₐUᵦ with Uₐ → a, Uᵦ → b

Practice Questions

Q1: Is S → ABC in CNF?

Answer: No, must break to S → AB₁, B₁ → BC.

Q2: Is A → aB in CNF?

Answer: No, mixed terminal and variable. Need A → UₐB, Uₐ → a.

Q3: Can CNF have S → ε?

Answer: Yes, only if S is start symbol and doesn’t appear on right side.

Q4: Convert A → abcd to CNF

Answer: A → UₐX₁, X₁ → UᵦX₂, X₂ → UᵤUᵈ, Uₐ → a, Uᵦ → b, Uᵤ → c, Uᵈ → d.

Q5: Why is CNF useful?

Answer: Enables CYK parsing (O(n³)), uniform structure, simplifies proofs.

Q6: Are all CFLs expressible in CNF?

Answer: Yes, except ε may need special handling with S → ε.

Q7: What’s the derivation length for string “abc” in CNF?

Answer: 2n-1 = 2(3)-1 = 5 steps.

Q8: How many children do internal nodes have in CNF parse tree?

Answer: Exactly 2 (binary tree).

Summary

  • Chomsky Normal Form: Restricted CFG with only A → BC or A → a
  • Exception: S → ε allowed if S not on right side
  • Conversion: Simplify → Start symbol → Terminals → Break long
  • Every CFL: Has equivalent CNF grammar
  • Properties: Binary parse trees, 2n-1 derivation length
  • Applications: CYK parsing (O(n³)), NLP, compiler design
  • Advantages: Uniform structure, efficient parsing, theoretical proofs
  • Limitations: Larger grammar, less readable, no ε (mostly)
  • Key insight: Trade readability for parsing efficiency

CNF is the foundation for the CYK parsing algorithm and many theoretical results!