Context-Free Grammars

What are Context-Free Grammars?

Context-Free Grammar (CFG) is a formal system for describing the syntax of languages. CFGs are more powerful than regular expressions and can describe languages that regular languages cannot, such as balanced parentheses and nested structures.

Simple Analogy

Think of CFG like construction rules for sentences:

  • Sentence → Subject + Verb + Object
  • Subject → “The cat” | “A dog” | “John”
  • Verb → “ate” | “chased” | “saw”
  • Object → “food” | “the mouse” | “her”

Just like grammar rules build valid sentences, CFG rules build valid strings!

Why Context-Free?

“Context-Free” means:

  • A variable can be replaced regardless of context
  • Don’t need to check what’s around it
  • Simpler than context-sensitive grammars

Example:

  • A → aA (can always replace A with aA, no matter where A appears)

Formal Definition

A Context-Free Grammar is a 4-tuple G = (V, Σ, R, S) where:

  • V: Finite set of variables (non-terminals)
  • Σ: Finite set of terminals (alphabet)
  • R: Finite set of production rules (V → (V ∪ Σ)*)
  • S: Start symbol (S ∈ V)

Constraint: V ∩ Σ = ∅ (variables and terminals are disjoint)

Components Explained

1. Variables (Non-terminals)

Notation: Usually capital letters A, B, C, S, …

Purpose: Placeholders that can be expanded using rules

Example: S, A, B, Expr, Term, Factor

2. Terminals

Notation: Usually lowercase letters a, b, c, … or symbols

Purpose: Actual symbols in the final string

Example: a, b, 0, 1, +, *, (, )

3. Production Rules

Form: Variable → String of variables and/or terminals

Notation:

  • A → α (A produces α)
  • Multiple rules: A → α | β | γ

Example:

S → aSb
S → ε

4. Start Symbol

Purpose: Where derivation begins

Usually: S (by convention)

Example: All strings derived starting from S

Simple Examples

Example 1: {0ⁿ1ⁿ | n ≥ 0}

Language: Equal number of 0’s and 1’s in sequence

CFG:

S → 0S1 | ε

How it works:

  • S → ε gives: ε
  • S → 0S1 → 0ε1 gives: 01
  • S → 0S1 → 00S11 → 00ε11 gives: 0011
  • S → 0S1 → 00S11 → 000S111 → 000ε111 gives: 000111

Pattern: Each application adds one 0 to left, one 1 to right

Note: This language is NOT regular, but IS context-free!

Example 2: Balanced Parentheses

Language: Properly nested parentheses

CFG:

S → (S) | SS | ε

Derivations:

S → ε gives: ε
S → (S) → (ε) gives: ()
S → (S) → ((S)) → ((ε)) gives: (())
S → SS → (S)S → ()S → ()(S) → ()() gives: ()()
S → (S) → (SS) → ((S)S) → (()S) → (()()) gives: (()())

Examples: (), (()), ()(), (())(), (()()), …

Example 3: Palindromes over {a, b}

Language: Strings that read the same forwards and backwards

CFG:

S → aSa | bSb | a | b | ε

Derivations:

S → ε gives: ε
S → a gives: a
S → b gives: b
S → aSa → aεa gives: aa
S → aSa → abSba → abεba gives: abba
S → aSa → aaSaa → aabaa gives: aabaa

Example 4: {aⁿbᵐ | n ≠ m}

Language: Unequal number of a’s and b’s (in sequence)

CFG:

S → AB | BA
A → aA | a
B → bBb | b

Wait, that’s wrong! Let me fix:

Correct CFG:

S → A | B
A → aAb | aA | a    (more a's than b's)
B → aBb | bB | b    (more b's than a's)

Better approach:

S → AB | BA
A → aAb | aA | a     (at least one more a)
B → aBb | bB | b     (at least one more b)

Example 5: {0ⁿ1²ⁿ | n ≥ 0}

Language: Twice as many 1’s as 0’s

CFG:

S → 0S11 | ε

Derivations:

S → ε gives: ε
S → 0S11 → 0ε11 gives: 011
S → 0S11 → 00S1111 → 00ε1111 gives: 001111

Language Generated by CFG

Definition

L(G) = {w ∈ Σ* | S ⇒* w}

The language generated by grammar G is the set of all terminal strings derivable from the start symbol.

Derivation Notation

Single step: α ⇒ β (α derives β in one step)

Multiple steps: α ⇒* β (α derives β in zero or more steps)

One or more steps: α ⇒⁺ β (α derives β in one or more steps)

Sentential Forms

Definition: Any string derivable from start symbol (may contain variables)

Examples (for S → 0S1 | ε):

  • S (sentential form)
  • 0S1 (sentential form)
  • 00S11 (sentential form)
  • 0011 (sentential form AND sentence)

Sentence: Sentential form with only terminals (w ∈ Σ*)

Types of Production Rules

1. Recursive Rules

Self-referencing on right side

Left Recursion:

A → Aα

Example: A → Aa | b

  • Generates: b, ba, baa, baaa, …

Right Recursion:

A → αA

Example: A → aA | b

  • Generates: b, ab, aab, aaab, …

Both Sides:

A → αAβ

Example: A → aAb | ε

  • Generates: ε, ab, aabb, aaabbb, …

2. Epsilon Productions

Form: A → ε

Purpose: Generate empty string or make variable optional

Example:

S → aSb | ε

3. Unit Productions

Form: A → B (single variable on right)

Purpose: Pass control from one variable to another

Example:

S → A
A → B
B → a

4. Useless Productions

Definition: Productions that never appear in valid derivations

Types:

  • Unreachable from start symbol
  • Cannot derive terminal string

Example:

S → aA
A → bB
B → cB     (useless! B loops forever, never reaches terminal)
C → d      (useless! C unreachable from S)

Derivations

Leftmost Derivation

Rule: Always expand leftmost variable first

Notation: ⇒ₗₘ

Example: S → aSb | ε, derive “aabb”

S ⇒ₗₘ aSb ⇒ₗₘ aaSbb ⇒ₗₘ aabb
   └─expand S (leftmost variable)
       └─expand S (leftmost variable)
           └─apply ε to S

Rightmost Derivation

Rule: Always expand rightmost variable first

Notation: ⇒ᵣₘ

Example: Same grammar, derive “aabb”

S ⇒ᵣₘ aSb ⇒ᵣₘ aaSbb ⇒ᵣₘ aabb
   └─expand S (rightmost variable)
       └─expand S (rightmost variable)
           └─apply ε to S

Note: Same result, different order!

Example: Arithmetic Expressions

Grammar:

E → E + T | T
T → T * F | F
F → (E) | id

Derive: id + id * id

Leftmost:

E ⇒ₗₘ E + T
  ⇒ₗₘ T + T
  ⇒ₗₘ F + T
  ⇒ₗₘ id + T
  ⇒ₗₘ id + T * F
  ⇒ₗₘ id + F * F
  ⇒ₗₘ id + id * F
  ⇒ₗₘ id + id * id

Rightmost:

E ⇒ᵣₘ E + T
  ⇒ᵣₘ E + T * F
  ⇒ᵣₘ E + T * id
  ⇒ᵣₘ E + F * id
  ⇒ᵣₘ E + id * id
  ⇒ᵣₘ T + id * id
  ⇒ᵣₘ F + id * id
  ⇒ᵣₘ id + id * id

Designing CFGs

Strategy 1: Balanced Structures

Pattern: Wrap terminals around recursive variable

Examples:

Equal a's and b's:  S → aSb | ε
Balanced parens:    S → (S) | SS | ε
Palindromes:        S → aSa | bSb | a | b | ε

Strategy 2: Concatenation

Pattern: Use multiple variables for different parts

Example: {aⁿbᵐ | n, m ≥ 0}

S → AB
A → aA | ε
B → bB | ε

Strategy 3: Union

Pattern: Multiple alternatives from start symbol

Example: {aⁿ | n ≥ 0} ∪ {bⁿ | n ≥ 0}

S → A | B
A → aA | ε
B → bB | ε

Strategy 4: Nesting

Pattern: Recursive in the middle

Example: {aⁿbⁿcᵐdᵐ | n, m ≥ 0}

S → AB
A → aAb | ε
B → cBd | ε

Strategy 5: Conditions

Pattern: Different rules for different conditions

Example: {aⁿbᵐ | n ≠ m}

S → A | B
A → aAb | aA | a    (more a's)
B → aBb | bB | b    (more b's)

CFG vs Regular Languages

What CFG Can Do That Regular Can’t

1. Counting and Matching:

{0ⁿ1ⁿ}: CFG ✓, Regular ✗
S → 0S1 | ε

2. Nested Structures:

{(ⁿ)ⁿ}: CFG ✓, Regular ✗
S → (S) | ε

3. Palindromes:

{w | w = wᴿ}: CFG ✓, Regular ✗
S → aSa | bSb | a | b | ε

What Regular Can Do That CFG Also Can

Every regular language is context-free!

Proof: Convert regex/DFA to CFG

Example: (a|b)*

Regex: (a|b)*
CFG:
S → aS | bS | ε

Example: ab

Regex: a*b*
CFG:
S → AB
A → aA | ε
B → bB | ε

Conversion: Regular → CFG

From DFA:

  • Variable for each state
  • Rule qᵢ → a qⱼ for each transition δ(qᵢ, a) = qⱼ
  • Rule qf → ε for each accept state qf

Example DFA: States {q₀, q₁}, δ(q₀, a) = q₁, F = {q₁}

CFG:

q₀ → aq₁
q₁ → ε

Applications of CFGs

1. Programming Language Syntax

Example: Java/C assignment statement

Statement → Variable = Expression ;
Variable → id
Expression → Expression + Term | Term
Term → Term * Factor | Factor
Factor → ( Expression ) | id | number

2. XML/HTML Structure

Example: Nested tags

Document → <tag> Content </tag>
Content → Text | Document | Document Content

3. Natural Language Processing

Example: Simple English sentences

Sentence → NP VP
NP → Det Noun | Noun
VP → Verb NP | Verb
Det → the | a
Noun → cat | dog | mouse
Verb → chased | ate | saw

4. Mathematical Expressions

Example: Arithmetic with precedence

Expr → Expr + Term | Term
Term → Term * Factor | Factor
Factor → ( Expr ) | number

Handles precedence: * before +

5. Data Formats

Example: JSON-like structure

Object → { Members }
Members → Pair | Pair , Members
Pair → String : Value
Value → String | number | Object

Ambiguity (Preview)

A grammar is ambiguous if a string has multiple parse trees

Example: E → E + E | E * E | id

Derive: id + id * id

Parse Tree 1: (id + id) _ id Parse Tree 2: id + (id _ id)

Two different meanings!

Solution: Rewrite grammar to be unambiguous (next topic!)

Important Properties

Closure Properties

CFLs are closed under:

  • ✓ Union
  • ✓ Concatenation
  • ✓ Kleene Star
  • ✗ Intersection (NOT closed!)
  • ✗ Complement (NOT closed!)

Decision Problems

Decidable:

  • ✓ Is w ∈ L(G)? (membership)
  • ✓ Is L(G) = ∅? (emptiness)

Undecidable:

  • ✗ Is L(G₁) = L(G₂)? (equivalence)
  • ✗ Is L(G) = Σ*? (universality)
  • ✗ Is G ambiguous? (ambiguity)

Important Exam Points

  1. CFG Definition: G = (V, Σ, R, S)
  2. Components: Variables, terminals, rules, start symbol
  3. Rules: A → α (A ∈ V, α ∈ (V ∪ Σ)*)
  4. Language: L(G) = {w ∈ Σ* | S ⇒* w}
  5. Derivation: S ⇒ α₁ ⇒ α₂ ⇒ … ⇒ w
  6. Leftmost/Rightmost: Order of variable expansion
  7. More powerful: CFG ⊃ Regular languages
  8. Key examples: {0ⁿ1ⁿ}, balanced parentheses, palindromes
  9. Applications: Programming languages, parsers, compilers
  10. Limitations: Not closed under intersection/complement

Common Mistakes to Avoid

Mistake 1: Confusing variables and terminals
Correct: Variables (capitals) can be expanded, terminals (lowercase) are final

Mistake 2: Thinking all CFLs are ambiguous
Correct: Many CFLs have unambiguous grammars

Mistake 3: Forgetting ε productions
Correct: S → ε allows empty string in language

Mistake 4: Assuming CFG = Regular
Correct: CFG ⊃ Regular (strictly more powerful)

Mistake 5: Multiple start symbols
Correct: Exactly one start symbol S

Mistake 6: Wrong notation for rules
Correct: A → α (not A := α or A = α)

Practice Questions

Q1: Write CFG for {aⁿbⁿ | n ≥ 1}

Answer: S → aSb | ab

Q2: What language does S → SS | (S) | ε generate?

Answer: Balanced parentheses

Q3: Is every regular language context-free?

Answer: Yes! Regular ⊂ Context-Free

Q4: What’s the difference between S ⇒ α and S ⇒* α?

Answer: ⇒ is one step, ⇒* is zero or more steps

Q5: Can CFG generate {aⁿbⁿcⁿ}?

Answer: No, this requires context-sensitive grammar

Q6: What’s a sentential form?

Answer: Any string derivable from S (may contain variables)

Q7: Write CFG for palindromes over {0, 1}

Answer: S → 0S0 | 1S1 | 0 | 1 | ε

Q8: Why is CFG called “context-free”?

Answer: Variable can be replaced regardless of surrounding context

Summary

  • CFG: G = (V, Σ, R, S) - formal system for describing languages
  • Components: Variables, terminals, production rules, start symbol
  • Language: L(G) = all terminal strings derivable from S
  • More powerful: Can express {0ⁿ1ⁿ}, balanced parentheses, palindromes
  • Derivations: Leftmost (expand left variable first) vs rightmost
  • Design strategies: Balanced structures, concatenation, union, nesting
  • Applications: Programming language syntax, parsers, natural language, data formats
  • Properties: Closed under union/concatenation/star, NOT under intersection/complement
  • Key advantage: Can count and match unbounded nested structures
  • Next level: Parse trees, ambiguity, normal forms

Context-Free Grammars are the foundation of syntax analysis in compilers and formal language theory!