形式语言与自动机_笔记整理(二)

来源：互联网发布：log4j linux 绝对路径编辑：程序博客网时间：2024/06/14 01:52

Regular Expressions
- Operations on Languages
- k-paths
- Algebraic Laws for REs
Decision Properties of Regular Languages
- The Pumping Lemma
  - Decision Property Equivalence
  - Decision Property Containment
Closure Properties of Regular Languages
Context-Free Grammars
- CFG Formalism
- BNF Notation
- Leftmost and Rightmost Derivations
  - Leftmost Derivations
  - Rightmost Derivations
Parse Trees
- Yield of a Parse Tree
- Generalization of Parse Trees
- Trees leftmost and rightmost derivations correspond
- Ambiguous Grammars
  - Equivalent definitions of ambiguous grammar
  - LL1 Grammars
  - Inherent Ambiguity
Normal Forms for CFG s
- Eliminate Variables That Derive Nothing
- Eliminating Useless Symbols
- Cleaning Up a Grammar
- Chomsky Normal Form

Regular Expressions

Operations on Languages

RE’ s use three operations: union, concatenation, and Kleene star.
Equivalence of RE’ s and Finite Automata
Converting a RE to an ε-NFA
DFA-to-RE

k-paths

A k-path is a path through the graph of the DFA that goes though no state numbered higher than k.
Endpoints are not restricted; they can be any state.
n-paths are unrestricted.
RE is the union of RE’ s for the n-paths from the start state to each final state.
这里写图片描述

Algebraic Laws for RE’s

Union and concatenation behave sort of like addition and multiplication.
Exception: Concatenation is not commutative.

Decision Properties of Regular Languages

这里写图片描述

The Pumping Lemma

这里写图片描述

Decision Property: Equivalence

product DFA
Let these DFA’ s have sets of states Q and R, respectively.

Product DFA has set of states Q × R.
I.e., pairs [q, r] with q in Q, r in R.
Start state = [q0,r0] (the start states of the DFA’ s for L, M).
Transitions: δ([q,r],a)=[δL(q,a),δM(r,a)]
- δL, δM are the transition functions for the DFA’ s of L, M.
- That is, we simulate the two DFA’ s in the two state components of the product DFA.

Decision Property: Containment

Product DFA
define the final states [q, r] of the product so its language is empty iff L⊆ M
Answer: q is final; r is not.
State Minimization
Algorithm is a recursion on the length of the shortest distinguishing string.
Constructing the Minimum-State DFA
Eliminating Unreachable States
Proof: No Unrelated, Smaller DFA

Closure Properties of Regular Languages

Union
Intersection
Difference
Concatenation
Kleene Closure
Reversal
Homomorphism
Inverse Homomorphism
TODO:

Context-Free Grammars

CFG Formalism

Terminals = symbols of the alphabet of the language being defined.
Variables = nonterminals = a finite set of other symbols, each of which represents a language.
Start symbol = the variable whose language is the one being defined.
A production has the form variable (head) -> string of variables and terminals (body).

Iterated Derivation
=>* means “zero or more derivation steps.”
Any string of variables and/or terminals derived from the start symbol is called a sentential form.

Formally, is a sentential form iff S =>* .
If G is a CFG, then L(G), the language of G, is {w | S =>* w}.
A language that is defined by some CFG is called a context-free language.

BNF Notation

Variables are words in <…>;
Terminals are often multicharacter strings indicated by boldface or underline;
Symbol ::= is often used for ->.
Symbol | is used for “or.”
Symbol … is used for “one or more.”
Surround one or more symbols by […] to make them optional.
Use {…} to surround a sequence of symbols that need to be treated as a unit.

Leftmost and Rightmost Derivations

Derivations allow us to replace any of the variables in a string.
Leads to many different derivations of the same string.
By forcing the leftmost variable (or alternatively, the rightmost variable) to be replaced, we avoid these “distinctions without a difference.”

Leftmost Derivations

Say wA =>lm w if w is a string of terminals only and A -> is a production.
Also, =>∗lm if becomes by a sequence of 0 or more =>lm steps.

Rightmost Derivations

Say Aw =>rm w if w is a string of terminals only and A -> is a production.
Also, =>∗rm if becomes by a sequence of 0 or more =>rm steps.

Parse Trees

Parse trees are trees labeled by symbols of a particular CFG.
Leaves: labeled by a terminal or ε.
Interior nodes: labeled by a variable.
Children are labeled by the body of a production for the parent.
Root: must be labeled by the start symbol.

Yield of a Parse Tree

The concatenation of the labels of the leaves in left-to-right order (that is, in the order of a preorder traversal) is called the yield of the parse tree.

Generalization of Parse Trees

We sometimes talk about trees that are not exactly parse trees, but only because the root is labeled by some variable A that is not the start symbol.
Call these parse trees with root A.

Trees, leftmost, and rightmost derivations correspond

If there is a parse tree with root labeled A and yield w, then A =>∗lm w.
If A =>∗lm w, then there is a parse tree with root A and yield w.

Ambiguous Grammars

A CFG is ambiguous if there is a string in the language that is the yield of two or more parse trees.
If there are two different parse trees, they must produce two different leftmost derivations by the construction given in the proof.
Conversely, two different leftmost derivations produce different parse trees by the other part of the proof.
Likewise for rightmost derivations.

Equivalent definitions of “ambiguous grammar”

There is a string in the language that has two different leftmost derivations.
There is a string in the language that has two different rightmost derivations.

Ambiguity is a Property of Grammars, not Languages.

LL(1) Grammars

As an aside, a grammar such B->(RB|ε R->)|(RR, where you can always figure out the production to use in a leftmost derivation by scanning the given string left-to-right and looking only at the next one symbol is called LL(1).
Most programming languages have LL(1) grammars.
LL(1) grammars are never ambiguous.

Inherent Ambiguity

Every grammar for the language is ambiguous.

Normal Forms for CFG’ s

Eliminate Variables That Derive Nothing

Variables That Derive Nothing

Discover all variables that derive terminal strings.

For all other variables, remove all productions in which they appear in either the head or body.

Consider: S -> AB, A -> aA | a, B -> AB
Although A derives all strings of a’ s, B derives no terminal strings.
Why? The only production for B leaves a B in the sentential form.
Thus, S derives nothing, and the language is empty.

S -> AB | C, A -> aA | a, B -> bB, C -> c

Basis: A and C are discovered because of A -> a and C -> c.
Induction: S is discovered because of S -> C.
Nothing else can be discovered.
Result: S -> C, A -> aA | a, C -> c

Eliminating Useless Symbols

Unreachable Symbols
Another way a terminal or variable deserves to be eliminated is if it cannot appear in any derivation from the start symbol.

Eliminate symbols that derive no terminal string. Eliminate

unreachable symbols.

Remove from the grammar all symbols not discovered reachable from S and all productions that involve these symbols.

Epsilon Productions
Theorem: If L is a CFL, then L-{ε} has a CFG with no ε-productions.

Nullable Symbols
nullable symbols = variables A such that A =>* ε.

S -> AB, A -> aA | ε, B -> bB | A

Basis: A is nullable because of A -> ε.
Induction: B is nullable because of B -> A.
Then, S is nullable because of S -> AB.

Unit Productions
A unit production is one whose body consists of exactly one variable.
These productions can be eliminated.
Key idea:

If A =>* B by a series of unit productions,
and B -> α is a non-unit-production, then add production A -> α.
Then, drop all unit productions.
Find all pairs (A, B) such that A =>* B by a sequence of unit productions only.

TODO: Proof That We Find Exactly the Right Pairs
TODO: Proof The the Unit-Production-Elimination Algorithm Works

Cleaning Up a Grammar

Theorem: if L is a CFL, then there is a CFG for L – {ε} that has:

No useless symbols.
No ε-productions.
No unit productions.

i.e., every body is either a single terminal or has length > 2.

Perform the following steps in order:

Eliminate ε-productions.

Eliminate unit productions.

Eliminate variables that derive no terminal string.

Eliminate variables not reached from the start symbol.

Chomsky Normal Form

A CFG is said to be in Chomsky Normal Form if every production is of one of these two forms:

A -> BC (body is two variables).
A -> a (body is a single terminal).

Theorem: If L is a CFL, then L – {ε} has a CFG in CNF.
Step 1: “Clean” the grammar, so every body is either a single terminal or of length at least 2.
Step 2: For each body a single terminal, make the right side all variables.

For each terminal a create new variable Aa and production Aa -> a.
Replace a by Aa in bodies of length > 2.
Consider production A -> BcDe.
We need variables Ac and Ae. with productions Ac -> c and Ae -> e.
- Note: you create at most one variable for each terminal, and use it everywhere it is needed.
Replace A -> BcDe by A -> BAcDAe.

Step 3: Break right sides longer than 2 into a chain of productions with right sides of two variables.

Example: A -> BCDE is replaced by A -> BF, F -> CG, and G -> DE.
- F and G must be used nowhere else.
Recall A -> BCDE is replaced by A -> BF, F -> CG, and G -> DE.
In the new grammar, A => BF => BCG => BCDE.
More importantly: Once we choose to replace A by BF, we must continue to BCG and BCDE.
- Because F and G have only one production.

阅读全文

0 0