形式语言与自动机_笔记整理(二)
来源:互联网 发布:log4j linux 绝对路径 编辑:程序博客网 时间:2024/06/14 01:52
- Regular Expressions
- Operations on Languages
- k-paths
- Algebraic Laws for REs
- Decision Properties of Regular Languages
- The Pumping Lemma
- Decision Property Equivalence
- Decision Property Containment
- The Pumping Lemma
- Closure Properties of Regular Languages
- Context-Free Grammars
- CFG Formalism
- BNF Notation
- Leftmost and Rightmost Derivations
- Leftmost Derivations
- Rightmost Derivations
- Parse Trees
- Yield of a Parse Tree
- Generalization of Parse Trees
- Trees leftmost and rightmost derivations correspond
- Ambiguous Grammars
- Equivalent definitions of ambiguous grammar
- LL1 Grammars
- Inherent Ambiguity
- Normal Forms for CFG s
- Eliminate Variables That Derive Nothing
- Eliminating Useless Symbols
- Cleaning Up a Grammar
- Chomsky Normal Form
Regular Expressions
Operations on Languages
RE’ s use three operations: union, concatenation, and Kleene star.
Equivalence of RE’ s and Finite Automata
Converting a RE to an ε-NFA
DFA-to-RE
k-paths
A k-path is a path through the graph of the DFA that goes though no state numbered higher than k.
Endpoints are not restricted; they can be any state.
n-paths are unrestricted.
RE is the union of RE’ s for the n-paths from the start state to each final state.
Algebraic Laws for RE’s
Union and concatenation behave sort of like addition and multiplication.
Exception: Concatenation is not commutative.
Decision Properties of Regular Languages
The Pumping Lemma
Decision Property: Equivalence
product DFA
Let these DFA’ s have sets of states Q and R, respectively.
- Product DFA has set of states Q
× R. - I.e., pairs [q, r] with q in Q, r in R.
- Start state =
[q0,r0] (the start states of the DFA’ s for L, M). - Transitions:
δ([q,r],a)=[δL(q,a),δM(r,a)] δL ,δM are the transition functions for the DFA’ s of L, M.- That is, we simulate the two DFA’ s in the two state components of the product DFA.
Decision Property: Containment
Product DFA
define the final states [q, r] of the product so its language is empty iff L
Answer: q is final; r is not.
State Minimization
Algorithm is a recursion on the length of the shortest distinguishing string.
Constructing the Minimum-State DFA
Eliminating Unreachable States
Proof: No Unrelated, Smaller DFA
Closure Properties of Regular Languages
- Union
- Intersection
- Difference
- Concatenation
- Kleene Closure
- Reversal
- Homomorphism
- Inverse Homomorphism
TODO:
Context-Free Grammars
CFG Formalism
Terminals = symbols of the alphabet of the language being defined.
Variables = nonterminals = a finite set of other symbols, each of which represents a language.
Start symbol = the variable whose language is the one being defined.
A production has the form variable (head) -> string of variables and terminals (body).
Iterated Derivation
=>* means “zero or more derivation steps.”
Any string of variables and/or terminals derived from the start symbol is called a sentential form.
Formally, is a sentential form iff S =>* .
If G is a CFG, then L(G), the language of G, is {w | S =>* w}.
A language that is defined by some CFG is called a context-free language.
BNF Notation
Variables are words in <…>;
Terminals are often multicharacter strings indicated by boldface or underline;
Symbol ::= is often used for ->.
Symbol | is used for “or.”
Symbol … is used for “one or more.”
Surround one or more symbols by […] to make them optional.
Use {…} to surround a sequence of symbols that need to be treated as a unit.
Leftmost and Rightmost Derivations
Derivations allow us to replace any of the variables in a string.
Leads to many different derivations of the same string.
By forcing the leftmost variable (or alternatively, the rightmost variable) to be replaced, we avoid these “distinctions without a difference.”
Leftmost Derivations
Say wA
Also,
Rightmost Derivations
Say Aw
Also,
Parse Trees
Parse trees are trees labeled by symbols of a particular CFG.
Leaves: labeled by a terminal or ε.
Interior nodes: labeled by a variable.
Children are labeled by the body of a production for the parent.
Root: must be labeled by the start symbol.
Yield of a Parse Tree
The concatenation of the labels of the leaves in left-to-right order (that is, in the order of a preorder traversal) is called the yield of the parse tree.
Generalization of Parse Trees
We sometimes talk about trees that are not exactly parse trees, but only because the root is labeled by some variable A that is not the start symbol.
Call these parse trees with root A.
Trees, leftmost, and rightmost derivations correspond
If there is a parse tree with root labeled A and yield w, then A
If A
Ambiguous Grammars
A CFG is ambiguous if there is a string in the language that is the yield of two or more parse trees.
If there are two different parse trees, they must produce two different leftmost derivations by the construction given in the proof.
Conversely, two different leftmost derivations produce different parse trees by the other part of the proof.
Likewise for rightmost derivations.
Equivalent definitions of “ambiguous grammar”
- There is a string in the language that has two different leftmost derivations.
- There is a string in the language that has two different rightmost derivations.
Ambiguity is a Property of Grammars, not Languages.
LL(1) Grammars
As an aside, a grammar such B->(RB|ε
R->)|(RR
, where you can always figure out the production to use in a leftmost derivation by scanning the given string left-to-right and looking only at the next one symbol is called LL(1).
Most programming languages have LL(1) grammars.
LL(1) grammars are never ambiguous.
Inherent Ambiguity
Every grammar for the language is ambiguous.
Normal Forms for CFG’ s
Eliminate Variables That Derive Nothing
Variables That Derive Nothing
- Discover all variables that derive terminal strings.
- For all other variables, remove all productions in which they appear in either the head or body.
Consider: S -> AB, A -> aA | a, B -> AB
Although A derives all strings of a’ s, B derives no terminal strings.
Why? The only production for B leaves a B in the sentential form.
Thus, S derives nothing, and the language is empty.
S -> AB | C, A -> aA | a, B -> bB, C -> c
Basis: A and C are discovered because of A -> a and C -> c.
Induction: S is discovered because of S -> C.
Nothing else can be discovered.
Result: S -> C, A -> aA | a, C -> c
Eliminating Useless Symbols
Unreachable Symbols
Another way a terminal or variable deserves to be eliminated is if it cannot appear in any derivation from the start symbol.
- Eliminate symbols that derive no terminal string. Eliminate
- unreachable symbols.
Remove from the grammar all symbols not discovered reachable from S and all productions that involve these symbols.
Epsilon Productions
Theorem: If L is a CFL, then L-{ε} has a CFG with no ε-productions.
Nullable Symbols
nullable symbols = variables A such that A =>* ε.
S -> AB, A -> aA | ε, B -> bB | A
Basis: A is nullable because of A -> ε.
Induction: B is nullable because of B -> A.
Then, S is nullable because of S -> AB.
Unit Productions
A unit production is one whose body consists of exactly one variable.
These productions can be eliminated.
Key idea:
If A =>* B by a series of unit productions,
and B ->α is a non-unit-production, then add production A ->α .
Then, drop all unit productions.
Find all pairs (A, B) such that A =>* B by a sequence of unit productions only.
TODO: Proof That We Find Exactly the Right Pairs
TODO: Proof The the Unit-Production-Elimination Algorithm Works
Cleaning Up a Grammar
Theorem: if L is a CFL, then there is a CFG for L – {ε} that has:
- No useless symbols.
- No ε-productions.
- No unit productions.
i.e., every body is either a single terminal or has length > 2.
Perform the following steps in order:
- Eliminate ε-productions.
- Eliminate unit productions.
- Eliminate variables that derive no terminal string.
- Eliminate variables not reached from the start symbol.
Chomsky Normal Form
A CFG is said to be in Chomsky Normal Form if every production is of one of these two forms:
A -> BC (body is two variables).
A -> a (body is a single terminal).
Theorem: If L is a CFL, then L – {ε} has a CFG in CNF.
Step 1: “Clean” the grammar, so every body is either a single terminal or of length at least 2.
Step 2: For each body a single terminal, make the right side all variables.
- For each terminal a create new variable Aa and production Aa -> a.
- Replace a by Aa in bodies of length > 2.
- Consider production
A -> BcDe
. - We need variables Ac and Ae. with productions
Ac -> c andAe -> e.- Note: you create at most one variable for each terminal, and use it everywhere it is needed.
- Replace A -> BcDe by A ->
BAcDAe .
Step 3: Break right sides longer than 2 into a chain of productions with right sides of two variables.
- Example: A -> BCDE is replaced by A -> BF, F -> CG, and G -> DE.
- F and G must be used nowhere else.
- Recall A -> BCDE is replaced by A -> BF, F -> CG, and G -> DE.
- In the new grammar, A => BF => BCG => BCDE.
- More importantly: Once we choose to replace A by BF, we must continue to BCG and BCDE.
- Because F and G have only one production.
- 形式语言与自动机_笔记整理(二)
- 形式语言与自动机_笔记整理(一)
- 形式语言与自动机_笔记整理(三)
- 形式语言与自动机笔记
- 形式语言与自动机学习心得
- 形式语言与自动机
- 形式语言与自动机
- 不看形式语言与自动机了
- 形式语言与自动机理论总结
- 形式语言与自动机理论总结
- 形式语言与自动机理论总结
- 形式语言与自动机理论总结
- nlp-形式语言与自动机-ch03
- 形式语言与自动机中的7大算法
- 计算语言学之形式语言与自动机
- nlp-形式语言与自动机-ch08-句法分析
- nlp-形式语言与自动机-ch10-统计机器翻译
- 形式语言与自动机之核心——乔姆斯基体系
- 【操作系统】进程调度与同步
- 去这些公司工作,你会成长更快!提前拿offer,可过年后入职!
- mysql安装(yum源)
- Android应用程序开发期末大作业-上集
- 想创业,想做个物业问题跟踪系统(业主宝)
- 形式语言与自动机_笔记整理(二)
- wampservice无法安装的处理方法
- 如何快速导入发票清单,开票清单批量导入方法探讨
- Java Scanner的简单应用
- Linux note0x01
- 18.dubbo引用配置缓存、线程栈自动dump、netty4支持
- Android源码在线查看网站
- 全排列
- 【Webpack】3.多入口设置与 html-webpack-pugin 插件详解