2021-03-25

CS143 Lectur...

note, cs143

CS143 Lecture03: Lexical Analysis

词法分析。

Goal: Partition input string into substring called tokens

Lexical analysis produces a stream of tokens, which is input to the parser.

这是编译的第1/4步，词法分析产生词元序列并作为下一阶段(parse，翻译)的输入。

Token

词元。

A syntactic category，也就是对句法的分类

In English:

noun, verb, adjective, …

In a programming language:

Identifier, Integer, Keyword, Whitespace, …

A token class corresponds to a set of strings.

Examples

Identifier: strings of letters or digits, starting with a letter
Integer: a non-empty string of digits
Keyword: “else” or “if” or “begin” or …
Whitespace: a non-empty sequence of blanks, newlines, and tabs

Design

噢对，这有个缩写N.B.，原版是Nota Bene（拉丁语），也就是划重点的意思。另外lexeme中文翻译为词素。

Define a finite set of tokens.
Describe which strings belong to each token.

但是我们应该考虑到歧义。如下例子：

i/if；=/==；Foo<Bar>/cin >> var;/Foo<Bar<Bazz>>；

而要消除歧义显然就没办法在线处理了——我们需要“Lookahead”。

现在我们需要的，是描述词元的词素并找到一种消除歧义的方法。

Regular Languages

中文叫做正则语言或是正规语言。正则语言是描述词元的最常用的形式。

Language Def. : Let S be a set of characters. A language over S is a set of strings of characters drawn from S.

定义指出，字符串S上的语言就是一组S上的子串。$$L(x)$$表示“x的语言”。

所以，

语言是一组字符串。
需要一些记号来指定词法分析器所需的字符串组。
正则语言使用的标准记号叫做正则表达式。

Atomic Regular Expression

Single character

$$‘c’={“c”}$$
Epsilon

$$\epsilon = {“”}$$

Compound Regular Expression

Union 并

$$A+B={s|s \in A \ \ or\ \ s\in B}$$

也就是对$$A$$和$$B$$所表示的语言取并集。
Concatenation 串接

$$AB={ab|a\in A\ \ and\ \ b \in B}$$
Iteration 克林闭包

$$A^*=\cup_{i\ge 0}A^i$$ where $$A^i=A…i \ \ times\ \ …A$$

表示$$A$$中的字符串没有出现或是多次出现。

注意以上运算的优先级是$$闭包(最高)-串接-并(最低)$$。

Regular Expression Def. : The regular expressions over S are the smallest set of expressions including

$$\epsilon$$
$$‘c’$$ where $$c\in \sum$$
$$A+B$$ where $$A,B$$ are regex over $$\sum$$
$$AB$$
$$A^*$$ where $$A$$ is regex over $$\sum$$

Syntax vs. Semantics

To be careful, we should distinguish syntax and semantics.

$$L(\epsilon) = {“”}$$

$$L(‘c’)={“c”}$$

$$L(A+B)=L(A)\cup L(B)$$

$$L(AB)={ab|a\in L(A)\ \ and\ \ b \in L(B)}$$

$$L(A^*)=\cup_{i\ge 0}L(A^i)$$

Example

Keyword

Keyword: “else” or “if” or “begin” or …

$$‘else’+‘if’+‘begin’+…$$

Note: $$‘else’$$ abbreviates $$‘e’\ ‘l’\ ‘s’\ ‘e’$$

Integer

In order to describe integers, here is an abbreviation:

$$A^+=AA^$$
$$
⇒ digit=‘0’+‘1’+‘2’+‘3’+‘4’+‘5’+‘6’+‘7’+‘8’+‘9’\
⇒ integer=digit^+
$$
Identifier
$$
letter=‘A’+…+‘Z’+‘a’+…+‘z’\
identifier=letter(letter+digit)^
$$
Whitespace
$$
(‘\ ‘+’\backslash n’ + ‘\backslash t’)
$$

Summary

Regular languages are a language specification
- We still need an implementation
Given a string s and a rexp R, is
- $$s \in L®$$ ?