Studying note of GCC-3.4.6 source (77)

来源：互联网发布：装潢艺术设计专业软件编辑：程序博客网时间：2024/05/22 14:36

5.6. Prepare the parser

Now it’s going to parse the source file, and the input of the parser must be identifiers of C++. The tool fetching identifiers is called Lexer. It is worth noting that, GCC hasn’t so-called preprocessing pass, because function like cpp_get_token can fetch preprocessed token directly. Obvious, this function is also an important part of the Lexer. Being the first step of parsing, it prepares the parser and the associating Lexer.

15112 void

15113 c_parse_file (void) in parser.c

15114 {

15115 bool error_occurred;

15116

15117 the_parser = cp_parser_new ();

15118 push_deferring_access_checks (flag_access_control

15119 ? dk_no_deferred : dk_no_check);

15120 error_occurred = cp_parser_translation_unit (the_parser);

15121 the_parser = NULL;

15122 }

As the data structure for C++ parser, cp_parser is defined as below. Notice that data of this type is managed by GCC garbage collector, for every translation unit, a new parser will be created.

1170 typedef struct cp_parser GTY(()) in parser.c

1171 {

1172 /* The lexer from which we are obtaining tokens. */

1173 cp_lexer *lexer;

1174

1175 /* The scope in which names should be looked up. If NULL_TREE, then

1176 we look up names in the scope that is currently open in the

1177 source program. If non-NULL, this is either a TYPE or

1178 NAMESPACE_DECL for the scope in which we should look.

1179

1180 This value is not cleared automatically after a name is looked

1181 up, so we must be careful to clear it before starting a new look

1182 up sequence. (If it is not cleared, then `X::Y' followed by `Z'

1183 will look up `Z' in the scope of `X', rather than the current

1184 scope.) Unfortunately, it is difficult to tell when name lookup

1185 is complete, because we sometimes peek at a token, look it up,

1186 and then decide not to consume it. */

1187 tree scope;

1188

1189 /* OBJECT_SCOPE and QUALIFYING_SCOPE give the scopes in which the

1190 last lookup took place. OBJECT_SCOPE is used if an expression

1191 like "x->y" or "x.y" was used; it gives the type of "*x" or "x",

1192 respectively. QUALIFYING_SCOPE is used for an expression of the

1193 form "X::Y"; it refers to X. */

1194 tree object_scope;

1195 tree qualifying_scope;

1196

1197 /* A stack of parsing contexts. All but the bottom entry on the

1198 stack will be tentative contexts.

1199

1200 We parse tentatively in order to determine which construct is in

1201 use in some situations. For example, in order to determine

1202 whether a statement is an expression-statement or a

1203 declaration-statement we parse it tentatively as a

1204 declaration-statement. If that fails, we then reparse the same

1205 token stream as an expression-statement. */

1206 cp_parser_context *context;

1207

1208 /* True if we are parsing GNU C++. If this flag is not set, then

1209 GNU extensions are not recognized. */

1210 bool allow_gnu_extensions_p;

1211

1212 /* TRUE if the `>' token should be interpreted as the greater-than

1213 operator. FALSE if it is the end of a template-id or

1214 template-parameter-list. */

1215 bool greater_than_is_operator_p;

1216

1217 /* TRUE if default arguments are allowed within a parameter list

1218 that starts at this point. FALSE if only a gnu extension makes

1219 them permissible. */

1220 bool default_arg_ok_p;

1221

1222 /* TRUE if we are parsing an integral constant-expression. See

1223 [expr.const] for a precise definition. */

1224 bool integral_constant_expression_p;

1225

1226 /* TRUE if we are parsing an integral constant-expression -- but a

1227 non-constant expression should be permitted as well. This flag

1228 is used when parsing an array bound so that GNU variable-length

1229 arrays are tolerated. */

1230 bool allow_non_integral_constant_expression_p;

1231

1232 /* TRUE if ALLOW_NON_CONSTANT_EXPRESSION_P is TRUE and something has

1233 been seen that makes the expression non-constant. */

1234 bool non_integral_constant_expression_p;

1235

1236 /* TRUE if we are parsing the argument to "__offsetof__". */

1237 bool in_offsetof_p;

1238

1239 /* TRUE if local variable names and `this' are forbidden in the

1240 current context. */

1241 bool local_variables_forbidden_p;

1242

1243 /* TRUE if the declaration we are parsing is part of a

1244 linkage-specification of the form `extern string-literal

1245 declaration'. */

1246 bool in_unbraced_linkage_specification_p;

1247

1248 /* TRUE if we are presently parsing a declarator, after the

1249 direct-declarator. */

1250 bool in_declarator_p;

1251

1252 /* TRUE if we are presently parsing a template-argument-list. */

1253 bool in_template_argument_list_p;

1254

1255 /* TRUE if we are presently parsing the body of an

1256 iteration-statement. */

1257 bool in_iteration_statement_p;

1258

1259 /* TRUE if we are presently parsing the body of a switch

1260 statement. */

1261 bool in_switch_statement_p;

1262

1263 /* TRUE if we are parsing a type-id in an expression context. In

1264 such a situation, both "type (expr)" and "type (type)" are valid

1265 alternatives. */

1266 bool in_type_id_in_expr_p;

1267

1268 /* If non-NULL, then we are parsing a construct where new type

1269 definitions are not permitted. The string stored here will be

1270 issued as an error message if a type is defined. */

1271 const char *type_definition_forbidden_message;

1272

1273 /* A list of lists. The outer list is a stack, used for member

1274 functions of local classes. At each level there are two sub-list,

1275 one on TREE_VALUE and one on TREE_PURPOSE. Each of those

1276 sub-lists has a FUNCTION_DECL or TEMPLATE_DECL on their

1277 TREE_VALUE's. The functions are chained in reverse declaration

1278 order.

1279

1280 The TREE_PURPOSE sublist contains those functions with default

1281 arguments that need post processing, and the TREE_VALUE sublist

1282 contains those functions with definitions that need post

1283 processing.

1284

1285 These lists can only be processed once the outermost class being

1286 defined is complete. */

1287 tree unparsed_functions_queues;

1288

1289 /* The number of classes whose definitions are currently in

1290 progress. */

1291 unsigned num_classes_being_defined;

1292

1293 /* The number of template parameter lists that apply directly to the

1294 current declaration. */

1295 unsigned num_template_parameter_lists;

1296 } cp_parser;

Function cp_parser_new used to create instances of cp_parser has below definition:

2230 static cp_parser *

2231 cp_parser_new (void) in parser.c

2232 {

2233 cp_parser *parser;

2234 cp_lexer *lexer;

2235

2236 /* cp_lexer_new_main is called before calling ggc_alloc because

2237 cp_lexer_new_main might load a PCH file. */

2238 lexer = cp_lexer_new_main ();

The lexer created by cp_lexer_new_main has following definition, it is also a GC controlled type. Note that in its definition, all pointer members are managed by GC too except next slot at line 212. This next slot reveals that unlike parser that is unique for one translation unit, more lexer will be created temperarily.

166 typedef struct cp_lexer GTY (()) in parser.c

167 {

168 /* The memory allocated for the buffer. Never NULL. */

169 cp_token * GTY ((length ("(%h.buffer_end - %h.buffer)"))) buffer;

170 /* A pointer just past the end of the memory allocated for the buffer. */

171 cp_token * GTY ((skip (""))) buffer_end;

172 /* The first valid token in the buffer, or NULL if none. */

173 cp_token * GTY ((skip (""))) first_token;

174 /* The next available token. If NEXT_TOKEN is NULL, then there are

175 no more available tokens. */

176 cp_token * GTY ((skip (""))) next_token;

177 /* A pointer just past the last available token. If FIRST_TOKEN is

178 NULL, however, there are no available tokens, and then this

179 location is simply the place in which the next token read will be

180 placed. If LAST_TOKEN == FIRST_TOKEN, then the buffer is full.

181 When the LAST_TOKEN == BUFFER, then the last token is at the

182 highest memory address in the BUFFER. */

183 cp_token * GTY ((skip (""))) last_token;

184

185 /* A stack indicating positions at which cp_lexer_save_tokens was

186 called. The top entry is the most recent position at which we

187 began saving tokens. The entries are differences in token

188 position between FIRST_TOKEN and the first saved token.

189

190 If the stack is non-empty, we are saving tokens. When a token is

191 consumed, the NEXT_TOKEN pointer will move, but the FIRST_TOKEN

192 pointer will not. The token stream will be preserved so that it

193 can be reexamined later.

194

195 If the stack is empty, then we are not saving tokens. Whenever a

196 token is consumed, the FIRST_TOKEN pointer will be moved, and the

197 consumed token will be gone forever. */

198 varray_type saved_tokens;

199

200 /* The STRING_CST tokens encountered while processing the current

201 string literal. */

202 varray_type string_tokens;

203

204 /* True if we should obtain more tokens from the preprocessor; false

205 if we are processing a saved token cache. */

206 bool main_lexer_p;

207

208 /* True if we should output debugging information. */

209 bool debugging_p;

210

211 /* The next lexer in a linked list of lexers. */

212 struct cp_lexer *next;

213 } cp_lexer;

In previous sections, we have seen that token is represented by type cpp_token, however this type is designed for preprocessor. After preprocessing, preprocessing elements like: maro, assertion, #include directive, etc. are longer exist, and cpp_token is not a fit candidate any more. Replacing it is cp_token for the preprocessed tokens.

69 typedef struct cp_token GTY (()) in parser.c

70 {

71 /* The kind of token. */

72 ENUM_BITFIELD (cpp_ttype) type : 8;

73 /* If this token is a keyword, this value indicates which keyword.

74 Otherwise, this value is RID_MAX. */

75 ENUM_BITFIELD (rid) keyword : 8;

76 /* Token flags. */

77 unsigned char flags;

78 /* The value associated with this token, if any. */

79 tree value;

80 /* The location at which this token was found. */

81 location_t location;

82 } cp_token;

By comparison, these 2 definitions are quite similar.

5.6.1. Create main Lexer

Every translation-unit should have a main lexer to go with the parser. This main lexer is created by below funciton.

301 static cp_lexer *

302 cp_lexer_new_main (void) in parser.c

303 {

304 cp_lexer *lexer;

305 cp_token first_token;

306

307 /* It's possible that lexing the first token will load a PCH file,

308 which is a GC collection point. So we have to grab the first

309 token before allocating any memory. */

310 cp_lexer_get_preprocessor_token (NULL, &first_token);

311 c_common_no_more_pch ();

312

313 /* Allocate the memory. */

314 lexer = ggc_alloc_cleared (sizeof (cp_lexer));

315

316 /* Create the circular buffer. */

317 lexer->buffer = ggc_calloc (CP_TOKEN_BUFFER_SIZE, sizeof (cp_token));

318 lexer->buffer_end = lexer->buffer + CP_TOKEN_BUFFER_SIZE;

319

320 /* There is one token in the buffer. */

321 lexer->last_token = lexer->buffer + 1;

322 lexer->first_token = lexer->buffer;

323 lexer->next_token = lexer->buffer;

324 memcpy (lexer->buffer, &first_token, sizeof (cp_token));

325

326 /* This lexer obtains more tokens by calling c_lex. */

327 lexer->main_lexer_p = true;

328

329 /* Create the SAVED_TOKENS stack. */

330 VARRAY_INT_INIT(lexer->saved_tokens, CP_SAVED_TOKENS_SIZE, "saved_tokens");

331

332 /* Create the STRINGS array. */

333 VARRAY_TREE_INIT (lexer->string_tokens, 32, "strings");

334

335 /* Assume we are not debugging. */

336 lexer->debugging_p = false;

337

338 return lexer;

339 }

Till now, we have read in main input file, header files included by –include options (if there is any), but don’t start parsing token in file. So cp_lexer_get_preprocessor_token below at line 310, will lex the first token. By GCC current implementation and requirement, every source file should include only one precompiled header file, and the precompiled header file should be the first included file, so if current source file uses precompiled header file, the function will cause the header file be read in (remember, first see #include directive, then run_directive calls handle do_include which calls _cpp_stack_include which next calls c_common_read_pch to read in the PCH file). And in ggc_pch_read invoked by c_common_read_pch, if the host system uses paging for memory management, GC garbage collection is triggered.

580 static void

581 cp_lexer_get_preprocessor_token (cp_lexer *lexer ATTRIBUTE_UNUSED , in parser.c

582 cp_token *token)

583 {

584 bool done;

585

586 /* If this not the main lexer, return a terminating CPP_EOF token. */

587 if (lexer != NULL && !lexer->main_lexer_p)

588 {

589 token->type = CPP_EOF;

590 token->location.line = 0;

591 token->location.file = NULL;

592 token->value = NULL_TREE;

593 token->keyword = RID_MAX;

594

595 return;

596 }

597

598 done = false;

599 /* Keep going until we get a token we like. */

600 while (!done)

601 {

602 /* Get a new token from the preprocessor. */

603 token->type = c_lex_with_flags (&token->value, &token->flags);

604 /* Issue messages about tokens we cannot process. */

605 switch (token->type)

606 {

607 case CPP_ATSIGN:

608 case CPP_HASH:

609 case CPP_PASTE:

610 error ("invalid token");

611 break;

612

613 default:

614 /* This is a good token, so we exit the loop. */

615 done = true;

616 break;

617 }

618 }

619 /* Now we've got our token. */

620 token->location = input_location;

621

622 /* Check to see if this token is a keyword. */

623 if (token->type == CPP_NAME

624 && C_IS_RESERVED_WORD (token->value))

625 {

626 /* Mark this token as a keyword. */

627 token->type = CPP_KEYWORD;

628 /* Record which keyword. */

629 token->keyword = C_RID_CODE (token->value);

630 /* Update the value. Some keywords are mapped to particular

631 entities, rather than simply having the value of the

632 corresponding IDENTIFIER_NODE. For example, `__const' is

633 mapped to `const'. */

634 token->value = ridpointers[token->keyword];

635 }

636 else

637 token->keyword = RID_MAX;

638 }

cp_lexer_get_preprocessor_token is low level routine for lexer, which feeds lexer with preprocessed token. No doubt, “#”, “##’, ‘@’ (line 607, used in Obj-C) are invalidate tokens. Further, preprocessed token sould be identifier or constant, C++ reserves some identifiers as key words, which should be recoginzed here (refer to section Initialize reserved words for C++).

5.6.1.1. Get preprocessed token

5.6.1.1.1. Case of identifier

Preprocessed token is represented by cp_token, in which field flags can take following value.

619 #define CPP_N_CATEGORY 0x000F in cpplib.h

620 #define CPP_N_INVALID 0x0000

621 #define CPP_N_INTEGER 0x0001

622 #define CPP_N_FLOATING 0x0002

623

624 #define CPP_N_WIDTH 0x00F0

625 #define CPP_N_SMALL 0x0010 /* int, float. */

626 #define CPP_N_MEDIUM 0x0020 /* long, double. */

627 #define CPP_N_LARGE 0x0040 /* long long, long double. */

628

629 #define CPP_N_RADIX 0x0F00

630 #define CPP_N_DECIMAL 0x0100

631 #define CPP_N_HEX 0x0200

632 #define CPP_N_OCTAL 0x0400

633

634 #define CPP_N_UNSIGNED 0x1000 /* Properties. */

635 #define CPP_N_IMAGINARY 0x2000

There are five groups can be set for flags. For example, for token 0x50, flags will be set with CPP_N_INTEGER, CPP_N_SMALL, CPP_N_HEX, and CPP_N_UNSIGNED. Type, value and flags of preprocessed token is taken by c_lex_with_flags.

315 int

316 c_lex_with_flags (tree *value, unsigned char *cpp_flags) in c-lex.c

317 {

318 const cpp_token *tok;

319 location_t atloc;

320 static bool no_more_pch;

321

322 retry:

323 tok = get_nonpadding_token ();

The core of get_nonpadding_token is cpp_get_token. As we have seen in previous section, this function is where preprocess takes place. At there, macro definitions are digested into instances of cpp_macro, macro invocation is expanded directly via argument replacement (if it is expected), other directives are covered by varies handles, and kinds preprocessor operator is executed.

302 static inline const cpp_token *

303 get_nonpadding_token (void) in c-lex.c

304 {

305 const cpp_token *tok;

306 timevar_push (TV_CPP);

307 do

308 tok = cpp_get_token (parse_in);

309 while (tok->type == CPP_PADDING);

310 timevar_pop (TV_CPP);

311

312 return tok;

313 }

See that get_nonpadding_token still returns cpp_token instead of cp_token.

c_lex_with_flags (continue)

325 retry_after_at:

326 switch (tok->type)

327 {

328 case CPP_NAME:

329 *value = HT_IDENT_TO_GCC_IDENT (HT_NODE (tok->val.node));

330 break;

331

332 case CPP_NUMBER:

333 {

334 unsigned int flags = cpp_classify_number (parse_in, tok);

335

336 switch (flags & CPP_N_CATEGORY)

337 {

338 case CPP_N_INVALID:

339 /* cpplib has issued an error. */

340 *value = error_mark_node;

341 break;

342

343 case CPP_N_INTEGER:

344 *value = interpret_integer (tok, flags);

345 break;

346

347 case CPP_N_FLOATING:

348 *value = interpret_float (tok, flags);

349 break;

350

351 default:

352 abort ();

353 }

354 }

355 break;

At line 328, token of CPP_NAME is an identifier, and HT_IDENT_TO_GCC_IDENT converts the corresponding hashnode into tree node.