Path: blob/main/Tools/c-analyzer/c_parser/parser/__init__.py
12 views
"""A simple non-validating parser for C99.12The functions and regex patterns here are not entirely suitable for3validating C syntax. Please rely on a proper compiler for that.4Instead our goal here is merely matching and extracting information from5valid C code.67Furthermore, the grammar rules for the C syntax (particularly as8described in the K&R book) actually describe a superset, of which the9full C language is a proper subset. Here are some of the extra10conditions that must be applied when parsing C code:1112* ...1314(see: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf)1516We have taken advantage of the elements of the C grammar that are used17only in a few limited contexts, mostly as delimiters. They allow us to18focus the regex patterns confidently. Here are the relevant tokens and19in which grammar rules they are used:2021separators:22* ";"23+ (decl) struct/union: at end of each member decl24+ (decl) declaration: at end of each (non-compound) decl25+ (stmt) expr stmt: at end of each stmt26+ (stmt) for: between exprs in "header"27+ (stmt) goto: at end28+ (stmt) continue: at end29+ (stmt) break: at end30+ (stmt) return: at end31* ","32+ (decl) struct/union: between member declators33+ (decl) param-list: between params34+ (decl) enum: between enumerators35+ (decl) initializer (compound): between initializers36+ (expr) postfix: between func call args37+ (expr) expression: between "assignment" exprs38* ":"39+ (decl) struct/union: in member declators40+ (stmt) label: between label and stmt41+ (stmt) case: between expression and stmt42+ (stmt) default: between "default" and stmt43* "="44+ (decl) declaration: between decl and initializer45+ (decl) enumerator: between identifier and "initializer"46+ (expr) assignment: between "var" and expr4748wrappers:49* "(...)"50+ (decl) declarator (func ptr): to wrap ptr/name51+ (decl) declarator (func ptr): around params52+ (decl) declarator: around sub-declarator (for readability)53+ (expr) postfix (func call): around args54+ (expr) primary: around sub-expr55+ (stmt) if: around condition56+ (stmt) switch: around source expr57+ (stmt) while: around condition58+ (stmt) do-while: around condition59+ (stmt) for: around "header"60* "{...}"61+ (decl) enum: around enumerators62+ (decl) func: around body63+ (stmt) compound: around stmts64* "[...]"65* (decl) declarator: for arrays66* (expr) postfix: array access6768other:69* "*"70+ (decl) declarator: for pointer types71+ (expr) unary: for pointer deref727374To simplify the regular expressions used here, we've takens some75shortcuts and made certain assumptions about the code we are parsing.76Some of these allow us to skip context-sensitive matching (e.g. braces)77or otherwise still match arbitrary C code unambiguously. However, in78some cases there are certain corner cases where the patterns are79ambiguous relative to arbitrary C code. However, they are still80unambiguous in the specific code we are parsing.8182Here are the cases where we've taken shortcuts or made assumptions:8384* there is no overlap syntactically between the local context (func85bodies) and the global context (other than variable decls), so we86do not need to worry about ambiguity due to the overlap:87+ the global context has no expressions or statements88+ the local context has no function definitions or type decls89* no "inline" type declarations (struct, union, enum) in function90parameters ~(including function pointers)~91* no "inline" type decls in function return types92* no superfluous parentheses in declarators93* var decls in for loops are always "simple" (e.g. no inline types)94* only inline struct/union/enum decls may be anonymous (without a name)95* no function pointers in function pointer parameters96* for loop "headers" do not have curly braces (e.g. compound init)97* syntactically, variable decls do not overlap with stmts/exprs, except98in the following case:99spam (*eggs) (...)100This could be either a function pointer variable named "eggs"101or a call to a function named "spam", which returns a function102pointer that gets called. The only differentiator is the103syntax used in the "..." part. It will be comma-separated104parameters for the former and comma-separated expressions for105the latter. Thus, if we expect such decls or calls then we must106parse the decl params.107"""108109"""110TODO:111* extract CPython-specific code112* drop include injection (or only add when needed)113* track position instead of slicing "text"114* Parser class instead of the _iter_source() mess115* alt impl using a state machine (& tokenizer or split on delimiters)116"""117118from ..info import ParsedItem119from ._info import SourceInfo120121122def parse(srclines, **srckwargs):123if isinstance(srclines, str): # a filename124raise NotImplementedError125126anon_name = anonymous_names()127for result in _parse(srclines, anon_name, **srckwargs):128yield ParsedItem.from_raw(result)129130131# XXX Later: Add a separate function to deal with preprocessor directives132# parsed out of raw source.133134135def anonymous_names():136counter = 1137def anon_name(prefix='anon-'):138nonlocal counter139name = f'{prefix}{counter}'140counter += 1141return name142return anon_name143144145#############################146# internal impl147148import logging149150151_logger = logging.getLogger(__name__)152153154def _parse(srclines, anon_name, **srckwargs):155from ._global import parse_globals156157source = _iter_source(srclines, **srckwargs)158for result in parse_globals(source, anon_name):159# XXX Handle blocks here instead of in parse_globals().160yield result161162163# We use defaults that cover most files. Files with bigger declarations164# are covered elsewhere (MAX_SIZES in cpython/_parser.py).165166def _iter_source(lines, *, maxtext=10_000, maxlines=200, showtext=False):167maxtext = maxtext if maxtext and maxtext > 0 else None168maxlines = maxlines if maxlines and maxlines > 0 else None169filestack = []170allinfo = {}171# "lines" should be (fileinfo, data), as produced by the preprocessor code.172for fileinfo, line in lines:173if fileinfo.filename in filestack:174while fileinfo.filename != filestack[-1]:175filename = filestack.pop()176del allinfo[filename]177filename = fileinfo.filename178srcinfo = allinfo[filename]179else:180filename = fileinfo.filename181srcinfo = SourceInfo(filename)182filestack.append(filename)183allinfo[filename] = srcinfo184185_logger.debug(f'-> {line}')186srcinfo._add_line(line, fileinfo.lno)187if srcinfo.too_much(maxtext, maxlines):188break189while srcinfo._used():190yield srcinfo191if showtext:192_logger.debug(f'=> {srcinfo.text}')193else:194if not filestack:195srcinfo = SourceInfo('???')196else:197filename = filestack[-1]198srcinfo = allinfo[filename]199while srcinfo._used():200yield srcinfo201if showtext:202_logger.debug(f'=> {srcinfo.text}')203yield srcinfo204if showtext:205_logger.debug(f'=> {srcinfo.text}')206if not srcinfo._ready:207return208# At this point either the file ended prematurely209# or there's "too much" text.210filename, lno, text = srcinfo.filename, srcinfo._start, srcinfo.text211if len(text) > 500:212text = text[:500] + '...'213raise Exception(f'unmatched text ({filename} starting at line {lno}):\n{text}')214215216