Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
allendowney
GitHub Repository: allendowney/cpython
Path: blob/main/Tools/c-analyzer/c_parser/parser/__init__.py
12 views
1
"""A simple non-validating parser for C99.
2
3
The functions and regex patterns here are not entirely suitable for
4
validating C syntax. Please rely on a proper compiler for that.
5
Instead our goal here is merely matching and extracting information from
6
valid C code.
7
8
Furthermore, the grammar rules for the C syntax (particularly as
9
described in the K&R book) actually describe a superset, of which the
10
full C language is a proper subset. Here are some of the extra
11
conditions that must be applied when parsing C code:
12
13
* ...
14
15
(see: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf)
16
17
We have taken advantage of the elements of the C grammar that are used
18
only in a few limited contexts, mostly as delimiters. They allow us to
19
focus the regex patterns confidently. Here are the relevant tokens and
20
in which grammar rules they are used:
21
22
separators:
23
* ";"
24
+ (decl) struct/union: at end of each member decl
25
+ (decl) declaration: at end of each (non-compound) decl
26
+ (stmt) expr stmt: at end of each stmt
27
+ (stmt) for: between exprs in "header"
28
+ (stmt) goto: at end
29
+ (stmt) continue: at end
30
+ (stmt) break: at end
31
+ (stmt) return: at end
32
* ","
33
+ (decl) struct/union: between member declators
34
+ (decl) param-list: between params
35
+ (decl) enum: between enumerators
36
+ (decl) initializer (compound): between initializers
37
+ (expr) postfix: between func call args
38
+ (expr) expression: between "assignment" exprs
39
* ":"
40
+ (decl) struct/union: in member declators
41
+ (stmt) label: between label and stmt
42
+ (stmt) case: between expression and stmt
43
+ (stmt) default: between "default" and stmt
44
* "="
45
+ (decl) declaration: between decl and initializer
46
+ (decl) enumerator: between identifier and "initializer"
47
+ (expr) assignment: between "var" and expr
48
49
wrappers:
50
* "(...)"
51
+ (decl) declarator (func ptr): to wrap ptr/name
52
+ (decl) declarator (func ptr): around params
53
+ (decl) declarator: around sub-declarator (for readability)
54
+ (expr) postfix (func call): around args
55
+ (expr) primary: around sub-expr
56
+ (stmt) if: around condition
57
+ (stmt) switch: around source expr
58
+ (stmt) while: around condition
59
+ (stmt) do-while: around condition
60
+ (stmt) for: around "header"
61
* "{...}"
62
+ (decl) enum: around enumerators
63
+ (decl) func: around body
64
+ (stmt) compound: around stmts
65
* "[...]"
66
* (decl) declarator: for arrays
67
* (expr) postfix: array access
68
69
other:
70
* "*"
71
+ (decl) declarator: for pointer types
72
+ (expr) unary: for pointer deref
73
74
75
To simplify the regular expressions used here, we've takens some
76
shortcuts and made certain assumptions about the code we are parsing.
77
Some of these allow us to skip context-sensitive matching (e.g. braces)
78
or otherwise still match arbitrary C code unambiguously. However, in
79
some cases there are certain corner cases where the patterns are
80
ambiguous relative to arbitrary C code. However, they are still
81
unambiguous in the specific code we are parsing.
82
83
Here are the cases where we've taken shortcuts or made assumptions:
84
85
* there is no overlap syntactically between the local context (func
86
bodies) and the global context (other than variable decls), so we
87
do not need to worry about ambiguity due to the overlap:
88
+ the global context has no expressions or statements
89
+ the local context has no function definitions or type decls
90
* no "inline" type declarations (struct, union, enum) in function
91
parameters ~(including function pointers)~
92
* no "inline" type decls in function return types
93
* no superfluous parentheses in declarators
94
* var decls in for loops are always "simple" (e.g. no inline types)
95
* only inline struct/union/enum decls may be anonymous (without a name)
96
* no function pointers in function pointer parameters
97
* for loop "headers" do not have curly braces (e.g. compound init)
98
* syntactically, variable decls do not overlap with stmts/exprs, except
99
in the following case:
100
spam (*eggs) (...)
101
This could be either a function pointer variable named "eggs"
102
or a call to a function named "spam", which returns a function
103
pointer that gets called. The only differentiator is the
104
syntax used in the "..." part. It will be comma-separated
105
parameters for the former and comma-separated expressions for
106
the latter. Thus, if we expect such decls or calls then we must
107
parse the decl params.
108
"""
109
110
"""
111
TODO:
112
* extract CPython-specific code
113
* drop include injection (or only add when needed)
114
* track position instead of slicing "text"
115
* Parser class instead of the _iter_source() mess
116
* alt impl using a state machine (& tokenizer or split on delimiters)
117
"""
118
119
from ..info import ParsedItem
120
from ._info import SourceInfo
121
122
123
def parse(srclines, **srckwargs):
124
if isinstance(srclines, str): # a filename
125
raise NotImplementedError
126
127
anon_name = anonymous_names()
128
for result in _parse(srclines, anon_name, **srckwargs):
129
yield ParsedItem.from_raw(result)
130
131
132
# XXX Later: Add a separate function to deal with preprocessor directives
133
# parsed out of raw source.
134
135
136
def anonymous_names():
137
counter = 1
138
def anon_name(prefix='anon-'):
139
nonlocal counter
140
name = f'{prefix}{counter}'
141
counter += 1
142
return name
143
return anon_name
144
145
146
#############################
147
# internal impl
148
149
import logging
150
151
152
_logger = logging.getLogger(__name__)
153
154
155
def _parse(srclines, anon_name, **srckwargs):
156
from ._global import parse_globals
157
158
source = _iter_source(srclines, **srckwargs)
159
for result in parse_globals(source, anon_name):
160
# XXX Handle blocks here instead of in parse_globals().
161
yield result
162
163
164
# We use defaults that cover most files. Files with bigger declarations
165
# are covered elsewhere (MAX_SIZES in cpython/_parser.py).
166
167
def _iter_source(lines, *, maxtext=10_000, maxlines=200, showtext=False):
168
maxtext = maxtext if maxtext and maxtext > 0 else None
169
maxlines = maxlines if maxlines and maxlines > 0 else None
170
filestack = []
171
allinfo = {}
172
# "lines" should be (fileinfo, data), as produced by the preprocessor code.
173
for fileinfo, line in lines:
174
if fileinfo.filename in filestack:
175
while fileinfo.filename != filestack[-1]:
176
filename = filestack.pop()
177
del allinfo[filename]
178
filename = fileinfo.filename
179
srcinfo = allinfo[filename]
180
else:
181
filename = fileinfo.filename
182
srcinfo = SourceInfo(filename)
183
filestack.append(filename)
184
allinfo[filename] = srcinfo
185
186
_logger.debug(f'-> {line}')
187
srcinfo._add_line(line, fileinfo.lno)
188
if srcinfo.too_much(maxtext, maxlines):
189
break
190
while srcinfo._used():
191
yield srcinfo
192
if showtext:
193
_logger.debug(f'=> {srcinfo.text}')
194
else:
195
if not filestack:
196
srcinfo = SourceInfo('???')
197
else:
198
filename = filestack[-1]
199
srcinfo = allinfo[filename]
200
while srcinfo._used():
201
yield srcinfo
202
if showtext:
203
_logger.debug(f'=> {srcinfo.text}')
204
yield srcinfo
205
if showtext:
206
_logger.debug(f'=> {srcinfo.text}')
207
if not srcinfo._ready:
208
return
209
# At this point either the file ended prematurely
210
# or there's "too much" text.
211
filename, lno, text = srcinfo.filename, srcinfo._start, srcinfo.text
212
if len(text) > 500:
213
text = text[:500] + '...'
214
raise Exception(f'unmatched text ({filename} starting at line {lno}):\n{text}')
215
216