Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
allendowney
GitHub Repository: allendowney/cpython
Path: blob/main/Tools/i18n/pygettext.py
12 views
1
#! /usr/bin/env python3
2
# -*- coding: iso-8859-1 -*-
3
# Originally written by Barry Warsaw <[email protected]>
4
#
5
# Minimally patched to make it even more xgettext compatible
6
# by Peter Funk <[email protected]>
7
#
8
# 2002-11-22 J�rgen Hermann <[email protected]>
9
# Added checks that _() only contains string literals, and
10
# command line args are resolved to module lists, i.e. you
11
# can now pass a filename, a module or package name, or a
12
# directory (including globbing chars, important for Win32).
13
# Made docstring fit in 80 chars wide displays using pydoc.
14
#
15
16
# for selftesting
17
try:
18
import fintl
19
_ = fintl.gettext
20
except ImportError:
21
_ = lambda s: s
22
23
__doc__ = _("""pygettext -- Python equivalent of xgettext(1)
24
25
Many systems (Solaris, Linux, Gnu) provide extensive tools that ease the
26
internationalization of C programs. Most of these tools are independent of
27
the programming language and can be used from within Python programs.
28
Martin von Loewis' work[1] helps considerably in this regard.
29
30
There's one problem though; xgettext is the program that scans source code
31
looking for message strings, but it groks only C (or C++). Python
32
introduces a few wrinkles, such as dual quoting characters, triple quoted
33
strings, and raw strings. xgettext understands none of this.
34
35
Enter pygettext, which uses Python's standard tokenize module to scan
36
Python source code, generating .pot files identical to what GNU xgettext[2]
37
generates for C and C++ code. From there, the standard GNU tools can be
38
used.
39
40
A word about marking Python strings as candidates for translation. GNU
41
xgettext recognizes the following keywords: gettext, dgettext, dcgettext,
42
and gettext_noop. But those can be a lot of text to include all over your
43
code. C and C++ have a trick: they use the C preprocessor. Most
44
internationalized C source includes a #define for gettext() to _() so that
45
what has to be written in the source is much less. Thus these are both
46
translatable strings:
47
48
gettext("Translatable String")
49
_("Translatable String")
50
51
Python of course has no preprocessor so this doesn't work so well. Thus,
52
pygettext searches only for _() by default, but see the -k/--keyword flag
53
below for how to augment this.
54
55
[1] https://www.python.org/workshops/1997-10/proceedings/loewis.html
56
[2] https://www.gnu.org/software/gettext/gettext.html
57
58
NOTE: pygettext attempts to be option and feature compatible with GNU
59
xgettext where ever possible. However some options are still missing or are
60
not fully implemented. Also, xgettext's use of command line switches with
61
option arguments is broken, and in these cases, pygettext just defines
62
additional switches.
63
64
Usage: pygettext [options] inputfile ...
65
66
Options:
67
68
-a
69
--extract-all
70
Extract all strings.
71
72
-d name
73
--default-domain=name
74
Rename the default output file from messages.pot to name.pot.
75
76
-E
77
--escape
78
Replace non-ASCII characters with octal escape sequences.
79
80
-D
81
--docstrings
82
Extract module, class, method, and function docstrings. These do
83
not need to be wrapped in _() markers, and in fact cannot be for
84
Python to consider them docstrings. (See also the -X option).
85
86
-h
87
--help
88
Print this help message and exit.
89
90
-k word
91
--keyword=word
92
Keywords to look for in addition to the default set, which are:
93
%(DEFAULTKEYWORDS)s
94
95
You can have multiple -k flags on the command line.
96
97
-K
98
--no-default-keywords
99
Disable the default set of keywords (see above). Any keywords
100
explicitly added with the -k/--keyword option are still recognized.
101
102
--no-location
103
Do not write filename/lineno location comments.
104
105
-n
106
--add-location
107
Write filename/lineno location comments indicating where each
108
extracted string is found in the source. These lines appear before
109
each msgid. The style of comments is controlled by the -S/--style
110
option. This is the default.
111
112
-o filename
113
--output=filename
114
Rename the default output file from messages.pot to filename. If
115
filename is `-' then the output is sent to standard out.
116
117
-p dir
118
--output-dir=dir
119
Output files will be placed in directory dir.
120
121
-S stylename
122
--style stylename
123
Specify which style to use for location comments. Two styles are
124
supported:
125
126
Solaris # File: filename, line: line-number
127
GNU #: filename:line
128
129
The style name is case insensitive. GNU style is the default.
130
131
-v
132
--verbose
133
Print the names of the files being processed.
134
135
-V
136
--version
137
Print the version of pygettext and exit.
138
139
-w columns
140
--width=columns
141
Set width of output to columns.
142
143
-x filename
144
--exclude-file=filename
145
Specify a file that contains a list of strings that are not be
146
extracted from the input files. Each string to be excluded must
147
appear on a line by itself in the file.
148
149
-X filename
150
--no-docstrings=filename
151
Specify a file that contains a list of files (one per line) that
152
should not have their docstrings extracted. This is only useful in
153
conjunction with the -D option above.
154
155
If `inputfile' is -, standard input is read.
156
""")
157
158
import os
159
import importlib.machinery
160
import importlib.util
161
import sys
162
import glob
163
import time
164
import getopt
165
import ast
166
import token
167
import tokenize
168
169
__version__ = '1.5'
170
171
default_keywords = ['_']
172
DEFAULTKEYWORDS = ', '.join(default_keywords)
173
174
EMPTYSTRING = ''
175
176
177
# The normal pot-file header. msgmerge and Emacs's po-mode work better if it's
178
# there.
179
pot_header = _('''\
180
# SOME DESCRIPTIVE TITLE.
181
# Copyright (C) YEAR ORGANIZATION
182
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
183
#
184
msgid ""
185
msgstr ""
186
"Project-Id-Version: PACKAGE VERSION\\n"
187
"POT-Creation-Date: %(time)s\\n"
188
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\\n"
189
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\\n"
190
"Language-Team: LANGUAGE <[email protected]>\\n"
191
"MIME-Version: 1.0\\n"
192
"Content-Type: text/plain; charset=%(charset)s\\n"
193
"Content-Transfer-Encoding: %(encoding)s\\n"
194
"Generated-By: pygettext.py %(version)s\\n"
195
196
''')
197
198
199
def usage(code, msg=''):
200
print(__doc__ % globals(), file=sys.stderr)
201
if msg:
202
print(msg, file=sys.stderr)
203
sys.exit(code)
204
205
206
def make_escapes(pass_nonascii):
207
global escapes, escape
208
if pass_nonascii:
209
# Allow non-ascii characters to pass through so that e.g. 'msgid
210
# "H�he"' would result not result in 'msgid "H\366he"'. Otherwise we
211
# escape any character outside the 32..126 range.
212
mod = 128
213
escape = escape_ascii
214
else:
215
mod = 256
216
escape = escape_nonascii
217
escapes = [r"\%03o" % i for i in range(mod)]
218
for i in range(32, 127):
219
escapes[i] = chr(i)
220
escapes[ord('\\')] = r'\\'
221
escapes[ord('\t')] = r'\t'
222
escapes[ord('\r')] = r'\r'
223
escapes[ord('\n')] = r'\n'
224
escapes[ord('\"')] = r'\"'
225
226
227
def escape_ascii(s, encoding):
228
return ''.join(escapes[ord(c)] if ord(c) < 128 else c for c in s)
229
230
def escape_nonascii(s, encoding):
231
return ''.join(escapes[b] for b in s.encode(encoding))
232
233
234
def is_literal_string(s):
235
return s[0] in '\'"' or (s[0] in 'rRuU' and s[1] in '\'"')
236
237
238
def safe_eval(s):
239
# unwrap quotes, safely
240
return eval(s, {'__builtins__':{}}, {})
241
242
243
def normalize(s, encoding):
244
# This converts the various Python string types into a format that is
245
# appropriate for .po files, namely much closer to C style.
246
lines = s.split('\n')
247
if len(lines) == 1:
248
s = '"' + escape(s, encoding) + '"'
249
else:
250
if not lines[-1]:
251
del lines[-1]
252
lines[-1] = lines[-1] + '\n'
253
for i in range(len(lines)):
254
lines[i] = escape(lines[i], encoding)
255
lineterm = '\\n"\n"'
256
s = '""\n"' + lineterm.join(lines) + '"'
257
return s
258
259
260
def containsAny(str, set):
261
"""Check whether 'str' contains ANY of the chars in 'set'"""
262
return 1 in [c in str for c in set]
263
264
265
def getFilesForName(name):
266
"""Get a list of module files for a filename, a module or package name,
267
or a directory.
268
"""
269
if not os.path.exists(name):
270
# check for glob chars
271
if containsAny(name, "*?[]"):
272
files = glob.glob(name)
273
list = []
274
for file in files:
275
list.extend(getFilesForName(file))
276
return list
277
278
# try to find module or package
279
try:
280
spec = importlib.util.find_spec(name)
281
name = spec.origin
282
except ImportError:
283
name = None
284
if not name:
285
return []
286
287
if os.path.isdir(name):
288
# find all python files in directory
289
list = []
290
# get extension for python source files
291
_py_ext = importlib.machinery.SOURCE_SUFFIXES[0]
292
for root, dirs, files in os.walk(name):
293
# don't recurse into CVS directories
294
if 'CVS' in dirs:
295
dirs.remove('CVS')
296
# add all *.py files to list
297
list.extend(
298
[os.path.join(root, file) for file in files
299
if os.path.splitext(file)[1] == _py_ext]
300
)
301
return list
302
elif os.path.exists(name):
303
# a single file
304
return [name]
305
306
return []
307
308
309
class TokenEater:
310
def __init__(self, options):
311
self.__options = options
312
self.__messages = {}
313
self.__state = self.__waiting
314
self.__data = []
315
self.__lineno = -1
316
self.__freshmodule = 1
317
self.__curfile = None
318
self.__enclosurecount = 0
319
320
def __call__(self, ttype, tstring, stup, etup, line):
321
# dispatch
322
## import token
323
## print('ttype:', token.tok_name[ttype], 'tstring:', tstring,
324
## file=sys.stderr)
325
self.__state(ttype, tstring, stup[0])
326
327
def __waiting(self, ttype, tstring, lineno):
328
opts = self.__options
329
# Do docstring extractions, if enabled
330
if opts.docstrings and not opts.nodocstrings.get(self.__curfile):
331
# module docstring?
332
if self.__freshmodule:
333
if ttype == tokenize.STRING and is_literal_string(tstring):
334
self.__addentry(safe_eval(tstring), lineno, isdocstring=1)
335
self.__freshmodule = 0
336
return
337
if ttype in (tokenize.COMMENT, tokenize.NL, tokenize.ENCODING):
338
return
339
self.__freshmodule = 0
340
# class or func/method docstring?
341
if ttype == tokenize.NAME and tstring in ('class', 'def'):
342
self.__state = self.__suiteseen
343
return
344
if ttype == tokenize.NAME and tstring in opts.keywords:
345
self.__state = self.__keywordseen
346
return
347
if ttype == tokenize.STRING:
348
maybe_fstring = ast.parse(tstring, mode='eval').body
349
if not isinstance(maybe_fstring, ast.JoinedStr):
350
return
351
for value in filter(lambda node: isinstance(node, ast.FormattedValue),
352
maybe_fstring.values):
353
for call in filter(lambda node: isinstance(node, ast.Call),
354
ast.walk(value)):
355
func = call.func
356
if isinstance(func, ast.Name):
357
func_name = func.id
358
elif isinstance(func, ast.Attribute):
359
func_name = func.attr
360
else:
361
continue
362
363
if func_name not in opts.keywords:
364
continue
365
if len(call.args) != 1:
366
print(_(
367
'*** %(file)s:%(lineno)s: Seen unexpected amount of'
368
' positional arguments in gettext call: %(source_segment)s'
369
) % {
370
'source_segment': ast.get_source_segment(tstring, call) or tstring,
371
'file': self.__curfile,
372
'lineno': lineno
373
}, file=sys.stderr)
374
continue
375
if call.keywords:
376
print(_(
377
'*** %(file)s:%(lineno)s: Seen unexpected keyword arguments'
378
' in gettext call: %(source_segment)s'
379
) % {
380
'source_segment': ast.get_source_segment(tstring, call) or tstring,
381
'file': self.__curfile,
382
'lineno': lineno
383
}, file=sys.stderr)
384
continue
385
arg = call.args[0]
386
if not isinstance(arg, ast.Constant):
387
print(_(
388
'*** %(file)s:%(lineno)s: Seen unexpected argument type'
389
' in gettext call: %(source_segment)s'
390
) % {
391
'source_segment': ast.get_source_segment(tstring, call) or tstring,
392
'file': self.__curfile,
393
'lineno': lineno
394
}, file=sys.stderr)
395
continue
396
if isinstance(arg.value, str):
397
self.__addentry(arg.value, lineno)
398
399
def __suiteseen(self, ttype, tstring, lineno):
400
# skip over any enclosure pairs until we see the colon
401
if ttype == tokenize.OP:
402
if tstring == ':' and self.__enclosurecount == 0:
403
# we see a colon and we're not in an enclosure: end of def
404
self.__state = self.__suitedocstring
405
elif tstring in '([{':
406
self.__enclosurecount += 1
407
elif tstring in ')]}':
408
self.__enclosurecount -= 1
409
410
def __suitedocstring(self, ttype, tstring, lineno):
411
# ignore any intervening noise
412
if ttype == tokenize.STRING and is_literal_string(tstring):
413
self.__addentry(safe_eval(tstring), lineno, isdocstring=1)
414
self.__state = self.__waiting
415
elif ttype not in (tokenize.NEWLINE, tokenize.INDENT,
416
tokenize.COMMENT):
417
# there was no class docstring
418
self.__state = self.__waiting
419
420
def __keywordseen(self, ttype, tstring, lineno):
421
if ttype == tokenize.OP and tstring == '(':
422
self.__data = []
423
self.__lineno = lineno
424
self.__state = self.__openseen
425
else:
426
self.__state = self.__waiting
427
428
def __openseen(self, ttype, tstring, lineno):
429
if ttype == tokenize.OP and tstring == ')':
430
# We've seen the last of the translatable strings. Record the
431
# line number of the first line of the strings and update the list
432
# of messages seen. Reset state for the next batch. If there
433
# were no strings inside _(), then just ignore this entry.
434
if self.__data:
435
self.__addentry(EMPTYSTRING.join(self.__data))
436
self.__state = self.__waiting
437
elif ttype == tokenize.STRING and is_literal_string(tstring):
438
self.__data.append(safe_eval(tstring))
439
elif ttype not in [tokenize.COMMENT, token.INDENT, token.DEDENT,
440
token.NEWLINE, tokenize.NL]:
441
# warn if we see anything else than STRING or whitespace
442
print(_(
443
'*** %(file)s:%(lineno)s: Seen unexpected token "%(token)s"'
444
) % {
445
'token': tstring,
446
'file': self.__curfile,
447
'lineno': self.__lineno
448
}, file=sys.stderr)
449
self.__state = self.__waiting
450
451
def __addentry(self, msg, lineno=None, isdocstring=0):
452
if lineno is None:
453
lineno = self.__lineno
454
if not msg in self.__options.toexclude:
455
entry = (self.__curfile, lineno)
456
self.__messages.setdefault(msg, {})[entry] = isdocstring
457
458
def set_filename(self, filename):
459
self.__curfile = filename
460
self.__freshmodule = 1
461
462
def write(self, fp):
463
options = self.__options
464
timestamp = time.strftime('%Y-%m-%d %H:%M%z')
465
encoding = fp.encoding if fp.encoding else 'UTF-8'
466
print(pot_header % {'time': timestamp, 'version': __version__,
467
'charset': encoding,
468
'encoding': '8bit'}, file=fp)
469
# Sort the entries. First sort each particular entry's keys, then
470
# sort all the entries by their first item.
471
reverse = {}
472
for k, v in self.__messages.items():
473
keys = sorted(v.keys())
474
reverse.setdefault(tuple(keys), []).append((k, v))
475
rkeys = sorted(reverse.keys())
476
for rkey in rkeys:
477
rentries = reverse[rkey]
478
rentries.sort()
479
for k, v in rentries:
480
# If the entry was gleaned out of a docstring, then add a
481
# comment stating so. This is to aid translators who may wish
482
# to skip translating some unimportant docstrings.
483
isdocstring = any(v.values())
484
# k is the message string, v is a dictionary-set of (filename,
485
# lineno) tuples. We want to sort the entries in v first by
486
# file name and then by line number.
487
v = sorted(v.keys())
488
if not options.writelocations:
489
pass
490
# location comments are different b/w Solaris and GNU:
491
elif options.locationstyle == options.SOLARIS:
492
for filename, lineno in v:
493
d = {'filename': filename, 'lineno': lineno}
494
print(_(
495
'# File: %(filename)s, line: %(lineno)d') % d, file=fp)
496
elif options.locationstyle == options.GNU:
497
# fit as many locations on one line, as long as the
498
# resulting line length doesn't exceed 'options.width'
499
locline = '#:'
500
for filename, lineno in v:
501
d = {'filename': filename, 'lineno': lineno}
502
s = _(' %(filename)s:%(lineno)d') % d
503
if len(locline) + len(s) <= options.width:
504
locline = locline + s
505
else:
506
print(locline, file=fp)
507
locline = "#:" + s
508
if len(locline) > 2:
509
print(locline, file=fp)
510
if isdocstring:
511
print('#, docstring', file=fp)
512
print('msgid', normalize(k, encoding), file=fp)
513
print('msgstr ""\n', file=fp)
514
515
516
def main():
517
global default_keywords
518
try:
519
opts, args = getopt.getopt(
520
sys.argv[1:],
521
'ad:DEhk:Kno:p:S:Vvw:x:X:',
522
['extract-all', 'default-domain=', 'escape', 'help',
523
'keyword=', 'no-default-keywords',
524
'add-location', 'no-location', 'output=', 'output-dir=',
525
'style=', 'verbose', 'version', 'width=', 'exclude-file=',
526
'docstrings', 'no-docstrings',
527
])
528
except getopt.error as msg:
529
usage(1, msg)
530
531
# for holding option values
532
class Options:
533
# constants
534
GNU = 1
535
SOLARIS = 2
536
# defaults
537
extractall = 0 # FIXME: currently this option has no effect at all.
538
escape = 0
539
keywords = []
540
outpath = ''
541
outfile = 'messages.pot'
542
writelocations = 1
543
locationstyle = GNU
544
verbose = 0
545
width = 78
546
excludefilename = ''
547
docstrings = 0
548
nodocstrings = {}
549
550
options = Options()
551
locations = {'gnu' : options.GNU,
552
'solaris' : options.SOLARIS,
553
}
554
555
# parse options
556
for opt, arg in opts:
557
if opt in ('-h', '--help'):
558
usage(0)
559
elif opt in ('-a', '--extract-all'):
560
options.extractall = 1
561
elif opt in ('-d', '--default-domain'):
562
options.outfile = arg + '.pot'
563
elif opt in ('-E', '--escape'):
564
options.escape = 1
565
elif opt in ('-D', '--docstrings'):
566
options.docstrings = 1
567
elif opt in ('-k', '--keyword'):
568
options.keywords.append(arg)
569
elif opt in ('-K', '--no-default-keywords'):
570
default_keywords = []
571
elif opt in ('-n', '--add-location'):
572
options.writelocations = 1
573
elif opt in ('--no-location',):
574
options.writelocations = 0
575
elif opt in ('-S', '--style'):
576
options.locationstyle = locations.get(arg.lower())
577
if options.locationstyle is None:
578
usage(1, _('Invalid value for --style: %s') % arg)
579
elif opt in ('-o', '--output'):
580
options.outfile = arg
581
elif opt in ('-p', '--output-dir'):
582
options.outpath = arg
583
elif opt in ('-v', '--verbose'):
584
options.verbose = 1
585
elif opt in ('-V', '--version'):
586
print(_('pygettext.py (xgettext for Python) %s') % __version__)
587
sys.exit(0)
588
elif opt in ('-w', '--width'):
589
try:
590
options.width = int(arg)
591
except ValueError:
592
usage(1, _('--width argument must be an integer: %s') % arg)
593
elif opt in ('-x', '--exclude-file'):
594
options.excludefilename = arg
595
elif opt in ('-X', '--no-docstrings'):
596
fp = open(arg)
597
try:
598
while 1:
599
line = fp.readline()
600
if not line:
601
break
602
options.nodocstrings[line[:-1]] = 1
603
finally:
604
fp.close()
605
606
# calculate escapes
607
make_escapes(not options.escape)
608
609
# calculate all keywords
610
options.keywords.extend(default_keywords)
611
612
# initialize list of strings to exclude
613
if options.excludefilename:
614
try:
615
with open(options.excludefilename) as fp:
616
options.toexclude = fp.readlines()
617
except IOError:
618
print(_(
619
"Can't read --exclude-file: %s") % options.excludefilename, file=sys.stderr)
620
sys.exit(1)
621
else:
622
options.toexclude = []
623
624
# resolve args to module lists
625
expanded = []
626
for arg in args:
627
if arg == '-':
628
expanded.append(arg)
629
else:
630
expanded.extend(getFilesForName(arg))
631
args = expanded
632
633
# slurp through all the files
634
eater = TokenEater(options)
635
for filename in args:
636
if filename == '-':
637
if options.verbose:
638
print(_('Reading standard input'))
639
fp = sys.stdin.buffer
640
closep = 0
641
else:
642
if options.verbose:
643
print(_('Working on %s') % filename)
644
fp = open(filename, 'rb')
645
closep = 1
646
try:
647
eater.set_filename(filename)
648
try:
649
tokens = tokenize.tokenize(fp.readline)
650
for _token in tokens:
651
eater(*_token)
652
except tokenize.TokenError as e:
653
print('%s: %s, line %d, column %d' % (
654
e.args[0], filename, e.args[1][0], e.args[1][1]),
655
file=sys.stderr)
656
finally:
657
if closep:
658
fp.close()
659
660
# write the output
661
if options.outfile == '-':
662
fp = sys.stdout
663
closep = 0
664
else:
665
if options.outpath:
666
options.outfile = os.path.join(options.outpath, options.outfile)
667
fp = open(options.outfile, 'w')
668
closep = 1
669
try:
670
eater.write(fp)
671
finally:
672
if closep:
673
fp.close()
674
675
676
if __name__ == '__main__':
677
main()
678
# some more test strings
679
# this one creates a warning
680
_('*** Seen unexpected token "%(token)s"') % {'token': 'test'}
681
_('more' 'than' 'one' 'string')
682
683