CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.

| Download

GAP 4.8.9 installation with standard packages -- copy to your CoCalc project to get it

Views: 418346
1
2
6 String and Text Utilities
3
4
5
6.1 Text Utilities
6
7
This section describes some utility functions for handling texts within GAP.
8
They are used by the functions in the GAPDoc package but may be useful for
9
other purposes as well. We start with some variables containing useful
10
strings and go on with functions for parsing and reformatting text.
11
12
6.1-1 WHITESPACE
13
14
WHITESPACE global variable
15
CAPITALLETTERS global variable
16
SMALLLETTERS global variable
17
LETTERS global variable
18
DIGITS global variable
19
HEXDIGITS global variable
20
BOXCHARS global variable
21
22
These variables contain sets of characters which are useful for text
23
processing. They are defined as follows.
24
25
WHITESPACE
26
" \n\t\r"
27
28
CAPITALLETTERS
29
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
30
31
SMALLLETTERS
32
"abcdefghijklmnopqrstuvwxyz"
33
34
LETTERS
35
concatenation of CAPITALLETTERS and SMALLLETTERS
36
37
DIGITS
38
"0123456789"
39
40
HEXDIGITS
41
"0123456789ABCDEFabcdef"
42
43
BOXCHARS
44
"─│┌┬┐├┼┤└┴┘━┃┏┳┓┣╋┫┗┻┛═║╔╦╗╠╬╣╚╩╝" , these are in UTF-8 encoding, the
45
i-th unicode character is BOXCHARS{[3*i-2..3*i]}.
46
47
6.1-2 TextAttr
48
49
TextAttr global variable
50
51
The record TextAttr contains strings which can be printed to change the
52
terminal attribute for the following characters. This only works with
53
terminals which understand basic ANSI escape sequences. Try the following
54
example to see if this is the case for the terminal you are using. It shows
55
the effect of the foreground and background color attributes and of the
56
.bold, .blink, .normal, .reverse and .underscore which can partly be mixed.
57
58
 Example 
59
extra := ["CSI", "reset", "delline", "home"];;
60
for t in Difference(RecNames(TextAttr), extra) do
61
 Print(TextAttr.(t), "TextAttr.", t, TextAttr.reset,"\n");
62
od;
63

64
65
The suggested defaults for colors 0..7 are black, red, green, brown, blue,
66
magenta, cyan, white. But this may be different for your terminal
67
configuration.
68
69
The escape sequence .delline deletes the content of the current line and
70
.home moves the cursor to the beginning of the current line.
71
72
 Example 
73
for i in [1..5] do 
74
 Print(TextAttr.home, TextAttr.delline, String(i,-6), "\c"); 
75
 Sleep(1); 
76
od;
77

78
79
Whenever you use this in some printing routines you should make it optional.
80
Use these attributes only when UserPreference("UseColorsInTerminal");
81
returns true.
82
83
6.1-3 WrapTextAttribute
84
85
WrapTextAttribute( str, attr )  function
86
Returns: a string with markup
87
88
The argument str must be a text as GAP string, possibly with markup by
89
escape sequences as in TextAttr (6.1-2). This function returns a string
90
which is wrapped by the escape sequences attr and TextAttr.reset. It takes
91
care of markup in the given string by appending attr also after each given
92
TextAttr.reset in str.
93
94
 Example 
95
gap> str := Concatenation("XXX",TextAttr.2, "BLUB", TextAttr.reset,"YYY");
96
"XXX\033[32mBLUB\033[0mYYY"
97
gap> str2 := WrapTextAttribute(str, TextAttr.1);
98
"\033[31mXXX\033[32mBLUB\033[0m\033[31m\027YYY\033[0m"
99
gap> str3 := WrapTextAttribute(str, TextAttr.underscore);
100
"\033[4mXXX\033[32mBLUB\033[0m\033[4m\027YYY\033[0m"
101
gap> # use Print(str); and so on to see how it looks like.
102

103
104
6.1-4 FormatParagraph
105
106
FormatParagraph( str[, len][, flush][, attr][, widthfun] )  function
107
Returns: the formatted paragraph as string
108
109
This function formats a text given in the string str as a paragraph. The
110
optional arguments have the following meaning:
111
112
len
113
the length of the lines of the formatted text, default is 78 (counted
114
without a visible length of the strings specified in the attr
115
argument)
116
117
flush
118
can be "left", "right", "center" or "both", telling that lines should
119
be flushed left, flushed right, centered or left-right justified,
120
respectively, default is "both"
121
122
attr
123
is a list of two strings; the first is prepended and the second
124
appended to each line of the result (can for example be used for
125
indenting, [" ", ""], or some markup, [TextAttr.bold, TextAttr.reset],
126
default is ["", ""])
127
128
widthfun
129
must be a function which returns the display width of text in str. The
130
default is Length assuming that each byte corresponds to a character
131
of width one. If str is given in UTF-8 encoding one can use
132
WidthUTF8String (6.2-3) here.
133
134
This function tries to handle markup with the escape sequences explained in
135
TextAttr (6.1-2) correctly.
136
137
 Example 
138
gap> str := "One two three four five six seven eight nine ten eleven.";;
139
gap> Print(FormatParagraph(str, 25, "left", ["/* ", " */"])); 
140
/* One two three four five */
141
/* six seven eight nine ten */
142
/* eleven. */
143

144
145
6.1-5 SubstitutionSublist
146
147
SubstitutionSublist( list, sublist, new[, flag] )  function
148
Returns: the changed list
149
150
This function looks for (non-overlapping) occurrences of a sublist sublist
151
in a list list (compare PositionSublist (Reference: PositionSublist)) and
152
returns a list where these are substituted with the list new.
153
154
The optional argument flag can either be "all" (this is the default if not
155
given) or "one". In the second case only the first occurrence of sublist is
156
substituted.
157
158
If sublist does not occur in list then list itself is returned (and not a
159
ShallowCopy(list)).
160
161
 Example 
162
gap> SubstitutionSublist("xababx", "ab", "a");
163
"xaax"
164

165
166
6.1-6 StripBeginEnd
167
168
StripBeginEnd( list, strip )  function
169
Returns: changed string
170
171
Here list and strip must be lists. This function returns the sublist of list
172
which does not contain the leading and trailing entries which are entries of
173
strip. If the result is equal to list then list itself is returned.
174
175
 Example 
176
gap> StripBeginEnd(" ,a, b,c, ", ", ");
177
"a, b,c"
178

179
180
6.1-7 StripEscapeSequences
181
182
StripEscapeSequences( str )  function
183
Returns: string without escape sequences
184
185
This function returns the string one gets from the string str by removing
186
all escape sequences which are explained in TextAttr (6.1-2). If str does
187
not contain such a sequence then str itself is returned.
188
189
6.1-8 RepeatedString
190
191
RepeatedString( c, len )  function
192
RepeatedUTF8String( c, len )  function
193
194
Here c must be either a character or a string and len is a non-negative
195
number. Then RepeatedString returns a string of length len consisting of
196
copies of c.
197
198
In the variant RepeatedUTF8String the argument c is considered as string in
199
UTF-8 encoding, and it can also be specified as unicode string or character,
200
see Unicode (6.2-1). The result is a string in UTF-8 encoding which has
201
visible width len as explained in WidthUTF8String (6.2-3).
202
203
 Example 
204
gap> RepeatedString('=',51);
205
"==================================================="
206
gap> RepeatedString("*=",51);
207
"*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*"
208
gap> s := "bäh";;
209
gap> enc := GAPInfo.TermEncoding;;
210
gap> if enc <> "UTF-8" then s := Encode(Unicode(s, enc), "UTF-8"); fi;
211
gap> l := RepeatedUTF8String(s, 8);;
212
gap> u := Unicode(l, "UTF-8");;
213
gap> Print(Encode(u, enc), "\n");
214
bähbähbä
215

216
217
6.1-9 NumberDigits
218
219
NumberDigits( str, base )  function
220
Returns: integer
221
222
DigitsNumber( n, base )  function
223
Returns: string
224
225
The argument str of NumberDigits must be a string consisting only of an
226
optional leading '-' and characters in 0123456789abcdefABCDEF, describing an
227
integer in base base with 2 ≤ base ≤ 16. This function returns the
228
corresponding integer.
229
230
The function DigitsNumber does the reverse.
231
232
 Example 
233
gap> NumberDigits("1A3F",16);
234
6719
235
gap> DigitsNumber(6719, 16);
236
"1A3F"
237

238
239
6.1-10 LabelInt
240
241
LabelInt( n, type, pre, post )  function
242
Returns: string
243
244
The argument n must be an integer in the range from 1 to 5000, while pre and
245
post must be strings.
246
247
The argument type can be one of "Decimal", "Roman", "roman", "Alpha",
248
"alpha".
249
250
The function returns a string that starts with pre, followed by a decimal,
251
respectively roman number or alphanumerical number literal (capital,
252
respectively small letters), followed by post.
253
254
 Example 
255
gap> List([1,2,3,4,5,691], i-> LabelInt(i,"Decimal","","."));
256
[ "1.", "2.", "3.", "4.", "5.", "691." ]
257
gap> List([1,2,3,4,5,691], i-> LabelInt(i,"alpha","(",")"));
258
[ "(a)", "(b)", "(c)", "(d)", "(e)", "(zo)" ]
259
gap> List([1,2,3,4,5,691], i-> LabelInt(i,"alpha","(",")"));
260
[ "(a)", "(b)", "(c)", "(d)", "(e)", "(zo)" ]
261
gap> List([1,2,3,4,5,691], i-> LabelInt(i,"Alpha","",".)"));
262
[ "A.)", "B.)", "C.)", "D.)", "E.)", "ZO.)" ]
263
gap> List([1,2,3,4,5,691], i-> LabelInt(i,"roman","","."));
264
[ "i.", "ii.", "iii.", "iv.", "v.", "dcxci." ]
265
gap> List([1,2,3,4,5,691], i-> LabelInt(i,"Roman","",""));
266
[ "I", "II", "III", "IV", "V", "DCXCI" ]
267

268
269
6.1-11 PositionMatchingDelimiter
270
271
PositionMatchingDelimiter( str, delim, pos )  function
272
Returns: position as integer or fail
273
274
Here str must be a string and delim a string with two different characters.
275
This function searches the smallest position r of the character delim[2] in
276
str such that the number of occurrences of delim[2] in str between positions
277
pos+1 and r is by one greater than the corresponding number of occurrences
278
of delim[1].
279
280
If such an r exists, it is returned. Otherwise fail is returned.
281
282
 Example 
283
gap> PositionMatchingDelimiter("{}x{ab{c}d}", "{}", 0);
284
fail
285
gap> PositionMatchingDelimiter("{}x{ab{c}d}", "{}", 1);
286
2
287
gap> PositionMatchingDelimiter("{}x{ab{c}d}", "{}", 6);
288
11
289

290
291
6.1-12 WordsString
292
293
WordsString( str )  function
294
Returns: list of strings containing the words
295
296
This returns the list of words of a text stored in the string str. All
297
non-letters are considered as word boundaries and are removed.
298
299
 Example 
300
gap> WordsString("one_two \n three!?");
301
[ "one", "two", "three" ]
302

303
304
6.1-13 Base64String
305
306
Base64String( str )  function
307
StringBase64( bstr )  function
308
Returns: a string
309
310
The first function translates arbitrary binary data given as a GAP string
311
into a base 64 encoded string. This encoded string contains only printable
312
ASCII characters and is used in various data transfer protocols (MIME
313
encoded emails, weak password encryption, ...). We use the specification in
314
RFC 2045 (http://tools.ietf.org/html/rfc2045).
315
316
The second function has the reverse functionality. Here we also accept the
317
characters -_ instead of +/ as last two characters. Whitespace is ignored.
318
319
 Example 
320
gap> b := Base64String("This is a secret!");
321
"VGhpcyBpcyBhIHNlY3JldCEA="
322
gap> StringBase64(b); 
323
"This is a secret!"
324

325
326
327
6.2 Unicode Strings
328
329
The GAPDoc package provides some tools to deal with unicode characters and
330
strings. These can be used for recoding text strings between various
331
encodings.
332
333
334
6.2-1 Unicode Strings and Characters
335
336
Unicode( list[, encoding] )  operation
337
UChar( num )  operation
338
IsUnicodeString filter
339
IsUnicodeCharacter filter
340
IntListUnicodeString( ustr )  function
341
342
Unicode characters are described by their codepoint, an integer in the range
343
from 0 to 2^21-1. For details about unicode, see http://www.unicode.org.
344
345
The function UChar wraps an integer num into a GAP object lying in the
346
filter IsUnicodeCharacter. Use Int to get the codepoint back. The argument
347
num can also be a GAP character which is then translated to an integer via
348
IntChar (Reference: IntChar).
349
350
Unicode produces a GAP object in the filter IsUnicodeString. This is a
351
wrapped list of integers for the unicode characters in the string. The
352
function IntListUnicodeString gives access to this list of integers. Basic
353
list functionality is available for IsUnicodeString elements. The entries
354
are in IsUnicodeCharacter. The argument list for Unicode is either a list of
355
integers or a GAP string. In the latter case an encoding can be specified as
356
string, its default is "UTF-8".
357
358
Currently supported encodings can be found in
359
UNICODE_RECODE.NormalizedEncodings (ASCII, ISO-8859-X, UTF-8 and aliases).
360
The encoding "XML" means an ASCII encoding in which non-ASCII characters are
361
specified by XML character entities. The encoding "URL" is for URL-encoded
362
(also called percent-encoded strings, as specified in RFC 3986 (see here
363
(http://www.ietf.org/rfc/rfc3986.txt)). The listed encodings "LaTeX" and
364
aliases cannot be used with Unicode. See the operation Encode (6.2-2) for
365
mapping a unicode string to a GAP string.
366
367
 Example 
368
gap> ustr := Unicode("a and \366", "latin1");
369
Unicode("a and ö")
370
gap> ustr = Unicode("a and &#246;", "XML"); 
371
true
372
gap> IntListUnicodeString(ustr);
373
[ 97, 32, 97, 110, 100, 32, 246 ]
374
gap> ustr[7];
375
'ö'
376

377
378
6.2-2 Encode
379
380
Encode( ustr[, encoding] )  operation
381
Returns: a GAP string
382
383
SimplifiedUnicodeString( ustr[, encoding][, "single"] )  function
384
Returns: a unicode string
385
386
LowercaseUnicodeString( ustr )  function
387
Returns: a unicode string
388
389
UppercaseUnicodeString( ustr )  function
390
Returns: a unicode string
391
392
LaTeXUnicodeTable global variable
393
SimplifiedUnicodeTable global variable
394
LowercaseUnicodeTable global variable
395
396
The operation Encode translates a unicode string ustr into a GAP string in
397
some specified encoding. The default encoding is "UTF-8".
398
399
Supported encodings can be found in UNICODE_RECODE.NormalizedEncodings.
400
Except for some cases mentioned below characters which are not available in
401
the target encoding are substituted by '?' characters.
402
403
If the encoding is "URL" (see Unicode (6.2-1)) then an optional argument
404
encreserved can be given, it must be a list of reserved characters which
405
should be percent encoded; the default is to encode only the % character.
406
407
The encoding "LaTeX" substitutes non-ASCII characters and LaTeX special
408
characters by LaTeX code as given in an ordered list LaTeXUnicodeTable of
409
pairs [codepoint, string]. If you have a unicode character for which no
410
substitution is contained in that list, you will get a warning and the
411
translation is Unicode(nr). In this case find a substitution and add a
412
corresponding [codepoint, string] pair to LaTeXUnicodeTable using AddSet
413
(Reference: AddSet). Also, please, tell the GAPDoc authors about your
414
addition, such that we can extend the list LaTeXUnicodeTable. (Most of the
415
initial entries were generated from lists in the TeX projects encTeX and
416
ucs.) There are some variants of this encoding:
417
418
"LaTeXleavemarkup" does the same translations for non-ASCII characters but
419
leaves the LaTeX special characters (e.g., any LaTeX commands) as they are.
420
421
"LaTeXUTF8" does not give a warning about unicode characters without
422
explicit translation, instead it translates the character to its UTF-8
423
encoding. Make sure to setup your LaTeX document such that all these
424
characters are understood.
425
426
"LaTeXUTF8leavemarkup" is a combination of the last two variants.
427
428
Note that the "LaTeX" encoding can only be used with Encode but not for the
429
opposite translation with Unicode (6.2-1) (which would need far too
430
complicated heuristics).
431
432
The function SimplifiedUnicodeString can be used to substitute many
433
non-ASCII characters by related ASCII characters or strings (e.g., by a
434
corresponding character without accents). The argument ustr and the result
435
are unicode strings, if encoding is "ASCII" then all non-ASCII characters
436
are translated, otherwise only the non-latin1 characters. If the string
437
"single" in an argument then only substitutions are considered which don't
438
make the result string longer. The translations are stored in a sorted list
439
SimplifiedUnicodeTable. Its entries are of the form [codepoint, trans1,
440
trans2, ...]. Here trans1 and so on is either an integer for the codepoint
441
of a substitution character or it is a list of codepoint integers. If you
442
are missing characters in this list and know a sensible ASCII approximation,
443
then add an entry (with AddSet (Reference: AddSet)) and tell the GAPDoc
444
authors about it. (The initial content of SimplifiedUnicodeTable was mainly
445
generated from the transtab tables by Markus Kuhn.)
446
447
The function LowercaseUnicodeString gets and returns a unicode string and
448
translates each uppercase character to its corresponding lowercase version.
449
This function uses a list LowercaseUnicodeTable of pairs of codepoint
450
integers. This list was generated using the file UnicodeData.txt from the
451
unicode definition (field 14 in each row).
452
453
The function UppercaseUnicodeString does the similar translation to
454
uppercase characters.
455
456
 Example 
457
gap> ustr := Unicode("a and &#246;", "XML");
458
Unicode("a and ö")
459
gap> SimplifiedUnicodeString(ustr, "ASCII");
460
Unicode("a and oe")
461
gap> SimplifiedUnicodeString(ustr, "ASCII", "single");
462
Unicode("a and o")
463
gap> ustr2 := UppercaseUnicodeString(ustr);;
464
gap> Print(Encode(ustr2, GAPInfo.TermEncoding), "\n");
465
A AND Ö
466

467
468
469
6.2-3 Lengths of UTF-8 strings
470
471
WidthUTF8String( str )  function
472
NrCharsUTF8String( str )  function
473
Returns: an integer
474
475
Let str be a GAP string with text in UTF-8 encoding. There are three lengths
476
of such a string which must be distinguished. The operation Length
477
(Reference: Length) returns the number of bytes and so the memory occupied
478
by str. The function NrCharsUTF8String returns the number of unicode
479
characters in str, that is the length of Unicode(str).
480
481
In many applications the function WidthUTF8String is more interesting, it
482
returns the number of columns needed by the string if printed to a terminal.
483
This takes into account that some unicode characters are combining
484
characters and that there are wide characters which need two columns (e.g.,
485
for Chinese or Japanese). (To be precise: This implementation assumes that
486
there are no control characters in str and uses the character width returned
487
by the wcwidth function in the GNU C-library called with UTF-8 locale.)
488
489
 Example 
490
gap> # A, German umlaut u, B, zero width space, C, newline
491
gap> str := Encode( Unicode( "A&#xFC;B&#x200B;C\n", "XML" ) );;
492
gap> Print(str);
493
AüB​C
494
gap> # umlaut u needs two bytes and the zero width space three
495
gap> Length(str);
496
9
497
gap> NrCharsUTF8String(str);
498
6
499
gap> # zero width space and newline don't contribute to width
500
gap> WidthUTF8String(str);
501
4
502

503
504
6.2-4 InitialSubstringUTF8String
505
506
InitialSubstringUTF8String( str, maxwidth )  function
507
Returns: UTF-8 encoded string
508
509
The argument str must be a GAP string with text in UTF-8 encoding or a
510
unicode string. The function returns the longest initial substring of str
511
which has at most visible width maxwidth, as UTF-8 encoded GAP string.
512
513
 Example 
514
gap> # A, German umlaut u, B, zero width space, C, newline
515
gap> str := Encode( Unicode( "A&#xFC;B&#x200B;C\n", "XML" ) );;
516
gap> ini := InitialSubstringUTF8String(str, 3);;
517
gap> WidthUTF8String(ini);
518
3
519
gap> IntListUnicodeString(Unicode(ini));
520
[ 65, 252, 66, 8203 ]
521

522
523
524
6.3 Print Utilities
525
526
The following printing utilities turned out to be useful for interactive
527
work with texts in GAP. But they are more general and so we document them
528
here.
529
530
6.3-1 PrintTo1
531
532
PrintTo1( filename, fun )  function
533
AppendTo1( filename, fun )  function
534
535
The argument fun must be a function without arguments. Everything which is
536
printed by a call fun() is printed into the file filename. As with PrintTo
537
(Reference: PrintTo) and AppendTo (Reference: AppendTo) this overwrites or
538
appends to, respectively, a previous content of filename.
539
540
These functions can be particularly efficient when many small pieces of text
541
shall be written to a file, because no multiple reopening of the file is
542
necessary.
543
544
 Example 
545
gap> f := function() local i; 
546
>  for i in [1..100000] do Print(i, "\n"); od; end;; 
547
gap> PrintTo1("nonsense", f); # now check the local file `nonsense'
548

549
550
6.3-2 StringPrint
551
552
StringPrint( obj1[, obj2[, ...]] )  function
553
StringView( obj )  function
554
555
These functions return a string containing the output of a Print or ViewObj
556
call with the same arguments.
557
558
This should be considered as a (temporary?) hack. It would be better to have
559
String (Reference: String) methods for all GAP objects and to have a generic
560
Print (Reference: Print)-function which just interprets these strings.
561
562
6.3-3 PrintFormattedString
563
564
PrintFormattedString( str )  function
565
566
This function prints a string str. The difference to Print(str); is that no
567
additional line breaks are introduced by GAP's standard printing mechanism.
568
This can be used to print lines which are longer than the current screen
569
width. In particular one can print text which contains escape sequences like
570
those explained in TextAttr (6.1-2), where lines may have more characters
571
than visible characters.
572
573
6.3-4 Page
574
575
Page( ... )  function
576
PageDisplay( obj )  function
577
578
These functions are similar to Print (Reference: Print) and Display
579
(Reference: Display), respectively. The difference is that the output is not
580
sent directly to the screen, but is piped into the current pager; see Pager
581
(Reference: Pager).
582
583
 Example 
584
gap> Page([1..1421]+0);
585
gap> PageDisplay(CharacterTable("Symmetric", 14));
586

587
588
6.3-5 StringFile
589
590
StringFile( filename )  function
591
FileString( filename, str[, append] )  function
592
593
The function StringFile returns the content of file filename as a string.
594
This works efficiently with arbitrary (binary or text) files. If something
595
went wrong, this function returns fail.
596
597
Conversely the function FileString writes the content of a string str into
598
the file filename. If the optional third argument append is given and equals
599
true then the content of str is appended to the file. Otherwise previous
600
content of the file is deleted. This function returns the number of bytes
601
written or fail if something went wrong.
602
603
Both functions are quite efficient, even with large files.
604
605
606