Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
godotengine
GitHub Repository: godotengine/godot
Path: blob/master/thirdparty/pcre2/src/pcre2_dfa_match.c
21797 views
1
/*************************************************
2
* Perl-Compatible Regular Expressions *
3
*************************************************/
4
5
/* PCRE is a library of functions to support regular expressions whose syntax
6
and semantics are as close as possible to those of the Perl 5 language.
7
8
Written by Philip Hazel
9
Original API code Copyright (c) 1997-2012 University of Cambridge
10
New API code Copyright (c) 2016-2024 University of Cambridge
11
12
-----------------------------------------------------------------------------
13
Redistribution and use in source and binary forms, with or without
14
modification, are permitted provided that the following conditions are met:
15
16
* Redistributions of source code must retain the above copyright notice,
17
this list of conditions and the following disclaimer.
18
19
* Redistributions in binary form must reproduce the above copyright
20
notice, this list of conditions and the following disclaimer in the
21
documentation and/or other materials provided with the distribution.
22
23
* Neither the name of the University of Cambridge nor the names of its
24
contributors may be used to endorse or promote products derived from
25
this software without specific prior written permission.
26
27
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
28
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
29
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
30
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
31
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
32
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
33
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
34
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
35
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
36
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
37
POSSIBILITY OF SUCH DAMAGE.
38
-----------------------------------------------------------------------------
39
*/
40
41
42
/* This module contains the external function pcre2_dfa_match(), which is an
43
alternative matching function that uses a sort of DFA algorithm (not a true
44
FSM). This is NOT Perl-compatible, but it has advantages in certain
45
applications. */
46
47
48
/* NOTE ABOUT PERFORMANCE: A user of this function sent some code that improved
49
the performance of his patterns greatly. I could not use it as it stood, as it
50
was not thread safe, and made assumptions about pattern sizes. Also, it caused
51
test 7 to loop, and test 9 to crash with a segfault.
52
53
The issue is the check for duplicate states, which is done by a simple linear
54
search up the state list. (Grep for "duplicate" below to find the code.) For
55
many patterns, there will never be many states active at one time, so a simple
56
linear search is fine. In patterns that have many active states, it might be a
57
bottleneck. The suggested code used an indexing scheme to remember which states
58
had previously been used for each character, and avoided the linear search when
59
it knew there was no chance of a duplicate. This was implemented when adding
60
states to the state lists.
61
62
I wrote some thread-safe, not-limited code to try something similar at the time
63
of checking for duplicates (instead of when adding states), using index vectors
64
on the stack. It did give a 13% improvement with one specially constructed
65
pattern for certain subject strings, but on other strings and on many of the
66
simpler patterns in the test suite it did worse. The major problem, I think,
67
was the extra time to initialize the index. This had to be done for each call
68
of internal_dfa_match(). (The supplied patch used a static vector, initialized
69
only once - I suspect this was the cause of the problems with the tests.)
70
71
Overall, I concluded that the gains in some cases did not outweigh the losses
72
in others, so I abandoned this code. */
73
74
75
#include "pcre2_internal.h"
76
77
78
79
#define NLBLOCK mb /* Block containing newline information */
80
#define PSSTART start_subject /* Field containing processed string start */
81
#define PSEND end_subject /* Field containing processed string end */
82
83
#define PUBLIC_DFA_MATCH_OPTIONS \
84
(PCRE2_ANCHORED|PCRE2_ENDANCHORED|PCRE2_NOTBOL|PCRE2_NOTEOL|PCRE2_NOTEMPTY| \
85
PCRE2_NOTEMPTY_ATSTART|PCRE2_NO_UTF_CHECK|PCRE2_PARTIAL_HARD| \
86
PCRE2_PARTIAL_SOFT|PCRE2_DFA_SHORTEST|PCRE2_DFA_RESTART| \
87
PCRE2_COPY_MATCHED_SUBJECT)
88
89
90
/*************************************************
91
* Code parameters and static tables *
92
*************************************************/
93
94
/* These are offsets that are used to turn the OP_TYPESTAR and friends opcodes
95
into others, under special conditions. A gap of 20 between the blocks should be
96
enough. The resulting opcodes don't have to be less than 256 because they are
97
never stored, so we push them well clear of the normal opcodes. */
98
99
#define OP_PROP_EXTRA 300
100
#define OP_EXTUNI_EXTRA 320
101
#define OP_ANYNL_EXTRA 340
102
#define OP_HSPACE_EXTRA 360
103
#define OP_VSPACE_EXTRA 380
104
105
106
/* This table identifies those opcodes that are followed immediately by a
107
character that is to be tested in some way. This makes it possible to
108
centralize the loading of these characters. In the case of Type * etc, the
109
"character" is the opcode for \D, \d, \S, \s, \W, or \w, which will always be a
110
small value. Non-zero values in the table are the offsets from the opcode where
111
the character is to be found. ***NOTE*** If the start of this table is
112
modified, the three tables that follow must also be modified. */
113
114
static const uint8_t coptable[] = {
115
0, /* End */
116
0, 0, 0, 0, 0, /* \A, \G, \K, \B, \b */
117
0, 0, 0, 0, 0, 0, /* \D, \d, \S, \s, \W, \w */
118
0, 0, 0, /* Any, AllAny, Anybyte */
119
0, 0, /* \P, \p */
120
0, 0, 0, 0, 0, /* \R, \H, \h, \V, \v */
121
0, /* \X */
122
0, 0, 0, 0, 0, 0, /* \Z, \z, $, $M, ^, ^M */
123
1, /* Char */
124
1, /* Chari */
125
1, /* not */
126
1, /* noti */
127
/* Positive single-char repeats */
128
1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
129
1+IMM2_SIZE, 1+IMM2_SIZE, /* upto, minupto */
130
1+IMM2_SIZE, /* exact */
131
1, 1, 1, 1+IMM2_SIZE, /* *+, ++, ?+, upto+ */
132
1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
133
1+IMM2_SIZE, 1+IMM2_SIZE, /* upto I, minupto I */
134
1+IMM2_SIZE, /* exact I */
135
1, 1, 1, 1+IMM2_SIZE, /* *+I, ++I, ?+I, upto+I */
136
/* Negative single-char repeats - only for chars < 256 */
137
1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
138
1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto, minupto */
139
1+IMM2_SIZE, /* NOT exact */
140
1, 1, 1, 1+IMM2_SIZE, /* NOT *+, ++, ?+, upto+ */
141
1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
142
1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto I, minupto I */
143
1+IMM2_SIZE, /* NOT exact I */
144
1, 1, 1, 1+IMM2_SIZE, /* NOT *+I, ++I, ?+I, upto+I */
145
/* Positive type repeats */
146
1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
147
1+IMM2_SIZE, 1+IMM2_SIZE, /* Type upto, minupto */
148
1+IMM2_SIZE, /* Type exact */
149
1, 1, 1, 1+IMM2_SIZE, /* Type *+, ++, ?+, upto+ */
150
/* Character class & ref repeats */
151
0, 0, 0, 0, 0, 0, /* *, *?, +, +?, ?, ?? */
152
0, 0, /* CRRANGE, CRMINRANGE */
153
0, 0, 0, 0, /* Possessive *+, ++, ?+, CRPOSRANGE */
154
0, /* CLASS */
155
0, /* NCLASS */
156
0, /* XCLASS - variable length */
157
0, /* ECLASS - variable length */
158
0, /* REF */
159
0, /* REFI */
160
0, /* DNREF */
161
0, /* DNREFI */
162
0, /* RECURSE */
163
0, /* CALLOUT */
164
0, /* CALLOUT_STR */
165
0, /* Alt */
166
0, /* Ket */
167
0, /* KetRmax */
168
0, /* KetRmin */
169
0, /* KetRpos */
170
0, 0, /* Reverse, Vreverse */
171
0, /* Assert */
172
0, /* Assert not */
173
0, /* Assert behind */
174
0, /* Assert behind not */
175
0, /* NA assert */
176
0, /* NA assert behind */
177
0, /* Assert scan substring */
178
0, /* ONCE */
179
0, /* SCRIPT_RUN */
180
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
181
0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
182
0, 0, /* CREF, DNCREF */
183
0, 0, /* RREF, DNRREF */
184
0, 0, /* FALSE, TRUE */
185
0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
186
0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
187
0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
188
0, 0, /* COMMIT, COMMIT_ARG */
189
0, 0, 0, /* FAIL, ACCEPT, ASSERT_ACCEPT */
190
0, 0, 0, /* CLOSE, SKIPZERO, DEFINE */
191
0, 0, /* \B and \b in UCP mode */
192
};
193
194
/* This table identifies those opcodes that inspect a character. It is used to
195
remember the fact that a character could have been inspected when the end of
196
the subject is reached. ***NOTE*** If the start of this table is modified, the
197
two tables that follow must also be modified. */
198
199
static const uint8_t poptable[] = {
200
0, /* End */
201
0, 0, 0, 1, 1, /* \A, \G, \K, \B, \b */
202
1, 1, 1, 1, 1, 1, /* \D, \d, \S, \s, \W, \w */
203
1, 1, 1, /* Any, AllAny, Anybyte */
204
1, 1, /* \P, \p */
205
1, 1, 1, 1, 1, /* \R, \H, \h, \V, \v */
206
1, /* \X */
207
0, 0, 0, 0, 0, 0, /* \Z, \z, $, $M, ^, ^M */
208
1, /* Char */
209
1, /* Chari */
210
1, /* not */
211
1, /* noti */
212
/* Positive single-char repeats */
213
1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
214
1, 1, 1, /* upto, minupto, exact */
215
1, 1, 1, 1, /* *+, ++, ?+, upto+ */
216
1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
217
1, 1, 1, /* upto I, minupto I, exact I */
218
1, 1, 1, 1, /* *+I, ++I, ?+I, upto+I */
219
/* Negative single-char repeats - only for chars < 256 */
220
1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
221
1, 1, 1, /* NOT upto, minupto, exact */
222
1, 1, 1, 1, /* NOT *+, ++, ?+, upto+ */
223
1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
224
1, 1, 1, /* NOT upto I, minupto I, exact I */
225
1, 1, 1, 1, /* NOT *+I, ++I, ?+I, upto+I */
226
/* Positive type repeats */
227
1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
228
1, 1, 1, /* Type upto, minupto, exact */
229
1, 1, 1, 1, /* Type *+, ++, ?+, upto+ */
230
/* Character class & ref repeats */
231
1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
232
1, 1, /* CRRANGE, CRMINRANGE */
233
1, 1, 1, 1, /* Possessive *+, ++, ?+, CRPOSRANGE */
234
1, /* CLASS */
235
1, /* NCLASS */
236
1, /* XCLASS - variable length */
237
1, /* ECLASS - variable length */
238
0, /* REF */
239
0, /* REFI */
240
0, /* DNREF */
241
0, /* DNREFI */
242
0, /* RECURSE */
243
0, /* CALLOUT */
244
0, /* CALLOUT_STR */
245
0, /* Alt */
246
0, /* Ket */
247
0, /* KetRmax */
248
0, /* KetRmin */
249
0, /* KetRpos */
250
0, 0, /* Reverse, Vreverse */
251
0, /* Assert */
252
0, /* Assert not */
253
0, /* Assert behind */
254
0, /* Assert behind not */
255
0, /* NA assert */
256
0, /* NA assert behind */
257
0, /* Assert scan substring */
258
0, /* ONCE */
259
0, /* SCRIPT_RUN */
260
0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
261
0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
262
0, 0, /* CREF, DNCREF */
263
0, 0, /* RREF, DNRREF */
264
0, 0, /* FALSE, TRUE */
265
0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
266
0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
267
0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
268
0, 0, /* COMMIT, COMMIT_ARG */
269
0, 0, 0, /* FAIL, ACCEPT, ASSERT_ACCEPT */
270
0, 0, 0, /* CLOSE, SKIPZERO, DEFINE */
271
1, 1, /* \B and \b in UCP mode */
272
};
273
274
/* Compile-time check that these tables have the correct size. */
275
STATIC_ASSERT(sizeof(coptable) == OP_TABLE_LENGTH, coptable);
276
STATIC_ASSERT(sizeof(poptable) == OP_TABLE_LENGTH, poptable);
277
278
/* These 2 tables allow for compact code for testing for \D, \d, \S, \s, \W,
279
and \w */
280
281
static const uint8_t toptable1[] = {
282
0, 0, 0, 0, 0, 0,
283
ctype_digit, ctype_digit,
284
ctype_space, ctype_space,
285
ctype_word, ctype_word,
286
0, 0 /* OP_ANY, OP_ALLANY */
287
};
288
289
static const uint8_t toptable2[] = {
290
0, 0, 0, 0, 0, 0,
291
ctype_digit, 0,
292
ctype_space, 0,
293
ctype_word, 0,
294
1, 1 /* OP_ANY, OP_ALLANY */
295
};
296
297
298
/* Structure for holding data about a particular state, which is in effect the
299
current data for an active path through the match tree. It must consist
300
entirely of ints because the working vector we are passed, and which we put
301
these structures in, is a vector of ints. */
302
303
typedef struct stateblock {
304
int offset; /* Offset to opcode (-ve has meaning) */
305
int count; /* Count for repeats */
306
int data; /* Some use extra data */
307
} stateblock;
308
309
#define INTS_PER_STATEBLOCK (int)(sizeof(stateblock)/sizeof(int))
310
311
312
/* Before version 10.32 the recursive calls of internal_dfa_match() were passed
313
local working space and output vectors that were created on the stack. This has
314
caused issues for some patterns, especially in small-stack environments such as
315
Windows. A new scheme is now in use which sets up a vector on the stack, but if
316
this is too small, heap memory is used, up to the heap_limit. The main
317
parameters are all numbers of ints because the workspace is a vector of ints.
318
319
The size of the starting stack vector, DFA_START_RWS_SIZE, is in bytes, and is
320
defined in pcre2_internal.h so as to be available to pcre2test when it is
321
finding the minimum heap requirement for a match. */
322
323
#define OVEC_UNIT (sizeof(PCRE2_SIZE)/sizeof(int))
324
325
#define RWS_BASE_SIZE (DFA_START_RWS_SIZE/sizeof(int)) /* Stack vector */
326
#define RWS_RSIZE 1000 /* Work size for recursion */
327
#define RWS_OVEC_RSIZE (1000*OVEC_UNIT) /* Ovector for recursion */
328
#define RWS_OVEC_OSIZE (2*OVEC_UNIT) /* Ovector in other cases */
329
330
/* This structure is at the start of each workspace block. */
331
332
typedef struct RWS_anchor {
333
struct RWS_anchor *next;
334
uint32_t size; /* Number of ints */
335
uint32_t free; /* Number of ints */
336
} RWS_anchor;
337
338
#define RWS_ANCHOR_SIZE (sizeof(RWS_anchor)/sizeof(int))
339
340
341
342
/*************************************************
343
* Process a callout *
344
*************************************************/
345
346
/* This function is called to perform a callout.
347
348
Arguments:
349
code current code pointer
350
offsets points to current capture offsets
351
current_subject start of current subject match
352
ptr current position in subject
353
mb the match block
354
extracode extra code offset when called from condition
355
lengthptr where to return the callout length
356
357
Returns: the return from the callout
358
*/
359
360
static int
361
do_callout_dfa(PCRE2_SPTR code, PCRE2_SIZE *offsets, PCRE2_SPTR current_subject,
362
PCRE2_SPTR ptr, dfa_match_block *mb, PCRE2_SIZE extracode,
363
PCRE2_SIZE *lengthptr)
364
{
365
pcre2_callout_block *cb = mb->cb;
366
367
*lengthptr = (code[extracode] == OP_CALLOUT)?
368
(PCRE2_SIZE)PRIV(OP_lengths)[OP_CALLOUT] :
369
(PCRE2_SIZE)GET(code, 1 + 2*LINK_SIZE + extracode);
370
371
if (mb->callout == NULL) return 0; /* No callout provided */
372
373
/* Fixed fields in the callout block are set once and for all at the start of
374
matching. */
375
376
cb->offset_vector = offsets;
377
cb->start_match = (PCRE2_SIZE)(current_subject - mb->start_subject);
378
cb->current_position = (PCRE2_SIZE)(ptr - mb->start_subject);
379
cb->pattern_position = GET(code, 1 + extracode);
380
cb->next_item_length = GET(code, 1 + LINK_SIZE + extracode);
381
382
if (code[extracode] == OP_CALLOUT)
383
{
384
cb->callout_number = code[1 + 2*LINK_SIZE + extracode];
385
cb->callout_string_offset = 0;
386
cb->callout_string = NULL;
387
cb->callout_string_length = 0;
388
}
389
else
390
{
391
cb->callout_number = 0;
392
cb->callout_string_offset = GET(code, 1 + 3*LINK_SIZE + extracode);
393
cb->callout_string = code + (1 + 4*LINK_SIZE + extracode) + 1;
394
cb->callout_string_length = *lengthptr - (1 + 4*LINK_SIZE) - 2;
395
}
396
397
return (mb->callout)(cb, mb->callout_data);
398
}
399
400
401
402
/*************************************************
403
* Expand local workspace memory *
404
*************************************************/
405
406
/* This function is called when internal_dfa_match() is about to be called
407
recursively and there is insufficient working space left in the current
408
workspace block. If there's an existing next block, use it; otherwise get a new
409
block unless the heap limit is reached.
410
411
Arguments:
412
rwsptr pointer to block pointer (updated)
413
ovecsize space needed for an ovector
414
mb the match block
415
416
Returns: 0 rwsptr has been updated
417
!0 an error code
418
*/
419
420
static int
421
more_workspace(RWS_anchor **rwsptr, unsigned int ovecsize, dfa_match_block *mb)
422
{
423
RWS_anchor *rws = *rwsptr;
424
RWS_anchor *new;
425
426
if (rws->next != NULL)
427
{
428
new = rws->next;
429
}
430
431
/* Sizes in the RWS_anchor blocks are in units of sizeof(int), but
432
mb->heap_limit and mb->heap_used are in kibibytes. Play carefully, to avoid
433
overflow. */
434
435
else
436
{
437
uint32_t newsize = (rws->size >= UINT32_MAX/(sizeof(int)*2))? UINT32_MAX/sizeof(int) : rws->size * 2;
438
uint32_t newsizeK = newsize/(1024/sizeof(int));
439
440
if (newsizeK + mb->heap_used > mb->heap_limit)
441
newsizeK = (uint32_t)(mb->heap_limit - mb->heap_used);
442
newsize = newsizeK*(1024/sizeof(int));
443
444
if (newsize < RWS_RSIZE + ovecsize + RWS_ANCHOR_SIZE)
445
return PCRE2_ERROR_HEAPLIMIT;
446
new = mb->memctl.malloc(newsize*sizeof(int), mb->memctl.memory_data);
447
if (new == NULL) return PCRE2_ERROR_NOMEMORY;
448
mb->heap_used += newsizeK;
449
new->next = NULL;
450
new->size = newsize;
451
rws->next = new;
452
}
453
454
new->free = new->size - RWS_ANCHOR_SIZE;
455
*rwsptr = new;
456
return 0;
457
}
458
459
460
461
/*************************************************
462
* Match a Regular Expression - DFA engine *
463
*************************************************/
464
465
/* This internal function applies a compiled pattern to a subject string,
466
starting at a given point, using a DFA engine. This function is called from the
467
external one, possibly multiple times if the pattern is not anchored. The
468
function calls itself recursively for some kinds of subpattern.
469
470
Arguments:
471
mb the match_data block with fixed information
472
this_start_code the opening bracket of this subexpression's code
473
current_subject where we currently are in the subject string
474
start_offset start offset in the subject string
475
offsets vector to contain the matching string offsets
476
offsetcount size of same
477
workspace vector of workspace
478
wscount size of same
479
rlevel function call recursion level
480
481
Returns: > 0 => number of match offset pairs placed in offsets
482
= 0 => offsets overflowed; longest matches are present
483
-1 => failed to match
484
< -1 => some kind of unexpected problem
485
486
The following macros are used for adding states to the two state vectors (one
487
for the current character, one for the following character). */
488
489
#define ADD_ACTIVE(x,y) \
490
if (active_count++ < wscount) \
491
{ \
492
next_active_state->offset = (x); \
493
next_active_state->count = (y); \
494
next_active_state++; \
495
} \
496
else return PCRE2_ERROR_DFA_WSSIZE
497
498
#define ADD_ACTIVE_DATA(x,y,z) \
499
if (active_count++ < wscount) \
500
{ \
501
next_active_state->offset = (x); \
502
next_active_state->count = (y); \
503
next_active_state->data = (z); \
504
next_active_state++; \
505
} \
506
else return PCRE2_ERROR_DFA_WSSIZE
507
508
#define ADD_NEW(x,y) \
509
if (new_count++ < wscount) \
510
{ \
511
next_new_state->offset = (x); \
512
next_new_state->count = (y); \
513
next_new_state++; \
514
} \
515
else return PCRE2_ERROR_DFA_WSSIZE
516
517
#define ADD_NEW_DATA(x,y,z) \
518
if (new_count++ < wscount) \
519
{ \
520
next_new_state->offset = (x); \
521
next_new_state->count = (y); \
522
next_new_state->data = (z); \
523
next_new_state++; \
524
} \
525
else return PCRE2_ERROR_DFA_WSSIZE
526
527
/* And now, here is the code */
528
529
static int
530
internal_dfa_match(
531
dfa_match_block *mb,
532
PCRE2_SPTR this_start_code,
533
PCRE2_SPTR current_subject,
534
PCRE2_SIZE start_offset,
535
PCRE2_SIZE *offsets,
536
uint32_t offsetcount,
537
int *workspace,
538
int wscount,
539
uint32_t rlevel,
540
int *RWS)
541
{
542
stateblock *active_states, *new_states, *temp_states;
543
stateblock *next_active_state, *next_new_state;
544
const uint8_t *ctypes, *lcc, *fcc;
545
PCRE2_SPTR ptr;
546
PCRE2_SPTR end_code;
547
dfa_recursion_info new_recursive;
548
int active_count, new_count, match_count;
549
550
/* Some fields in the mb block are frequently referenced, so we load them into
551
independent variables in the hope that this will perform better. */
552
553
PCRE2_SPTR start_subject = mb->start_subject;
554
PCRE2_SPTR end_subject = mb->end_subject;
555
PCRE2_SPTR start_code = mb->start_code;
556
557
#ifdef SUPPORT_UNICODE
558
BOOL utf = (mb->poptions & PCRE2_UTF) != 0;
559
BOOL utf_or_ucp = utf || (mb->poptions & PCRE2_UCP) != 0;
560
#else
561
BOOL utf = FALSE;
562
#endif
563
564
BOOL reset_could_continue = FALSE;
565
566
if (mb->match_call_count++ >= mb->match_limit) return PCRE2_ERROR_MATCHLIMIT;
567
if (rlevel++ > mb->match_limit_depth) return PCRE2_ERROR_DEPTHLIMIT;
568
offsetcount &= (uint32_t)(-2); /* Round down */
569
570
wscount -= 2;
571
wscount = (wscount - (wscount % (INTS_PER_STATEBLOCK * 2))) /
572
(2 * INTS_PER_STATEBLOCK);
573
574
ctypes = mb->tables + ctypes_offset;
575
lcc = mb->tables + lcc_offset;
576
fcc = mb->tables + fcc_offset;
577
578
match_count = PCRE2_ERROR_NOMATCH; /* A negative number */
579
580
active_states = (stateblock *)(workspace + 2);
581
next_new_state = new_states = active_states + wscount;
582
new_count = 0;
583
584
/* The first thing in any (sub) pattern is a bracket of some sort. Push all
585
the alternative states onto the list, and find out where the end is. This
586
makes is possible to use this function recursively, when we want to stop at a
587
matching internal ket rather than at the end.
588
589
If we are dealing with a backward assertion we have to find out the maximum
590
amount to move back, and set up each alternative appropriately. */
591
592
if (*this_start_code == OP_ASSERTBACK || *this_start_code == OP_ASSERTBACK_NOT)
593
{
594
size_t max_back = 0;
595
size_t gone_back;
596
597
end_code = this_start_code;
598
do
599
{
600
size_t back = (size_t)GET2(end_code, 2+LINK_SIZE);
601
if (back > max_back) max_back = back;
602
end_code += GET(end_code, 1);
603
}
604
while (*end_code == OP_ALT);
605
606
/* If we can't go back the amount required for the longest lookbehind
607
pattern, go back as far as we can; some alternatives may still be viable. */
608
609
#ifdef SUPPORT_UNICODE
610
/* In character mode we have to step back character by character */
611
612
if (utf)
613
{
614
for (gone_back = 0; gone_back < max_back; gone_back++)
615
{
616
if (current_subject <= start_subject) break;
617
current_subject--;
618
ACROSSCHAR(current_subject > start_subject, current_subject,
619
current_subject--);
620
}
621
}
622
else
623
#endif
624
625
/* In byte-mode we can do this quickly. */
626
627
{
628
size_t current_offset = (size_t)(current_subject - start_subject);
629
gone_back = (current_offset < max_back)? current_offset : max_back;
630
current_subject -= gone_back;
631
}
632
633
/* Save the earliest consulted character */
634
635
if (current_subject < mb->start_used_ptr)
636
mb->start_used_ptr = current_subject;
637
638
/* Now we can process the individual branches. There will be an OP_REVERSE at
639
the start of each branch, except when the length of the branch is zero. */
640
641
end_code = this_start_code;
642
do
643
{
644
uint32_t revlen = (end_code[1+LINK_SIZE] == OP_REVERSE)? 1 + IMM2_SIZE : 0;
645
size_t back = (revlen == 0)? 0 : (size_t)GET2(end_code, 2+LINK_SIZE);
646
if (back <= gone_back)
647
{
648
int bstate = (int)(end_code - start_code + 1 + LINK_SIZE + revlen);
649
ADD_NEW_DATA(-bstate, 0, (int)(gone_back - back));
650
}
651
end_code += GET(end_code, 1);
652
}
653
while (*end_code == OP_ALT);
654
}
655
656
/* This is the code for a "normal" subpattern (not a backward assertion). The
657
start of a whole pattern is always one of these. If we are at the top level,
658
we may be asked to restart matching from the same point that we reached for a
659
previous partial match. We still have to scan through the top-level branches to
660
find the end state. */
661
662
else
663
{
664
end_code = this_start_code;
665
666
/* Restarting */
667
668
if (rlevel == 1 && (mb->moptions & PCRE2_DFA_RESTART) != 0)
669
{
670
do { end_code += GET(end_code, 1); } while (*end_code == OP_ALT);
671
new_count = workspace[1];
672
if (!workspace[0])
673
memcpy(new_states, active_states, (size_t)new_count * sizeof(stateblock));
674
}
675
676
/* Not restarting */
677
678
else
679
{
680
int length = 1 + LINK_SIZE +
681
((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
682
*this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
683
? IMM2_SIZE:0);
684
do
685
{
686
ADD_NEW((int)(end_code - start_code + length), 0);
687
end_code += GET(end_code, 1);
688
length = 1 + LINK_SIZE;
689
}
690
while (*end_code == OP_ALT);
691
}
692
}
693
694
workspace[0] = 0; /* Bit indicating which vector is current */
695
696
/* Loop for scanning the subject */
697
698
ptr = current_subject;
699
for (;;)
700
{
701
int i, j;
702
int clen, dlen;
703
uint32_t c, d;
704
BOOL partial_newline = FALSE;
705
BOOL could_continue = reset_could_continue;
706
reset_could_continue = FALSE;
707
708
if (ptr > mb->last_used_ptr) mb->last_used_ptr = ptr;
709
710
/* Make the new state list into the active state list and empty the
711
new state list. */
712
713
temp_states = active_states;
714
active_states = new_states;
715
new_states = temp_states;
716
active_count = new_count;
717
new_count = 0;
718
719
workspace[0] ^= 1; /* Remember for the restarting feature */
720
workspace[1] = active_count;
721
722
/* Set the pointers for adding new states */
723
724
next_active_state = active_states + active_count;
725
next_new_state = new_states;
726
727
/* Load the current character from the subject outside the loop, as many
728
different states may want to look at it, and we assume that at least one
729
will. */
730
731
if (ptr < end_subject)
732
{
733
clen = 1; /* Number of data items in the character */
734
#ifdef SUPPORT_UNICODE
735
GETCHARLENTEST(c, ptr, clen);
736
#else
737
c = *ptr;
738
#endif /* SUPPORT_UNICODE */
739
}
740
else
741
{
742
clen = 0; /* This indicates the end of the subject */
743
c = NOTACHAR; /* This value should never actually be used */
744
}
745
746
/* Scan up the active states and act on each one. The result of an action
747
may be to add more states to the currently active list (e.g. on hitting a
748
parenthesis) or it may be to put states on the new list, for considering
749
when we move the character pointer on. */
750
751
for (i = 0; i < active_count; i++)
752
{
753
stateblock *current_state = active_states + i;
754
BOOL caseless = FALSE;
755
PCRE2_SPTR code;
756
uint32_t codevalue;
757
int state_offset = current_state->offset;
758
int rrc;
759
int count;
760
761
/* A negative offset is a special case meaning "hold off going to this
762
(negated) state until the number of characters in the data field have
763
been skipped". If the could_continue flag was passed over from a previous
764
state, arrange for it to passed on. */
765
766
if (state_offset < 0)
767
{
768
if (current_state->data > 0)
769
{
770
ADD_NEW_DATA(state_offset, current_state->count,
771
current_state->data - 1);
772
if (could_continue) reset_could_continue = TRUE;
773
continue;
774
}
775
else
776
{
777
current_state->offset = state_offset = -state_offset;
778
}
779
}
780
781
/* Check for a duplicate state with the same count, and skip if found.
782
See the note at the head of this module about the possibility of improving
783
performance here. */
784
785
for (j = 0; j < i; j++)
786
{
787
if (active_states[j].offset == state_offset &&
788
active_states[j].count == current_state->count)
789
goto NEXT_ACTIVE_STATE;
790
}
791
792
/* The state offset is the offset to the opcode */
793
794
code = start_code + state_offset;
795
codevalue = *code;
796
797
/* If this opcode inspects a character, but we are at the end of the
798
subject, remember the fact for use when testing for a partial match. */
799
800
if (clen == 0 && poptable[codevalue] != 0)
801
could_continue = TRUE;
802
803
/* If this opcode is followed by an inline character, load it. It is
804
tempting to test for the presence of a subject character here, but that
805
is wrong, because sometimes zero repetitions of the subject are
806
permitted.
807
808
We also use this mechanism for opcodes such as OP_TYPEPLUS that take an
809
argument that is not a data character - but is always one byte long because
810
the values are small. We have to take special action to deal with \P, \p,
811
\H, \h, \V, \v and \X in this case. To keep the other cases fast, convert
812
these ones to new opcodes. */
813
814
if (coptable[codevalue] > 0)
815
{
816
dlen = 1;
817
#ifdef SUPPORT_UNICODE
818
if (utf) { GETCHARLEN(d, (code + coptable[codevalue]), dlen); } else
819
#endif /* SUPPORT_UNICODE */
820
d = code[coptable[codevalue]];
821
if (codevalue >= OP_TYPESTAR)
822
{
823
switch(d)
824
{
825
case OP_ANYBYTE: return PCRE2_ERROR_DFA_UITEM;
826
case OP_NOTPROP:
827
case OP_PROP: codevalue += OP_PROP_EXTRA; break;
828
case OP_ANYNL: codevalue += OP_ANYNL_EXTRA; break;
829
case OP_EXTUNI: codevalue += OP_EXTUNI_EXTRA; break;
830
case OP_NOT_HSPACE:
831
case OP_HSPACE: codevalue += OP_HSPACE_EXTRA; break;
832
case OP_NOT_VSPACE:
833
case OP_VSPACE: codevalue += OP_VSPACE_EXTRA; break;
834
default: break;
835
}
836
}
837
}
838
else
839
{
840
dlen = 0; /* Not strictly necessary, but compilers moan */
841
d = NOTACHAR; /* if these variables are not set. */
842
}
843
844
845
/* Now process the individual opcodes */
846
847
switch (codevalue)
848
{
849
/* ========================================================================== */
850
/* Reached a closing bracket. If not at the end of the pattern, carry
851
on with the next opcode. For repeating opcodes, also add the repeat
852
state. Note that KETRPOS will always be encountered at the end of the
853
subpattern, because the possessive subpattern repeats are always handled
854
using recursive calls. Thus, it never adds any new states.
855
856
At the end of the (sub)pattern, unless we have an empty string and
857
PCRE2_NOTEMPTY is set, or PCRE2_NOTEMPTY_ATSTART is set and we are at the
858
start of the subject, save the match data, shifting up all previous
859
matches so we always have the longest first. */
860
861
case OP_KET:
862
case OP_KETRMIN:
863
case OP_KETRMAX:
864
case OP_KETRPOS:
865
if (code != end_code)
866
{
867
ADD_ACTIVE(state_offset + 1 + LINK_SIZE, 0);
868
if (codevalue != OP_KET)
869
{
870
ADD_ACTIVE(state_offset - (int)GET(code, 1), 0);
871
}
872
}
873
else
874
{
875
if (ptr > current_subject ||
876
((mb->moptions & PCRE2_NOTEMPTY) == 0 &&
877
((mb->moptions & PCRE2_NOTEMPTY_ATSTART) == 0 ||
878
current_subject > start_subject + mb->start_offset)))
879
{
880
if (match_count < 0) match_count = (offsetcount >= 2)? 1 : 0;
881
else if (match_count > 0 && ++match_count * 2 > (int)offsetcount)
882
match_count = 0;
883
count = ((match_count == 0)? (int)offsetcount : match_count * 2) - 2;
884
if (count > 0) (void)memmove(offsets + 2, offsets,
885
(size_t)count * sizeof(PCRE2_SIZE));
886
if (offsetcount >= 2)
887
{
888
offsets[0] = (PCRE2_SIZE)(current_subject - start_subject);
889
offsets[1] = (PCRE2_SIZE)(ptr - start_subject);
890
}
891
if ((mb->moptions & PCRE2_DFA_SHORTEST) != 0) return match_count;
892
}
893
}
894
break;
895
896
/* ========================================================================== */
897
/* These opcodes add to the current list of states without looking
898
at the current character. */
899
900
/*-----------------------------------------------------------------*/
901
case OP_ALT:
902
do { code += GET(code, 1); } while (*code == OP_ALT);
903
ADD_ACTIVE((int)(code - start_code), 0);
904
break;
905
906
/*-----------------------------------------------------------------*/
907
case OP_BRA:
908
case OP_SBRA:
909
do
910
{
911
ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
912
code += GET(code, 1);
913
}
914
while (*code == OP_ALT);
915
break;
916
917
/*-----------------------------------------------------------------*/
918
case OP_CBRA:
919
case OP_SCBRA:
920
ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE + IMM2_SIZE), 0);
921
code += GET(code, 1);
922
while (*code == OP_ALT)
923
{
924
ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
925
code += GET(code, 1);
926
}
927
break;
928
929
/*-----------------------------------------------------------------*/
930
case OP_BRAZERO:
931
case OP_BRAMINZERO:
932
ADD_ACTIVE(state_offset + 1, 0);
933
code += 1 + GET(code, 2);
934
while (*code == OP_ALT) code += GET(code, 1);
935
ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
936
break;
937
938
/*-----------------------------------------------------------------*/
939
case OP_SKIPZERO:
940
code += 1 + GET(code, 2);
941
while (*code == OP_ALT) code += GET(code, 1);
942
ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
943
break;
944
945
/*-----------------------------------------------------------------*/
946
case OP_CIRC:
947
if (ptr == start_subject && (mb->moptions & PCRE2_NOTBOL) == 0)
948
{ ADD_ACTIVE(state_offset + 1, 0); }
949
break;
950
951
/*-----------------------------------------------------------------*/
952
case OP_CIRCM:
953
if ((ptr == start_subject && (mb->moptions & PCRE2_NOTBOL) == 0) ||
954
((ptr != end_subject || (mb->poptions & PCRE2_ALT_CIRCUMFLEX) != 0 )
955
&& WAS_NEWLINE(ptr)))
956
{ ADD_ACTIVE(state_offset + 1, 0); }
957
break;
958
959
/*-----------------------------------------------------------------*/
960
case OP_EOD:
961
if (ptr >= end_subject)
962
{
963
if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0)
964
return PCRE2_ERROR_PARTIAL;
965
else { ADD_ACTIVE(state_offset + 1, 0); }
966
}
967
break;
968
969
/*-----------------------------------------------------------------*/
970
case OP_SOD:
971
if (ptr == start_subject) { ADD_ACTIVE(state_offset + 1, 0); }
972
break;
973
974
/*-----------------------------------------------------------------*/
975
case OP_SOM:
976
if (ptr == start_subject + start_offset) { ADD_ACTIVE(state_offset + 1, 0); }
977
break;
978
979
980
/* ========================================================================== */
981
/* These opcodes inspect the next subject character, and sometimes
982
the previous one as well, but do not have an argument. The variable
983
clen contains the length of the current character and is zero if we are
984
at the end of the subject. */
985
986
/*-----------------------------------------------------------------*/
987
case OP_ANY:
988
if (clen > 0 && !IS_NEWLINE(ptr))
989
{
990
if (ptr + 1 >= mb->end_subject &&
991
(mb->moptions & (PCRE2_PARTIAL_HARD)) != 0 &&
992
NLBLOCK->nltype == NLTYPE_FIXED &&
993
NLBLOCK->nllen == 2 &&
994
c == NLBLOCK->nl[0])
995
{
996
could_continue = partial_newline = TRUE;
997
}
998
else
999
{
1000
ADD_NEW(state_offset + 1, 0);
1001
}
1002
}
1003
break;
1004
1005
/*-----------------------------------------------------------------*/
1006
case OP_ALLANY:
1007
if (clen > 0)
1008
{ ADD_NEW(state_offset + 1, 0); }
1009
break;
1010
1011
/*-----------------------------------------------------------------*/
1012
case OP_EODN:
1013
if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - mb->nllen))
1014
{
1015
if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0)
1016
return PCRE2_ERROR_PARTIAL;
1017
ADD_ACTIVE(state_offset + 1, 0);
1018
}
1019
break;
1020
1021
/*-----------------------------------------------------------------*/
1022
case OP_DOLL:
1023
if ((mb->moptions & PCRE2_NOTEOL) == 0)
1024
{
1025
if (clen == 0 && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
1026
could_continue = TRUE;
1027
else if (clen == 0 ||
1028
((mb->poptions & PCRE2_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
1029
(ptr == end_subject - mb->nllen)
1030
))
1031
{ ADD_ACTIVE(state_offset + 1, 0); }
1032
else if (ptr + 1 >= mb->end_subject &&
1033
(mb->moptions & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0 &&
1034
NLBLOCK->nltype == NLTYPE_FIXED &&
1035
NLBLOCK->nllen == 2 &&
1036
c == NLBLOCK->nl[0])
1037
{
1038
if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0)
1039
{
1040
reset_could_continue = TRUE;
1041
ADD_NEW_DATA(-(state_offset + 1), 0, 1);
1042
}
1043
else could_continue = partial_newline = TRUE;
1044
}
1045
}
1046
break;
1047
1048
/*-----------------------------------------------------------------*/
1049
case OP_DOLLM:
1050
if ((mb->moptions & PCRE2_NOTEOL) == 0)
1051
{
1052
if (clen == 0 && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
1053
could_continue = TRUE;
1054
else if (clen == 0 ||
1055
((mb->poptions & PCRE2_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr)))
1056
{ ADD_ACTIVE(state_offset + 1, 0); }
1057
else if (ptr + 1 >= mb->end_subject &&
1058
(mb->moptions & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0 &&
1059
NLBLOCK->nltype == NLTYPE_FIXED &&
1060
NLBLOCK->nllen == 2 &&
1061
c == NLBLOCK->nl[0])
1062
{
1063
if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0)
1064
{
1065
reset_could_continue = TRUE;
1066
ADD_NEW_DATA(-(state_offset + 1), 0, 1);
1067
}
1068
else could_continue = partial_newline = TRUE;
1069
}
1070
}
1071
else if (IS_NEWLINE(ptr))
1072
{ ADD_ACTIVE(state_offset + 1, 0); }
1073
break;
1074
1075
/*-----------------------------------------------------------------*/
1076
1077
case OP_DIGIT:
1078
case OP_WHITESPACE:
1079
case OP_WORDCHAR:
1080
if (clen > 0 && c < 256 &&
1081
((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0)
1082
{ ADD_NEW(state_offset + 1, 0); }
1083
break;
1084
1085
/*-----------------------------------------------------------------*/
1086
case OP_NOT_DIGIT:
1087
case OP_NOT_WHITESPACE:
1088
case OP_NOT_WORDCHAR:
1089
if (clen > 0 && (c >= 256 ||
1090
((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0))
1091
{ ADD_NEW(state_offset + 1, 0); }
1092
break;
1093
1094
/*-----------------------------------------------------------------*/
1095
case OP_WORD_BOUNDARY:
1096
case OP_NOT_WORD_BOUNDARY:
1097
case OP_NOT_UCP_WORD_BOUNDARY:
1098
case OP_UCP_WORD_BOUNDARY:
1099
{
1100
int left_word, right_word;
1101
1102
if (ptr > start_subject)
1103
{
1104
PCRE2_SPTR temp = ptr - 1;
1105
if (temp < mb->start_used_ptr) mb->start_used_ptr = temp;
1106
#if defined SUPPORT_UNICODE && PCRE2_CODE_UNIT_WIDTH != 32
1107
if (utf) { BACKCHAR(temp); }
1108
#endif
1109
GETCHARTEST(d, temp);
1110
#ifdef SUPPORT_UNICODE
1111
if (codevalue == OP_UCP_WORD_BOUNDARY ||
1112
codevalue == OP_NOT_UCP_WORD_BOUNDARY)
1113
{
1114
int chartype = UCD_CHARTYPE(d);
1115
int category = PRIV(ucp_gentype)[chartype];
1116
left_word = (category == ucp_L || category == ucp_N ||
1117
chartype == ucp_Mn || chartype == ucp_Pc);
1118
}
1119
else
1120
#endif
1121
left_word = d < 256 && (ctypes[d] & ctype_word) != 0;
1122
}
1123
else left_word = FALSE;
1124
1125
if (clen > 0)
1126
{
1127
if (ptr >= mb->last_used_ptr)
1128
{
1129
PCRE2_SPTR temp = ptr + 1;
1130
#if defined SUPPORT_UNICODE && PCRE2_CODE_UNIT_WIDTH != 32
1131
if (utf) { FORWARDCHARTEST(temp, mb->end_subject); }
1132
#endif
1133
mb->last_used_ptr = temp;
1134
}
1135
#ifdef SUPPORT_UNICODE
1136
if (codevalue == OP_UCP_WORD_BOUNDARY ||
1137
codevalue == OP_NOT_UCP_WORD_BOUNDARY)
1138
{
1139
int chartype = UCD_CHARTYPE(c);
1140
int category = PRIV(ucp_gentype)[chartype];
1141
right_word = (category == ucp_L || category == ucp_N ||
1142
chartype == ucp_Mn || chartype == ucp_Pc);
1143
}
1144
else
1145
#endif
1146
right_word = c < 256 && (ctypes[c] & ctype_word) != 0;
1147
}
1148
else right_word = FALSE;
1149
1150
if ((left_word == right_word) ==
1151
(codevalue == OP_NOT_WORD_BOUNDARY ||
1152
codevalue == OP_NOT_UCP_WORD_BOUNDARY))
1153
{ ADD_ACTIVE(state_offset + 1, 0); }
1154
}
1155
break;
1156
1157
1158
/*-----------------------------------------------------------------*/
1159
/* Check the next character by Unicode property. We will get here only
1160
if the support is in the binary; otherwise a compile-time error occurs.
1161
*/
1162
1163
#ifdef SUPPORT_UNICODE
1164
case OP_PROP:
1165
case OP_NOTPROP:
1166
if (clen > 0)
1167
{
1168
BOOL OK;
1169
int chartype;
1170
const uint32_t *cp;
1171
const ucd_record * prop = GET_UCD(c);
1172
switch(code[1])
1173
{
1174
case PT_LAMP:
1175
chartype = prop->chartype;
1176
OK = chartype == ucp_Lu || chartype == ucp_Ll ||
1177
chartype == ucp_Lt;
1178
break;
1179
1180
case PT_GC:
1181
OK = PRIV(ucp_gentype)[prop->chartype] == code[2];
1182
break;
1183
1184
case PT_PC:
1185
OK = prop->chartype == code[2];
1186
break;
1187
1188
case PT_SC:
1189
OK = prop->script == code[2];
1190
break;
1191
1192
case PT_SCX:
1193
OK = (prop->script == code[2] ||
1194
MAPBIT(PRIV(ucd_script_sets) + UCD_SCRIPTX_PROP(prop), code[2]) != 0);
1195
break;
1196
1197
/* These are specials for combination cases. */
1198
1199
case PT_ALNUM:
1200
chartype = prop->chartype;
1201
OK = PRIV(ucp_gentype)[chartype] == ucp_L ||
1202
PRIV(ucp_gentype)[chartype] == ucp_N;
1203
break;
1204
1205
/* Perl space used to exclude VT, but from Perl 5.18 it is included,
1206
which means that Perl space and POSIX space are now identical. PCRE
1207
was changed at release 8.34. */
1208
1209
case PT_SPACE: /* Perl space */
1210
case PT_PXSPACE: /* POSIX space */
1211
switch(c)
1212
{
1213
HSPACE_CASES:
1214
VSPACE_CASES:
1215
OK = TRUE;
1216
break;
1217
1218
default:
1219
OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z;
1220
break;
1221
}
1222
break;
1223
1224
case PT_WORD:
1225
chartype = prop->chartype;
1226
OK = PRIV(ucp_gentype)[chartype] == ucp_L ||
1227
PRIV(ucp_gentype)[chartype] == ucp_N ||
1228
chartype == ucp_Mn || chartype == ucp_Pc;
1229
break;
1230
1231
case PT_CLIST:
1232
#if PCRE2_CODE_UNIT_WIDTH == 32
1233
if (c > MAX_UTF_CODE_POINT)
1234
{
1235
OK = FALSE;
1236
break;
1237
}
1238
#endif
1239
cp = PRIV(ucd_caseless_sets) + code[2];
1240
for (;;)
1241
{
1242
if (c < *cp) { OK = FALSE; break; }
1243
if (c == *cp++) { OK = TRUE; break; }
1244
}
1245
break;
1246
1247
case PT_UCNC:
1248
OK = c == CHAR_DOLLAR_SIGN || c == CHAR_COMMERCIAL_AT ||
1249
c == CHAR_GRAVE_ACCENT || (c >= 0xa0 && c <= 0xd7ff) ||
1250
c >= 0xe000;
1251
break;
1252
1253
case PT_BIDICL:
1254
OK = UCD_BIDICLASS(c) == code[2];
1255
break;
1256
1257
case PT_BOOL:
1258
OK = MAPBIT(PRIV(ucd_boolprop_sets) +
1259
UCD_BPROPS_PROP(prop), code[2]) != 0;
1260
break;
1261
1262
/* Should never occur, but keep compilers from grumbling. */
1263
1264
default:
1265
OK = codevalue != OP_PROP;
1266
break;
1267
}
1268
1269
if (OK == (codevalue == OP_PROP)) { ADD_NEW(state_offset + 3, 0); }
1270
}
1271
break;
1272
#endif
1273
1274
1275
1276
/* ========================================================================== */
1277
/* These opcodes likewise inspect the subject character, but have an
1278
argument that is not a data character. It is one of these opcodes:
1279
OP_ANY, OP_ALLANY, OP_DIGIT, OP_NOT_DIGIT, OP_WHITESPACE, OP_NOT_SPACE,
1280
OP_WORDCHAR, OP_NOT_WORDCHAR. The value is loaded into d. */
1281
1282
case OP_TYPEPLUS:
1283
case OP_TYPEMINPLUS:
1284
case OP_TYPEPOSPLUS:
1285
count = current_state->count; /* Already matched */
1286
if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1287
if (clen > 0)
1288
{
1289
if (d == OP_ANY && ptr + 1 >= mb->end_subject &&
1290
(mb->moptions & (PCRE2_PARTIAL_HARD)) != 0 &&
1291
NLBLOCK->nltype == NLTYPE_FIXED &&
1292
NLBLOCK->nllen == 2 &&
1293
c == NLBLOCK->nl[0])
1294
{
1295
could_continue = partial_newline = TRUE;
1296
}
1297
else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1298
(c < 256 &&
1299
(d != OP_ANY || !IS_NEWLINE(ptr)) &&
1300
((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1301
{
1302
if (count > 0 && codevalue == OP_TYPEPOSPLUS)
1303
{
1304
active_count--; /* Remove non-match possibility */
1305
next_active_state--;
1306
}
1307
count++;
1308
ADD_NEW(state_offset, count);
1309
}
1310
}
1311
break;
1312
1313
/*-----------------------------------------------------------------*/
1314
case OP_TYPEQUERY:
1315
case OP_TYPEMINQUERY:
1316
case OP_TYPEPOSQUERY:
1317
ADD_ACTIVE(state_offset + 2, 0);
1318
if (clen > 0)
1319
{
1320
if (d == OP_ANY && ptr + 1 >= mb->end_subject &&
1321
(mb->moptions & (PCRE2_PARTIAL_HARD)) != 0 &&
1322
NLBLOCK->nltype == NLTYPE_FIXED &&
1323
NLBLOCK->nllen == 2 &&
1324
c == NLBLOCK->nl[0])
1325
{
1326
could_continue = partial_newline = TRUE;
1327
}
1328
else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1329
(c < 256 &&
1330
(d != OP_ANY || !IS_NEWLINE(ptr)) &&
1331
((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1332
{
1333
if (codevalue == OP_TYPEPOSQUERY)
1334
{
1335
active_count--; /* Remove non-match possibility */
1336
next_active_state--;
1337
}
1338
ADD_NEW(state_offset + 2, 0);
1339
}
1340
}
1341
break;
1342
1343
/*-----------------------------------------------------------------*/
1344
case OP_TYPESTAR:
1345
case OP_TYPEMINSTAR:
1346
case OP_TYPEPOSSTAR:
1347
ADD_ACTIVE(state_offset + 2, 0);
1348
if (clen > 0)
1349
{
1350
if (d == OP_ANY && ptr + 1 >= mb->end_subject &&
1351
(mb->moptions & (PCRE2_PARTIAL_HARD)) != 0 &&
1352
NLBLOCK->nltype == NLTYPE_FIXED &&
1353
NLBLOCK->nllen == 2 &&
1354
c == NLBLOCK->nl[0])
1355
{
1356
could_continue = partial_newline = TRUE;
1357
}
1358
else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1359
(c < 256 &&
1360
(d != OP_ANY || !IS_NEWLINE(ptr)) &&
1361
((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1362
{
1363
if (codevalue == OP_TYPEPOSSTAR)
1364
{
1365
active_count--; /* Remove non-match possibility */
1366
next_active_state--;
1367
}
1368
ADD_NEW(state_offset, 0);
1369
}
1370
}
1371
break;
1372
1373
/*-----------------------------------------------------------------*/
1374
case OP_TYPEEXACT:
1375
count = current_state->count; /* Number already matched */
1376
if (clen > 0)
1377
{
1378
if (d == OP_ANY && ptr + 1 >= mb->end_subject &&
1379
(mb->moptions & (PCRE2_PARTIAL_HARD)) != 0 &&
1380
NLBLOCK->nltype == NLTYPE_FIXED &&
1381
NLBLOCK->nllen == 2 &&
1382
c == NLBLOCK->nl[0])
1383
{
1384
could_continue = partial_newline = TRUE;
1385
}
1386
else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1387
(c < 256 &&
1388
(d != OP_ANY || !IS_NEWLINE(ptr)) &&
1389
((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1390
{
1391
if (++count >= (int)GET2(code, 1))
1392
{ ADD_NEW(state_offset + 1 + IMM2_SIZE + 1, 0); }
1393
else
1394
{ ADD_NEW(state_offset, count); }
1395
}
1396
}
1397
break;
1398
1399
/*-----------------------------------------------------------------*/
1400
case OP_TYPEUPTO:
1401
case OP_TYPEMINUPTO:
1402
case OP_TYPEPOSUPTO:
1403
ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0);
1404
count = current_state->count; /* Number already matched */
1405
if (clen > 0)
1406
{
1407
if (d == OP_ANY && ptr + 1 >= mb->end_subject &&
1408
(mb->moptions & (PCRE2_PARTIAL_HARD)) != 0 &&
1409
NLBLOCK->nltype == NLTYPE_FIXED &&
1410
NLBLOCK->nllen == 2 &&
1411
c == NLBLOCK->nl[0])
1412
{
1413
could_continue = partial_newline = TRUE;
1414
}
1415
else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1416
(c < 256 &&
1417
(d != OP_ANY || !IS_NEWLINE(ptr)) &&
1418
((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1419
{
1420
if (codevalue == OP_TYPEPOSUPTO)
1421
{
1422
active_count--; /* Remove non-match possibility */
1423
next_active_state--;
1424
}
1425
if (++count >= (int)GET2(code, 1))
1426
{ ADD_NEW(state_offset + 2 + IMM2_SIZE, 0); }
1427
else
1428
{ ADD_NEW(state_offset, count); }
1429
}
1430
}
1431
break;
1432
1433
/* ========================================================================== */
1434
/* These are virtual opcodes that are used when something like
1435
OP_TYPEPLUS has OP_PROP, OP_NOTPROP, OP_ANYNL, or OP_EXTUNI as its
1436
argument. It keeps the code above fast for the other cases. The argument
1437
is in the d variable. */
1438
1439
#ifdef SUPPORT_UNICODE
1440
case OP_PROP_EXTRA + OP_TYPEPLUS:
1441
case OP_PROP_EXTRA + OP_TYPEMINPLUS:
1442
case OP_PROP_EXTRA + OP_TYPEPOSPLUS:
1443
count = current_state->count; /* Already matched */
1444
if (count > 0) { ADD_ACTIVE(state_offset + 4, 0); }
1445
if (clen > 0)
1446
{
1447
BOOL OK;
1448
int chartype;
1449
const uint32_t *cp;
1450
const ucd_record * prop = GET_UCD(c);
1451
switch(code[2])
1452
{
1453
case PT_LAMP:
1454
chartype = prop->chartype;
1455
OK = chartype == ucp_Lu || chartype == ucp_Ll || chartype == ucp_Lt;
1456
break;
1457
1458
case PT_GC:
1459
OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1460
break;
1461
1462
case PT_PC:
1463
OK = prop->chartype == code[3];
1464
break;
1465
1466
case PT_SC:
1467
OK = prop->script == code[3];
1468
break;
1469
1470
case PT_SCX:
1471
OK = (prop->script == code[3] ||
1472
MAPBIT(PRIV(ucd_script_sets) + UCD_SCRIPTX_PROP(prop), code[3]) != 0);
1473
break;
1474
1475
/* These are specials for combination cases. */
1476
1477
case PT_ALNUM:
1478
chartype = prop->chartype;
1479
OK = PRIV(ucp_gentype)[chartype] == ucp_L ||
1480
PRIV(ucp_gentype)[chartype] == ucp_N;
1481
break;
1482
1483
/* Perl space used to exclude VT, but from Perl 5.18 it is included,
1484
which means that Perl space and POSIX space are now identical. PCRE
1485
was changed at release 8.34. */
1486
1487
case PT_SPACE: /* Perl space */
1488
case PT_PXSPACE: /* POSIX space */
1489
switch(c)
1490
{
1491
HSPACE_CASES:
1492
VSPACE_CASES:
1493
OK = TRUE;
1494
break;
1495
1496
default:
1497
OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z;
1498
break;
1499
}
1500
break;
1501
1502
case PT_WORD:
1503
chartype = prop->chartype;
1504
OK = PRIV(ucp_gentype)[chartype] == ucp_L ||
1505
PRIV(ucp_gentype)[chartype] == ucp_N ||
1506
chartype == ucp_Mn || chartype == ucp_Pc;
1507
break;
1508
1509
case PT_CLIST:
1510
#if PCRE2_CODE_UNIT_WIDTH == 32
1511
if (c > MAX_UTF_CODE_POINT)
1512
{
1513
OK = FALSE;
1514
break;
1515
}
1516
#endif
1517
cp = PRIV(ucd_caseless_sets) + code[3];
1518
for (;;)
1519
{
1520
if (c < *cp) { OK = FALSE; break; }
1521
if (c == *cp++) { OK = TRUE; break; }
1522
}
1523
break;
1524
1525
case PT_UCNC:
1526
OK = c == CHAR_DOLLAR_SIGN || c == CHAR_COMMERCIAL_AT ||
1527
c == CHAR_GRAVE_ACCENT || (c >= 0xa0 && c <= 0xd7ff) ||
1528
c >= 0xe000;
1529
break;
1530
1531
case PT_BIDICL:
1532
OK = UCD_BIDICLASS(c) == code[3];
1533
break;
1534
1535
case PT_BOOL:
1536
OK = MAPBIT(PRIV(ucd_boolprop_sets) +
1537
UCD_BPROPS_PROP(prop), code[3]) != 0;
1538
break;
1539
1540
/* Should never occur, but keep compilers from grumbling. */
1541
1542
default:
1543
OK = codevalue != OP_PROP;
1544
break;
1545
}
1546
1547
if (OK == (d == OP_PROP))
1548
{
1549
if (count > 0 && codevalue == OP_PROP_EXTRA + OP_TYPEPOSPLUS)
1550
{
1551
active_count--; /* Remove non-match possibility */
1552
next_active_state--;
1553
}
1554
count++;
1555
ADD_NEW(state_offset, count);
1556
}
1557
}
1558
break;
1559
1560
/*-----------------------------------------------------------------*/
1561
case OP_EXTUNI_EXTRA + OP_TYPEPLUS:
1562
case OP_EXTUNI_EXTRA + OP_TYPEMINPLUS:
1563
case OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS:
1564
count = current_state->count; /* Already matched */
1565
if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1566
if (clen > 0)
1567
{
1568
int ncount = 0;
1569
if (count > 0 && codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS)
1570
{
1571
active_count--; /* Remove non-match possibility */
1572
next_active_state--;
1573
}
1574
(void)PRIV(extuni)(c, ptr + clen, mb->start_subject, end_subject, utf,
1575
&ncount);
1576
count++;
1577
ADD_NEW_DATA(-state_offset, count, ncount);
1578
}
1579
break;
1580
#endif
1581
1582
/*-----------------------------------------------------------------*/
1583
case OP_ANYNL_EXTRA + OP_TYPEPLUS:
1584
case OP_ANYNL_EXTRA + OP_TYPEMINPLUS:
1585
case OP_ANYNL_EXTRA + OP_TYPEPOSPLUS:
1586
count = current_state->count; /* Already matched */
1587
if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1588
if (clen > 0)
1589
{
1590
int ncount = 0;
1591
switch (c)
1592
{
1593
case CHAR_VT:
1594
case CHAR_FF:
1595
case CHAR_NEL:
1596
#ifndef EBCDIC
1597
case 0x2028:
1598
case 0x2029:
1599
#endif /* Not EBCDIC */
1600
if (mb->bsr_convention == PCRE2_BSR_ANYCRLF) break;
1601
goto ANYNL01;
1602
1603
case CHAR_CR:
1604
if (ptr + 1 < end_subject && UCHAR21TEST(ptr + 1) == CHAR_LF) ncount = 1;
1605
/* Fall through */
1606
1607
ANYNL01:
1608
case CHAR_LF:
1609
if (count > 0 && codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSPLUS)
1610
{
1611
active_count--; /* Remove non-match possibility */
1612
next_active_state--;
1613
}
1614
count++;
1615
ADD_NEW_DATA(-state_offset, count, ncount);
1616
break;
1617
1618
default:
1619
break;
1620
}
1621
}
1622
break;
1623
1624
/*-----------------------------------------------------------------*/
1625
case OP_VSPACE_EXTRA + OP_TYPEPLUS:
1626
case OP_VSPACE_EXTRA + OP_TYPEMINPLUS:
1627
case OP_VSPACE_EXTRA + OP_TYPEPOSPLUS:
1628
count = current_state->count; /* Already matched */
1629
if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1630
if (clen > 0)
1631
{
1632
BOOL OK;
1633
switch (c)
1634
{
1635
VSPACE_CASES:
1636
OK = TRUE;
1637
break;
1638
1639
default:
1640
OK = FALSE;
1641
break;
1642
}
1643
1644
if (OK == (d == OP_VSPACE))
1645
{
1646
if (count > 0 && codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSPLUS)
1647
{
1648
active_count--; /* Remove non-match possibility */
1649
next_active_state--;
1650
}
1651
count++;
1652
ADD_NEW_DATA(-state_offset, count, 0);
1653
}
1654
}
1655
break;
1656
1657
/*-----------------------------------------------------------------*/
1658
case OP_HSPACE_EXTRA + OP_TYPEPLUS:
1659
case OP_HSPACE_EXTRA + OP_TYPEMINPLUS:
1660
case OP_HSPACE_EXTRA + OP_TYPEPOSPLUS:
1661
count = current_state->count; /* Already matched */
1662
if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1663
if (clen > 0)
1664
{
1665
BOOL OK;
1666
switch (c)
1667
{
1668
HSPACE_CASES:
1669
OK = TRUE;
1670
break;
1671
1672
default:
1673
OK = FALSE;
1674
break;
1675
}
1676
1677
if (OK == (d == OP_HSPACE))
1678
{
1679
if (count > 0 && codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSPLUS)
1680
{
1681
active_count--; /* Remove non-match possibility */
1682
next_active_state--;
1683
}
1684
count++;
1685
ADD_NEW_DATA(-state_offset, count, 0);
1686
}
1687
}
1688
break;
1689
1690
/*-----------------------------------------------------------------*/
1691
#ifdef SUPPORT_UNICODE
1692
case OP_PROP_EXTRA + OP_TYPEQUERY:
1693
case OP_PROP_EXTRA + OP_TYPEMINQUERY:
1694
case OP_PROP_EXTRA + OP_TYPEPOSQUERY:
1695
count = 4;
1696
goto QS1;
1697
1698
case OP_PROP_EXTRA + OP_TYPESTAR:
1699
case OP_PROP_EXTRA + OP_TYPEMINSTAR:
1700
case OP_PROP_EXTRA + OP_TYPEPOSSTAR:
1701
count = 0;
1702
1703
QS1:
1704
1705
ADD_ACTIVE(state_offset + 4, 0);
1706
if (clen > 0)
1707
{
1708
BOOL OK;
1709
int chartype;
1710
const uint32_t *cp;
1711
const ucd_record * prop = GET_UCD(c);
1712
switch(code[2])
1713
{
1714
case PT_LAMP:
1715
chartype = prop->chartype;
1716
OK = chartype == ucp_Lu || chartype == ucp_Ll || chartype == ucp_Lt;
1717
break;
1718
1719
case PT_GC:
1720
OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1721
break;
1722
1723
case PT_PC:
1724
OK = prop->chartype == code[3];
1725
break;
1726
1727
case PT_SC:
1728
OK = prop->script == code[3];
1729
break;
1730
1731
case PT_SCX:
1732
OK = (prop->script == code[3] ||
1733
MAPBIT(PRIV(ucd_script_sets) + UCD_SCRIPTX_PROP(prop), code[3]) != 0);
1734
break;
1735
1736
/* These are specials for combination cases. */
1737
1738
case PT_ALNUM:
1739
chartype = prop->chartype;
1740
OK = PRIV(ucp_gentype)[chartype] == ucp_L ||
1741
PRIV(ucp_gentype)[chartype] == ucp_N;
1742
break;
1743
1744
/* Perl space used to exclude VT, but from Perl 5.18 it is included,
1745
which means that Perl space and POSIX space are now identical. PCRE
1746
was changed at release 8.34. */
1747
1748
case PT_SPACE: /* Perl space */
1749
case PT_PXSPACE: /* POSIX space */
1750
switch(c)
1751
{
1752
HSPACE_CASES:
1753
VSPACE_CASES:
1754
OK = TRUE;
1755
break;
1756
1757
default:
1758
OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z;
1759
break;
1760
}
1761
break;
1762
1763
case PT_WORD:
1764
chartype = prop->chartype;
1765
OK = PRIV(ucp_gentype)[chartype] == ucp_L ||
1766
PRIV(ucp_gentype)[chartype] == ucp_N ||
1767
chartype == ucp_Mn || chartype == ucp_Pc;
1768
break;
1769
1770
case PT_CLIST:
1771
#if PCRE2_CODE_UNIT_WIDTH == 32
1772
if (c > MAX_UTF_CODE_POINT)
1773
{
1774
OK = FALSE;
1775
break;
1776
}
1777
#endif
1778
cp = PRIV(ucd_caseless_sets) + code[3];
1779
for (;;)
1780
{
1781
if (c < *cp) { OK = FALSE; break; }
1782
if (c == *cp++) { OK = TRUE; break; }
1783
}
1784
break;
1785
1786
case PT_UCNC:
1787
OK = c == CHAR_DOLLAR_SIGN || c == CHAR_COMMERCIAL_AT ||
1788
c == CHAR_GRAVE_ACCENT || (c >= 0xa0 && c <= 0xd7ff) ||
1789
c >= 0xe000;
1790
break;
1791
1792
case PT_BIDICL:
1793
OK = UCD_BIDICLASS(c) == code[3];
1794
break;
1795
1796
case PT_BOOL:
1797
OK = MAPBIT(PRIV(ucd_boolprop_sets) +
1798
UCD_BPROPS_PROP(prop), code[3]) != 0;
1799
break;
1800
1801
/* Should never occur, but keep compilers from grumbling. */
1802
1803
default:
1804
OK = codevalue != OP_PROP;
1805
break;
1806
}
1807
1808
if (OK == (d == OP_PROP))
1809
{
1810
if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSSTAR ||
1811
codevalue == OP_PROP_EXTRA + OP_TYPEPOSQUERY)
1812
{
1813
active_count--; /* Remove non-match possibility */
1814
next_active_state--;
1815
}
1816
ADD_NEW(state_offset + count, 0);
1817
}
1818
}
1819
break;
1820
1821
/*-----------------------------------------------------------------*/
1822
case OP_EXTUNI_EXTRA + OP_TYPEQUERY:
1823
case OP_EXTUNI_EXTRA + OP_TYPEMINQUERY:
1824
case OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY:
1825
count = 2;
1826
goto QS2;
1827
1828
case OP_EXTUNI_EXTRA + OP_TYPESTAR:
1829
case OP_EXTUNI_EXTRA + OP_TYPEMINSTAR:
1830
case OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR:
1831
count = 0;
1832
1833
QS2:
1834
1835
ADD_ACTIVE(state_offset + 2, 0);
1836
if (clen > 0)
1837
{
1838
int ncount = 0;
1839
if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR ||
1840
codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY)
1841
{
1842
active_count--; /* Remove non-match possibility */
1843
next_active_state--;
1844
}
1845
(void)PRIV(extuni)(c, ptr + clen, mb->start_subject, end_subject, utf,
1846
&ncount);
1847
ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1848
}
1849
break;
1850
#endif
1851
1852
/*-----------------------------------------------------------------*/
1853
case OP_ANYNL_EXTRA + OP_TYPEQUERY:
1854
case OP_ANYNL_EXTRA + OP_TYPEMINQUERY:
1855
case OP_ANYNL_EXTRA + OP_TYPEPOSQUERY:
1856
count = 2;
1857
goto QS3;
1858
1859
case OP_ANYNL_EXTRA + OP_TYPESTAR:
1860
case OP_ANYNL_EXTRA + OP_TYPEMINSTAR:
1861
case OP_ANYNL_EXTRA + OP_TYPEPOSSTAR:
1862
count = 0;
1863
1864
QS3:
1865
ADD_ACTIVE(state_offset + 2, 0);
1866
if (clen > 0)
1867
{
1868
int ncount = 0;
1869
switch (c)
1870
{
1871
case CHAR_VT:
1872
case CHAR_FF:
1873
case CHAR_NEL:
1874
#ifndef EBCDIC
1875
case 0x2028:
1876
case 0x2029:
1877
#endif /* Not EBCDIC */
1878
if (mb->bsr_convention == PCRE2_BSR_ANYCRLF) break;
1879
goto ANYNL02;
1880
1881
case CHAR_CR:
1882
if (ptr + 1 < end_subject && UCHAR21TEST(ptr + 1) == CHAR_LF) ncount = 1;
1883
/* Fall through */
1884
1885
ANYNL02:
1886
case CHAR_LF:
1887
if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSSTAR ||
1888
codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSQUERY)
1889
{
1890
active_count--; /* Remove non-match possibility */
1891
next_active_state--;
1892
}
1893
ADD_NEW_DATA(-(state_offset + (int)count), 0, ncount);
1894
break;
1895
1896
default:
1897
break;
1898
}
1899
}
1900
break;
1901
1902
/*-----------------------------------------------------------------*/
1903
case OP_VSPACE_EXTRA + OP_TYPEQUERY:
1904
case OP_VSPACE_EXTRA + OP_TYPEMINQUERY:
1905
case OP_VSPACE_EXTRA + OP_TYPEPOSQUERY:
1906
count = 2;
1907
goto QS4;
1908
1909
case OP_VSPACE_EXTRA + OP_TYPESTAR:
1910
case OP_VSPACE_EXTRA + OP_TYPEMINSTAR:
1911
case OP_VSPACE_EXTRA + OP_TYPEPOSSTAR:
1912
count = 0;
1913
1914
QS4:
1915
ADD_ACTIVE(state_offset + 2, 0);
1916
if (clen > 0)
1917
{
1918
BOOL OK;
1919
switch (c)
1920
{
1921
VSPACE_CASES:
1922
OK = TRUE;
1923
break;
1924
1925
default:
1926
OK = FALSE;
1927
break;
1928
}
1929
if (OK == (d == OP_VSPACE))
1930
{
1931
if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSSTAR ||
1932
codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSQUERY)
1933
{
1934
active_count--; /* Remove non-match possibility */
1935
next_active_state--;
1936
}
1937
ADD_NEW_DATA(-(state_offset + (int)count), 0, 0);
1938
}
1939
}
1940
break;
1941
1942
/*-----------------------------------------------------------------*/
1943
case OP_HSPACE_EXTRA + OP_TYPEQUERY:
1944
case OP_HSPACE_EXTRA + OP_TYPEMINQUERY:
1945
case OP_HSPACE_EXTRA + OP_TYPEPOSQUERY:
1946
count = 2;
1947
goto QS5;
1948
1949
case OP_HSPACE_EXTRA + OP_TYPESTAR:
1950
case OP_HSPACE_EXTRA + OP_TYPEMINSTAR:
1951
case OP_HSPACE_EXTRA + OP_TYPEPOSSTAR:
1952
count = 0;
1953
1954
QS5:
1955
ADD_ACTIVE(state_offset + 2, 0);
1956
if (clen > 0)
1957
{
1958
BOOL OK;
1959
switch (c)
1960
{
1961
HSPACE_CASES:
1962
OK = TRUE;
1963
break;
1964
1965
default:
1966
OK = FALSE;
1967
break;
1968
}
1969
1970
if (OK == (d == OP_HSPACE))
1971
{
1972
if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSSTAR ||
1973
codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSQUERY)
1974
{
1975
active_count--; /* Remove non-match possibility */
1976
next_active_state--;
1977
}
1978
ADD_NEW_DATA(-(state_offset + (int)count), 0, 0);
1979
}
1980
}
1981
break;
1982
1983
/*-----------------------------------------------------------------*/
1984
#ifdef SUPPORT_UNICODE
1985
case OP_PROP_EXTRA + OP_TYPEEXACT:
1986
case OP_PROP_EXTRA + OP_TYPEUPTO:
1987
case OP_PROP_EXTRA + OP_TYPEMINUPTO:
1988
case OP_PROP_EXTRA + OP_TYPEPOSUPTO:
1989
if (codevalue != OP_PROP_EXTRA + OP_TYPEEXACT)
1990
{ ADD_ACTIVE(state_offset + 1 + IMM2_SIZE + 3, 0); }
1991
count = current_state->count; /* Number already matched */
1992
if (clen > 0)
1993
{
1994
BOOL OK;
1995
int chartype;
1996
const uint32_t *cp;
1997
const ucd_record * prop = GET_UCD(c);
1998
switch(code[1 + IMM2_SIZE + 1])
1999
{
2000
case PT_LAMP:
2001
chartype = prop->chartype;
2002
OK = chartype == ucp_Lu || chartype == ucp_Ll || chartype == ucp_Lt;
2003
break;
2004
2005
case PT_GC:
2006
OK = PRIV(ucp_gentype)[prop->chartype] == code[1 + IMM2_SIZE + 2];
2007
break;
2008
2009
case PT_PC:
2010
OK = prop->chartype == code[1 + IMM2_SIZE + 2];
2011
break;
2012
2013
case PT_SC:
2014
OK = prop->script == code[1 + IMM2_SIZE + 2];
2015
break;
2016
2017
case PT_SCX:
2018
OK = (prop->script == code[1 + IMM2_SIZE + 2] ||
2019
MAPBIT(PRIV(ucd_script_sets) + UCD_SCRIPTX_PROP(prop),
2020
code[1 + IMM2_SIZE + 2]) != 0);
2021
break;
2022
2023
/* These are specials for combination cases. */
2024
2025
case PT_ALNUM:
2026
chartype = prop->chartype;
2027
OK = PRIV(ucp_gentype)[chartype] == ucp_L ||
2028
PRIV(ucp_gentype)[chartype] == ucp_N;
2029
break;
2030
2031
/* Perl space used to exclude VT, but from Perl 5.18 it is included,
2032
which means that Perl space and POSIX space are now identical. PCRE
2033
was changed at release 8.34. */
2034
2035
case PT_SPACE: /* Perl space */
2036
case PT_PXSPACE: /* POSIX space */
2037
switch(c)
2038
{
2039
HSPACE_CASES:
2040
VSPACE_CASES:
2041
OK = TRUE;
2042
break;
2043
2044
default:
2045
OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z;
2046
break;
2047
}
2048
break;
2049
2050
case PT_WORD:
2051
chartype = prop->chartype;
2052
OK = PRIV(ucp_gentype)[chartype] == ucp_L ||
2053
PRIV(ucp_gentype)[chartype] == ucp_N ||
2054
chartype == ucp_Mn || chartype == ucp_Pc;
2055
break;
2056
2057
case PT_CLIST:
2058
#if PCRE2_CODE_UNIT_WIDTH == 32
2059
if (c > MAX_UTF_CODE_POINT)
2060
{
2061
OK = FALSE;
2062
break;
2063
}
2064
#endif
2065
cp = PRIV(ucd_caseless_sets) + code[1 + IMM2_SIZE + 2];
2066
for (;;)
2067
{
2068
if (c < *cp) { OK = FALSE; break; }
2069
if (c == *cp++) { OK = TRUE; break; }
2070
}
2071
break;
2072
2073
case PT_UCNC:
2074
OK = c == CHAR_DOLLAR_SIGN || c == CHAR_COMMERCIAL_AT ||
2075
c == CHAR_GRAVE_ACCENT || (c >= 0xa0 && c <= 0xd7ff) ||
2076
c >= 0xe000;
2077
break;
2078
2079
case PT_BIDICL:
2080
OK = UCD_BIDICLASS(c) == code[1 + IMM2_SIZE + 2];
2081
break;
2082
2083
case PT_BOOL:
2084
OK = MAPBIT(PRIV(ucd_boolprop_sets) +
2085
UCD_BPROPS_PROP(prop), code[1 + IMM2_SIZE + 2]) != 0;
2086
break;
2087
2088
/* Should never occur, but keep compilers from grumbling. */
2089
2090
default:
2091
OK = codevalue != OP_PROP;
2092
break;
2093
}
2094
2095
if (OK == (d == OP_PROP))
2096
{
2097
if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSUPTO)
2098
{
2099
active_count--; /* Remove non-match possibility */
2100
next_active_state--;
2101
}
2102
if (++count >= (int)GET2(code, 1))
2103
{ ADD_NEW(state_offset + 1 + IMM2_SIZE + 3, 0); }
2104
else
2105
{ ADD_NEW(state_offset, count); }
2106
}
2107
}
2108
break;
2109
2110
/*-----------------------------------------------------------------*/
2111
case OP_EXTUNI_EXTRA + OP_TYPEEXACT:
2112
case OP_EXTUNI_EXTRA + OP_TYPEUPTO:
2113
case OP_EXTUNI_EXTRA + OP_TYPEMINUPTO:
2114
case OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO:
2115
if (codevalue != OP_EXTUNI_EXTRA + OP_TYPEEXACT)
2116
{ ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
2117
count = current_state->count; /* Number already matched */
2118
if (clen > 0)
2119
{
2120
PCRE2_SPTR nptr;
2121
int ncount = 0;
2122
if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO)
2123
{
2124
active_count--; /* Remove non-match possibility */
2125
next_active_state--;
2126
}
2127
nptr = PRIV(extuni)(c, ptr + clen, mb->start_subject, end_subject, utf,
2128
&ncount);
2129
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
2130
reset_could_continue = TRUE;
2131
if (++count >= (int)GET2(code, 1))
2132
{ ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
2133
else
2134
{ ADD_NEW_DATA(-state_offset, count, ncount); }
2135
}
2136
break;
2137
#endif
2138
2139
/*-----------------------------------------------------------------*/
2140
case OP_ANYNL_EXTRA + OP_TYPEEXACT:
2141
case OP_ANYNL_EXTRA + OP_TYPEUPTO:
2142
case OP_ANYNL_EXTRA + OP_TYPEMINUPTO:
2143
case OP_ANYNL_EXTRA + OP_TYPEPOSUPTO:
2144
if (codevalue != OP_ANYNL_EXTRA + OP_TYPEEXACT)
2145
{ ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
2146
count = current_state->count; /* Number already matched */
2147
if (clen > 0)
2148
{
2149
int ncount = 0;
2150
switch (c)
2151
{
2152
case CHAR_VT:
2153
case CHAR_FF:
2154
case CHAR_NEL:
2155
#ifndef EBCDIC
2156
case 0x2028:
2157
case 0x2029:
2158
#endif /* Not EBCDIC */
2159
if (mb->bsr_convention == PCRE2_BSR_ANYCRLF) break;
2160
goto ANYNL03;
2161
2162
case CHAR_CR:
2163
if (ptr + 1 < end_subject && UCHAR21TEST(ptr + 1) == CHAR_LF) ncount = 1;
2164
/* Fall through */
2165
2166
ANYNL03:
2167
case CHAR_LF:
2168
if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSUPTO)
2169
{
2170
active_count--; /* Remove non-match possibility */
2171
next_active_state--;
2172
}
2173
if (++count >= (int)GET2(code, 1))
2174
{ ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
2175
else
2176
{ ADD_NEW_DATA(-state_offset, count, ncount); }
2177
break;
2178
2179
default:
2180
break;
2181
}
2182
}
2183
break;
2184
2185
/*-----------------------------------------------------------------*/
2186
case OP_VSPACE_EXTRA + OP_TYPEEXACT:
2187
case OP_VSPACE_EXTRA + OP_TYPEUPTO:
2188
case OP_VSPACE_EXTRA + OP_TYPEMINUPTO:
2189
case OP_VSPACE_EXTRA + OP_TYPEPOSUPTO:
2190
if (codevalue != OP_VSPACE_EXTRA + OP_TYPEEXACT)
2191
{ ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
2192
count = current_state->count; /* Number already matched */
2193
if (clen > 0)
2194
{
2195
BOOL OK;
2196
switch (c)
2197
{
2198
VSPACE_CASES:
2199
OK = TRUE;
2200
break;
2201
2202
default:
2203
OK = FALSE;
2204
}
2205
2206
if (OK == (d == OP_VSPACE))
2207
{
2208
if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSUPTO)
2209
{
2210
active_count--; /* Remove non-match possibility */
2211
next_active_state--;
2212
}
2213
if (++count >= (int)GET2(code, 1))
2214
{ ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
2215
else
2216
{ ADD_NEW_DATA(-state_offset, count, 0); }
2217
}
2218
}
2219
break;
2220
2221
/*-----------------------------------------------------------------*/
2222
case OP_HSPACE_EXTRA + OP_TYPEEXACT:
2223
case OP_HSPACE_EXTRA + OP_TYPEUPTO:
2224
case OP_HSPACE_EXTRA + OP_TYPEMINUPTO:
2225
case OP_HSPACE_EXTRA + OP_TYPEPOSUPTO:
2226
if (codevalue != OP_HSPACE_EXTRA + OP_TYPEEXACT)
2227
{ ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
2228
count = current_state->count; /* Number already matched */
2229
if (clen > 0)
2230
{
2231
BOOL OK;
2232
switch (c)
2233
{
2234
HSPACE_CASES:
2235
OK = TRUE;
2236
break;
2237
2238
default:
2239
OK = FALSE;
2240
break;
2241
}
2242
2243
if (OK == (d == OP_HSPACE))
2244
{
2245
if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSUPTO)
2246
{
2247
active_count--; /* Remove non-match possibility */
2248
next_active_state--;
2249
}
2250
if (++count >= (int)GET2(code, 1))
2251
{ ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
2252
else
2253
{ ADD_NEW_DATA(-state_offset, count, 0); }
2254
}
2255
}
2256
break;
2257
2258
/* ========================================================================== */
2259
/* These opcodes are followed by a character that is usually compared
2260
to the current subject character; it is loaded into d. We still get
2261
here even if there is no subject character, because in some cases zero
2262
repetitions are permitted. */
2263
2264
/*-----------------------------------------------------------------*/
2265
case OP_CHAR:
2266
if (clen > 0 && c == d) { ADD_NEW(state_offset + dlen + 1, 0); }
2267
break;
2268
2269
/*-----------------------------------------------------------------*/
2270
case OP_CHARI:
2271
if (clen == 0) break;
2272
2273
#ifdef SUPPORT_UNICODE
2274
if (utf_or_ucp)
2275
{
2276
if (c == d) { ADD_NEW(state_offset + dlen + 1, 0); } else
2277
{
2278
unsigned int othercase;
2279
if (c < 128)
2280
othercase = fcc[c];
2281
else
2282
othercase = UCD_OTHERCASE(c);
2283
if (d == othercase) { ADD_NEW(state_offset + dlen + 1, 0); }
2284
}
2285
}
2286
else
2287
#endif /* SUPPORT_UNICODE */
2288
/* Not UTF or UCP mode */
2289
{
2290
if (TABLE_GET(c, lcc, c) == TABLE_GET(d, lcc, d))
2291
{ ADD_NEW(state_offset + 2, 0); }
2292
}
2293
break;
2294
2295
2296
#ifdef SUPPORT_UNICODE
2297
/*-----------------------------------------------------------------*/
2298
/* This is a tricky one because it can match more than one character.
2299
Find out how many characters to skip, and then set up a negative state
2300
to wait for them to pass before continuing. */
2301
2302
case OP_EXTUNI:
2303
if (clen > 0)
2304
{
2305
int ncount = 0;
2306
PCRE2_SPTR nptr = PRIV(extuni)(c, ptr + clen, mb->start_subject,
2307
end_subject, utf, &ncount);
2308
if (nptr >= end_subject && (mb->moptions & PCRE2_PARTIAL_HARD) != 0)
2309
reset_could_continue = TRUE;
2310
ADD_NEW_DATA(-(state_offset + 1), 0, ncount);
2311
}
2312
break;
2313
#endif
2314
2315
/*-----------------------------------------------------------------*/
2316
/* This is a tricky like EXTUNI because it too can match more than one
2317
character (when CR is followed by LF). In this case, set up a negative
2318
state to wait for one character to pass before continuing. */
2319
2320
case OP_ANYNL:
2321
if (clen > 0) switch(c)
2322
{
2323
case CHAR_VT:
2324
case CHAR_FF:
2325
case CHAR_NEL:
2326
#ifndef EBCDIC
2327
case 0x2028:
2328
case 0x2029:
2329
#endif /* Not EBCDIC */
2330
if (mb->bsr_convention == PCRE2_BSR_ANYCRLF) break;
2331
PCRE2_FALLTHROUGH /* Fall through */
2332
2333
case CHAR_LF:
2334
ADD_NEW(state_offset + 1, 0);
2335
break;
2336
2337
case CHAR_CR:
2338
if (ptr + 1 >= end_subject)
2339
{
2340
ADD_NEW(state_offset + 1, 0);
2341
if ((mb->moptions & PCRE2_PARTIAL_HARD) != 0)
2342
reset_could_continue = TRUE;
2343
}
2344
else if (UCHAR21TEST(ptr + 1) == CHAR_LF)
2345
{
2346
ADD_NEW_DATA(-(state_offset + 1), 0, 1);
2347
}
2348
else
2349
{
2350
ADD_NEW(state_offset + 1, 0);
2351
}
2352
break;
2353
}
2354
break;
2355
2356
/*-----------------------------------------------------------------*/
2357
case OP_NOT_VSPACE:
2358
if (clen > 0) switch(c)
2359
{
2360
VSPACE_CASES:
2361
break;
2362
2363
default:
2364
ADD_NEW(state_offset + 1, 0);
2365
break;
2366
}
2367
break;
2368
2369
/*-----------------------------------------------------------------*/
2370
case OP_VSPACE:
2371
if (clen > 0) switch(c)
2372
{
2373
VSPACE_CASES:
2374
ADD_NEW(state_offset + 1, 0);
2375
break;
2376
2377
default:
2378
break;
2379
}
2380
break;
2381
2382
/*-----------------------------------------------------------------*/
2383
case OP_NOT_HSPACE:
2384
if (clen > 0) switch(c)
2385
{
2386
HSPACE_CASES:
2387
break;
2388
2389
default:
2390
ADD_NEW(state_offset + 1, 0);
2391
break;
2392
}
2393
break;
2394
2395
/*-----------------------------------------------------------------*/
2396
case OP_HSPACE:
2397
if (clen > 0) switch(c)
2398
{
2399
HSPACE_CASES:
2400
ADD_NEW(state_offset + 1, 0);
2401
break;
2402
2403
default:
2404
break;
2405
}
2406
break;
2407
2408
/*-----------------------------------------------------------------*/
2409
/* Match a negated single character casefully. */
2410
2411
case OP_NOT:
2412
if (clen > 0 && c != d) { ADD_NEW(state_offset + dlen + 1, 0); }
2413
break;
2414
2415
/*-----------------------------------------------------------------*/
2416
/* Match a negated single character caselessly. */
2417
2418
case OP_NOTI:
2419
if (clen > 0)
2420
{
2421
uint32_t otherd;
2422
#ifdef SUPPORT_UNICODE
2423
if (utf_or_ucp && d >= 128)
2424
otherd = UCD_OTHERCASE(d);
2425
else
2426
#endif /* SUPPORT_UNICODE */
2427
otherd = TABLE_GET(d, fcc, d);
2428
if (c != d && c != otherd)
2429
{ ADD_NEW(state_offset + dlen + 1, 0); }
2430
}
2431
break;
2432
2433
/*-----------------------------------------------------------------*/
2434
case OP_PLUSI:
2435
case OP_MINPLUSI:
2436
case OP_POSPLUSI:
2437
case OP_NOTPLUSI:
2438
case OP_NOTMINPLUSI:
2439
case OP_NOTPOSPLUSI:
2440
caseless = TRUE;
2441
codevalue -= OP_STARI - OP_STAR;
2442
2443
PCRE2_FALLTHROUGH /* Fall through */
2444
case OP_PLUS:
2445
case OP_MINPLUS:
2446
case OP_POSPLUS:
2447
case OP_NOTPLUS:
2448
case OP_NOTMINPLUS:
2449
case OP_NOTPOSPLUS:
2450
count = current_state->count; /* Already matched */
2451
if (count > 0) { ADD_ACTIVE(state_offset + dlen + 1, 0); }
2452
if (clen > 0)
2453
{
2454
uint32_t otherd = NOTACHAR;
2455
if (caseless)
2456
{
2457
#ifdef SUPPORT_UNICODE
2458
if (utf_or_ucp && d >= 128)
2459
otherd = UCD_OTHERCASE(d);
2460
else
2461
#endif /* SUPPORT_UNICODE */
2462
otherd = TABLE_GET(d, fcc, d);
2463
}
2464
if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2465
{
2466
if (count > 0 &&
2467
(codevalue == OP_POSPLUS || codevalue == OP_NOTPOSPLUS))
2468
{
2469
active_count--; /* Remove non-match possibility */
2470
next_active_state--;
2471
}
2472
count++;
2473
ADD_NEW(state_offset, count);
2474
}
2475
}
2476
break;
2477
2478
/*-----------------------------------------------------------------*/
2479
case OP_QUERYI:
2480
case OP_MINQUERYI:
2481
case OP_POSQUERYI:
2482
case OP_NOTQUERYI:
2483
case OP_NOTMINQUERYI:
2484
case OP_NOTPOSQUERYI:
2485
caseless = TRUE;
2486
codevalue -= OP_STARI - OP_STAR;
2487
PCRE2_FALLTHROUGH /* Fall through */
2488
case OP_QUERY:
2489
case OP_MINQUERY:
2490
case OP_POSQUERY:
2491
case OP_NOTQUERY:
2492
case OP_NOTMINQUERY:
2493
case OP_NOTPOSQUERY:
2494
ADD_ACTIVE(state_offset + dlen + 1, 0);
2495
if (clen > 0)
2496
{
2497
uint32_t otherd = NOTACHAR;
2498
if (caseless)
2499
{
2500
#ifdef SUPPORT_UNICODE
2501
if (utf_or_ucp && d >= 128)
2502
otherd = UCD_OTHERCASE(d);
2503
else
2504
#endif /* SUPPORT_UNICODE */
2505
otherd = TABLE_GET(d, fcc, d);
2506
}
2507
if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2508
{
2509
if (codevalue == OP_POSQUERY || codevalue == OP_NOTPOSQUERY)
2510
{
2511
active_count--; /* Remove non-match possibility */
2512
next_active_state--;
2513
}
2514
ADD_NEW(state_offset + dlen + 1, 0);
2515
}
2516
}
2517
break;
2518
2519
/*-----------------------------------------------------------------*/
2520
case OP_STARI:
2521
case OP_MINSTARI:
2522
case OP_POSSTARI:
2523
case OP_NOTSTARI:
2524
case OP_NOTMINSTARI:
2525
case OP_NOTPOSSTARI:
2526
caseless = TRUE;
2527
codevalue -= OP_STARI - OP_STAR;
2528
PCRE2_FALLTHROUGH /* Fall through */
2529
case OP_STAR:
2530
case OP_MINSTAR:
2531
case OP_POSSTAR:
2532
case OP_NOTSTAR:
2533
case OP_NOTMINSTAR:
2534
case OP_NOTPOSSTAR:
2535
ADD_ACTIVE(state_offset + dlen + 1, 0);
2536
if (clen > 0)
2537
{
2538
uint32_t otherd = NOTACHAR;
2539
if (caseless)
2540
{
2541
#ifdef SUPPORT_UNICODE
2542
if (utf_or_ucp && d >= 128)
2543
otherd = UCD_OTHERCASE(d);
2544
else
2545
#endif /* SUPPORT_UNICODE */
2546
otherd = TABLE_GET(d, fcc, d);
2547
}
2548
if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2549
{
2550
if (codevalue == OP_POSSTAR || codevalue == OP_NOTPOSSTAR)
2551
{
2552
active_count--; /* Remove non-match possibility */
2553
next_active_state--;
2554
}
2555
ADD_NEW(state_offset, 0);
2556
}
2557
}
2558
break;
2559
2560
/*-----------------------------------------------------------------*/
2561
case OP_EXACTI:
2562
case OP_NOTEXACTI:
2563
caseless = TRUE;
2564
codevalue -= OP_STARI - OP_STAR;
2565
PCRE2_FALLTHROUGH /* Fall through */
2566
case OP_EXACT:
2567
case OP_NOTEXACT:
2568
count = current_state->count; /* Number already matched */
2569
if (clen > 0)
2570
{
2571
uint32_t otherd = NOTACHAR;
2572
if (caseless)
2573
{
2574
#ifdef SUPPORT_UNICODE
2575
if (utf_or_ucp && d >= 128)
2576
otherd = UCD_OTHERCASE(d);
2577
else
2578
#endif /* SUPPORT_UNICODE */
2579
otherd = TABLE_GET(d, fcc, d);
2580
}
2581
if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2582
{
2583
if (++count >= (int)GET2(code, 1))
2584
{ ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2585
else
2586
{ ADD_NEW(state_offset, count); }
2587
}
2588
}
2589
break;
2590
2591
/*-----------------------------------------------------------------*/
2592
case OP_UPTOI:
2593
case OP_MINUPTOI:
2594
case OP_POSUPTOI:
2595
case OP_NOTUPTOI:
2596
case OP_NOTMINUPTOI:
2597
case OP_NOTPOSUPTOI:
2598
caseless = TRUE;
2599
codevalue -= OP_STARI - OP_STAR;
2600
PCRE2_FALLTHROUGH /* Fall through */
2601
case OP_UPTO:
2602
case OP_MINUPTO:
2603
case OP_POSUPTO:
2604
case OP_NOTUPTO:
2605
case OP_NOTMINUPTO:
2606
case OP_NOTPOSUPTO:
2607
ADD_ACTIVE(state_offset + dlen + 1 + IMM2_SIZE, 0);
2608
count = current_state->count; /* Number already matched */
2609
if (clen > 0)
2610
{
2611
uint32_t otherd = NOTACHAR;
2612
if (caseless)
2613
{
2614
#ifdef SUPPORT_UNICODE
2615
if (utf_or_ucp && d >= 128)
2616
otherd = UCD_OTHERCASE(d);
2617
else
2618
#endif /* SUPPORT_UNICODE */
2619
otherd = TABLE_GET(d, fcc, d);
2620
}
2621
if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2622
{
2623
if (codevalue == OP_POSUPTO || codevalue == OP_NOTPOSUPTO)
2624
{
2625
active_count--; /* Remove non-match possibility */
2626
next_active_state--;
2627
}
2628
if (++count >= (int)GET2(code, 1))
2629
{ ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2630
else
2631
{ ADD_NEW(state_offset, count); }
2632
}
2633
}
2634
break;
2635
2636
2637
/* ========================================================================== */
2638
/* These are the class-handling opcodes */
2639
2640
case OP_CLASS:
2641
case OP_NCLASS:
2642
#ifdef SUPPORT_WIDE_CHARS
2643
case OP_XCLASS:
2644
case OP_ECLASS:
2645
#endif
2646
{
2647
BOOL isinclass = FALSE;
2648
int next_state_offset;
2649
PCRE2_SPTR ecode;
2650
2651
#ifdef SUPPORT_WIDE_CHARS
2652
/* An extended class may have a table or a list of single characters,
2653
ranges, or both, and it may be positive or negative. There's a
2654
function that sorts all this out. */
2655
2656
if (codevalue == OP_XCLASS)
2657
{
2658
ecode = code + GET(code, 1);
2659
if (clen > 0)
2660
isinclass = PRIV(xclass)(c, code + 1 + LINK_SIZE,
2661
(const uint8_t*)mb->start_code, utf);
2662
}
2663
2664
/* A nested set-based class has internal opcodes for performing
2665
set operations. */
2666
2667
else if (codevalue == OP_ECLASS)
2668
{
2669
ecode = code + GET(code, 1);
2670
if (clen > 0)
2671
isinclass = PRIV(eclass)(c, code + 1 + LINK_SIZE, ecode,
2672
(const uint8_t*)mb->start_code, utf);
2673
}
2674
2675
else
2676
#endif /* SUPPORT_WIDE_CHARS */
2677
2678
/* For a simple class, there is always just a 32-byte table, and we
2679
can set isinclass from it. */
2680
2681
{
2682
ecode = code + 1 + (32 / sizeof(PCRE2_UCHAR));
2683
if (clen > 0)
2684
{
2685
isinclass = (c > 255)? (codevalue == OP_NCLASS) :
2686
((((const uint8_t *)(code + 1))[c/8] & (1u << (c&7))) != 0);
2687
}
2688
}
2689
2690
/* At this point, isinclass is set for all kinds of class, and ecode
2691
points to the byte after the end of the class. If there is a
2692
quantifier, this is where it will be. */
2693
2694
next_state_offset = (int)(ecode - start_code);
2695
2696
switch (*ecode)
2697
{
2698
case OP_CRSTAR:
2699
case OP_CRMINSTAR:
2700
case OP_CRPOSSTAR:
2701
ADD_ACTIVE(next_state_offset + 1, 0);
2702
if (isinclass)
2703
{
2704
if (*ecode == OP_CRPOSSTAR)
2705
{
2706
active_count--; /* Remove non-match possibility */
2707
next_active_state--;
2708
}
2709
ADD_NEW(state_offset, 0);
2710
}
2711
break;
2712
2713
case OP_CRPLUS:
2714
case OP_CRMINPLUS:
2715
case OP_CRPOSPLUS:
2716
count = current_state->count; /* Already matched */
2717
if (count > 0) { ADD_ACTIVE(next_state_offset + 1, 0); }
2718
if (isinclass)
2719
{
2720
if (count > 0 && *ecode == OP_CRPOSPLUS)
2721
{
2722
active_count--; /* Remove non-match possibility */
2723
next_active_state--;
2724
}
2725
count++;
2726
ADD_NEW(state_offset, count);
2727
}
2728
break;
2729
2730
case OP_CRQUERY:
2731
case OP_CRMINQUERY:
2732
case OP_CRPOSQUERY:
2733
ADD_ACTIVE(next_state_offset + 1, 0);
2734
if (isinclass)
2735
{
2736
if (*ecode == OP_CRPOSQUERY)
2737
{
2738
active_count--; /* Remove non-match possibility */
2739
next_active_state--;
2740
}
2741
ADD_NEW(next_state_offset + 1, 0);
2742
}
2743
break;
2744
2745
case OP_CRRANGE:
2746
case OP_CRMINRANGE:
2747
case OP_CRPOSRANGE:
2748
count = current_state->count; /* Already matched */
2749
if (count >= (int)GET2(ecode, 1))
2750
{ ADD_ACTIVE(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2751
if (isinclass)
2752
{
2753
int max = (int)GET2(ecode, 1 + IMM2_SIZE);
2754
2755
if (*ecode == OP_CRPOSRANGE && count >= (int)GET2(ecode, 1))
2756
{
2757
active_count--; /* Remove non-match possibility */
2758
next_active_state--;
2759
}
2760
2761
if (++count >= max && max != 0) /* Max 0 => no limit */
2762
{ ADD_NEW(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2763
else
2764
{ ADD_NEW(state_offset, count); }
2765
}
2766
break;
2767
2768
default:
2769
if (isinclass) { ADD_NEW(next_state_offset, 0); }
2770
break;
2771
}
2772
}
2773
break;
2774
2775
/* ========================================================================== */
2776
/* These are the opcodes for fancy brackets of various kinds. We have
2777
to use recursion in order to handle them. The "always failing" assertion
2778
(?!) is optimised to OP_FAIL when compiling, so we have to support that,
2779
though the other "backtracking verbs" are not supported. */
2780
2781
case OP_FAIL:
2782
break;
2783
2784
case OP_ASSERT:
2785
case OP_ASSERT_NOT:
2786
case OP_ASSERTBACK:
2787
case OP_ASSERTBACK_NOT:
2788
{
2789
int rc;
2790
int *local_workspace;
2791
PCRE2_SIZE *local_offsets;
2792
PCRE2_SPTR endasscode = code + GET(code, 1);
2793
RWS_anchor *rws = (RWS_anchor *)RWS;
2794
2795
if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
2796
{
2797
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
2798
if (rc != 0) return rc;
2799
RWS = (int *)rws;
2800
}
2801
2802
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
2803
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
2804
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
2805
2806
while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2807
2808
rc = internal_dfa_match(
2809
mb, /* static match data */
2810
code, /* this subexpression's code */
2811
ptr, /* where we currently are */
2812
(PCRE2_SIZE)(ptr - start_subject), /* start offset */
2813
local_offsets, /* offset vector */
2814
RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
2815
local_workspace, /* workspace vector */
2816
RWS_RSIZE, /* size of same */
2817
rlevel, /* function recursion level */
2818
RWS); /* recursion workspace */
2819
2820
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
2821
2822
if (rc < 0 && rc != PCRE2_ERROR_NOMATCH) return rc;
2823
if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK))
2824
{ ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2825
}
2826
break;
2827
2828
/*-----------------------------------------------------------------*/
2829
case OP_COND:
2830
case OP_SCOND:
2831
{
2832
int codelink = (int)GET(code, 1);
2833
PCRE2_UCHAR condcode;
2834
2835
/* Because of the way auto-callout works during compile, a callout item
2836
is inserted between OP_COND and an assertion condition. This does not
2837
happen for the other conditions. */
2838
2839
if (code[LINK_SIZE + 1] == OP_CALLOUT
2840
|| code[LINK_SIZE + 1] == OP_CALLOUT_STR)
2841
{
2842
PCRE2_SIZE callout_length;
2843
rrc = do_callout_dfa(code, offsets, current_subject, ptr, mb,
2844
1 + LINK_SIZE, &callout_length);
2845
if (rrc < 0) return rrc; /* Abandon */
2846
if (rrc > 0) break; /* Fail this thread */
2847
code += callout_length; /* Skip callout data */
2848
}
2849
2850
condcode = code[LINK_SIZE+1];
2851
2852
/* Back reference conditions and duplicate named recursion conditions
2853
are not supported */
2854
2855
if (condcode == OP_CREF || condcode == OP_DNCREF ||
2856
condcode == OP_DNRREF)
2857
return PCRE2_ERROR_DFA_UCOND;
2858
2859
/* The DEFINE condition is always false, and the assertion (?!) is
2860
converted to OP_FAIL. */
2861
2862
if (condcode == OP_FALSE || condcode == OP_FAIL)
2863
{ ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2864
2865
/* There is also an always-true condition */
2866
2867
else if (condcode == OP_TRUE)
2868
{ ADD_ACTIVE(state_offset + LINK_SIZE + 2, 0); }
2869
2870
/* The only supported version of OP_RREF is for the value RREF_ANY,
2871
which means "test if in any recursion". We can't test for specifically
2872
recursed groups. */
2873
2874
else if (condcode == OP_RREF)
2875
{
2876
unsigned int value = GET2(code, LINK_SIZE + 2);
2877
if (value != RREF_ANY) return PCRE2_ERROR_DFA_UCOND;
2878
if (mb->recursive != NULL)
2879
{ ADD_ACTIVE(state_offset + LINK_SIZE + 2 + IMM2_SIZE, 0); }
2880
else { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2881
}
2882
2883
/* Otherwise, the condition is an assertion */
2884
2885
else
2886
{
2887
int rc;
2888
int *local_workspace;
2889
PCRE2_SIZE *local_offsets;
2890
PCRE2_SPTR asscode = code + LINK_SIZE + 1;
2891
PCRE2_SPTR endasscode = asscode + GET(asscode, 1);
2892
RWS_anchor *rws = (RWS_anchor *)RWS;
2893
2894
if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
2895
{
2896
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
2897
if (rc != 0) return rc;
2898
RWS = (int *)rws;
2899
}
2900
2901
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
2902
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
2903
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
2904
2905
while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2906
2907
rc = internal_dfa_match(
2908
mb, /* fixed match data */
2909
asscode, /* this subexpression's code */
2910
ptr, /* where we currently are */
2911
(PCRE2_SIZE)(ptr - start_subject), /* start offset */
2912
local_offsets, /* offset vector */
2913
RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
2914
local_workspace, /* workspace vector */
2915
RWS_RSIZE, /* size of same */
2916
rlevel, /* function recursion level */
2917
RWS); /* recursion workspace */
2918
2919
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
2920
2921
if (rc < 0 && rc != PCRE2_ERROR_NOMATCH) return rc;
2922
if ((rc >= 0) ==
2923
(condcode == OP_ASSERT || condcode == OP_ASSERTBACK))
2924
{ ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2925
else
2926
{ ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2927
}
2928
}
2929
break;
2930
2931
/*-----------------------------------------------------------------*/
2932
case OP_RECURSE:
2933
{
2934
int rc;
2935
int *local_workspace;
2936
PCRE2_SIZE *local_offsets;
2937
RWS_anchor *rws = (RWS_anchor *)RWS;
2938
PCRE2_SPTR callpat = start_code + GET(code, 1);
2939
uint32_t recno = (callpat == mb->start_code)? 0 :
2940
GET2(callpat, 1 + LINK_SIZE);
2941
2942
/* Argument list has not been supported yet. */
2943
if (code[1 + LINK_SIZE] == OP_CREF) return PCRE2_ERROR_DFA_UITEM;
2944
2945
if (rws->free < RWS_RSIZE + RWS_OVEC_RSIZE)
2946
{
2947
rc = more_workspace(&rws, RWS_OVEC_RSIZE, mb);
2948
if (rc != 0) return rc;
2949
RWS = (int *)rws;
2950
}
2951
2952
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
2953
local_workspace = ((int *)local_offsets) + RWS_OVEC_RSIZE;
2954
rws->free -= RWS_RSIZE + RWS_OVEC_RSIZE;
2955
2956
/* Check for repeating a recursion without advancing the subject
2957
pointer or last used character. This should catch convoluted mutual
2958
recursions. (Some simple cases are caught at compile time.) */
2959
2960
for (dfa_recursion_info *ri = mb->recursive;
2961
ri != NULL;
2962
ri = ri->prevrec)
2963
{
2964
if (recno == ri->group_num && ptr == ri->subject_position &&
2965
mb->last_used_ptr == ri->last_used_ptr)
2966
return PCRE2_ERROR_RECURSELOOP;
2967
}
2968
2969
/* Remember this recursion and where we started it so as to
2970
catch infinite loops. */
2971
2972
new_recursive.group_num = recno;
2973
new_recursive.subject_position = ptr;
2974
new_recursive.last_used_ptr = mb->last_used_ptr;
2975
new_recursive.prevrec = mb->recursive;
2976
mb->recursive = &new_recursive;
2977
2978
rc = internal_dfa_match(
2979
mb, /* fixed match data */
2980
callpat, /* this subexpression's code */
2981
ptr, /* where we currently are */
2982
(PCRE2_SIZE)(ptr - start_subject), /* start offset */
2983
local_offsets, /* offset vector */
2984
RWS_OVEC_RSIZE/OVEC_UNIT, /* size of same */
2985
local_workspace, /* workspace vector */
2986
RWS_RSIZE, /* size of same */
2987
rlevel, /* function recursion level */
2988
RWS); /* recursion workspace */
2989
2990
rws->free += RWS_RSIZE + RWS_OVEC_RSIZE;
2991
mb->recursive = new_recursive.prevrec; /* Done this recursion */
2992
2993
/* Ran out of internal offsets */
2994
2995
if (rc == 0) return PCRE2_ERROR_DFA_RECURSE;
2996
2997
/* For each successful matched substring, set up the next state with a
2998
count of characters to skip before trying it. Note that the count is in
2999
characters, not bytes. */
3000
3001
if (rc > 0)
3002
{
3003
for (rc = rc*2 - 2; rc >= 0; rc -= 2)
3004
{
3005
PCRE2_SIZE charcount = local_offsets[rc+1] - local_offsets[rc];
3006
#if defined SUPPORT_UNICODE && PCRE2_CODE_UNIT_WIDTH != 32
3007
if (utf)
3008
{
3009
PCRE2_SPTR p = start_subject + local_offsets[rc];
3010
PCRE2_SPTR pp = start_subject + local_offsets[rc+1];
3011
while (p < pp) if (NOT_FIRSTCU(*p++)) charcount--;
3012
}
3013
#endif
3014
if (charcount > 0)
3015
{
3016
ADD_NEW_DATA(-(state_offset + LINK_SIZE + 1), 0,
3017
(int)(charcount - 1));
3018
}
3019
else
3020
{
3021
ADD_ACTIVE(state_offset + LINK_SIZE + 1, 0);
3022
}
3023
}
3024
}
3025
else if (rc != PCRE2_ERROR_NOMATCH) return rc;
3026
}
3027
break;
3028
3029
/*-----------------------------------------------------------------*/
3030
case OP_BRAPOS:
3031
case OP_SBRAPOS:
3032
case OP_CBRAPOS:
3033
case OP_SCBRAPOS:
3034
case OP_BRAPOSZERO:
3035
{
3036
int rc;
3037
int *local_workspace;
3038
PCRE2_SIZE *local_offsets;
3039
PCRE2_SIZE charcount, matched_count;
3040
PCRE2_SPTR local_ptr = ptr;
3041
RWS_anchor *rws = (RWS_anchor *)RWS;
3042
BOOL allow_zero;
3043
3044
if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
3045
{
3046
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
3047
if (rc != 0) return rc;
3048
RWS = (int *)rws;
3049
}
3050
3051
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
3052
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
3053
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
3054
3055
if (codevalue == OP_BRAPOSZERO)
3056
{
3057
allow_zero = TRUE;
3058
++code; /* The following opcode will be one of the above BRAs */
3059
}
3060
else allow_zero = FALSE;
3061
3062
/* Loop to match the subpattern as many times as possible as if it were
3063
a complete pattern. */
3064
3065
for (matched_count = 0;; matched_count++)
3066
{
3067
rc = internal_dfa_match(
3068
mb, /* fixed match data */
3069
code, /* this subexpression's code */
3070
local_ptr, /* where we currently are */
3071
(PCRE2_SIZE)(ptr - start_subject), /* start offset */
3072
local_offsets, /* offset vector */
3073
RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
3074
local_workspace, /* workspace vector */
3075
RWS_RSIZE, /* size of same */
3076
rlevel, /* function recursion level */
3077
RWS); /* recursion workspace */
3078
3079
/* Failed to match */
3080
3081
if (rc < 0)
3082
{
3083
if (rc != PCRE2_ERROR_NOMATCH) return rc;
3084
break;
3085
}
3086
3087
/* Matched: break the loop if zero characters matched. */
3088
3089
charcount = local_offsets[1] - local_offsets[0];
3090
if (charcount == 0) break;
3091
local_ptr += charcount; /* Advance temporary position ptr */
3092
}
3093
3094
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
3095
3096
/* At this point we have matched the subpattern matched_count
3097
times, and local_ptr is pointing to the character after the end of the
3098
last match. */
3099
3100
if (matched_count > 0 || allow_zero)
3101
{
3102
PCRE2_SPTR end_subpattern = code;
3103
int next_state_offset;
3104
3105
do { end_subpattern += GET(end_subpattern, 1); }
3106
while (*end_subpattern == OP_ALT);
3107
next_state_offset =
3108
(int)(end_subpattern - start_code + LINK_SIZE + 1);
3109
3110
/* Optimization: if there are no more active states, and there
3111
are no new states yet set up, then skip over the subject string
3112
right here, to save looping. Otherwise, set up the new state to swing
3113
into action when the end of the matched substring is reached. */
3114
3115
if (i + 1 >= active_count && new_count == 0)
3116
{
3117
ptr = local_ptr;
3118
clen = 0;
3119
ADD_NEW(next_state_offset, 0);
3120
}
3121
else
3122
{
3123
PCRE2_SPTR p = ptr;
3124
PCRE2_SPTR pp = local_ptr;
3125
charcount = (PCRE2_SIZE)(pp - p);
3126
#if defined SUPPORT_UNICODE && PCRE2_CODE_UNIT_WIDTH != 32
3127
if (utf) while (p < pp) if (NOT_FIRSTCU(*p++)) charcount--;
3128
#endif
3129
ADD_NEW_DATA(-next_state_offset, 0, (int)(charcount - 1));
3130
}
3131
}
3132
}
3133
break;
3134
3135
/*-----------------------------------------------------------------*/
3136
case OP_ONCE:
3137
{
3138
int rc;
3139
int *local_workspace;
3140
PCRE2_SIZE *local_offsets;
3141
RWS_anchor *rws = (RWS_anchor *)RWS;
3142
3143
if (rws->free < RWS_RSIZE + RWS_OVEC_OSIZE)
3144
{
3145
rc = more_workspace(&rws, RWS_OVEC_OSIZE, mb);
3146
if (rc != 0) return rc;
3147
RWS = (int *)rws;
3148
}
3149
3150
local_offsets = (PCRE2_SIZE *)(RWS + rws->size - rws->free);
3151
local_workspace = ((int *)local_offsets) + RWS_OVEC_OSIZE;
3152
rws->free -= RWS_RSIZE + RWS_OVEC_OSIZE;
3153
3154
rc = internal_dfa_match(
3155
mb, /* fixed match data */
3156
code, /* this subexpression's code */
3157
ptr, /* where we currently are */
3158
(PCRE2_SIZE)(ptr - start_subject), /* start offset */
3159
local_offsets, /* offset vector */
3160
RWS_OVEC_OSIZE/OVEC_UNIT, /* size of same */
3161
local_workspace, /* workspace vector */
3162
RWS_RSIZE, /* size of same */
3163
rlevel, /* function recursion level */
3164
RWS); /* recursion workspace */
3165
3166
rws->free += RWS_RSIZE + RWS_OVEC_OSIZE;
3167
3168
if (rc >= 0)
3169
{
3170
PCRE2_SPTR end_subpattern = code;
3171
PCRE2_SIZE charcount = local_offsets[1] - local_offsets[0];
3172
int next_state_offset, repeat_state_offset;
3173
3174
do { end_subpattern += GET(end_subpattern, 1); }
3175
while (*end_subpattern == OP_ALT);
3176
next_state_offset =
3177
(int)(end_subpattern - start_code + LINK_SIZE + 1);
3178
3179
/* If the end of this subpattern is KETRMAX or KETRMIN, we must
3180
arrange for the repeat state also to be added to the relevant list.
3181
Calculate the offset, or set -1 for no repeat. */
3182
3183
repeat_state_offset = (*end_subpattern == OP_KETRMAX ||
3184
*end_subpattern == OP_KETRMIN)?
3185
(int)(end_subpattern - start_code - GET(end_subpattern, 1)) : -1;
3186
3187
/* If we have matched an empty string, add the next state at the
3188
current character pointer. This is important so that the duplicate
3189
checking kicks in, which is what breaks infinite loops that match an
3190
empty string. */
3191
3192
if (charcount == 0)
3193
{
3194
ADD_ACTIVE(next_state_offset, 0);
3195
}
3196
3197
/* Optimization: if there are no more active states, and there
3198
are no new states yet set up, then skip over the subject string
3199
right here, to save looping. Otherwise, set up the new state to swing
3200
into action when the end of the matched substring is reached. */
3201
3202
else if (i + 1 >= active_count && new_count == 0)
3203
{
3204
ptr += charcount;
3205
clen = 0;
3206
ADD_NEW(next_state_offset, 0);
3207
3208
/* If we are adding a repeat state at the new character position,
3209
we must fudge things so that it is the only current state.
3210
Otherwise, it might be a duplicate of one we processed before, and
3211
that would cause it to be skipped. */
3212
3213
if (repeat_state_offset >= 0)
3214
{
3215
next_active_state = active_states;
3216
active_count = 0;
3217
i = -1;
3218
ADD_ACTIVE(repeat_state_offset, 0);
3219
}
3220
}
3221
else
3222
{
3223
#if defined SUPPORT_UNICODE && PCRE2_CODE_UNIT_WIDTH != 32
3224
if (utf)
3225
{
3226
PCRE2_SPTR p = start_subject + local_offsets[0];
3227
PCRE2_SPTR pp = start_subject + local_offsets[1];
3228
while (p < pp) if (NOT_FIRSTCU(*p++)) charcount--;
3229
}
3230
#endif
3231
ADD_NEW_DATA(-next_state_offset, 0, (int)(charcount - 1));
3232
if (repeat_state_offset >= 0)
3233
{ ADD_NEW_DATA(-repeat_state_offset, 0, (int)(charcount - 1)); }
3234
}
3235
}
3236
else if (rc != PCRE2_ERROR_NOMATCH) return rc;
3237
}
3238
break;
3239
3240
3241
/* ========================================================================== */
3242
/* Handle callouts */
3243
3244
case OP_CALLOUT:
3245
case OP_CALLOUT_STR:
3246
{
3247
PCRE2_SIZE callout_length;
3248
rrc = do_callout_dfa(code, offsets, current_subject, ptr, mb, 0,
3249
&callout_length);
3250
if (rrc < 0) return rrc; /* Abandon */
3251
if (rrc == 0)
3252
{ ADD_ACTIVE(state_offset + (int)callout_length, 0); }
3253
}
3254
break;
3255
3256
3257
/* ========================================================================== */
3258
default: /* Unsupported opcode */
3259
return PCRE2_ERROR_DFA_UITEM;
3260
}
3261
3262
NEXT_ACTIVE_STATE: continue;
3263
3264
} /* End of loop scanning active states */
3265
3266
/* We have finished the processing at the current subject character. If no
3267
new states have been set for the next character, we have found all the
3268
matches that we are going to find. If partial matching has been requested,
3269
check for appropriate conditions.
3270
3271
The "could_continue" variable is true if a state could have continued but
3272
for the fact that the end of the subject was reached. */
3273
3274
if (new_count <= 0)
3275
{
3276
if (could_continue && /* Some could go on, and */
3277
( /* either... */
3278
(mb->moptions & PCRE2_PARTIAL_HARD) != 0 /* Hard partial */
3279
|| /* or... */
3280
((mb->moptions & PCRE2_PARTIAL_SOFT) != 0 && /* Soft partial and */
3281
match_count < 0) /* no matches */
3282
) && /* And... */
3283
(
3284
partial_newline || /* Either partial NL */
3285
( /* or ... */
3286
ptr >= end_subject && /* End of subject and */
3287
( /* either */
3288
ptr > mb->start_used_ptr || /* Inspected non-empty string */
3289
mb->allowemptypartial /* or pattern has lookbehind */
3290
) /* or could match empty */
3291
)
3292
))
3293
match_count = PCRE2_ERROR_PARTIAL;
3294
break; /* Exit from loop along the subject string */
3295
}
3296
3297
/* One or more states are active for the next character. */
3298
3299
ptr += clen; /* Advance to next subject character */
3300
} /* Loop to move along the subject string */
3301
3302
/* Control gets here from "break" a few lines above. If we have a match and
3303
PCRE2_ENDANCHORED is set, the match fails. */
3304
3305
if (match_count >= 0 &&
3306
((mb->moptions | mb->poptions) & PCRE2_ENDANCHORED) != 0 &&
3307
ptr < end_subject)
3308
match_count = PCRE2_ERROR_NOMATCH;
3309
3310
return match_count;
3311
}
3312
3313
3314
3315
/*************************************************
3316
* Match a pattern using the DFA algorithm *
3317
*************************************************/
3318
3319
/* This function matches a compiled pattern to a subject string, using the
3320
alternate matching algorithm that finds all matches at once.
3321
3322
Arguments:
3323
code points to the compiled pattern
3324
subject subject string
3325
length length of subject string
3326
startoffset where to start matching in the subject
3327
options option bits
3328
match_data points to a match data structure
3329
gcontext points to a match context
3330
workspace pointer to workspace
3331
wscount size of workspace
3332
3333
Returns: > 0 => number of match offset pairs placed in offsets
3334
= 0 => offsets overflowed; longest matches are present
3335
-1 => failed to match
3336
< -1 => some kind of unexpected problem
3337
*/
3338
3339
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
3340
pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, PCRE2_SIZE length,
3341
PCRE2_SIZE start_offset, uint32_t options, pcre2_match_data *match_data,
3342
pcre2_match_context *mcontext, int *workspace, PCRE2_SIZE wscount)
3343
{
3344
int rc;
3345
3346
const pcre2_real_code *re = (const pcre2_real_code *)code;
3347
uint32_t original_options = options;
3348
3349
PCRE2_UCHAR null_str[1] = { 0xcd };
3350
PCRE2_SPTR original_subject = subject;
3351
PCRE2_SPTR start_match;
3352
PCRE2_SPTR end_subject;
3353
PCRE2_SPTR bumpalong_limit;
3354
PCRE2_SPTR req_cu_ptr;
3355
3356
BOOL utf, anchored, startline, firstline;
3357
BOOL has_first_cu = FALSE;
3358
BOOL has_req_cu = FALSE;
3359
3360
#if PCRE2_CODE_UNIT_WIDTH == 8
3361
PCRE2_SPTR memchr_found_first_cu = NULL;
3362
PCRE2_SPTR memchr_found_first_cu2 = NULL;
3363
#endif
3364
3365
PCRE2_UCHAR first_cu = 0;
3366
PCRE2_UCHAR first_cu2 = 0;
3367
PCRE2_UCHAR req_cu = 0;
3368
PCRE2_UCHAR req_cu2 = 0;
3369
3370
const uint8_t *start_bits = NULL;
3371
3372
/* We need to have mb pointing to a match block, because the IS_NEWLINE macro
3373
is used below, and it expects NLBLOCK to be defined as a pointer. */
3374
3375
pcre2_callout_block cb;
3376
dfa_match_block actual_match_block;
3377
dfa_match_block *mb = &actual_match_block;
3378
3379
/* Set up a starting block of memory for use during recursive calls to
3380
internal_dfa_match(). By putting this on the stack, it minimizes resource use
3381
in the case when it is not needed. If this is too small, more memory is
3382
obtained from the heap. At the start of each block is an anchor structure.*/
3383
3384
int base_recursion_workspace[RWS_BASE_SIZE];
3385
RWS_anchor *rws = (RWS_anchor *)base_recursion_workspace;
3386
rws->next = NULL;
3387
rws->size = RWS_BASE_SIZE;
3388
rws->free = RWS_BASE_SIZE - RWS_ANCHOR_SIZE;
3389
3390
/* Recognize NULL, length 0 as an empty string. */
3391
3392
if (subject == NULL && length == 0) subject = null_str;
3393
3394
/* Plausibility checks */
3395
3396
if (match_data == NULL) return PCRE2_ERROR_NULL;
3397
if (re == NULL || subject == NULL || workspace == NULL)
3398
{ rc = PCRE2_ERROR_NULL; goto EXIT; }
3399
if ((options & ~PUBLIC_DFA_MATCH_OPTIONS) != 0)
3400
{ rc = PCRE2_ERROR_BADOPTION; goto EXIT; }
3401
3402
if (length == PCRE2_ZERO_TERMINATED)
3403
{
3404
length = PRIV(strlen)(subject);
3405
}
3406
3407
if (wscount < 20) { rc = PCRE2_ERROR_DFA_WSSIZE; goto EXIT; }
3408
if (start_offset > length) { rc = PCRE2_ERROR_BADOFFSET; goto EXIT; }
3409
3410
/* Partial matching and PCRE2_ENDANCHORED are currently not allowed at the same
3411
time. */
3412
3413
if ((options & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) != 0 &&
3414
((re->overall_options | options) & PCRE2_ENDANCHORED) != 0)
3415
{ rc = PCRE2_ERROR_BADOPTION; goto EXIT; }
3416
3417
/* Invalid UTF support is not available for DFA matching. */
3418
3419
if ((re->overall_options & PCRE2_MATCH_INVALID_UTF) != 0)
3420
{ rc = PCRE2_ERROR_DFA_UINVALID_UTF; goto EXIT; }
3421
3422
/* Check that the first field in the block is the magic number. If it is not,
3423
return with PCRE2_ERROR_BADMAGIC. */
3424
3425
if (re->magic_number != MAGIC_NUMBER)
3426
{ rc = PCRE2_ERROR_BADMAGIC; goto EXIT; }
3427
3428
/* Check the code unit width. */
3429
3430
if ((re->flags & PCRE2_MODE_MASK) != PCRE2_CODE_UNIT_WIDTH/8)
3431
{ rc = PCRE2_ERROR_BADMODE; goto EXIT; }
3432
3433
/* PCRE2_NOTEMPTY and PCRE2_NOTEMPTY_ATSTART are match-time flags in the
3434
options variable for this function. Users of PCRE2 who are not calling the
3435
function directly would like to have a way of setting these flags, in the same
3436
way that they can set pcre2_compile() flags like PCRE2_NO_AUTO_POSSESS with
3437
constructions like (*NO_AUTOPOSSESS). To enable this, (*NOTEMPTY) and
3438
(*NOTEMPTY_ATSTART) set bits in the pattern's "flag" function which can now be
3439
transferred to the options for this function. The bits are guaranteed to be
3440
adjacent, but do not have the same values. This bit of Boolean trickery assumes
3441
that the match-time bits are not more significant than the flag bits. If by
3442
accident this is not the case, a compile-time division by zero error will
3443
occur. */
3444
3445
#define FF (PCRE2_NOTEMPTY_SET|PCRE2_NE_ATST_SET)
3446
#define OO (PCRE2_NOTEMPTY|PCRE2_NOTEMPTY_ATSTART)
3447
options |= (re->flags & FF) / ((FF & (~FF+1)) / (OO & (~OO+1)));
3448
#undef FF
3449
#undef OO
3450
3451
/* If restarting after a partial match, do some sanity checks on the contents
3452
of the workspace. */
3453
3454
if ((options & PCRE2_DFA_RESTART) != 0)
3455
{
3456
if ((workspace[0] & (-2)) != 0 || workspace[1] < 1 ||
3457
workspace[1] > (int)((wscount - 2)/INTS_PER_STATEBLOCK))
3458
{ rc = PCRE2_ERROR_DFA_BADRESTART; goto EXIT; }
3459
}
3460
3461
/* Set some local values */
3462
3463
utf = (re->overall_options & PCRE2_UTF) != 0;
3464
start_match = subject + start_offset;
3465
end_subject = subject + length;
3466
req_cu_ptr = start_match - 1;
3467
anchored = (options & (PCRE2_ANCHORED|PCRE2_DFA_RESTART)) != 0 ||
3468
(re->overall_options & PCRE2_ANCHORED) != 0;
3469
3470
/* The "must be at the start of a line" flags are used in a loop when finding
3471
where to start. */
3472
3473
startline = (re->flags & PCRE2_STARTLINE) != 0;
3474
firstline = !anchored && (re->overall_options & PCRE2_FIRSTLINE) != 0;
3475
bumpalong_limit = end_subject;
3476
3477
/* Initialize and set up the fixed fields in the callout block, with a pointer
3478
in the match block. */
3479
3480
mb->cb = &cb;
3481
cb.version = 2;
3482
cb.subject = subject;
3483
cb.subject_length = (PCRE2_SIZE)(end_subject - subject);
3484
cb.callout_flags = 0;
3485
cb.capture_top = 1; /* No capture support */
3486
cb.capture_last = 0;
3487
cb.mark = NULL; /* No (*MARK) support */
3488
3489
/* Get data from the match context, if present, and fill in the remaining
3490
fields in the match block. It is an error to set an offset limit without
3491
setting the flag at compile time. */
3492
3493
if (mcontext == NULL)
3494
{
3495
mb->callout = NULL;
3496
mb->memctl = re->memctl;
3497
mb->match_limit = PRIV(default_match_context).match_limit;
3498
mb->match_limit_depth = PRIV(default_match_context).depth_limit;
3499
mb->heap_limit = PRIV(default_match_context).heap_limit;
3500
}
3501
else
3502
{
3503
if (mcontext->offset_limit != PCRE2_UNSET)
3504
{
3505
if ((re->overall_options & PCRE2_USE_OFFSET_LIMIT) == 0)
3506
{ rc = PCRE2_ERROR_BADOFFSETLIMIT; goto EXIT; }
3507
bumpalong_limit = subject + mcontext->offset_limit;
3508
}
3509
mb->callout = mcontext->callout;
3510
mb->callout_data = mcontext->callout_data;
3511
mb->memctl = mcontext->memctl;
3512
mb->match_limit = mcontext->match_limit;
3513
mb->match_limit_depth = mcontext->depth_limit;
3514
mb->heap_limit = mcontext->heap_limit;
3515
}
3516
3517
if (mb->match_limit > re->limit_match)
3518
mb->match_limit = re->limit_match;
3519
3520
if (mb->match_limit_depth > re->limit_depth)
3521
mb->match_limit_depth = re->limit_depth;
3522
3523
if (mb->heap_limit > re->limit_heap)
3524
mb->heap_limit = re->limit_heap;
3525
3526
mb->start_code = (PCRE2_SPTR)((const uint8_t *)re + re->code_start);
3527
mb->tables = re->tables;
3528
mb->start_subject = subject;
3529
mb->end_subject = end_subject;
3530
mb->start_offset = start_offset;
3531
mb->allowemptypartial = (re->max_lookbehind > 0) ||
3532
(re->flags & PCRE2_MATCH_EMPTY) != 0;
3533
mb->moptions = options;
3534
mb->poptions = re->overall_options;
3535
mb->match_call_count = 0;
3536
mb->heap_used = 0;
3537
3538
/* Process the \R and newline settings. */
3539
3540
mb->bsr_convention = re->bsr_convention;
3541
mb->nltype = NLTYPE_FIXED;
3542
switch(re->newline_convention)
3543
{
3544
case PCRE2_NEWLINE_CR:
3545
mb->nllen = 1;
3546
mb->nl[0] = CHAR_CR;
3547
break;
3548
3549
case PCRE2_NEWLINE_LF:
3550
mb->nllen = 1;
3551
mb->nl[0] = CHAR_NL;
3552
break;
3553
3554
case PCRE2_NEWLINE_NUL:
3555
mb->nllen = 1;
3556
mb->nl[0] = CHAR_NUL;
3557
break;
3558
3559
case PCRE2_NEWLINE_CRLF:
3560
mb->nllen = 2;
3561
mb->nl[0] = CHAR_CR;
3562
mb->nl[1] = CHAR_NL;
3563
break;
3564
3565
case PCRE2_NEWLINE_ANY:
3566
mb->nltype = NLTYPE_ANY;
3567
break;
3568
3569
case PCRE2_NEWLINE_ANYCRLF:
3570
mb->nltype = NLTYPE_ANYCRLF;
3571
break;
3572
3573
/* LCOV_EXCL_START */
3574
default:
3575
PCRE2_DEBUG_UNREACHABLE();
3576
rc = PCRE2_ERROR_INTERNAL;
3577
goto EXIT;
3578
/* LCOV_EXCL_STOP */
3579
}
3580
3581
/* Check a UTF string for validity if required. For 8-bit and 16-bit strings,
3582
we must also check that a starting offset does not point into the middle of a
3583
multiunit character. We check only the portion of the subject that is going to
3584
be inspected during matching - from the offset minus the maximum back reference
3585
to the given length. This saves time when a small part of a large subject is
3586
being matched by the use of a starting offset. Note that the maximum lookbehind
3587
is a number of characters, not code units. */
3588
3589
#ifdef SUPPORT_UNICODE
3590
if (utf && (options & PCRE2_NO_UTF_CHECK) == 0)
3591
{
3592
PCRE2_SPTR check_subject = start_match; /* start_match includes offset */
3593
3594
if (start_offset > 0)
3595
{
3596
#if PCRE2_CODE_UNIT_WIDTH != 32
3597
unsigned int i;
3598
if (start_match < end_subject && NOT_FIRSTCU(*start_match))
3599
{ rc = PCRE2_ERROR_BADUTFOFFSET; goto EXIT; }
3600
for (i = re->max_lookbehind; i > 0 && check_subject > subject; i--)
3601
{
3602
check_subject--;
3603
while (check_subject > subject &&
3604
#if PCRE2_CODE_UNIT_WIDTH == 8
3605
(*check_subject & 0xc0) == 0x80)
3606
#else /* 16-bit */
3607
(*check_subject & 0xfc00) == 0xdc00)
3608
#endif /* PCRE2_CODE_UNIT_WIDTH == 8 */
3609
check_subject--;
3610
}
3611
#else /* In the 32-bit library, one code unit equals one character. */
3612
check_subject -= re->max_lookbehind;
3613
if (check_subject < subject) check_subject = subject;
3614
#endif /* PCRE2_CODE_UNIT_WIDTH != 32 */
3615
}
3616
3617
/* Validate the relevant portion of the subject. After an error, adjust the
3618
offset to be an absolute offset in the whole string. */
3619
3620
rc = PRIV(valid_utf)(check_subject,
3621
length - (PCRE2_SIZE)(check_subject - subject), &(match_data->startchar));
3622
if (rc != 0)
3623
{
3624
match_data->startchar += (PCRE2_SIZE)(check_subject - subject);
3625
goto EXIT;
3626
}
3627
}
3628
#endif /* SUPPORT_UNICODE */
3629
3630
/* Set up the first code unit to match, if available. If there's no first code
3631
unit there may be a bitmap of possible first characters. */
3632
3633
if ((re->flags & PCRE2_FIRSTSET) != 0)
3634
{
3635
has_first_cu = TRUE;
3636
first_cu = first_cu2 = (PCRE2_UCHAR)(re->first_codeunit);
3637
if ((re->flags & PCRE2_FIRSTCASELESS) != 0)
3638
{
3639
first_cu2 = TABLE_GET(first_cu, mb->tables + fcc_offset, first_cu);
3640
#ifdef SUPPORT_UNICODE
3641
#if PCRE2_CODE_UNIT_WIDTH == 8
3642
if (first_cu > 127 && !utf && (re->overall_options & PCRE2_UCP) != 0)
3643
first_cu2 = (PCRE2_UCHAR)UCD_OTHERCASE(first_cu);
3644
#else
3645
if (first_cu > 127 && (utf || (re->overall_options & PCRE2_UCP) != 0))
3646
first_cu2 = (PCRE2_UCHAR)UCD_OTHERCASE(first_cu);
3647
#endif
3648
#endif /* SUPPORT_UNICODE */
3649
}
3650
}
3651
else
3652
if (!startline && (re->flags & PCRE2_FIRSTMAPSET) != 0)
3653
start_bits = re->start_bitmap;
3654
3655
/* There may be a "last known required code unit" set. */
3656
3657
if ((re->flags & PCRE2_LASTSET) != 0)
3658
{
3659
has_req_cu = TRUE;
3660
req_cu = req_cu2 = (PCRE2_UCHAR)(re->last_codeunit);
3661
if ((re->flags & PCRE2_LASTCASELESS) != 0)
3662
{
3663
req_cu2 = TABLE_GET(req_cu, mb->tables + fcc_offset, req_cu);
3664
#ifdef SUPPORT_UNICODE
3665
#if PCRE2_CODE_UNIT_WIDTH == 8
3666
if (req_cu > 127 && !utf && (re->overall_options & PCRE2_UCP) != 0)
3667
req_cu2 = (PCRE2_UCHAR)UCD_OTHERCASE(req_cu);
3668
#else
3669
if (req_cu > 127 && (utf || (re->overall_options & PCRE2_UCP) != 0))
3670
req_cu2 = (PCRE2_UCHAR)UCD_OTHERCASE(req_cu);
3671
#endif
3672
#endif /* SUPPORT_UNICODE */
3673
}
3674
}
3675
3676
/* If the match data block was previously used with PCRE2_COPY_MATCHED_SUBJECT,
3677
free the memory that was obtained. */
3678
3679
if ((match_data->flags & PCRE2_MD_COPIED_SUBJECT) != 0)
3680
{
3681
match_data->memctl.free((void *)match_data->subject,
3682
match_data->memctl.memory_data);
3683
match_data->flags &= ~PCRE2_MD_COPIED_SUBJECT;
3684
}
3685
3686
/* Fill in fields that are always returned in the match data. */
3687
3688
match_data->code = re;
3689
match_data->subject = NULL; /* Default for match error */
3690
match_data->mark = NULL;
3691
match_data->matchedby = PCRE2_MATCHEDBY_DFA_INTERPRETER;
3692
match_data->options = original_options;
3693
3694
/* Call the main matching function, looping for a non-anchored regex after a
3695
failed match. If not restarting, perform certain optimizations at the start of
3696
a match. */
3697
3698
for (;;)
3699
{
3700
/* ----------------- Start of match optimizations ---------------- */
3701
3702
/* There are some optimizations that avoid running the match if a known
3703
starting point is not found, or if a known later code unit is not present.
3704
However, there is an option (settable at compile time) that disables
3705
these, for testing and for ensuring that all callouts do actually occur.
3706
The optimizations must also be avoided when restarting a DFA match. */
3707
3708
if ((re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0 &&
3709
(options & PCRE2_DFA_RESTART) == 0)
3710
{
3711
/* If firstline is TRUE, the start of the match is constrained to the first
3712
line of a multiline string. That is, the match must be before or at the
3713
first newline following the start of matching. Temporarily adjust
3714
end_subject so that we stop the optimization scans for a first code unit
3715
immediately after the first character of a newline (the first code unit can
3716
legitimately be a newline). If the match fails at the newline, later code
3717
breaks this loop. */
3718
3719
if (firstline)
3720
{
3721
PCRE2_SPTR t = start_match;
3722
#ifdef SUPPORT_UNICODE
3723
if (utf)
3724
{
3725
while (t < end_subject && !IS_NEWLINE(t))
3726
{
3727
t++;
3728
ACROSSCHAR(t < end_subject, t, t++);
3729
}
3730
}
3731
else
3732
#endif
3733
while (t < end_subject && !IS_NEWLINE(t)) t++;
3734
end_subject = t;
3735
}
3736
3737
/* Anchored: check the first code unit if one is recorded. This may seem
3738
pointless but it can help in detecting a no match case without scanning for
3739
the required code unit. */
3740
3741
if (anchored)
3742
{
3743
if (has_first_cu || start_bits != NULL)
3744
{
3745
BOOL ok = start_match < end_subject;
3746
if (ok)
3747
{
3748
PCRE2_UCHAR c = UCHAR21TEST(start_match);
3749
ok = has_first_cu && (c == first_cu || c == first_cu2);
3750
if (!ok && start_bits != NULL)
3751
{
3752
#if PCRE2_CODE_UNIT_WIDTH != 8
3753
if (c > 255) c = 255;
3754
#endif
3755
ok = (start_bits[c/8] & (1u << (c&7))) != 0;
3756
}
3757
}
3758
if (!ok) break;
3759
}
3760
}
3761
3762
/* Not anchored. Advance to a unique first code unit if there is one. */
3763
3764
else
3765
{
3766
if (has_first_cu)
3767
{
3768
if (first_cu != first_cu2) /* Caseless */
3769
{
3770
/* In 16-bit and 32_bit modes we have to do our own search, so can
3771
look for both cases at once. */
3772
3773
#if PCRE2_CODE_UNIT_WIDTH != 8
3774
PCRE2_UCHAR smc;
3775
while (start_match < end_subject &&
3776
(smc = UCHAR21TEST(start_match)) != first_cu &&
3777
smc != first_cu2)
3778
start_match++;
3779
#else
3780
/* In 8-bit mode, the use of memchr() gives a big speed up, even
3781
though we have to call it twice in order to find the earliest
3782
occurrence of the code unit in either of its cases. Caching is used
3783
to remember the positions of previously found code units. This can
3784
make a huge difference when the strings are very long and only one
3785
case is actually present. */
3786
3787
PCRE2_SPTR pp1 = NULL;
3788
PCRE2_SPTR pp2 = NULL;
3789
PCRE2_SIZE searchlength = end_subject - start_match;
3790
3791
/* If we haven't got a previously found position for first_cu, or if
3792
the current starting position is later, we need to do a search. If
3793
the code unit is not found, set it to the end. */
3794
3795
if (memchr_found_first_cu == NULL ||
3796
start_match > memchr_found_first_cu)
3797
{
3798
pp1 = memchr(start_match, first_cu, searchlength);
3799
memchr_found_first_cu = (pp1 == NULL)? end_subject : pp1;
3800
}
3801
3802
/* If the start is before a previously found position, use the
3803
previous position, or NULL if a previous search failed. */
3804
3805
else pp1 = (memchr_found_first_cu == end_subject)? NULL :
3806
memchr_found_first_cu;
3807
3808
/* Do the same thing for the other case. */
3809
3810
if (memchr_found_first_cu2 == NULL ||
3811
start_match > memchr_found_first_cu2)
3812
{
3813
pp2 = memchr(start_match, first_cu2, searchlength);
3814
memchr_found_first_cu2 = (pp2 == NULL)? end_subject : pp2;
3815
}
3816
3817
else pp2 = (memchr_found_first_cu2 == end_subject)? NULL :
3818
memchr_found_first_cu2;
3819
3820
/* Set the start to the end of the subject if neither case was found.
3821
Otherwise, use the earlier found point. */
3822
3823
if (pp1 == NULL)
3824
start_match = (pp2 == NULL)? end_subject : pp2;
3825
else
3826
start_match = (pp2 == NULL || pp1 < pp2)? pp1 : pp2;
3827
3828
#endif /* 8-bit handling */
3829
}
3830
3831
/* The caseful case is much simpler. */
3832
3833
else
3834
{
3835
#if PCRE2_CODE_UNIT_WIDTH != 8
3836
while (start_match < end_subject && UCHAR21TEST(start_match) !=
3837
first_cu)
3838
start_match++;
3839
#else /* 8-bit code units */
3840
start_match = memchr(start_match, first_cu, end_subject - start_match);
3841
if (start_match == NULL) start_match = end_subject;
3842
#endif
3843
}
3844
3845
/* If we can't find the required code unit, having reached the true end
3846
of the subject, break the bumpalong loop, to force a match failure,
3847
except when doing partial matching, when we let the next cycle run at
3848
the end of the subject. To see why, consider the pattern /(?<=abc)def/,
3849
which partially matches "abc", even though the string does not contain
3850
the starting character "d". If we have not reached the true end of the
3851
subject (PCRE2_FIRSTLINE caused end_subject to be temporarily modified)
3852
we also let the cycle run, because the matching string is legitimately
3853
allowed to start with the first code unit of a newline. */
3854
3855
if ((mb->moptions & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) == 0 &&
3856
start_match >= mb->end_subject)
3857
break;
3858
}
3859
3860
/* If there's no first code unit, advance to just after a linebreak for a
3861
multiline match if required. */
3862
3863
else if (startline)
3864
{
3865
if (start_match > mb->start_subject + start_offset)
3866
{
3867
#ifdef SUPPORT_UNICODE
3868
if (utf)
3869
{
3870
while (start_match < end_subject && !WAS_NEWLINE(start_match))
3871
{
3872
start_match++;
3873
ACROSSCHAR(start_match < end_subject, start_match, start_match++);
3874
}
3875
}
3876
else
3877
#endif
3878
while (start_match < end_subject && !WAS_NEWLINE(start_match))
3879
start_match++;
3880
3881
/* If we have just passed a CR and the newline option is ANY or
3882
ANYCRLF, and we are now at a LF, advance the match position by one
3883
more code unit. */
3884
3885
if (start_match[-1] == CHAR_CR &&
3886
(mb->nltype == NLTYPE_ANY || mb->nltype == NLTYPE_ANYCRLF) &&
3887
start_match < end_subject &&
3888
UCHAR21TEST(start_match) == CHAR_NL)
3889
start_match++;
3890
}
3891
}
3892
3893
/* If there's no first code unit or a requirement for a multiline line
3894
start, advance to a non-unique first code unit if any have been
3895
identified. The bitmap contains only 256 bits. When code units are 16 or
3896
32 bits wide, all code units greater than 254 set the 255 bit. */
3897
3898
else if (start_bits != NULL)
3899
{
3900
while (start_match < end_subject)
3901
{
3902
uint32_t c = UCHAR21TEST(start_match);
3903
#if PCRE2_CODE_UNIT_WIDTH != 8
3904
if (c > 255) c = 255;
3905
#endif
3906
if ((start_bits[c/8] & (1u << (c&7))) != 0) break;
3907
start_match++;
3908
}
3909
3910
/* See comment above in first_cu checking about the next line. */
3911
3912
if ((mb->moptions & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) == 0 &&
3913
start_match >= mb->end_subject)
3914
break;
3915
}
3916
} /* End of first code unit handling */
3917
3918
/* Restore fudged end_subject */
3919
3920
end_subject = mb->end_subject;
3921
3922
/* The following two optimizations are disabled for partial matching. */
3923
3924
if ((mb->moptions & (PCRE2_PARTIAL_HARD|PCRE2_PARTIAL_SOFT)) == 0)
3925
{
3926
PCRE2_SPTR p;
3927
3928
/* The minimum matching length is a lower bound; no actual string of that
3929
length may actually match the pattern. Although the value is, strictly,
3930
in characters, we treat it as code units to avoid spending too much time
3931
in this optimization. */
3932
3933
if (end_subject - start_match < re->minlength) goto NOMATCH_EXIT;
3934
3935
/* If req_cu is set, we know that that code unit must appear in the
3936
subject for the match to succeed. If the first code unit is set, req_cu
3937
must be later in the subject; otherwise the test starts at the match
3938
point. This optimization can save a huge amount of backtracking in
3939
patterns with nested unlimited repeats that aren't going to match.
3940
Writing separate code for cased/caseless versions makes it go faster, as
3941
does using an autoincrement and backing off on a match. As in the case of
3942
the first code unit, using memchr() in the 8-bit library gives a big
3943
speed up. Unlike the first_cu check above, we do not need to call
3944
memchr() twice in the caseless case because we only need to check for the
3945
presence of the character in either case, not find the first occurrence.
3946
3947
The search can be skipped if the code unit was found later than the
3948
current starting point in a previous iteration of the bumpalong loop.
3949
3950
HOWEVER: when the subject string is very, very long, searching to its end
3951
can take a long time, and give bad performance on quite ordinary
3952
patterns. This showed up when somebody was matching something like
3953
/^\d+C/ on a 32-megabyte string... so we don't do this when the string is
3954
sufficiently long, but it's worth searching a lot more for unanchored
3955
patterns. */
3956
3957
p = start_match + (has_first_cu? 1:0);
3958
if (has_req_cu && p > req_cu_ptr)
3959
{
3960
PCRE2_SIZE check_length = end_subject - start_match;
3961
3962
if (check_length < REQ_CU_MAX ||
3963
(!anchored && check_length < REQ_CU_MAX * 1000))
3964
{
3965
if (req_cu != req_cu2) /* Caseless */
3966
{
3967
#if PCRE2_CODE_UNIT_WIDTH != 8
3968
while (p < end_subject)
3969
{
3970
uint32_t pp = UCHAR21INCTEST(p);
3971
if (pp == req_cu || pp == req_cu2) { p--; break; }
3972
}
3973
#else /* 8-bit code units */
3974
PCRE2_SPTR pp = p;
3975
p = memchr(pp, req_cu, end_subject - pp);
3976
if (p == NULL)
3977
{
3978
p = memchr(pp, req_cu2, end_subject - pp);
3979
if (p == NULL) p = end_subject;
3980
}
3981
#endif /* PCRE2_CODE_UNIT_WIDTH != 8 */
3982
}
3983
3984
/* The caseful case */
3985
3986
else
3987
{
3988
#if PCRE2_CODE_UNIT_WIDTH != 8
3989
while (p < end_subject)
3990
{
3991
if (UCHAR21INCTEST(p) == req_cu) { p--; break; }
3992
}
3993
3994
#else /* 8-bit code units */
3995
p = memchr(p, req_cu, end_subject - p);
3996
if (p == NULL) p = end_subject;
3997
#endif
3998
}
3999
4000
/* If we can't find the required code unit, break the matching loop,
4001
forcing a match failure. */
4002
4003
if (p >= end_subject) break;
4004
4005
/* If we have found the required code unit, save the point where we
4006
found it, so that we don't search again next time round the loop if
4007
the start hasn't passed this code unit yet. */
4008
4009
req_cu_ptr = p;
4010
}
4011
}
4012
}
4013
}
4014
4015
/* ------------ End of start of match optimizations ------------ */
4016
4017
/* Give no match if we have passed the bumpalong limit. */
4018
4019
if (start_match > bumpalong_limit) break;
4020
4021
/* OK, now we can do the business */
4022
4023
mb->start_used_ptr = start_match;
4024
mb->last_used_ptr = start_match;
4025
mb->recursive = NULL;
4026
4027
rc = internal_dfa_match(
4028
mb, /* fixed match data */
4029
mb->start_code, /* this subexpression's code */
4030
start_match, /* where we currently are */
4031
start_offset, /* start offset in subject */
4032
match_data->ovector, /* offset vector */
4033
(uint32_t)match_data->oveccount * 2, /* actual size of same */
4034
workspace, /* workspace vector */
4035
(int)wscount, /* size of same */
4036
0, /* function recurse level */
4037
base_recursion_workspace); /* initial workspace for recursion */
4038
4039
/* Anything other than "no match" means we are done, always; otherwise, carry
4040
on only if not anchored. */
4041
4042
if (rc != PCRE2_ERROR_NOMATCH || anchored)
4043
{
4044
if (rc == PCRE2_ERROR_NOMATCH) goto NOMATCH_EXIT;
4045
4046
if (rc == PCRE2_ERROR_PARTIAL && match_data->oveccount > 0)
4047
{
4048
match_data->ovector[0] = (PCRE2_SIZE)(start_match - subject);
4049
match_data->ovector[1] = (PCRE2_SIZE)(end_subject - subject);
4050
}
4051
4052
if (rc >= 0 || rc == PCRE2_ERROR_PARTIAL)
4053
{
4054
match_data->subject_length = length;
4055
match_data->start_offset = start_offset;
4056
match_data->leftchar = (PCRE2_SIZE)(mb->start_used_ptr - subject);
4057
match_data->rightchar = (PCRE2_SIZE)(mb->last_used_ptr - subject);
4058
match_data->startchar = (PCRE2_SIZE)(start_match - subject);
4059
}
4060
4061
if (rc >= 0 && (options & PCRE2_COPY_MATCHED_SUBJECT) != 0)
4062
{
4063
if (length != 0)
4064
{
4065
match_data->subject = match_data->memctl.malloc(CU2BYTES(length),
4066
match_data->memctl.memory_data);
4067
if (match_data->subject == NULL)
4068
{ rc = PCRE2_ERROR_NOMEMORY; goto EXIT; }
4069
memcpy((void *)match_data->subject, subject, CU2BYTES(length));
4070
}
4071
else
4072
match_data->subject = NULL;
4073
match_data->flags |= PCRE2_MD_COPIED_SUBJECT;
4074
}
4075
else if (rc >= 0 || rc == PCRE2_ERROR_PARTIAL)
4076
{
4077
match_data->subject = original_subject;
4078
}
4079
goto EXIT;
4080
}
4081
4082
/* Advance to the next subject character unless we are at the end of a line
4083
and firstline is set. */
4084
4085
if (firstline && IS_NEWLINE(start_match)) break;
4086
start_match++;
4087
#ifdef SUPPORT_UNICODE
4088
if (utf)
4089
{
4090
ACROSSCHAR(start_match < end_subject, start_match, start_match++);
4091
}
4092
#endif
4093
if (start_match > end_subject) break;
4094
4095
/* If we have just passed a CR and we are now at a LF, and the pattern does
4096
not contain any explicit matches for \r or \n, and the newline option is CRLF
4097
or ANY or ANYCRLF, advance the match position by one more character. */
4098
4099
if (UCHAR21TEST(start_match - 1) == CHAR_CR &&
4100
start_match < end_subject &&
4101
UCHAR21TEST(start_match) == CHAR_NL &&
4102
(re->flags & PCRE2_HASCRORLF) == 0 &&
4103
(mb->nltype == NLTYPE_ANY ||
4104
mb->nltype == NLTYPE_ANYCRLF ||
4105
mb->nllen == 2))
4106
start_match++;
4107
4108
} /* "Bumpalong" loop */
4109
4110
NOMATCH_EXIT:
4111
match_data->subject = original_subject;
4112
match_data->subject_length = length;
4113
match_data->start_offset = start_offset;
4114
rc = PCRE2_ERROR_NOMATCH;
4115
4116
EXIT:
4117
while (rws->next != NULL)
4118
{
4119
RWS_anchor *next = rws->next;
4120
rws->next = next->next;
4121
mb->memctl.free(next, mb->memctl.memory_data);
4122
}
4123
4124
match_data->rc = rc;
4125
return rc;
4126
}
4127
4128
/* These #undefs are here to enable unity builds with CMake. */
4129
4130
#undef NLBLOCK /* Block containing newline information */
4131
#undef PSSTART /* Field containing processed string start */
4132
#undef PSEND /* Field containing processed string end */
4133
4134
/* End of pcre2_dfa_match.c */
4135
4136