Path: blob/aarch64-shenandoah-jdk8u272-b10/jdk/src/share/classes/sun/text/normalizer/UnicodeSet.java
38830 views
/*1* Copyright (c) 2005, 2011, Oracle and/or its affiliates. All rights reserved.2* DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.3*4* This code is free software; you can redistribute it and/or modify it5* under the terms of the GNU General Public License version 2 only, as6* published by the Free Software Foundation. Oracle designates this7* particular file as subject to the "Classpath" exception as provided8* by Oracle in the LICENSE file that accompanied this code.9*10* This code is distributed in the hope that it will be useful, but WITHOUT11* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or12* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License13* version 2 for more details (a copy is included in the LICENSE file that14* accompanied this code).15*16* You should have received a copy of the GNU General Public License version17* 2 along with this work; if not, write to the Free Software Foundation,18* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.19*20* Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA21* or visit www.oracle.com if you need additional information or have any22* questions.23*/24/*25*******************************************************************************26* (C) Copyright IBM Corp. and others, 1996-2009 - All Rights Reserved *27* *28* The original version of this source code and documentation is copyrighted *29* and owned by IBM, These materials are provided under terms of a License *30* Agreement between IBM and Sun. This technology is protected by multiple *31* US and International patents. This notice and attribution to IBM may not *32* to removed. *33*******************************************************************************34*/3536package sun.text.normalizer;3738import java.text.ParsePosition;39import java.util.Iterator;40import java.util.TreeSet;4142/**43* A mutable set of Unicode characters and multicharacter strings. Objects of this class44* represent <em>character classes</em> used in regular expressions.45* A character specifies a subset of Unicode code points. Legal46* code points are U+0000 to U+10FFFF, inclusive.47*48* <p>The UnicodeSet class is not designed to be subclassed.49*50* <p><code>UnicodeSet</code> supports two APIs. The first is the51* <em>operand</em> API that allows the caller to modify the value of52* a <code>UnicodeSet</code> object. It conforms to Java 2's53* <code>java.util.Set</code> interface, although54* <code>UnicodeSet</code> does not actually implement that55* interface. All methods of <code>Set</code> are supported, with the56* modification that they take a character range or single character57* instead of an <code>Object</code>, and they take a58* <code>UnicodeSet</code> instead of a <code>Collection</code>. The59* operand API may be thought of in terms of boolean logic: a boolean60* OR is implemented by <code>add</code>, a boolean AND is implemented61* by <code>retain</code>, a boolean XOR is implemented by62* <code>complement</code> taking an argument, and a boolean NOT is63* implemented by <code>complement</code> with no argument. In terms64* of traditional set theory function names, <code>add</code> is a65* union, <code>retain</code> is an intersection, <code>remove</code>66* is an asymmetric difference, and <code>complement</code> with no67* argument is a set complement with respect to the superset range68* <code>MIN_VALUE-MAX_VALUE</code>69*70* <p>The second API is the71* <code>applyPattern()</code>/<code>toPattern()</code> API from the72* <code>java.text.Format</code>-derived classes. Unlike the73* methods that add characters, add categories, and control the logic74* of the set, the method <code>applyPattern()</code> sets all75* attributes of a <code>UnicodeSet</code> at once, based on a76* string pattern.77*78* <p><b>Pattern syntax</b></p>79*80* Patterns are accepted by the constructors and the81* <code>applyPattern()</code> methods and returned by the82* <code>toPattern()</code> method. These patterns follow a syntax83* similar to that employed by version 8 regular expression character84* classes. Here are some simple examples:85*86* <blockquote>87* <table>88* <tr align="top">89* <td nowrap valign="top" align="left"><code>[]</code></td>90* <td valign="top">No characters</td>91* </tr><tr align="top">92* <td nowrap valign="top" align="left"><code>[a]</code></td>93* <td valign="top">The character 'a'</td>94* </tr><tr align="top">95* <td nowrap valign="top" align="left"><code>[ae]</code></td>96* <td valign="top">The characters 'a' and 'e'</td>97* </tr>98* <tr>99* <td nowrap valign="top" align="left"><code>[a-e]</code></td>100* <td valign="top">The characters 'a' through 'e' inclusive, in Unicode code101* point order</td>102* </tr>103* <tr>104* <td nowrap valign="top" align="left"><code>[\\u4E01]</code></td>105* <td valign="top">The character U+4E01</td>106* </tr>107* <tr>108* <td nowrap valign="top" align="left"><code>[a{ab}{ac}]</code></td>109* <td valign="top">The character 'a' and the multicharacter strings "ab" and110* "ac"</td>111* </tr>112* <tr>113* <td nowrap valign="top" align="left"><code>[\p{Lu}]</code></td>114* <td valign="top">All characters in the general category Uppercase Letter</td>115* </tr>116* </table>117* </blockquote>118*119* Any character may be preceded by a backslash in order to remove any special120* meaning. White space characters, as defined by UCharacterProperty.isRuleWhiteSpace(), are121* ignored, unless they are escaped.122*123* <p>Property patterns specify a set of characters having a certain124* property as defined by the Unicode standard. Both the POSIX-like125* "[:Lu:]" and the Perl-like syntax "\p{Lu}" are recognized. For a126* complete list of supported property patterns, see the User's Guide127* for UnicodeSet at128* <a href="http://www.icu-project.org/userguide/unicodeSet.html">129* http://www.icu-project.org/userguide/unicodeSet.html</a>.130* Actual determination of property data is defined by the underlying131* Unicode database as implemented by UCharacter.132*133* <p>Patterns specify individual characters, ranges of characters, and134* Unicode property sets. When elements are concatenated, they135* specify their union. To complement a set, place a '^' immediately136* after the opening '['. Property patterns are inverted by modifying137* their delimiters; "[:^foo]" and "\P{foo}". In any other location,138* '^' has no special meaning.139*140* <p>Ranges are indicated by placing two a '-' between two141* characters, as in "a-z". This specifies the range of all142* characters from the left to the right, in Unicode order. If the143* left character is greater than or equal to the144* right character it is a syntax error. If a '-' occurs as the first145* character after the opening '[' or '[^', or if it occurs as the146* last character before the closing ']', then it is taken as a147* literal. Thus "[a\\-b]", "[-ab]", and "[ab-]" all indicate the same148* set of three characters, 'a', 'b', and '-'.149*150* <p>Sets may be intersected using the '&' operator or the asymmetric151* set difference may be taken using the '-' operator, for example,152* "[[:L:]&[\\u0000-\\u0FFF]]" indicates the set of all Unicode letters153* with values less than 4096. Operators ('&' and '|') have equal154* precedence and bind left-to-right. Thus155* "[[:L:]-[a-z]-[\\u0100-\\u01FF]]" is equivalent to156* "[[[:L:]-[a-z]]-[\\u0100-\\u01FF]]". This only really matters for157* difference; intersection is commutative.158*159* <table>160* <tr valign=top><td nowrap><code>[a]</code><td>The set containing 'a'161* <tr valign=top><td nowrap><code>[a-z]</code><td>The set containing 'a'162* through 'z' and all letters in between, in Unicode order163* <tr valign=top><td nowrap><code>[^a-z]</code><td>The set containing164* all characters but 'a' through 'z',165* that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF166* <tr valign=top><td nowrap><code>[[<em>pat1</em>][<em>pat2</em>]]</code>167* <td>The union of sets specified by <em>pat1</em> and <em>pat2</em>168* <tr valign=top><td nowrap><code>[[<em>pat1</em>]&[<em>pat2</em>]]</code>169* <td>The intersection of sets specified by <em>pat1</em> and <em>pat2</em>170* <tr valign=top><td nowrap><code>[[<em>pat1</em>]-[<em>pat2</em>]]</code>171* <td>The asymmetric difference of sets specified by <em>pat1</em> and172* <em>pat2</em>173* <tr valign=top><td nowrap><code>[:Lu:] or \p{Lu}</code>174* <td>The set of characters having the specified175* Unicode property; in176* this case, Unicode uppercase letters177* <tr valign=top><td nowrap><code>[:^Lu:] or \P{Lu}</code>178* <td>The set of characters <em>not</em> having the given179* Unicode property180* </table>181*182* <p><b>Warning</b>: you cannot add an empty string ("") to a UnicodeSet.</p>183*184* <p><b>Formal syntax</b></p>185*186* <blockquote>187* <table>188* <tr align="top">189* <td nowrap valign="top" align="right"><code>pattern := </code></td>190* <td valign="top"><code>('[' '^'? item* ']') |191* property</code></td>192* </tr>193* <tr align="top">194* <td nowrap valign="top" align="right"><code>item := </code></td>195* <td valign="top"><code>char | (char '-' char) | pattern-expr<br>196* </code></td>197* </tr>198* <tr align="top">199* <td nowrap valign="top" align="right"><code>pattern-expr := </code></td>200* <td valign="top"><code>pattern | pattern-expr pattern |201* pattern-expr op pattern<br>202* </code></td>203* </tr>204* <tr align="top">205* <td nowrap valign="top" align="right"><code>op := </code></td>206* <td valign="top"><code>'&' | '-'<br>207* </code></td>208* </tr>209* <tr align="top">210* <td nowrap valign="top" align="right"><code>special := </code></td>211* <td valign="top"><code>'[' | ']' | '-'<br>212* </code></td>213* </tr>214* <tr align="top">215* <td nowrap valign="top" align="right"><code>char := </code></td>216* <td valign="top"><em>any character that is not</em><code> special<br>217* | ('\\' </code><em>any character</em><code>)<br>218* | ('\u' hex hex hex hex)<br>219* </code></td>220* </tr>221* <tr align="top">222* <td nowrap valign="top" align="right"><code>hex := </code></td>223* <td valign="top"><em>any character for which224* </em><code>Character.digit(c, 16)</code><em>225* returns a non-negative result</em></td>226* </tr>227* <tr>228* <td nowrap valign="top" align="right"><code>property := </code></td>229* <td valign="top"><em>a Unicode property set pattern</td>230* </tr>231* </table>232* <br>233* <table border="1">234* <tr>235* <td>Legend: <table>236* <tr>237* <td nowrap valign="top"><code>a := b</code></td>238* <td width="20" valign="top"> </td>239* <td valign="top"><code>a</code> may be replaced by <code>b</code> </td>240* </tr>241* <tr>242* <td nowrap valign="top"><code>a?</code></td>243* <td valign="top"></td>244* <td valign="top">zero or one instance of <code>a</code><br>245* </td>246* </tr>247* <tr>248* <td nowrap valign="top"><code>a*</code></td>249* <td valign="top"></td>250* <td valign="top">one or more instances of <code>a</code><br>251* </td>252* </tr>253* <tr>254* <td nowrap valign="top"><code>a | b</code></td>255* <td valign="top"></td>256* <td valign="top">either <code>a</code> or <code>b</code><br>257* </td>258* </tr>259* <tr>260* <td nowrap valign="top"><code>'a'</code></td>261* <td valign="top"></td>262* <td valign="top">the literal string between the quotes </td>263* </tr>264* </table>265* </td>266* </tr>267* </table>268* </blockquote>269* <p>To iterate over contents of UnicodeSet, use UnicodeSetIterator class.270*271* @author Alan Liu272* @stable ICU 2.0273* @see UnicodeSetIterator274*/275public class UnicodeSet implements UnicodeMatcher {276277private static final int LOW = 0x000000; // LOW <= all valid values. ZERO for codepoints278private static final int HIGH = 0x110000; // HIGH > all valid values. 10000 for code units.279// 110000 for codepoints280281/**282* Minimum value that can be stored in a UnicodeSet.283* @stable ICU 2.0284*/285public static final int MIN_VALUE = LOW;286287/**288* Maximum value that can be stored in a UnicodeSet.289* @stable ICU 2.0290*/291public static final int MAX_VALUE = HIGH - 1;292293private int len; // length used; list may be longer to minimize reallocs294private int[] list; // MUST be terminated with HIGH295private int[] rangeList; // internal buffer296private int[] buffer; // internal buffer297298// NOTE: normally the field should be of type SortedSet; but that is missing a public clone!!299// is not private so that UnicodeSetIterator can get access300TreeSet<String> strings = new TreeSet<>();301302/**303* The pattern representation of this set. This may not be the304* most economical pattern. It is the pattern supplied to305* applyPattern(), with variables substituted and whitespace306* removed. For sets constructed without applyPattern(), or307* modified using the non-pattern API, this string will be null,308* indicating that toPattern() must generate a pattern309* representation from the inversion list.310*/311private String pat = null;312313private static final int START_EXTRA = 16; // initial storage. Must be >= 0314private static final int GROW_EXTRA = START_EXTRA; // extra amount for growth. Must be >= 0315316/**317* A set of all characters _except_ the second through last characters of318* certain ranges. These ranges are ranges of characters whose319* properties are all exactly alike, e.g. CJK Ideographs from320* U+4E00 to U+9FA5.321*/322private static UnicodeSet INCLUSIONS[] = null;323324//----------------------------------------------------------------325// Public API326//----------------------------------------------------------------327328/**329* Constructs an empty set.330* @stable ICU 2.0331*/332public UnicodeSet() {333list = new int[1 + START_EXTRA];334list[len++] = HIGH;335}336337/**338* Constructs a set containing the given range. If <code>end >339* start</code> then an empty set is created.340*341* @param start first character, inclusive, of range342* @param end last character, inclusive, of range343* @stable ICU 2.0344*/345public UnicodeSet(int start, int end) {346this();347complement(start, end);348}349350/**351* Constructs a set from the given pattern. See the class description352* for the syntax of the pattern language. Whitespace is ignored.353* @param pattern a string specifying what characters are in the set354* @exception java.lang.IllegalArgumentException if the pattern contains355* a syntax error.356* @stable ICU 2.0357*/358public UnicodeSet(String pattern) {359this();360applyPattern(pattern, null, null, IGNORE_SPACE);361}362363/**364* Make this object represent the same set as <code>other</code>.365* @param other a <code>UnicodeSet</code> whose value will be366* copied to this object367* @stable ICU 2.0368*/369public UnicodeSet set(UnicodeSet other) {370list = other.list.clone();371len = other.len;372pat = other.pat;373strings = (TreeSet)other.strings.clone();374return this;375}376377/**378* Modifies this set to represent the set specified by the given pattern.379* See the class description for the syntax of the pattern language.380* Whitespace is ignored.381* @param pattern a string specifying what characters are in the set382* @exception java.lang.IllegalArgumentException if the pattern383* contains a syntax error.384* @stable ICU 2.0385*/386public final UnicodeSet applyPattern(String pattern) {387return applyPattern(pattern, null, null, IGNORE_SPACE);388}389390/**391* Append the <code>toPattern()</code> representation of a392* string to the given <code>StringBuffer</code>.393*/394private static void _appendToPat(StringBuffer buf, String s, boolean escapeUnprintable) {395for (int i = 0; i < s.length(); i += UTF16.getCharCount(i)) {396_appendToPat(buf, UTF16.charAt(s, i), escapeUnprintable);397}398}399400/**401* Append the <code>toPattern()</code> representation of a402* character to the given <code>StringBuffer</code>.403*/404private static void _appendToPat(StringBuffer buf, int c, boolean escapeUnprintable) {405if (escapeUnprintable && Utility.isUnprintable(c)) {406// Use hex escape notation (<backslash>uxxxx or <backslash>Uxxxxxxxx) for anything407// unprintable408if (Utility.escapeUnprintable(buf, c)) {409return;410}411}412// Okay to let ':' pass through413switch (c) {414case '[': // SET_OPEN:415case ']': // SET_CLOSE:416case '-': // HYPHEN:417case '^': // COMPLEMENT:418case '&': // INTERSECTION:419case '\\': //BACKSLASH:420case '{':421case '}':422case '$':423case ':':424buf.append('\\');425break;426default:427// Escape whitespace428if (UCharacterProperty.isRuleWhiteSpace(c)) {429buf.append('\\');430}431break;432}433UTF16.append(buf, c);434}435436/**437* Append a string representation of this set to result. This will be438* a cleaned version of the string passed to applyPattern(), if there439* is one. Otherwise it will be generated.440*/441private StringBuffer _toPattern(StringBuffer result,442boolean escapeUnprintable) {443if (pat != null) {444int i;445int backslashCount = 0;446for (i=0; i<pat.length(); ) {447int c = UTF16.charAt(pat, i);448i += UTF16.getCharCount(c);449if (escapeUnprintable && Utility.isUnprintable(c)) {450// If the unprintable character is preceded by an odd451// number of backslashes, then it has been escaped.452// Before unescaping it, we delete the final453// backslash.454if ((backslashCount % 2) == 1) {455result.setLength(result.length() - 1);456}457Utility.escapeUnprintable(result, c);458backslashCount = 0;459} else {460UTF16.append(result, c);461if (c == '\\') {462++backslashCount;463} else {464backslashCount = 0;465}466}467}468return result;469}470471return _generatePattern(result, escapeUnprintable, true);472}473474/**475* Generate and append a string representation of this set to result.476* This does not use this.pat, the cleaned up copy of the string477* passed to applyPattern().478* @param includeStrings if false, doesn't include the strings.479* @stable ICU 3.8480*/481public StringBuffer _generatePattern(StringBuffer result,482boolean escapeUnprintable, boolean includeStrings) {483result.append('[');484485int count = getRangeCount();486487// If the set contains at least 2 intervals and includes both488// MIN_VALUE and MAX_VALUE, then the inverse representation will489// be more economical.490if (count > 1 &&491getRangeStart(0) == MIN_VALUE &&492getRangeEnd(count-1) == MAX_VALUE) {493494// Emit the inverse495result.append('^');496497for (int i = 1; i < count; ++i) {498int start = getRangeEnd(i-1)+1;499int end = getRangeStart(i)-1;500_appendToPat(result, start, escapeUnprintable);501if (start != end) {502if ((start+1) != end) {503result.append('-');504}505_appendToPat(result, end, escapeUnprintable);506}507}508}509510// Default; emit the ranges as pairs511else {512for (int i = 0; i < count; ++i) {513int start = getRangeStart(i);514int end = getRangeEnd(i);515_appendToPat(result, start, escapeUnprintable);516if (start != end) {517if ((start+1) != end) {518result.append('-');519}520_appendToPat(result, end, escapeUnprintable);521}522}523}524525if (includeStrings && strings.size() > 0) {526Iterator<String> it = strings.iterator();527while (it.hasNext()) {528result.append('{');529_appendToPat(result, it.next(), escapeUnprintable);530result.append('}');531}532}533return result.append(']');534}535536// for internal use, after checkFrozen has been called537private UnicodeSet add_unchecked(int start, int end) {538if (start < MIN_VALUE || start > MAX_VALUE) {539throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(start, 6));540}541if (end < MIN_VALUE || end > MAX_VALUE) {542throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(end, 6));543}544if (start < end) {545add(range(start, end), 2, 0);546} else if (start == end) {547add(start);548}549return this;550}551552/**553* Adds the specified character to this set if it is not already554* present. If this set already contains the specified character,555* the call leaves this set unchanged.556* @stable ICU 2.0557*/558public final UnicodeSet add(int c) {559return add_unchecked(c);560}561562// for internal use only, after checkFrozen has been called563private final UnicodeSet add_unchecked(int c) {564if (c < MIN_VALUE || c > MAX_VALUE) {565throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(c, 6));566}567568// find smallest i such that c < list[i]569// if odd, then it is IN the set570// if even, then it is OUT of the set571int i = findCodePoint(c);572573// already in set?574if ((i & 1) != 0) return this;575576// HIGH is 0x110000577// assert(list[len-1] == HIGH);578579// empty = [HIGH]580// [start_0, limit_0, start_1, limit_1, HIGH]581582// [..., start_k-1, limit_k-1, start_k, limit_k, ..., HIGH]583// ^584// list[i]585586// i == 0 means c is before the first range587588if (c == list[i]-1) {589// c is before start of next range590list[i] = c;591// if we touched the HIGH mark, then add a new one592if (c == MAX_VALUE) {593ensureCapacity(len+1);594list[len++] = HIGH;595}596if (i > 0 && c == list[i-1]) {597// collapse adjacent ranges598599// [..., start_k-1, c, c, limit_k, ..., HIGH]600// ^601// list[i]602System.arraycopy(list, i+1, list, i-1, len-i-1);603len -= 2;604}605}606607else if (i > 0 && c == list[i-1]) {608// c is after end of prior range609list[i-1]++;610// no need to chcek for collapse here611}612613else {614// At this point we know the new char is not adjacent to615// any existing ranges, and it is not 10FFFF.616617618// [..., start_k-1, limit_k-1, start_k, limit_k, ..., HIGH]619// ^620// list[i]621622// [..., start_k-1, limit_k-1, c, c+1, start_k, limit_k, ..., HIGH]623// ^624// list[i]625626// Don't use ensureCapacity() to save on copying.627// NOTE: This has no measurable impact on performance,628// but it might help in some usage patterns.629if (len+2 > list.length) {630int[] temp = new int[len + 2 + GROW_EXTRA];631if (i != 0) System.arraycopy(list, 0, temp, 0, i);632System.arraycopy(list, i, temp, i+2, len-i);633list = temp;634} else {635System.arraycopy(list, i, list, i+2, len-i);636}637638list[i] = c;639list[i+1] = c+1;640len += 2;641}642643pat = null;644return this;645}646647/**648* Adds the specified multicharacter to this set if it is not already649* present. If this set already contains the multicharacter,650* the call leaves this set unchanged.651* Thus "ch" => {"ch"}652* <br><b>Warning: you cannot add an empty string ("") to a UnicodeSet.</b>653* @param s the source string654* @return this object, for chaining655* @stable ICU 2.0656*/657public final UnicodeSet add(String s) {658int cp = getSingleCP(s);659if (cp < 0) {660strings.add(s);661pat = null;662} else {663add_unchecked(cp, cp);664}665return this;666}667668/**669* @return a code point IF the string consists of a single one.670* otherwise returns -1.671* @param string to test672*/673private static int getSingleCP(String s) {674if (s.length() < 1) {675throw new IllegalArgumentException("Can't use zero-length strings in UnicodeSet");676}677if (s.length() > 2) return -1;678if (s.length() == 1) return s.charAt(0);679680// at this point, len = 2681int cp = UTF16.charAt(s, 0);682if (cp > 0xFFFF) { // is surrogate pair683return cp;684}685return -1;686}687688/**689* Complements the specified range in this set. Any character in690* the range will be removed if it is in this set, or will be691* added if it is not in this set. If <code>end > start</code>692* then an empty range is complemented, leaving the set unchanged.693*694* @param start first character, inclusive, of range to be removed695* from this set.696* @param end last character, inclusive, of range to be removed697* from this set.698* @stable ICU 2.0699*/700public UnicodeSet complement(int start, int end) {701if (start < MIN_VALUE || start > MAX_VALUE) {702throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(start, 6));703}704if (end < MIN_VALUE || end > MAX_VALUE) {705throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(end, 6));706}707if (start <= end) {708xor(range(start, end), 2, 0);709}710pat = null;711return this;712}713714/**715* This is equivalent to716* <code>complement(MIN_VALUE, MAX_VALUE)</code>.717* @stable ICU 2.0718*/719public UnicodeSet complement() {720if (list[0] == LOW) {721System.arraycopy(list, 1, list, 0, len-1);722--len;723} else {724ensureCapacity(len+1);725System.arraycopy(list, 0, list, 1, len);726list[0] = LOW;727++len;728}729pat = null;730return this;731}732733/**734* Returns true if this set contains the given character.735* @param c character to be checked for containment736* @return true if the test condition is met737* @stable ICU 2.0738*/739public boolean contains(int c) {740if (c < MIN_VALUE || c > MAX_VALUE) {741throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(c, 6));742}743744/*745// Set i to the index of the start item greater than ch746// We know we will terminate without length test!747int i = -1;748while (true) {749if (c < list[++i]) break;750}751*/752753int i = findCodePoint(c);754755return ((i & 1) != 0); // return true if odd756}757758/**759* Returns the smallest value i such that c < list[i]. Caller760* must ensure that c is a legal value or this method will enter761* an infinite loop. This method performs a binary search.762* @param c a character in the range MIN_VALUE..MAX_VALUE763* inclusive764* @return the smallest integer i in the range 0..len-1,765* inclusive, such that c < list[i]766*/767private final int findCodePoint(int c) {768/* Examples:769findCodePoint(c)770set list[] c=0 1 3 4 7 8771=== ============== ===========772[] [110000] 0 0 0 0 0 0773[\u0000-\u0003] [0, 4, 110000] 1 1 1 2 2 2774[\u0004-\u0007] [4, 8, 110000] 0 0 0 1 1 2775[:all:] [0, 110000] 1 1 1 1 1 1776*/777778// Return the smallest i such that c < list[i]. Assume779// list[len - 1] == HIGH and that c is legal (0..HIGH-1).780if (c < list[0]) return 0;781// High runner test. c is often after the last range, so an782// initial check for this condition pays off.783if (len >= 2 && c >= list[len-2]) return len-1;784int lo = 0;785int hi = len - 1;786// invariant: c >= list[lo]787// invariant: c < list[hi]788for (;;) {789int i = (lo + hi) >>> 1;790if (i == lo) return hi;791if (c < list[i]) {792hi = i;793} else {794lo = i;795}796}797}798799/**800* Adds all of the elements in the specified set to this set if801* they're not already present. This operation effectively802* modifies this set so that its value is the <i>union</i> of the two803* sets. The behavior of this operation is unspecified if the specified804* collection is modified while the operation is in progress.805*806* @param c set whose elements are to be added to this set.807* @stable ICU 2.0808*/809public UnicodeSet addAll(UnicodeSet c) {810add(c.list, c.len, 0);811strings.addAll(c.strings);812return this;813}814815/**816* Retains only the elements in this set that are contained in the817* specified set. In other words, removes from this set all of818* its elements that are not contained in the specified set. This819* operation effectively modifies this set so that its value is820* the <i>intersection</i> of the two sets.821*822* @param c set that defines which elements this set will retain.823* @stable ICU 2.0824*/825public UnicodeSet retainAll(UnicodeSet c) {826retain(c.list, c.len, 0);827strings.retainAll(c.strings);828return this;829}830831/**832* Removes from this set all of its elements that are contained in the833* specified set. This operation effectively modifies this834* set so that its value is the <i>asymmetric set difference</i> of835* the two sets.836*837* @param c set that defines which elements will be removed from838* this set.839* @stable ICU 2.0840*/841public UnicodeSet removeAll(UnicodeSet c) {842retain(c.list, c.len, 2);843strings.removeAll(c.strings);844return this;845}846847/**848* Removes all of the elements from this set. This set will be849* empty after this call returns.850* @stable ICU 2.0851*/852public UnicodeSet clear() {853list[0] = HIGH;854len = 1;855pat = null;856strings.clear();857return this;858}859860/**861* Iteration method that returns the number of ranges contained in862* this set.863* @see #getRangeStart864* @see #getRangeEnd865* @stable ICU 2.0866*/867public int getRangeCount() {868return len/2;869}870871/**872* Iteration method that returns the first character in the873* specified range of this set.874* @exception ArrayIndexOutOfBoundsException if index is outside875* the range <code>0..getRangeCount()-1</code>876* @see #getRangeCount877* @see #getRangeEnd878* @stable ICU 2.0879*/880public int getRangeStart(int index) {881return list[index*2];882}883884/**885* Iteration method that returns the last character in the886* specified range of this set.887* @exception ArrayIndexOutOfBoundsException if index is outside888* the range <code>0..getRangeCount()-1</code>889* @see #getRangeStart890* @see #getRangeEnd891* @stable ICU 2.0892*/893public int getRangeEnd(int index) {894return (list[index*2 + 1] - 1);895}896897//----------------------------------------------------------------898// Implementation: Pattern parsing899//----------------------------------------------------------------900901/**902* Parses the given pattern, starting at the given position. The character903* at pattern.charAt(pos.getIndex()) must be '[', or the parse fails.904* Parsing continues until the corresponding closing ']'. If a syntax error905* is encountered between the opening and closing brace, the parse fails.906* Upon return from a successful parse, the ParsePosition is updated to907* point to the character following the closing ']', and an inversion908* list for the parsed pattern is returned. This method909* calls itself recursively to parse embedded subpatterns.910*911* @param pattern the string containing the pattern to be parsed. The912* portion of the string from pos.getIndex(), which must be a '[', to the913* corresponding closing ']', is parsed.914* @param pos upon entry, the position at which to being parsing. The915* character at pattern.charAt(pos.getIndex()) must be a '['. Upon return916* from a successful parse, pos.getIndex() is either the character after the917* closing ']' of the parsed pattern, or pattern.length() if the closing ']'918* is the last character of the pattern string.919* @return an inversion list for the parsed substring920* of <code>pattern</code>921* @exception java.lang.IllegalArgumentException if the parse fails.922*/923UnicodeSet applyPattern(String pattern,924ParsePosition pos,925SymbolTable symbols,926int options) {927928// Need to build the pattern in a temporary string because929// _applyPattern calls add() etc., which set pat to empty.930boolean parsePositionWasNull = pos == null;931if (parsePositionWasNull) {932pos = new ParsePosition(0);933}934935StringBuffer rebuiltPat = new StringBuffer();936RuleCharacterIterator chars =937new RuleCharacterIterator(pattern, symbols, pos);938applyPattern(chars, symbols, rebuiltPat, options);939if (chars.inVariable()) {940syntaxError(chars, "Extra chars in variable value");941}942pat = rebuiltPat.toString();943if (parsePositionWasNull) {944int i = pos.getIndex();945946// Skip over trailing whitespace947if ((options & IGNORE_SPACE) != 0) {948i = Utility.skipWhitespace(pattern, i);949}950951if (i != pattern.length()) {952throw new IllegalArgumentException("Parse of \"" + pattern +953"\" failed at " + i);954}955}956return this;957}958959/**960* Parse the pattern from the given RuleCharacterIterator. The961* iterator is advanced over the parsed pattern.962* @param chars iterator over the pattern characters. Upon return963* it will be advanced to the first character after the parsed964* pattern, or the end of the iteration if all characters are965* parsed.966* @param symbols symbol table to use to parse and dereference967* variables, or null if none.968* @param rebuiltPat the pattern that was parsed, rebuilt or969* copied from the input pattern, as appropriate.970* @param options a bit mask of zero or more of the following:971* IGNORE_SPACE, CASE.972*/973void applyPattern(RuleCharacterIterator chars, SymbolTable symbols,974StringBuffer rebuiltPat, int options) {975// Syntax characters: [ ] ^ - & { }976977// Recognized special forms for chars, sets: c-c s-s s&s978979int opts = RuleCharacterIterator.PARSE_VARIABLES |980RuleCharacterIterator.PARSE_ESCAPES;981if ((options & IGNORE_SPACE) != 0) {982opts |= RuleCharacterIterator.SKIP_WHITESPACE;983}984985StringBuffer patBuf = new StringBuffer(), buf = null;986boolean usePat = false;987UnicodeSet scratch = null;988Object backup = null;989990// mode: 0=before [, 1=between [...], 2=after ]991// lastItem: 0=none, 1=char, 2=set992int lastItem = 0, lastChar = 0, mode = 0;993char op = 0;994995boolean invert = false;996997clear();998999while (mode != 2 && !chars.atEnd()) {1000if (false) {1001// Debugging assertion1002if (!((lastItem == 0 && op == 0) ||1003(lastItem == 1 && (op == 0 || op == '-')) ||1004(lastItem == 2 && (op == 0 || op == '-' || op == '&')))) {1005throw new IllegalArgumentException();1006}1007}10081009int c = 0;1010boolean literal = false;1011UnicodeSet nested = null;10121013// -------- Check for property pattern10141015// setMode: 0=none, 1=unicodeset, 2=propertypat, 3=preparsed1016int setMode = 0;1017if (resemblesPropertyPattern(chars, opts)) {1018setMode = 2;1019}10201021// -------- Parse '[' of opening delimiter OR nested set.1022// If there is a nested set, use `setMode' to define how1023// the set should be parsed. If the '[' is part of the1024// opening delimiter for this pattern, parse special1025// strings "[", "[^", "[-", and "[^-". Check for stand-in1026// characters representing a nested set in the symbol1027// table.10281029else {1030// Prepare to backup if necessary1031backup = chars.getPos(backup);1032c = chars.next(opts);1033literal = chars.isEscaped();10341035if (c == '[' && !literal) {1036if (mode == 1) {1037chars.setPos(backup); // backup1038setMode = 1;1039} else {1040// Handle opening '[' delimiter1041mode = 1;1042patBuf.append('[');1043backup = chars.getPos(backup); // prepare to backup1044c = chars.next(opts);1045literal = chars.isEscaped();1046if (c == '^' && !literal) {1047invert = true;1048patBuf.append('^');1049backup = chars.getPos(backup); // prepare to backup1050c = chars.next(opts);1051literal = chars.isEscaped();1052}1053// Fall through to handle special leading '-';1054// otherwise restart loop for nested [], \p{}, etc.1055if (c == '-') {1056literal = true;1057// Fall through to handle literal '-' below1058} else {1059chars.setPos(backup); // backup1060continue;1061}1062}1063} else if (symbols != null) {1064UnicodeMatcher m = symbols.lookupMatcher(c); // may be null1065if (m != null) {1066try {1067nested = (UnicodeSet) m;1068setMode = 3;1069} catch (ClassCastException e) {1070syntaxError(chars, "Syntax error");1071}1072}1073}1074}10751076// -------- Handle a nested set. This either is inline in1077// the pattern or represented by a stand-in that has1078// previously been parsed and was looked up in the symbol1079// table.10801081if (setMode != 0) {1082if (lastItem == 1) {1083if (op != 0) {1084syntaxError(chars, "Char expected after operator");1085}1086add_unchecked(lastChar, lastChar);1087_appendToPat(patBuf, lastChar, false);1088lastItem = op = 0;1089}10901091if (op == '-' || op == '&') {1092patBuf.append(op);1093}10941095if (nested == null) {1096if (scratch == null) scratch = new UnicodeSet();1097nested = scratch;1098}1099switch (setMode) {1100case 1:1101nested.applyPattern(chars, symbols, patBuf, options);1102break;1103case 2:1104chars.skipIgnored(opts);1105nested.applyPropertyPattern(chars, patBuf, symbols);1106break;1107case 3: // `nested' already parsed1108nested._toPattern(patBuf, false);1109break;1110}11111112usePat = true;11131114if (mode == 0) {1115// Entire pattern is a category; leave parse loop1116set(nested);1117mode = 2;1118break;1119}11201121switch (op) {1122case '-':1123removeAll(nested);1124break;1125case '&':1126retainAll(nested);1127break;1128case 0:1129addAll(nested);1130break;1131}11321133op = 0;1134lastItem = 2;11351136continue;1137}11381139if (mode == 0) {1140syntaxError(chars, "Missing '['");1141}11421143// -------- Parse special (syntax) characters. If the1144// current character is not special, or if it is escaped,1145// then fall through and handle it below.11461147if (!literal) {1148switch (c) {1149case ']':1150if (lastItem == 1) {1151add_unchecked(lastChar, lastChar);1152_appendToPat(patBuf, lastChar, false);1153}1154// Treat final trailing '-' as a literal1155if (op == '-') {1156add_unchecked(op, op);1157patBuf.append(op);1158} else if (op == '&') {1159syntaxError(chars, "Trailing '&'");1160}1161patBuf.append(']');1162mode = 2;1163continue;1164case '-':1165if (op == 0) {1166if (lastItem != 0) {1167op = (char) c;1168continue;1169} else {1170// Treat final trailing '-' as a literal1171add_unchecked(c, c);1172c = chars.next(opts);1173literal = chars.isEscaped();1174if (c == ']' && !literal) {1175patBuf.append("-]");1176mode = 2;1177continue;1178}1179}1180}1181syntaxError(chars, "'-' not after char or set");1182break;1183case '&':1184if (lastItem == 2 && op == 0) {1185op = (char) c;1186continue;1187}1188syntaxError(chars, "'&' not after set");1189break;1190case '^':1191syntaxError(chars, "'^' not after '['");1192break;1193case '{':1194if (op != 0) {1195syntaxError(chars, "Missing operand after operator");1196}1197if (lastItem == 1) {1198add_unchecked(lastChar, lastChar);1199_appendToPat(patBuf, lastChar, false);1200}1201lastItem = 0;1202if (buf == null) {1203buf = new StringBuffer();1204} else {1205buf.setLength(0);1206}1207boolean ok = false;1208while (!chars.atEnd()) {1209c = chars.next(opts);1210literal = chars.isEscaped();1211if (c == '}' && !literal) {1212ok = true;1213break;1214}1215UTF16.append(buf, c);1216}1217if (buf.length() < 1 || !ok) {1218syntaxError(chars, "Invalid multicharacter string");1219}1220// We have new string. Add it to set and continue;1221// we don't need to drop through to the further1222// processing1223add(buf.toString());1224patBuf.append('{');1225_appendToPat(patBuf, buf.toString(), false);1226patBuf.append('}');1227continue;1228case SymbolTable.SYMBOL_REF:1229// symbols nosymbols1230// [a-$] error error (ambiguous)1231// [a$] anchor anchor1232// [a-$x] var "x"* literal '$'1233// [a-$.] error literal '$'1234// *We won't get here in the case of var "x"1235backup = chars.getPos(backup);1236c = chars.next(opts);1237literal = chars.isEscaped();1238boolean anchor = (c == ']' && !literal);1239if (symbols == null && !anchor) {1240c = SymbolTable.SYMBOL_REF;1241chars.setPos(backup);1242break; // literal '$'1243}1244if (anchor && op == 0) {1245if (lastItem == 1) {1246add_unchecked(lastChar, lastChar);1247_appendToPat(patBuf, lastChar, false);1248}1249add_unchecked(UnicodeMatcher.ETHER);1250usePat = true;1251patBuf.append(SymbolTable.SYMBOL_REF).append(']');1252mode = 2;1253continue;1254}1255syntaxError(chars, "Unquoted '$'");1256break;1257default:1258break;1259}1260}12611262// -------- Parse literal characters. This includes both1263// escaped chars ("\u4E01") and non-syntax characters1264// ("a").12651266switch (lastItem) {1267case 0:1268lastItem = 1;1269lastChar = c;1270break;1271case 1:1272if (op == '-') {1273if (lastChar >= c) {1274// Don't allow redundant (a-a) or empty (b-a) ranges;1275// these are most likely typos.1276syntaxError(chars, "Invalid range");1277}1278add_unchecked(lastChar, c);1279_appendToPat(patBuf, lastChar, false);1280patBuf.append(op);1281_appendToPat(patBuf, c, false);1282lastItem = op = 0;1283} else {1284add_unchecked(lastChar, lastChar);1285_appendToPat(patBuf, lastChar, false);1286lastChar = c;1287}1288break;1289case 2:1290if (op != 0) {1291syntaxError(chars, "Set expected after operator");1292}1293lastChar = c;1294lastItem = 1;1295break;1296}1297}12981299if (mode != 2) {1300syntaxError(chars, "Missing ']'");1301}13021303chars.skipIgnored(opts);13041305if (invert) {1306complement();1307}13081309// Use the rebuilt pattern (pat) only if necessary. Prefer the1310// generated pattern.1311if (usePat) {1312rebuiltPat.append(patBuf.toString());1313} else {1314_generatePattern(rebuiltPat, false, true);1315}1316}13171318private static void syntaxError(RuleCharacterIterator chars, String msg) {1319throw new IllegalArgumentException("Error: " + msg + " at \"" +1320Utility.escape(chars.toString()) +1321'"');1322}13231324//----------------------------------------------------------------1325// Implementation: Utility methods1326//----------------------------------------------------------------13271328private void ensureCapacity(int newLen) {1329if (newLen <= list.length) return;1330int[] temp = new int[newLen + GROW_EXTRA];1331System.arraycopy(list, 0, temp, 0, len);1332list = temp;1333}13341335private void ensureBufferCapacity(int newLen) {1336if (buffer != null && newLen <= buffer.length) return;1337buffer = new int[newLen + GROW_EXTRA];1338}13391340/**1341* Assumes start <= end.1342*/1343private int[] range(int start, int end) {1344if (rangeList == null) {1345rangeList = new int[] { start, end+1, HIGH };1346} else {1347rangeList[0] = start;1348rangeList[1] = end+1;1349}1350return rangeList;1351}13521353//----------------------------------------------------------------1354// Implementation: Fundamental operations1355//----------------------------------------------------------------13561357// polarity = 0, 3 is normal: x xor y1358// polarity = 1, 2: x xor ~y == x === y13591360private UnicodeSet xor(int[] other, int otherLen, int polarity) {1361ensureBufferCapacity(len + otherLen);1362int i = 0, j = 0, k = 0;1363int a = list[i++];1364int b;1365if (polarity == 1 || polarity == 2) {1366b = LOW;1367if (other[j] == LOW) { // skip base if already LOW1368++j;1369b = other[j];1370}1371} else {1372b = other[j++];1373}1374// simplest of all the routines1375// sort the values, discarding identicals!1376while (true) {1377if (a < b) {1378buffer[k++] = a;1379a = list[i++];1380} else if (b < a) {1381buffer[k++] = b;1382b = other[j++];1383} else if (a != HIGH) { // at this point, a == b1384// discard both values!1385a = list[i++];1386b = other[j++];1387} else { // DONE!1388buffer[k++] = HIGH;1389len = k;1390break;1391}1392}1393// swap list and buffer1394int[] temp = list;1395list = buffer;1396buffer = temp;1397pat = null;1398return this;1399}14001401// polarity = 0 is normal: x union y1402// polarity = 2: x union ~y1403// polarity = 1: ~x union y1404// polarity = 3: ~x union ~y14051406private UnicodeSet add(int[] other, int otherLen, int polarity) {1407ensureBufferCapacity(len + otherLen);1408int i = 0, j = 0, k = 0;1409int a = list[i++];1410int b = other[j++];1411// change from xor is that we have to check overlapping pairs1412// polarity bit 1 means a is second, bit 2 means b is.1413main:1414while (true) {1415switch (polarity) {1416case 0: // both first; take lower if unequal1417if (a < b) { // take a1418// Back up over overlapping ranges in buffer[]1419if (k > 0 && a <= buffer[k-1]) {1420// Pick latter end value in buffer[] vs. list[]1421a = max(list[i], buffer[--k]);1422} else {1423// No overlap1424buffer[k++] = a;1425a = list[i];1426}1427i++; // Common if/else code factored out1428polarity ^= 1;1429} else if (b < a) { // take b1430if (k > 0 && b <= buffer[k-1]) {1431b = max(other[j], buffer[--k]);1432} else {1433buffer[k++] = b;1434b = other[j];1435}1436j++;1437polarity ^= 2;1438} else { // a == b, take a, drop b1439if (a == HIGH) break main;1440// This is symmetrical; it doesn't matter if1441// we backtrack with a or b. - liu1442if (k > 0 && a <= buffer[k-1]) {1443a = max(list[i], buffer[--k]);1444} else {1445// No overlap1446buffer[k++] = a;1447a = list[i];1448}1449i++;1450polarity ^= 1;1451b = other[j++]; polarity ^= 2;1452}1453break;1454case 3: // both second; take higher if unequal, and drop other1455if (b <= a) { // take a1456if (a == HIGH) break main;1457buffer[k++] = a;1458} else { // take b1459if (b == HIGH) break main;1460buffer[k++] = b;1461}1462a = list[i++]; polarity ^= 1; // factored common code1463b = other[j++]; polarity ^= 2;1464break;1465case 1: // a second, b first; if b < a, overlap1466if (a < b) { // no overlap, take a1467buffer[k++] = a; a = list[i++]; polarity ^= 1;1468} else if (b < a) { // OVERLAP, drop b1469b = other[j++]; polarity ^= 2;1470} else { // a == b, drop both!1471if (a == HIGH) break main;1472a = list[i++]; polarity ^= 1;1473b = other[j++]; polarity ^= 2;1474}1475break;1476case 2: // a first, b second; if a < b, overlap1477if (b < a) { // no overlap, take b1478buffer[k++] = b; b = other[j++]; polarity ^= 2;1479} else if (a < b) { // OVERLAP, drop a1480a = list[i++]; polarity ^= 1;1481} else { // a == b, drop both!1482if (a == HIGH) break main;1483a = list[i++]; polarity ^= 1;1484b = other[j++]; polarity ^= 2;1485}1486break;1487}1488}1489buffer[k++] = HIGH; // terminate1490len = k;1491// swap list and buffer1492int[] temp = list;1493list = buffer;1494buffer = temp;1495pat = null;1496return this;1497}14981499// polarity = 0 is normal: x intersect y1500// polarity = 2: x intersect ~y == set-minus1501// polarity = 1: ~x intersect y1502// polarity = 3: ~x intersect ~y15031504private UnicodeSet retain(int[] other, int otherLen, int polarity) {1505ensureBufferCapacity(len + otherLen);1506int i = 0, j = 0, k = 0;1507int a = list[i++];1508int b = other[j++];1509// change from xor is that we have to check overlapping pairs1510// polarity bit 1 means a is second, bit 2 means b is.1511main:1512while (true) {1513switch (polarity) {1514case 0: // both first; drop the smaller1515if (a < b) { // drop a1516a = list[i++]; polarity ^= 1;1517} else if (b < a) { // drop b1518b = other[j++]; polarity ^= 2;1519} else { // a == b, take one, drop other1520if (a == HIGH) break main;1521buffer[k++] = a; a = list[i++]; polarity ^= 1;1522b = other[j++]; polarity ^= 2;1523}1524break;1525case 3: // both second; take lower if unequal1526if (a < b) { // take a1527buffer[k++] = a; a = list[i++]; polarity ^= 1;1528} else if (b < a) { // take b1529buffer[k++] = b; b = other[j++]; polarity ^= 2;1530} else { // a == b, take one, drop other1531if (a == HIGH) break main;1532buffer[k++] = a; a = list[i++]; polarity ^= 1;1533b = other[j++]; polarity ^= 2;1534}1535break;1536case 1: // a second, b first;1537if (a < b) { // NO OVERLAP, drop a1538a = list[i++]; polarity ^= 1;1539} else if (b < a) { // OVERLAP, take b1540buffer[k++] = b; b = other[j++]; polarity ^= 2;1541} else { // a == b, drop both!1542if (a == HIGH) break main;1543a = list[i++]; polarity ^= 1;1544b = other[j++]; polarity ^= 2;1545}1546break;1547case 2: // a first, b second; if a < b, overlap1548if (b < a) { // no overlap, drop b1549b = other[j++]; polarity ^= 2;1550} else if (a < b) { // OVERLAP, take a1551buffer[k++] = a; a = list[i++]; polarity ^= 1;1552} else { // a == b, drop both!1553if (a == HIGH) break main;1554a = list[i++]; polarity ^= 1;1555b = other[j++]; polarity ^= 2;1556}1557break;1558}1559}1560buffer[k++] = HIGH; // terminate1561len = k;1562// swap list and buffer1563int[] temp = list;1564list = buffer;1565buffer = temp;1566pat = null;1567return this;1568}15691570private static final int max(int a, int b) {1571return (a > b) ? a : b;1572}15731574//----------------------------------------------------------------1575// Generic filter-based scanning code1576//----------------------------------------------------------------15771578private static interface Filter {1579boolean contains(int codePoint);1580}15811582// VersionInfo for unassigned characters1583static final VersionInfo NO_VERSION = VersionInfo.getInstance(0, 0, 0, 0);15841585private static class VersionFilter implements Filter {1586VersionInfo version;15871588VersionFilter(VersionInfo version) { this.version = version; }15891590public boolean contains(int ch) {1591VersionInfo v = UCharacter.getAge(ch);1592// Reference comparison ok; VersionInfo caches and reuses1593// unique objects.1594return v != NO_VERSION &&1595v.compareTo(version) <= 0;1596}1597}15981599private static synchronized UnicodeSet getInclusions(int src) {1600if (INCLUSIONS == null) {1601INCLUSIONS = new UnicodeSet[UCharacterProperty.SRC_COUNT];1602}1603if(INCLUSIONS[src] == null) {1604UnicodeSet incl = new UnicodeSet();1605switch(src) {1606case UCharacterProperty.SRC_PROPSVEC:1607UCharacterProperty.getInstance().upropsvec_addPropertyStarts(incl);1608break;1609default:1610throw new IllegalStateException("UnicodeSet.getInclusions(unknown src "+src+")");1611}1612INCLUSIONS[src] = incl;1613}1614return INCLUSIONS[src];1615}16161617/**1618* Generic filter-based scanning code for UCD property UnicodeSets.1619*/1620private UnicodeSet applyFilter(Filter filter, int src) {1621// Walk through all Unicode characters, noting the start1622// and end of each range for which filter.contain(c) is1623// true. Add each range to a set.1624//1625// To improve performance, use the INCLUSIONS set, which1626// encodes information about character ranges that are known1627// to have identical properties, such as the CJK Ideographs1628// from U+4E00 to U+9FA5. INCLUSIONS contains all characters1629// except the first characters of such ranges.1630//1631// TODO Where possible, instead of scanning over code points,1632// use internal property data to initialize UnicodeSets for1633// those properties. Scanning code points is slow.16341635clear();16361637int startHasProperty = -1;1638UnicodeSet inclusions = getInclusions(src);1639int limitRange = inclusions.getRangeCount();16401641for (int j=0; j<limitRange; ++j) {1642// get current range1643int start = inclusions.getRangeStart(j);1644int end = inclusions.getRangeEnd(j);16451646// for all the code points in the range, process1647for (int ch = start; ch <= end; ++ch) {1648// only add to the unicodeset on inflection points --1649// where the hasProperty value changes to false1650if (filter.contains(ch)) {1651if (startHasProperty < 0) {1652startHasProperty = ch;1653}1654} else if (startHasProperty >= 0) {1655add_unchecked(startHasProperty, ch-1);1656startHasProperty = -1;1657}1658}1659}1660if (startHasProperty >= 0) {1661add_unchecked(startHasProperty, 0x10FFFF);1662}16631664return this;1665}16661667/**1668* Remove leading and trailing rule white space and compress1669* internal rule white space to a single space character.1670*1671* @see UCharacterProperty#isRuleWhiteSpace1672*/1673private static String mungeCharName(String source) {1674StringBuffer buf = new StringBuffer();1675for (int i=0; i<source.length(); ) {1676int ch = UTF16.charAt(source, i);1677i += UTF16.getCharCount(ch);1678if (UCharacterProperty.isRuleWhiteSpace(ch)) {1679if (buf.length() == 0 ||1680buf.charAt(buf.length() - 1) == ' ') {1681continue;1682}1683ch = ' '; // convert to ' '1684}1685UTF16.append(buf, ch);1686}1687if (buf.length() != 0 &&1688buf.charAt(buf.length() - 1) == ' ') {1689buf.setLength(buf.length() - 1);1690}1691return buf.toString();1692}16931694/**1695* Modifies this set to contain those code points which have the1696* given value for the given property. Prior contents of this1697* set are lost.1698* @param propertyAlias1699* @param valueAlias1700* @param symbols if not null, then symbols are first called to see if a property1701* is available. If true, then everything else is skipped.1702* @return this set1703* @stable ICU 3.21704*/1705public UnicodeSet applyPropertyAlias(String propertyAlias,1706String valueAlias, SymbolTable symbols) {1707if (valueAlias.length() > 0) {1708if (propertyAlias.equals("Age")) {1709// Must munge name, since1710// VersionInfo.getInstance() does not do1711// 'loose' matching.1712VersionInfo version = VersionInfo.getInstance(mungeCharName(valueAlias));1713applyFilter(new VersionFilter(version), UCharacterProperty.SRC_PROPSVEC);1714return this;1715}1716}1717throw new IllegalArgumentException("Unsupported property: " + propertyAlias);1718}17191720/**1721* Return true if the given iterator appears to point at a1722* property pattern. Regardless of the result, return with the1723* iterator unchanged.1724* @param chars iterator over the pattern characters. Upon return1725* it will be unchanged.1726* @param iterOpts RuleCharacterIterator options1727*/1728private static boolean resemblesPropertyPattern(RuleCharacterIterator chars,1729int iterOpts) {1730boolean result = false;1731iterOpts &= ~RuleCharacterIterator.PARSE_ESCAPES;1732Object pos = chars.getPos(null);1733int c = chars.next(iterOpts);1734if (c == '[' || c == '\\') {1735int d = chars.next(iterOpts & ~RuleCharacterIterator.SKIP_WHITESPACE);1736result = (c == '[') ? (d == ':') :1737(d == 'N' || d == 'p' || d == 'P');1738}1739chars.setPos(pos);1740return result;1741}17421743/**1744* Parse the given property pattern at the given parse position.1745* @param symbols TODO1746*/1747private UnicodeSet applyPropertyPattern(String pattern, ParsePosition ppos, SymbolTable symbols) {1748int pos = ppos.getIndex();17491750// On entry, ppos should point to one of the following locations:17511752// Minimum length is 5 characters, e.g. \p{L}1753if ((pos+5) > pattern.length()) {1754return null;1755}17561757boolean posix = false; // true for [:pat:], false for \p{pat} \P{pat} \N{pat}1758boolean isName = false; // true for \N{pat}, o/w false1759boolean invert = false;17601761// Look for an opening [:, [:^, \p, or \P1762if (pattern.regionMatches(pos, "[:", 0, 2)) {1763posix = true;1764pos = Utility.skipWhitespace(pattern, pos+2);1765if (pos < pattern.length() && pattern.charAt(pos) == '^') {1766++pos;1767invert = true;1768}1769} else if (pattern.regionMatches(true, pos, "\\p", 0, 2) ||1770pattern.regionMatches(pos, "\\N", 0, 2)) {1771char c = pattern.charAt(pos+1);1772invert = (c == 'P');1773isName = (c == 'N');1774pos = Utility.skipWhitespace(pattern, pos+2);1775if (pos == pattern.length() || pattern.charAt(pos++) != '{') {1776// Syntax error; "\p" or "\P" not followed by "{"1777return null;1778}1779} else {1780// Open delimiter not seen1781return null;1782}17831784// Look for the matching close delimiter, either :] or }1785int close = pattern.indexOf(posix ? ":]" : "}", pos);1786if (close < 0) {1787// Syntax error; close delimiter missing1788return null;1789}17901791// Look for an '=' sign. If this is present, we will parse a1792// medium \p{gc=Cf} or long \p{GeneralCategory=Format}1793// pattern.1794int equals = pattern.indexOf('=', pos);1795String propName, valueName;1796if (equals >= 0 && equals < close && !isName) {1797// Equals seen; parse medium/long pattern1798propName = pattern.substring(pos, equals);1799valueName = pattern.substring(equals+1, close);1800}18011802else {1803// Handle case where no '=' is seen, and \N{}1804propName = pattern.substring(pos, close);1805valueName = "";18061807// Handle \N{name}1808if (isName) {1809// This is a little inefficient since it means we have to1810// parse "na" back to UProperty.NAME even though we already1811// know it's UProperty.NAME. If we refactor the API to1812// support args of (int, String) then we can remove1813// "na" and make this a little more efficient.1814valueName = propName;1815propName = "na";1816}1817}18181819applyPropertyAlias(propName, valueName, symbols);18201821if (invert) {1822complement();1823}18241825// Move to the limit position after the close delimiter1826ppos.setIndex(close + (posix ? 2 : 1));18271828return this;1829}18301831/**1832* Parse a property pattern.1833* @param chars iterator over the pattern characters. Upon return1834* it will be advanced to the first character after the parsed1835* pattern, or the end of the iteration if all characters are1836* parsed.1837* @param rebuiltPat the pattern that was parsed, rebuilt or1838* copied from the input pattern, as appropriate.1839* @param symbols TODO1840*/1841private void applyPropertyPattern(RuleCharacterIterator chars,1842StringBuffer rebuiltPat, SymbolTable symbols) {1843String patStr = chars.lookahead();1844ParsePosition pos = new ParsePosition(0);1845applyPropertyPattern(patStr, pos, symbols);1846if (pos.getIndex() == 0) {1847syntaxError(chars, "Invalid property pattern");1848}1849chars.jumpahead(pos.getIndex());1850rebuiltPat.append(patStr.substring(0, pos.getIndex()));1851}18521853//----------------------------------------------------------------1854// Case folding API1855//----------------------------------------------------------------18561857/**1858* Bitmask for constructor and applyPattern() indicating that1859* white space should be ignored. If set, ignore characters for1860* which UCharacterProperty.isRuleWhiteSpace() returns true,1861* unless they are quoted or escaped. This may be ORed together1862* with other selectors.1863* @stable ICU 3.81864*/1865public static final int IGNORE_SPACE = 1;18661867}1868186918701871