Path: blob/aarch64-shenandoah-jdk8u272-b10/jdk/src/share/classes/sun/text/normalizer/UCharacter.java
38830 views
/*1* Copyright (c) 2009, 2013, Oracle and/or its affiliates. All rights reserved.2* DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.3*4* This code is free software; you can redistribute it and/or modify it5* under the terms of the GNU General Public License version 2 only, as6* published by the Free Software Foundation. Oracle designates this7* particular file as subject to the "Classpath" exception as provided8* by Oracle in the LICENSE file that accompanied this code.9*10* This code is distributed in the hope that it will be useful, but WITHOUT11* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or12* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License13* version 2 for more details (a copy is included in the LICENSE file that14* accompanied this code).15*16* You should have received a copy of the GNU General Public License version17* 2 along with this work; if not, write to the Free Software Foundation,18* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.19*20* Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA21* or visit www.oracle.com if you need additional information or have any22* questions.23*/24/*25*******************************************************************************26* (C) Copyright IBM Corp. and others, 1996-2009 - All Rights Reserved *27* *28* The original version of this source code and documentation is copyrighted *29* and owned by IBM, These materials are provided under terms of a License *30* Agreement between IBM and Sun. This technology is protected by multiple *31* US and International patents. This notice and attribution to IBM may not *32* to removed. *33*******************************************************************************34*/3536package sun.text.normalizer;3738import java.io.IOException;39import java.util.MissingResourceException;4041/**42* <p>43* The UCharacter class provides extensions to the44* <a href="https://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html">45* java.lang.Character</a> class. These extensions provide support for46* more Unicode properties and together with the <a href=../text/UTF16.html>UTF16</a>47* class, provide support for supplementary characters (those with code48* points above U+FFFF).49* Each ICU release supports the latest version of Unicode available at that time.50* </p>51* <p>52* Code points are represented in these API using ints. While it would be53* more convenient in Java to have a separate primitive datatype for them,54* ints suffice in the meantime.55* </p>56* <p>57* To use this class please add the jar file name icu4j.jar to the58* class path, since it contains data files which supply the information used59* by this file.<br>60* E.g. In Windows <br>61* <code>set CLASSPATH=%CLASSPATH%;$JAR_FILE_PATH/ucharacter.jar</code>.<br>62* Otherwise, another method would be to copy the files uprops.dat and63* unames.icu from the icu4j source subdirectory64* <i>$ICU4J_SRC/src/com.ibm.icu.impl.data</i> to your class directory65* <i>$ICU4J_CLASS/com.ibm.icu.impl.data</i>.66* </p>67* <p>68* Aside from the additions for UTF-16 support, and the updated Unicode69* properties, the main differences between UCharacter and Character are:70* <ul>71* <li> UCharacter is not designed to be a char wrapper and does not have72* APIs to which involves management of that single char.<br>73* These include:74* <ul>75* <li> char charValue(),76* <li> int compareTo(java.lang.Character, java.lang.Character), etc.77* </ul>78* <li> UCharacter does not include Character APIs that are deprecated, nor79* does it include the Java-specific character information, such as80* boolean isJavaIdentifierPart(char ch).81* <li> Character maps characters 'A' - 'Z' and 'a' - 'z' to the numeric82* values '10' - '35'. UCharacter also does this in digit and83* getNumericValue, to adhere to the java semantics of these84* methods. New methods unicodeDigit, and85* getUnicodeNumericValue do not treat the above code points86* as having numeric values. This is a semantic change from ICU4J 1.3.1.87* </ul>88* <p>89* Further detail differences can be determined from the program90* <a href="http://source.icu-project.org/repos/icu/icu4j/trunk/src/com/ibm/icu/dev/test/lang/UCharacterCompare.java">91* com.ibm.icu.dev.test.lang.UCharacterCompare</a>92* </p>93* <p>94* In addition to Java compatibility functions, which calculate derived properties,95* this API provides low-level access to the Unicode Character Database.96* </p>97* <p>98* Unicode assigns each code point (not just assigned character) values for99* many properties.100* Most of them are simple boolean flags, or constants from a small enumerated list.101* For some properties, values are strings or other relatively more complex types.102* </p>103* <p>104* For more information see105* "About the Unicode Character Database" (http://www.unicode.org/ucd/)106* and the ICU User Guide chapter on Properties (http://www.icu-project.org/userguide/properties.html).107* </p>108* <p>109* There are also functions that provide easy migration from C/POSIX functions110* like isblank(). Their use is generally discouraged because the C/POSIX111* standards do not define their semantics beyond the ASCII range, which means112* that different implementations exhibit very different behavior.113* Instead, Unicode properties should be used directly.114* </p>115* <p>116* There are also only a few, broad C/POSIX character classes, and they tend117* to be used for conflicting purposes. For example, the "isalpha()" class118* is sometimes used to determine word boundaries, while a more sophisticated119* approach would at least distinguish initial letters from continuation120* characters (the latter including combining marks).121* (In ICU, BreakIterator is the most sophisticated API for word boundaries.)122* Another example: There is no "istitle()" class for titlecase characters.123* </p>124* <p>125* ICU 3.4 and later provides API access for all twelve C/POSIX character classes.126* ICU implements them according to the Standard Recommendations in127* Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions128* (http://www.unicode.org/reports/tr18/#Compatibility_Properties).129* </p>130* <p>131* API access for C/POSIX character classes is as follows:132* - alpha: isUAlphabetic(c) or hasBinaryProperty(c, UProperty.ALPHABETIC)133* - lower: isULowercase(c) or hasBinaryProperty(c, UProperty.LOWERCASE)134* - upper: isUUppercase(c) or hasBinaryProperty(c, UProperty.UPPERCASE)135* - punct: ((1<<getType(c)) & ((1<<DASH_PUNCTUATION)|(1<<START_PUNCTUATION)|(1<<END_PUNCTUATION)|(1<<CONNECTOR_PUNCTUATION)|(1<<OTHER_PUNCTUATION)|(1<<INITIAL_PUNCTUATION)|(1<<FINAL_PUNCTUATION)))!=0136* - digit: isDigit(c) or getType(c)==DECIMAL_DIGIT_NUMBER137* - xdigit: hasBinaryProperty(c, UProperty.POSIX_XDIGIT)138* - alnum: hasBinaryProperty(c, UProperty.POSIX_ALNUM)139* - space: isUWhiteSpace(c) or hasBinaryProperty(c, UProperty.WHITE_SPACE)140* - blank: hasBinaryProperty(c, UProperty.POSIX_BLANK)141* - cntrl: getType(c)==CONTROL142* - graph: hasBinaryProperty(c, UProperty.POSIX_GRAPH)143* - print: hasBinaryProperty(c, UProperty.POSIX_PRINT)144* </p>145* <p>146* The C/POSIX character classes are also available in UnicodeSet patterns,147* using patterns like [:graph:] or \p{graph}.148* </p>149* <p>150* Note: There are several ICU (and Java) whitespace functions.151* Comparison:152* - isUWhiteSpace=UCHAR_WHITE_SPACE: Unicode White_Space property;153* most of general categories "Z" (separators) + most whitespace ISO controls154* (including no-break spaces, but excluding IS1..IS4 and ZWSP)155* - isWhitespace: Java isWhitespace; Z + whitespace ISO controls but excluding no-break spaces156* - isSpaceChar: just Z (including no-break spaces)157* </p>158* <p>159* This class is not subclassable160* </p>161* @author Syn Wee Quek162* @stable ICU 2.1163* @see com.ibm.icu.lang.UCharacterEnums164*/165166public final class UCharacter167{168169/**170* Numeric Type constants.171* @see UProperty#NUMERIC_TYPE172* @stable ICU 2.4173*/174public static interface NumericType175{176/**177* @stable ICU 2.4178*/179public static final int DECIMAL = 1;180}181182// public data members -----------------------------------------------183184/**185* The lowest Unicode code point value.186* @stable ICU 2.1187*/188public static final int MIN_VALUE = UTF16.CODEPOINT_MIN_VALUE;189190/**191* The highest Unicode code point value (scalar value) according to the192* Unicode Standard.193* This is a 21-bit value (21 bits, rounded up).<br>194* Up-to-date Unicode implementation of java.lang.Character.MIN_VALUE195* @stable ICU 2.1196*/197public static final int MAX_VALUE = UTF16.CODEPOINT_MAX_VALUE;198199/**200* The minimum value for Supplementary code points201* @stable ICU 2.1202*/203public static final int SUPPLEMENTARY_MIN_VALUE =204UTF16.SUPPLEMENTARY_MIN_VALUE;205206// public methods ----------------------------------------------------207208/**209* Retrieves the numeric value of a decimal digit code point.210* <br>This method observes the semantics of211* <code>java.lang.Character.digit()</code>. Note that this212* will return positive values for code points for which isDigit213* returns false, just like java.lang.Character.214* <br><em>Semantic Change:</em> In release 1.3.1 and215* prior, this did not treat the European letters as having a216* digit value, and also treated numeric letters and other numbers as217* digits.218* This has been changed to conform to the java semantics.219* <br>A code point is a valid digit if and only if:220* <ul>221* <li>ch is a decimal digit or one of the european letters, and222* <li>the value of ch is less than the specified radix.223* </ul>224* @param ch the code point to query225* @param radix the radix226* @return the numeric value represented by the code point in the227* specified radix, or -1 if the code point is not a decimal digit228* or if its value is too large for the radix229* @stable ICU 2.1230*/231public static int digit(int ch, int radix)232{233// when ch is out of bounds getProperty == 0234int props = getProperty(ch);235int value;236if (getNumericType(props) == NumericType.DECIMAL) {237value = UCharacterProperty.getUnsignedValue(props);238} else {239value = getEuropeanDigit(ch);240}241return (0 <= value && value < radix) ? value : -1;242}243244/**245* Returns the Bidirection property of a code point.246* For example, 0x0041 (letter A) has the LEFT_TO_RIGHT directional247* property.<br>248* Result returned belongs to the interface249* <a href=UCharacterDirection.html>UCharacterDirection</a>250* @param ch the code point to be determined its direction251* @return direction constant from UCharacterDirection.252* @stable ICU 2.1253*/254public static int getDirection(int ch)255{256return gBdp.getClass(ch);257}258259/**260* Returns a code point corresponding to the two UTF16 characters.261* @param lead the lead char262* @param trail the trail char263* @return code point if surrogate characters are valid.264* @exception IllegalArgumentException thrown when argument characters do265* not form a valid codepoint266* @stable ICU 2.1267*/268public static int getCodePoint(char lead, char trail)269{270if (UTF16.isLeadSurrogate(lead) && UTF16.isTrailSurrogate(trail)) {271return UCharacterProperty.getRawSupplementary(lead, trail);272}273throw new IllegalArgumentException("Illegal surrogate characters");274}275276/**277* <p>Get the "age" of the code point.</p>278* <p>The "age" is the Unicode version when the code point was first279* designated (as a non-character or for Private Use) or assigned a280* character.281* <p>This can be useful to avoid emitting code points to receiving282* processes that do not accept newer characters.</p>283* <p>The data is from the UCD file DerivedAge.txt.</p>284* @param ch The code point.285* @return the Unicode version number286* @stable ICU 2.6287*/288public static VersionInfo getAge(int ch)289{290if (ch < MIN_VALUE || ch > MAX_VALUE) {291throw new IllegalArgumentException("Codepoint out of bounds");292}293return PROPERTY_.getAge(ch);294}295296// private variables -------------------------------------------------297298/**299* Database storing the sets of character property300*/301private static final UCharacterProperty PROPERTY_;302/**303* For optimization304*/305private static final char[] PROPERTY_TRIE_INDEX_;306private static final char[] PROPERTY_TRIE_DATA_;307private static final int PROPERTY_INITIAL_VALUE_;308309private static final UBiDiProps gBdp;310311// block to initialise character property database312static313{314try315{316PROPERTY_ = UCharacterProperty.getInstance();317PROPERTY_TRIE_INDEX_ = PROPERTY_.m_trieIndex_;318PROPERTY_TRIE_DATA_ = PROPERTY_.m_trieData_;319PROPERTY_INITIAL_VALUE_ = PROPERTY_.m_trieInitialValue_;320}321catch (Exception e)322{323throw new MissingResourceException(e.getMessage(),"","");324}325326UBiDiProps bdp;327try {328bdp=UBiDiProps.getSingleton();329} catch(IOException e) {330bdp=UBiDiProps.getDummy();331}332gBdp=bdp;333}334335/**336* Shift to get numeric type337*/338private static final int NUMERIC_TYPE_SHIFT_ = 5;339/**340* Mask to get numeric type341*/342private static final int NUMERIC_TYPE_MASK_ = 0x7 << NUMERIC_TYPE_SHIFT_;343344// private methods ---------------------------------------------------345346/**347* Getting the digit values of characters like 'A' - 'Z', normal,348* half-width and full-width. This method assumes that the other digit349* characters are checked by the calling method.350* @param ch character to test351* @return -1 if ch is not a character of the form 'A' - 'Z', otherwise352* its corresponding digit will be returned.353*/354private static int getEuropeanDigit(int ch) {355if ((ch > 0x7a && ch < 0xff21)356|| ch < 0x41 || (ch > 0x5a && ch < 0x61)357|| ch > 0xff5a || (ch > 0xff3a && ch < 0xff41)) {358return -1;359}360if (ch <= 0x7a) {361// ch >= 0x41 or ch < 0x61362return ch + 10 - ((ch <= 0x5a) ? 0x41 : 0x61);363}364// ch >= 0xff21365if (ch <= 0xff3a) {366return ch + 10 - 0xff21;367}368// ch >= 0xff41 && ch <= 0xff5a369return ch + 10 - 0xff41;370}371372/**373* Gets the numeric type of the property argument374* @param props 32 bit property375* @return the numeric type376*/377private static int getNumericType(int props)378{379return (props & NUMERIC_TYPE_MASK_) >> NUMERIC_TYPE_SHIFT_;380}381382/**383* Gets the property value at the index.384* This is optimized.385* Note this is alittle different from CharTrie the index m_trieData_386* is never negative.387* This is a duplicate of UCharacterProperty.getProperty. For optimization388* purposes, this method calls the trie data directly instead of through389* UCharacterProperty.getProperty.390* @param ch code point whose property value is to be retrieved391* @return property value of code point392* @stable ICU 2.6393*/394private static final int getProperty(int ch)395{396if (ch < UTF16.LEAD_SURROGATE_MIN_VALUE397|| (ch > UTF16.LEAD_SURROGATE_MAX_VALUE398&& ch < UTF16.SUPPLEMENTARY_MIN_VALUE)) {399// BMP codepoint 0000..D7FF or DC00..FFFF400try { // using try for ch < 0 is faster than using an if statement401return PROPERTY_TRIE_DATA_[402(PROPERTY_TRIE_INDEX_[ch >> 5] << 2)403+ (ch & 0x1f)];404} catch (ArrayIndexOutOfBoundsException e) {405return PROPERTY_INITIAL_VALUE_;406}407}408if (ch <= UTF16.LEAD_SURROGATE_MAX_VALUE) {409// lead surrogate D800..DBFF410return PROPERTY_TRIE_DATA_[411(PROPERTY_TRIE_INDEX_[(0x2800 >> 5) + (ch >> 5)] << 2)412+ (ch & 0x1f)];413}414// for optimization415if (ch <= UTF16.CODEPOINT_MAX_VALUE) {416// supplementary code point 10000..10FFFF417// look at the construction of supplementary characters418// trail forms the ends of it.419return PROPERTY_.m_trie_.getSurrogateValue(420UTF16.getLeadSurrogate(ch),421(char)(ch & 0x3ff));422}423// return m_dataOffset_ if there is an error, in this case we return424// the default value: m_initialValue_425// we cannot assume that m_initialValue_ is at offset 0426// this is for optimization.427return PROPERTY_INITIAL_VALUE_;428}429430}431432433