Path: blob/aarch64-shenandoah-jdk8u272-b10/jdk/src/share/classes/sun/text/normalizer/UTF16.java
38830 views
/*1* Copyright (c) 2005, 2009, Oracle and/or its affiliates. All rights reserved.2* DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.3*4* This code is free software; you can redistribute it and/or modify it5* under the terms of the GNU General Public License version 2 only, as6* published by the Free Software Foundation. Oracle designates this7* particular file as subject to the "Classpath" exception as provided8* by Oracle in the LICENSE file that accompanied this code.9*10* This code is distributed in the hope that it will be useful, but WITHOUT11* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or12* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License13* version 2 for more details (a copy is included in the LICENSE file that14* accompanied this code).15*16* You should have received a copy of the GNU General Public License version17* 2 along with this work; if not, write to the Free Software Foundation,18* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.19*20* Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA21* or visit www.oracle.com if you need additional information or have any22* questions.23*/24/*25*******************************************************************************26* (C) Copyright IBM Corp. and others, 1996-2009 - All Rights Reserved *27* *28* The original version of this source code and documentation is copyrighted *29* and owned by IBM, These materials are provided under terms of a License *30* Agreement between IBM and Sun. This technology is protected by multiple *31* US and International patents. This notice and attribution to IBM may not *32* to removed. *33*******************************************************************************34*/3536package sun.text.normalizer;3738/**39* <p>Standalone utility class providing UTF16 character conversions and40* indexing conversions.</p>41* <p>Code that uses strings alone rarely need modification.42* By design, UTF-16 does not allow overlap, so searching for strings is a safe43* operation. Similarly, concatenation is always safe. Substringing is safe if44* the start and end are both on UTF-32 boundaries. In normal code, the values45* for start and end are on those boundaries, since they arose from operations46* like searching. If not, the nearest UTF-32 boundaries can be determined47* using <code>bounds()</code>.</p>48* <strong>Examples:</strong>49* <p>The following examples illustrate use of some of these methods.50* <pre>51* // iteration forwards: Original52* for (int i = 0; i < s.length(); ++i) {53* char ch = s.charAt(i);54* doSomethingWith(ch);55* }56*57* // iteration forwards: Changes for UTF-3258* int ch;59* for (int i = 0; i < s.length(); i+=UTF16.getCharCount(ch)) {60* ch = UTF16.charAt(s,i);61* doSomethingWith(ch);62* }63*64* // iteration backwards: Original65* for (int i = s.length() -1; i >= 0; --i) {66* char ch = s.charAt(i);67* doSomethingWith(ch);68* }69*70* // iteration backwards: Changes for UTF-3271* int ch;72* for (int i = s.length() -1; i > 0; i-=UTF16.getCharCount(ch)) {73* ch = UTF16.charAt(s,i);74* doSomethingWith(ch);75* }76* </pre>77* <strong>Notes:</strong>78* <ul>79* <li>80* <strong>Naming:</strong> For clarity, High and Low surrogates are called81* <code>Lead</code> and <code>Trail</code> in the API, which gives a better82* sense of their ordering in a string. <code>offset16</code> and83* <code>offset32</code> are used to distinguish offsets to UTF-1684* boundaries vs offsets to UTF-32 boundaries. <code>int char32</code> is85* used to contain UTF-32 characters, as opposed to <code>char16</code>,86* which is a UTF-16 code unit.87* </li>88* <li>89* <strong>Roundtripping Offsets:</strong> You can always roundtrip from a90* UTF-32 offset to a UTF-16 offset and back. Because of the difference in91* structure, you can roundtrip from a UTF-16 offset to a UTF-32 offset and92* back if and only if <code>bounds(string, offset16) != TRAIL</code>.93* </li>94* <li>95* <strong>Exceptions:</strong> The error checking will throw an exception96* if indices are out of bounds. Other than than that, all methods will97* behave reasonably, even if unmatched surrogates or out-of-bounds UTF-3298* values are present. <code>UCharacter.isLegal()</code> can be used to check99* for validity if desired.100* </li>101* <li>102* <strong>Unmatched Surrogates:</strong> If the string contains unmatched103* surrogates, then these are counted as one UTF-32 value. This matches104* their iteration behavior, which is vital. It also matches common display105* practice as missing glyphs (see the Unicode Standard Section 5.4, 5.5).106* </li>107* <li>108* <strong>Optimization:</strong> The method implementations may need109* optimization if the compiler doesn't fold static final methods. Since110* surrogate pairs will form an exceeding small percentage of all the text111* in the world, the singleton case should always be optimized for.112* </li>113* </ul>114* @author Mark Davis, with help from Markus Scherer115* @stable ICU 2.1116*/117118public final class UTF16119{120// public variables ---------------------------------------------------121122/**123* The lowest Unicode code point value.124* @stable ICU 2.1125*/126public static final int CODEPOINT_MIN_VALUE = 0;127/**128* The highest Unicode code point value (scalar value) according to the129* Unicode Standard.130* @stable ICU 2.1131*/132public static final int CODEPOINT_MAX_VALUE = 0x10ffff;133/**134* The minimum value for Supplementary code points135* @stable ICU 2.1136*/137public static final int SUPPLEMENTARY_MIN_VALUE = 0x10000;138/**139* Lead surrogate minimum value140* @stable ICU 2.1141*/142public static final int LEAD_SURROGATE_MIN_VALUE = 0xD800;143/**144* Trail surrogate minimum value145* @stable ICU 2.1146*/147public static final int TRAIL_SURROGATE_MIN_VALUE = 0xDC00;148/**149* Lead surrogate maximum value150* @stable ICU 2.1151*/152public static final int LEAD_SURROGATE_MAX_VALUE = 0xDBFF;153/**154* Trail surrogate maximum value155* @stable ICU 2.1156*/157public static final int TRAIL_SURROGATE_MAX_VALUE = 0xDFFF;158/**159* Surrogate minimum value160* @stable ICU 2.1161*/162public static final int SURROGATE_MIN_VALUE = LEAD_SURROGATE_MIN_VALUE;163164// public method ------------------------------------------------------165166/**167* Extract a single UTF-32 value from a string.168* Used when iterating forwards or backwards (with169* <code>UTF16.getCharCount()</code>, as well as random access. If a170* validity check is required, use171* <code><a href="../lang/UCharacter.html#isLegal(char)">172* UCharacter.isLegal()</a></code> on the return value.173* If the char retrieved is part of a surrogate pair, its supplementary174* character will be returned. If a complete supplementary character is175* not found the incomplete character will be returned176* @param source array of UTF-16 chars177* @param offset16 UTF-16 offset to the start of the character.178* @return UTF-32 value for the UTF-32 value that contains the char at179* offset16. The boundaries of that codepoint are the same as in180* <code>bounds32()</code>.181* @exception IndexOutOfBoundsException thrown if offset16 is out of182* bounds.183* @stable ICU 2.1184*/185public static int charAt(String source, int offset16) {186char single = source.charAt(offset16);187if (single < LEAD_SURROGATE_MIN_VALUE) {188return single;189}190return _charAt(source, offset16, single);191}192193private static int _charAt(String source, int offset16, char single) {194if (single > TRAIL_SURROGATE_MAX_VALUE) {195return single;196}197198// Convert the UTF-16 surrogate pair if necessary.199// For simplicity in usage, and because the frequency of pairs is200// low, look both directions.201202if (single <= LEAD_SURROGATE_MAX_VALUE) {203++offset16;204if (source.length() != offset16) {205char trail = source.charAt(offset16);206if (trail >= TRAIL_SURROGATE_MIN_VALUE && trail <= TRAIL_SURROGATE_MAX_VALUE) {207return UCharacterProperty.getRawSupplementary(single, trail);208}209}210} else {211--offset16;212if (offset16 >= 0) {213// single is a trail surrogate so214char lead = source.charAt(offset16);215if (lead >= LEAD_SURROGATE_MIN_VALUE && lead <= LEAD_SURROGATE_MAX_VALUE) {216return UCharacterProperty.getRawSupplementary(lead, single);217}218}219}220return single; // return unmatched surrogate221}222223/**224* Extract a single UTF-32 value from a substring.225* Used when iterating forwards or backwards (with226* <code>UTF16.getCharCount()</code>, as well as random access. If a227* validity check is required, use228* <code><a href="../lang/UCharacter.html#isLegal(char)">UCharacter.isLegal()229* </a></code> on the return value.230* If the char retrieved is part of a surrogate pair, its supplementary231* character will be returned. If a complete supplementary character is232* not found the incomplete character will be returned233* @param source array of UTF-16 chars234* @param start offset to substring in the source array for analyzing235* @param limit offset to substring in the source array for analyzing236* @param offset16 UTF-16 offset relative to start237* @return UTF-32 value for the UTF-32 value that contains the char at238* offset16. The boundaries of that codepoint are the same as in239* <code>bounds32()</code>.240* @exception IndexOutOfBoundsException thrown if offset16 is not within241* the range of start and limit.242* @stable ICU 2.1243*/244public static int charAt(char source[], int start, int limit,245int offset16)246{247offset16 += start;248if (offset16 < start || offset16 >= limit) {249throw new ArrayIndexOutOfBoundsException(offset16);250}251252char single = source[offset16];253if (!isSurrogate(single)) {254return single;255}256257// Convert the UTF-16 surrogate pair if necessary.258// For simplicity in usage, and because the frequency of pairs is259// low, look both directions.260if (single <= LEAD_SURROGATE_MAX_VALUE) {261offset16 ++;262if (offset16 >= limit) {263return single;264}265char trail = source[offset16];266if (isTrailSurrogate(trail)) {267return UCharacterProperty.getRawSupplementary(single, trail);268}269}270else { // isTrailSurrogate(single), so271if (offset16 == start) {272return single;273}274offset16 --;275char lead = source[offset16];276if (isLeadSurrogate(lead))277return UCharacterProperty.getRawSupplementary(lead, single);278}279return single; // return unmatched surrogate280}281282/**283* Determines how many chars this char32 requires.284* If a validity check is required, use <code>285* <a href="../lang/UCharacter.html#isLegal(char)">isLegal()</a></code> on286* char32 before calling.287* @param char32 the input codepoint.288* @return 2 if is in supplementary space, otherwise 1.289* @stable ICU 2.1290*/291public static int getCharCount(int char32)292{293if (char32 < SUPPLEMENTARY_MIN_VALUE) {294return 1;295}296return 2;297}298299/**300* Determines whether the code value is a surrogate.301* @param char16 the input character.302* @return true iff the input character is a surrogate.303* @stable ICU 2.1304*/305public static boolean isSurrogate(char char16)306{307return LEAD_SURROGATE_MIN_VALUE <= char16 &&308char16 <= TRAIL_SURROGATE_MAX_VALUE;309}310311/**312* Determines whether the character is a trail surrogate.313* @param char16 the input character.314* @return true iff the input character is a trail surrogate.315* @stable ICU 2.1316*/317public static boolean isTrailSurrogate(char char16)318{319return (TRAIL_SURROGATE_MIN_VALUE <= char16 &&320char16 <= TRAIL_SURROGATE_MAX_VALUE);321}322323/**324* Determines whether the character is a lead surrogate.325* @param char16 the input character.326* @return true iff the input character is a lead surrogate327* @stable ICU 2.1328*/329public static boolean isLeadSurrogate(char char16)330{331return LEAD_SURROGATE_MIN_VALUE <= char16 &&332char16 <= LEAD_SURROGATE_MAX_VALUE;333}334335/**336* Returns the lead surrogate.337* If a validity check is required, use338* <code><a href="../lang/UCharacter.html#isLegal(char)">isLegal()</a></code>339* on char32 before calling.340* @param char32 the input character.341* @return lead surrogate if the getCharCount(ch) is 2; <br>342* and 0 otherwise (note: 0 is not a valid lead surrogate).343* @stable ICU 2.1344*/345public static char getLeadSurrogate(int char32)346{347if (char32 >= SUPPLEMENTARY_MIN_VALUE) {348return (char)(LEAD_SURROGATE_OFFSET_ +349(char32 >> LEAD_SURROGATE_SHIFT_));350}351352return 0;353}354355/**356* Returns the trail surrogate.357* If a validity check is required, use358* <code><a href="../lang/UCharacter.html#isLegal(char)">isLegal()</a></code>359* on char32 before calling.360* @param char32 the input character.361* @return the trail surrogate if the getCharCount(ch) is 2; <br>otherwise362* the character itself363* @stable ICU 2.1364*/365public static char getTrailSurrogate(int char32)366{367if (char32 >= SUPPLEMENTARY_MIN_VALUE) {368return (char)(TRAIL_SURROGATE_MIN_VALUE +369(char32 & TRAIL_SURROGATE_MASK_));370}371372return (char)char32;373}374375/**376* Convenience method corresponding to String.valueOf(char). Returns a one377* or two char string containing the UTF-32 value in UTF16 format. If a378* validity check is required, use379* <code><a href="../lang/UCharacter.html#isLegal(char)">isLegal()</a></code>380* on char32 before calling.381* @param char32 the input character.382* @return string value of char32 in UTF16 format383* @exception IllegalArgumentException thrown if char32 is a invalid384* codepoint.385* @stable ICU 2.1386*/387public static String valueOf(int char32)388{389if (char32 < CODEPOINT_MIN_VALUE || char32 > CODEPOINT_MAX_VALUE) {390throw new IllegalArgumentException("Illegal codepoint");391}392return toString(char32);393}394395/**396* Append a single UTF-32 value to the end of a StringBuffer.397* If a validity check is required, use398* <code><a href="../lang/UCharacter.html#isLegal(char)">isLegal()</a></code>399* on char32 before calling.400* @param target the buffer to append to401* @param char32 value to append.402* @return the updated StringBuffer403* @exception IllegalArgumentException thrown when char32 does not lie404* within the range of the Unicode codepoints405* @stable ICU 2.1406*/407public static StringBuffer append(StringBuffer target, int char32)408{409// Check for irregular values410if (char32 < CODEPOINT_MIN_VALUE || char32 > CODEPOINT_MAX_VALUE) {411throw new IllegalArgumentException("Illegal codepoint: " + Integer.toHexString(char32));412}413414// Write the UTF-16 values415if (char32 >= SUPPLEMENTARY_MIN_VALUE)416{417target.append(getLeadSurrogate(char32));418target.append(getTrailSurrogate(char32));419}420else {421target.append((char)char32);422}423return target;424}425426//// for StringPrep427/**428* Shifts offset16 by the argument number of codepoints within a subarray.429* @param source char array430* @param start position of the subarray to be performed on431* @param limit position of the subarray to be performed on432* @param offset16 UTF16 position to shift relative to start433* @param shift32 number of codepoints to shift434* @return new shifted offset16 relative to start435* @exception IndexOutOfBoundsException if the new offset16 is out of436* bounds with respect to the subarray or the subarray bounds437* are out of range.438* @stable ICU 2.1439*/440public static int moveCodePointOffset(char source[], int start, int limit,441int offset16, int shift32)442{443int size = source.length;444int count;445char ch;446int result = offset16 + start;447if (start<0 || limit<start) {448throw new StringIndexOutOfBoundsException(start);449}450if (limit>size) {451throw new StringIndexOutOfBoundsException(limit);452}453if (offset16<0 || result>limit) {454throw new StringIndexOutOfBoundsException(offset16);455}456if (shift32 > 0 ) {457if (shift32 + result > size) {458throw new StringIndexOutOfBoundsException(result);459}460count = shift32;461while (result < limit && count > 0)462{463ch = source[result];464if (isLeadSurrogate(ch) && (result+1 < limit) &&465isTrailSurrogate(source[result+1])) {466result ++;467}468count --;469result ++;470}471} else {472if (result + shift32 < start) {473throw new StringIndexOutOfBoundsException(result);474}475for (count=-shift32; count>0; count--) {476result--;477if (result<start) {478break;479}480ch = source[result];481if (isTrailSurrogate(ch) && result>start && isLeadSurrogate(source[result-1])) {482result--;483}484}485}486if (count != 0) {487throw new StringIndexOutOfBoundsException(shift32);488}489result -= start;490return result;491}492493// private data members -------------------------------------------------494495/**496* Shift value for lead surrogate to form a supplementary character.497*/498private static final int LEAD_SURROGATE_SHIFT_ = 10;499500/**501* Mask to retrieve the significant value from a trail surrogate.502*/503private static final int TRAIL_SURROGATE_MASK_ = 0x3FF;504505/**506* Value that all lead surrogate starts with507*/508private static final int LEAD_SURROGATE_OFFSET_ =509LEAD_SURROGATE_MIN_VALUE -510(SUPPLEMENTARY_MIN_VALUE511>> LEAD_SURROGATE_SHIFT_);512513// private methods ------------------------------------------------------514515/**516* <p>Converts argument code point and returns a String object representing517* the code point's value in UTF16 format.</p>518* <p>This method does not check for the validity of the codepoint, the519* results are not guaranteed if a invalid codepoint is passed as520* argument.</p>521* <p>The result is a string whose length is 1 for non-supplementary code522* points, 2 otherwise.</p>523* @param ch code point524* @return string representation of the code point525*/526private static String toString(int ch)527{528if (ch < SUPPLEMENTARY_MIN_VALUE) {529return String.valueOf((char)ch);530}531532StringBuffer result = new StringBuffer();533result.append(getLeadSurrogate(ch));534result.append(getTrailSurrogate(ch));535return result.toString();536}537}538539540