[ACCEPTED]-NVARCHAR storing characters not supported by UCS-2 encoding on SQL Server-ucs2

Accepted answer
Score: 10

There are several clarifications to make 174 here regarding the MS documentation snippets 173 posted in the question, and for the sample 172 code, for the questions themselves, and 171 for statements made in the comments on the 170 question. Most of the confusion can be cleared 169 up, I believe, by the information provided 168 in the following post of mine:

How Many Bytes Per Character in SQL Server: a Completely Complete Guide

First things 167 first (which is the only way it can be, right?): I'm 166 not insulting the people who wrote the MS 165 documentation as SQL Server alone is a huge product 164 and there is a lot to cover, etc, but for 163 the moment (until I get a chance to update 162 it), please read the "official" documentation 161 with a sense of caution. There are several 160 misstatements regarding Collations / Unicode.

  1. UCS-2 159 is an encoding that handles a subset of 158 the Unicode character set. It works in 2-byte 157 units. With 2 bytes, you can encode values 156 0 - 65535. This range of code points is 155 known as the BMP (Basic Multilingual Plane). The 154 BMP is all of the characters that are not Supplementary 153 Characters (because those are supplementary 152 to the BMP), but it does contain a set of code 151 points that are exclusively used to encode 150 Supplementary Characters in UTF-16 (i.e. the 149 2048 surrogate code points). This is a complete 148 subset of UTF-16.

  2. UTF-16 is an encoding that 147 handles all of the Unicode character set. It 146 also works in 2-byte units. In fact, there 145 is no difference between UCS-2 and UTF-16 144 regarding the BMP code points and characters. The 143 difference is that UTF-16 makes use of those 142 2048 surrogate code points in the BMP to 141 create surrogate pairs which are the encodings 140 for all Supplementary Characters. While 139 Supplementary Characters are 4-bytes (in 138 UTF-8, UTF-16, and UTF-32), they are really 137 two 2-byte code units when encoding in UTF-16 (likewise, they are 136 four 1-byte units in UTF-8, and one 4-byte 135 in UTF-32).

  3. Since UTF-16 merely extends what 134 can be done with UCS-2 (by actually defining 133 the usage of the surrogate code points), there 132 is absolutely no difference in the byte sequences 131 that can be stored in either case. All 2048 surrogate 130 code points used to create Supplementary 129 Characters in UTF-16 are valid code points 128 in UCS-2, they just don't have any defined 127 usage (i.e. interpretation) in UCS-2.

  4. NVARCHAR, NCHAR, and 126 the deprecated-so-do-NOT-use-it-NTEXT datatypes 125 all store Unicode characters encoded in 124 UCS-2 / UTF-16. From a storage perspective 123 there is absolutely NO difference. So, it 122 doesn't matter if something (even outside 121 of SQL Server) says that it can store UCS-2. If 120 it can do that, then it can inherently store 119 UTF-16. In fact, while I have not had a 118 chance to update the post linked above, I 117 have been able to store and retrieve, as 116 expected, emojis (most of which are Supplementary 115 Characters) in SQL Server 2000 running on 114 Windows XP. There were no Supplementary 113 Characters defined until 2003, I think, and 112 certainly not in 1999 when SQL Server 2000 111 was being developed. In fact (again), UCS-2 110 was only used in Windows / SQL Server because 109 Microsoft pushed ahead with development 108 prior to UTF-16 being finalized and published 107 (and as soon as it was, UCS-2 became obsolete).

  5. The 106 only difference between UCS-2 and UTF-16 105 is that UTF-16 knows how to interpret surrogate pairs 104 (comprised of a pair of surrogate code points, so 103 at least they're appropriately named). This 102 is where the _SC collations (and, starting 101 in SQL Server 2017, also version _140_ collations 100 which include support for Supplementary 99 Characters so none of them have the _SC in 98 their name) come in: they allow the built-in 97 SQL Server functions to correctly interpret 96 Supplementary Characters. That's it! Those 95 collations have nothing to do with storing and 94 retrieving Supplementary Characters, nor 93 do they even have anything to do with sorting or 92 comparing them (even though the "Collation 91 and Unicode Support" documentation 90 says specifically that this is what those 89 collations do — another item on my "to 88 do" list to fix). For collations that 87 have neither _SC nor _140_ in their name (though 86 the new-as-of-SQL Server 2019 Latin1_General_100_BIN2_UTF8 might be grey-area, at 85 least, I remember there being some inconsistency 84 either there or with the Japanese_*_140_BIN2 collations), the 83 built-in functions only handle BMP code 82 points (i.e. UCS-2).

  6. Not "handling" Supplementary 81 Characters means not interpreting a valid 80 sequence of two surrogate code points as 79 actually being a singular supplementary 78 code point. So, for non-"SC" collations, BMP 77 surrogate code point 1 (B1) and BMP surrogate 76 code point 2 (B2) are just those two code 75 points, neither one of which is defined, hence 74 they appear as two "nothing"s 73 (i.e. B1 followed by B2). This is why it 72 is possible to split a Supplementary Character 71 in two using SUBSTRING / LEFT / RIGHT because they won't know 70 to keep those two BMP code points together. But 69 an "SC" collation will read those 68 code points B1 and B2 from disk or memory 67 and see a single Supplementary code point 66 S. Now it can be handled correctly via SUBSTRING / CHARINDEX / etc.

  7. The 65 NCHAR() function (not the datatype; yes, poorly 64 named function ;) is also sensitive to whether 63 or not the default collation of the current database supports 62 Supplementary Characters. If yes, then passing 61 in a value between 65536 and 1114111 (the 60 Supplementary Character range) will return 59 a non-NULL value. If not, then passing in any 58 value above 65535 will return NULL. (Of course, it 57 would be far better if NCHAR() just always worked, given 56 that storing / retrieving always works, so 55 please vote for this suggestion: NCHAR() function should always return Supplementary Character for values 0x10000 - 0x10FFFF regardless of active database's default collation ).

  8. Fortunately, you 54 don't need an "SC" collation to 53 output a Supplementary Character. You can 52 either paste in the literal character, or 51 convert the UTF-16 Little Endian encoded 50 surrogate pair, or use the NCHAR() function to 49 output the surrogate pair. The following 48 works in SQL Server 2000 (using SSMS 2005) running 47 on Windows XP:

    SELECT N'💩', -- 💩
    CONVERT(VARBINARY(4), N'💩'), -- 0x3DD8A9DC
    CONVERT(NVARCHAR(10), 0x3DD8A9DC), -- 💩 (regardless of DB Collation)
    NCHAR(0xD83D) + NCHAR(0xDCA9) -- 💩 (regardless of DB Collation)

    For more details on creating 46 Supplementary Characters when using non-"SC" collations, please 45 see my answer to the following DBA.SE question: How do I set a SQL Server Unicode / NVARCHAR string to an emoji or Supplementary Character?

  9. None 44 of this affects what you see. If you store 43 a code point, then it's there. How it behaves 42 — sorting, comparison, etc — is controlled 41 by collations. But, how it appears is controlled 40 by fonts and the OS. No font can contain 39 all characters, so different fonts contain 38 different sets of characters, with a lot 37 of overlap on the more widely used characters. However, if 36 a font has a particular byte sequence mapped, then 35 it can display that character. This is why 34 the only work required to get Supplementary 33 Characters displaying correctly in SQL Server 32 2000 (using SSMS 2005) running on Windows 31 XP was to add a font containing the characters 30 and doing one or two minor registry edits 29 (no changes to SQL Server).

  10. Supplementary 28 Characters in SQL_* collations and collations 27 without a version number in their name have 26 no sort weights. Hence, they all equate 25 to each other as well as to any other BMP 24 code points that have no sort weights (including 23 "space" (U+0020) and "null" (U+0000)). They 22 started to fix this in the version _90_ collations.

  11. SSMS 21 has nothing to do with any of this, outside 20 of possibly needing the font used for the 19 query editor and/or grid results and/or 18 errors + messages changed to one that has 17 the desired characters. (SSMS doesn't render 16 anything outside of maybe spatial data; characters 15 are rendered by the display driver + font 14 definitions + maybe something else).

Therefore, the 13 following statement in the documentation 12 (from the question):

If a non-SC collation 11 is specified, then these data types store 10 only the subset of character data supported 9 by the UCS-2 character encoding.

is both 8 nonsensical and incorrect. They were probably 7 intending to say the datatypes would only 6 store a subset of the UTF-16 encoding (since UCS-2 5 is the subset). Also, even if it said "UTF-16 4 character encoding" it would still 3 be wrong because the bytes that you pass 2 in will be stored (assuming enough free 1 space in the column or variable).

More Related questions