Dyalog’08 Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore Agenda • • • • • Native Files What is Unicode? V.12 Design Goals Key Unicode Features Language Differences – Unicode Text Files (UTF-8) • External Interfaces – COM/OLE, Microsoft.NET – ODBC / SQAPL – ⎕NA: A & W win32 calls – ⎕DR, ⍋ of char data – Space & Performance • ”Interop”: Classic vs Unicode – – – – WSs & Component Files TCP Sockets & Conga External Vars, Mapped Files Own DLLs and Aps Migrating to Unicode • Source Code Management – SALT, SubVersion, Diff Tools • Planning Migrations Dyalog’08 - Elsinore 3 What is Unicode? Wikipedia: An industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems. • It assigns a number, or code point, to each of approximately 100,000 characters – Including the APL character set. • The first version of the standard appeared in 1991, support is now becoming “common” on all platforms Migrating to Unicode Dyalog’08 - Elsinore 4 Why do we want Unicode? • Obviously: It allows us to write applications which use text from all the world’s written languages… • Less obviously, but perhaps more important in the short term: – APL no longer needs it’s own character set (“Atomic Vector”) – Characters no longer need to be translated on the way in and out of APL – APL Source Code can be stored in “ordinary” text files and be handled by “standard” management tools Migrating to Unicode Dyalog’08 - Elsinore 5 What is Unicode in practice? Char Name HEX DEC UTF-8 A Latin capital letter A 00041 Æ Latin capital letter AE 000C6 198 195 134 α Greek small letter alpha 003B1 945 206 177 ؤ Arabic letter waw with hamza above 00624 1572 216 164 ⍺ APL functional symbol alpha 0237A 9082 226 141 186 𠀁 CJK ideograph extension B, second 20001 65 65 131073 240 160 128 129 • Most often, when someone tells you the data ”is Unicode”, they mean ”UTF-8 encoded”. Migrating to Unicode Dyalog’08 - Elsinore 6 Use Google... Migrating to Unicode Dyalog’08 - Elsinore 7 Wikipedia too ... Migrating to Unicode Dyalog’08 - Elsinore 8 Encodings • ”Unicode” assigns unique numbers to characters. Encodings are ways to represent these numbers on file. Encoding • Description UCS-4 4 bytes per character (= Dyalog ⎕DR type 320). Often used as internal representation on Unix systems. UCS-2 2 bytes per character (= type 160). The internal format for ”wide” chars under Windows until Win2000. UTF-8 THE most popular encoding for text files. Identical to ASCII for range 0-127 (= good for Americans). 2 bytes/char from 1282047, 3 bytes 2048-65535, 4 bytes after that. The only encoding which is independent of ”endian-ness”. UTF-16 Identical to UCS-2 for most of first plane, but can encode all characters. Replaced UCS-2 on Windows after Win2000. UCS (Universal Character Set) encodings have a fixed width, UTF (Unicode Transformation Format) encodings are variable width. Migrating to Unicode Dyalog’08 - Elsinore 9 Version 12.0 Design Goals • To allow users to develop Unicode applications (containing all the worlds symbols) • To make the Dyalog IDE a Unicode application – No more ”translate tables”! • Avoid having to explain ⎕AV to future generations – Only one ”kind” of characters • Design should encourage migration – Controlled migration with ”interop” between old & new apps – No ”Big Bang” data conversion events – Classic & Unicode editions allow ”parallel runs” Migrating to Unicode Dyalog’08 - Elsinore 10 Unicode vs Classic • Unicode Edition: – Character data is defined as Unicode code points – No translation of data as it moves in & out of APL • Classic Edition: – Character data is defined as indices into ⎕AV – Translate tables used for keyboard, display and file I/O • Classic will be available so long as a single major customer has not been able to migrate – The price may increase at some point Migrating to Unicode Dyalog’08 - Elsinore 11 Key Unicode Features (1) • New Character Data Types 80, 160, 320: 1-, 2-, 4-byte representations of Code Points. ⎕DR 'Hello' 80 ⎕DR '{⍺+⍵}' 160 ⎕DR '𠀁𠀂𠀃' 320 • NB: One character = one array element! Migrating to Unicode Dyalog’08 - Elsinore 12 Key Unicode Features (2) • Monadic ⎕UCS converts to and from code points (self inverse): ⎕UCS 'Hello' 72 101 108 108 111 ⎕UCS '{⍺+⍵}' 123 9082 43 9077 125 ⎕UCS (2*17)+⍳3 𠀁𠀂𠀃 Migrating to Unicode Dyalog’08 - Elsinore 13 Key Unicode Features (3) • Dyadic ⎕UCS encodes and decodes data as UTF-8, UTF-16 or UTF-32: 'UTF-8' ⎕UCS 'ABCÆØÅ' 65 66 67 195 134 195 152 195 133 'UTF-8' ⎕UCS 240 160 128 129, 240 160 128 130, 240 160 128 131 𠀁𠀂𠀃 'UTF-16' ⎕UCS '𠀁𠀂𠀃' 55360 56321 55360 56322 55360 56323 Migrating to Unicode Dyalog’08 - Elsinore 14 Demo 1 ... (key features) Migrating to Unicode Dyalog’08 - Elsinore 15 Language Differences • If you are only using APL workspaces, and component files, most code from earlier versions will just load & run • Potential problems are: – Monadic ⍋ (only real language difference) – ⎕DR to test for character data – Dyadic use of ⎕DR to ”cast” data – Space usage (char arrays can be larger) Migrating to Unicode Dyalog’08 - Elsinore 16 Monadic ⍋ • Due to differences in the internal representation, upgrade without a collation sequence may return different results: Classic Unicode ⍋'aA' ⍋'aA' 1 2 2 1 ⎕AV⍳'aA‘ 18 66 ⎕UCS 'aA' 97 65 • Give ⍋ a left argument of ⎕AV to maintain the current behaviour • In many cases where monadic use, ⍋ order does not matter Migrating to Unicode Dyalog’08 - Elsinore 17 Testing for Character Data • This no longer works as expected: 82=⎕DR X • Dyalog recommends: (10|⎕DR ⍵)∊0 2 – The latter is correct in all versions Migrating to Unicode Dyalog’08 - Elsinore 18 Dyadic ⎕DR for ”Casting” • Classic (and previous versions): 83 ⎕DR '⍋' ⍝ ⎕AV[⎕IO+198] ¯109 ⍝ Via APL+Win tables • Unicode: 83 ⎕DR '⍋' 75 35 ⍝ ⎕UCS 9035 ⍝ 9035 = 256⊥⌽75 35 • The internal representation is different, and Unicode does NO TRANSLATION • Code which (e.g.) reads characters from native files and then ”casts” to number using ⎕DR needs work Migrating to Unicode Dyalog’08 - Elsinore 19 More on ⎕DR ... (and ⎕UCS) • Unicode Edition still recognises 82 as an left argument: 82 ⎕DR ¯109 ⍋ • This returns the same character as in Classic. But: 160 ⎕DR 82 ⎕DR ¯109 ⍝ Type 82 cannot exist in Unicode • Conversely, ⎕UCS exists in Classic: ⎕UCS 9035 ⍋ ⎕UCS 180 TRANSLATION ERROR Migrating to Unicode ⍝ But must return elements of ⎕AV ⍝ Cannot convert to type 82 Dyalog’08 - Elsinore 20 Space and Time • Character data will require 2 bytes per element in the Unicode Edition, if it contains APL symbols. No existing APL arrays can need 4 bytes per element. • Primitives which manipulate or search this data may run more slowly (more data to sift through). • Comments and character constants in code, and the script form of namespaces and classes, is also affected Migrating to Unicode Dyalog’08 - Elsinore 21 Time and Space • When copying functions between Classic and Unicode, the format needs to be converted – this can be expensive. • The same applies when reading a ⎕OR “across the line”. • It is not recommended to dynamically import functions across the Classic/Unicode boundary in production applications. • Some VERY LARGE functions which could fix in v11.0 may not fix in the Unicode Edition: Lists of names and constants in a function share space with comments. – Proposal to relax all limits on functions may be executed for version 12.1 Migrating to Unicode Dyalog’08 - Elsinore 22 Unicode vs Classic • Use the Unicode Edition if: – You want to develop new applications – You need to manage characters not in ⎕AV now. • Use the Classic Edition if: – You need other v12+ enhancements, but are not ready to convert to Unicode yet – Classic is upwards compatible with v11.0 (as usual) • UE and CE are maintained from single source, and are ”identical” except for character arrays. • Start planning your migration now! (please!) Migrating to Unicode Dyalog’08 - Elsinore 23 So you want to migrate soon... • If you ”only use APL” (workspaces, component files, sockets), applications SHOULD just load & run • If you – Fell for the temptation to use any external tools or storage media as part of your application – Wrote your own AP’s or DLL’s – Or want to start using data not in ⎕AV ... you may have a little work to do. Let’s take a look! Migrating to Unicode Dyalog’08 - Elsinore 24 ”Interop” • Unicode and Classic editions are designed to inter-operate seamlessly – also with v11 & v10.1 • 12.0 Classic can read and translate Unicode character data found in files, workspaces and on TCP sockets • Unicode editions will translate data to type 82 when using TCP Sockets and Component files flagged as non-Unicode (for interop with v11 & v10.1) • If Unicode data contains characters not in ⎕AV => TRANSLATION ERROR • Unicode editions still recognise 82 as a valid argument to ⎕DR and native file functions, and are able to map data in old native files to ”the same character”. Migrating to Unicode Dyalog’08 - Elsinore 25 ”Interop” • The intention is that users should be able to perform controlled experiments when migrating to Unicode • No ”Big Bang” data conversion events; old files and workspaces can still be read • We hope that users will ”reciprocate” by moving as quickly as possibly; it is as easy as we could make it! Migrating to Unicode Dyalog’08 - Elsinore 26 Workspaces • Classic and Unicode editions can load each others workspaces, but: – Classic cannot load (or COPY from) a workspace containing characters not in ⎕AV (TRANSLATION ERROR) • The contents of ⎕AV are defined by ⎕AVU, a list of 256 Unicode Code Points: ⎕AV[97+⍳26] ⍝ By default in v12.0, "Dyalog Alt" ÁÂÃÇÈÊËÌÍÎÏÐÒÓÔÕÙÚÛÝþãìðòõ ⎕AVU[97+⍳26]←9397+⍳26 ⍝ Underscored alphabet (sort of) ⎕AV[97+⍳26] ⍝ Now we have "Dyalog Std” mapping ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ • When )COPYing from a pre-v12 workspace, ⎕AVU in the target namespace decides how incoming character data is translated. So code written using Alt & Std can be merged and maintain the original looks. Migrating to Unicode Dyalog’08 - Elsinore 27 More on ⎕AVU • The Dyalog Std font is still in some older (”anglo”) applications • Dyalog Alt is used across Western Europe • Some countries use fonts created by local distributors: )copy avu Russian.⎕AVU C:\...avu saved Fri Jun 27 10:00:52 2008 3 50⍴65↓⎕AV ABCDEFGHIJKLMNOPQRSTUVWXYZАБВГД⍙ЕЖЗИЙКЛМНОПРСТУФХЦ ЧШЩЪЫЬЭЮ{€}⊣⌷¨Яабв⍨гдежзийклмнопрстуфхцч[/⌿\⍀<≤=≥> ≠∨∧-+÷×?∊⍴~↑↓⍳○*⌈⌊∇∘(⊂⊃∩∪⊥⊤|;,⍱⍲⍒⍋⍉⌽⊖⍟⌹!⍕⍎⍫⍪≡≢шщъы • • The translate table is also used when reading component files and APL data arriving on TCP Sockets It has namespace scope, so classes or namespaces can be defined to read data from Classic systems using different languages if necessary Migrating to Unicode Dyalog’08 - Elsinore 28 Underscores Must Die! • There is no Underscored alphabet in Unicode. Underscoring is a form ”emphasis” (like bold or italic). The underscored alphabet is the ONLY incompatibility with the rest of the world and should be phased OUT. • The APL385 Unicode font incorrectly displays underscores for code points 9398-9423 (decimal). The positions should really display as Ⓐ..Ⓩ. • (Don’t ask why circled alphabetics ARE in unicode, while underscores are not – but Dyalog decided to map underscores to this range) Migrating to Unicode Dyalog’08 - Elsinore 29 ⎕AV: Just another variable • In the Unicode Edition, the Atomic Vector is only used to define how to inter-operate with Classic systems. Only characters in ⎕AV can be shared. Assuming the default (Alt) setting: 'ÁⒶ'∊⎕AV 1 0 • System variable ⎕Ⓐ (name now displays as ⎕Á) should no longer be used. It continues to exist and returns ⎕AV[97+⍳26] Migrating to Unicode Dyalog’08 - Elsinore 30 Chars Allowed in Names • The list has not been extended, the following are allowed: 0123456789 (but not as the 1st character in a name) ABCDEFGHIJKLMNOPQRSTUVWXYZ_ abcdefghijklmnopqrstuvwxyz ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß àáâãäåæçèéêëìíîïðñòóôõöøùúûüþ ∆⍙ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ • In a standard font, underscores display as Ⓐ to Ⓩ • I Unicode, all of the above can now be used simultaneously (previously, the available set depended on whether the Alt or Std font was selected). Russian letters are NOT allowed. Migrating to Unicode Dyalog’08 - Elsinore 31 Component File Interop • Like workspaces, Component Files can be shared between Classic and Unicode editions. • The same restriction applies: Classic cannot read arrays containing characters not in ⎕AV. • Files can be marked as non-Unicode, in which case Unicode cannot write characters not in ⎕AV. – All ”small” (32-bit) component files are non-Unicode • For ordinary APL arrays (no ⎕ORs), the Unicode edition can share files with old versions of APL too. Migrating to Unicode Dyalog’08 - Elsinore 32 File Properties • New system function ⎕FPROPS allows you to control whether a file may contain Unicode data: 'c:\temp\smallfile' ⎕FCREATE 32 32 'EJSU' ⎕FPROPS 1 ⍝ Endian, Journaled, Size, Unicode 0 0 32 0 'c:\temp\bigfile’ ⎕FCREATE 64 64 'EJSU' ⎕FPROPS 64 0 0 64 1 • Size defaults to 64 from v12.0 (new startup flag –F32/-F64) • Small address size (32-bit) files are limited to 4Gb in size and can NOT have the Unicode bit set • Setting Journaling on prevents sharing with v11.0 or earlier Migrating to Unicode Dyalog’08 - Elsinore 33 Translation Error on Write • Unicode edition can write to non-Unicode component files: '{⍺+⍵}' ⎕FAPPEND 32 ⍝ 'U' 0 ⎕FPROPS 64 ⍝ '𠀁𠀂𠀃' ⎕FAPPEND 64 ⍝ TRANSLATION ERROR 'U' 1 ⎕FPROPS 32 ⍝ Not TRANSLATION ERROR ∧/'{⍺+⍵}'∊⎕AV – fine! Switch Unicode OFF Chars not in ⎕AV allowed for small files • If non-Unicode files do not contain namespaces or ⎕ORs, v10.1 and v11.0 can use them • Note: Large files (64-bit) cannot be used with versions 10.0 or earlier. Migrating to Unicode Dyalog’08 - Elsinore 34 Migrating to Unicode Dyalog’08 - Elsinore 35 TCP Socket / Conga Interop • TCPSocket objects have an Encoding property: Encoding Style Meaning None Char No translation, characters must be in range 0-255. UTF-8 Char To UTF-8 on send, from UTF-8 on receive Classic APL Chars transmitted encoded as elements of ⎕AV Unicode APL Types 80, 160 or 320 used as required • • • The default is None for Char, and Classic for APL APL sockets are non-Unicode by default to avoid crashing downversion APL interpreters receiving Unicode data Conga always sends data in ”native” form, receive will fail with a TRANSLATION ERROR if data cannot be represented Migrating to Unicode Dyalog’08 - Elsinore 36 External Variables • External Variables are implemented as small span component files (32-bit files) – and can thus NOT contain Unicode data: 'c:\temp\xvar’ ⎕XT'x' x Hello World x←'𠀁𠀂𠀃' TRANSLATION ERROR • External Variables should be seen as a ”deprecated” feature: You will still be able to use existing external variables, but should plan to convert to component files or mapped files at your convenience. Migrating to Unicode Dyalog’08 - Elsinore 37 Mapped Files • Like external variables, the use of APL mapped files (containing APL arrays with header information) should be seen as a deprecated feature. – • Convert to using other mechanisms at your earliest convenience. Support for RAW mapped files (where type information is provided when mapping) remains core functionality (and will probably get more important in a world of multicore machines): 32↓102↑80 ¯1 ⎕MAP'c:\Program Files\ComfortKeyboard\changes.txt' Added new interface languages: Latvian, Brazilian Portuguese, Italian. • • Type 82 is NOT supported in the Unicode Edition: Mapped variables are ”in the workspace” and cannot be translated on access. To read a raw file written using data type 82, map with data type 83 and the characters extracted by indexing into ⎕AVU. Migrating to Unicode Dyalog’08 - Elsinore 38 (Own) DLLs and APs • • • • The format for passing APL arrays to Libraries and Auxiliary Processors is unchanged, except that a Unicode Edition will pass character arrays of type 80, 160 or 320 Dyalog-provided libraries have been upgraded. A number of old Aps like PREFECT are no longer shipped, but v11 versions will continue to work fine with the Classic Edition. If you have written your own APs or DLLs which handle character data, these need to be updated to deal with new data types. You can return any of the Classic or Unicode character types, they will be translated (subject to the usual TRANSLATION ERROR limitations). Migrating to Unicode Dyalog’08 - Elsinore 39 Native Files • Unicode Edition also still supports type 82, so that old files containing APL characters can be used. They mapping to the ”same characters” but with a different internal representation: V11: 'c:\temp\plus'⎕NCREATE ¯1 '{⍺+⍵}' ⎕nappend ¯1 V12: ⎕DR ⎕←⎕NREAD ¯1 82 5 0 {⍺+⍵} 160 Migrating to Unicode Dyalog’08 - Elsinore 40 Native Files & Unicode • • Unicode Edition supports new data types 80, 160, 320 – reading or writing 1, 2 or 4 bytes at a time (file is UCS-1, -2 or -4 encoded). Code Change Possibly Required: The DEFAULT TYPE when appending character arrays is now 80 (was 82): 'plus:’ ⎕NAPPEND ¯2 ⍝ Type 80 (all ANSI) '{⍺+⍵}' ⎕NAPPEND ¯1 ⍝ Type 160 (APL chars) DOMAIN ERROR ⍝ Data cannot be narrowed • • Early Beta versions of 12.0 used the type of the left argument, but this lead to variable numbers of bytes being used when writing depending on the content of an array (160 if a non-ANSI character included). If you need to write text containing APL to a native file, use type 160 – or perhaps better, use UTF-8! Migrating to Unicode Dyalog’08 - Elsinore 41 Native Files & UTF-8 • The most common way to store Unicode data in text files is to encode it using UTF-8: This is a format understood by ”most” web applications and other Unicode-enabled applications. text←'plus←{⍺+⍵}' 'UTF-8' ⎕UCS 'plus' 112 108 117 115 'c:\temp\plus.txt' ⎕NCREATE ¯1 (⎕UCS 'UTF-8' ⎕UCS 'plus') ⎕NAPPEND ¯1 ⎕CMD 'notepad c:\temp\plus.txt' 'normal’ • • Windows Notepad is able to detect that the file is UTF-8 encoded and displays the text correctly. The monadic ⎕UCS on the left converts integers in the range 0-255 into one-byte Unicode characters before appending. Integers above 127 would become type 163 (2 bytes per element). Migrating to Unicode Dyalog’08 - Elsinore 42 Native Files & UTF-8 • • • The most common way to store Unicode data in text files is to encode it using UTF-8: This is a format understood by ”most” web applications and other Unicode-enabled applications. UCS-2 (2 bytes per character) is supported by many Microsoft apps (like Visual Studio). UCS-2 was the standard until Windows 2000 – now replaced by UTF-16, which is identical to UCS-2 for most data, but expands to 4 bytes when required. Applications need to know which encoding has been used. Two common methods of indicating this are ”Byte Order Marks” at the beginning of the file, and (for web pages) HTTP tags. Migrating to Unicode Dyalog’08 - Elsinore 43 Byte Order Mark • By convention, the first few bytes of text files are sometimes (but not always) an encoding of U+FEFF, the ”Byte Order Mark”, also known as ”Zero width no-break space”: This convention allows applications to ”guess” the encoding used: • 1st bytes are... Encoding is therefore probably • EF BB BF UTF-8 FF FE UTF-16 or UCS-2, written by little endian CPU (Intel) FE FF UTF-16 or UCS-2, big endian FF FE 00 00 UTF-32 / UCS-4, little endian 00 00 FE FF UTF-32 / UCS-4, big endian The convention is more common under Windows than Unix/Linux. Sometimes writing the BOM makes things worse... Migrating to Unicode Dyalog’08 - Elsinore 44 Reading Text Files ∇ Chars←ReadFile name;nid;signature;nums [1] ⍝ Read ANSI or Unicode character file (Windows) [2] nid←name ⎕NTIE 0 [3] signature←3↑⎕NREAD nid 83 3 0 [4] :If signature≡¯17 ¯69 ¯65 ⍝ UTF-8 (EF BB BF) [5] Chars←⎕NREAD nid 80(¯3+⎕NSIZE nid) 3 [6] Chars←'UTF-8' ⎕UCS ⎕UCS Chars [7] :ElseIf (2↑signature)≡¯1 ¯2 ⍝ LittleEnd UTF-16 (FF FE) [8] Chars←⎕NREAD nid 160(¯1+⎕NSIZE nid)2 [9] :Else ⍝ ANSI [10] Chars←⎕NREAD nid 80(⎕NSIZE nid)0 [11] :EndIf [12] ⎕NUNTIE nid ∇ Migrating to Unicode Dyalog’08 - Elsinore 45 Writing Text Files Migrating to Unicode Dyalog’08 - Elsinore 46 Writing a UTF-8 Web Page html←'<html>',NL,' <head>',NL html,←' <meta http-equiv="content-type" content="text/html; charset=UTF-8" />' html,←’ </head>',NL,'<body>',NL html,←’ <font face="APL385 Unicode">' html,←'plus←{⍺+⍵}</font>',NL html,←'</body>',NL,'</html>',NL 'c:\temp\plus.htm'⎕NCREATE ¯1 (⎕UCS 'UTF-8' ⎕UCS html) ⎕NAPPEND ¯1 ⎕NUNTIE ¯1 ⎕CMD 'iexplore c:\temp\plus.htm' '' Migrating to Unicode Dyalog’08 - Elsinore 47 Web Page: Results Migrating to Unicode Dyalog’08 - Elsinore 48 UTF-8 Files with .NET Migrating to Unicode Dyalog’08 - Elsinore 49 UTF-8 Files with .NET apltxt←⎕SE.SALT.New 'C:\..\UTF8File' 'c:\temp\apl.txt' apltxt.Text Compute average in APL: avg←{(+/⍵)÷⍴⍵} apltxt.Text,←⊂'⍝ Morten was here’ System.Text.Encoding.⎕nl -2 ASCII BigEndianUnicode Default Unicode Migrating to Unicode Dyalog’08 - Elsinore UTF32 UTF7 UTF8 50 External Interfaces: COM/.NET • COM/OLE, Microsoft.Net: No problem – Have been translating chars to UCS-2/UTF-16 ”always” – Translation code removed in v12 Unicode • We already saw it in action: ↑System.IO.File.ReadAllLines ⊂'c:\temp\apl.txt' Compute average in APL: avg←{(+/⍵)÷⍴⍵} Migrating to Unicode Dyalog’08 - Elsinore 51 SQAPL / ODBC & Unicode SQA.Connect 'B' 'MS SQL Server' 'pass' 'user’ (not all results displayed in the following) 0 SQA.Columns 'B' 'idioms' COLUMN_NAME .. DATA_TYPE TYPE_NAME id .. 4 int identity exp .. ¯9 nvarchar COLUMN_SIZE 10 400 ⎕←data←3 1⊃SQA.Do 'B' 'select * from idioms' 1 {(+/⍵)÷⍴⍵} 2 {⍵/⍳⍴⍵} 3 {(<\⍵)⍳1} data[;2]←{⎕UCS 'UTF-8' ⎕UCS ⍵}¨data[;2] ⍝ Make UTF8 Migrating to Unicode Dyalog’08 - Elsinore 52 SQAPL Example (continued) SQA.Do 'B' 'alter table idioms add utf8exp varbinary(100)' SQA.Prepare 'B.U1' 'update idioms set utf8exp=:<X20: where id=:<I:' ('Bulk' 20) SQA.X 'B.U1' (⌽data) ⍝ Store UTF8 1 2 3 1 ⎕←data←3 1⊃SQA.Do 'B' 'select id,exp,utf8exp from idioms' {(+/⍵)÷⍴⍵} {(+/â• µ)÷╠´â• µ} {⍵/⍳⍴⍵} {â• µ/â• ³â• ´â• µ} {(<\⍵)⍳1} {(<\â• µ)â• ³1} data[;2]≡¨{'UTF-8' ⎕UCS (⎕UCS ⍵)~0}¨data[;3] ⍝ It works! 1 1 Migrating to Unicode Dyalog’08 - Elsinore 53 ODBC / SQAPL Summary • SQAPL 6.0 supports ODBC Unicode data types: • • • ODBC Type SQAPL Type Description WCHAR U ”Wide” fixed-length string WVARCHAR W ”Wide” variable-length WLONGVARCHAR Q ”Wide” unlimited-length These can be used in the same was as the single-byte types. In most cases, the choice is automatic (as we have seen). Note: The above applies to databases which have Unicode data types. However, Unicode data is often stored in single-byte types, UTF-8 encoded. Most of the work will be understanding how to store Unicode in your database – and converting the data (see your Database Manual ). Migrating to Unicode Dyalog’08 - Elsinore 54 External Interfaces: ⎕NA • In Classic & previous editions, parameter type C meant untranslated bytes and T meant ”text”, translated to ANSI. • In Unicode, both are untranslated. • T without a width specification now means ”wide characters according to the host convention” • Thus: T means T1 in Classic, T2 in Unicode for Windows, and T4 under Unicode for Unix/Linux • This means that the use of type T (<0T, >0T, =T) should be portable across Classic/Unicode systems • Some (typically Unix/Linux) system calls expect data to be UTF8 encoded: You must use dyadic ⎕UCS to do the translation. • Future extensions to ⎕NA may provide UTF-8 encoding. Migrating to Unicode Dyalog’08 - Elsinore 55 Selection of A or W Functions • Under Windows, Win32 library calls which handle text are generally available in two variants: – An ANSI (narrow) version with a name ending in A – a Unicode (wide) version with a name ending in W • For example, the function to display a message box is available as MessageBoxA and MessageBoxW. • If you specify the character * at the end of a name, this will be replaced by A in Classic and W in the Unicode Edition. • The intention is to allow you to write code which will work now under Classic and continue to work under Unicode – to facilitate parallel code testing and a controlled migration. Migrating to Unicode Dyalog’08 - Elsinore 56 Portable ⎕NA Example • The following function is portable between Classic and Unicode: ∇ ok←title MsgBox msg;MessageBox [1] ⎕NA 'I user32∣MessageBox* I <0T <0T I' [2] ok←1=MessageBox 0 msg title 1 ⍝ 1=OK, 2=Cancel. ∇ • The function MessageBoxA will be selected by Classic, MessageBoxW by Unicode. • <0T will mean 1-byte (translated) text under Classic, and 2-byte (untranslated) text under Unicode – Strictly speaking, text should be translated to UTF-16 in Classic, but this is only required for ”a few” special chars Migrating to Unicode Dyalog’08 - Elsinore 57 APL Source in Unicode Files • SALT (Simple APL Library Toolkit) supports storage of functions, namespaces and classes in UTF-8 files with a .dyalog extension. • You can also very easily write your own storage mechanism using Unicode text files. Under .Net it is trivial: Save: System.IO.File.WriteAllText 'c:\temp\foo.txt' (⎕VR 'foo') System.Text.Encoding.UTF8 Load: ⎕FX System.IO.File.ReadAllText ⊂'c:\temp\foo.txt’ • Without .Net it requires a wee bit more work (as we have seen earlier) Migrating to Unicode Dyalog’08 - Elsinore 58 Source Code Management • Storing APL source in Unicode text files may seem less convenient to the seasoned APL programmer, but there are very significant advantages: • High quality tools (both free and ”commercial”) built for other languages can be used to edit, compare, manage source, and build systems – without further ado • Not only does this make it easier to position APL as a tool for ”professional” software development, many of these tools are actually useful (there are some smart people ”out there”) • Young developers joining your APL team will already be familiar with these tools and feel ”at home” more quickly • The quality of life of the APL developer need not be sacrificed! Migrating to Unicode Dyalog’08 - Elsinore 59 Demo of Source Code Mgt Migrating to Unicode Dyalog’08 - Elsinore 60 Demo of Source Code Mgt Migrating to Unicode Dyalog’08 - Elsinore 61 Source Code Mgt Demo • All tools shown here downloaded from internet, none of them knew about APL in any way. Migrating to Unicode Dyalog’08 - Elsinore 62 Demo: Working with MyApp Migrating to Unicode Dyalog’08 - Elsinore 63 Keyboarding • Discuss IME vs new Keyboards • Demo new Console Unix/Linux APLs Migrating to Unicode Dyalog’08 - Elsinore 64 Migration Check List • Are you migrating in order to simplify and stay current, or because you want to support ”foreign” text in your application? – Probably, you should do the former first (or at least experiment with it), before trying the latter • For the former, you only need to make sure that your interfaces to external systems (native files, databases etc) work the same way as before – You may need to add checks to prevent the inadvertant entry of Unicode characters that your external interfaces cannot handle • For the latter, you need to be sure that external systems ALSO support Unicode, and how they want to exchange data with your application Migrating to Unicode Dyalog’08 - Elsinore 65 Think about ... • (Dyadic) ⎕DR • Native Files • Monadic ⍋ of char data • APL style TCP Sockets • Interop required with earlier versions? • External Vars • Mapped Files • Own DLLs and Aps Migrating to Unicode – Need non-⎕AV/ANSI data – Convert to UTF-8? • Win32 or other system calls via ⎕NA • Underscores(!) • Switching to SALT / SubVersion? Dyalog’08 - Elsinore 66 Suggested Strategy • Migrate to v12 Classic, write code which works in both Classic & Unicode. • Wait until entire user base upgraded to v12. • Move application to Unicode Edition. • Suggested timeframe for a large application with many interfaces might be 2-4 years. • Start thinking now! Migrating to Unicode Dyalog’08 - Elsinore 67
© Copyright 2025