Characters and strings

Rationale for Ada 2005: Predefined library

ENG

5. Characters and strings

@ An important improvement in Ada 2005 is the ability to deal with 16- and 32-bit characters both in the program text and in the executing program.

@ The fine detail of the changes to the program text are perhaps for the language lawyer. The purpose is to permit the use of all relevant characters of the entire ISO/IEC 10646:2003 repertoire. The most important effect is that we can write programs using Cyrillic, Greek and other character sets.

@ A good example is provided by the addition of the constant

  1        <font face="Symbol" size=+1>p</<code>font> : constant := Pi;

@ to the package Ada.Numerics. This enables us to write mathematical programs in a more natural notation thus

  1        Circumference : Float := 2.0 * <font face="Symbol" size=+1>p</<code>font> * Radius;

@ Other examples might be for describing polar coordinates thus

  1        R : Float := Sqrt (X*X + Y*Y);
  2        <font face="Symbol" size=+1>j</<code>font> : Angle := Arctan (Y, X);

@ and of course in France we can now declare a decent set of ingredients for breakfast

  1        type Breakfast_Stuff is (Croissant, Cafй, Њuf, Beurre);

@ Curiously, although the ligature ж is in Latin-1 and thus available in Ada 95 in identifiers, the ligature њ is not (for reasons we need not go into). However, in Ada 95, њ is a character of the type Wide_Character and so even in Ada 95 one can order breakfast thus

  1        Put ("Deux њufs easy-over avec jambon"); -- wide string

@ In order to manipulate 32-bit characters, Ada 2005 includes types Wide_Wide_Character and Wide_Wide_String in the package Standard and the appropriate operations to manipulate them in packages such as

Ada.Strings.Wide_Wide_Bounded
Ada.Strings.Wide_Wide_Fixed
Ada.Strings.Wide_Wide_Maps
Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants
Ada.Strings.Wide_Wide_Unbounded
Ada.Wide_Wide_Text_IO
Ada.Wide_Wide_Text_IO.Text_Streams
Ada.Wide_Wide_Text_IO.Complex_IO
Ada.Wide_Wide_Text_IO.Editing

@ There are also new attributes Wide_Wide_Image, Wide_Wide_Value and Wide_Wide_Width and so on.

@ The addition of wide-wide characters and strings introduces many additional possibilities for conversions. Just adding these directly to the existing package Ada.Characters.Handling could cause ambiguities in existing programs when using literals. So a new package Ada.Characters.

@ Conversions has been added. This contains conversions in all combinations between Character, Wide_Character and Wide_Wide_Character and similarly for strings. The existing functions from Is_Character to To_Wide_String in Ada.Characters.Handling have been banished to Annex J.

@ The introduction of more complex writing systems makes the definition of the case insensitivity of identifiers, (the equivalence between upper and lower case), much more complicated.

@ In some systems, such as the ideographic system used by Chinese, Japanese and Korean, there is only one case, so things are easy. But in other systems, like the Latin, Greek and Cyrillic alphabets, upper and lower case characters have to be considered. Their equivalence is usually straightforward but there are some interesting exceptions such as ? Greek has two forms for lower case sigma (the normal form s and the final form ? which is used at the end of a word). These both convert to the one upper case letter S. ? German has the lower case letter Я whose upper case form is made of two letters, namely SS. ? Slovenian has a grapheme LJ which is considered a single letter and has three forms: LJ, Lj and lj.

@ The Greek situation used to apply in English where the long s was used in the middle of words (where it looked like an f but without a cross stroke) and the familiar short s only at the end. To modern eyes this makes poetic lines such as "Where the bee sucks, there suck I" somewhat dubious. (This is sung by Ariel in Act V Scene I of The Tempest by William Shakespeare.) The definition chosen for Ada 2005 closely follows those provided by ISO/IEC 10646:2003 and by the Unicode Consortium; this hopefully means that all users should find that the case insensitivity of identifiers works as expected in their own language.

@ Of interest to all users whatever their language is the addition of a few more subprograms in the string handling packages. As explained in the Introduction, Ada 95 requires rather too many conversions between bounded and unbounded strings and the raw type String and, moreover, multiple searching is inconvenient.

@ The additional subprograms in the packages are as follows.

@ In the package Ada.Strings.Fixed (assuming use Maps; for brevity)

  1        function Index
  2        (       Source  : String;
  3                Pattern : String;
  4                From    : Positive;
  5                Going   : Direction := Forward;
  6                Mapping : Character_Mapping := Identity) return Natural;
  7        function Index
  8        (       Source  : String;
  9                Pattern : String;
 10                From    : Positive; Going : Direction := Forward;
 11                Mapping : Character_Mapping_Function) return Natural;
 12        function Index
 13        (       Source : String;
 14                Set    : Character_Set;
 15                From   : Positive;
 16                Test   : Membership := Inside;
 17                Going  : Direction := Forward) return Natural;
 18        function Index_Non_Blank
 19        (       Source : String;
 20                From   : Positive;
 21                Going  : Direction := Forward) return Natural;

@ The difference between these and the existing functions is that these have an additional parameter From. This makes it much easier to search for all the occurrences of some pattern in a string.

@ Similar functions are also added to the packages Ada.Strings.Bounded and Ada.Strings.Unbounded.

@ Thus suppose we want to find all the occurrences of "bar" in the string "barbara barnes" held in the variable BS of type Bounded_String. (I have put my wife into lower case for convenience.) There are 3 of course. The existing function Count can be used to determine this fact quite easily

  1        N := Count (BS, "bar") -- is 3

@ But we really need to know where they are; we want the corresponding index values. The first is easy in Ada 95

  1        I := Index (BS, "bar") -- is 1

@ But to find the next one in Ada 95 we have to do something such as take a slice by removing the first three characters and then search again. This would destroy the original string so we need to make a copy of at least part of it thus

  1        Part := Delete (BS, I, I+2); -- 2 is length "bar" – 1
  2        I := Index (Part, "bar") + 3; -- is 4

@ and so on in the not-so-obvious loop. (There are other ways such as making a complete copy first, this could either be in another bounded string or perhaps it is simplest just to copy it into a normal String first; but whatever we do it is messy.) In Ada 2005, having found the index of the first in I, we can find the second by writing

  1        I := Index (BS, "bar", From => I+3);

@ and so on. This is clearly much easier.

@ The following are also added to Ada.Strings.Bounded

  1        procedure Set_Bounded_String
  2        (       Target : out Bounded_String;
  3                Source : in String;
  4                Drop   : in Truncation := Error);
  5        function Bounded_Slice
  6        (       Source : Bounded_String;
  7                Low    : Positive;
  8                High   : Natural) return Bounded_String;
  9        procedure Bounded_Slice
 10        (       Source : in Bounded_String;
 11                Target : out Bounded_String;
 12                Low    : in Positive;
 13                High   : in Natural);

@ The procedure Set_Bounded_String is similar to the existing function To_Bounded_String. Thus rather than

  1        BS := To_Bounded_String ("A Bounded String");

@ we can equally write

  1        Set_Bounded_String (BS, "A Bounded String");

@ The slice subprograms avoid conversion to and from the type String. Thus to extract the characters from 3 to 9 we can write

  1        BS := Bounded_Slice (BS, 3, 9); -- "Bounded"

@ whereas in Ada 95 we have to write something like

  1        BS := To_Bounded (Slice (BS, 3, 9));

@ Similar subprograms are added to Ada.Strings.Unbounded. These are even more valuable because unbounded strings are typically implemented with controlled types and the use of a procedure such as Set_Unbounded_String is much more efficient than the function To_Unbounded_String because it avoids assignment and thus calls of Adjust.

@ Input and output of bounded and unbounded strings in Ada 95 can only be done by converting to or from the type String. This is both slow and untidy. This problem is particularly acute with unbounded strings and so Ada 2005 provides the following additional package (we have added a use clause for brevity as usual)

  1        with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
  2        package Ada.Text_IO.Unbounded_IO is
  3                procedure Put (File : in File_Type; Item : in Unbounded_String);
  4                procedure Put (Item : in Unbounded_String);
  5                procedure Put_Line (File : in File_Type; Item : in Unbounded_String);
  6                procedure Put_Line (Item : in Unbounded_String);
  7                function Get_Line (File : File_Type) return Unbounded_String;
  8                function Get_Line return Unbounded_String;
  9                procedure Get_Line (File : in File_Type; Item : out Unbounded_String);
 10                procedure Get_Line (Item : out Unbounded_String);
 11        end Ada.Text_IO.Unbounded_IO;

@ The behaviour is as expected.

@ There is a similar package for bounded strings but it is generic. It has to be generic because the package Generic_Bounded_Length within Strings.Bounded is itself generic and has to be instantiated with the maximum string size. So the specification is

  1        with Ada.Strings.Bounded; use Ada.Strings.Bounded;
  2        generic
  3                with package Bounded is new Generic_Bounded_Length (<>);
  4                use Bounded;
  5        package Ada.Text_IO.Bounded_IO is
  6                procedure Put (File : in File_Type; Item : in Bounded_String);
  7                procedure Put (Item : in Bounded_String);
  8                ... -- etc as for Unbounded_IO
  9        end Ada.Text_IO.Bounded_IO;

@ It will be noticed that these packages include functions Get_Line as well as procedures Put_Line and Get_Line corresponding to those in Text_IO. The reason is that procedures Get_Line are not entirely satisfactory.

@ If we do successive calls of the procedure Text_IO.Get_Line using a string of length 80 on a series of lines of length 80 (we are reading a nice old deck of punched cards), then it does not work as expected. Alternate calls return a line of characters and a null string (the history of this behaviour goes back to early Ada 83 days and is best left dormant).

@ Ada 2005 accordingly adds corresponding functions Get_Line to the package Ada.Text_IO itself thus

  1        function Get_Line (File : File_Type) return String;
  2        function Get_Line return String;

@ Successive calls of a function Get_Line then neatly return the text on the cards one by one without bother.

Rationale for Ada 2005: Predefined library

@ENGRUSTOPBACKNEXT

5. Символы и строки

@ Одним из важных усовершенствований в Аде 2005 является появившаяся возможность работать с 16-и 32-разрядными наборами символов как в тексте программы так и в самой выполняющейся программе.

@ Это прекрасное дополнение к правилам языка. Его цель состоит в том, чтобы разрешить использование всех символов набора ISO/IEC 10646:2003. Т.о. мы теперь можем писать программы используя Кириллицу, греческий язык и другие наборы символов.

@ В качестве примера рассмотрим добавление константы:

  1        <font face="Symbol" size=+1>p</<code>font> : constant := Pi;

@ к пакету Ada.Numerics. И теперь мы имеем возможность писать математические программы в более естественном виде:

  1        Circumference : Float := 2.0 * <font face="Symbol" size=+1>p</<code>font> * Radius;

@ Другой пример, мы можем описать полярные координаты таким образом:

  1        R : Float := Sqrt (X*X + Y*Y);
  2        <font face="Symbol" size=+1>j</<code>font> : Angle := Arctan (Y, X);

@ и конечно во Франции мы можем теперь объявить приличный набор компонентов на завтрак:

  1        type Breakfast_Stuff is (Croissant, Cafй, Њuf, Beurre);

@ Любопытно, хотя символ 'й' находится в наборе Latin-1 и, таким образом, доступен в идентификаторах ada, а символ 'њ' - нет (по причине выхода за диапазон ???). Однако, в Аде 95 'њ' - символ типа Wide_Character, и поэтому даже в Аде 95 можно заказать завтрак таким образом:

  1        Put ("Deux њufs easy-over avec jambon"); -- wide string

@ Для манипуляции с 32-разрядными символами в Аде 2005 в пакет Standard включены типы Wide_Wide_Character и Wide_Wide_String, а также соответствующие операции для их обработки в пакетах:

Ada.Strings.Wide_Wide_Bounded
Ada.Strings.Wide_Wide_Fixed
Ada.Strings.Wide_Wide_Maps
Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants
Ada.Strings.Wide_Wide_Unbounded
Ada.Wide_Wide_Text_IO
Ada.Wide_Wide_Text_IO.Text_Streams
Ada.Wide_Wide_Text_IO.Complex_IO
Ada.Wide_Wide_Text_IO.Editing

@ Появились также новые атрибуты Wide_Wide_Image, Wide_Wide_Value и Wide_Wide_Width и так далее.

@ Добавление общешироких символов и строк вводит много дополнительных возможностей для преобразований. Но простое добавление их непосредственно в существующий пакет Ada.Characters.Handling могло вызвать двусмысленности в существующих программах символьной обработки. В итоге был добавлен отдельный пакет Ada.Characters.Conversions.

@ Были добавлены преобразования в различных комбинациях между Character, Wide_Character и Wide_Wide_Character и так же для строк. Существующие функции от Is_Character до To_Wide_String в Ada.Characters.Handling были перенесены в Приложение J.

@ Появление более сложных устройств записи делает определение нечувствительности между верхним и нижним регистром символов в идентификаторах намного более сложной задачей.

@ В некоторых системах, таких как иерографическая система используемая китайским, японским и корейским языками имеется только один вариант написания, и таким образом, нет никаких проблем. Но в других системах, таких как латинский, греческий алфавиты и Кириллица нужно рассматривать символы верхнего и нижнего регистра. Их эквивалентность является обычно прямой, но есть некоторые интересные исключения. Например, у греческого языка есть две формы для нижнего регитра сигмы (нормальная форма s и конечная форма ? которая используется в конце слова). Они оба преобразовываются в один символ верхнего регистра S. В немецком языке есть строчный символ Я, форма верхнего регистра которого образуется из двух символов SS. Словенский язык имеет графему LJ которую считают одной буквой и она имеет три формы: LJ, Lj и lj.

@ Греческая ситуация имела обыкновение применяться на английском языке, где длинный s использовался в середине слова (где это было похоже на f, но без перекрестного штриха), и знакомое короткое s только в конце. В современном прочтении это делает поэтические строки: "Где пчела сосет, там высосите меня" несколько сомнительными. (Это песня Ariel в V акте I сцене из Бури Уильямом Shakespeare). Определение, выбранное для Ады 2005 близко следует за ISO/IEC 10646:2003 и Unicode Consortium; это, как мы надеемся, означает, что для всех пользователей нечувствительность к верхнему и нижнему регистрам символов в идентификаторах будет работать как ожидается на их собственных языках.

@ Представляет интерес для всех пользователей безотносительно их языка - добавление еще нескольких подпрограмм в пакеты для обработки строк. Как было объяснено во Введении, Ада 95 требует слишком много преобразований между ограниченными и неограниченными строками и базовым типом String и, кроме того, множественный поиск был неудобен.

@ Дополнительные подпрограммы в пакетах следующие.

@ В пакете Ada.Strings.Fixed (предполагаем спецификатор use Maps; для краткости)

  1        function Index
  2        (       Source  : String;
  3                Pattern : String;
  4                From    : Positive;
  5                Going   : Direction := Forward;
  6                Mapping : Character_Mapping := Identity) return Natural;
  7        function Index
  8        (       Source  : String;
  9                Pattern : String;
 10                From    : Positive;
 11                Going   : Direction := Forward;
 12                Mapping : Character_Mapping_Function) return Natural;
 13        function Index
 14        (       Source : String;
 15                Set    : Character_Set;
 16                From   : Positive;
 17                Test   : Membership := Inside;
 18                Going  : Direction := Forward) return Natural;
 19        function Index_Non_Blank
 20        (       Source : String;
 21                From   : Positive;
 22                Going  : Direction := Forward) return Natural;

@ Различие между этими и существующими функциями состоит в том, что у них есть дополнительный параметр From. Что намного упрощает поиск всех вхождений некоторого шаблона в строке.

@ Подобные функции также добавлены в пакет Ada.Strings.Bounded и Ada.Strings.Unbounded.

@ Предположим, что мы хотим найти все вхождения "bar" в строке "barbara barnes" в переменной BS типа Bounded_String. (Я написал имя своей жены строчными буквами для удобства). Этих вхождений 3 конечно. Существующая функция Count может использоваться, чтобы определить этот факт весьма просто:

  1        N := Count (BS, "bar") -- is 3

@ Но мы хотим знать на каких позициях они находятся. Первую легко найти на Аде 95:

  1        I := Index (BS, "bar") -- is 1

@ Но для того чтобы найти следующую мы должны удалить первые три символа и затем выполнить поиск снова. Это разрушило бы оригинальную строку, поэтому мы сначала должны сделать копию по крайней мере части её таким образом:

  1        Part := Delete (BS, I, I+2); -- 2 is length "bar" – 1
  2        I := Index (Part, "bar") + 3; -- is 4

@ и так далее. (Есть и другие пути, например создать сначала полную копию строки, это можно сделать в другой ограниченной строке (Unbounded_String), или возможно, что является самым простым ,только скопировать её в обычную String; но независимо от того, что мы делаем это весьма топорно). На Аде 2005, найдя индекс первого вхождения, мы легко можем найти индекс второго таким образом:

  1        I := Index (BS, "bar", From => I+3);

@ и так далее. И это уже гораздо проще.

@ Следующее также добавлено в пакет Ada.Strings.Bounded:

  1        procedure Set_Bounded_String
  2        (       Target : out Bounded_String;
  3                Source : in String;
  4                Drop   : in Truncation := Error);
  5        function Bounded_Slice
  6        (       Source : Bounded_String;
  7                Low    : Positive;
  8                High   : Natural) return Bounded_String;
  9        procedure Bounded_Slice
 10        (       Source : in Bounded_String;
 11                Target : out Bounded_String;
 12                Low    : in Positive;
 13                High   : in Natural);

@ Процедура Set_Bounded_String подобна существующей функции To_Bounded_String. Таким образом

  1        BS := To_Bounded_String ("A Bounded String");

@ эквивалентно:

  1        Set_Bounded_String (BS, "A Bounded String");

@ Подпрограммы вырезки избегают преобразования и из типа String. Таким образом, чтобы извлечь символы c 3 по 9 мы можем написать:

  1        BS := Bounded_Slice (BS, 3, 9); -- "Bounded"

@ тогда как в Аде 95 мы должны были сделать это так:

  1        BS := To_Bounded (Slice (BS, 3, 9));

@ Аналогичные подпрограммы добавлены в пакет Ada.Strings.Unbounded. Они еще более ценны, потому что неограниченные строки обычно осуществляются с управляемыми (controlled) типами, и использование процедур, таких как Set_Unbounded_String намного более эффективно чем функции To_Unbounded_String, потому что это избегает присваивания, и таким образом вызова Adjust.

@ Ввод и вывод ограниченных и неограниченных строк в Аде 95 может быть сделан только преобразованием в или из типа String. Это является и медленным и неопрятным. Эта проблема является особенно острой с неограниченными строками. Для решения этой проблемы Ада 2005 обеспечивает следующий дополнительный пакет (мы добавили выражение использования для краткости как обычно)

  1        with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
  2        package Ada.Text_IO.Unbounded_IO is
  3                procedure Put (File : in File_Type; Item : in Unbounded_String);
  4                procedure Put (Item : in Unbounded_String);
  5                procedure Put_Line (File : in File_Type; Item : in Unbounded_String);
  6                procedure Put_Line (Item : in Unbounded_String);
  7                function Get_Line (File : File_Type) return Unbounded_String;
  8                function Get_Line return Unbounded_String;
  9                procedure Get_Line (File : in File_Type; Item : out Unbounded_String);
 10                procedure Get_Line (Item : out Unbounded_String);
 11        end Ada.Text_IO.Unbounded_IO;

@ Поведение как ожидается.

@ Есть подобный пакет и для ограниченных строк, но он является настраиваемым. Он должен быть настраиваемым потому что пакет Generic_Bounded_Length в пределах Stings.Bounded является самостоятельно настраиваемым и должен иллюстрироваться с максимальным строковым размером. Его спецификация:

  1        with Ada.Strings.Bounded; use Ada.Strings.Bounded;
  2        generic
  3                with package Bounded is new Generic_Bounded_Length (<>);
  4                use Bounded;
  5        package Ada.Text_IO.Bounded_IO is
  6                procedure Put (File : in File_Type; Item : in Bounded_String);
  7                procedure Put (Item : in Bounded_String);
  8                ... -- etc as for Unbounded_IO
  9        end Ada.Text_IO.Bounded_IO;

@ Эти пакеты включают функции Get_Line так же как и процедуры Put_Line и Get_Line соответствующие имеющимся в пакете Text_IO. Причина этого состоит в том, что процедуры Get_Line не являются полностью удовлетворительными.

@ Если мы делаем последовательные вызовы процедуры Text_IO.Get_Line, используя строку длинной 80 серии строк длиной 80 (мы читаем хороший старый набор перфорированных плат), то это не работает как ожидается. Дополнительные вызовы возвращают и строки символов и пустые строки (хронология этого поведения возвращает нас к ранней Аде 83 и ??? лучше всего оставлена бездействующей ???).

@ Ада 2005 соответственно добавляет соответствующие функции Get_Line в пакет Ada.Text_IO непосредственно таким образом:

  1        function Get_Line (File : File_Type) return String;
  2        function Get_Line return String;

@ Последовательные вызовы функции Get_Line тогда аккуратно возвращают текст на перфокартах одну за другой без беспокойства.

@ ENG RUS

TOP BACK NEXT

2010-10-24 00:26:58

. .