Rendering Recommendations For Indic Languages

Rahul Bhalerao

Dated 2^nd January 2008

Contents

Prologue 3

Introductory Illustration 3

Devanagari 5

Oriya 8

Malayalam 14

References 20

Disclaimer 21

Prologue

This document discusses various issues present in the rendering of Indic Languages on modern computers. The main focus of the document is on the OpenType fonts and rendering engines based on OpenType Specification. Thus wherever unclear, the context should be assumed to be that of OpenType. Few of the terms used frequently use their meanings in OpenType context. For example, a Font is generally assumed to be OpenType Font here. Similarly, Layout Engines, Rendering Engines are used alternatively to mean the same software component such as Pango, Uniscribe or ICU etc.

Along with the issues the document also discusses various possible solutions to the problems and also recommends the appropriate one giving solution at various stages such as Unicode, Rendering Layout and Font.

The introductory illustration is only a mean to explain the OpenType mechanisms and should be adopted to the context of each issue appropriately. The final reference is always the original OpenType specifications.

The intent of this document is to discuss various issues in detail and provide solutions for them that can/should be adopted publicly on a large scale making them a standard or suggesting modifications to the existing standards.

Introductory Illustration

A general equation of most of the combinations of consonants is:

Consonant1 + Halant + Consonant2

This equation can be extended by any number of consonants with the intermediate Halants.

In OpenType they are generally implemented as 2 steps:

1. Consonat1 + Halant = HalfConsonant1, where HalfConsonant1 is the half form of the Consonant1

then, 2. HalfConsonant1 + Consonant2 = the combination.

In most cases step 2 is not required, since the HalfConsonant1 is already in a shape of glyph that can get attached to the following consonant easily, thus step 2 is used only for special cases where the default shape may not be acceptable.

Such type of combinations are called "Pre-Base Substitutions" since the substitution happens before the base glyph i.e. Consonant2. These combinations do not need any reordering.

There are other types of combinations as well that require the substitution after the Base glyph. In these type of combinations the Consonant2 is in its half form and not the Consonant1. These type of combinations include 'Post-base Substitution/Form' (psts/pstf) and 'Below-base Substitution/Form' (blws/blwf). In OpenType, say Post-Base, are implemented in following steps:

1. Consonant1 + Halant + Consonant2 => The logical sequence of characters as user inputs

2. Consonant1 + (Consonat2 +Halant)[pstf] => Reordering at Layout engine with a 'pstf' feature tag

3. (Consonant1 + HalfConsonant2)[psts] => Final substitution where HalfConsonant2 is the Consonant2 in it post-base half form.

Similar mechanism is used for Below-Base form substitutions.

For this type of substitutions to take place, the Layout Engine should be capable of the reordering of the particular Consonants. And parallely the font should have substitution rules that are in accordance to the final reordered sequence with proper feature tag specification. The forms for such consonants are documented in OpenType Specification for Indic here: http://www.microsoft.com/typography/otfntdev/indicot/appen.aspx#forms

The special forms like reph are documented along with the other blws and psts etc. are documented here: http://www.microsoft.com/typography/OpenType%20Dev/indic/intro.mspx

Please refer to the given links for formal specifications. Given here are the illustrations of the original specs.

Lets look at the languages and scripts one-by-one.

1. Devanagari:

ISSUE 1:

Problem: Need to use dependent vowel sign U+0945 (ॅ) with the vowel U+0905 (अ) i.e. the sequence [U0905+U0945]. This form is mainly used in Marathi.

But Dotted Circle appears automatically between the two characters disallowing the sequence.

Unicode: There is no Independent vowel associated to the dependent vowel 0945. Also unicode does not allow using sequence of vowel and vowel signs to create some of the independent vowels. But it does not explicitly mentions this particular sequence.

Severity: On almost all the rendering systems that follow unicode and OpenType guidelines.

Solution:

Font: No change

Rendering engine: Should allow the sequence 0905+0945. but not any of the other vowels+0945. (Implementations may be different and depend on the rendering engine) .

Unicode: Explanation for this case should be proposed to be included in the Unicode standard.

Latest Update: The independent vowel Candra E has been proposed and included in the Unicode 5.1.0 beta version at the code point U+0972. Final Unicode 5.1.0 standard will be released in March. Thus the rendering engines thereafter would only have to treat this character as an Independent Vowel and fonts will need to have a glyph associated for this.

ISSUE 2:

Problem: For writing eyelash-Ra(mainly used in Marathi and Newari) some fonts support using the sequence [0931+094d] (ऱ + ्) while few others support [0930+094d+ZWJ] (र + ् + ZWJ). This creates problem when fonts used for writing uses different method than the fonts used for reading.

Unicode reference: In Unicode 5.0, Section-9.1 the rules R5 and R5a allow both of these methods. R5 allows [0931+094d] in conformance with ISCII and R5a allows [0930+094d+ZWJ] for compatibility with Unicode 2.0.

Severity: Mostly on cross system data interchange e.g. web pages.

Solution:

Font: Font should support both the sequence. In OpenType context, it should have glyph substitution rule for both the sequences.

Rendering Engine: No change

Unicode: No change

ISSUE 3:

Problem: Redundant Nukta forms for few consonants in unicode. Since unicode has encoded Nukta sign (093c), it is intended to be used with any of the consonants. Still unicode includes 0929, 0931, 0934, 0958, 0959, 095A, 095B, 095C, 095D, 095E, and 095F as explicit Nukta forms of the consonants 0928, 0933, 0915, 0916, 0917, 091C, 0921, 0922, 092B, and 092F respectively. This is redundant, since they all can be formed using sequence Consonant+Nukta(093C).

Severity: Every platform that supports Unicode. May result in non-unique data. Some data may use Nukta sign (093c) while some may directly use the encoded Nukta characters.

Solution: No proper solution can exist unless unicode makes the correction, which is very less likely since it may violate unicode policies about backward compatibility.

Font: OpenType font for Devanagari should internally have ligature rules that can substitute the [Consonant+093c] sequence to the glyph of encoded Nukta forms for the ones listed above.

Rendering Engine: No change

Unicode: Guidelines should come from unicode itself.

2. Oriya:

ISSUE 1:

Problem: Most of the consonants when type after some kind of combinations, cause the earlier combination to break and new combination being formed with the later consonant which is not expected.

Technical analysis:

This on first hand looks like a problem due to the reorderings done by the layout engine. These reorderings are in accordance to the OpenType specifications. Thus rest of the analysis will be done in context of OpenType.

The two sets of ligatures to be considered here are :

1. the combinations which break:

It is observed that these combinations are generally that involve the half form of the consonant Ya in Oriya i.e. U+0B2F (ଯ). The equation is :

Consonant + Halant + Ya

2. the consonants that cause the breakage.

There are many of the consonants that when follow the above equation break the combination. Few of these include Ka, Kha, Ga, Gha etc. Few others do not break it which include, Ba, Bha, Ma etc.

With reference to the 'Introductory Illustrations' section and the table at http://www.microsoft.com/typography/otfntdev/indicot/appen.aspx#forms , it can be seen that Ya in oriya i.e. U+0B2F is a post base consonant. while many others are below base. The problem described above happens only for those that are not defined as below base, and it is clear from the script requirements and many of the glyphs in the popular Oriya fonts that these consonants indeed appear in their below base form in many cases.

To implement the below base forms in the font without having it supported in OpenType and thus in the layout engine, the font has to define a substitution rule for the below base forms without reordering, i.e. the logical sequence

Consonant1 + Halant + Consonant2

is used directly like

Consonant1 + (Halant + Consonant2) => Consonant1 + HalfConsonant2, where

HalfConsonant2 = (Halant + Consonant2)

This results in a side effect whenever the Consonant2 comes after a post-base or any other below base consonant. The mechanism is as follows:

Char sequence => Consonant1 + Halant + Ya + Consonant2, here Consonant2 is the one which needs to be of below base form but is not defined so in OpenType.

Expected result is => (Consonant1+Ya[in post-base form]) + Consonant2

Actual results => Reordering for post-base Ya i.e. Consonant1 + Ya + Halant thus the result is Consonant1 + Ya[post-base form]

Then when Consonant2 follows, the sequence of characters becomes:

Consonant1 + Ya + Halant + Consonant2

Thus according to the rules defined in the font, the cluster of (Halant + Consonant2) gets substituted by the below-base glyph of Consonant2, thus resulting,

Consonant1 + Ya + Consonant2[below-base form]

Obviously, this breaks the intended combination of (Consonant1 + Post-base Ya).

Solutions:

Unicode: No Change

OpenType/Rendering Engine: The only apt solution for this is to define all the required consonants as below-base forms. Any OpenType Layout Engine should reorder the below-base consonants whenever their below-base forms are needed.

Consonants in Oriya that need to be below-base but not defined so in OpenType include:

U+0B15 Ka

U+0B16 Kha

U+0B17 Ga

U+0B18 Gha

U+0B19 Nga

U+0B1A Ca

U+0B1B Cha

U+0B1C Ja

U+0B1D Jha

U+0B1F Tta

U+0B21 Dda

U+0B22 Ddha

U+0B23 Nna

U+0B25 Tha

U+0B26 Da

U+0B27 Dha

U+0B2A Pa

U+0B2B Pha

U+0B35 Va

U+0B36 Sha

U+0B37 Ssa

U+0B38 Sa

U+0B39 Ha

U+0B71 Wa

Fonts: Along with the support in OpenType Layout engine, the Oriya fonts should incorporate the reordering done by the Layout engine. Thus for all of the above listed consonants, and also those mentioned in the OpenType specification, font should include a below-base glyph for each of them, with the substitution rule as follows:

(Consonant + Halant)[blwf], i.e. glyph=> Consonant + Halant with a feature tag 'blwf'

To distinguish this from the actual half form glyphs(i.e. non reordered sequences) the font should also include the half form glyphs, generally a glyph containing the shape of consonant followed with the shape of halant with substitution rule that uses 'half' as a feature tag. Some of the conjuncts may have shapes that are not dependent on the below-base or half forms, in such cases separate substitution rules incorporating the reordering wherever applicable.

ISSUE 2:

Problem: The consonant Va U+0B35 (ଵ) is not a proper Oriya consonant according to most of the linguists and users. Instead they use Ba U+0B2C (ବ) whenever needed.

There also exists a consonant Wa U+0B71 (ୱ) which is similar to these.

Solution:

This is not necessarily a technical problem but more of the preferences. Users should be free to use any of these three. Since they all are already in Unicode, it is not appropriate to remove them.

For the conjunct formation, Va and Ba should have their normal below-base forms present in the font. Additionally, Wa i.e. U+0B71 should also have a below-base form and it should appear identical to that of Ba U+0B2C.

Unicode: No Change

OpenType/Layout Engine: Should declare all three of the consonants as below-base forms

Fonts: Font should have appropriate substitution rules as explained in the solution for ISSUE1. The glyph of blwf glyph of Ba and Wa should be identical.

ISSUE 3:

Problem: Oriya uses the consonant Yya U+0B5F (ୟ) often as an alternative to the consonant Ya U+0B2F (ଯ) in their post-base form. But combinations with U+0B5F are not rendered properly.

Solution:

Unicode: No Change

OpenType/Layout Engine: U+0B5F is used as post-base form but not defined so in OpenType, thus the solution is similar to that of Issue 1. U+0B5F should be defined as a post base form in OpenType and thus in the Layout Engine.

Fonts: Font should have a general post base form substitution rule for U+0B5F which is similar to that of U+0B2F and incorporating the reordering done in the Layout engine. Since Yya is used as an alternative to Ya in some cases, the shape of the post-base form of U+0B5F should be identical to that of post-base form of U+0B2F.

Note: All the listed Oriya issues are addressed in the Open Source Layout engine 'Pango' and the popular Open Source fonts such as Lohit, Utkal and Samyak etc. They are tested fine and can be adopted by other rendering engines and fonts and used as reference in support of the arguments made here.

3. Malayalam

ISSUE 1:

Problem: Rendering of the conjuncts in Malayalam of the form (Consonant + Halant + Ra(U+0D30)) (e.g. പ + ് + ര) is not correct on Open Source platforms and fonts.

Description: This problem is dependent on the two variations of the Malayalam script, they are Traditional and Reformed. In the reformed script, the later Ra is transformed to its post-base form and shifted back behind the earlier consonant. Whereas, in the traditional script, the entire cluster is substituted by a new shape that needs completely new separate glyph to be substituted for the entire cluster.

Technical Analysis:

For the Old/traditional script, a simple Pre-Base Substitution is enough. But the new script require the reordering to take place, since the pstf form of Ra shifts back behind the consonant. We need to support both the type of scripts on a single platform, thus lets look at its feasibility.

Since the reformed scripts need the reordering to take place, it is not possible to use simple plain substitution for traditional style combination.

If we use reordering, then substitution in the old script font cannot be simple, but it is still possible since the reordering can be incorporated in the pre-base substitution.

The OpenType specification for handling this case is found here: http://www.microsoft.com/typography/otfntdev/indicot/appen.aspx#forms

As seen there, the consonant Ra in Malayalam Reformed script is defined as a Post-Base form with special case denoted by the asterisk(*). The footnote says 'will be reordered at syllable start'. This certainly suggests the need t reorder at the beginning. Yet the actual intended reordering is not clear.

Solution:

Unicode: No Change

OpenType/Rendering Engine:

OpenType should be more clear about the expected recording in this case, also the way to implement both new and old scripts on same platform.

The alternative is to derive a reordering that can best suit the problem and resolve it efficiently.

The logical sequence of characters:

Consonant + Halant + Ra

The physical sequence of glyphs:

(post base form of Ra) + Consonant

Obviously the reordered sequence of glyphs happens to be:

Consonant + Halant + Ra == Reordered as == (Ra + Halant)pstf + Consonant

The feature tag 'pstf' is neccesary to distinguish this form of Ra from the natural sequence of Half Ra i.e. (Ra + Halant)half

To support the Old script combination, we have to use the new sequence of substitutions as:

Ra + Halant = (HalfRa)pstf and

[(HalfRa)pstf + Consonant]pres

thus incorporating the reordered sequence as a pre-base substitution.

In conclusion the Rendering Engine should reorder the characters as explained above and define Ra as a Post-base form consonant.

Font:

The part of the solution lies in the font file.

For Reformed Script:

As already explained a font should have a glyph with the post base form of the Ra and a substitution rule, (Ra+Halant)pstf.

And another glyph for the half form of Ra with the substitution rule (Ra + Halant)half i.e. using 'half' tag. This is to distinguish the reformed sequence of the (Ra+Halant) from the natural sequence of (Ra + Halant).

For Old Script :

The final glyph should have substitution rule as,

(HalfRa.pstf + Consonant ) pres i.e. using pre-base substitution, where,

HalfRa.pstf => (Ra + Halant)pstf

And a half form of Ra as usual i.e. Ra + Halant with 'half' as a feature tag.

* Why a new script without reordering is not acceptable?

The New script is intended to minimize the number of glyphs. If we do not use reordering, we have to use substitution rules similar to that of an old font, where entire clusters is substituted by a new glyph. But this means new glyph for every consonant combination and there ca n be any number of such possible combinations. This way the purpose of the new script is not served and font is loaded with huge number of glyphs, resulting redundancy, overload and high resource consumption.

Since reordering for New script is not affecting the use of old script, the above described solution is appropriate for a system that should support both the scripts.

* Why support both the traditional and reformed scripts?

Traditional script is considered to be the original script used since hundreds of years for writing manuscripts. Thus it is also considered the correct form of the script. Thus it is necessary to support it.

The new script has emerged as a result of technological changes in writing systems e.g. typewriter, where minimum number of glyphs can only be used. Due to the technical needs it has become common form of script used by many people including Government, Offices, and many books. Although digital computers today do not have technical limitation on the number of glyphs, it is always a good practice to keep things as light as possible. Since developing an Old script font needs huge collection of glyphs to be create, it can be discouraging for a font developer to design such font. The minimal set of glyphs will always encourage new developers.

Thus a font for Reformed script with minimum non-redundant glyphs set should also be supported.

ISSUE 2:

Problem: Malayalam uses a form of syllable called samvruthokaram, which is a sequence of U+0D41 i.e. Vowel sign U followed by the Halant U+0D4D. The sequence is not allowed and blocked using the dotted circle.

Technical analysis: The dotted circle appears since the Halant is not supposed to follow any vowel sign. But this is the only exception for the case.

Solution:

Unicode: No Change, but a guideline should be added.

OpenType/Layout Engine: Layout engine should catch this particular sequence and allow it without inserting the dotted circle in between.

Font: No Change.

ISSUE 3:

Problem: Many of the combinations of La U+0D32 (ല) and Va U+0D35 (വ) do not work properly and they also break other combinations on OpenSource platforms such as pango.

Technical analysis: This case is similar to the Oriya ISSUE1. Here La and Va appear in their below-base and post-base forms respectively in most cases. Thus they are also defined so in the OpenType specs. The problem occurs since they also have other forms being used. It is necessary to use them in pre-base forms in some cases.

Thus the few opensource implementations do not employ the required reordering and use work around by using akhand (akhn) form of ligature substitution.

Solution:

Unicode: No Change

Layout Engine: Layout engines should strictly define La (U+0D32) and Va (U+0D35) as below-base and post-base forms respectively.

Fonts: Fonts should avoid using 'akhn' forms for combinations involving such reordering. Instead they should incorporate the reordering and use 'pre-base' form of substitution whenever possible.

ISSUE4:

Problem: For combination described in ISSUE 1, should similar combination exist for U+0D31 (റ)?

Illustration:

This is a preferential problem. It appears that in some cases U+0D31 is the implicit consonant instead of U+0D30. Grammatically, some cases do have this variation. but since for typing purpose, U+0D31 was never used on keyboards to render the combination, a widespread usage of only U+0D30 is accepted. But since on modern computer keyboards, there is no issue of using multiple keys and single key press can result in the same outcome, the said combination should be supported with U+0D31 as well. Thus similar solution should be implemented for U+0D31 as to the U+0D30.

ISSUE 5:

Problem: Chillaksharam- few of the malayalam consonants have special alternative half forms called chillaksharam. They are currently implemented using ZWJ. But now unicode has included them in the Unocode version 5.1 as chillu characters.

Illustrations: The inclusion of these into unicode should not affect the present implementations. But yes, they should be transformed back to unicode as soon as possible. Meanwhile for users to feel no change, the same substitution rule can be applied to the glyphs at the actual positions of chillu characters in the new version of unicode. This should not be a major issue.

References:

Disclaimer:

This document is only a recommendation. It is not claimed to be the perfect in every situation. The conclusions should be understood contextually, e.g. Most of the discussion is with reference to OpenType rendering and should not be applicable to any other kind of rendering unless mentioned explicitly. The implementations based on these conclusions may differ system to system and need to be tested thoroughly.