Because they're not there! Why character classes at all? The whole point of RE languages is to make pattern matching simpler than writing your own matching algorithms. Standardized RE languages are also easier to read, write, maintain and share than the alternatives. Syntax for character classes exist to exploit in simple notation the shared properties of a group of characters.
So what properties do Ethiopic characters have that RE languages do not detect? Ethiopic letters each contain two properties that should be matched independently. Each letter is a syllable, a "CV" pattern, a means to detect either the "C" part or the "V" part is highly desirable. There are 7 basic "V" forms shared by 37 "C" bases. This gives us 7 classes each containing 37 members and the inversion of 37 "C" classes each of 7 members. The number of elements per class is indeed larger than a number existing already defined character classes.
But isn't this just a matter of convenience? Yes, but we like convenience, that's why we have character classes and REs to begin with. We have both \d and [:digit:] when [0-9], and[#1#] | [#2#] | [#3#] | [#4#] | [#5#] | [#6#] | [#7#] | [#8#] | [#9#] | [#10#] | [#11#] | [#12#] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
[#ሀ#] | ሀ | ሁ | ሂ | ሃ | ሄ | ህ | ሆ | |||||
[#ለ#] | ለ | ሉ | ሊ | ላ | ሌ | ል | ሎ | ሏ | ||||
[#ሐ#] | ሐ | ሑ | ሒ | ሓ | ሔ | ሕ | ሖ | ሗ | ||||
[#መ#] | መ | ሙ | ሚ | ማ | ሜ | ም | ሞ | ሟ | ||||
[#ሠ#] | ሠ | ሡ | ሢ | ሣ | ሤ | ሥ | ሦ | ሧ | ||||
[#ረ#] | ረ | ሩ | ሪ | ራ | ሬ | ር | ሮ | ሯ | ||||
[#ሰ#] | ሰ | ሱ | ሲ | ሳ | ሴ | ስ | ሶ | ሷ | ||||
[#ሸ#] | ሸ | ሹ | ሺ | ሻ | ሼ | ሽ | ሾ | ሿ | ||||
[#ቀ#] | ቀ | ቁ | ቂ | ቃ | ቄ | ቅ | ቆ | ቈ | ቍ | ቊ | ቋ | ቌ |
[#ቐ#] | ቐ | ቑ | ቒ | ቓ | ቔ | ቕ | ቖ | ቘ | ቝ | ቚ | ቛ | ቜ |
[#በ#] | በ | ቡ | ቢ | ባ | ቤ | ብ | ቦ | ቧ | ||||
[#ቨ#] | ቨ | ቩ | ቪ | ቫ | ቬ | ቭ | ቮ | ቯ | ||||
[#ተ#] | ተ | ቱ | ቲ | ታ | ቴ | ት | ቶ | ቷ | ||||
[#ቸ#] | ቸ | ቹ | ቺ | ቻ | ቼ | ች | ቾ | ቿ | ||||
[#ኀ#] | ኀ | ኁ | ኂ | ኃ | ኄ | ኅ | ኆ | ኈ | ኍ | ኊ | ኋ | ኌ |
[#ነ#] | ነ | ኑ | ኒ | ና | ኔ | ን | ኖ | ኗ | ||||
[#ኘ#] | ኘ | ኙ | ኚ | ኛ | ኜ | ኝ | ኞ | ኟ | ||||
[#አ#] | አ | ኡ | ኢ | ኣ | ኤ | እ | ኦ | ኧ | ||||
[#ከ#] | ከ | ኩ | ኪ | ካ | ኬ | ክ | ኮ | ኰ | ኵ | ኲ | ኳ | ኴ |
[#ኸ#] | ኸ | ኹ | ኺ | ኻ | ኼ | ኽ | ኾ | ዀ | ዅ | ዂ | ዃ | ዄ |
[#ወ#] | ወ | ዉ | ዊ | ዋ | ዌ | ው | ዎ | |||||
[#ዐ#] | ዐ | ዑ | ዒ | ዓ | ዔ | ዕ | ዖ | |||||
[#ዘ#] | ዘ | ዙ | ዚ | ዛ | ዜ | ዝ | ዞ | ዟ | ||||
[#ዠ#] | ዠ | ዡ | ዢ | ዣ | ዤ | ዥ | ዦ | ዧ | ||||
[#የ#] | የ | ዩ | ዪ | ያ | ዬ | ይ | ዮ | |||||
[#ደ#] | ደ | ዱ | ዲ | ዳ | ዴ | ድ | ዶ | ዷ | ||||
[#ዸ#] | ዸ | ዹ | ዺ | ዻ | ዼ | ዽ | ዾ | ዿ | ||||
[#ጀ#] | ጀ | ጁ | ጂ | ጃ | ጄ | ጅ | ጆ | ጇ | ||||
[#ገ#] | ገ | ጉ | ጊ | ጋ | ጌ | ግ | ጎ | ጐ | ጕ | ጒ | ጓ | ጔ |
[#ጘ#] | ጘ | ጙ | ጚ | ጛ | ጜ | ጝ | ጞ | |||||
[#ጠ#] | ጠ | ጡ | ጢ | ጣ | ጤ | ጥ | ጦ | ጧ | ||||
[#ጨ#] | ጨ | ጩ | ጪ | ጫ | ጬ | ጭ | ጮ | ጯ | ||||
[#ጰ#] | ጰ | ጱ | ጲ | ጳ | ጴ | ጵ | ጶ | ጷ | ||||
[#ጸ#] | ጸ | ጹ | ጺ | ጻ | ጼ | ጽ | ጾ | ጿ | ||||
[#ፀ#] | ፀ | ፁ | ፂ | ፃ | ፄ | ፅ | ፆ | |||||
[#ፈ#] | ፈ | ፉ | ፊ | ፋ | ፌ | ፍ | ፎ | ፏ | ||||
[#ፐ#] | ፐ | ፑ | ፒ | ፓ | ፔ | ፕ | ፖ | ፗ |
|
|
Equivalence in Phono-Orthography
|
Equivalence of Families
|
The overloading of Perl's regular expressions mechanism is the preferred usage for the Regexp::Ethiopic package. However, the overloading mechanism only applies to the constant part of the RE. The following would not be handled by the Regexp::Ethipic package as expected:
use Regexp::Ethiopic 'overload'; my $x = "ከ"; : : if ( /[#$x#]/ ) { : : }
The package never gets to see the variable $x
to then
perform the RE expansion. The work around is to use the package as per:
use Regexp::Ethiopic 'overload'; my $x = "ከ"; : : my $re = Regexp::Ethiopic::getRe ( "[#$x#]" ); if ( /$re/ ) { : : }
This works as expected at the cost of one extra step. The overloading and functional modes of the Regexp::Ethiopic package may be used together without conflict.
The initial philosophy applied to syllabic character class development was to stick with existing POSIX definitions and notation ([=x=], [:x:], etc) and simply apply them in the context of a syllabary. Shoe-horning syllabic classes into POSIX norms has proven at times to be both awkward and confusing. As this package is experimental, a clean break is made at this time from previously proposed notations and class symbols are used that appear to be intuitive and easy to type.
In large part, a complication for working with Ethiopic character classes easily has been the difference between the greater number of Ethiopic classes and available (while somewhat applicable) POSIX abstractions. There are four types of character equivalence that are of interest in Ethiopic regular expressions:
The syllable x is:The choice of # has been made at this time for no other reason than that symbol itself looks like the grid that the syllables are invariably presented in. The interpretation of the character between #s is made by the character's context as either a letter or numeral. This may prove to be a good neumanic device.
Notation is of course subject to change. Putting theory into practice (and code!) and experimentation is the only real way to shake out the wilburries. Once settled GUS notation will be updated accordingly and I'll get back to IPA based pattern matching with GUS (aka "folding script"). Honest, for real this time... ;-)