Internet Engineering Task Force P. Cordell Internet Draft Tech-Know-Ware Ltd draft-cordell-lumas-05.txt February 1, 2007 Expires: August 1, 2007 Lumas - Language for Universal Message Abstraction and Specification STATUS OF THIS MEMO By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on August 1, 2007. Copyright Notice Copyright (C) The IETF Trust (2007). Abstract A number of methods and tools are available for defining the format of messages used for application protocols. However, many of these methods and tools have been designed for purposes other than message definition, and have been adopted on the basis that they are Cordell Expires August 1, 2007 [Page 1] Internet Draft Lumas February 2007 available rather than being ideally suited to the task. This often means that the methods make it difficult to get definitions correct, or result in unnecessary complexity and verbosity both in the definition and on the wire. Lumas - Language for Universal Message Abstraction and Specification - has been custom designed for the purpose of message definition. It is thus easy to specify messages in a compact, extensible format that is readily machine manipulated to produce a compact encoding on the wire. Table of Contents 1. Introduction 2. About Lumas 3. Lumas and Other Message Definition Languages 4. Terminology 5. Example Lumas Message Definition and Message Encoding 5.1 Principles of the Message Definition 5.2 An Example Message Definition 6. Formal Message Definition Syntax 6.1 Lumas Keywords 6.2 Lumas Parameters 6.3 Simple Parameters 6.4 The Simple Types 6.5 Simple Type Definition 6.6 The Pattern Constraint 6.7 The Name 6.8 Cardinality 6.9 Tagging 6.10 The Plugin Extension Mechanism 6.11 Reference Parameters 6.12 Compound Parameters 6.13 Struct Parameters 6.14 Union Parameters 6.15 Combined Parameters 6.16 Referenced Parameters 6.17 External Extensions - Plug and Pluggable 6.18 Module Definition and Directives 6.19 The Top Level Definition 6.20 Locating Lumas within a Specification 7. On-the-Wire Representation 7.1 Principles of the default On-the-Wire Encoding 7.2 Formal On-the-Wire Representation 7.3 Marking Message Boundaries 7.4 Examples of Encoded Types 8. Common ABNF Definitions 9. Notes on Comments 10. Locating Lumas Modules 11. Mandatory to Understand 12. Security Considerations 13. Normative References Cordell Expires August 1, 2007 [Page 2] Internet Draft Lumas February 2007 14. Informative References 15. Author's Address 1. Introduction Lumas is a lightweight, message definition language that is both flexible and highly extensible. This document defines the Lumas message definition language, and the default text encoding method for messages defined in this way. 2. About Lumas Lumas - Language for Universal Message Abstraction and Specification - is a simple message definition language that can be used to define the messages used by protocols. In this context, a message is defined as a collection of data used to convey information between two or more machines (or processes). Typically Lumas is used to define application layer messages (e.g. at the layer at which the likes of SMTP [SMTP] is defined), but there is no practical reason why Lumas should not be used at other layers. The design objectives of Lumas are simplicity, ease of use, efficiency, and extensibility. Lumas provides a high-level method for defining messages and a default set of encoding rules for character based protocols. The encoding rules describe how instances of messages that conform to the defined high-level definition are represented on the wire. It is also possible to define alternative encoding rules that could be used to define representations of messages in binary form, or other character based forms; e.g. XML [XML] or JSON [JSON]. In general Lumas is not able to describe messages with arbitrary sequences of characters and bytes, any more than a C compiler is able to specify arbitrary sequences of assembler instructions. Lumas recognises that message definition is a small part of the overall development process and thus should not warrant a disproportionately large investment in learning the language. Lumas uses the 80/20 principle to keep it simple. Lumas is designed to readily allow the use of Lumas aware software tools to aid in the development process. Lumas messages are text-encoded by default so that they are easy to read, and it is easy to create test messages for debugging. Using Lumas in applications is designed to be simple and efficient. Lumas addresses a number of different types of extensibility, including versioning, external extensions, and component based architectures. This makes Lumas an ideal definition language to use where simplicity, efficiency, compactness and/or a high degree of extensibility is required, especially where the extensibility involves plugging external modules into the base syntax. Cordell Expires August 1, 2007 [Page 3] Internet Draft Lumas February 2007 3. Lumas and Other Message Definition Languages Over the years a number of message definition methods have been developed. These include XDR [XDR], ASN.1 [ASN1], various flavours of IDL (such as OMG IDL [OMGIDL]), 'bit pictures,' various flavours of BNF (e.g. ABNF [ABNF]), and XML [XML]. It is therefore worthwhile considering how Lumas relates to these other message definition languages. Lumas differs from XDR in that Lumas is primarily a language for defining text-encoded messages. XDR is fixed to defining binary messages of very specific types. ASN.1 is also primarily a language for defining binary messages, although recently there have been XML encoding rules defined. ASN.1 information object classes are difficult to understand and a deterrent to its use. The complexity of some of the encoding rules, such as BER and PER, make the method difficult to use without using special tools. ASN.1 has found uses in the IETF, notably in the areas of cryptography (CMS [CMS] etc) and SNMP [SNMP]. However, it is not much loved, and efforts such as SMING have been undertaken to replace its usage (although at the time of writing this effort seems to have stalled). The IDL languages such as OMG IDL have similarities with message definition languages, but are subtly different. IDLs define a collection of objects, each of which describes a remote procedure call. They also define a return value for the procedure call. A protocol message set is typically a single object that can have a number of variants. A protocol will typically send another message is response to a message rather than sending a return value. Perhaps for the reasons mentioned, the above methods have not received wide usage within the IETF. The main workhorses for message definition in the IETF have been 'bit pictures,' various types of BNF and more recently XML. The term 'bit pictures' is used to refer to the pictures of bits and bytes that is used to capture the layout of parameters within a message, such as used to define IP [IP], UDP [UDP] and TCP [TCP]. This is very low-level and really only suitable for protocols containing a few parameters which ideally have fixed positions. At a level higher than pure 'bit pictures' is the scheme used in TLS [TLS], but this again is specific to defining binary messages. Diameter [DIAMETER] presents another variation on this approach. A number of types of BNF have been defined over the years, most recently ABNF. Until recently, the BNFs have been the main workhorse of IETF application level protocol definition. ABNF is very low-level, and is much like programming in assembler when high-level languages would be more useful. It is very difficult to get Cordell Expires August 1, 2007 [Page 4] Internet Draft Lumas February 2007 definitions correct, and issues such as ensuring extensibility have to be addressed not only for each message definition, but also for each parameter within the definition. The implementation route from ABNF can also be long as there is typically not enough high level information in the specification for tools to extract the important elements. This leaves XML. XML is a comprehensive and powerful way of defining messages. It would be a long and unproductive exercise to list all the things that XML gets right. Instead, the focus here is on the areas that a developer may wish to consider when choosing between Lumas and XML. The main differences between Lumas and XML are in the areas of simplicity and efficiency. Whether these differences are significant will depend on the application. There are two parts of the XML route: XML itself, and the method used to define the XML messages. Some of the less significant issues to consider are to do with XML itself. For example, it has long been recognised that the format of XML messages, with its start and end tags, is inefficient. (It is the author's belief that the extra tagging also makes the messages harder to read, because the message is dominated by tags rather than the important part, which is the values. Hence, what works well when there is a high ratio of PCDATA to tags, is detrimental when that ratio is significantly reduced.) The separation of parameters into attributes and elements adds complexity, but adds no real value in a protocol, and is an artefact of markup use. The provision for multiple character encodings (such as UTF-8, UTF-16BE, UTF-16LE, ISO-8859-1 etc) places demands on a parser as does the implementation of namespaces (where in a start tag the namespace is defined after the first use of the namespace), which requires double parsing or significant intermediate storage. The task of converting a namespace prefix to a namespace is potentially an area involving significant lookup effort. Once expanded, the effective tag is a long sequence of characters on which comparison operations are performed, the size of which potentially reduces efficiency. User definable general entities and parameter entities are additional burdens that have little value for message definition, as is the white space handling which is a hang over from XML as a markup language. While these are surmountable problems, the consideration for a developer has to be 'why pay for it if I don't need it?' The second issue is how to define the XML messages. Arguably the current favourite is W3C XML Schema, although there are other methods including RELAX NG [RELAX] and Schematron [STRON]. First of all, it has to be admitted that this is currently a controversial area and the existence of the latter two is largely due to concerns about the former. The main concern with XML Schema is again complexity. Maybe in the future one of the other methods will prevail. Cordell Expires August 1, 2007 [Page 5] Internet Draft Lumas February 2007 Keeping with XML Schema for now, firstly the language can be very difficult to learn. The specification is some 350 pages long (ignoring XML itself, and XML namespaces etc), and uses a formal language that is very confusing to interpret. In a number of areas there is even debate among the experts about what is intended. The constructs can be confusing and apparently contradictory in a number of areas, such as the notion of complexType with simpleContent and so on. While XML Schema is touted as being extensible, in practice for the unwary, there are a number of traps to fall into. For example, incorporated attribute and element groups, especially those from different schemas can easily result in name clashes when they are extended independently. Enumerated strings can not be extended without careful consideration. Indeed, the Unique Particle Attribution Constraint makes defining an extensible schema messy and not something that happens by accident [XMLVER]. There is no support for capturing what has changed from one version of a schema to the next, other than doing a diff operation on two files. This again makes it difficult for tools. Other features also make it difficult for tools, such as the ability to use patterns to restrict the format of basic types such as floating point numbers. XML Schema has no concise way of specifying short tag names while at the same time specifying descriptive formal names. For example, the most common XML like syntax, HTML [HTML], has an abundance of short tags such as ,

, etc. This makes it easy for the expert to type, and it must be assumed that the approach has some merit otherwise it wouldn't have been done that way. But XML Schema does not readily support this. Verbosity is even more of an issue when it comes to XML Schema, in a number of cases requiring five of more lines of text when only one would do. This means extra scrolling or page turning when editing and viewing, which makes a schema harder to write, harder to check, and harder for a third-party to understand. Many of these problems are subjective. Some can also be avoided by defining style guides and best practices for using XML Schema (for example [XMLBCP]). Compression can be used to reduce the size of messages. However, this really just addresses the complexity by adding more complexity. Not only does this make it harder to learn, it is important to remember that where there is complexity, there is the potential for bugs. And bugs not only affect the integrity of the code, but can affect the security of the system on which the code runs also. Complexity is also a barrier to implementation. It could be argued that the Internet has been successful because of its use of simple protocols. Using XML Schema would seem to be at odds with that principle. By being designed to be simple, Lumas avoids these problems. In summary, currently the main tools used for message definition in the IETF are ABNF and XML Schema. In many respects these represent two extremes, one simple and very low-level, and the other complex and high-level. Lumas is a data point between these two extremes, giving much of the flexibility of XML with the ease of understanding Cordell Expires August 1, 2007 [Page 6] Internet Draft Lumas February 2007 and compactness of ABNF. As such it is a useful extra tool that allows protocol developers to better tailor protocols to their needs. On another level, although message definition languages have been around for many years now, the relative paucity of options available, and the fact that XML is being trumpeted as a break through in inter-platform communication suggests that in terms of evolution, the field is in its infancy. It's easy to see why this might be. Message definition has not been seen as a core activity, and developers simply make-do by borrowing what is already available in other fields, even if they are not an ideal fit to their requirements. This would suggest that there is scope for much development, and it may transpire that XML turns out to be the FORTRAN or COBOL of the message definition world, and there is much more exciting stuff to come. It is hoped that Lumas can play a part in that story. 4. Terminology The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" in this document are to be interpreted as described in [KWORDS]. For the purposes of this document, a "tag" is a fixed sequence of characters used on the wire as an identifier for the value or values that it is associated with. Thus identified, the value can be interpreted and processed in the right way. 5. Example Lumas Message Definition and Message Encoding As the Lumas message definition syntax is C-like it is felt that many will immediately understand the majority of a message definition. For this reason the basic principles of the definition language and a short example are presented before describing the format in detail. 5.1 Principles of the Message Definition Following the C language format, the basic format of a parameter definition is: type name ; 'Type' specifies things like integers, booleans, ASCII strings, Unicode strings and so on. The 'name' is the name of the parameter. Thus a parameter definition might be: ascii rfc-name ; This says that 'rfc-name' is an ASCII string. In addition, a parameter definition can express constraints on the type, constraints Cordell Expires August 1, 2007 [Page 7] Internet Draft Lumas February 2007 on the cardinality (how many instances of the type are valid in a message), and the tag to be used for the value on the wire. (A tag is a fixed sequence of characters that is used to identify the value or values that it is associated with.) For example, an integer may be limited to the values 0 to 255, and an ASCII string may be limited to a maximum size. The fuller format of a parameter definition has the form: type name [cardinality] tagging ; For example: int <1..30000> referenced-rfcs [0..255] as refers ; This defines an integer that can have values between 1 and 30000. The name of the parameter is 'referenced-rfcs', but is tagged on-the-wire using the character sequence 'refers'. The parameter can consist of between 0 and 255 instances of the integer in a valid message. Two main types of compound parameter are possible, these being 'struct' and 'union'. Having much the same meaning as they have in C, a struct specifies a group of parameters, all of which may be used in a particular instance of the struct. A union similarly specifies a group of parameters, but in this case only one of the parameters can be used in any one instance of the union. An example of a struct is: struct rfc-info { ascii rfc-name; int <1..30000> referenced-rfcs[0..255] as refers ; }; A third form of compound type called 'combi' is also available. The name is short for 'combined' and the type allows a number of values to be concatenated together into what looks like a single value. Hence it can be used to define constructs like the character sequence 'HTTP/1.0', and that the '1' and the '0' are the major and minor version numbers. 5.2 An Example Message Definition and How it is Encoded The following is an example message definition that is intended to represent a very crude meeting controller: Cordell Expires August 1, 2007 [Page 8] Internet Draft Lumas February 2007 lumas module com.tech-know-ware.my-example; /* An example Lumas definition */ import com.tech-know-ware.general as tkwg; struct my-example { int <0..255> participant-id as ?; Action action as ?; struct my-addition[0..1] as new.tech-know-ware.com plugin { bool tkw-app-capable as ?; }; }; union Action { Join join; Message message as msg; void leave; }; struct Join { unicode<0..63> name; }; struct Message { int <0..255> to-participants[1..127] as to; unicode<1..255> message as msg; [ // Version 2 additions tkwg::Priority priority; ] [ // Version 5 additions ascii<0..16> font-name[0..1] as font; void bold[0..1]; void italic[0..1]; void underlined[0..1] as ul; ] }; The first construct (in this case the struct my-example) is the root of all messages for the protocol. Each message identifies a participant using an integer in the range 0 to 255, called 'participant-id'. When encoded on the wire, this parameter will be untagged due to the 'as ?' specification. On-the-wire, the default encoding generally encodes parameters in the form: Cordell Expires August 1, 2007 [Page 9] Internet Draft Lumas February 2007 tag = value where 'value' is a textual representation of the parameters value. However, if a parameter is marked as untagged, then it is represented simply as: value Hence, if in a message an instance of participant-id is to have a value of, say, 12, then, due to being marked as untagged, it is encoded simply as: 12 Rather than the following, which would be the case if it was not marked as untagged: participant-id = 12 In this example, each message then has an action, which is also untagged. The type of the action parameter is not immediately specified, and instead references the 'Action' definition. The Action definition is a union in which only one of the specified parameters may appear in an instance of the Action construct. This effectively represents a fork in the semantics of any given message. In this case the options within Action can indicate that somebody has joined the meeting, left the meeting, or is sending a message to other participants. There is no explicit tag for the 'join' and 'leave' options, so these will be tagged on-the-wire by the parameters' names, 'join' and 'leave' respectively. Conversely, an explicit tag for the 'message' parameter is specified, and hence the message option will be tagged by 'msg' on-the-wire. The join parameter also has a referenced definition; the struct named Join. For the purposes of this example, when a person joins a meeting, all the other participants are informed of their name. The name member in the struct is a UTF-8 encoded Unicode string that has a minimum length of 0 characters and a maximum length of 63 characters. Hence an example of the join parameter encoded on the wire is: join = { name = "Alice" } Here, the braces delimit the extent of the members in the struct and the double quotes delimit the characters representing the name. The message option is also a referenced definition. Conceptually, to send a message, the 'participant-id' is used to identify the sender, Cordell Expires August 1, 2007 [Page 10] Internet Draft Lumas February 2007 and the 'to-participants' field contains the participant ids of all the people to whom the message is being sent. On-the-wire, the to-participants parameter will be tagged with 'to'. Between 1 and 127 (inclusive) instances of the to-participants parameter may appear in a message. For efficiency, Lumas allows multiple occurrences of the same parameter to be represented as a comma separated list. Hence an example of the on-the-wire encoding of the to-participants parameter would be: to = 2, 5, 8, 58 Also, the message itself is included. The message will consist of Unicode characters and can be between 1 and 255 Unicode characters long. On-the-wire, the message parameter will have the tag 'msg'. An example of the on-the-wire format is thus: msg = "Where are we going for dinner" The priority field within the message struct has been added in a later version of the protocol. This is indicated by the square brackets in which the parameter is wrapped. Similarly, font-name, and the associated parameters have, according to the comment, been added in version 5 of the protocol. The type of the 'priority' parameter is defined in an external module that has the alias 'tkwg'. The 'import' directive at the beginning of the example indicates that the 'tkwg' alias corresponds to the module 'com.tech-know-ware.general', and it is in this module that the definition of 'Priority' is located. The definition indicates that 'font-name' is an ASCII string. The reader should already understand enough of the definition language to understand the meaning of the other fields. Returning to the 'my-example' root, a third-party has added an extension to the protocol in the form of the 'my-addition' parameter. It is identified as not being part of the base specification by the keyword 'plugin'. On-the-wire, the additional parameter will be identified by the tag 'new.tech-know-ware.com' to differentiate it from additions that may be made by other third parties. In summary, the following are complete examples of the default on-the-wire representation of the example message definition: 12 join = { name = "Alice" } new.tech-know-ware.com = { True } and: Cordell Expires August 1, 2007 [Page 11] Internet Draft Lumas February 2007 12 msg = { to = 2, 5, 8, 58 msg = "Where are we going for dinner" font = 'Arial' } and: 12 leave Note that the placing of each parameter on a separate line is not significant. Lumas is free form with respect to white space. Hence, the message above could equally be represented as: 12 join={name="Alice"} new.tech-know-ware.com={True} 6. Formal Message Definition Syntax The sections below describe the Lumas message definition syntax. The 'top-level' production is 'lumas-definition', which is defined in 6.19, "The Top Level Definition". The following sections define the components of the message definition language building up to the top-level production. The Lumas syntax is defined using ABNF [ABNF]. 6.1 Lumas Keywords Lumas keywords are case-sensitive. Therefore "AS" can not be used in place of "as". As ABNF literal strings are case-insensitive, this section defines the Lumas keywords in a case-sensitive way. as-kw = %x61.73 ; as in lowercase ascii-kw = %x61.73.63.69.69 ; ascii in lowercase b = %x62 bool-kw = %x62.6F.6F.6C ; bool in lowercase bytes-kw = %x62.79.74.65.73 ; bytes in lowercase combi-kw = %x63.6F.6D.62.69 ; combi in lowercase const-kw = %x63.6F.6E.73.74 ; const in lowercase d-upper = %x44 ; Uppercase D d = %x64 date-kw = %x64.61.74.65 ; date in lowercase double-kw = %x64.6F.75.62.6C.65 ; double in lowercase embedded-kw = %x65.6D.62.65.64.64.65.64 ; embedded in lowercase endmodule-kw = %x65.6E.64.6D.6F.64.75.6C.65 ; endmodule in lowercase extends-kw = %x65.78.74.65.6E.64.73 ; extends in lowercase f = %x66 float-kw = %x66.6C.6F.61.74 ; float in lowercase import-kw = %x69.6D.70.6F.72.74 ; import in lowercase int-kw = %x69.6E.74 ; int in lowercase into-kw = %x69.6E.74.6F ; into in lowercase Cordell Expires August 1, 2007 [Page 12] Internet Draft Lumas February 2007 ipv4-kw = %x69.70.76.34 ; ipv4 in lowercase ipv6-kw = %x69.70.76.36 ; ipv6 in lowercase lumas-kw = %x6C.75.6D.61.73 ; lumas in lowercase module-kw = %x6D.6F.64.75.6C.65 ; module in lowercase n = %x6E oid-kw = %x6F.69.64 ; oid in lowercase plug-kw = %x70.6C.75.67 ; plug in lowercase pluggable-kw = %x70.6C.75.67.67.61.62.6C.65 ; pluggable in lowercase plugin-kw = %x70.6C.75.67.69.6E ; plugin in lowercase r = %x72 s-upper = %x53 ; Uppercase S s = %x73 single-kw = %x73.69.6E.67.6C.65 ; single in lowercase struct-kw = %x73.74.72.75.63.74 ; struct in lowercase t = %x74 time-kw = %x74.69.6D.65 ; time in lowercase unicode-kw = %x75.6E.69.63.6F.64.65 ; unicode in lowercase union-kw = %x75.6E.69.6F.6E ; union in lowercase unquoted-ascii-kw = %x75.6E.71.75.6F.74.65.64.2D.61.73.63.69.69 ; unquoted-ascii in lowercase void-kw = %x76.6F.69.64 ; void in lowercase w = %x77 w-upper = %x57 ; Uppercase W x = %x78 z = %x7A 6.2 Lumas Parameters The main building block of a Lumas message definition is the parameter. There are three classess of parameter in Lumas, simple parameters, compound parameters and reference parameters, which are defined as: lumas-parameter = simple-param / compound-param / reference-param A simple parameter typically describes a simple value such as a string, integer or date. They may represent a name, a temperature or a birthday. Compound parameters are collections of simple parameters and other compound parameters, similar to how Java and C++ classes group together simple variables and other classes. Reference parameters allow a parameter to be defined in terms of a type (either simple, compound or reference) that is defined elsewhere in the message definition. 6.3 Simple Parameters The ABNF definition of a simple parameter is: Cordell Expires August 1, 2007 [Page 13] Internet Draft Lumas February 2007 simple-param = simple-type WS name [ OWS cardinality ] [ WS as-kw WS explicit-tag ] [ WS plugin-kw ] OWS ";" OWS where 'WS' represents white space, and 'OWS' represents optional white space. ('WS' and 'OWS' are defined in Section 8 - 'Common ABNF Definitions'. Generally, comments can be included wherever white space is allowed.) As can be seen, the main parts of the definition of a simple parameter are the simple type and the name. Additional specification allows further control of the message contents. These fields are discussed below. 6.4 The Simple Types Simple parameters have simple types such as integers, booleans etc. Each of Lumas' simple types are listed and described below. How these simple types are specified in a message definition is described in the following section. The Lumas simple types are: void A parameter that has no value. This is most useful in unions (wherein a converts a union into an enumerated type), and can also be used in a struct to represent boolean events wherein the absence of the parameter indicates false, and the presence of the parameter indicates true. It is more useful than you might at first think! bool A Boolean value. Can be true or false. int An integer value. float A floating point value. The constraints of a float specify the float to be either in accordance with a single precision value or a double precision value as specified in IEEE 754 [IEEE754]. The absence of a constraint indicates a single precision value. ipv4 Represents an IPv4 address, but not the port. ipv6 Cordell Expires August 1, 2007 [Page 14] Internet Draft Lumas February 2007 Represents an IPv6 address, but not the port. date Date according to the Gregorian calendar, with year, month and day of month. Other calendar types may be constructed from primitive types if required. time Represents the time in hours, minutes and seconds using the 24 hour clock notation. By default the time MUST be adjusted to UTC, unless the time can be guaranteed to have only local significance. oid This is an ASN.1 style Object Identifier. This is primarily included to enable identification of security protocols. ascii A string made up of ASCII characters, limited to the values 0 to 127. unquoted-ascii An ascii string usually has quote marks around it. This type does not have quotes around it. Consequently it can not have any white space, or include any special characters (such as "=", ")", and "}") that would confuse the parser. unicode A string representing Unicode characters. const This type allows a constant value to be inserted into the encoded message. It will typically be untagged. One thing it might be used for is identifying the protocol of the message definition. For example: const protocol as ?; bytes An array of bytes. Also useful for carriage of opaque data. embedded The value is an embedded Lumas message. This allows layering Cordell Expires August 1, 2007 [Page 15] Internet Draft Lumas February 2007 of message definitions. 6.5 Simple Type Definition Lumas simple types are specified in a Lumas message as described in this section. The 'simple-type' construct represents the type of the parameter. It has the following form: simple-type = void-kw / bool-kw / integer-type / float-type / ipv4-kw / ipv6-kw / date-kw / time-kw / oid-kw / string-type / const-type / bytes-type / embedded-type As can be seen, many of the types are specified using a single keyword. Other types such as integers and strings allow the specification of additional constraints (such as the maximum value that an integer is allowed to have). The definition of these types are as follows: integer-type = int-kw OWS "<" OWS int-constraint OWS ">" float-type = float-kw OWS [ "<" OWS float-constraint OWS ">" ] string-type = ( ascii-kw / unquoted-ascii-kw / unicode-kw ) [ OWS "<" OWS string-constraint OWS ">" ] const-type = const-kw OWS "<" first-safe-char *( safe-char ) ">" ; See the section 'Notes on Comments' below bytes-type = bytes-kw [ OWS "<" OWS length-constraint OWS ">" ] embedded-type = embedded-kw [OWS "<" OWS embed-constraint OWS ">"] The constraints for the numerical types are specified as follows: int-constraint = min-int-constraint OWS ".." OWS max-int-constraint [ OWS use-leading-zero-marker ] min-int-constraint = ["-"] pos-number max-int-constraint = ["-"] pos-number use-leading-zero-marker = z ; lower case z float-constraint = single-kw / double-kw The constraints for the string, const, bytes and embedded types are as follows: Cordell Expires August 1, 2007 [Page 16] Internet Draft Lumas February 2007 string-constraint = [ length-constraint ] [ OWS pattern-constraint ] embed-constraint = [ length-constraint ] [ OWS embedded-module-constraint ] embedded-module-constraint = "(" OWS module-name OWS ")" length-constraint = [ min-len-constraint OWS ".." OWS ] max-len-constraint min-len-constraint = pos-number max-len-constraint = pos-number / unlimited-length-token unlimited-length-token = "*" These constraints use the following definition: pos-number = 1*DIGIT ; Decimal number / "0"x 1*HEXDIG ; Hex number / 1*DIGIT b ; Specifies number of binary bits In the case of 'integer-type', the mandatory constraint specifies the minimum and maximum permissible values that the integer can take. If the 'use-leading-zeros-marker' character ('z') is included in the constraint, then where necessary the integer MUST be represented on the wire with leading zeros to make the value fixed width. (This is primarily applicable to combined types.) The 'pos-number' construct used to specify the integer value constraint has a form that can specify the number of binary bits. The number of bits specified does not include any sign bits. Hence an unsigned 32 bit number can be represented as 0..32b, whereas a signed 32 bit number can be represented as -31b..31b (although this will actually exclude the most negative value of a signed 32 bit number). A float is either a single precision IEEE 754 number or a double precision IEEE 754 number [IEEE754]. The absence of a constraint indicates single precision. (Developers are advised that in a number of cases a binary IEEE 754 number can not be exactly represented in a text-based base 10 format. Hence the decoder's binary representation of a floating-point number may differ from the encoder's binary representation of the number. If such discrepancies are not acceptable, developers should use an alternative representation for floating-point numbers.) In the case of 'string-type', the optional constraint specifies the minimum and maximum number of characters that are allowed to be represented in a valid encoding and optionally a valid pattern of characters. The minimum and maximum character constraint specifies the minimum and maximum number of characters at the application level, not the actual number of characters that are used to represent the application level characters on the wire. The format of the pattern constraint is designed to simplify regular expression evaluation by preventing the need for the trial and error type processing of general regular expressions. Thus, in accordance with Cordell Expires August 1, 2007 [Page 17] Internet Draft Lumas February 2007 Lumas' 80/20 principle, valid patterns MUST not require the regular expression evaluator to do backtracking. The pattern constraint is described further in Section 6.6. In the case of 'bytes-type', the optional constraint specifies the minimum and maximum number of bytes that are allowed to be represented in a valid encoding. The constraint specifies the minimum and maximum number of bytes at the application level, not the number of characters that are used to encode those bytes on the wire. The optional constraint in 'embedded-type' MAY specify the permitted length of the embedded message and/or the Lumas module name of the message that is to be embedded. For example: embedded<(com.tech-know-ware.scp)> embedded-scp; In the constraint syntax, a maximum value '*' means infinite or unbounded. 6.6 The Pattern Constraint The pattern-constraint has the following form: pattern-constraint = "/" sub-pattern *( "|" sub-pattern ) "/" sub-pattern = *pattern-element pattern-element = pattern-char [ quantifier ] pattern-char = %x20-29 / %x2C-2E / %x30-3E / %x40-5A / %x5D-7A / %x7D-FF ;not \/|[?*+{ / escaped-char / special-char / character-class escaped-char = "\\" ; Matches \ / "\/" ; Matches / / "\|" ; Matches | / "\[" ; Matches [ / "\?" ; Matches ? / "\*" ; Matches * / "\+" ; Matches + / "\{" ; Matches { / "\." ; Matches . special-char = "\" r ; Matches the return character / "\" n ; Matches the new line character / "\" t ; Matches the tab character / "\" f ; Matches the form feed character / "\" s ; Matches white space [ \t\r\n\f] / "\" d ; Matches any digit [0-9] / "\" w ; Matches any word character [a-zA-Z_0-9] / "\" s-upper ; \S Matches anything not matched by \s / "\" d-upper ; \D Matches anything not matched by \d / "\" w-upper ; \W Matches anything not matched by \w / "." ; Matches any character Cordell Expires August 1, 2007 [Page 18] Internet Draft Lumas February 2007 character-class = matching-character-class / inverse-character-class matching-character-class = "[" *(class-char / class-range) "]" ; For a successful match, the character in the string ; being matched must be one of the characters ; specified in the matching-character-class. inverse-character-class = "[^" *(class-char / class-range) "]" ; For a successful match, the character in the string ; being matched must NOT be one of the characters ; specified in the inverse-character-class. class-char = class-single-char / class-escaped-char / escaped-char / special-char class-single-char = %x20-2C / %x2E-5B / %x5E-FF ; not - ] \ class-escaped-char = "\-" ; Matches - / "\]" ; Matches ] ; /|[?*+{. need not be escaped within character-class class-range = first-range-char "-" last-range-char ; The class-range matches all character that have ; an ASCII value greater or equal to that of ; first-range-char and less than or equal to ; last-range-char. first-range-char = class-single-char / class-escaped-char / escaped-char last-range-char = class-single-char / class-escaped-char / escaped-char quantifier = "?" / "*" / "+" / "{" quant-min-occurs [ "," [ quant-max-occurs ] ] "}" ; The absence of a quantifier indicates once and only ; once quant-min-occurs = 1*DIGIT quant-max-occurs = 1*DIGIT The 'pattern-constraint' allows a number of 'sub-pattern's to be defined, any one of which may match the string value. In each 'sub-pattern' there are no grouping or alternation constructs. This removes the need for backtracking and is suitable for 80% (or more) of applications. The pattern matching uses a "greedy" match. Each 'sub-pattern' can be viewed as a concatenation of 'pattern-element's. Each 'pattern-element' is a pattern-char and an optional 'quantifier'. The 'pattern-char' may actually match multiple characters. The 'quantifier' indicates how many times the associated 'pattern-char' may appear in a valid pattern. If the 'quantifier' is '?', the 'pattern-char' may appear 0 or 1 times. If the 'quantifier' is '*', the 'pattern-char' may appear 0 or more times. If the 'quantifier' is '+', the 'pattern-char' may appear 1 or more times. If the quantifier is of the form '{n,m}', the 'pattern-char' may appear a minimum of n times, and a maximum of m times. If the Cordell Expires August 1, 2007 [Page 19] Internet Draft Lumas February 2007 quantifier is of the form '{n}', the 'pattern-char' must appear exactly n times. If the quantifier is of the form '{n,}', the 'pattern-char' may appear n or more times. To ensure that a string is in a suitable form to represent the value, the application, subject to the quantifier of a pattern-element, MUST, starting with the first character, keep matching successive characters of the string with the first pattern-element until the match fails. The application MUST then try to match the unmatched character of the string along with subsequent characters in the string with the next pattern-element, again taking into account the quantifier for that pattern-element. If a pattern-element has a quantifier that allows zero matches, then if the unmatched character of the previous pattern-element does not match the current pattern-element, the application should attempt to match the unmatched character against the next pattern-element, and so on. The process is repeated until the whole string is matched, or the application is unable to match the current string character with an appropriate pattern-element. If the application is unable to match the current input character with an appropriate patter-element, the whole sub-pattern match is deemed to have failed. The application MUST NOT backtrack to a previous pattern-element in order to attempt to find a match. This process is repeated for each of the sub-patterns until one of the sub-patterns matches the string, or all sub-patterns fail to match the string. The message MUST NOT be encoded if none of the patterns matches the string. Example patterns include /\d{4} \d{4} \d{4} \d{4}/ for a (UK) credit card number, or /\d{4}-\d{2}-\d{2}T\d+:\d+:\d+Z/ for a date & time matching the form 2003-03-03T12:45:32Z. The pattern / ?\d+| ?\d+\.\d+| ?\d+\.\d+[eE][+\ ]?\d+/ matches a floating point number that can be represented as either an integer, a decimal without exponent, or full 'scientific' format. This pattern illustrates some of the impact of not allowing pattern groupings. For more information on regular expressions, see [PERL]. 6.7 The Name Referring back to the simple-param definition, 'name' is the name of the parameter. It has the format: name = ALPHA *( ALPHA / DIGIT / "-" / "_" ) If there is no explicitly defined tag, then, in the case of character based protocols, the name is also used as the parameter's tag on-the-wire. In this case, the length of the name MUST NOT exceed 63 characters in length. See Section 6.9 for more on tagging. 6.8 Cardinality The cardinality of a parameter specifies how many times a particular Cordell Expires August 1, 2007 [Page 20] Internet Draft Lumas February 2007 parameter can appear in a message. The format mirrors a C-like array specification, but uses UML style ranges rather than the single values used in C. If the cardinality field is absent, then one and only one instance of the parameter must occur in a valid message. The format of the cardinality specification is: cardinality = "[" ( cardinality-range / "?" / "*" / "+" ) "]" ; [?] short hand for [0..1] ; [*] short hand for [0..*] ; [+] short hand for [1..*] cardinality-range = [ min-occurrences ".." ] max-occurrences min-occurrences = 1*DIGIT max-occurrences = 1*DIGIT / unbounded-token unbounded-token = "*" Once again, the '*' in max-occurrences represents infinite or unbounded. If in the 'cardinality-range' only 'max-occurrences' is present and it has a numerical value, the containing struct MUST have exactly 'max-occurrences' instances of the parameter. Example cardinalities are as follows: [0..1] ; Zero or one time [?] ; Short hand for zero or one time [0..*] ; Zero or more times [*] ; Same as above, zero or more times [1..*] ; One or more times [+] ; Same as above, one or more times [2..*] ; Two or more times [5] ; Exactly five times 6.9 Tagging A parameter can have a tag associated with it. A tag is a fixed sequence of characters used on the wire to enable a parser to identify the value or values that it is associated with. By default, the name of the parameter is used as the tag. If the name of the parameter is used as the tag the name MUST NOT exceed 63 characters in length. Alternatively an explicit tag can be specified. It can be any sequence of characters that do not have special significance to the parser. To facilitate buffer management, an explicit tag MUST NOT Cordell Expires August 1, 2007 [Page 21] Internet Draft Lumas February 2007 exceed 63 characters in length. If the tag definition begins with a "?", the "?" is discarded. Thus to specify that "?" should be used as the tag on-the-wire, 'explicit-tag' should be specified as "??". explicit-tag = [ "?" ] tag ; tag defined in common definitions In certain constructs a parameter may also be untagged. This is discussed in the relevant sections below. 6.10 The Plugin Extension Mechanism Marking a parameter as 'plugin' indicates to the developer and the tools that this parameter is (probably) not part of the original message definition. For example, it might be a proprietary extension. It also indicates that the parameter may not be present in all received messages. A parameter that is marked as 'plugin' MUST have an explicit-tag defined for it. The explicit-tag MUST be constructed from a domain name [DOMAINS] owned by the entity defining the parameter, plus a sequence of characters that differentiate the explicit-tag from other explicit-tags defined by the defining entity. The component parts of the explicit-tag are presented in the normal domain name order so that the most variable part of the string is at the beginning, thus improving parsing efficiency. An example explicit-tag for tech-know-ware.com might be: my-tag.tech-know-ware.com 6.11 Reference Parameters In a struct or union, it is also possible to reference types that are defined elsewhere. The format of a 'reference-param' is: reference-param = reference-name WS name [ OWS cardinality ] [ WS as-kw WS explicit-tag ] [ WS plugin-kw ] OWS ";" OWS reference-name = [ module-name "::" ] name Other forms of reference-parameter are defined in the sections below. 6.12 Compound Parameters The compound types are struct, union and combi. For a struct, depending on the various parameters' cardinality specifications, any all or none of the parameters that a struct groups together may appear in a valid encoding. In the case of a union, only one of the parameters may be encoded in a valid instance. The combi form is effectively a compact encoding of a struct, but is subject to a number of additional constraints, which are described below. Cordell Expires August 1, 2007 [Page 22] Internet Draft Lumas February 2007 The definition format of each of the compound parameters is similar to the simple parameters. The 'compound-param' has the form: compound-param = struct-param / union-param / combined-param 6.13 Struct Parameters The definition of a 'struct-param' is: struct-param = struct-kw WS name [ OWS cardinality ] [ WS as-kw WS explicit-tag ] [ WS pluggable-kw ] [ WS plugin-kw ] WS "{" struct-body "}" OWS ";" OWS 'Cardinality' and 'explicit-tag' have the same meaning as for the simple types. The 'pluggable' keyword is defined in Section 6.17. The format of the 'struct-body' is: struct-body = *( untagged-lumas-parameter ) *( lumas-parameter ) *( struct-extension ) The struct body starts with all the untagged parameters. Untagged parameters may have a cardinality other than one. Note that, if the cardinality of an untagged parameter allows it to be absent, then when encoded on the wire, if the untagged parameter is absent, then all subsequent parameters, including tagged parameters MUST also be absent. Thus great care is recommended when defining a message syntax that allows for an untagged parameter to be absent. The tagged parameters follow the untagged parameters. When the message definition is subsequently extended, an instance of the 'struct-extension' construct MUST be added to the end of the struct definition for each version in which the struct is extended. The 'struct-extension' construct wraps the added parameters within square brackets to indicate that they are added in a new version. This not only allows a developer to see what has been added in a new version, but also allows a parser to do the same. This is important because a parser must always consider absence of the new parameters to be a valid encoding so that it can receive messages from entities that are using an earlier version of the protocol. (To do this manually would dictate that all extension parameters would have to have a cardinality specification that included zero. This would be tedious, potentially error prone, and loses some expressiveness.) During the extension process, all new parameters MUST be added onto the end of an existing construct, and the order of parameters MUST NOT be rearranged from one version to the next. Note that Cordell Expires August 1, 2007 [Page 23] Internet Draft Lumas February 2007 'struct-extension' does not allow the specification of untagged parameters. All of these have a similar format to the types already defined, except that in some cases they may be untagged. To make the ABNF definition accurate it is therefore necessary to repeat the above basic definitions with the appropriate tagging specifications. The definition of the untagged struct parameters is: untagged-lumas-parameter = untagged-simple-param / untagged-compound-param / untagged-reference-param untagged-simple-param = simple-type WS name [ OWS cardinality ] WS as-kw WS "?" OWS ";" OWS untagged-compound-param = untagged-struct-param / untagged-union-param / untagged-combined-param untagged-struct-param = struct-kw WS name [ OWS cardinality ] WS as-kw WS "?" [ WS pluggable-kw ] WS "{" struct-body "}" OWS ";" OWS untagged-union-param = union-kw WS name [ OWS cardinality ] WS as-kw WS "?" [ WS pluggable-kw ] WS "{" union-body "}" OWS ";" OWS untagged-combined-param = combi-kw WS name [ OWS cardinality ] WS as-kw WS "?" WS "{" combined-body "}" OWS ";" OWS untagged-reference-param = reference-name WS name [ OWS cardinality ] OWS ";" OWS Note that the 'plugin' keyword is not applicable to untagged parameters. The tagged parameters have the basic parameter definition that was initially presented, i.e. lumas-parameter. The struct body extension fields have the format: struct-extension = "[" OWS 1*( lumas-parameter ) "]" OWS 6.14 Union Parameters Cordell Expires August 1, 2007 [Page 24] Internet Draft Lumas February 2007 A union parameter has the following definition: union-param = union-kw name [ OWS cardinality ] [ WS as-kw WS explicit-tag ] [ WS pluggable-kw ] [ WS plugin-kw ] WS "{" union-body "}" OWS ";" OWS 'Cardinality' and 'explicit-tag' have the same meaning as for the simple types. The 'pluggable' keyword is defined in Section 6.17. A union-body MAY have a single untagged integer parameter. All other parameters MUST be tagged and have a cardinality of one and only one. Other than the cardinality constraints of a union, a union can be extended in the same way as a struct. The untagged integer parameter allows integers to be defined that have wild-carding options. For example, a union might be defined as: union select { int<0..65535> numbered as ?; void any as *; }; Examples of the encoded form might be: select = 12 select = * The parameters within a union are only allowed unary cardinality to avoid ambiguity in the on-the-wire encoding. If multiple instances of a parameter must be included as an option in a union, it is necessary to wrap the parameters within a struct, using something similar to: struct X { X x[1..*] as ?; }; The definition of a union-body is as follows: union-body = [ integer-type WS name WS as-kw WS "?" OWS ";" OWS ] *( singular-lumas-parameter ) *( union-extension ) As mentioned previously, most of the parameters within a union are tagged and have a cardinality of one. Their defininition is: Cordell Expires August 1, 2007 [Page 25] Internet Draft Lumas February 2007 singular-lumas-parameter = singular-simple-param / singular-compound-param / singular-reference-param singular-simple-param = simple-type WS name [ WS as-kw WS explicit-tag ] [ WS plugin-kw ] OWS ";" OWS singular-compound-param = singular-struct-param / singular-union-param / singular-combined-param singular-struct-param = struct-kw WS name [ WS as-kw WS explicit-tag ] [ WS pluggable-kw ] [ WS plugin-kw ] OWS "{" struct-body "}" OWS ";" OWS singular-union-param = union-kw WS name [ WS as-kw WS explicit-tag ] [ WS pluggable-kw ] [ WS plugin-kw ] OWS "{" union-body "}" OWS ";" OWS singular-combined-param = combi-kw WS name [ WS as-kw WS explicit-tag ] [ WS plugin-kw ] OWS "{" combined-body "}" OWS ";" OWS singular-reference-param = reference-name WS name [ WS as-kw WS explicit-tag ] [ WS plugin-kw ] OWS ";" OWS The union extension operates in a similar fashion to that of a struct, but references singular-lumas-parameters. Its definition is: union-extension = "[" OWS 1*( singular-lumas-parameter ) "]" OWS 6.15 Combined Parameters A combined parameter has the following definition: combined-param = combi-kw name [ OWS cardinality ] [ WS as-kw WS explicit-tag ] [ WS plugin-kw ] WS "{" combined-body "}" OWS ";" OWS The combined compound type provides a simple mechanism for defining new combined types similar to that used for date and time. All the members of a combined type are encoded on the wire using their untagged form and concatenated together with no intervening white space. The result of the encoding MUST meet all the constraints of an unquoted-ascii value. In addition, the parameters that make up the combined type are subject to the following constraints: Cordell Expires August 1, 2007 [Page 26] Internet Draft Lumas February 2007 - Each unquoted-ascii parameter that is part of a combined body MUST have a fixed number of characters, - The first character of unquoted-ascii and const parameters MUST NOT be a digit, - integer values MUST NOT be adjacent. The form of the combined body is: combined-body = *( combined-simple-type WS name ";" ) combined-simple-type = integer-type / const-type / unquoted-ascii-kw OWS "<" 1*DIGIT ">" In many respects the combined type simply makes the encoded form look prettier, and anything that can be encoded with the combined type can also be represented with the struct type. The combined type should also not be used for defining patterns of ASCII or Unicode characters. Note also that a combined type is not pluggable and hence can not be extended. It is therefore recommended that the combined type be used sparingly. An example of a combined type is: combi protocol as ? { const const1; int<0..99> major-version; const <.> const2; int<0..99> minor-version; }; Which might be encoded as: HTTP/1.1 Combined types also allow you to define numbers that contain decimal points. An example of such is: Cordell Expires August 1, 2007 [Page 27] Internet Draft Lumas February 2007 union currency as ? { void dollars as US$; void pounds as GBP; void francs as FFr; } combi amount as ? { int<-31b..31b> main-denomination; const <.> const2; int<0..99z> sub-denomination; }; Which might be encoded as: US$ 100.05 6.16 Referenced Parameters It was mentioned previously that structs and unions can reference types that are defined elsewhere. Referenced types do not have a cardinality specification, and do not specify an explicit tag. This is because the cardinality and tagging of the type are defined in the item that does the referencing, rather than where the referenced type is defined. (If a referenced type needs a cardinality other than one, it is recommended that the technique for giving a parameter within a union a non-unary cardinality be used.) The definition of the referenced types are: referenced-lumas-parameter = referenced-simple-param / referenced-compound-param / referenced-reference-param referenced-simple-param = simple-type WS name OWS ";" OWS referenced-compound-param = referenced-struct-param / referenced-union-param / referenced-combined-param referenced-struct-param = struct-kw WS name [ WS pluggable-kw ] OWS "{" struct-body "}" OWS ";" OWS referenced-union-param = union-kw WS name [ WS pluggable-kw ] OWS "{" union-body "}" OWS ";" OWS referenced-combined-param = combi-kw WS name OWS "{" combined-body "}" OWS ";" OWS referenced-reference-param = reference-name WS name OWS ";" OWS 6.17 External Extensions - Plug and Pluggable Cordell Expires August 1, 2007 [Page 28] Internet Draft Lumas February 2007 A protocol may be extended via an external specification without directly modifying the original definition. This may be to define a proprietary extension, or to define an external profile of the base protocol. The specification for this type of extension is: external-extension = plug-kw WS ( external-struct-extension / external-union-extension ) WS into-kw WS into-name *( OWS COMMA OWS into-name ) OWS ";" OWS into-name = [ module-name "::" ] hierarchical-name hierarchical-name = *( name "." ) name external-struct-extension = 1*lumas-parameter external-union-extension = 1*singular-lumas-parameter This specifies a parameter that is to be plugged into an existing construct. For example, if the following is defined: plug ascii cookie as cookie.tech-know-ware.com; into my-example.my-addition; The resulant definition would be treated as if it were: struct my-example { int <0..255> participant-id as ?; Action action as ?; struct my-addition[0..1] as new.tech-know-ware.com plugin; { bool tkw-app-capable as ?; ascii cookie as cookie.tech-know-ware.com plugin; }; }; The 'into-name' field indicates the name of the construct that the item is to be plugged into. The optional 'module-name' part of the name specifies the name of the module that contains the parameter into which the extension is to be plugged. The 'hierarchical-name' specifies the name of the parameter within the module that the extensions are to be plugged into. The name is hierarchical because parameters can be locally defined within structs and unions. The hierarchical name is made up of the name of each of the parameter's ancestors' names plus the name of the parameter itself joined together by the '.' character. If the parameter to be extended is contained within another parameter, the first name is the name of the outer-most parameter that contains the parameter to be extended (i.e. Cordell Expires August 1, 2007 [Page 29] Internet Draft Lumas February 2007 one that is not contained within any other parameter), and the second name is the name of the next outer-most parameter that contains the parameter to be extended (if present), and so on until the parameter itself is named. An illustration of the naming is shown in the example above. In a struct and union the 'pluggable' keyword is used to indicate that the construct is a location that the message designers have formally declared as extendible using the 'plug' mechanism. Lumas compilers SHOULD emit warnings when extra material is plugged into locations that are not marked as pluggable, but MUST NOT consider it an error. Combined types are not pluggable. If a party other than the original message designers use the plug mechanism to define an extension, each added parameter MUST have an explicit-tag constructed according to the rules described in Section 6.10. 6.18 Module Definition and Directives A single protocol may be defined in a number of message definition files. This might be for the purpose of accessing predefined libraries, or specifying a definition that the current definition extends. A message definition therefore begins with a set of optional directives expressing this information. They have the form: lumas-directives = [ lumas-kw WS module-kw WS module-name OWS ";" OWS ] [ extends-kw WS module-name [ WS as-kw WS alias ] OWS ";" OWS ] *( import-kw WS module-name [ WS as-kw WS alias ] OWS ";" OWS ) module-name = [ "+" ] name *( "." name ) alias = name The 'module' directive specifies the name of the module. The 'extends' directive is used in a definition that contains an external extension. The module-name in the extends specification indicates the message definition that is being extended. The 'import' statement indicates a library message definition that contains referenced types that are referenced within the message definition. The 'module-name' is a hierarchical namespace that is based on the name of the protocol, combined with a domain name [DOMAINS] owned by the entity defining the protocol. The parts of the module-name are combined together so that it looks like a regular domain name. The order in which the domain levels is written is then reversed, so that the top-level domain becomes the first written domain, and the second level domain becomes the second written domain and so on. For Cordell Expires August 1, 2007 [Page 30] Internet Draft Lumas February 2007 example, if a protocol called the Simple Conference Protocol (SCP) was defined by Tech-Know-Ware Ltd with a domain name of tech-know-ware.com, the module name might be: com.tech-know-ware.scp It is the responsibility of the entity owning the domain name to ensure that the module names it creates using its domain name are unique. Lumas defines a number of pseudo top level domains for its own purposes. These are currently as follows: +ietf A pseudo top level domain for the Internet Engineering Task Force. +iso A pseudo top level domain for the International Standards Organisation. The sub-domains of this domain follow the structure of ISO defined Object Identifiers. All spaces must be removed and numbers in brackets should be ignored when parsing this domain. E.g. iso(1) member-body(2) us(840) rsadsi(113549) digestAlgorithm(2) 5 is represented as +iso(1).member-body(2).us(840).rsadsi(113549).digestAlgorithm(2).5 and looked up as +iso.member-body.us.rsadsi.digestAlgorithm.5 . +itu A pseudo top level domain for the International Telecommunications Union. The sub-domains of this domain follow the structure of ITU defined Object Identifiers. Processing of such identifiers follows that defined for processing of ISO Object Identifiers. +lms A pseudo top level domain for defining Lumas extensions and libraries. +uuid A pseudo top level domain that uses Universally Unique Identifiers for identification. An example is: +uuid.4d36e96c-e325-11ce-bfc1-08002be10318 National standards bodies such as ANSI and BSI are defined under their national top-level domain. The 'alias' part of the import and export statements is used as an alias of the 'module-name', so that items within 'module-name' can be referenced in the abbreviated form of: alias::item For example, if a parameter definition called 'id' is contained in the module 'com.tech-know-ware.scp', and the following import statement is specified: Cordell Expires August 1, 2007 [Page 31] Internet Draft Lumas February 2007 import com.tech-know-ware.scp as tkwscp; Then 'id' can be referenced by: tkwscp::id 6.19 The Top Level Definition Finally, we are in a position to describe a complete Lumas message definition. This is: lumas-definition = OWS lumas-directives *external-extension *referenced-lumas-parameter [ OWS endmodule-kw OWS ";" ] OWS The first parameter defined within the message definition is the root of the message definition tree, and is thus the outer-most construct of an encoded message. The end of a Lumas definition MAY be marked with the 'endmodule' keyword. Marking the end of a module in this way allows multiple Lumas definitions to be included in a single a file or document. If the 'endmodule' keyword is not present, the definition ends at the end of the file or document. 6.20 Locating Lumas within a Specification It is not sufficient to use Lumas alone to define a protocol. Additional narrative is required to define the semantics of a protocol in addition to the syntax defined by Lumas. Thus Lumas and narrative typically need to be combined in a single document. The issue here is that at some point the Lumas must be extracted from the document to be useful. If the Lumas is intermingled with the narrative, it can be manually removed using cut and paste, however this is tedious and error-prone. An alternative is to put all the Lumas in a separate section so that it can be easily extracted. However, this distances the Lumas specification from the narrative that explains it, which is undesirable. A third option is to do both - interleave one copy of the Lumas with the narrative and a separate copy that can be used for compiling. This approach makes it difficult to keep the two versions in step, and errors can easily creep in. Lumas compilers MUST implement a fourth option. Before parsing a file, a compiler MUST first look for a line of text on which the first non-white space text is lumas*/ and only has white space after it. If such a line is found, compilation starts at the following line. Subsequent narrative is then included in /* */ comment marks. If no such line is found, then compilation begins at the beginning of the file. Cordell Expires August 1, 2007 [Page 32] Internet Draft Lumas February 2007 For example, if any */ character sequences that follow this example are removed (which have been included to discuss how they are used and hence not properly matched), a Lumas compiler must be able to find and process the following Lumas syntax: lumas*/ // The first 'official' line of Lumas struct top { not-much not-much; }; /* This is narrative. */ int <0..1> not-much; /** For a fuller description of Lumas comments, see Section 9. 7. On-the-Wire Representation This section describes the default character based on-the-wire encoding of Lumas messages. Messages defined using the Lumas message definition language may be represented using other character encoding forms or even binary forms. 7.1 Principles of the default On-the-Wire Encoding The basic format of the default text based on-the-wire encoding is to use the format: tag = value The tag is a fixed sequence of characters that identifies the parameter with which a particular value (or values) is associated. For example, there may be multiple parameters that have integer values within a struct, that might specify, say, width and height. The tags are used to identify which integer value belongs to which parameter. If there are multiple instances of a parameter, then they may either be conveyed as multiple instances of the above construct, or as a comma separated list, as in: tag = value, value, value If a tag is explicitly specified in the message definition, then this is used on the wire. If no tag is explicitly specified, then the name of the parameter is used as the tag. Tagged items may appear in any order within a struct, and do not have to be in the same order as they are defined in the struct definition. Cordell Expires August 1, 2007 [Page 33] Internet Draft Lumas February 2007 It is also possible to specify that no tag should be used on the wire by specifying 'as ?'. All untagged items MUST appear in a struct in the same order that they are defined in the message definition, and MUST appear before any tagged items within a struct definition. Untagged parameters that have greater than one instance MUST be constructed as a comma separated list. Thus untagged values have the format: value or: value, value, value If an untagged parameter has a cardinality that allows it to be absent from an encoded message, then all subsequent parameters in the enclosing struct, including tagged parameters, MUST also be absent. Consequently, great care should be taken when defining a message definition that allows untagged parameters to be absent. For the examples quoted earlier, that is: ascii rfc-name ; int <1..30000> referenced-rfcs [0..255] as refers; The format on the wire would be something like (depending on the actual values in question): rfc-name = 'Lumas' refers = 2234, 791, 2045 7.2 Formal On-the-Wire Representation The principle representation of a Lumas defined message on the wire is text based. The top-level construct of a Lumas definition is a referenced type, which essentially has no tag associated with it. (Indeed, the presence of such a tag would not convey any information.) The top-level construct on the wire is therefore either a struct body, or a union body, as in: lumas-text-message = (struct-body / union-body) OWS A struct body can contain untagged and tagged parameters. All untagged parameters MUST appear before any tagged parameters. The values of untagged parameters that have non-singular cardinality MUST be comma separated. Tagged parameters that have non-singular cardinality may either have a tag followed by a comma separated list of values, have multiple instances of the "tag = value" form, or some combination of the two. All parameters in a struct body are separated by white space, but white space is optional either before or after the struct body. (This logical specification of where white Cordell Expires August 1, 2007 [Page 34] Internet Draft Lumas February 2007 space is used leads to an unfortunately complex ABNF definition for a struct body.) The definition of a struct-body is therefore: struct-body = OWS ( struct-untagged-set / struct-tagged-set / (struct-untagged-set WS struct-tagged-set) ) struct-untagged-param = value *( COMMA value ) struct-untagged-set = struct-untagged-param *(WS struct-untagged-param) struct-tagged-param = tag ; For a void parameter / (tag EQUAL value *( COMMA value )) struct-tagged-set = struct-tagged-param *(WS struct-tagged-param) Except for a single integer parameter that may be untagged, all items of a union body MUST be tagged. Also, parameters must only have a cardinality of one in the encoding to avoid ambiguities in the encoded message. Therefore a union body has the form: union-body = OWS ( integer-value / tag ; For a void parameter / ( tag EQUAL value ) ) The definition for 'tag' is defined in the common definitions section, Section 8. 'value' has the following definition: value = simple-value / compound-value simple-value = bool-value / integer-value / float-value / ipv4-value / ipv6-value / date-value / time-value / oid-value / ascii-value / unquoted-ascii-value / unicode-value / const-value / bytes-value / embedded-value Which in turn are defined as follows: bool-value = True-kw / False-kw / T / F integer-value = [ "-" ] 1*DIGIT Cordell Expires August 1, 2007 [Page 35] Internet Draft Lumas February 2007 float-value = float-number / NaN-kw ; IEEE 754 Not a Number / INF-kw ; Positive infinity / "-" INF-kw ; Negative infinity ; Note that "-0" is included in float-number float-number = float-mantissa [ (e/E) float-exponent ] float-mantissa = ["-"] 1*DIGIT ["." 1*DIGIT] float-exponent = ["-"/"+"] 1*DIGIT True-kw = %x54.72.75.65 ; 'True' False-kw = %x46.61.6C.73.65 ; 'False' T = %x54 ; 'T' F = %x46 ; 'F' NaN-kw = %x4E.61.4E ; 'NaN' INF-kw = %x49.4E.46 ; 'INF' E = %x45 ; 'E' e = %x65 ' 'e' The value encoding of a float is the base 10 representation of a base 2 number. There will typically be a degree of error introduced when the conversion is made. Hence the float type should be looked upon as a convenient way to convey floating point information where bit level accuracy between the encoder's base 2 representation of the number and the decoder's base 2 representation of the number is not required. If this is not acceptable, then implementers should seek other ways of presenting floating point numbers that do not suffer from this loss of accuracy. The 'float-mantissa' part of the number is NOT restricted to the range 1.0 to 9.9. An 'oid-value' is represented as: oid-value = 1*DIGIT *( "~" 1*DIGIT ) As can be seen, only the oid's numerical values are encoded. The IP address values are: ipv4-value = 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT ipv6-value = hexseq / hexseq "::" [ hexseq ] / "::" [ hexseq ] hexseq = hex4 *( ":" hex4) hex4 = 1*4HEXDIG Note that the IPv4 address within an IPv6 address format is not supported. Date and time parameters have fixed width to aid parsing. As such the various fields have leading zeros if required. (They adopt one of the ISO-8601 formats.) Cordell Expires August 1, 2007 [Page 36] Internet Draft Lumas February 2007 Dates are according to the Gregorian calendar. Other calendar types may be constructed from other types if required. Unless the time can be guaranteed to have only local significance, the time MUST be converted to UTC prior to including it in a message. The time uses 24-hour clock notation. The absence of the 'time-seconds' field is interpreted as meaning seconds = 0. date-value = date-year "-" date-month "-" date-day-of-month date-year = 4DIGIT ; e.g. 2002 date-month = 2DIGIT ; With leading zeros, 01 to 12 date-day-of-month = 2DIGIT ; With leading zeros, 01 to 31 time-value = time-hours ":" time-minutes [ ":" time-seconds ] time-hours = 2DIGIT ; With leading zeros, e.g. 00 to 23 time-minutes = 2DIGIT ; With leading zeros, e.g. 00 to 59 time-seconds = 2DIGIT ; With leading zeros, e.g. 00 to 59 unquoted-ascii-value = first-safe-char *( safe-char ) ; See the section 'Notes on Comments' below The string types have the format: ascii-value = "'" *( %x00-26 / %x28-5B / %x5D-7F / "\\" / "\'" ) "'" unicode-value = DQUOTE *( %x00-21 / %x23-5B / %x5D-FF / "\\" / "\" DQUOTE ) DQUOTE ; DQUOTE defined in [ABNF] For 'unicode-value', each Unicode character is represented on the wire using the UTF-8 transform [UTF8]. The 'bytes-value' encodes binary data using the Base64 transform [BASE64], and is defined as: bytes-value = "[" OWSNC base64-line *( WSNC base64-line ) OWSNC "]" base64-line = 0*18( 4BASE64-CHAR ) ( ( 4BASE64-CHAR ) / ( 3BASE64-CHAR "=" ) / ( 2BASE64-CHAR "=" "=" ) ) BASE64-CHAR = ALPHA / DIGIT / "+" / "/" The white space between base64-lines should include characters to move to a new line as specified in [BASE64]. Cordell Expires August 1, 2007 [Page 37] Internet Draft Lumas February 2007 const-value = first-safe-char *( safe-char ) ; See the section 'Notes on Comments' below embedded-value = "(" *(%x00-FF) ")" Any occurrence of '(' within an embedded message that is not part of a string, must be matched by a corresponding ')'. Illustrating the recursiveness of the message format, we have: compound-value = struct-value / union-value / combined-value struct-value = "{" struct-body "}" union-value = union-body combined-value = first-safe-char *( safe-char ) EQUAL = OWS "=" OWS 7.3 Marking Message Boundaries Before a message is parsed it is necessary to know the boundary of the message. There are many ways in which this can be done, and the method adopted should be specified in the protocol specification. However, in the absence of any other way, Lumas parsers should take the presence of an unmatched closing brace to be the end of message marker. Hence, the definition of a message delimited in this way becomes: delimited-lumas-text-message = lumas-text-message ( "}" / ")" ) 7.4 Examples of Encoded Types This section illustrates how the types look once they have been encoded according to the syntax above. The tag of each item has the format 'my-XXXX'. Except in the case of the 'void' example, the XXXX part indicates the type that is encoded to the right of the equals sign. my-void // Tag only for a void parameter my-bool = True my-int = 5643 my-float = 102.4519 my-ipv4 = 192.0.2.1 my-ipv6 = 2001:DB8::1 Cordell Expires August 1, 2007 [Page 38] Internet Draft Lumas February 2007 my-date = 2002-02-28 my-time = 12:00:00 my-oid = 1~2~840~113549~2~5 my-ascii = 'Lumas' my-unquoted-ascii = Lumas my-unicode = "Lumas" my-const = Lumas my-bytes = [ 01AF3C== ] my-embedded = ( my-other-int=5 single-closing-bracket-text=')' ) my-struct = { 5434 All time=98787654654 } my-union = 5434 my-union = Switch my-union = Volume = 11 8. Common ABNF Definitions The following definitions are common to both the message definition syntax and the on the wire representation. Cordell Expires August 1, 2007 [Page 39] Internet Draft Lumas February 2007 tag = first-tag-safe-char 0*62( safe-char ) ; Tag MUST NOT exceed 63 characters in length first-tag-safe-char = %x21 / ; Not " %x23-26 / ; Not ' ( ) %x28-2B / ; Not , - %x2E-2F / ; Not 0 1 2 3 4 5 6 7 8 9 %x3A-3C / ; Not = %x3E-5A / ; Not [ %x5C-7A / ; Not { %x7C / ; Not } %x7E-7F ; Visible characters except = , " ' { } ( ) [ - ; and digits (tags must not get confused with integers) first-safe-char = first-tag-safe-char / DIGIT / "-" safe-char = first-safe-char / DQUOTE / "'" / "{" / "(" / "[" ; Not = } ) , WS = 1*( comment / SP / HTAB / CR / LF ) ; HTAB, CR, LF defined in [ABNF] OWS = [ WS ] ; Optional white space WSNC = 1*( SP / HTAB / CR / LF ) ; Whitespace - no comment OWSNC = [ WSNC ] ; Optional white space - no comment COMMA = OWS "," OWS ; See section 'Notes on Comments' below for more on comments comment = c-comment / cpp-comment / narrative-comment c-comment = "/*" (nested-end / hard-end ) nested-end = "*/" hard-end = "**/" cpp-comment = "//" *( HTAB / %x20-7F ) ( CR / LF ) narrative-comment = "/**" "lumas*/" ; A comment is treated as a single space during parsing ALPHA, DIGIT, HEXDIG and DQUOTE are defined in [ABNF]. 9. Notes on Comments To aid development Lumas allows comments to appear in both a message definition and on the wire. Cordell Expires August 1, 2007 [Page 40] Internet Draft Lumas February 2007 On the wire, const and unquoted-ascii values MUST NOT begin with comment start markers ('//' and '/*'). However, if the values contain comment start marker characters, the characters MUST be interpreted as part of the value, and do not indicate the start of a comment. For example, in the first of the examples below, the text "This-is-a-comment" MUST be treated as a comment, whereas in the second example the text "this-is-part-of-the-value" MUST be treated as part of the value. ascii-value = /*This-is-a-comment*/This-is-the-value ascii-value = and-//this-is-part-of-the-value In a message definition (but not on the wire) the ABNF c-comment production allows nesting of comments. In a nested comment, each occurrence of the '/*' character sequence MUST be matched by a corresponding occurrence of the '*/' character sequence before the comment ends or, the end of the comment can be forced by the hard end of comment marker defined as '**/', which overrides the nesting. (This provision allows the commenting out of headers and footers in text only message definition documents.) To further support Lumas embedded in specification documents, Lumas supports a 'narrative-comment'. These are comments that may coincidentally contain Lumas end of comment markers such as C example code. The narrative comment begins with the symbol '/**', and ends with the symbol 'lumas*/'. A comment is treated as a single space for the purposes of parsing. 10. Locating Lumas Modules It is not intended that applications should find Lumas modules 'on-the-fly'. It is expected that some human involvement will be required to locate and interpret a Lumas definition. A Lumas definition does not therefore have any way of specifying the physical location from where a referenced definition can be acquired. Instead, the strategy is to exploit the fact that a module definition can begin with the text "lumas module" followed by the module name. By entering this text (e.g. "lumas module org.lumas.mine") into a web search engine (either one that covers the whole Internet, or is limited to a specific site) a user can locate a particular Lumas module. Determining whether a Lumas module so located is authentic is beyond the scope of this document. 11. Mandatory to Understand Many protocols require the capability to signal that certain extension parameters are mandatory to understand, and if they are not understood the message should be rejected in some way. Lumas Cordell Expires August 1, 2007 [Page 41] Internet Draft Lumas February 2007 provides no in-built mechanism for this feature. Instead implementers are recommended to use a feature similar to SIP's 'Require' header [SIP] which presents a list of feature identifiers that must be understood. Naturally, provision for this mechanism must be included in the first version of the protocol, as it is not possible to define such semantics at a later time. An example of such a construct might be: union require [*] pluggable { }; And could be populated using: plug void my-feature; into require; 12. Security Considerations Lumas itself does not have any security issues related to it, but the security requirements of a protocol must be borne in mind when writing a Lumas message definition. Common advice is that it is difficult to add security to a protocol once it has been released, and hence security issues must be considered from the outset. This is of issue to a Lumas message definition as it may affect the format of messages. This is particularly the case for integrity check values that are effectively appended to the end of the message once it is encoded. This may mean that it is appropriate to define both a main message definition and a message definition that is a wrapper that can provide cryptographic services for the main message definition. For example, a message definition wrapper might look like: struct my-protocol-wrapper { embedded main-definition as ?; bytes<1..64> signature as signed; oid signature-algorithm as sig-alg; }; 13. Normative References [ABNF]D. Crocker, & P. Overell, "Augmented BNF for Syntax Specifications: ABNF, " Internet Engineering Task Force, RFC 4234, October 2005. [BASE64]N. Freed, & N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies," Internet Engineering Task Force, RFC 2045, November 1996. [DOMAINS]J. Postel, "Domain Name System Structure and Delegation," Internet Engineering Task Force, RFC 1591, March 1994. Cordell Expires August 1, 2007 [Page 42] Internet Draft Lumas February 2007 [IEEE754]"IEEE Standard for Binary Floating-Point Arithmetic," IEEE 754-1985, IEEE, 1985. [KWORDS]S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels," RFC 2119, March 1997. [PERL]L. Wall, T.Christiansen, & J. Orwant, "Programming Perl", O'Reilly, ISDN-0-596-00027-8. [UTF8]F. Yergeau, "UTF-8, a transformation format of ISO 10646," RFC 2279, January 1998. 14. Informative References [ASN1]International Organization for Standardization, "Information Processing Systems - Open Systems Interconnection - Specification of Abstract Syntax Notation One (ASN.1)", ISO Standard 8824, December 1990. [CMS] R. Housley, "Cryptographic Message Syntax," RFC 2630, June 1999. [DIAMETER]Pat R. Calhoun, John Loughney, Erik Guttman, Glen Zorn, Jari Arkko, "Diameter Base Protocol," draft-ietf-aaa-diameter-xx, Work in Progress. [IP] "Internet Protocol," RFC 791, September 1981. [JSON]"Introducing JSON," http://www.json.org/. [OMGIDL]"Common Object Request Broker Architecture: Core Specification, " Object Management Group, December 2002. (Accessible via: http://www.omg.org/technology/documents/corba_spec_catalog.htm) [RELAX]OASIS Technical Committee: RELAX NG, "RELAX NG Specification", December 2001, . [SCHEMA]Thompson, H., Beech, D., Maloney, M. and N. Mendelsohn, "XML Schema Part 1: Structures", W3C REC-xmlschema-1, May 2001, , and Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", W3C REC-xmlschema-2, May 2001, . [SIP] J. Rosenberg et al., "SIP: Session Initiation Protocol," Internet Engineering Task Force, RFC 3261, June 2002. [SMTP]Klensin, J. (Ed.), "Simple Mail Transfer Protocol", RFC 2821, April 2001. [SNMP]J. Case, M. Fedor, M. Schoffstall, J. Davin, "A Simple Network Cordell Expires August 1, 2007 [Page 43] Internet Draft Lumas February 2007 Management Protocol (SNMP)," RFC 1157, May 1990. [STRON]Jelliffe, R., "The Schematron", November 2001, . [TCP] "Transmission Control Protocol," RFC 793, September 1981. [TLS] Dierks, T. and C. Allen, "The TLS Protocol Version 1.0", RFC 2246, January 1999. [UDP] "User Datagram Protocol, " RFC 768, August 1980. [XDR] R. Srinivasan, "XDR: External Data Representation Standard," RFC 1832, August 1995. [XML] "Extensible Markup Language (XML) 1.0 (Second Edition)", W3C REC-xml, October 2000. [XMLBCP]S. Hollenbeck, M. Rose, and L. Masinter, "Guidelines for the Use of Extensible Markup Language (XML) within IETF Protocols," RFC 3470, January 2003. [XMLVER]David Orchard, "Versioning XML Vocabularies," XML.com, December 03, 2003, http://www.xml.com/pub/a/2003/12/03/versioning.html 15. Author's Address Pete Cordell Tech-Know-Ware Ltd P.O. Box 30 Ipswich IP5 2WY UK pete@tech-know-ware.com http://www.tech-know-ware.com Full Copyright Statement Copyright (C) The IETF Trust (2007). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED Cordell Expires August 1, 2007 [Page 44] Internet Draft Lumas February 2007 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgment Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA). Cordell Expires August 1, 2007 [Page 45]