Interscript(ID:5455/int025)

Began: 1985

Xerox PARC 1985

Device independant document description language

Superset of Interpress

People:

Butler Lampson

References:

Interscript [reference manual] view details pdf

Xerox PARC "Introduction to Interscript" view details pdf Extract: Introduction
Interscript provides a means of representing editable documents. This representation is independent of any particular editor and can therefore be used to interchange documents among editors.
Rationale for an interchange standard
As office systems proliferate, being able to interchange documents among different editing systems is becoming more and more important. Customers need document compatibility to avoid being trapped in evolutionary cul-de-sacs and having to pay the awful price of converting documents from one product's format to another's.
Typically, an editing program uses a private, highly-encoded representation when operating on a document to enable it to provide good performance. Generally, this means that different editors use different, incompatible private formats, and a user can conveniently edit a document only with the editor used to create it. This problem can be solved by providing programs to convert between one editor's private (or file} format and another's. However, a set of different editors with N different document representations requires N(N-l) conversion routines to be able to convert directly from each format to every other.
This N(N-l) problem can be reduced to 2(N-1) by noticing that we could write N-l conversion routines to go from F1 (format for editor1) to F2,. . .,FN, and another N-l routines to convert from F2,. . .,FN to F1. Except when converting from or to F1, this scheme requires two conversions to go from Fi to Fj (j = i). This is a minor drawback. Choosing which editor should be editor1 is the critical issue, however, since the capabilities of that editor will determine how general a class of documents can be interchanged among the editors.
This presents a truly difficult problem in the case that there is no single functionally dominant editor1 in the set. If the pivotal editor1 doesn't incorporate all of the structures, formats, and content types used by editor2,. . ..editorN, then it will not be possible to faithfully convert documents containing them. Even if there were a single, functionally dominant editor, it would place an upper bound on the functionality of all future compatible editors.
Since there are no actual candidates for a totally dominant editor, this standard has been developed by examining, in general, what information editors need and how that
information can be organized to represent general documents, It provides an external representation that is capable of conveying the content, form, and structure of editable documents. That external representation has only one purpose: to enable the interchange of documents among different editors. It must be easy to convert between real editors' formats and this interchange encoding.
When represented by this interchange encoding, we call a document a script and reserve the term document for the representation that an editing system uses to enable editing it. Using a standard interchange encoding has the additional advantage that much of the input and output conversion algorithms will be common to all conforming editors. For example, when a new version of an existing editor is released, the only differences in the new version's conversion routines will be in the areas in which its internal document format has changed from its previous form; this represents a significant saving of programming.
1.2    Properties that any interchange standard must have
An interchange encoding for editable documents must satisfy a number of constraints. Among these are the following:
1.2.1  Encoding efficiency
Since editable documents may be stored as scripts, may be transmitted over a network, and must certainly be processed to convert them to various editors' private formats, it is important that the encoding be space-efficient.
Similarly, the cost in time of converting between Interchange encoding and private formats must be reasonably low, since it will have a significant effect on how useful the interchange standard is.
1.2.2  Open-ended representation
Scripts must be capable of describing virtually all editable documents, including those containing formatted text, synthetic graphics, scanned images, animated images, etc., and mixtures of these various modes. Nor may the standard foreclose future options for documents that exploit additional media (e.g., audio) or require rich structures (e.g., VLSI circuit diagrams, database views). Thus, a standard must be capable of incremental extension and any extension must have the same guarantees and be able to employ the same mechanisms as the most basic parts of the standard.
For the same reasons, the standard must not be tied to particular hardware or to a file format since documents will be stored and transmitted using a variety of media.
1.2.3  Document content and form
The complete description of a component of a document usually requires more than a list of its explicit contents; e.g., paragraphs have margins, leading between lines, default fonts, etc. Scripts must record the association between attributes (e.g., margins) and pieces of content.
Both the contents and attributes of typical documents require a rich value space containing scalar numbers, strings, vectors, and record-like constructs in order to describe items as varied as distances, text, coefficients of curves, graphical constraints, digital audio, scanned images, transistors, etc.
1.2.4  Document structure
Many documents have hierarchical structure: e.g., a book is made of chapters containing sections, each of which is a sequence of paragraphs: a figure is embedded in a frame on a page and in turn contains a textual caption and imbedded graphics, and the description of an integrated circuit has levels corresponding to modular or repeated subcircuits. This standard exploits such structure, without imposing any particular hierarchy on all documents.
Hierarchy is not sufficient, however. Parts of documents must often be related in other ways; e.g., graphics components must often be related geometrically, which may defy hierarchical structuring, and it must be possible to indicate a reference from some part of a document to a figure, footnote, or section in way a that cuts across the dominant hierarchy of the document.
Documents often contain structure in the form of indirection. For instance, a set of paragraphs may all have a common "style," which must be referred to indirectly so that changing the style alone is sufficient to change the characteristics of all the paragraphs using it. Or a document may be incorporated "by reference" as a part of more than one document and may need to "inherit" many of its properties from the document into which it is being incorporated at a given time.
1.2.5  Transcription fidelity
It must be possible to convert any document from any editor's private format to a script and reconvert it back to the same editor's private format with no observable effect on the document's content, form, or structure. This characteristic is called transcription fidelity, and is a sine qua non for an interchange encoding; if it is not possible to accomplish this, the interchange encoding or the conversion routines (or both) must, be defective. It must, of course, also be possible to test that an editor does transcribe scripts faithfully, which in turn requires that it be possible to test if two scripts are equivalent (section 2.3.4).
Even complicated documents have simple pieces. A simple editor should be able to display parts of documents that it is capable of displaying, even in the presence of parts that it cannot. More precisely, an editor must, in the course of internalizing a script (converting it from a script to its private, editable format), be able to discover all the information necessary to recognize and to display the parts that it understands. This must work despite the fact that different editors may well use different data structures to represent the content, form, and structure of a document.
At a minimum, this requires that a script contain information by which an editor can easily determine whether or not it understands a component well enough to edit it, and that it be able to interpret the effect that components which it does not understand have on the ones it does. For example, if an editor does not understand figures, it might still be possible for it to display their embedded textual captions correctly, even though a figure might well dictate some of its caption's content or attributes such as margins, font, etc.
This constraint requires that an interchange encoding must have a simple syntax and semantics that can be interpreted readily, even by low-capability editors.
1.2.7 Regeneration
Processing a script to internalize it correctly is only half the problem. It is equally important that an editor, in externalizing a script from its private internal format be able to regenerate the content, form, and structure carried by the script from which the document, it should be possible to retain the structure in parts of the original script that were not affected by editing operations. For example, an editor that understands text but not figures should be able to edit the text in a document (although editing a caption may be unsafe without understanding figures) while faithfully retaining and then regenerating the figures when externalizing it.
This problem is much less severe when an editor is transcribing a document that it "understands" completely, e.g., because the entire document was generated using that same editor.
1.3    What the Interscript standard does not do
There are a number of issues that the Interscript standard specifically does not discuss. Each of these issues is important in its own right, but is separable from the design of an interchange representation
1.3.1  Interscript is not a file format
This standard is not concerned with how scripts are held in files on various media (floppy disks, hard disks, tapes, etc.), or with how they are transmitted over communications media (local area network, telephone lines, etc.).
1.3.2  Interscript is not a standard for editing
A script is not intended as a directly editable representation. It is not part of its function to make editing of various constructs easier, more efficient, or more compact: that is the purview of editors and their associated private document formats. A script is intended to be internalized before being edited. This might be done by the editor, by a utility program on the editing workstation, or by a completely separate service.
1.3.3  Combining documents is not an interchange function
This exclusion is really a corollary of the statement, "A script is not intended as a directly editable representation." In general, it is no easier to "glue" two arbitrary documents together than it is to edit them.
1.3.4 Intel-script, does not overlap with other standards
There are a number of standards issues that are closely related to the representation of editable documents, but which are not part of the Interscript standard because they are also closely related to other standards. For example, the issues of specifying encodings for characters in documents, or how fonts should be named or described are not part of this work.
1.4    Concepts and guiding principles
1.4.1  Layers
The Interscript standard is presented in layers:
Layer 0 defines the syntax of the base language for scripts; parsing reveals the dominant structure of the documents they represent (sections 2.1-2.2)
Layer 1 defines the semantics of the base language, particularly the treatment of bindings and environments (section 2.3, chapter 3).
Layer 2 defines the semantics of properties and attributes that are expected to have a uniform interpretation across all editors (chapters 4-5).
1.4.2  Externalization and internalization
A script represents a document in the Interscript format. Its sole purpose is to enable the interchange of documents among editors in a manner that is independent of any one editor.
A script is not the editable form of a document. The editable form is created by an editor by internalizing a script according to the rules (semantics) of Interscript. The reverse operation of converting a document in an editor's internal, editable format to a valid script is called externalization.
It is important that any document prepared by any editor can be externalized as a script that will then be (re)internalized by the editor without "loss of information". Ease of internalization requires that the Interscript base language contain only relatively few (and simple) constructs. This apparent paradox has been resolved by including within the base language a simple, yet powerful, mechanism for abbreviation and extension.
A script may be considered to be a "program" that is "compiled" to convert the script to the private representation of a particular editor, ready for further editing. The Interscript language has been designed so that internalizing scripts into typical editors' representations can be performed in a single pass over the script by maintaining a few simple data structures.
1.4.3  Content, form, value, and structure
Most editors deal with both the content of a document (or piece of a document), and its form. The former is thought of as "what" is in the document, the latter as "how" it is to be viewed; e.g., "ABC" has a sequence of character codes as its contents; its format may include font and position information Interscript maintains this distinction.
The distinction between the value and the structure of both content and form within a document is also important. When viewing a document, only the value is of concern, but the structure that leads to that value may be essential to convenient editing. An example of structure in content is the grouping of text into paragraphs. An example of structure in form is associating a named "style" with a paragraph.
Content: may be represented by structures built from character strings, numbers, Booleans, identifiers, and nodes, which are structured objects containing simpler ones.
Form: Interscript provides for open-ended sets of properties and attributes. Properties are associated with content by means of togs. Attributes are bindings between names and values that apply over some scope. The way the contents of a document are to be "understood" is determined by its properties; Interscript makes it straightforward to determine what these properties are without having to understand them.
Structure: Most editors structure the content of a document somehow-into paragraphs, sections, chapters; or lines, pages, signatures, for example. This assists in obtaining private efficiency, but, more importantly, provides a conceptual structure for the user.
The most important, and most frequent, structuring mechanism between values is logical adjacency (sequentiality), which is represented by simply putting them one after another in the script.
Most editors that structure contents have a "dominant" hierarchy that maps well into trees whose arcs are implicitly labelled by order. (Different editors use these trees to represent different hierarchies). Interscript provides a simple linear notation for such trees, delimiting node values by braces ("{" and "}"). If an editor maintains multiple hierarchies, the dominant one is the one transcribed into the primary tree structure and used to control the inheritance of attributes.
Structures recorded for form use explicit indirection by means of names. Interscript allows expressions composed of literals, identifiers, and operators, and permits the use of identifiers to represent expressions.
Resources

Lampson-designed systems
Interscript (1982-83): With Bob Ayers, Jim Horning and Jim Mitchell, I designed this standard for describing editable documents. The main innovations are semantics which allow editing of parts of the document by an editor which doesn't understand other parts (e.g., captions within figures), provision for what-you-see-is-what-you-get editing, a fully integrated mechanism for style sheets, and a layout model based on regular expressions.external link