Chapter 9. <acronym xmlns="http://www.w3.org/1999/xhtml" class="acronym">GRS-1</acronym> Record Model and Filter Modules

Prev		Next

Chapter 9. GRS-1 Record Model and Filter Modules

Note

The functionality of this record model has been improved and replaced by the DOM XML record model. See Chapter 7, DOM XML Record Model and Filter Module.

The record model described in this chapter applies to the fundamental, structured record type grs, introduced in the section called “GRS-1 Record Model and Filter Modules”.

GRS-1 Record Filters

Many basic subtypes of the grs type are currently available:

grs.sgml

This is the canonical input format described the section called “GRS-1 Canonical Input Format”. It is using simple SGML-like syntax.

grs.marc.type

This allows Zebra to read records in the ISO2709 (MARC) encoding standard. Last parameter type names the .abs file (see below) which describes the specific MARC structure of the input record as well as the indexing rules.

The grs.marc uses an internal represtantion which is not XML conformant. In particular MARC tags are presented as elements with the same name. And XML elements may not start with digits. Therefore this filter is only suitable for systems returning GRS-1 and MARC records. For XML use grs.marcxml filter instead (see below).

The loadable grs.marc filter module is packaged in the GNU/Debian package libidzebra2.0-mod-grs-marc

grs.marcxml.type

This allows Zebra to read ISO2709 encoded records. Last parameter type names the .abs file (see below) which describes the specific MARC structure of the input record as well as the indexing rules.

The internal representation for grs.marcxml is the same as for MARCXML. It slightly more complicated to work with than grs.marc but XML conformant.

The loadable grs.marcxml filter module is also contained in the GNU/Debian package libidzebra2.0-mod-grs-marc

grs.xml

This filter reads XML records and uses Expat to parse them and convert them into IDZebra's internal grs record model. Only one record per file is supported, due to the fact XML does not allow two documents to "follow" each other (there is no way to know when a document is finished). This filter is only available if Zebra is compiled with EXPAT support.

The loadable grs.xml filter module is packagged in the GNU/Debian package libidzebra2.0-mod-grs-xml

grs.regx.filter

This enables a user-supplied Regular Expressions input filter described in the section called “GRS-1 REGX And TCL Input Filters”.

The loadable grs.regx filter module is packaged in the GNU/Debian package libidzebra2.0-mod-grs-regx

grs.tcl.filter

Similar to grs.regx but using Tcl for rules, described in the section called “GRS-1 REGX And TCL Input Filters”.

The loadable grs.tcl filter module is also packaged in the GNU/Debian package libidzebra2.0-mod-grs-regx

GRS-1 Canonical Input Format

Although input data can take any form, it is sometimes useful to describe the record processing capabilities of the system in terms of a single, canonical input format that gives access to the full spectrum of structure and flexibility in the system. In Zebra, this canonical format is an "SGML-like" syntax.

To use the canonical format specify grs.sgml as the record type.

Consider a record describing an information resource (such a record is sometimes known as a locator record). It might contain a field describing the distributor of the information resource, which might in turn be partitioned into various fields providing details about the distributor, like this:

      <Distributor>
        <Name> USGS/WRD </Name>
        <Organization> USGS/WRD </Organization>
        <Street-Address>
          U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW
        </Street-Address>
        <City> ALBUQUERQUE </City>
        <State> NM </State>
        <Zip-Code> 87102 </Zip-Code>
        <Country> USA </Country>
        <Telephone> (505) 766-5560 </Telephone>
      </Distributor>

The keywords surrounded by <...> are tags, while the sections of text in between are the data elements. A data element is characterized by its location in the tree that is made up by the nested elements. Each element is terminated by a closing tag - beginning with </, and containing the same symbolic tag-name as the corresponding opening tag. The general closing tag - </> - terminates the element started by the last opening tag. The structuring of elements is significant. The element Telephone, for instance, may be indexed and presented to the client differently, depending on whether it appears inside the Distributor element, or some other, structured data element such a Supplier element.

Record Root

The first tag in a record describes the root node of the tree that makes up the total record. In the canonical input format, the root tag should contain the name of the schema that lends context to the elements of the record (see the section called “GRS-1 Internal Record Representation”). The following is a GILS record that contains only a single element (strictly speaking, that makes it an illegal GILS record, since the GILS profile includes several mandatory elements - Zebra does not validate the contents of a record against the Z39.50 profile, however - it merely attempts to match up elements of a local representation with the given schema):

       <gils>
          <title>Zen and the Art of Motorcycle Maintenance</title>
       </gils>

Variants

Zebra allows you to provide individual data elements in a number of variant forms. Examples of variant forms are textual data elements which might appear in different languages, and images which may appear in different formats or layouts. The variant system in Zebra is essentially a representation of the variant mechanism of Z39.50-1995.

The following is an example of a title element which occurs in two different languages.

       <title>
       <var lang lang "eng">
       Zen and the Art of Motorcycle Maintenance</>
       <var lang lang "dan">
       Zen og Kunsten at Vedligeholde en Motorcykel</>
       </title>

The syntax of the variant element is <var class type value>. The available values for the class and type fields are given by the variant set that is associated with the current schema (see the section called “Variants”).

Variant elements are terminated by the general end-tag </>, by the variant end-tag </var>, by the appearance of another variant tag with the same class and value settings, or by the appearance of another, normal tag. In other words, the end-tags for the variants used in the example above could have been omitted.

Variant elements can be nested. The element

       <title>
       <var lang lang "eng"><var body iana "text/plain">
       Zen and the Art of Motorcycle Maintenance
       </title>

Associates two variant components to the variant list for the title element.

Given the nesting rules described above, we could write

       <title>
       <var body iana "text/plain>
       <var lang lang "eng">
       Zen and the Art of Motorcycle Maintenance
       <var lang lang "dan">
       Zen og Kunsten at Vedligeholde en Motorcykel
       </title>

The title element above comes in two variants. Both have the IANA body type "text/plain", but one is in English, and the other in Danish. The client, using the element selection mechanism of Z39.50, can retrieve information about the available variant forms of data elements, or it can select specific variants based on the requirements of the end-user.

GRS-1 REGX And TCL Input Filters

In order to handle general input formats, Zebra allows the operator to define filters which read individual records in their native format and produce an internal representation that the system can work with.

Input filters are ASCII files, generally with the suffix .flt. The system looks for the files in the directories given in the profilePath setting in the zebra.cfg files. The record type for the filter is grs.regx.filter-filename (fundamental type grs, file read type regx, argument filter-filename).

Generally, an input filter consists of a sequence of rules, where each rule consists of a sequence of expressions, followed by an action. The expressions are evaluated against the contents of the input record, and the actions normally contribute to the generation of an internal representation of the record.

An expression can be either of the following:

INIT: The action associated with this expression is evaluated exactly once in the lifetime of the application, before any records are read. It can be used in conjunction with an action that initializes tables or other resources that are used in the processing of input records.
BEGIN: Matches the beginning of the record. It can be used to initialize variables, etc. Typically, the BEGIN rule is also used to establish the root node of the record.
END: Matches the end of the record - when all of the contents of the record has been processed.
/reg/: Matches regular expression pattern reg from the input record. The operators supported are the same as for regular expression queries. Refer to the section called “Zebra Regular Expressions in Truncation Attribute (type = 5)”.
BODY: This keyword may only be used between two patterns. It matches everything between (not including) those patterns.
FINISH: The expression associated with this pattern is evaluated once, before the application terminates. It can be used to release system resources - typically ones allocated in the INIT step.

An action is surrounded by curly braces ({...}), and consists of a sequence of statements. Statements may be separated by newlines or semicolons (;). Within actions, the strings that matched the expressions immediately preceding the action can be referred to as $0, $1, $2, etc.

The available statements are:

begin type [parameter ... ]

Begin a new data element. The type is one of the following:

record: Begin a new record. The following parameter should be the name of the schema that describes the structure of the record, eg. gils or wais (see below). The begin record call should precede any other use of the begin statement.
element: Begin a new tagged element. The parameter is the name of the tag. If the tag is not matched anywhere in the tagsets referenced by the current schema, it is treated as a local string tag.
variant: Begin a new node in a variant tree. The parameters are class type value.

data parameter

Create a data element. The concatenated arguments make up the value of the data element. The option -text signals that the layout (whitespace) of the data should be retained for transmission. The option -element tag wraps the data up in the tag. The use of the -element option is equivalent to preceding the command with a begin element command, and following it with the end command.

end [type]

Close a tagged element. If no parameter is given, the last element on the stack is terminated. The first parameter, if any, is a type name, similar to the begin statement. For the element type, a tag name can be provided to terminate a specific tag.

unread no

Move the input pointer to the offset of first character that match rule given by no. The first rule from left-to-right is numbered zero, the second rule is named 1 and so on.

The following input filter reads a Usenet news file, producing a record in the WAIS schema. Note that the body of a news posting is separated from the list of headers by a blank line (or rather a sequence of two newline characters.

      BEGIN                { begin record wais }

      /^From:/ BODY /$/    { data -element name $1 }
      /^Subject:/ BODY /$/ { data -element title $1 }
      /^Date:/ BODY /$/    { data -element lastModified $1 }
      /\n\n/ BODY END      {
         begin element bodyOfDisplay
         begin variant body iana "text/plain"
         data -text $1
         end record
      }

If Zebra is compiled with support for Tcl enabled, the statements described above are supplemented with a complete scripting environment, including control structures (conditional expressions and loop constructs), and powerful string manipulation mechanisms for modifying the elements of a record.