Virtual Solar Observatory

Data Description Schemata

A Critique

Our experience with defining and attempting to use XML schemata for description of the various data archives in the VSO testbed suggests to me the following clarification of the problems to be addressed.
  1. Each archive needs to be able to describe its searchable data in an abstract schema that can be uniquely mapped to its own search procedures. For a relational database this may involve describing all the searchable keywords and the meaningful ranges of associated values, or at least those values supporting VSO. (It is possible that only part of an archive may be revealed to VSO, with private components of the archive unindexed as it were, for whatever reasons apply.) For a flat database the schema may involve simply the range of syntactically meaningful names. Problems specific to such descriptions include:
  2. In order to support a common search tool and user interface, the syntactic meanings of the components of the individual schemata need to be coordinated. This is of course the heart of the interoperability problem. It is clear that similar or identical terms may be used by various archives, or even wiuthin a single archive in different contexts, with different syntactic meanings, and different terms may be used with the same syntactic meaning. Obvious examples are certain of the keywords reserved by the FITS standard, defined in contexts not wholly appropriate to the particular data set in question. It is not clear to me how the translation to a common schema with a well-defined dictionary can be accomplished except by bilateral negotiation of definitions. We hope that by building a core vocabulary from representative testbed archives, we can minimize the number of terms to be added, or worse, redefined, as additional schemata are ingested.
  3. A separate schema may be required to describe the nature and functionality of the archive itself: data formats, access protocols, data accessibility, costs in both literal network senses, sources of information, conditions of use, and so forth. These may also include `hidden' keywords such as the identity of the archive (always matched for queries directed to that archive). The same remarks as above about negotiating syntactic meanings apply to this problem as well.
  4. A schema based on the common dictionaries constructed above needs to be defined so that independent search tools can be constructed, or their capabilities integrated into existing search services. This is a comparatively trivial problem, as it should really only be necessary to construct a single simple search tool. Integration into existing services only involves either reflection of the translations involved in problem (2), or mirroring of the search services of the VSO example for terms without local meaning.