Virtual
Solar
Observatory
Data Description Schemata
A Critique
Our experience with defining and attempting to use XML schemata for
description of the various data archives in the VSO testbed suggests
to me the following clarification of the problems to be addressed.
- Each archive needs to be able to describe its searchable data in
an abstract schema that can be uniquely mapped to its own search procedures.
For a relational database this may involve describing all the searchable
keywords and the meaningful ranges of associated values, or at least those
values supporting VSO. (It is possible that only part of an archive may
be revealed to VSO, with private components of the archive unindexed as
it were, for whatever reasons apply.) For a flat database the schema
may involve simply the range of syntactically meaningful names. Problems
specific to such descriptions include:
- Distinguishing continuous and discrete ranges of key values
- Describing dependencies: the ranges of values associated with
certain keywords may depend on the values associated with other
keywords. In some cases a particular keyword may make no sense
when combined with another keyword-value combination.
- There must be a means of describing ranges that change dynamically.
An obvious example is the ending time of an ongoing series of observations
- In order to support a common search tool and user interface, the
syntactic meanings of the components of the individual schemata need
to be coordinated. This is of course the heart of the interoperability
problem. It is clear that similar or identical terms may be used by
various archives, or even wiuthin a single archive in different contexts,
with different syntactic meanings, and different terms may be used with
the same syntactic meaning. Obvious examples are certain of the keywords
reserved by the FITS standard, defined in contexts not wholly appropriate
to the particular data set in question. It is not clear to me how the
translation to a common schema with a well-defined dictionary can be
accomplished except by bilateral negotiation of definitions. We hope
that by building a core vocabulary from representative testbed archives,
we can minimize the number of terms to be added, or worse, redefined,
as additional schemata are ingested.
- A separate schema may be required to describe the nature and functionality
of the archive itself: data formats, access protocols, data accessibility,
costs in both literal network senses, sources of information, conditions of
use, and so forth. These may also include `hidden' keywords such as the
identity of the archive (always matched for queries directed to that archive).
The same remarks as above about negotiating syntactic meanings apply to this
problem as well.
- A schema based on the common dictionaries constructed above needs to
be defined so that independent search tools can be constructed, or their
capabilities integrated into existing search services. This is a
comparatively trivial problem, as it should really only be necessary to
construct a single simple search tool. Integration into existing services
only involves either reflection of the translations involved in problem (2),
or mirroring of the search services of the VSO example for terms without
local meaning.