Questionnaire Metadata Validation

This ticket is primarily about establishing policies for checking the validity of metadata that is produced by the CMIP5 questionnaire. In addition this ticket establishes policies for the partial or full generation of metadata outside of the questionnaire. The implementation of these policies is documented in  ticket 254 and the associated  wiki page.

 Ticket 252 provides diagrams of the questionnaire development process helping to illustrate where validation is required.

Questionnaire Content

The content and structure of the questionnaire is being driven by the following configuration metadata:

  • "Software" vocabulary defined by the scientists. This vocabulary consists of a model component hierarchy, parameter/property names for each component and associated values of the parameter/property names. This information is being (heroically) extracted from a number of key scientists by the Metafor team and in particular by Marie-Pierre. The information is captured in mindmaps which are stored and versioned in the metafor repository. In some cases there are parameters/properties that can only exist if certain other parameter/properties have a particular value. As it was considered too complex to support hierarchical parameters in the questionnaire it was decided to also create and support a set of "flattened" mindmaps which do not have any hierarchical parameter/properties. These are also stored and versioned in the metafor repository. There is therefore an issue of how to keep these two sets of mindmaps consistent. Marie-Pierre is manually generating the flattened mindmaps from the original ones so she is ensuring consistency manually. The mindmaps use visual keys e.g. bold font, or a particular icon type to encode the type of information that the mindmaps catpure. An agreed set of visual keys has been documented in  ticket 244 As it is difficult to ensure that the flattened mindmaps conform to these rules a validator has been written. Further a translator has also been written that takes the flattened mindmap files (stored as xml) and translates them to a more formal xml structure. When new flattened mindmaps are generated they are then checked using the validator and then translated to xml suitable for reading by the questionnaire.
  • "Activity" vocabulary and constraints. There are a set of defined CMIP5 experiments. These experiments and their relationships which each other (e.g. overlaps between experiments) are being captured in xml by Gerry et al, see  ticket 250. I believe that this xml is being used to automatically configure the questionnaire.
    • Gerry, Sebastien and Charlotte are writing CMIP5 conformance documents  ticket 251. These are CIM instances which describe the inital conditions and boundary conditions (numerical requirements) required for each CMIP5 experiment for it to conform to the CMIP5 specifications. The questionnaire will be used to associate each numerical requirement in the conformance document with either an external file or some code modification to a software component. (added by Charlotte 3rd Aug 2009)
  • "Shared" concepts. There are a number of concepts that are independent of component, in particular coupling. Marie-Pierre is responsible for capturing this information and I am not sure how these concepts are going to be stored and/or used to configure the questionnaire,  see ticket 260
  • Grid information. I'm not sure who is currently looking at this, if anyone.
  • Any remaining metadata is being (manually) taken directly from the CIM and manually embedded in the questionnaire.

For info here is the protocol for generating Questionnaire software vocab. I think it is out of scope for this ticket but is probably useful to capture in any case.

  • Mindmaps are used to capture software CV from meetings with key scientists. A second set of "flattened" mindmaps are manually created. The mindmaps are versioned so that people know which version a flattened mindmap relates to. The mindmaps are checked for conformance to some basic rules and modified if they do not conform. Conformant flattened mindmaps are translated into a more generic xml representation the structure of which has been agreed with the questionnaire developers. The generic xml representation is read in by the questionnaire.

Checking Consistency with the CIM

Some of the above configuration information is being created, and used by the questionnaire, independently of the current version of the CIM. Therefore this configuration metadata should be checked to make sure that there is a one-to-one mapping between the concepts represented in the questionnaire configuration metadata and the concepts represented in the CIM and that the questionnaire captures all of the mandatory concepts in the CIM.

Any such checks will need to be done manually. In the first instance these checks can be based on the xml that is being generated to configure the questionnaire. If these are not consistent with the CIM then results of the questionnaire will not provide consistent xml. When versions of the questionnaire are available then, in addition, output of the questionnaire can be checked against the requirements of the CIM. Where there is a mismatch this should be reported to the group and resolved by either modifying the questionnaire concepts or the CIM concepts.

  • "Software" vocabulary.  Ticket 249 (Rupert) is responsible for checking that the software xml is consistent with the CIM.
  • "Activity" vocabulary.  Ticket 251 (Charlotte) is responsible for checking that the activity xml is consistent with the CIM.
  • "Shared" concepts. No ticket
  • "Grid" information. No ticket

Once the questionnaire is able to output data and is in its testing phase then we can look at translating questionnaire instances into CIM instances. At this point we will be able to check that it is possible to perform the translation and can feed back any issues.  Ticket 249 (Rupert) is responsible for translating the questionnaire output into CIM instances and within this ticket will also be responsible for checking consistency.

Checking Questionnaire Output Validity

The questionnaire will be responsible for ensuring that completed questionnaires are valid, in the sense that all required fields have been completed to made a consistent instance. Note, the questionnaire will probably also allow partially completed instances to be output (which will not be valid). Constraints, such as component hierarchy and parameter (property) names are enforced by the structure of the questionnaire. In general it is therefore not possible to provide invalid data through the questionnaire interface, only incomplete data.

However some constraints can not be directly enforced by the questionnaire. An example is where a parameter (property) can only exist if another parameter (property) has a certain value. As things stand such constraints are encoded in notes in the Software Mindmaps. These constraints need to be enforced somehow. The proposed solution is to structure these constraints so that they can be automatically translated into schematron constraints (and/or some other technology if required). Such checks may be integrated into the questionnaire itself or may need to be performed after the fact. I do not know if the same issue occurs for the Activity, Shared or Grid information. This needs to be confirmed.

Checking Questionnaire Input Validity

Assuming that the questionnaire will support the output of partially completed instances it must also support the ability to input partially completed instances.This opens up the possibility of someone hand editing any partial completed instances outside of the questionnaire. An more likely scenario (probably one to be used by the Met Office) is the automatic generation of partially completed instances from another representation.

Therefore it must be possible to validate partially completed instances before they are input into the questionnaire. The proposed solution to this is to define the structure of the questionnaire (xml) output as a schema structure. This will allow some basic structural and type checking. Any remaining constraints imposed by the configuration metadata described above should be used to automatically generate appropriate (probably schematron) constraints. Note that the configuration metadata constraints include checking for a valid component hierarchy, valid parameters valid parameter values etc.

The validity checks may be integrated into the questionnaire if the questionnaire supports this. If not they can be performed as a separate external step.

Checks on the validity of metadata generated outside of the questionnaire.

It is possible that someone might want to modify a completed questionnaire instance without loading it back into the questionnaire, or even generate a valid instance completely independendly of the questionnaire. For example someone might decide to use GeoNetwork as the editing tool. Most of the checks that would need to be done to confirm that a valid instance had been created have been documented in the previous sections. However in this case there would need to be the additional "is the document complete" check. Again the proposed strategy is to automatically generate schematron constraints - these mostly be identical to those described above. However, in the first instance the constraints will be generated directly for CIM instances, as these are well defined, rather than the, as yet undefined, questionnaire xml output format. These (schematron constraints) will be integrated into GeoNetwork.