CIM Differencing Tool Design Document

This page provides documentation for ticket:79.

It also gives an overview for one of the main components of WP5, the differencing tool. It is both a milestone for the METAFOR project and a deliverable to the Met Office. (The Met Office was delivered  this version, but this document has continued to evolve since then.)

From the description of Met Office Deliverable 4:

Upon discovering multiple metadata instances that satisfy the search criteria, users should be able to compare those instances. This tool will report the relevant differences between 2 metadata instances. It is anticipated that the tool will support a hierarchical approach were high-level differences can be explored in greater detail.

From the description of METAFOR Deliverable D5.2:

Users may want to use the information within the CIM to understand the numerical/scientific differences between two configurations of a model, or they may want to understand differences in provenance between two models or datasets. This is akin to running a "diff" on the CIM.

Both these descriptions were written before work on the CIM Query Tool had begun. The Differencing Tool will be tightly integrated into the Query Tool, so the design choices made for the latter may constrain the design of the former and slightly shift the focus of the tool. This is explained in greater detail below.

1. Overview

One of the stated goals of the METAFOR project is to "direct scientists, policy makers and other data users to the correct information within the available repositories through better metadata description of model data [and] provide the users of data and models with the information they need to put those resources to optimum use, again through metadata." Over the course of the project so far, this has meant building a structured metadata language, the CIM, that can be used to describe the sort of information that the users of data and models need. The CIM Query Tool is basically a search-engine built on top of a repository of such metadata instances so that users can be "directed" to the correct information. Taking the search-engine analogy further, the Differencing Tool is basically a feature comparison tool for the results of a search, such as one finds in many online stores. There are, however, certain key points that distinguish the CIM Differencing Tool from a feature comparison across items found in, say, an online bookshop or a tyre distributor:

  • Firstly, there are likely to be several variants of comparisons depending upon the type of CIM instances being compared, the type of user doing the comparison, and the type of information being requested.
  • Secondly, the set of features being compared is potentially orders of magnitude larger than those typically found in online stores (hundreds versus tens). This large set of features being compared may also change dynamically (as a result of specific constraints on the results added by the user, or simply the availability of features within the contents of the data repository) rather than be hard-coded into the system.
  • Thirdly, CIM instances have a very rich structure to draw upon. Sometimes, this is a liability; It may be preferable to simply display the presence or absence of a particular feature or to provide the user with a short bit of descriptive text about a particular feature, but finding that sort of high-level summary information from within the very detailed CIM can be difficult or even impossible. Other times, though, this structure may prove beneficial; It may be preferable to display several detailed and hierarchical points about a particular feature in order to show a user exactly how multiple instances differ.

This document will describe the scope of the query tool: what is it meant to do and for whom. It will then describe, at a high level, at the level of the interface, how it is meant to work. Finally it will describe, at a lower level, some specific implementation details.

2. Scope

The CIM Differencing Tool will allow users to compare a set of CIM instances. It will be built into the CIM Query Tool; Users will compare instances chosen from the result set of a CIM Query. The Query Tool, and by extension, the Differencing Tool will operate on a single CIM Portal. Currently, during prototyping, this portal contains a single local CIM Repository and so all queries/comparisons are performed on the local server. A goal of the METAFOR project's Work Package 4 is to eventually implement a distributed server. The details of this distributed structure is still emerging. Once an interface has been formalised and agreed between WP4 ("services") and WP5 ("tools") the Differencing Tool (and the rest of the Query Tool) code may have to be altered to support multiple portals.

The results of a comparison will be displayed as a table, mimicking the product comparison sites mentioned earlier. It would be convenient for these tables to be exportable to a printable format, but because there are no use-cases justifying it it is considered a very low priority and not essential for the tool. The comparison should be limited to a small number of instances due to screen/print real estate limitations. In fact, the prototype may initially only support a comparison between two instances. This number may be increased later if use-cases warrant it.

The number of features being compared should also be able to be limited, both for reasons of real estate and the sheer complexity of analysing tables with hundreds of rows. It would be useful to present the user with some type of widget for adding and removing features in realtime.

Since it would control the formatting of the results table, this widget may be a good place to allow a user to specify the complexity of what gets displayed in each table cell. As mentioned earlier, the CIM has the potential to provide very detailed structured information but this may be overwhelming or confusing for certain users and/or situations. Sometimes it will suffice to simply summarise. This widget should allow a user to switch between a simple view and a detailed view. Alternatively, this same variable complexity may be accomplished by having tree structures inside each table cell that can be expanded (for the detailed view) or collapsed (for the simple view). This allows a user to individually specify the complexity displayed for every feature, which is both an advantage and disadvantage - an advantage because it provides more control, a disadvantage because it requires more work.

Rather than support a "generic" comparison, which ignores the structure of CIM instances and treats each one as a simple (albeit long and hierarchical) text file, which would be of dubious value to the end-user, the CIM Differencing Tool will only compare a subset of the features of a CIM instance. This subset will necessarily vary according to the types of CIM instances being compared (different document types have different XML structures) and the reasons that the comparison is being performed.

The different document types that users might want to compare include:

  1. simulations and ensembles of simulations
  2. models and model components
  3. datasets

For the prototype, it will be assumed that all instance documents were written as part of CMIP5. This will ensure a reasonable amount of similarity between CIM documents.

Some different reasons for (Met Office) users performing a comparison are listed below as use-cases below. Once a query result set has been generated - results having been conveniently pre-screened for suitability (some repository holdings may be incomplete or have access restrictions) and pre-sorted by document type - users will be presented with the option to perform only one of the following types of comparisons on only the appropriate type of documents:

2.1 Use-Cases

Use Case 1: Diagnostic Availability Compare

This use-case allows users to display available diagnostics from a set of simulations (being run in support of the same experiment?). As before, the rows of the comparison table should be made up of particular diagnostics. The cells should provide varying levels of information depending on what level of complexity has been selected from the Formatting Widget. The simplest sort of information would be a boolean value indicating whether or not the simulation provides that diagnostic. More complicated information would include information about how it is provided: what grid does it use, how many vertical levels does it extend, what units is it in, in what file format is it archived, etc.. The sequence of events for this use-case is similar to that above:

  1. The user issues a query for simulations run in support of a particular experiment.
  2. The user is presented with a set of query results.
  3. The user selects the simulations they wish to compare and clicks on the "diagnostics compare" button.
  4. This brings up a comparison table with a default set of diagnostics and the default level of complexity.
  5. The user can add/remove diagnostics or change the level of complexity by using the Formatting Widget at the top of the comparison page.

Use Case 2: Conformance Compare

This use-case allows users to compare how different simulations have conformed to experimental requirements. This means that the difference table should include as one dimension each of the numerical requirements belonging to a CMIP5 experiment. The other dimension should include the method of conformance by which a selected simulation satisfied that requirement. The sequence of events for this use-case is as follows:

  1. The user issues a query for simulations run in support of a particular experiment.
  2. The user is presented with a set of query results.
  3. The user selects the simulations they wish to compare and clicks on the "conformance compare" button.
  4. This brings up a comparison table with a default set of numerical requirements and the default level of complexity.
  5. The user can add/remove requirements or change the level of complexity by using the Formatting Widget at the top of the comparison page.

Use Case 3: Model Property Compare

This use-case allows users to display model properties - both those being simulated like grid dimensions ("scientific" properties) and those affecting how the others are simulated like G ("numerical" properties). The act of generating a comparison table is the same as with the previous two use-cases. This differencing use-case follows on very closesly from the query use-case "find all models with a particular (sub)component or (sub)property." A possible sequence of events is as follows:

  1. The user issues some query on model components: "Find all models run in support of the CMIP5 experiment RCP8.5 that have an OceanBioGeoChemistry component"
  2. The user is presented with a set of query results.
  3. This set could be further constrained by refining the query to include only those models which have, for example, "dynamic vegetation modelling."
  4. The user selects the models they wish to compare and clicks on the "property compare" button.
  5. This brings up a comparison table with a default set of properties and the default level of complexity.
  6. The user can add/remove properties or change the level of complexity by using the Options Widget at the top of the comparison page. In this example, the user would ensure that properties relating to vegetation modeling were included so that the comparison could indicate things like what vegetation types were supported and how they were modeled.

It is clear here that the Query Tool and the Differencing Tool compliment each other and work together to direct a user to relevant information.

This set of use-cases may grow as the prototype is tested and feedback is generated.

Regardless of the number of use-cases to be supported, and despite what was said earlier about not supporting "generic" comparisons, each type of comparison should have the same basic look-and-feel about it to make it easier to use. That is, each comparison result page should be contain a brief high-level summary of relevant differences, an "options widget" to limit or otherwise constrain the set of features being compared, and a grid-like viewer with features in one dimension and instances in the other.

a mockup user interface.

3. High-Level Design (what functionality should be supported)

Initially, I anticipated writing a "dumb" Differencing Tool. Such a tool would treat CIM instances as generic XML documents; no intelligence about the design of the CIM nor the meaning/relevance of the elements and attributes within the CIM would be incorporated into the tool. This could be implemented as clickable tree-structure embedded within a "tkdiff-like" interface.

like this but with trees

However, it's not clear how useful a text-based difference of two complete CIM instances would be. It seems to miss the point of all of the meaningful structure embedded within metadata. It would also be difficult to navigate through a tree because of the great length of CIM instances. And it only allows users to difference two instances at a time. Furthermore, it does not map well onto the use-cases described above. Finally, there are standard tools (such as Oxygen) that already exist which can offer some of this functionality.

And yet, it may be useful to compare known fragments of XML in cases where a more "intelligent" comparison (like the structured table at the bottom of the mockups shown in this document) is lacking. If there is a way to easily integrate a lightweight XML comparison into the Differencing Tool it will be used alonside a set of small focused differences corresponding to identified use-cases will be implemented. This will return a table of relevant differences.

Above you can see what a table of relevant differences might look like.

The structure of the comparison page should remain constant even as the types of comparisons being performed differ. This structure will include high-level summary information, an "Options Widget" to allow users to constrain the set of features being compared and the complexity of the results being reported, and the comparison table itself.

There are two possible approaches for displaying summary information. The first simply lists all the relevant features a document has and then provides a percentage of commonality across all documents.

The other approach allows users to select a "master" document against which the other documents are compared.

This latter approach is useful for more complex differencing types. For example, consider the conformance differencing where simulations are compared with respect to how they conform to the requirements of a particular experiment. In this case both a master simulation and experiment can be selected to make the comparisons more useful.

Part of the Options Widget should allow users to select from available features which ones to display in the table. A pair of listboxes with swappable items, one with the features to display and one with the features to hide, can implement this.

Notice that details of the selected feature (diagnostic20, in this case) can be viewed in the textarea in the middle of the widget. This rightfully assumes that just providing the name of the feature is not enough for users to make an informed decision about whether or not to include it.

Another part of the Options Widget (shown in the right pane in the above sample) could allow users to set how much information is displayed in each cell of the comparison table. Showing the simple presence or absence of a feature is straightforward. Showing the complete hierarchical structure of that part of the CIM corresponding to a feature is feasible. However, showing something inbetween is more complicated. For that reason, for the prototype, users will be allowed to selected between these two extremes of a simple view and a detailed view only (think of a pair of radio-buttons rather than a slider bar).

4. Low-Level Design (what technologies should be used)

The Differencing Tool will be tightly integrated into the Query Tool. It is as a result of running queries against the CIM archives that users choose the set of CIM instances to difference. The Query Tool is written in a combination of JQuery and XQuery (and, of course, HTML and CSS). The JQuery builds the web-forms and handles the user interaction. Queries to the backend database are performed by XQuery. CIM Instances are stored as XML documents in an eXist database.

Thus far, this document has concentrated on the user interface for performing and viewing differences. Ideally, the UI will be separate from the services driving the differencing. It is expected that the front-end of the Differencing Tool will be written in JavaScript (JQuery) and XML technologies (XQuery, XSLT) will be used to transform [subsets of] CIM instances into tables.

5. References / Notes