wiki:tickets/167

CIM sequence document & diagrams

notes for  ticket 167

Allyn

summary

Here are my working notes for coming up with a sequence document & diagrams for the CIM


Sequencing Report

This report describes the different stages during experiment formation, software modeling, and data production/processing that CIM documents can be created, extended, and/or modified. A CIM document is illustrated here as an XML document conforming to the CIM Schema (although, it may be stored in a different format such as database records, word-processor documents, or text files). Creating a new CIM document is a relatively straightforward process.  Modifying an existing document is slightly more complicated because users may want to record the fact that an instance has changed and highlight the difference between the new and old versions.

The bulk of this document will be organized as a series of use-cases. First an overview of the structure of the CIM, including one possible sequence of creating and modifying documents, will be provided.

Overview

A “complete” description of a climate modeling process is represented as a set of linked documents within the CIM. Each type of CIM document corresponds to a high-level primary articact used in describing climate modelling activities, software, and datasets.

The CIM can support the follwing documents:

  • NumericalExperiment
  • SimulationComposite
  • SimulationRun
  • Ensemble
  • DataProcessing
  • DataObject
  • GridSpec
  • ModelComponent
  • ProcessorComponent
  • Deployment

Every document has a standard set of features: a unique id, a simple integer version, an author, and a creation date. These are all properties of the CIM document, not the artifact that the document is describing.

In addition, each document can expose its genealogy via theShared::Genealogyclass. This includes one or more references to other documents and a choice of relationship types. Relationship types vary according to the type of document. Sample relationship types are “previousVersion” or “extensionOf.” [A complete set of relationship types for all the different documents is still lacking as of CIM v1.1. ]

Finally, each document can have a number of quality records associated with it. These provide qualitative or quantitative commentary about the artifact being described by the document.

Where appropriate, documents can be linked to one another via references. In addition to the id and version mentioned above, each document should have a particular URI identifying it within a CIM Repository. This can be referred to using the XML Linking Language (XLink). Sub-elements within documents can be referenced by appending XPath expressions. The structure of a reference is shown below:

<!xs:element name=”someDocument” type=”someDocumentType”>
  <!xs:complexType>
    <!xs:sequence>
      …
      <!-- here is a reference to another element →
      <!xs:element name=”referenceToSomeOtherDocumentElement”>
        <!xs:complexType>
          <!xs:sequence>
            <!xs:element name="name" minOccurs="0" type="!xs:string"/>
            <!xs:element name="description" minOccurs="0" type="!xs:string"/>
          </!xs:sequence>
          <!xs:attribute name="version" use="optional" type="!xs:integer"/>
          <!xs:attribute ref="!xlink:href" use="optional"/>
        </!xs:complexType>
      </!xs:element>
      …
    </!xs:sequence>
  </!xs:complexType>
</!xs:element>

An example reference might look something like:

<referenceToSomeOtherDocument xlink:href=”http://www.cim.org/URI-to-some-other-document/path-to-sub-element” version=”1”>
  <name>referencedThingy</name>
  <description>this reference to a referencedThingy documents the relationship between someDocument and someOtherDocument</description>
</referenceToSomeOtherDocument>

Sequence

The use-cases discussed later can occur in any order; Creating a series of CIM documents is not an inherently ordered process. And, in fact, multiple documents can be created and updated independent of one another. However, before looking at the specific use-cases, I will describe one possible sequence of document manipulation as a way of providing an overview of how (and when) individual documents are related to each other.

Fig 1: high-level sequence diagram

A climate prediction experiment description exists independently of the simulations being run in support of that experiment, the software models used to implement those simulations, and the data generated by that software. For instance, CMIP5 has prescribed a set of experiments to be run in the near future. To record this in the CIM, instances ofActivity::NumericalExperimentwill be created. This includes descriptions of theActivity::NumericalRequirements.

Some of those requirements will be to use particular datasets as input (ie: Forcings with a prescribed frequency) or to produce particular datasets as output. These can be described asData::DataObjects. Some of those DataObjects can already exist in the CIM. Others may need to be created. Some may refer to actual archived data, though this is not necessary.

Once experimental requirements have been produced and published, scientists at various modeling centers begin to think about the types of simulations they are going to run for those experiments. Instances ofActivity::Simulationmay be created at this point. Depending on the nature of the simulation, this may be completely new, or it may be a new run for an existing collection/set of simulations. Describing a Simulation includes describing (via the Activity::Conformanceclass) how the experimental requirements will be satisfied – generally, this is by using particular DataObjects as mentioned above.

At some point software will be configured to implement those simulations. Rather than starting from scratch, this will typically consist of tweaking an existing model. If a CIM instance exists for that previous model then it will likely be used as the basis for the current model description. Depending on the circumstances, it may simply be used as a template and then changed as needed, or else it may be explicitly referenced as an ancestor model and then a new instance created (so that end-users of the CIM can easily trace the evolution of a model). Additionally, climate models are typically made up of several separate model components. Each of these is represented as a separate document in the CIM, and those documents are concatenated together into a hierarchy with the root document representing the entire coupled model (and being the point-of-reference for the simulation that the model is implementing).

Eventually, models begin to generate data. If theData::DataObjectdescribing that data already exists in the CIM, then it can simply be pointed to by the model and/or simulation. The content of the dataset is described in a Data::DataContentelement. This includes the type of aggregation, frequency, units, and topic (ie: climate variable) being recorded. If appropriate (if the dataset is being archived), then a DataStorage element should be added to the DataObject giving the format and location of the dataset. Datasets can be organized into logical hierarchies in two ways: A single DataObject can contain multiple instances of DataContent. This supports cases where a single stored artifact contains several climate variables. A DataObject can also contain other nested DataObjects. This is a way of grouping logically related datasets – for instance all DataObjects used by a particular SimulationRun.

Some of the data that is generated by models may need to be transformed (by a post-processor) before satisfying the original NumericalRequirements. That transformation can itself be represented as a separate Software::ProcessorComponent, which is described in its own CIM document.

Standard grid descriptions can be stored as Grids::GridSpec documents. These are semi-formal descriptions of grid coordinate systems. Once a GridSpec document can be located in a CIM repository, it may be referenced by other CIM documents: SoftwareComponents can refer to the grids that they support and individual ComponentProperties can point to the grids that they are mapped onto.

None of the instances described above have to be “finished” before they are archived into a CIM repository. Presumably, METAFOR will recommend specific processes, such as the CMIP5 questionnaire, to create CIM instances which ensure the content of those instances conform to some minimum standard.

A general guideline is that if there is published work based on the simulation/software/data described by a CIM document, then refinements to that document should be implemented as a new version of the document. Examples of published work include archived DataObjects (ie: objects with fully-specified DataStorage elements) and scientific articles which refer to CIM documents.

Use-Cases

  1. Use-Case 1: Describe a Climate Experiment

Scenario 1: Describe an Experiment from Scratch

In this scenario, the user will create anActivity::NumericalExperimentdocument in order to describe either a personal experiment run by a scientist or group of scientists or a published experiment such as those prescribed by CMIP5. A NumericalExperiment provides a rationale for different NumericalActivities later on.

  1. The user fills in the standard CIM document information: an author, a creation date (the date the document was created, not the date the experiment was created/run), and a version number. A unique ID is generated for the document.
  2. The user describes theshortName(ie: the abbreviation that identifies it) and thelongName(ie: a meaningful text description of it) of the experiment. Additionally, the user should describe in text why the experiment is being run including, if appropriate, what hypotheses are being tested.
  3. The user describes the time period that the experiment is meant to cover. This is done by simply specifying a start and end time (both arexs:dateTimeelements).
  4. Next, the user describes the various requirements of this experiment. These are implemented as a set ofActivity::NumericalRequirements. Related NumericalRequirements may be bundled together hierarchically. Generally, a NumericalRequirement may be to use particular SpatioTemporalConstraints – spatial and temporal resolution, output frequency, and aggregation method – with particular input or output datasets.
    1. If the dataset specified by the NumericalRequirement already exists as aData::DataObjectin the CIM, then the requirement can include a fully-specified reference to that dataset (a “fully-specified” reference is one whosexlink:hrefandversionattributes point to an existing CIM element).
    2. If the dataset does not already exist then the reference will be under-specified, which means that thenameanddescriptionattributes should try to describe the dataset. This situation is to be discouraged; Ideally a user who encounters this situation should describe the dataset as a new DataObject [Use-Case 5].
  5. Finally, the user submits the NumericalExperiment document to a CIM Repository.
    1. At the very least, a NumericalExperiment must have all its document-specific attributes, plus its text names and description, a timing profile, and one NumericalRequirement; An experiment without at least one requirement cannot meaningfully constrain any activities run in support of it.
    2. It is expected that – as the lifecycles of experiments, the simulations run in support of them, and the data generated by the components implementing those simulations can span several months – a given experiment document will evolve over time and will therefore need to be updated as more information about that experiment becomes available.

Scenario 2: Update an Experiment

An experiment document requires updating whenever the requirements increase and/or become better-specified.

  1. First, the user locates the CIM document describing the experiment they wish to modify.
  2. Then the user adds the desired content to that document.
    1. If the user is simply adding more detail as described above, then there is no need to re-version the document. It can be updated and then replaced in the repository.
    2. If, however, the user is changing detail about the experiment, and other CIM documents reference that experiment, then a new version should be created.

2. Use-Case 2: Describe a Climate Simulation

Scenario 1: Describe a Simulation from Scratch

In this scenario, the user will create anActivity::Simulationdocument in order to describe a climate simulation. Typically, a simulation is being run by an Activity::NumericalExperiment and implemented by aSoftware::ModelComponent.

A Simulation can be either a SimulationComposite or a SimulationRun – the former being an aggregation of simulations, and the latter being a “leaf node” with a one-to-one correspondence to Software::ModelComponents.

  1. First, the user fills in the standard CIM document information: an author, a creation date, a version number, and an ID.
  2. Then the user describes the timing profile of the simulation. As with the experiment, this is simply a description of the time period that the simulation models. This may be realized by a combination of SimulationComposites and SimulationRuns, each of which has a differentstartPointandendPoint.
  3. Then the user describes the inputs and outputs of this simulation. This includes the ancillary forcings, restart dumps, and other initial conditions, as well as the climate variables that are to be produced. [The current status of the activity package has a mixture of externally referenced DataObjects, locally described DataSets, and abstract Forcings (which associate a frequency with a DataObject reference); these need to be rationalised into a consistent structure.]
    1. These may already exist as DataObjects in the CIM, in which case they can be included as fully-specified references. Otherwise new DataObjects ought to be created as per Use-Case 5.
  4. The user may either be describing a SimulationComposite or a SimulationRun. The content model of both is the same, except that composites can include other composites and runs while runs cannot.
    1. It is expected that an initial SimulationComposite will be created and then separate child composites and runs will be added which correspond to how the overarching simulation is represented at the modeling center. For example, a single simulation (a single response to an experiment) may actually be run as multiple segments: a spinup phase, and then separate runs for distinct time periods.
    2. Each separate run or composite is itself a simulation document and so all the steps of this scenario (and other simulation-related scenarios) apply. For instance, each simulation run can require its own input (say, the output of a previous run) and generate its own output.
  5. Very few simulation are really “created from scratch.” Apart from the above-mentioned aggregation pattern between composites and runs, most simulations are part of a genealogy of simulations; They are an extension of or a response to some earlier simulation.
    1. If that “ancestor” is already archived in the CIM, then the genealogy class of the simulation should refer to it. If not, then the reference will be under-specified and the user should consider creating the ancestor simulation document in the near future.
  6. Finally, the user submits the simulation document to a CIM Repository.
    1. At the very least, a simulation must have all its document-specific attributes, plus its text names and description, a timing profile, and one output dataset; A climate simulation that does not output at least one set of climate variables has no use.
    2. As before, it is expected that a simulation will be updated several times during its lifecycle.

Scenario 2: Updating a Simulation

A Simulation should be updated as more detail becomes known about it. This includes how it will conform to a NumericalExperiment [see Scenario 3].

This also includes adding new child simulations as new composites and runs are setup.

  1. First, the user locates the CIM document describing the simulation they wish to modify.
    1. If the user is adding a child simulation, then they should first proceed according to Scenario 1, and then modify the containing simulation by adding a new reference to the child. [Note that it would probably be cleaner to have children reference their parents rather than the other way around in order ato avoid unnecesarry modifications to CIM documents; The logic of the data package needs to be reviewed to see if this is appropriate.]
  2. Then the user adds the desired content to that document.
    1. As before, versioning is required if other CIM documents reference that simulation.
    2. For simulations, versioning is also required if DataObjects have already been generated (and described in the CIM) by them.

Scenario 3: Creating a Simulation in Response to an Experiment

This is really an extension of Scenario 1 – it follows on from step e). Simulations are usually being run in response to a particular NumericalExperiment. In this case, theActivity::Conformanceclass is used to describe the relationship between the experiment and the simulation. [The data pacakge does not include an explicit reference from a simulation to an experiment. Instead the relationship is inferred via instances of Conformance. This seems inappropriate and should be reviewed.]

  1. The user adds a set of Conformance elements to the appropriate simulation(s).
    1. Each instance of Conformance requires a free-text description of how the simulation satisfies particular NumericalRequirements. Then a reference to a given NumericalRequirement is provided (remembering that multiple requirements can be contained by a common parent requirement). Then a reference to aShared::DataSourceis provided. DataSource is an abstract class that includes DataObjects as well as Software::ComponentProperties This allows requirements to be satisfied either by using particular “external” data (DataObject) or “internal” coupling data (ComponentProperty).
  2. All an experiment's requirements ought to be satisfied by a simulation. However, given the length of time it takes to configure a simulation, it is unrealistic to expect users to fully-specify all instances of Conformance straight away. Therefore, “incomplete” simulations are allowed. A general guideline, however, is that by the time a simulation has been deployed [see Use-Case x], it should fully conform to the experiment it is running.

3. Use-Case 3: Describe Model Component(s)

Scenario 1: Describe a Model Component from Scratch

A Software::ModelComponent is the software implementation of a simulation. The CIM allows a very high level of detail to be recorded for a component – and it is expected that certain external “CIM-aware” tools (like future versions of BFG or OASIS) will require such a high level – however most elements of a component are optional and so, in all likelihood, initial component documents will not be very detailed.

  1. First, the user fills in the standard CIM document information: an author, a creation date, a version number, and an ID.
  2. As with simulations, it is unlikely that a component has truly been “created from scratch.” An ancestor component should be identified (or created if need be) and referred to in the genealogy.
  3. Then the user specifies theshortNameandlongNameof the component.
  4. Then the user describes the timing profile of the component. This includes the period of time that the component is meant to simulate (the start and end time and, importantly, the rate).
  5. Then the user chooses the type of component being described. These include terms like “atmosphere,” “radiation,” “sea-ice,” etc.. The choices will be stored in a controlled vocabulary. It is expected that there will be different controlled vocabularies for different user communities (for instance, one for those using the CIM for CMIP5).
  6. Then the user can refer to one or moreGrids::GridSpecto specify the grids that the component uses.
  7. Next the user records all (or all of the relevant, or most of the relevant) variables that a component simulates or otherwise uses. These are stored asSoftware::ComponentProperties. Examples of ComponentProperties include sea surface temperature, air pressure, etc.
    1. For each ComponentProperty, the user records its name and description, and optionally its type and value (it doesn't make sense for all properties to have a value – coupling fields, for instance). The type is simply used to group similar properties together. [A ComponentPropertyType is implemented as an "open" codelist. This means that users are provided a choice from a controlled vocablary, but they can also choose "other" and list a property type of their own.]
    2. ComponentProperties may also refer to a GridSpec to specify how the field they describe is mapped onto a geographic grid.
  8. A component can include nested “child” components. For instance, a coupled model that simulates the ocean and the atmosphere would have two child components – one for the ocean and one of the atmosphere. The atmosphere itself may have several nested components (radiation, etc.).
  9. each child component is itself a document and should be described as per this use-case.
  10. Once a hierarchy of coupled components has been described, their coupling may be described in further detail. ASoftware::Compositionis associated with a component. Each composition consists of a set ofSoftware::Couplingswhich in turn consist of a set ofSoftware::Connections. A Coupling represents an input/output link between two components; For example, the atmosphere couples to the ocean. A Connection represents an input/output link between two properties; For example, the atmosphere SST couples to the ocean SST. If a component is marked as “fullySpecified” then its set of connections are a complete list of all connections between the two components. Otherwise, it is a partial list or else there are no connections listed at all. Couplings and Connections have timing information associated with them to describe the frequency of the connections.
    1. The properties specified by a component's composition must be owned by that component or a child of that component; child components cannot couple together their parents' properties.
    2. Note that the “source” of a connection is aShared::DataSource. This means that an input to a coupling can either be a DataObject (ie: an external forcing), or a property from another coupling (ie: an internal coupling). And, in fact, a fully-specified Composition might describe a connection in two steps: once from a dataset to a component for the first timestep, and once from another component to that component for all future timesteps.
  11. A ModelComponent needs only the standard set of CIM document information and a name, type, and timing information.

Scenario 2: Associating a Component with an Activity

Most model components are run as an implementation of a particular simulation. The only additional steps beyond Scenario 1 are as follows:

  1. The user creates a reference to theActivity::SimulationRun(orActivity::DataProcessingin the case of a ProcessorComponent) that the ModelComponent is implementing.
  2. The user then checks to see if (assuming the activity is being run in support of an experiment), any of the experiment requirements can be satisfied by the particular way that the component is configured – specifically, the set of ComponentProperties it contains.
    1. If they can, then either a new conformance element is added or an existing conformance element is extended with a reference to the newly-discovered DataSource.
  3. If the updated simulation had other CIM documents associated with it, it may need to be versioned.

4. Use-Case 4: Describe Processor Component(s)

A Software::ProcessorComponent is a component which, unlike a ModelComponent, does not model any physical phenomena. It still processes data, but it is not a “scientific model” in the strict sense. Examples of ProcessorComponents include transformers and post-processors.

Some ProcessorComponents may be embedded within coupled component hierarchies. For example, a transformer component may transform the output of one component to a format suitable to be the input of another component (an “AtoO”subroutine is an example of this). Such ProcessorComponents are treated just like the components in the previous use-case and can be used within the parent model's Composition.

This use-case concerns itself with ProcessorComponents that are not being run to implement simulations. They can be run instead to implementActivity::DataProcessing– for example, the processing of observation data or the post-processing of data from a simulation. In this case the procedure is exactly the same except that no timing profile is specified. [This does not provide a way to specify that a ProcessorComopnent helps conform to a simulation. If it is being used to transform model output such that it meets the requirements of a simulation, then I suggest creating a "virtual" SoftwareComponent at one level higher than the actual coupled model with has the ProcessorComponent in question as a sibling of the coupled model. This will allow the variables it manipulates to be specified in a top-level "virtual" composition and hence referenced by a Conformance element.]

5. Use-Case 5: Describe Climate Data

A Data::DataObjectrepresents a set of climate variables. Multiple variables may be grouped together in some logical way into a single data structure.

Scenario 1: Describe Climate Data from Scratch:

It is unlikely that a DataObject will really be created in isolation. Rather it will be added to a CIM Repository as the requirement of a NumericalExperiment, and/or the input or output datasets for a Simulation, and/or the ComponentProperties (ie: IO variables) of a SoftwareComponent.

  1. First, the user fills in the standard CIM document information: an author, a creation date, a version number, and an ID.
  2. Then the user starts to fill in the content of the DataObject. This identifies the climate variable by name, geographical (horizontal and vertical) and temporal extent [Extent is currently described using a structure other than the Grids::GridSpec. This in itself is not problemantc, but it seems odd that there is no longer any link between the data and brids package. (There used to be.)], and aggregationType (sum, mean, min, max, none) and frequencyType (hourly, daily, monthly, etc.)
  3. DataObjects can have nested “child” DataObjects. The logical way that particular DataObjects are grouped together is recorded in aData::DataProperty. A DataProperty includes an enumerated type to name the relationship – for example, at the Met Office, data is grouped into logical streams so an object property might have the name “stream” and the value “APM” (the name of one particular stream used by the Met Office).
    1. In addition, DataObjects can have multiple instances of DataContent. This means that a single DataObject can be used to contain multiple climate variables. Users need to decide what the most appropriate combination of DataObjects and DataContent is for the way that their data is generated/stored.
  4. These are the only attributes that are required in order to submit a DataObject to a CIM Repository. Note that the actual raw data is not needed. In fact, users can specify thestatusof a DataObject as “metadataOnly” (or “complete” or “continuouslySupplemented”).

Scenario 2: Update Climate Data

There are other attributes of DataObjects that are likely to change throughout the course of a climate simulation activity. The most obvious one is when the dataset described by the DataObject is finally generated and archived.

  1. First, the user locates the CIM document describing the DataObject they wish to modify.
  2. The DataStorage class is used to describe where/how data is archived. This includes its format, location, size, and any medium-specific attributes that might be needed (such as filename or hostname).
    1. If this is newly archived data that is being described then, the document can simply be updated and resubmitted to a CIM Repository. If the DataStorage instance is changing, however, then the document must be versioned.
  3. Additional attributes that might change are DataDistribution, DataRestriction, and DataCitation. The former two describe any constraints on accessing the dataset specified by the associated DataStorage element. This includes contact information for the data owner. A DataCitation describes a published article that refers to this DataObject.
    1. Great care should be taking when updating a DataObject with one or more existing DataCitations; In general the changes should result in a new version. [The cited articles shoudl ideally include the CIM document id and version when referencing the DataObject.]

Use-Case 6: Describe a Deployment

A Simulation or Component are said to be deployed when the software component has begun to be run on computing resources. Once that has happened aSoftware::Deploymentcan be created.

  1. First, the user fills in the standard CIM document information: an author, a creation date, a version number, and an ID.
  2. Then the user specifies the date that the run was deployed. Note that this is different from the date that the document was created.
  3. Then the user describes the Machine that the run was deployed onto, the Compiler that the run used, and the method of Parallelization [The exact way to describe a Parallelization method has yet to be decided (the class is currently unused in the APPCIM).] that was used.
  4. Finally the user should add a reference to the deployment from the relevant SoftwareComponent and/or Simulation. Note that both the software and activity package can refer to the same Deployment.

Conclusions, Issues, and Recommendations

This report has highlighted the following issues:

  • The recursive nature of some of the "key" classes allow for too many CIM documents; At the very least their complexity should be reduced by specifying a minimal set of mandatory elements for each document; Also, it is worth considering ways to enforce a single document (ie: a component with embedded, rather than referenced, sub-components) being used to describe coupled models, etc.
  • The simplistic versioning system described above may run into problems (for example, consider a coupled model where one of the sub-components' version is updated; how does that effect the version of the parent component?)
  • There seems like there ought to be a relationship between the grids package and the data pacakge; Where is it?
  • There seems to be no real structural difference between a SimulationComposite and SimulationRun; why the two classes then?

TODO:

  1. add sample XML snippets to use-cases

Attachments