Discussion of Ensemble Handling Issues

For ticket:684 and ticket:533

Ensemble Types

It's clear that we have the following ensemble types to consider:

  • Initial Condition Ensembles
    • Ensemble members are initialised at the same date and time, but the data used at that date and time varied in some way. A useful attribute of the ensemble description is how they differed, which would require an overarching description and a per member mapping to some varying attribute.
  • Staggered Start Ensembles
    • The same analysis method has been used to create initial conditions, but the simulations actually start at different times. It would not be possible to produce an ensemble average for the entire period from the ensemble members. Some might argue that this is not an ensemble, even though logically consistent ensembles can be constructed from portions of the output. Certainly one could calculate ensemble statistics for many things without worrying about the start dates, but the methodology would be to calculate time independent statistics and average those.
  • Perturbed Physics Ensembles
    • The ensemble members have modified physics, but the same start date.
  • Modified Boundary Condition Ensembles
    • Where boundary conditions differ between ensemble members
  • Cross-Model Ensembles
    • The ensemble members are from different models running different simulations. It may be that they didn't run the simulations for the same period, and almost certainly not for all variables. It's likely that one couldn't then generate an ensemble for all output, logically (abstractly) this is the same situation as the staggered start in that ensemble averages can only be produced for some of the output.
  • Grand Ensembles: Where we have a mixture of the other types ...

Two primary use cases on the table

  1. Decadal hindcasts: Multiple start dates, with multiple initial condition ensembles. So this is a grand ensemble of initial condition ensembles nested inside staggered start dates, with a dose of multiple initial condition strategies for good measure.
  2. TAMIP: We only know about the first 64 of the 128. Four tranches of 16 experiments with staggered start dates (and times). Four sets separated by order months, but sixteen members staggered at 30 hour intervals.

What the questonnaire currently does

As of mid-march 2010, the  questionnaire doesn't handle these as well as one might like ...


  1. The situation is probably ok for physical and boundary condition ensembles, but none of the rest are well catered for
  2. It's not clear how to map onto the realisation numbers that will appear in the DRS
  3. It's not obvious how to do conformances into numerical experiments that define staggered starts, without replicating the experiments themselves (which is nearly ok for the decadal case, but not ok for TAMIP). We can't have folk replicating all that metadata, even with simulation copying available.
  4. Initial condition ensemble members don't work logically with initial conditions at the simulation level.
  5. We want to try and expose the fact that numerical requirement instances are aggregated not composed into experiments, so that implies deleting an experiment shouldn't delete requirements (allowing reuse), but that's not the case at the moment.

The Inputs Issue

Ideally we want a sensible way of an initial condition ensemble member referring to the initial conditions in such a way that input replication is minimised.

Consider (in the following, everywhere you see date, consider date/time):

  • All the inputs for a single simulation can be considered as an aggregation. (The CIM can handle this: the DataObject has a hierarchy level)
    • We could introduce this concept to the questionnaire
  • We can handle the input modification as a modification to that aggregation (as opposed to the individual inputs)
    • Allowing one to suggest either a file name change to one or more of the members, or a global date change to apply after input (we have the concept of time transformation in coupling, so we probably need it here too), or a global date change to the date looked for in the files, or a date change to the date looked for in some files ...
      • Doing all that could be pretty offputting ... we might want to think hard about how much information we are asking about inputs ...
    • We could help simplify things, by
      • Adding a date attribute to all inputs (not b.c.s nor ancillary files)
      • not allowing dates to be associated with input definition when we ask at the component level, and
      • default the input date to the simulation start date for all inputs when asking at the sim level
      • And somehow only show all the input names and dates in the mod pages if folks actually want to do it (somehow not make it the default)
    • Need to ensure that we match these mods onto the realisation numbers and make sure these things end up in the XML output

So, what would an ensemble i.c. mod look like:

  • It points to either an aggregation or a set of inputs
  • It allows a date change, or in the second case, a file change and/or a date change
  • And it recognises the distinction between these various date options:
    • Date in file is replaced with new date at runtime
    • Different date is extracted from file
    • Different start date is used.

CIM revisions needed

The CIM actually has the flexibility to handle the situation, but it's not very helpful.

We recommend that we modify the  CIM

To handle ensembles more cleanly

  1. Moving the !requiredDuration from an attribute of the numerical experiment to become an attribute of the SpatioTemporalConstraint?
  2. Make it clearer how we expect to use aggregations of numerical requirements within themselves (strictly we don't need to do this, but it'll make it much clearer for those who have to use this UML):
    1. Remove the !consistOf relationship from NumericalRequirement
    2. Add a new class, a NumericalRequirementSet, being a specialisation of NumericalRequirement and add an association from NumericalRequirement to the new class, with source role of belongsTo (multiplicity 0..1) and target role of ensembleRequirements (multiplicity 0..n).
  3. We now want to make it possible for ensemble members to conform to the ensembleRequirement members *. details tbd, but ideally we want to exploit an ensemble members class, rather than go via simulation and then the conformances ... FIXME

At the same time, we recognised that the way we handle input and ensemble members needs better handling.

  1. TBD FIXME. What are the CIM consequences of the discussion about input handling above.

What could the questionnaire do differently

  1. Add StaggeredStart to the ensemble types for modifications
  2. Allow a start date per member in this case (and modify the simulation main page to show the set, not a single one), and
  3. Allow conformances to point to the set of start dates (rather than the individual ones) and allow 'not complete' as an option (allowing one to submit some of the required ensemble members)
  4. Handle the rest of the inputs better TBD FIXME once the CIM section above is done.

Issues Not Resolved

The question of how CMIP5 should handle this was not really addressed, but we would recommend the pragmatic way would be to treat the TAMIP 64 as 4 experiments, not 16, and leave decadal the way they currently have. This means export should be straightforward for Rupert.

Scientific basis for describing TAMIP as 4 experiments rather than 1 or 64

Extract from  TAMIP experiment design:
4 sets of 16 hindcasts are to be run, the first in each set starting at 00Z on the 15th of the following months and then subsequently at 30 hour intervals: October 2008, January 2009, April 2009 and July 2009. This ensures sampling throughout the annual and diurnal cycles for each grid-point for given lead times. These periods have been chosen to tie in with the Year of Tropical Convection (YOTC) and the hindcast periods are aligned with one or more of the IOPs (Intensive Observing Periods) for VOCALS (VAMOS Ocean-Cloud-Atmosphere-Land Study), AMY (Asian Monsoon Years) and T-PARC (THORPEX Pacific Asian Regional Campaign).

  • The TAMIP experiment design yields 4 (25 day) staggered start ensembles separated by approx. 50 days. In other words there are 4 distinct TAMIP hindcast periods.
  • The TAMIP hindcast periods are aligned with the Intense Observation Periods (IOPs) listed above.
  • We should expect TAMIP users to focus searches on one IOP rather than all of them.
    • Keith Williams (MOHC) "I think it is fairly likely (as a number out of the air, I'd say 30%) that people retrieving the data would be interested in particular hindcasts rather than the whole set."

Proposal for CMIP5 DRS handling of TAMIP

4 TAMIP experiment names

  • tamip200810
  • tamip200901
  • tamip200904
  • tamip200907

Directory Path


where the temporal range has the form yyyymmddhh .

Comment from Mark Elkington:
"I notice that you are proposing a DRS reference something like

Wouldn't the activity be labelled as TAMIP - or is TAMIP going to become part of CMIP5 rather than just using our resources?
/TAMIP/MOHC/HadGEM2-AO/tamip_200810/3hr/atmos/tas/r1/v1/ "