ccc-gistemp release 0.3.0

I am pleased to annnounce ccc-gistemp release 0.3.0. This includes a number of bug fixes and features in our framework and tools, and a great deal of clarification work especially in steps 1 (station combination) and 2 (peri-urban adjustment). Really, it’s much better. Give it a go.

Much of GISTEMP was concerned with generating and consuming intermediate files, to separate phases and to avoid keeping the whole dataset in memory at once (an important consideration when GISTEMP was originally written). In 0.3.0 this has largely been replaced by an iterator-based approach, which is clearer, automatically pipelines all the processing where possible, and avoids all code concerned with serialization and deserialization.

We have retained intermediate files between the distinct steps of the GISTEMP algorithm, for compatibility with GISTEMP and for testing purposes. We have also retained some code to round or truncate some data at the points where Fortran truncates it for serialization. This will be removed in future.

Some of the original GISS code was already in Python, and survived almost unchanged in 0.2.0. Much of the rest of 0.2.0, especially the more complex arithmetical processing in step 2, was more-or-less transliterated from the Fortran. A lot of this code has been rewritten in 0.3.0, especially improving the clarity of the station-combining code (in step1.py) and the peri-urban adjustment (now in step2.py).

There has been a rearrangement of the code: the code/ directory now only contains code which we consider part of the GISTEMP algorithm. Everything else – input data fetching, run framework, testing, debugging utilities – is in the tool/ directory. This division will continue, to allow us to add useful tools while still reducing and clarifying the core code.

There is better code for comparing results, and a regression test against genuine GISTEMP results.

10 Responses to “ccc-gistemp release 0.3.0”

  1. Open Knowledge Foundation Blog » Blog Archive » Clear Climate Code, and Data Says:

    […] but does the same thing. We have taken great steps forward towards this goal: We have recently released a version which is all in Python and which reproduces GISS’s results exactly. We think much of this code is already a great deal clearer than the starting material, but we […]

  2. Carrick Says:

    Thanks for your efforts. I was able to download and run, with a complete build in less than 30 minutes.

    Have you guys thought about how to go about making it smooth to turn on or off different parts of the package externally?

    In particular (at the moment), I’m interested in comparing the output with and without the UHI correction. I imagine I’ll be able to do this with a few minutes of work, but it would be pretty awesome if there were a way to specify the output file (have the default be same as now), but to be able to change things like whether homogenization and UHI get performed, as well as be able to change the grid spacing and radius in step 3 and so forth.

    The main point of these is to look at magnitude of the effect of the various manipulations. Knowing how important they are tells you something about where to spend your QC effort first. I imagine for example homogenization is a big effect, UHI not so much. Anyway, in my experience, this sort of manipulation of the algorithm has always been an important part of the software validation itself.

    I could go in and add flags to the code to turn these on and off, change values and so forth, but that seems a bit like duplicate effort. I will probably do so anyway, but I’d love to see the evolution of this code include automating tweaks like this.

    Thanks again for your effort.

    Carrifck

  3. Nick.Barnes Says:

    @Carrick:

    Firstly, I am glad you were able to easily use our code. I hope you will find it fairly easy to understand as well, and to develop the details of any experimental changes you want to try.

    Your suggestions are exactly the sort of thing we are considering for future work. The next version (0.4.x) will have a common data set structure: i.e. the data objects output by (for instance) the “STEP1″ station-combination step will have exactly the same form as the data objects output by (for instance) the “STEP2″ peri-urban adjustment step (at present step1.py and step2.py follow the Fortran in producing, respectively, a plain-text Ts.txt file and a Fortran binary Ts.GHCN.CL.PA file). All the I/O is going to come out of the core code and move into the tool/ directory. Paul Ollis is working on that right now. Once that is done, it will be trivial for third parties to, for instance, switch steps on or off, or to modify the code for a step, or even to reorder steps.

    After that version, or possible the one after, all the core algorithms will be a set of parameterized iterator functions – filters – each of which takes a data set and returns a modified data set. These functions won’t exactly correspond to the Fortran STEPs: for instance the part of STEP0 which combined USHCN data with GHCN data would be one such function; STEP1 currently has a total of 5 functions. STEP2 has one – the peri-urban adjustment – but preparative to that is the calculation of annual anomaly series, from which it is separable. And so on. My hope is to clearly expose the parameters to each of these functions (e.g. the peri-urban radius) and also to make each of these functions switchable (so, for instance, one could choose to omit the peri-urban adjustment entirely, or the St-Helena-adjustment part of STEP1). This will make the sort of experiment you describe a simple matter of changing a line in a configuration file and re-running the code.

  4. Zeke Hausfather Says:

    One nice result of CCC is making it trivial to dispel some of the more spurious accusations about GISTemp. WUWT today is case-in-point: http://wattsupwiththat.com/2010/02/01/chiefio-asks-why-does-giss-make-us-see-red/

  5. Nick.Barnes Says:

    @Zeke: We are very happy for ccc-gistemp to be used in discussions about GISTEMP. This was exactly its purpose. We hope that, as CCC increases knowledge and understanding of the GISTEMP algorithm, so future criticisms of that algorithm will identify any actual weaknesses – which can then be fixed.

  6. Carrick Says:

    Nick, if you haven’t seen it, you may want to take a look at what Chad has been doing over at Trees for the Forest. This is exactly the sort of testing I have been thinking about in terms of algorithm validation.

    What would be ideal would be to migrate to a truly objected-oriented version of the code, in which different homogenization codes could be written and “dropped” in as different subclasses as class functions..

    And of course the ability to “drop in” Monte Carlo tests in place of real data. I don’t know if you guys have thought of that, but a means for doing that would be really slick.

    [I imagine starting with the homogenized code, feeding this back in to an earlier stage of the analysis, but with Monte Carlo errors introduced, and using that to study the efficacy of the various corrections that have been applied.]

  7. Nick.Barnes Says:

    @Carrick: Thanks for that link to Trees for the Forest. It’s very interesting stuff, especially the experiments with different gridding algorithms.

    Regarding “a truly object-oriented version of the code”: I don’t think that’s the best direction for ccc-gistemp. OO is a powerful style, but it can also be very obscure for newcomers. If we go down that road then before we know it we will have factory classes, I/O monads, and so forth, which would add hundreds or thousands of lines of code which would be impenetrable to any non-programmers. It also wouldn’t make switching modules of code any easier. Python has first-class functions and modules, which give just as much flexibility for a much lower cost in clarity. We do use objects in a few places.

    In terms of the simplest experiments, this month I am changing the code to lift all the numerical parameters out to parameters/py.

  8. Carrick Says:

    Nick, when I brought up OO, I was thinking more along the lines of data encapsulation than formal OO methods or development patterns. You could get the same result by thinking structures + functions divided into semantic categories.

    As to it being “obscure”…I’ve actually found it easier for people to use the OO approach over a procedural one that utilizes arrays and global data. .

    I’ve found this useful even when delivering software to other people for their use in particular types of data analysis. The particular application I am thinking of involved using classes in the MATLAB language, which as you know has very crude support for the OO paradigm.

    In this particular case, it has been successfully reused by other people who had no previous experience in OO code, but who were very rapidly able to incorporate my software package into their existing software framework.

    [Once you give them examples of how to create objects and manipulate them, it’s not that different conceptually from creating arrays, loading them with data, then supplying them to a function… it’s just a lot easier and a lot more bullet proof.]

  9. Nick.Barnes Says:

    @Carrick: I recommend you browse our current sources, or download them. Or you might like to wait for release 0.4.0, which will be along in a few days. Then you can see the extent to which we use OO, and other kinds of encapsulation. Of course it’s a work in progress, gradually moving from Fortran, via Fortran-in-Python, to a simpler, clearer, and more versatile code base.

    Certainly no global data. As for arrays, in any application like this they will always be there under the skin.

  10. Carrick Says:

    Thanks for the comments Nick. I certainly hope you weren’t taking anything I was saying as negative or critical, and certainly am looking forward to the release of 0.4.0.

Leave a Reply