Trendy!

tool/vischeck.py has been recently updated so that it computes and draws trends (the work was done by me and Nick Barnes). Here’s some recent comparisons redrawn with trends:

The “before 1992 / after 1992 stations” from “The 1990s station dropout does not have a warming effect”:


The short trends are done with the last 30 years of data for each series (which since one series ends in 1991, is a different period for each). Notice how similar the recent trends are.

Reprising the Urban Adjustment post:

I don’t think I’ve done a combined land and ocean chart comparing hemispheres for the blog before, but here it is now:

Nick Barnes added the calculation of R2 whilst I was writing this post, causing me to redraw all the charts.

Nick has also been exploiting ccc-gistemp’s new parameters.py module, and did a run with the somewhat experimental 250km smoothing rather than the traditional 1200km smoothing. The parameter is named gridding_radius and it affects gridding in Step 3; setting it to 250km essentially reduces each station’s influence to very roughly the size of the cell used in gridding.

The effect on the trends is most visible in the Northern Hemisphere:

Trends are just one minor example of the way in which the ccc-gistemp code can be continuously improved. We don’t just draw trends for one graph, we improve the code so that all graphs can have trends.

ccc-gistemp release 0.4.0

I am pleased to annnounce ccc-gistemp release 0.4.0. This release is much clearer than previous releases. Give it a go.

  • Almost all of our code has now been rewritten to remove the Fortran style which remained from the original conversion from GISTEMP. Previous releases had greatly improved steps 0-2; this release continues the improvement work there and also carries those improvements through steps 3-5. Almost all of the code now has sensible variable and function names, clearer data handling, and helpful comments. Many unused variables and functions have been removed. The current core algorithm has 3740 lines of code, of which more than half are either comments, documentation strings, or blank.
  • Rounding has been completely eliminated from the system. Previously, rounding and truncation code was used to exactly emulate GISTEMP. Rounding made the code less clear, and Dr Reto Ruedy of NASA GISS confirmed that rounding was not important to the algorithm, so it has been removed. All temperature data is now handled internally as floating point degrees Celsius (previously it was a mixture of integer tenths, floating point tenths, and floating point degrees) and all location information is handled as floating point degrees latitude and longitude (previously it was a mixture of floating point degrees and integer hundredths).
  • In a normal run of ccc-gistemp, no data passes through intermediate files. Much of GISTEMP is concerned with generating and consuming intermediate files, to separate phases and to avoid keeping the whole dataset in memory at once (an important consideration when GISTEMP was originally written). We have now completely replaced this with an in-memory pipeline, which is clearer, automatically pipelines all the processing where possible, and avoids all code concerned with serialization and deserialization.
    We now have separate code to generate data files between the distinct steps of the GISTEMP algorithm, and to allow running a step from a data file instead of in a pipeline. This allows the running of single steps, and is useful for testing purposes.
  • Parameters, such as the 1200 km radius used when gridding, and the number, 3, of rural stations required to adjust an urban station, which were scattered throughout the code, are now all to be found, with explanatory comments, in code/parameters.py
  • It’s now possible to omit Step 4 and produce a land-only index, which closely matches GISTEMP.
  • It’s also possible to omit Step 2, and run the algorithm without the urban heat-island adjustment.
  • GISTEMP recently switched to using nighttime brightness to determine urban/rural stations. We made the corresponding change, which is switchable.

Note that none of these changes altered any of our results by more than 0.01 degrees C, except for the change to urban station identification, for which the changes in our results (none greater than 0.03 degrees C) closely match the changes the GISTEMP results.

The work for this release has been done by David Jones, Paul Ollis, and Nick Barnes.

GISTEMP Land Index

GISS publish a land-only temperature anomaly (referred to as their “traditional analysis”).

As I pointed out in an earlier article ccc-gistemp can now create a land index by omitting Step 4: python tool/run.py -s0-3,5.

Here’s how we compare with official GISTEMP:

GISTEMP Urban Adjustment

After some recent tweaks by me to the ccc-gistemp sources it is now possible to run a pipeline of the GISTEMP process with some of the steps omitted. An earlier post shows how I can omit Step 4 to create a land-only index. My recent changes allow Step 2 to be omitted. Step 2 is the urban adjustment step (in which stations marked as urban have their trend adjusted).

Omitting Step 2 will therefore give us an idea of the magnitude of the effect of the urban adjustment. It so happens that my writing this blog post overlaps with Nick Barnes implementing GISTEMP’s new scheme for identifying urban stations (corresponding to GISTEMP’s update of 2010-01-16). That gives me an opportunity to show both the new and old adjustment schemes against a “no adjustment” baseline:

In making this graph Step 4 has been omitted, giving us a land index. This is primarily to amplify the differences: land covers the lesser fraction of the Earth; so including the ocean data (which does not require an urban adjustment) makes the difference smaller.

And for each hemisphere:

Northern:

Southern:

To make a “no urban adjustment” run of ccc-gistemp: «python tool/run.py -s 0,1,3,5»; and to make an “urban adjustment” land-index: «python tool/run.py -s 0,1,2,3,5».

The 1990s station dropout does not have a warming effect

Tamino gives his results for his GHCN based temperature reconstruction. It is well worth reading. He also gives a comparison between stations that are reporting after 1992, and those that “dropped out” before 1992. He concludes that there is no significant difference in the overall trend. In other words refuting the claim that the 1990s station dropout has a warming effect. His results are preliminary and for the Northern Hemisphere only.

Tamino’s analysis use only the land stations; in order to write this blog post I tweaked ccc-gistemp so that we can produce a land index (python tool/run.py -s 1-3,5 now skips step 4, avoids merging in the ocean data, and effectively produces a global average based only on land data).

It is very easy to subset the input to ccc-gistemp and run it with smaller input datasets. So in this case I can split the input data into stations reporting since 1992, and those that have no records since 1992, and run ccc-gistemp separately on each input. I created tool/v2split.py to split the input data. Specifically I ran step 0 (which merges USHCN, Antarctic, and Hohenpeissenberg data into the GHCN data) to create work/v2.mean_comb then split that file into those stations reporting in 1992 and after, and those not reporting after the cutoff. Then I ran steps 1,2,3, and 5 of ccc-gistemp to create a land index:

It is certainly not the case that the warming trend is stronger in the data from the post-cutoff stations.

The differences between these results and Tamino’s are interesting. Both show good agreement for most of the 20th century. These data show more divergence than Tamino’s in the 1800’s. Is that because we’re using Southern Hemisphere data as well, or is it because of the difference in station combining? Further investigation is merited.

We hope to make “experiments” of this sort easier to perform using ccc-gistemp and encourage anyone interested to download the code and play with it.

Update: Nick B obliges with a graph of the differences:

On integers, floating-point numbers, and rounding

Progress continues on the ccc-gistemp project. Anyone interested is welcome to go on over to the source code browse page and peruse it.

  • Paul Ollis has done excellent work separating all the I/O code from the main algorithm, and refactoring it so that data can flow through the entire program without passing through several intermediate data files.
  • David Jones has made a tool for indexing plain-text data files for random access, and has been working SVG-based visualisation tools. Together, one day these will let us provide a snappy graphical interface for answering questions like “how did the peri-urban adjustment on this station work?”
  • I have been working on removing rounding from the whole system. Until now we have often found ourselves having to round values in order to maintain exact equivalence with GISS results (which may have been rounded for output to an intermediate data file which is read by a later phase). For example, rounding temperatures to the nearest tenth degree Celsius, or latitude and longitude values to the nearest tenth degree. I mentioned this in email with Dr Reto Ruedy of GISS, and he assured me that all such rounding is incidental to the algorithm – an accident of history. So we are removing it from our version, to help clarify the algorithm. We will end up with the only explicit rounding in the system being done in order to write the final result files.
  • Next I am hoping we will extract the main numerical parameters of the algorithm – for instance, the 1200km station radius for gridding, the 4 rural stations required for peri-urban adjustment – to a separate module, where they can be easily modified by anyone interested in experimenting with different values.

We are aiming for a release 0.4.0 of ccc-gistemp to happen around the end of February or in early March, time permitting. The specification of this version is something like “no I/O, no rounding, and explicit parameters”, and we’re pretty close to that now.

Rounding in GISTEMP has prompted a lot of discussion in the blogosphere, and since I have been working in that area in ccc-gistemp, I thought I could write a few words here to clarify it. There is a lot of general misunderstanding of computer arithmetic, even among professional programmers. I have dealt with the nitty-gritty of it in various capacities in the past, and hopefully can convey some of my expertise.
(more…)

NASA GISS wants to use our code

After the release of ccc-gistemp 0.3.0, I contacted Dr Reto Ruedy of NASA GISS to ask him to try out the release and have a look through it.
Dr Ruedy responded, thanking us for our effort, and saying “I hope to switch to your version of that program”. After some further discussion, he clarified this:
When GISS has the resources:

Ideally, we would like to replace our whole code

.

They are busy with other things, and won’t have the resources for quite some time. Also, we will need to do some more work, to interface our code with various GISS tools (such as the station data web page). Nonetheless this is very much to the credit of the whole ccc-gistemp team. Well done, everybody.

ccc-gistemp release 0.3.0

I am pleased to annnounce ccc-gistemp release 0.3.0. This includes a number of bug fixes and features in our framework and tools, and a great deal of clarification work especially in steps 1 (station combination) and 2 (peri-urban adjustment). Really, it’s much better. Give it a go.

Much of GISTEMP was concerned with generating and consuming intermediate files, to separate phases and to avoid keeping the whole dataset in memory at once (an important consideration when GISTEMP was originally written). In 0.3.0 this has largely been replaced by an iterator-based approach, which is clearer, automatically pipelines all the processing where possible, and avoids all code concerned with serialization and deserialization.

We have retained intermediate files between the distinct steps of the GISTEMP algorithm, for compatibility with GISTEMP and for testing purposes. We have also retained some code to round or truncate some data at the points where Fortran truncates it for serialization. This will be removed in future.

Some of the original GISS code was already in Python, and survived almost unchanged in 0.2.0. Much of the rest of 0.2.0, especially the more complex arithmetical processing in step 2, was more-or-less transliterated from the Fortran. A lot of this code has been rewritten in 0.3.0, especially improving the clarity of the station-combining code (in step1.py) and the peri-urban adjustment (now in step2.py).

There has been a rearrangement of the code: the code/ directory now only contains code which we consider part of the GISTEMP algorithm. Everything else – input data fetching, run framework, testing, debugging utilities – is in the tool/ directory. This division will continue, to allow us to add useful tools while still reducing and clarifying the core code.

There is better code for comparing results, and a regression test against genuine GISTEMP results.

What do we mean when we say “Fortran”?

A visitor named “Dan” recently left this comment:

[...] I’m not sure why Python has been described in some associated project documents as easier or friendlier than Fortran. They are both pretty simple in that regard. I agree there are other reasons that Python (like some other new languages) is a better choice for a new project with many contributors and users.

Thank you for raising that point, Dan.  Fortran and Python themselves are really ciphers in this discussion, standing for “obscure twisty code” and “clean clear code”.

As the old saw has it, “you can write Fortran in any language” – indeed, GISTEMP includes Fortran written in C, ksh, Fortran and even in Python.  The reverse is also true: with all the features that have been added to Fortran in the last few decades, you can write any language in Fortran[*].  However, I’d invite you, or anyone, to compare:

1. padjust.f, as it is in GISTEMP.  This is a tiny corner of GISTEMP, used for applying computed heat-island adjustments to urban stations, certainly nothing like as twisty as most of the code around it (e.g. PApars.f).

2. padjust.py, as it is in ccc-gistemp 0.2.0.  This is a fairly routine translation of padjust.f into Python. It was a key step in the road to our 0.2 all-Python milestone, but one could, unkindly, characterize it as Fortran-in-Python.

3. apply_adjustments(), as it is today in the ccc-gistemp sources. The relevant code here is lines 690 to 791 inclusive.

There are several points to be made here. Firstly, version 3 is not exactly clear. The function adj() is not well-documented. There are a lot of slightly mysterious variables. There is some unnecessary messing-around with metadata entries, and there are substantial opportunities for using helpful little functions such as max() and min(). Code clarification is a process of gradual improvement, and is certainly not finished here.

Secondly, I expect something much like version 3 could have been written in modern Fortran. The main reason we’re not using a modern Fortran is that I set up the project. Like each team member I brought my own skills and preferences to the project, and my favourite language, at least for writing small pieces of clear, simple code, is Python. I have had very little professional experience in Fortran, and essentially none for 20 years.

Thirdly, most science Fortran, even newly written science Fortran, is like version 1: FORTRAN 77 with aspects of Fortran 90 (e.g. free-form source, long identifiers, dynamic arrays). Some is still in FORTRAN 66.  Furthermore, it is big blobs of Fortran with cryptic variable and function names, very occasional comments, aliasing through COMMON blocks, a lot of unused functions and/or variables.  What is the variable iy1e, and how can I find out? Why are we comparing nameo(31:32) to ‘ R’? This is what I mean when I say “Fortran”, and it is typical of GISTEMP, and other bodies of code we have seen, and friends and colleagues tell us it is true elsewhere. It is the natural consequence of the way in which science is done. Scientists are paid to do science, not to write code.  As long as the code does what it ought to do, for long enough to plot the charts for the paper for publication, then it’s good enough. There is no pressure to write code which is clear, maintainable and flexible, and so scientists mostly don’t develop or retain the skills to do so. That is one of the points of this project:to show what such code might look like, how to write it, and how it can be beneficial to science.

Fourthly, in the specific case of some code such as GISTEMP, the results of the code are being used to argue important public policy questions which will affect all of us. Something that I, personally, can do to turn some of the heat of that debate into light, and to help us all reach good decisions, is to make GISTEMP accessible to the public. All-Python is simply better than Fortran-ksh-Python-C for that purpose, for various reasons but primarily that it is easier to install, browse, and run on a random PC. Consider how many people downloaded the GISTEMP sources and ran into the sand very early. That should not happen with ccc-gistemp.

So, in short, yes we are converting from “Fortran” to “Python”, but some of the “Fortran” was already Python and some of the “Python” is decidedly Fortran-like.

For more on the pros and cons of Fortran and Python, please visit the Software Carpentry project. No affiliation; I just like what they do.

[*] This isn’t quite true – as far as I know Fortran still doesn’t have the meta-object protocol or introspective facilities of some languages, and pretty much no other language has the macro facilities of Lisp – but features like that play no part in this project anyway.

GISTEMP 2009 anomaly anomaly

In a previous article I predicted that the 2009 GISTEMP anomaly would be +0.58. In fact when it was published it came in at +0.57. This 0.01 K difference is well within any reasonable error bounds and typical of the sort of error you get from rounding. Still, it bothered me. How unlucky was I to get agreement for all the years except the most recent one?

Today I realised that although I was using up to date land data I wasn’t using up to date ocean data. I have just fetched fresh ocean data and rerun ccc-gistemp. Of course the 2009 anomaly comes out as +0.57 K, same as GISS: