GISTEMP Urban Adjustment

After some recent tweaks by me to the ccc-gistemp sources it is now possible to run a pipeline of the GISTEMP process with some of the steps omitted. An earlier post shows how I can omit Step 4 to create a land-only index. My recent changes allow Step 2 to be omitted. Step 2 is the urban adjustment step (in which stations marked as urban have their trend adjusted).

Omitting Step 2 will therefore give us an idea of the magnitude of the effect of the urban adjustment. It so happens that my writing this blog post overlaps with Nick Barnes implementing GISTEMP’s new scheme for identifying urban stations (corresponding to GISTEMP’s update of 2010-01-16). That gives me an opportunity to show both the new and old adjustment schemes against a “no adjustment” baseline:

In making this graph Step 4 has been omitted, giving us a land index. This is primarily to amplify the differences: land covers the lesser fraction of the Earth; so including the ocean data (which does not require an urban adjustment) makes the difference smaller.

And for each hemisphere:

Northern:

Southern:

To make a “no urban adjustment” run of ccc-gistemp: «python tool/run.py -s 0,1,3,5»; and to make an “urban adjustment” land-index: «python tool/run.py -s 0,1,2,3,5».

The 1990s station dropout does not have a warming effect

Tamino gives his results for his GHCN based temperature reconstruction. It is well worth reading. He also gives a comparison between stations that are reporting after 1992, and those that “dropped out” before 1992. He concludes that there is no significant difference in the overall trend. In other words refuting the claim that the 1990s station dropout has a warming effect. His results are preliminary and for the Northern Hemisphere only.

Tamino’s analysis use only the land stations; in order to write this blog post I tweaked ccc-gistemp so that we can produce a land index (python tool/run.py -s 1-3,5 now skips step 4, avoids merging in the ocean data, and effectively produces a global average based only on land data).

It is very easy to subset the input to ccc-gistemp and run it with smaller input datasets. So in this case I can split the input data into stations reporting since 1992, and those that have no records since 1992, and run ccc-gistemp separately on each input. I created tool/v2split.py to split the input data. Specifically I ran step 0 (which merges USHCN, Antarctic, and Hohenpeissenberg data into the GHCN data) to create work/v2.mean_comb then split that file into those stations reporting in 1992 and after, and those not reporting after the cutoff. Then I ran steps 1,2,3, and 5 of ccc-gistemp to create a land index:

It is certainly not the case that the warming trend is stronger in the data from the post-cutoff stations. [edit 2010-03-22: In a subsequent post I add trend lines to this chart]

The differences between these results and Tamino’s are interesting. Both show good agreement for most of the 20th century. These data show more divergence than Tamino’s in the 1800’s. Is that because we’re using Southern Hemisphere data as well, or is it because of the difference in station combining? Further investigation is merited.

We hope to make “experiments” of this sort easier to perform using ccc-gistemp and encourage anyone interested to download the code and play with it.

Update: Nick B obliges with a graph of the differences:

On integers, floating-point numbers, and rounding

Progress continues on the ccc-gistemp project. Anyone interested is welcome to go on over to the source code browse page and peruse it.

  • Paul Ollis has done excellent work separating all the I/O code from the main algorithm, and refactoring it so that data can flow through the entire program without passing through several intermediate data files.
  • David Jones has made a tool for indexing plain-text data files for random access, and has been working SVG-based visualisation tools. Together, one day these will let us provide a snappy graphical interface for answering questions like “how did the peri-urban adjustment on this station work?”
  • I have been working on removing rounding from the whole system. Until now we have often found ourselves having to round values in order to maintain exact equivalence with GISS results (which may have been rounded for output to an intermediate data file which is read by a later phase). For example, rounding temperatures to the nearest tenth degree Celsius, or latitude and longitude values to the nearest tenth degree. I mentioned this in email with Dr Reto Ruedy of GISS, and he assured me that all such rounding is incidental to the algorithm – an accident of history. So we are removing it from our version, to help clarify the algorithm. We will end up with the only explicit rounding in the system being done in order to write the final result files.
  • Next I am hoping we will extract the main numerical parameters of the algorithm – for instance, the 1200km station radius for gridding, the 4 rural stations required for peri-urban adjustment – to a separate module, where they can be easily modified by anyone interested in experimenting with different values.

We are aiming for a release 0.4.0 of ccc-gistemp to happen around the end of February or in early March, time permitting. The specification of this version is something like “no I/O, no rounding, and explicit parameters”, and we’re pretty close to that now.

Rounding in GISTEMP has prompted a lot of discussion in the blogosphere, and since I have been working in that area in ccc-gistemp, I thought I could write a few words here to clarify it. There is a lot of general misunderstanding of computer arithmetic, even among professional programmers. I have dealt with the nitty-gritty of it in various capacities in the past, and hopefully can convey some of my expertise.
(more…)

NASA GISS wants to use our code

After the release of ccc-gistemp 0.3.0, I contacted Dr Reto Ruedy of NASA GISS to ask him to try out the release and have a look through it.
Dr Ruedy responded, thanking us for our effort, and saying “I hope to switch to your version of that program”. After some further discussion, he clarified this:
When GISS has the resources:

Ideally, we would like to replace our whole code

.

They are busy with other things, and won’t have the resources for quite some time. Also, we will need to do some more work, to interface our code with various GISS tools (such as the station data web page). Nonetheless this is very much to the credit of the whole ccc-gistemp team. Well done, everybody.

ccc-gistemp release 0.3.0

I am pleased to annnounce ccc-gistemp release 0.3.0. This includes a number of bug fixes and features in our framework and tools, and a great deal of clarification work especially in steps 1 (station combination) and 2 (peri-urban adjustment). Really, it’s much better. Give it a go.

Much of GISTEMP was concerned with generating and consuming intermediate files, to separate phases and to avoid keeping the whole dataset in memory at once (an important consideration when GISTEMP was originally written). In 0.3.0 this has largely been replaced by an iterator-based approach, which is clearer, automatically pipelines all the processing where possible, and avoids all code concerned with serialization and deserialization.

We have retained intermediate files between the distinct steps of the GISTEMP algorithm, for compatibility with GISTEMP and for testing purposes. We have also retained some code to round or truncate some data at the points where Fortran truncates it for serialization. This will be removed in future.

Some of the original GISS code was already in Python, and survived almost unchanged in 0.2.0. Much of the rest of 0.2.0, especially the more complex arithmetical processing in step 2, was more-or-less transliterated from the Fortran. A lot of this code has been rewritten in 0.3.0, especially improving the clarity of the station-combining code (in step1.py) and the peri-urban adjustment (now in step2.py).

There has been a rearrangement of the code: the code/ directory now only contains code which we consider part of the GISTEMP algorithm. Everything else – input data fetching, run framework, testing, debugging utilities – is in the tool/ directory. This division will continue, to allow us to add useful tools while still reducing and clarifying the core code.

There is better code for comparing results, and a regression test against genuine GISTEMP results.

What do we mean when we say “Fortran”?

A visitor named “Dan” recently left this comment:

[…] I’m not sure why Python has been described in some associated project documents as easier or friendlier than Fortran. They are both pretty simple in that regard. I agree there are other reasons that Python (like some other new languages) is a better choice for a new project with many contributors and users.

Thank you for raising that point, Dan.  Fortran and Python themselves are really ciphers in this discussion, standing for “obscure twisty code” and “clean clear code”.

As the old saw has it, “you can write Fortran in any language” – indeed, GISTEMP includes Fortran written in C, ksh, Fortran and even in Python.  The reverse is also true: with all the features that have been added to Fortran in the last few decades, you can write any language in Fortran[*].  However, I’d invite you, or anyone, to compare:

1. padjust.f, as it is in GISTEMP.  This is a tiny corner of GISTEMP, used for applying computed heat-island adjustments to urban stations, certainly nothing like as twisty as most of the code around it (e.g. PApars.f).

2. padjust.py, as it is in ccc-gistemp 0.2.0.  This is a fairly routine translation of padjust.f into Python. It was a key step in the road to our 0.2 all-Python milestone, but one could, unkindly, characterize it as Fortran-in-Python.

3. apply_adjustments(), as it is today in the ccc-gistemp sources. The relevant code here is lines 690 to 791 inclusive.

There are several points to be made here. Firstly, version 3 is not exactly clear. The function adj() is not well-documented. There are a lot of slightly mysterious variables. There is some unnecessary messing-around with metadata entries, and there are substantial opportunities for using helpful little functions such as max() and min(). Code clarification is a process of gradual improvement, and is certainly not finished here.

Secondly, I expect something much like version 3 could have been written in modern Fortran. The main reason we’re not using a modern Fortran is that I set up the project. Like each team member I brought my own skills and preferences to the project, and my favourite language, at least for writing small pieces of clear, simple code, is Python. I have had very little professional experience in Fortran, and essentially none for 20 years.

Thirdly, most science Fortran, even newly written science Fortran, is like version 1: FORTRAN 77 with aspects of Fortran 90 (e.g. free-form source, long identifiers, dynamic arrays). Some is still in FORTRAN 66.  Furthermore, it is big blobs of Fortran with cryptic variable and function names, very occasional comments, aliasing through COMMON blocks, a lot of unused functions and/or variables.  What is the variable iy1e, and how can I find out? Why are we comparing nameo(31:32) to ‘ R’? This is what I mean when I say “Fortran”, and it is typical of GISTEMP, and other bodies of code we have seen, and friends and colleagues tell us it is true elsewhere. It is the natural consequence of the way in which science is done. Scientists are paid to do science, not to write code.  As long as the code does what it ought to do, for long enough to plot the charts for the paper for publication, then it’s good enough. There is no pressure to write code which is clear, maintainable and flexible, and so scientists mostly don’t develop or retain the skills to do so. That is one of the points of this project:to show what such code might look like, how to write it, and how it can be beneficial to science.

Fourthly, in the specific case of some code such as GISTEMP, the results of the code are being used to argue important public policy questions which will affect all of us. Something that I, personally, can do to turn some of the heat of that debate into light, and to help us all reach good decisions, is to make GISTEMP accessible to the public. All-Python is simply better than Fortran-ksh-Python-C for that purpose, for various reasons but primarily that it is easier to install, browse, and run on a random PC. Consider how many people downloaded the GISTEMP sources and ran into the sand very early. That should not happen with ccc-gistemp.

So, in short, yes we are converting from “Fortran” to “Python”, but some of the “Fortran” was already Python and some of the “Python” is decidedly Fortran-like.

For more on the pros and cons of Fortran and Python, please visit the Software Carpentry project. No affiliation; I just like what they do.

[*] This isn’t quite true – as far as I know Fortran still doesn’t have the meta-object protocol or introspective facilities of some languages, and pretty much no other language has the macro facilities of Lisp – but features like that play no part in this project anyway.

GISTEMP 2009 anomaly anomaly

In a previous article I predicted that the 2009 GISTEMP anomaly would be +0.58. In fact when it was published it came in at +0.57. This 0.01 K difference is well within any reasonable error bounds and typical of the sort of error you get from rounding. Still, it bothered me. How unlucky was I to get agreement for all the years except the most recent one?

Today I realised that although I was using up to date land data I wasn’t using up to date ocean data. I have just fetched fresh ocean data and rerun ccc-gistemp. Of course the 2009 anomaly comes out as +0.57 K, same as GISS:

Overview of GISTEMP intermediate files

When ccc-gistemp runs, the data files in the input/ directory are processed in a number of steps to produce the results in the result/ directory. On the way many intermediate files are written to the work/ directory. Generally the intermediate files are written by one stage of the process and consumed by a later stage. GISS’s GISTEMP works in a broadly similar way, but the details are slightly different. One of the first things we did when working with GISTEMP was to reorganise where the intermediate files were written.

Many of the intermediate files were only written because the computers on which GISTEMP was originally intended to run were extremely resource poor and the whole data could not always be processed in working memory. Hence, data was written into several intermediate files and processed piece by piece. This is no longer necessary, and ince ccc-gistemp release 0.2.0 we have made changes that mean that far fewer intermediate files are written.

A consequence is that there are now a manageable number of file in the work/ directory, and a listing of them tells a story about how GISTEMP works:

         5 Jan 20 13:27 GHCN.last_year
  44716518 Jan 20 13:28 v2.mean_comb
  29802728 Jan 20 13:30 Ts.txt
  39696368 Jan 20 13:31 Ts.bin
  20853712 Jan 20 13:31 Ts.GHCN.CL
   2107900 Jan 20 13:31 ANN.dTs.GHCN.CL
    354106 Jan 20 13:33 PApars.pre-flags
    371742 Jan 20 13:33 PApars.list
  19233584 Jan 20 13:34 Ts.GHCN.CL.PA
         0 Jan 20 13:34 BX.Ts.GHCN.CL.PA.1200
  50240120 Jan 20 13:50 SBBX1880.Ts.GHCN.CL.PA.1200
  34001576 Jan 20 13:50 SBBX.HadR2
    176152 Jan 20 13:51 ZON.Ts.ho2.GHCN.CL.PA.1200.step1
    176152 Jan 20 13:51 ZON.Ts.ho2.GHCN.CL.PA.1200
     15974 Jan 20 13:51 ANNZON.Ts.ho2.GHCN.CL.PA.1200

The above is a listing of my work/ directory having done a fresh run using subversion revision 199 sources. Each row lists: file size in bytes, timestamp, file name.

The first thing to note are the timestamps. The first file is written at 13:27 and the last file at 13:51. On my machine ccc-gistemp took about 25 minutes for this run.

I’ll go through the files in order and try and explain what each one is. Bear in mind that some of these files will probably disappear in future version as we reduce the number of time data is written to disk and read back in again.

GHCN.last_year

This file is used to pass the highest year that is found in the GHCN input data (input/v2.mean) from step0 (where this file is created) to step2 (where this file is read). The contents are the highest year. This is not a very elegant way to pass this information. It’s needed because in step2 a Fortran binary file is created with fixed record lengths, and the length of the record is related to the highest year that has data so that highest year needs to be known before the binary file is created.

v2.mean_comb

This large file is the output from step0. It contains all the temperature data that GISTEMP will go on to use combined into one file. The temperature data are combined from: GHCN, USHCN, SCAR, and one input file for Hohenpeissenberg. The combining process is not entirely trivial: data from USHCN do not simply replace data from GHCN, they are adjust by the mean monthly difference (see the function include_US in step0.py).

Ts.txt

This is the output of step1. Duplicate records (multiple series for the same weather station) are combined, if possible; an adjustment is made for the St Helena record (listed in the file config/combine_pieces_helena.in); records and partial records listed in the file config/Ts.strange.RSU.list.IN are removed; an adjustment is made for a discontinuity in the Lihue record (listed in the file config/Ts.discont.RS.alter.IN).

The output format of this file is different from the v2.mean format used in the previous step. Metadata from the file input/v2.inv is included in this file.

Ts.bin

At the beginning of step2 the Ts.txt file from step1 is converted to this Fortran binary file. The binary file is easier to access using a Fortran program. At one time in the past the binary file would have been significantly quicker as well, but I doubt that matters these days. Since ccc-gistemp is now entirely in Python it’s likely that we’ll remove the binary file, preserving it only as an option to match the GISTEMP intermediate files.

In the GISS code, this file is then split into 6 files so that the gridding, in step3, can proceed by using only a subset of the station data held in memory at once. Keeping all the data in one file greatly reduced the number of intermediate files created in ccc-gistemp.

The next few files are all internal to step2, the Urban adjustment.

Ts.GHCN.CL

The same data as Ts.bin but trimmed to make the file size slightly smaller. Even more pointless in this day and age.

ANN.dTs.GHCN.CL

Annual anomalies for each station, computed in step2. Each of the 12 months has an average computed, and the anomaly for a year is computed from the difference between each month in that year and that month’s average.

There is far less data in this file than the monthly series in Ts.txt, and it is the this data that is used to make the urban adjustment.

PApars.pre-flags

By analysing urban and rural stations step2 creates this table of parameters that control what adjustments are going to be made (to urban stations).

PApars.list

The parameters in PApars.pre-flags are annotated with a flag that affects the exact adjustment made.

Ts.GHCN.CL.PA

The data from Ts.GHCN.CL are read in and urban stations are adjusted according to the parameters previously computed and stored in PApars.list. This file contains the adjusted data and is the output of step2.

BX.Ts.GHCN.CL.PA.1200

An empty file created by step3. The GISS version of GISTEMP creates a gridded dataset ( SBBX1880.Ts.GHCN.CL.PA.1200 see below) with 8000 cells (subboxes) and from this also creates a gridded dataset with 80 cells (boxes) which is what this file would be. Currently ccc-gistemp does not create the 80 cell version.

SBBX1880.Ts.GHCN.CL.PA.1200

This is the gridded output of step3. It’s created by considering each grid cell in turn, and combining (usually several) station records into one record for each cell. From this point on only gridded data is used.

SBBX.HadR2

This file contains sea surface data from Hadley and Reynolds version 2 (the “Had” and “R2″ in its name). It’s the result of step4 which takes the input file of the same name (in the input/) directory, and adds in any updates that have been downloaded. Usually we run without any updates, and in this case step4 simply copies this file from the input/ directory.

ZON.Ts.ho2.GHCN.CL.PA.1200.step1

This is a temporary file used by step5. Step5 takes the two subbox files, SBBX1880.Ts.GHCN.CL.PA.1200 containing land data, and SBBX.HadR2 containing ocean data, and merges the land and ocean data together creates a gridded file with 80 boxes. The gridded file appears in the result/ directory: BX.Ts.ho2.GHCN.CL.PA.1200.

The data in the gridded file are combined to produce a data series for each of 8 latitudinal zones, then from those 8 zones another 6 are produce for large scale regions, including the 3 for Northern Hemispehere, Southern Hemisphere, and Global average. That zonal data is stored in this file.

ZON.Ts.ho2.GHCN.CL.PA.1200

The zonal data are read and an alternative computation is done where the global average is the simple average of the northern and southern hemispheres, as opposed to the previous calculation with uses an area weighted average.

ANNZON.Ts.ho2.GHCN.CL.PA.1200

Annual anomalies are computed from the monthly data series for each zone.

Whilst the data for this and the previous file are being computed, the text files that hold summaries of this data are written to the result/ directory.

All-Python ccc-gistemp release

I am proud to announce release 0.2.0 of ccc-gistemp.  This is an all-Python reimplementation of GISTEMP, the NASA GISS surface temperature analysis.  Please feel free to download and play with it.  It will automatically fetch input data across the internet, and produce textual and graphical result files.

This release works on Windows, Linux, Mac OS X, FreeBSD, and probably anywhere else you can get Python to work.  The only dependency is on Python (2.5.2 or later, as we discovered today that the code to fetch input data trips over a bug in earlier Python libraries).

The results of running this release match GISTEMP results very closely indeed:

Comparison of ccc-gistemp with GISTEMP, on common input data

In fact, the annual global, northern hemisphere, and southern hemisphere anomaly results are identical, as are the southern hemisphere monthly anomalies.  The global monthly anomalies differ 7 times, out of more than 1000, each time by one digit in the least-significant place.

This ends phase 1 of the CCC-GISTEMP project.  However, although there is no remaining Fortran, ksh, or C source code, much of step1.py is still GISS code, and a lot of the large-scale structure of the code is still dictated by its 1980s Fortran heritage.  For instance, the data is broken up into pieces because it couldn’t all fit into memory at once [ed: 2010-01-19: this particular instance is Issue 25 and it’s now fixed].  This obscures the underlying algorithms being applied.  Phase 2 of CCC-GISTEMP will refactor the code to eliminate this obscurity.  We expect one side-effect to be an increase in speed.

Thanks to all who have contributed, including David Jones, Paul Ollis, Gareth Rees, John Keyes, and Richard Hendricks. Thanks also to Reto Ruedy at GISS, who has been helpful and responsive.

GISTEMP 2009 anomaly

GISS haven’t published a 2009 anomaly yet (as of writing, 2010-01-11T14:30Z), but new GHCN records were made available on 2010-01-07. I’ve just made a fresh run of our ccc-gistemp code with all fresh inputs to produce this graph:
global historical temperature anomaly

Because I’m using fully up to date inputs, this run of ccc-gistemp produces an anomaly for 2009. That red tick at the end is the extra year, 2009, that we produce.

I predict that when GISTEMP publish their 2009 anomaly, it will be +0.58 K.

[minor edits: screwed up year in opening paragraph, and colour of labels in graph]