Archive for the ‘Uncategorized’ Category

Just 440 stations

Over at his blog, Nick Stokes selects rural stations with more than 90 years of data and that report in 2010. In GHCN he counts just 61 stations.

We can do something similar with ccc-gistemp. We get a result that is unreasonably similar to the official GISTEMP with full series.

That major differences in methodology are:

Here’s the short (but not very clear) Python script that identifies long rural stations:

rural = set((row[:11] for row in open('input/v2.inv') if
  int(row[102:106]) <= 10))
def id11(x):return x[:11]
def id12(x):return x[:12]
import itertools
rurlong = [group for group,rows in
  ((g,list(r)) for g,r in itertools.groupby(open('work/v2.step1.out'), id12))
  if id11(group) in rural and len(rows) > 90 and rows[-1][12:16] == '2010']
print len(rurlong)

There are 440 stations. With these 12-digit record identifiers in hand it is a trivial matter to create a new v2.mean file («open(‘v2.longrural’, ‘w’).writelines((row for row in open(‘work/v2.step1.out’) if row[:12] in rurlong))» if you must).

As I said before the results are pretty close to the standard analysis:

The new v2.mean file is over 10 times smaller (uncompressed) than the official GHCN v2.mean file. The analysis is correspondingly about 10 times quicker to run (a few minutes on my old laptop). In case you want to use this file, to replicate my results or run it through your own analysis, I’ve made it available from our googlecode repository.

Byrd

In this comment Bob Koss notices something a little about some of the 1980 anomalies near Byrd station (-80.0-119.4). He points out there is a patch of suspiciously zero anomaly north of Byrd. Illustrated by this graphic (the dot is Byrd, roughly):

Here’s an extract from ccc-gistemp’s Step 3 output (work/v2.step3.out) showing that several cells have flat 0 anomalies for 1980:

-77.2-121.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-77.2-112.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-77.2-103.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-77.2-094.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-74.8-130.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-74.8-121.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-74.8-112.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-74.8-103.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-74.8-094.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-72.7-139.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-72.7-130.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-72.7-121.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-72.7-112.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-72.7-103.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-70.9-139.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-70.9-130.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-70.9-121.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-70.9-112.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-70.9-103.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-69.2-121.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0

(this file is a v2.mean style file where the 11-digit station identifiers are in fact the centre of the grid cell rounded to the nearest .1 degree in latitude and longitude.)

It turns out that all of those cells have a time series starting in 1980. And this leads to me having to explain a curiosity in the algorithm used to produce anomalies.

The anomalies produce by the gridding step, Step 3, have a baseline of 1951 to 1980 (see gridding_reference_period in parameters.py). What this means is that just before the end of Step 3 each cell’s time series is adjusted: For each of the 12 months in the year, an average is computed for the period 1951 to 1980 and this average is subtracted from that month’s time series.

If there is a cell that is unlucky enough to have its series start in 1980, then the average for the period 1951 to 1980 is simply the 1980 value. That’s subtracted from the series, so after Step 3 all the anomalies for 1980 will be 0 (see list above).

This doesn’t happen a lot. Recall that the GISTEMP analysis uses a grid with 8000 cells. We can see that the vast majority of the cells (that have any series at all) have 30 years of data for the period 1951 to 1980:

(The chart above represents a simplification, but a reasonable one: A year is marked as “present” if it has any valid monthly data)

The vast majority of the cells have a full 30 years for the base period.

What if there are no values between 1951 and 1980? Then in that case the algorithm uses an average over the entire series for its baseline. There are 39 such cells (that’s the tiny blip on the left of the graph).

So some cells will have a different baseline, some cells will have a pathetically small number of values to average over in the baseline (the 1980 example above is the extreme example). This is unfortunate if you’re the sort of person that worries about individual years in localised regions, but it doesn’t affect the larger scale averages at all.

Koss also asks why the cell just north and west of Byrd does not have a yearly anomaly for 1980. A reasonable question since it’s close to Byrd and Byrd has data for 1980.

Here’s a Google Earth screenshot, showing stations south of -70. Taken from roughly above the South Pole with 0 degrees longitude to the right.

Note that this shows stations after Step 2 (I ran the command «tool/kml.py --lat -90,-70 work/v2.step2.out > step2.kml»), and Step 2 discards all stations with a short series, which means throwing out several Antarctic stations.

Let’s zoom in a little on Byrd and add the gridded cells (a select few of the gridded cells from Step 3):

For the cells (with locations of the form -NN.N-EEE.E), the 4 digit number is the first year with data (technically, the first year with enough data to produce a yearly anomaly, but that’s the same in these cases).

So why does cell -77.2-121.5 (just north of Byrd) have a series that starts in 1980? Answer: Because it uses data from Byrd, and Byrd’s data starts in 1980. The little squiggle inside the icon is the yearly anomaly series and you can just about see that -77.2-121.5 is the same squiggle as Byrd.

What about cell -77.2-130.5 (just north and west of Byrd)? Why does that have no anomaly for 1980? Because its series starts in 1985. Why is that? Because this cell uses data from Gill and not Byrd (again, compare the squiggle to Gill’s squiggle). Why is that? Because there are 3 stations within 1200 km of this cell. I recently added a bit of logging to Step 3 and it’s now possible to see for each cell what stations contributed and the weights of their contribution:

$ grep -e -77.2-130.5C log/step3.log
-77.2-130.5C stations [('700893760009', 0.10965896795479568), ('700893240009', 0.0), ('700891250000', 0.0)]

70089376000, Gill, -80.00-178.60
70089324000, Byrd, -80.00-119.40
70089125000, Byrd Station, -80.02-119.53

Byrd and Byrd Station are presumably different stations in more or less the same location. Byrd Station has a temperature record starting in 1957 (yay for International Geophysical Year) running continuously up to 1971, and then sporadic southern summers up to 1987. Byrd is presumably the modern station; its record starts in 1980 and runs up to 2010 (not quite continuously).

When considering the order in which to combine stations to produce a series for a cell, stations are considered “longest record first” by a simple count of the total number of months of valid data. It turns out that Gill, 70089376000, has 246 months of data, while Byrd, 70089324000, has 216 months of data. So Gill comes first (even though Byrd is much closer).

Referring back to the log output for the three stations, we see that records 700893240009 and 700891250000 contributed with 0.0 weight. In other words, not at all. Why is that?

In order for a station to combine with a cell series it has to have a certain number of years of common overlap: parameters.gridding_min_overlap. This is normally 20. Each month of the year is considered separately; it turns out that February is the month for which Gill and Byrd have the greatest overlap, with 18 years in common. Just falling short of the cut. If both Gill and Byrd continue to have temperature records then in 2012 their February record will be combined. This still won’t be enough to compute a yearly anomaly for 1980 for that cell, because only February will combine, not the other months of the year. And one month is not enough to compute a yearly anomaly.

For completeness we should consider Byrd Station, 70089125000, as well. Gill starts in 1985, and Byrd Station has only 4 monthly values since 1985 (and in fact, only 8 for the whole of the 1980′s). So no chance of enough overlap.

However for the cell containing Byrd itself, -80.1-121.5, we can see that its record starts in 1957. Again grepping station contributors out of the log:

$ grep -e -80.1-121.5 log/step3.log
-80.1-121.5C stations [('700890090008', 0.079848008479036836),
('700893760009', 0.12098360649264683),
('700893240009', 0.9657843990261038),
('700891250000', 0.96815637481310624)]

The first station is Amundsen–Scott, 70089009000, with a record that starts in 1957 (International Geophysical Year again). We have increased overlap, and so it turns out that Gill and both records from Byrd contribute. You can think of Amundsen-Scott as being like a catalyst that allows Gill and Byrd to combine when they ordinarily would not. Amundsen-Scott is only just within range to contribute to this cell. Cells further north will not use Amundsen-Scott, so Gill and Byrd will not combine (unless there is some other station in range to provide an overlapping series).

Conclusion: Some individual cells can have slightly unintuitive series, but they are merely the result of the documented computation (the key facts—that records are combined longest first, stations from up to 1200 km away are used, and records require an overlap for combining—are all documented in Hansen and Lebedeff 1987). A tiny fraction of cells in the anomaly maps either have misleadingly short anomaly periods or entirely different anomaly periods. This makes no difference to zonal averages. Probably when browsing the maps such cells should be marked or eliminated.

Opening up the IPCC

Updated: I have now 2010-06-29 submitted this comment to the IAC. Thank you, all signatories.

We have a rare opportunity to affect the conduct and perception of climate science. If you believe this is important, please read on, and comment.

The Intergovernmental Panel on Climate Change (IPCC) produces reports which review and summarize the science of climate change. These reports are then used by inter-governmental treaties, bodies, conferences, and national governments, as the basis for international and national policies on climate change. In other words, it is vitally important. The Clear Climate Code project has the goal of “increasing public confidence in climate science results”, and the perception of IPCC reports directly affects this goal.

There has been a lot of controversy about the accuracy and balance of IPCC reports. In response, in March the UN asked the InterAcademy Council (representing the national science academies of many different countries) to conduct a review of the IPCC processes and procedures. A committee has been established and the review is underway. The committee is now soliciting public comment. This is a rare opportunity to influence the way in which the science of climate change is conducted, reviewed, synthesized, and communicated.

I have written the following comment, and am hereby soliciting signatures. If you agree with this comment and would like to be added as a signatory, please either contact me directly, or post a comment to this blog post, giving your name and affiliation, as you wish it to appear in the list of signatories. Please also spread the word about this blog post, and encourage your friends, colleagues, and contacts to sign it.

[edited to add: as people send me their endorsements, I will update the list of signatories here in the post. I cannot make other changes, since this is now receiving signatures.]

Comment to the InterAcademy Council Review of the Intergovernmental Panel on Climate Change.

1. Summary

The IPCC procedures should be amended to increase the transparency of
the science and of the IPCC process itself. The proposed amendments
are small, but would have a large effect on confidence in IPCC
reports.

“Sunlight is said to be the best of disinfectants” – Louis D. Brandeis, 1913.

2. The Problem

IPCC reports contribute to global public policy debates and processes,
which may have major effects on the daily lives of every person in the
world. Every government and large enterprise has already been
affected. As the century continues, the effects of policies based on
IPCC work will increase in their scope and impact: they will create
whole new industrial sectors, thousands of businesses, and many ways
of life.

For this reason, the IPCC reports and the processes which create them
have been under increasing scrutiny. Questions are asked and doubts
are raised, both about the IPCC process and about the underlying
scientific research. Both the research, and the processes of review
and synthesis, have been criticised for opacity. Very serious
accusations have been made: of a lack of rigor, of group-think, of
conflicts of interest, of deception, and even of conspiracy and fraud.

This has led to doubts about the validity of IPCC conclusions, and to
serious difficulty in making national and international policy
regarding climate change.

All this is well-known and need not be rehearsed further here.
Indeed, the recognition of these problems has led directly to the
United Nations request for a review, and the establishment of this IAC
review committee.

3. The Solution

A key part of any solution to these problems is to increase the
transparency of the research underlying IPCC reports, and of the IPCC
process itself. While the research and the process remain closed and
opaque to commentators and to the public, doubts will flourish and
will impede progress.

3.1. Bibliography

The IPCC AR4 WG1 report included references to around 5000 items of
peer-reviewed research. Thousands more were referred by the WG2 and
WG3 reports. To assess or fully understand any part of an IPCC
report, an interested reader will want to follow the bibliographic
references and read the underlying research. For this reason the
bibliographic function of an IPCC report is very important. However,
the IPCC AR4 bibliography does not perform it well.

Each chapter of each report of AR4 has its own separate bibliography.
These bibliographies are not linked together, within a report or
between reports. The formats of these bibliographies varies. There
is no way to see whether any given paper is referred in more than one
working-group report, in more than one chapter, or at all. In the
online published text of each chapter of AR4 each citation does not
link to the matching reference in that chapter’s bibliography. In
turn, in each chapter’s bibliography, each reference does not link to
any online materials relating to that piece of research.

AR5 should have a single unified bibliography, containing all
references in all working group reports. Each citation in the body of
a report should link to the matching entry in the bibliography. If a
reference is to material which is published online, the bibliography
should link to that publication. The bibliography should also
reproduce whatever part of the publication and supporting materials is
available for reproduction (possibly just the abstract, but see
below). To protect these references against future change or loss,
wherever possible the IPCC should also archive copies of any online
publication on its own server (for instance, at the IPCC Data
Distribution Centre http://www.ipcc-data.org/).

There are many free tools available for managing online bibliographic
databases and repositories such as this. Such tools allow
collaborative enterprises such as the IPCC to readily create,
populate, update, search, and publish bibliographic data. The IPCC
should adopt such a tool, and mandate its use by lead authors and
contributing lead authors.

3.2. Underlying Research

Each piece of research lies somewhere on a spectrum of transparency
and open-ness. Some publications are open-access: freely available
for anyone to read and assess. For instance, some are published in
open-access journals. Many are not open-access, but describe results
such as datasets which are publicly available. Still more may have
some additional materials, such as computer source code used to
produce or analyse the datssets, freely available for download.
Finally, a great deal of research is entirely closed: only the
abstract is available, and neither the scientific paper, nor the data
described in the paper, nor the computer source code (or other
processing details), is generally open.

In recent years, and especially since AR4, it has become clear that
public confidence in research is directly connected to this spectrum
of transparency. The more open the research, the less vulnerable it
is to criticism, and especially to the more serious accusations of
fabrication and fraud. As argued above, this criticism seriously
damages the reputation of the IPCC and impedes progress in the use of
the IPCC reports.

For this reason, all contributors to AR5 should be encouraged to open
their work as much as possible: to make their contributed papers
available online, to publish their datasets and supporting materials
such as computer source code, design documents, and additional text,
images, and charts. This can be very simply done by the IPCC
routinely gathering and publishing information about the transparency
of each piece of underlying research. This information can easily be
stored in the IPCC bibliographic database.

As noted above, whenever possible a publication, and/or supporting
material, should be copied to an IPCC repository, to protect against
change or loss. As publications in climate science become more open,
such reproduction should be increasingly possible.

3.3. The IPCC Process

Much of the IPCC process itself is already open. Draft reports,
review comments, and responses are all published. However, the IPCC
reports themselves are not open. It is not possible to freely
reproduce and disseminate them. The IPCC should immediately change
this, and adopt an open licensing policy. All IPCC reports, past and
future, should be freely available under a license which conforms to
the Open Knowledge Definition http://www.opendefinition.org/, for
example the Creative Commons Attribution Share-Alike license CC-BY-SA http://www.opendefinition.org/licenses/cc-by-sa/.

The existing transparency should also be increased. There have been
prominent recent calls for the review and synthesis process to take
place in public, for instance by adopting a wiki-style drafting
mechanism. Such a move would protect the IPCC against certain
accusations of group-think (or even conspiracy). However, such a move
is somewhat outside the scope of the detailed recommendations below.

4. Recommendations

This is a series of concrete recommendations for amendments to the
document “Principles Governing IPCC Work, Appendix A – Procedures for
the preparation, review, acceptance, adoption, approval and
publication of IPCC Reports”
, with the effect of implementing the
solutions described above.

In section 4.1, “Introduction to Review Process”, this paragraph should
be added:
The IPCC Secretariat should identify, implement, and provide a
bibliographic system and repository for the use of Coordinating
Lead Authors, Lead Authors, and Review Editors.
The content of this bibliographic system and repository shall be
shared between all the Working Groups and the Task Force on
National Greenhouse Gas Inventories, and shall be publicly
available on or before completion of the Report for a period of at
least five years.

In section 4.2.3, “Preparation of Draft Report”, this sentence should
be added to the first paragraph:
Contributions should include, wherever possible, access
instructions for any original data, supplementary materials,
computer source code used for analysis or processing, and an
indication of the public availability and licensing of such
materials.

In Annex 1, under “Lead Authors”, this paragraph should be added:
Lead Authors shall record all contributed material in the IPCC
bibliographic system. Where any access to original data,
supplementary materials, or computer source code is provided, Lead
Authors shall record such access in the IPCC bibliographic system
and, wherever possible, copy such material to the IPCC repository.

In section 4.2, “Reports Accepted by Working Groups and Reports
prepared by the Task Force on National Greenhouse Gas Inventories”,
this paragraph should be added:
Reports accepted by Working Groups, or prepared by the Task Force
on National Greenhouse Gas Inventories, shall be made publicly
available under the Creative Commons Attribution Share-Alike
license CC-BY-SA.

In section 4.4, “Reports Approved and/or Adopted by the Panel”, this
paragraph should be added:
The Synthesis Report shall be made publicly available under the
Creative Commons Attribution Share-Alike license CC-BY-SA.

Furthermore, the IPCC should make its existing reports publicly available
under the same CC-BY-SA license.

5. Conclusion

The IPCC reports have been questioned and attacked on many fronts, and
this has been a source of great difficulty in making national and
international policy regarding climate change. A principal ground for
complaint has been the transparency of the underlying science and of
the IPCC process of review and synthesis. Progress can be enabled by
addressing these complaints: by making the science and the process far
more open.

The IPCC doesn’t have a direct influence on the working practices of
the thousands of researchers who contribute work to its reports.
However, it can shine a bright light on those practices by the simple
and cheap step of requesting and recording certain information in its
bibliography, and by making that bibliography readily available to the
public.

Finally, by making its own processes more open, and by making its own
reports more freely available, the IPCC can both avoid any further
criticism on these grounds and set a leading example for the research
community from which it is drawn.

—-

Signatories

  • Nicholas Barnes, Founder, Clear Climate Code project
  • David Jones, Founder, Open Climate Code project
  • Richard Drake, Founder, Open Climate Initiative
  • Rufus Pollock, Founder, Open Knowledge Foundation
  • Jonathan Gray, Community Coordinator, Open Knowledge Foundation
  • Joshua Halpern, Professor of Chemistry, Howard University
  • Tim Lambert, School of Computer Science and Engineering, University of New South Wales
  • Peter Murray-Rust, University of Cambridge and Open Knowledge Foundation
  • Andrew Montford. Author: The Hockey Stick Illusion
  • Subbiah Arunachalam, Distinguished Fellow, Centre for Internet and Society, Bangalore, India
  • Dave Berry, ex Deputy Director of the UK National e-Science Centre
  • Peter Suber, Berkman Fellow, Harvard University
  • Lucia Liljegren of the Blackboard
  • Carrick Talmadge, Senior Scientist, University of Mississippi
  • Ivo Grigorov (Centre National de la Recherche Scientifique/DTU-Aqua)
  • William Eichinger, William Ashton Professor of Engineering, University of Iowa
  • Nick Levine
  • Philippa Davey
  • Leif Burrough
  • David L. Hagen
  • Scott McKay
  • Ronald Broberg
  • Ted Lemon
  • Martin Brumby
  • Gerry Morrow
  • David Bishop
  • Conrad Taylor
  • John Shade
  • Allen McMahon
  • Robert Thomson
  • Eamon Watters
  • Bruce Cunningham
  • Greg Freemyer
  • Chad Herman
  • Barry Woods
  • Jack Mosevich
  • Stephen L. Jones
  • Zeke Hausfather
  • Daniel Godet
  • Laurence Childs
  • Peter O’Neil
  • Phillip Bratby
  • Colin Brooks
  • Andrew Smith
  • Peter Walsh
  • Louis Hooffstetter
  • Steve Fitzpatrick
  • Stephen Gaalema
  • Charles Minning
  • Brian Crounse

Airport Warming

More or less on a whim I split the GHCN data into two sets: Those stations marked as being at an airport; those stations not marked as being at an airport. This is easy to do because the v2.inv file puts an ‘A’ in column 81 (counting from 0) for airport stations.

Here’s the airport versus non-airport comparison for ccc-gistemp:

Certainly for the most recent 50 years it doesn’t seem to matter much whether you use exclusively airport based measurements or exclude airport based measurements (considering the global anomaly).

My earlier post about the 1990s station “dropout” used a similar technique of splitting the input data into two sets.

OKCon CCC Presentation

Saturday past was OKCon 2010 and we were in London to give a presentation about Clear Climate Code (well, Nick, Paul, and I were). Specifically, I was there to monkey the slides, and Nick was there to stand up and talk.

A PDF of the slides (3.5e6 octets) is available from our googlecode download page; you can also find a zip of PNGs there if you need it.

It was an interesting conference; thanks to Open Knowledge Foundation for organising, and everyone else for attending.

GISTEMP Urban Adjustment

After some recent tweaks by me to the ccc-gistemp sources it is now possible to run a pipeline of the GISTEMP process with some of the steps omitted. An earlier post shows how I can omit Step 4 to create a land-only index. My recent changes allow Step 2 to be omitted. Step 2 is the urban adjustment step (in which stations marked as urban have their trend adjusted).

Omitting Step 2 will therefore give us an idea of the magnitude of the effect of the urban adjustment. It so happens that my writing this blog post overlaps with Nick Barnes implementing GISTEMP’s new scheme for identifying urban stations (corresponding to GISTEMP’s update of 2010-01-16). That gives me an opportunity to show both the new and old adjustment schemes against a “no adjustment” baseline:

In making this graph Step 4 has been omitted, giving us a land index. This is primarily to amplify the differences: land covers the lesser fraction of the Earth; so including the ocean data (which does not require an urban adjustment) makes the difference smaller.

And for each hemisphere:

Northern:

Southern:

To make a “no urban adjustment” run of ccc-gistemp: «python tool/run.py -s 0,1,3,5»; and to make an “urban adjustment” land-index: «python tool/run.py -s 0,1,2,3,5».

NASA GISS wants to use our code

After the release of ccc-gistemp 0.3.0, I contacted Dr Reto Ruedy of NASA GISS to ask him to try out the release and have a look through it.
Dr Ruedy responded, thanking us for our effort, and saying “I hope to switch to your version of that program”. After some further discussion, he clarified this:
When GISS has the resources:

Ideally, we would like to replace our whole code

.

They are busy with other things, and won’t have the resources for quite some time. Also, we will need to do some more work, to interface our code with various GISS tools (such as the station data web page). Nonetheless this is very much to the credit of the whole ccc-gistemp team. Well done, everybody.

What do we mean when we say “Fortran”?

A visitor named “Dan” recently left this comment:

[...] I’m not sure why Python has been described in some associated project documents as easier or friendlier than Fortran. They are both pretty simple in that regard. I agree there are other reasons that Python (like some other new languages) is a better choice for a new project with many contributors and users.

Thank you for raising that point, Dan.  Fortran and Python themselves are really ciphers in this discussion, standing for “obscure twisty code” and “clean clear code”.

As the old saw has it, “you can write Fortran in any language” – indeed, GISTEMP includes Fortran written in C, ksh, Fortran and even in Python.  The reverse is also true: with all the features that have been added to Fortran in the last few decades, you can write any language in Fortran[*].  However, I’d invite you, or anyone, to compare:

1. padjust.f, as it is in GISTEMP.  This is a tiny corner of GISTEMP, used for applying computed heat-island adjustments to urban stations, certainly nothing like as twisty as most of the code around it (e.g. PApars.f).

2. padjust.py, as it is in ccc-gistemp 0.2.0.  This is a fairly routine translation of padjust.f into Python. It was a key step in the road to our 0.2 all-Python milestone, but one could, unkindly, characterize it as Fortran-in-Python.

3. apply_adjustments(), as it is today in the ccc-gistemp sources. The relevant code here is lines 690 to 791 inclusive.

There are several points to be made here. Firstly, version 3 is not exactly clear. The function adj() is not well-documented. There are a lot of slightly mysterious variables. There is some unnecessary messing-around with metadata entries, and there are substantial opportunities for using helpful little functions such as max() and min(). Code clarification is a process of gradual improvement, and is certainly not finished here.

Secondly, I expect something much like version 3 could have been written in modern Fortran. The main reason we’re not using a modern Fortran is that I set up the project. Like each team member I brought my own skills and preferences to the project, and my favourite language, at least for writing small pieces of clear, simple code, is Python. I have had very little professional experience in Fortran, and essentially none for 20 years.

Thirdly, most science Fortran, even newly written science Fortran, is like version 1: FORTRAN 77 with aspects of Fortran 90 (e.g. free-form source, long identifiers, dynamic arrays). Some is still in FORTRAN 66.  Furthermore, it is big blobs of Fortran with cryptic variable and function names, very occasional comments, aliasing through COMMON blocks, a lot of unused functions and/or variables.  What is the variable iy1e, and how can I find out? Why are we comparing nameo(31:32) to ‘ R’? This is what I mean when I say “Fortran”, and it is typical of GISTEMP, and other bodies of code we have seen, and friends and colleagues tell us it is true elsewhere. It is the natural consequence of the way in which science is done. Scientists are paid to do science, not to write code.  As long as the code does what it ought to do, for long enough to plot the charts for the paper for publication, then it’s good enough. There is no pressure to write code which is clear, maintainable and flexible, and so scientists mostly don’t develop or retain the skills to do so. That is one of the points of this project:to show what such code might look like, how to write it, and how it can be beneficial to science.

Fourthly, in the specific case of some code such as GISTEMP, the results of the code are being used to argue important public policy questions which will affect all of us. Something that I, personally, can do to turn some of the heat of that debate into light, and to help us all reach good decisions, is to make GISTEMP accessible to the public. All-Python is simply better than Fortran-ksh-Python-C for that purpose, for various reasons but primarily that it is easier to install, browse, and run on a random PC. Consider how many people downloaded the GISTEMP sources and ran into the sand very early. That should not happen with ccc-gistemp.

So, in short, yes we are converting from “Fortran” to “Python”, but some of the “Fortran” was already Python and some of the “Python” is decidedly Fortran-like.

For more on the pros and cons of Fortran and Python, please visit the Software Carpentry project. No affiliation; I just like what they do.

[*] This isn’t quite true – as far as I know Fortran still doesn’t have the meta-object protocol or introspective facilities of some languages, and pretty much no other language has the macro facilities of Lisp – but features like that play no part in this project anyway.

GISTEMP 2009 anomaly anomaly

In a previous article I predicted that the 2009 GISTEMP anomaly would be +0.58. In fact when it was published it came in at +0.57. This 0.01 K difference is well within any reasonable error bounds and typical of the sort of error you get from rounding. Still, it bothered me. How unlucky was I to get agreement for all the years except the most recent one?

Today I realised that although I was using up to date land data I wasn’t using up to date ocean data. I have just fetched fresh ocean data and rerun ccc-gistemp. Of course the 2009 anomaly comes out as +0.57 K, same as GISS:

Overview of GISTEMP intermediate files

When ccc-gistemp runs, the data files in the input/ directory are processed in a number of steps to produce the results in the result/ directory. On the way many intermediate files are written to the work/ directory. Generally the intermediate files are written by one stage of the process and consumed by a later stage. GISS’s GISTEMP works in a broadly similar way, but the details are slightly different. One of the first things we did when working with GISTEMP was to reorganise where the intermediate files were written.

Many of the intermediate files were only written because the computers on which GISTEMP was originally intended to run were extremely resource poor and the whole data could not always be processed in working memory. Hence, data was written into several intermediate files and processed piece by piece. This is no longer necessary, and ince ccc-gistemp release 0.2.0 we have made changes that mean that far fewer intermediate files are written.

A consequence is that there are now a manageable number of file in the work/ directory, and a listing of them tells a story about how GISTEMP works:

         5 Jan 20 13:27 GHCN.last_year
  44716518 Jan 20 13:28 v2.mean_comb
  29802728 Jan 20 13:30 Ts.txt
  39696368 Jan 20 13:31 Ts.bin
  20853712 Jan 20 13:31 Ts.GHCN.CL
   2107900 Jan 20 13:31 ANN.dTs.GHCN.CL
    354106 Jan 20 13:33 PApars.pre-flags
    371742 Jan 20 13:33 PApars.list
  19233584 Jan 20 13:34 Ts.GHCN.CL.PA
         0 Jan 20 13:34 BX.Ts.GHCN.CL.PA.1200
  50240120 Jan 20 13:50 SBBX1880.Ts.GHCN.CL.PA.1200
  34001576 Jan 20 13:50 SBBX.HadR2
    176152 Jan 20 13:51 ZON.Ts.ho2.GHCN.CL.PA.1200.step1
    176152 Jan 20 13:51 ZON.Ts.ho2.GHCN.CL.PA.1200
     15974 Jan 20 13:51 ANNZON.Ts.ho2.GHCN.CL.PA.1200

The above is a listing of my work/ directory having done a fresh run using subversion revision 199 sources. Each row lists: file size in bytes, timestamp, file name.

The first thing to note are the timestamps. The first file is written at 13:27 and the last file at 13:51. On my machine ccc-gistemp took about 25 minutes for this run.

I’ll go through the files in order and try and explain what each one is. Bear in mind that some of these files will probably disappear in future version as we reduce the number of time data is written to disk and read back in again.

GHCN.last_year

This file is used to pass the highest year that is found in the GHCN input data (input/v2.mean) from step0 (where this file is created) to step2 (where this file is read). The contents are the highest year. This is not a very elegant way to pass this information. It’s needed because in step2 a Fortran binary file is created with fixed record lengths, and the length of the record is related to the highest year that has data so that highest year needs to be known before the binary file is created.

v2.mean_comb

This large file is the output from step0. It contains all the temperature data that GISTEMP will go on to use combined into one file. The temperature data are combined from: GHCN, USHCN, SCAR, and one input file for Hohenpeissenberg. The combining process is not entirely trivial: data from USHCN do not simply replace data from GHCN, they are adjust by the mean monthly difference (see the function include_US in step0.py).

Ts.txt

This is the output of step1. Duplicate records (multiple series for the same weather station) are combined, if possible; an adjustment is made for the St Helena record (listed in the file config/combine_pieces_helena.in); records and partial records listed in the file config/Ts.strange.RSU.list.IN are removed; an adjustment is made for a discontinuity in the Lihue record (listed in the file config/Ts.discont.RS.alter.IN).

The output format of this file is different from the v2.mean format used in the previous step. Metadata from the file input/v2.inv is included in this file.

Ts.bin

At the beginning of step2 the Ts.txt file from step1 is converted to this Fortran binary file. The binary file is easier to access using a Fortran program. At one time in the past the binary file would have been significantly quicker as well, but I doubt that matters these days. Since ccc-gistemp is now entirely in Python it’s likely that we’ll remove the binary file, preserving it only as an option to match the GISTEMP intermediate files.

In the GISS code, this file is then split into 6 files so that the gridding, in step3, can proceed by using only a subset of the station data held in memory at once. Keeping all the data in one file greatly reduced the number of intermediate files created in ccc-gistemp.

The next few files are all internal to step2, the Urban adjustment.

Ts.GHCN.CL

The same data as Ts.bin but trimmed to make the file size slightly smaller. Even more pointless in this day and age.

ANN.dTs.GHCN.CL

Annual anomalies for each station, computed in step2. Each of the 12 months has an average computed, and the anomaly for a year is computed from the difference between each month in that year and that month’s average.

There is far less data in this file than the monthly series in Ts.txt, and it is the this data that is used to make the urban adjustment.

PApars.pre-flags

By analysing urban and rural stations step2 creates this table of parameters that control what adjustments are going to be made (to urban stations).

PApars.list

The parameters in PApars.pre-flags are annotated with a flag that affects the exact adjustment made.

Ts.GHCN.CL.PA

The data from Ts.GHCN.CL are read in and urban stations are adjusted according to the parameters previously computed and stored in PApars.list. This file contains the adjusted data and is the output of step2.

BX.Ts.GHCN.CL.PA.1200

An empty file created by step3. The GISS version of GISTEMP creates a gridded dataset ( SBBX1880.Ts.GHCN.CL.PA.1200 see below) with 8000 cells (subboxes) and from this also creates a gridded dataset with 80 cells (boxes) which is what this file would be. Currently ccc-gistemp does not create the 80 cell version.

SBBX1880.Ts.GHCN.CL.PA.1200

This is the gridded output of step3. It’s created by considering each grid cell in turn, and combining (usually several) station records into one record for each cell. From this point on only gridded data is used.

SBBX.HadR2

This file contains sea surface data from Hadley and Reynolds version 2 (the “Had” and “R2″ in its name). It’s the result of step4 which takes the input file of the same name (in the input/) directory, and adds in any updates that have been downloaded. Usually we run without any updates, and in this case step4 simply copies this file from the input/ directory.

ZON.Ts.ho2.GHCN.CL.PA.1200.step1

This is a temporary file used by step5. Step5 takes the two subbox files, SBBX1880.Ts.GHCN.CL.PA.1200 containing land data, and SBBX.HadR2 containing ocean data, and merges the land and ocean data together creates a gridded file with 80 boxes. The gridded file appears in the result/ directory: BX.Ts.ho2.GHCN.CL.PA.1200.

The data in the gridded file are combined to produce a data series for each of 8 latitudinal zones, then from those 8 zones another 6 are produce for large scale regions, including the 3 for Northern Hemispehere, Southern Hemisphere, and Global average. That zonal data is stored in this file.

ZON.Ts.ho2.GHCN.CL.PA.1200

The zonal data are read and an alternative computation is done where the global average is the simple average of the northern and southern hemispheres, as opposed to the previous calculation with uses an area weighted average.

ANNZON.Ts.ho2.GHCN.CL.PA.1200

Annual anomalies are computed from the monthly data series for each zone.

Whilst the data for this and the previous file are being computed, the text files that hold summaries of this data are written to the result/ directory.