Google Summer of Code 2011

[Cross-posted from the Foundation blog]

Google have announced their Summer of Code, and we intend to be a mentoring organisation. If you’re a student, this is an opportunity to work on our open source code and earn a bit of money doing so (Google give a stipend of USD 5000 qualifying students, and an honorarium of USD 500 to the mentoring organisation).

The Climate Code Foundation has an ideas page, most of which revolves around the Clear Climate Code ccc-gistemp project. Ideas range from improving ccc-gistemp in various ways, through novel reconstructions, to clear implementations of other climate codes. If you have ideas of your own, we’d like to hear about those too.

If you are interested in participating as a student, then please get in touch.

We have not been a Summer of Code mentor before, but we bring many years (decades even!) of experience to the table: experience in computer science, software engineering, project management, and so on. We hope to help students make a success of their projects!

Gavin’s bug

(I call it Gavin’s bug because he found it, not because he created it).

Gavin’s bug concerns Step 3, and in particular the part where for each cell a list of stations within 1200 km is created. Creating this list of “nearby” stations means potentially consulting all the stations. All the stations could be quite a lot. So for each of the 80 boxes (see overview.txt), the code restricts the search list to an extended box. Here’s a typical box (in white) and the extended box (in green):

In principle it’s faster to first cull the entire station list by rejecting those stations outside of an enclosing rectangle based on latitude and longitude, because we can do that quickly without needing to do the trigonometry for the proper “within 1200 km” test.

The bug is that the rectangle chosen for the culling is not large enough. Here’s the code from revision 665 of step3.py:

# Extend box, by half a box east and west and by arc north
# and south.
extent = [box[0] - arcdeg,
box[1] + arcdeg,
box[2] - 0.5 * (box[3] - box[2]),
box[3] + 0.5 * (box[3] - box[2])]
if box[0] <= -90 or box[1] >= 90:
# polar
extent[2] = -180.0
extent[3] = +180.0

Note that the comment says that the box is extended by half a width east and west. The polar boxes (there are 4 for the North Pole, and they all meet at the pole) have to be special-cased because otherwise stations close to the pole would be missed.

The bug concerns the boxes that span latitudes 44 to 64 (there are 8 of these in each hemisphere, see overview.txt. The extended box isn’t quite wide enough:

But wait… that circle (above) is centred on the northwest corner of the box. When it comes to cells, it’s the centre of the cell that used to select the stations within 1200 km, and the northwest cell is set inside the box a little bit. Is it enough?

The white circle is 1200 km around the centre of the northwest cell. It just fits. Phew!

So even though there are point within 1200km of the box that are not inside the extended box, no stations are missed, because the cells inside the box are too large to be close enough to the edge.

Perhaps at one time computers were slow enough for this “optimisation” to make a difference, but it’s been irrelevant for probably at least 20 years. Whilst the code seems to be correct, it’s not clearly correct.

It would be simpler and clearer without it. So it’s gone.

It does change the results by a tiny tiny bit. The, essentially bogus, reasons for which I may writeup in another post.

Analysis of Canada data

In an earlier post I describe the trials and tribulations of tracking down some station data from Environment Canada’s website.

The obvious question to ask is, how does this affect the ccc-gistemp analysis?

For starters, how much extra data do we get, once we’ve merged all the duplicates and rejected short records and so on? Here’s the station count by year for the GHCN stations (dark lines), and the extra Environment Canada stations (lighter lines):

This count is made after ccc-gistemp Step 2 processing, so duplicate records for the same station have been merged, short records have been discarded, and urban stations that could not be adjusted have been dropped (the log tells me there are 18 such stations). New Environment Canada stations (recall that some of the Environment Canada data is for stations that are not in GHCN) do not get any brightness information in the v2.inv file; it so happens that in ccc-gistemp this means they get marked as rural, more by accident by design. I should probably fix this (by calculating brightnesses for the new stations, and rejecting stations with no brightness data), but this will certainly do for a preliminary analysis.

The 1990s still don’t reach the dizzying peaks of the 1960s (in terms of station count), but the Environment Canada data is certainly a welcome contribution. More than doubling the number of stations for recent decades.

The effect of this on the analysis? Here’s the arctic zone:

The first thing to note if you haven’t seen one of these before, is the scale. The swings in this zone are much larger than the global average (this zone is 5% of the Earth’s surface); the recent warming in this zone is over 5 °C per century! The remaining points of note are the slight differences here and there in the very recent period. That large dip in the 2000s is 2004, and the new analysis has the anomaly some 0.16 °C colder (+0.57 versus +0.73). A warm spike is 1995 is 0.09 °C warmer. The same blips are also just about visibly different on the Northern Hemisphere analysis, but the differences smaller.

The additional Environment Canada is welcome, and does affect the result just enough to be visible, but the trends and any conclusion one could derive are not affected at all.

Appendix

The data are available here, but you don’t need to download that if you’re using ccc-gistemp. Run «python tool/cascrape.py» to download the data, and then run «python tool/camatch.py» to generate a mapping table. «python tool/run.py -d ‘data_sources=ghcn ushcn scar hohenpeissenberg ca.v2′» will then run the analysis.

Canada

It’s easy to note that GHCN is a bit lacking when it comes to recent temperature data for Canada:

Canada is of course just a particular aspect of the global case, which is well known and discussed in the literature ([Jones and Moberg 2003] for example).

Note in particular the drop off at around 1990, shortly before GHCN v2 was compiled in 1997. Canada is of some interest because: it has many relatively high latitude stations; and, it’s an industrialised nation that ought to have a good weather station network. The recent periods of GHCN are updated via CLIMAT messages. Who knows why Canada’s data doesn’t make it into GHCN, perhaps they prohibit NOAA from republishing the data, perhaps their BUFR format isn’t quote what NOAA expect.

Environment Canada publish (some of) their monthly temperature data on the web. Using ScraperWiki I wrote a scraper to collect it, and another program to format it into GHCN v2 format so that it could be used as an input for ccc-gistemp. Whilst there are far fewer stations in the scraped data, there is more data in the 2 most recent decades (note different y-axis scale compared to above):

GHCN is bumping along with record counts in the 30s and 40s for the last 2 decades, Environment Canada has 100 or more records for most of those years. Until 2007 in fact (change of government?).

Quite a few of the stations in the scraped data have fewer than 20 years. I exclude those because they’re unlikely to be of use in the ccc-gistemp analysis (in principle they could be combined in Step 1 with existing GHCN duplicates even when they’re short, but meh). It turns out this also avoids some thorny and potentially embarrassing data cleansing war stories (such as why are the reported locations for WMO station 71826 13 degrees apart?)

There are 106 stations from the scraped data with 20 or more years of monthly data.

So, the first question is how do we merge this dataset with GHCN? I suppose I could simply add all the scraped data alongside the GHCN data, but that would be wrong. The scraped data has some stations which are in GHCN, and those would be “double counted”. We ought to try and identify which stations are the same in the two sources; so how do we do that? Most of the scraped stations are WMO stations (87 out of 106) and they come with WMO identifiers. So it’s easy to match those against a possible counterpart in GHCN (WMO stations in GHCN have their 11-digit identifier ending with “000″). Both GHCN and the scraped data come with location information, so we can try matching on that as well. In the case of the matched WMO stations (67), 5 have locations that were more than 0.015 degrees apart, but they’re all within 0.03 degrees.

Sometimes a scraped WMO station will not match a GHCN station by WMO identifier, but will have a location match. This is probably a station in GHCN that should be marked as a WMO station, but by mistake isn’t.

WMO station 71808 represents a good win. The existing GHCN data (station 40371808000) runs from 1967 to 1982. The scraped data (Environment Canada identifier 7040813) runs from 1982 to 2010. In fact there is no overlap at all, GHCN finishes in 1982-07 and Environment Canada starts in 2010-08. Curious.

WMO station 71808: Blanc Sablon

We need a rule of thumb to decide whether to add the scraped data as a new station or as a duplicate of an existing GHCN station. Scraped WMO stations that match a WMO station in GHCN will always be added as a duplicate. For each remaining scraped station I select the nearest GHCN station as a comparison (this is in Canada in all but one case). An overlap and q score is computed for each GHCN duplicate, and the “best” duplicate is picked. If there are duplicates with “long” overlaps (meaing >= 120 months) then the one with lowest q score is chosen for “best”, otherwise (all the overlaps must be short) the duplicate with longest overlap is chosen. q is sqrt(sum(d[i]**2)/n) where d[i] is the difference between the two series at time i; the sum is over the common period; n is the number of elements (months) in the common period.

So for the non WMO stations (some of these will be WMO stations, just without an identified WMO counterpart in GHCN) the rule of thumb is: if s+q < 1, then the scraped data is added as a duplicate, otherwise it is added as a new station. s is the separation between the scraped station and the “best” station in degrees; q is the q score of the “best” duplicate.

The added value of this dataset, compared to GHCN, is recent data for high latitude stations. Alert is a classic case, North of 80:

The scraped data brings almost 20 years of recent data to the table.

So what does this all add up to when we incorporate this new data into the GISTEMP analysis? Well, I’ve done that, but I’ll have to leave the results for another post.

REFERENCES

[Jones and Moberg 2003] Jones, P. D., A. Moberg, 2003: Hemispheric and Large-Scale Surface Air Temperature Variations: An Extensive Revision and an Update to 2001. J. Climate, 16, 206–223.
«doi:10.1175/1520-0442(2003)016<0206:HALSSA>2.0.CO;2».

ccc-gistemp release 0.6.0

[ed: 2010-10-29: This was an announcement for release 0.6.0, but that has a bug (see comments). Please use release 0.6.1 instead. I've edited the article to update the links]

I’m pleased to announce ccc-gistemp release 0.6.1 (making the previous buggy 0.6.0 obsolete). ccc-gistemp is our project to reimplement the NASA GISS GISTEMP algorithm in clearer code (in Python). This release is the first release made under the aegis of the Climate Code Foundation.

Many of the significant changes in this release have already been previewd in earlier blog posts:

Further details are available in the release notes.

We intend to carry on our work on ccc-gistemp, and we urge you to download our code, try it, and read it. We welcome contributions.

The work for release 0.6.0 was carried out by David Jones and Nick Barnes, and the work for release 0.6.1 was carried out by David Jones.

GHCN-M V3

Why bother with words when abbreviations are so much more cryptic? As pd points out, there is a new version of Global Historical Climate Network, version 3. There isn’t an official announcement yet, but others have noticed.

GHCN-M is the monthly datasets. Version 3 is still in beta, so we’re all still learning.

The file format is different. More like USHCNv2. And like USHCN each datum has a set of flags that indicate quality checks (isolated value, inconsistent with climatology, month has missing days, and so on). One of the flags is a source flag, each monthly datum is tagged with its source: UK Met Office, CLIMAT report, MCDW, and so on.

Unlike GHCN v2 there is only one record for each station in GHCN v3. There are no “duplicates”. This makes one job (the job of Step 0) easier, we don’t have to decide how to select or combine multiple records for the same station: that’s been done for us. On the other hand, we may have wanted to combine records in a different way.

I’ve been modifying ccc-gistemp to experiment with GHCN v3. At first I thought I could use the v2.inv file supplied by GISTEMP, but the GHCN station identifiers for the contiguous US have changed (so that they’re based on their USHCN station identifiers—probably a good thing). Writing code to parse the new v3 .inv file is straightforward enough.

Of course the v3 .inv file doesn’t have the night-time satellite brightness that GISTEMP uses in its analysis (globally, since 2010-01-16). So I also added a parameter to use the GHCN population index (POPCLS in the documentation) globally.

This result should be considered preliminary.

When making comparisons with official GISTEMP there are several caveats:

  • Only GHCN v3 data is used. No SCAR READER (and no Hohenpeissenberg correction).
  • In Step 2, urban adjustment, the GHCN v3 analysis uses the POPCLS field for the rural/urban designation. The field has three values, R/S/U, for Rural/Semi-Rural/Urban. R maps to rural in the analysis, the others count as urban. The current GISTEMP analysis uses night-time satellite brightnesses.
  • Each GHCN v3 station is treated as a single record. GISTEMP using v2 data combines duplicate records for the same station into one record (sometimes more than one); this record may not be the same as the GHCN v3 record. And in particular…
  • (because I appended a ‘G’ to all the 11-digit v3 station identifiers) the “hand picked” list of deletions and adjustments is not used. The most obvious example of where this matters is St Helena, 14761901000.

I changed ccc-gistemp to use GHCN v3 and wrote this post ages ago, but when I met Jay Lawrimore at the Exeter workshop, he said I should probably hold off posting. Here’s the record of my GHCN v3 changes in googlecode (made on 2010-09-04).

Surface Temperatures Workshop

David Jones and I attended the Surface Temperatures workshop at the Met Office in Exeter this week. This is the kickoff meeting for an ambitious new project to produce a far more comprehensive databank of surface temperature records than currently exists, especially at finer time resolution (daily and sub-daily) and incorporating many station records which are not currently available.

There were around 80 attendees from around the world, including climate scientists, meteorologists, computer people, statisticians, metrologists, and ourselves. This was the first outing for our new Climate Code Foundation, although many people there were aware of Clear Climate Code. This was the first time either of us had attended a climate science meeting. We were made welcome, our motivation and focus was respected, and our voices were heard. The project principles established at the meeting include a strong commitment to openness and transparency, and although some scientists don’t share our conviction of the importance of code publication, the project is committed to publishing all its code.

We were not paid for our participation or for our expenses. In the final meeting we were asked to contribute to software aspects of the project, and said that this may be possible depending on resources.

A mind-boggling side-light: estimates of the volume of non-digitized or hard-copy data range in the hundreds of millions of pages; NCDC alone has a digital archive of 56 million page images, and literally thousands of boxes of unscanned hard-copy in their basement. Many national weather services, and other governmental, non-governmental, and commercial organisations also have large paper or imaged archives; the Galaxy Zoo people are working with the Met Office and the National Maritime Museum on an amazingly cool new crowd-sourced project to recover weather records from millions of imaged pages of Royal Navy log books. There was a strong emphasis at the meeting on the need to retain original data and to make any dataset fully traceable to that original data (I imagine a web interface in which one can drill down to page images of the original weather station hard-copy records). It was clear to the meeting that this traceability requirement implies software publication.

The Climate Code Foundation

We are pleased to announce the creation of the Climate Code Foundation. The Foundation is a non-profit organisation founded by David Jones, Nick Barnes, and Philippa Davey to promote public understanding of climate science. The Foundation will continue work on the Clear Climate Code project, and also related activities, encouraging climate scientists to improve and publish their software.

The Foundation intends to work with climate scientists, funding bodies, national and international organisations, and science publishers. We hope to establish climate science in the forefront of science software quality and transparency.

Members of the Foundation are attending the Surface Temperatures workshop at the UK Met Office in September 2010, to promote better and more open software practices within that project.

If you support the Foundation’s goals, there are many ways to contribute to its work.

A real land mask

As mentioned in the previous post, I now have a real land mask from ISLSCP Initiative II.

The format of the files in refreshingly simple. 180 rows with 360 space separated numbers. It takes nothing more than a few dozen lines of Python to convert this into the 8000 cell format that ccc-gistemp uses for its step5mask.

Here’s the 8000 GISTEMP cells that contain any land (according to the ISLSCP file). Drawn on a Plate Carrée projection:

Even at this blobby scale there are some suspicious features (missing). Where is Ascension or Cocos Islands? I don’t think it’s an artefact of my processing, because they don’t appear on the ISLSCP file either; that’s probably a bug. So accepting that their might be some minor issues with islands and so on, we can go on to do a run of ccc-gistemp using this land mask:

As mentioned in the previous article, restricting to land de-emphasises coastal stations which generally warm less slowly. So the restricted version of the analysis has a stronger warming trend.

Many of the partial land cells will have ocean data and will not have a nearby land station to count as a land cell in the usual GISTEMP analysis. So we find that excluding cells that have ocean data (and no nearby land station) results in an even stronger trend. The red graph is reprised from the previous article:

(pedants should note that in the above graph “cells with ocean data” means “cells with ocean data and where the nearest land station to the cell’s centre is more than 100 km away”)

There are countless variations one could try such as weighting cells by their coverage, or using a threshold of 50% land cover. As far as an estimate of land temperatures go, such variations amount to interpolating between those two curves on the previous chart. Roughly.

Masking GISTEMP

[update: 2010-08-26: the "no arctic" masks were wrong, fixed now]

I added a land mask feature to ccc-gistemp. The land mask is used In Step 5, where land and ocean series are combined into zonal averages. For each cell, ocean data is used unless: there is very little ocean data (fewer than 240 months); or, there is a land station within 100 km. With the new land mask feature an external file (input/step5mask which is the same format as the newly output work/step5mask) specifies whether to use land or ocean data for each cell.

This enables me to do something new: a run of ccc-gistemp using only land-data but restricted to locations where there is no ocean data:

I don’t actually have a true land mask to hand (a list of cells that cover the Earth’s land surface), but I can do a normal run of ccc-gistemp that uses both land and ocean data and use the generated mask. That mask tells me which cells have no ocean data (and therefore land data is used, if there is any).

That’s the mask I use above. This reduces the effects of the 1200 km smoothing radius, by preventing land series from being interpolated over the ocean (unless there is no ocean data). This mask has 3077 cells (8000 in total), so clearly some ocean cells are still using land data. It looks like this:

Note that black is not land, but it is where land data is used if there is any. Note the arctic ocean has no ocean data so land data (generally interpolated) is used. Also some of southern ocean will pick up land data.

The fact that the land series when restricted to exclude cells with ocean data shows greater warming can be explained by considering coastal stations. In the usual ccc-gistemp land-only analysis every station has an influence that extends out to a disc of radius 1200 km. Coastal stations will have a disc of influence that extends into the ocean. When excluding cells that have ocean data that means that a coastal station has its disc chopped in half (roughly), and an island station is almost entirely excluded. The overall effect is to weight stations in the continental interior more strongly. And warming is greater in the continent interior.

[edit: Chad also shows a version of the same effect (scroll down the very long post until you reach the red and black graphs): de-emphasising coastal stations increases trend]

We can show the effect of excluding Arctic and Antarctic zones (everything north of 60N and south of 60S is excluded):

Only a small effect. The broken mask version had an even smaller effect, but I’m still a little surprised by how small this difference is.

Again we can exclude cells with ocean data, and again this shows increased warming. This more than cancels out the cooling by excluding the high latitude zones:

Anyone got a real land mask? [edit: Yes, Steven Mosher pointed me at one; so expect a post using a real land mask soon]