Posts Tagged ‘gistemp’

Gavin’s bug

(I call it Gavin’s bug because he found it, not because he created it).

Gavin’s bug concerns Step 3, and in particular the part where for each cell a list of stations within 1200 km is created. Creating this list of “nearby” stations means potentially consulting all the stations. All the stations could be quite a lot. So for each of the 80 boxes (see overview.txt), the code restricts the search list to an extended box. Here’s a typical box (in white) and the extended box (in green):

In principle it’s faster to first cull the entire station list by rejecting those stations outside of an enclosing rectangle based on latitude and longitude, because we can do that quickly without needing to do the trigonometry for the proper “within 1200 km” test.

The bug is that the rectangle chosen for the culling is not large enough. Here’s the code from revision 665 of step3.py:

# Extend box, by half a box east and west and by arc north
# and south.
extent = [box[0] - arcdeg,
box[1] + arcdeg,
box[2] - 0.5 * (box[3] - box[2]),
box[3] + 0.5 * (box[3] - box[2])]
if box[0] <= -90 or box[1] >= 90:
# polar
extent[2] = -180.0
extent[3] = +180.0

Note that the comment says that the box is extended by half a width east and west. The polar boxes (there are 4 for the North Pole, and they all meet at the pole) have to be special-cased because otherwise stations close to the pole would be missed.

The bug concerns the boxes that span latitudes 44 to 64 (there are 8 of these in each hemisphere, see overview.txt. The extended box isn’t quite wide enough:

But wait… that circle (above) is centred on the northwest corner of the box. When it comes to cells, it’s the centre of the cell that used to select the stations within 1200 km, and the northwest cell is set inside the box a little bit. Is it enough?

The white circle is 1200 km around the centre of the northwest cell. It just fits. Phew!

So even though there are point within 1200km of the box that are not inside the extended box, no stations are missed, because the cells inside the box are too large to be close enough to the edge.

Perhaps at one time computers were slow enough for this “optimisation” to make a difference, but it’s been irrelevant for probably at least 20 years. Whilst the code seems to be correct, it’s not clearly correct.

It would be simpler and clearer without it. So it’s gone.

It does change the results by a tiny tiny bit. The, essentially bogus, reasons for which I may writeup in another post.

Analysis of Canada data

In an earlier post I describe the trials and tribulations of tracking down some station data from Environment Canada’s website.

The obvious question to ask is, how does this affect the ccc-gistemp analysis?

For starters, how much extra data do we get, once we’ve merged all the duplicates and rejected short records and so on? Here’s the station count by year for the GHCN stations (dark lines), and the extra Environment Canada stations (lighter lines):

This count is made after ccc-gistemp Step 2 processing, so duplicate records for the same station have been merged, short records have been discarded, and urban stations that could not be adjusted have been dropped (the log tells me there are 18 such stations). New Environment Canada stations (recall that some of the Environment Canada data is for stations that are not in GHCN) do not get any brightness information in the v2.inv file; it so happens that in ccc-gistemp this means they get marked as rural, more by accident by design. I should probably fix this (by calculating brightnesses for the new stations, and rejecting stations with no brightness data), but this will certainly do for a preliminary analysis.

The 1990s still don’t reach the dizzying peaks of the 1960s (in terms of station count), but the Environment Canada data is certainly a welcome contribution. More than doubling the number of stations for recent decades.

The effect of this on the analysis? Here’s the arctic zone:

The first thing to note if you haven’t seen one of these before, is the scale. The swings in this zone are much larger than the global average (this zone is 5% of the Earth’s surface); the recent warming in this zone is over 5 °C per century! The remaining points of note are the slight differences here and there in the very recent period. That large dip in the 2000s is 2004, and the new analysis has the anomaly some 0.16 °C colder (+0.57 versus +0.73). A warm spike is 1995 is 0.09 °C warmer. The same blips are also just about visibly different on the Northern Hemisphere analysis, but the differences smaller.

The additional Environment Canada is welcome, and does affect the result just enough to be visible, but the trends and any conclusion one could derive are not affected at all.

Appendix

The data are available here, but you don’t need to download that if you’re using ccc-gistemp. Run «python tool/cascrape.py» to download the data, and then run «python tool/camatch.py» to generate a mapping table. «python tool/run.py -d ‘data_sources=ghcn ushcn scar hohenpeissenberg ca.v2’» will then run the analysis.

Canada

It’s easy to note that GHCN is a bit lacking when it comes to recent temperature data for Canada:

Canada is of course just a particular aspect of the global case, which is well known and discussed in the literature ([Jones and Moberg 2003] for example).

Note in particular the drop off at around 1990, shortly before GHCN v2 was compiled in 1997. Canada is of some interest because: it has many relatively high latitude stations; and, it’s an industrialised nation that ought to have a good weather station network. The recent periods of GHCN are updated via CLIMAT messages. Who knows why Canada’s data doesn’t make it into GHCN, perhaps they prohibit NOAA from republishing the data, perhaps their BUFR format isn’t quote what NOAA expect.

Environment Canada publish (some of) their monthly temperature data on the web. Using ScraperWiki I wrote a scraper to collect it, and another program to format it into GHCN v2 format so that it could be used as an input for ccc-gistemp. Whilst there are far fewer stations in the scraped data, there is more data in the 2 most recent decades (note different y-axis scale compared to above):

GHCN is bumping along with record counts in the 30s and 40s for the last 2 decades, Environment Canada has 100 or more records for most of those years. Until 2007 in fact (change of government?).

Quite a few of the stations in the scraped data have fewer than 20 years. I exclude those because they’re unlikely to be of use in the ccc-gistemp analysis (in principle they could be combined in Step 1 with existing GHCN duplicates even when they’re short, but meh). It turns out this also avoids some thorny and potentially embarrassing data cleansing war stories (such as why are the reported locations for WMO station 71826 13 degrees apart?)

There are 106 stations from the scraped data with 20 or more years of monthly data.

So, the first question is how do we merge this dataset with GHCN? I suppose I could simply add all the scraped data alongside the GHCN data, but that would be wrong. The scraped data has some stations which are in GHCN, and those would be “double counted”. We ought to try and identify which stations are the same in the two sources; so how do we do that? Most of the scraped stations are WMO stations (87 out of 106) and they come with WMO identifiers. So it’s easy to match those against a possible counterpart in GHCN (WMO stations in GHCN have their 11-digit identifier ending with “000”). Both GHCN and the scraped data come with location information, so we can try matching on that as well. In the case of the matched WMO stations (67), 5 have locations that were more than 0.015 degrees apart, but they’re all within 0.03 degrees.

Sometimes a scraped WMO station will not match a GHCN station by WMO identifier, but will have a location match. This is probably a station in GHCN that should be marked as a WMO station, but by mistake isn’t.

WMO station 71808 represents a good win. The existing GHCN data (station 40371808000) runs from 1967 to 1982. The scraped data (Environment Canada identifier 7040813) runs from 1982 to 2010. In fact there is no overlap at all, GHCN finishes in 1982-07 and Environment Canada starts in 2010-08. Curious.

WMO station 71808: Blanc Sablon

We need a rule of thumb to decide whether to add the scraped data as a new station or as a duplicate of an existing GHCN station. Scraped WMO stations that match a WMO station in GHCN will always be added as a duplicate. For each remaining scraped station I select the nearest GHCN station as a comparison (this is in Canada in all but one case). An overlap and q score is computed for each GHCN duplicate, and the “best” duplicate is picked. If there are duplicates with “long” overlaps (meaing >= 120 months) then the one with lowest q score is chosen for “best”, otherwise (all the overlaps must be short) the duplicate with longest overlap is chosen. q is sqrt(sum(d[i]**2)/n) where d[i] is the difference between the two series at time i; the sum is over the common period; n is the number of elements (months) in the common period.

So for the non WMO stations (some of these will be WMO stations, just without an identified WMO counterpart in GHCN) the rule of thumb is: if s+q < 1, then the scraped data is added as a duplicate, otherwise it is added as a new station. s is the separation between the scraped station and the “best” station in degrees; q is the q score of the “best” duplicate.

The added value of this dataset, compared to GHCN, is recent data for high latitude stations. Alert is a classic case, North of 80:

The scraped data brings almost 20 years of recent data to the table.

So what does this all add up to when we incorporate this new data into the GISTEMP analysis? Well, I’ve done that, but I’ll have to leave the results for another post.

REFERENCES

[Jones and Moberg 2003] Jones, P. D., A. Moberg, 2003: Hemispheric and Large-Scale Surface Air Temperature Variations: An Extensive Revision and an Update to 2001. J. Climate, 16, 206–223.
«doi:10.1175/1520-0442(2003)016<0206:HALSSA>2.0.CO;2».

ccc-gistemp release 0.6.0

[ed: 2010-10-29: This was an announcement for release 0.6.0, but that has a bug (see comments). Please use release 0.6.1 instead. I’ve edited the article to update the links]

I’m pleased to announce ccc-gistemp release 0.6.1 (making the previous buggy 0.6.0 obsolete). ccc-gistemp is our project to reimplement the NASA GISS GISTEMP algorithm in clearer code (in Python). This release is the first release made under the aegis of the Climate Code Foundation.

Many of the significant changes in this release have already been previewd in earlier blog posts:

Further details are available in the release notes.

We intend to carry on our work on ccc-gistemp, and we urge you to download our code, try it, and read it. We welcome contributions.

The work for release 0.6.0 was carried out by David Jones and Nick Barnes, and the work for release 0.6.1 was carried out by David Jones.

Masking GISTEMP

[update: 2010-08-26: the “no arctic” masks were wrong, fixed now]

I added a land mask feature to ccc-gistemp. The land mask is used In Step 5, where land and ocean series are combined into zonal averages. For each cell, ocean data is used unless: there is very little ocean data (fewer than 240 months); or, there is a land station within 100 km. With the new land mask feature an external file (input/step5mask which is the same format as the newly output work/step5mask) specifies whether to use land or ocean data for each cell.

This enables me to do something new: a run of ccc-gistemp using only land-data but restricted to locations where there is no ocean data:

I don’t actually have a true land mask to hand (a list of cells that cover the Earth’s land surface), but I can do a normal run of ccc-gistemp that uses both land and ocean data and use the generated mask. That mask tells me which cells have no ocean data (and therefore land data is used, if there is any).

That’s the mask I use above. This reduces the effects of the 1200 km smoothing radius, by preventing land series from being interpolated over the ocean (unless there is no ocean data). This mask has 3077 cells (8000 in total), so clearly some ocean cells are still using land data. It looks like this:

Note that black is not land, but it is where land data is used if there is any. Note the arctic ocean has no ocean data so land data (generally interpolated) is used. Also some of southern ocean will pick up land data.

The fact that the land series when restricted to exclude cells with ocean data shows greater warming can be explained by considering coastal stations. In the usual ccc-gistemp land-only analysis every station has an influence that extends out to a disc of radius 1200 km. Coastal stations will have a disc of influence that extends into the ocean. When excluding cells that have ocean data that means that a coastal station has its disc chopped in half (roughly), and an island station is almost entirely excluded. The overall effect is to weight stations in the continental interior more strongly. And warming is greater in the continent interior.

[edit: Chad also shows a version of the same effect (scroll down the very long post until you reach the red and black graphs): de-emphasising coastal stations increases trend]

We can show the effect of excluding Arctic and Antarctic zones (everything north of 60N and south of 60S is excluded):

Only a small effect. The broken mask version had an even smaller effect, but I’m still a little surprised by how small this difference is.

Again we can exclude cells with ocean data, and again this shows increased warming. This more than cancels out the cooling by excluding the high latitude zones:

Anyone got a real land mask? [edit: Yes, Steven Mosher pointed me at one; so expect a post using a real land mask soon]

No US temperatures

Along with other changes I’ve been making to ccc-gistemp I recently added a feature to remove GHCN stations that are in the contiguous US. I did that partly because GISTEMP’s Fortran has a similar feature, and partly because it allows me to do the following whimsical experiment:

Running ccc-gistemp with no USHCN data and dropping contiguous US stations from GHCN:

tool/run.py -p 'retain_contiguous_US=False;data_sources=ghcn,scar,hohenpeissenberg'

The -p allows a parameter from parameters.py to be set, without editing that file. retain_contiguous_US and data_sources are recently added parameters.

Onto the graphs!

Global:

Hmm, not much difference.

What about removing the ocean? Removing it should bring the differences due to land stations into sharper focus:

Tiny differences, not easily distinguishable from rounding.

What about zooming into the Northern Hemisphere, where the US is located?

Aha, some slightly less tiny differences appear which are not due to rounding. US citizens may be interested to note that their contribution to the recent trend is to lower it by a tiny amount (the trend is lower when US stations are included).

ccc-gistemp produces zonal anomaly files (because GISTEMP does). I’ve not really looked at them much before, but we can zoom in further on Zone 24N to 44N where most of the contiguous US is found. Note I had to rescale the next chart to fit the higher peaks in:

What above the next (GISTEMP) zone north, which includes the remaining part of the contiguous US but which has a slightly less dense station network and more volatile temperature swings:

The differences are all lost in the noise?

The US is only about 1½ percent of the Earth’s surface after all.

Just 440 stations

Over at his blog, Nick Stokes selects rural stations with more than 90 years of data and that report in 2010. In GHCN he counts just 61 stations.

We can do something similar with ccc-gistemp. We get a result that is unreasonably similar to the official GISTEMP with full series.

That major differences in methodology are:

Here’s the short (but not very clear) Python script that identifies long rural stations:

rural = set((row[:11] for row in open('input/v2.inv') if
  int(row[102:106]) <= 10))
def id11(x):return x[:11]
def id12(x):return x[:12]
import itertools
rurlong = [group for group,rows in
  ((g,list(r)) for g,r in itertools.groupby(open('work/v2.step1.out'), id12))
  if id11(group) in rural and len(rows) > 90 and rows[-1][12:16] == '2010']
print len(rurlong)

There are 440 stations. With these 12-digit record identifiers in hand it is a trivial matter to create a new v2.mean file («open(‘v2.longrural’, ‘w’).writelines((row for row in open(‘work/v2.step1.out’) if row[:12] in rurlong))» if you must).

As I said before the results are pretty close to the standard analysis:

The new v2.mean file is over 10 times smaller (uncompressed) than the official GHCN v2.mean file. The analysis is correspondingly about 10 times quicker to run (a few minutes on my old laptop). In case you want to use this file, to replicate my results or run it through your own analysis, I’ve made it available from our googlecode repository.

GISTEMP Land Index

GISS publish a land-only temperature anomaly (referred to as their “traditional analysis”).

As I pointed out in an earlier article ccc-gistemp can now create a land index by omitting Step 4: python tool/run.py -s0-3,5.

Here’s how we compare with official GISTEMP:

GISTEMP Urban Adjustment

After some recent tweaks by me to the ccc-gistemp sources it is now possible to run a pipeline of the GISTEMP process with some of the steps omitted. An earlier post shows how I can omit Step 4 to create a land-only index. My recent changes allow Step 2 to be omitted. Step 2 is the urban adjustment step (in which stations marked as urban have their trend adjusted).

Omitting Step 2 will therefore give us an idea of the magnitude of the effect of the urban adjustment. It so happens that my writing this blog post overlaps with Nick Barnes implementing GISTEMP’s new scheme for identifying urban stations (corresponding to GISTEMP’s update of 2010-01-16). That gives me an opportunity to show both the new and old adjustment schemes against a “no adjustment” baseline:

In making this graph Step 4 has been omitted, giving us a land index. This is primarily to amplify the differences: land covers the lesser fraction of the Earth; so including the ocean data (which does not require an urban adjustment) makes the difference smaller.

And for each hemisphere:

Northern:

Southern:

To make a “no urban adjustment” run of ccc-gistemp: «python tool/run.py -s 0,1,3,5»; and to make an “urban adjustment” land-index: «python tool/run.py -s 0,1,2,3,5».

The 1990s station dropout does not have a warming effect

Tamino gives his results for his GHCN based temperature reconstruction. It is well worth reading. He also gives a comparison between stations that are reporting after 1992, and those that “dropped out” before 1992. He concludes that there is no significant difference in the overall trend. In other words refuting the claim that the 1990s station dropout has a warming effect. His results are preliminary and for the Northern Hemisphere only.

Tamino’s analysis use only the land stations; in order to write this blog post I tweaked ccc-gistemp so that we can produce a land index (python tool/run.py -s 1-3,5 now skips step 4, avoids merging in the ocean data, and effectively produces a global average based only on land data).

It is very easy to subset the input to ccc-gistemp and run it with smaller input datasets. So in this case I can split the input data into stations reporting since 1992, and those that have no records since 1992, and run ccc-gistemp separately on each input. I created tool/v2split.py to split the input data. Specifically I ran step 0 (which merges USHCN, Antarctic, and Hohenpeissenberg data into the GHCN data) to create work/v2.mean_comb then split that file into those stations reporting in 1992 and after, and those not reporting after the cutoff. Then I ran steps 1,2,3, and 5 of ccc-gistemp to create a land index:

It is certainly not the case that the warming trend is stronger in the data from the post-cutoff stations. [edit 2010-03-22: In a subsequent post I add trend lines to this chart]

The differences between these results and Tamino’s are interesting. Both show good agreement for most of the 20th century. These data show more divergence than Tamino’s in the 1800’s. Is that because we’re using Southern Hemisphere data as well, or is it because of the difference in station combining? Further investigation is merited.

We hope to make “experiments” of this sort easier to perform using ccc-gistemp and encourage anyone interested to download the code and play with it.

Update: Nick B obliges with a graph of the differences: