Just 440 stations

Over at his blog, Nick Stokes selects rural stations with more than 90 years of data and that report in 2010. In GHCN he counts just 61 stations.

We can do something similar with ccc-gistemp. We get a result that is unreasonably similar to the official GISTEMP with full series.

That major differences in methodology are:

Here’s the short (but not very clear) Python script that identifies long rural stations:

rural = set((row[:11] for row in open('input/v2.inv') if
  int(row[102:106]) <= 10))
def id11(x):return x[:11]
def id12(x):return x[:12]
import itertools
rurlong = [group for group,rows in
  ((g,list(r)) for g,r in itertools.groupby(open('work/v2.step1.out'), id12))
  if id11(group) in rural and len(rows) > 90 and rows[-1][12:16] == '2010']
print len(rurlong)

There are 440 stations. With these 12-digit record identifiers in hand it is a trivial matter to create a new v2.mean file («open(‘v2.longrural’, ‘w’).writelines((row for row in open(‘work/v2.step1.out’) if row[:12] in rurlong))» if you must).

As I said before the results are pretty close to the standard analysis:

The new v2.mean file is over 10 times smaller (uncompressed) than the official GHCN v2.mean file. The analysis is correspondingly about 10 times quicker to run (a few minutes on my old laptop). In case you want to use this file, to replicate my results or run it through your own analysis, I’ve made it available from our googlecode repository.

16 Responses to “Just 440 stations”

  1. Nick.Barnes Says:

    Splendid. Where are the stations? Is this “unreasonably similar”?

  2. Zeke Hausfather Says:

    Seems pretty well in-line with similar analysis I’ve run in the past using GHCN data: http://i81.photobucket.com/albums/j237/hausfath/Picture18-4.png

  3. Bob Koss Says:

    I came across two stations which do not appear in the v2.mean_comb file and therefore aren’t used. Both stations appear in the v2.inv and the v2.mean files and have extensive records.

    42572406006 1817-1999
    42572523004 1890-2005

    I checked the Giss station data page at http://data.giss.nasa.gov/gistemp/station_data/

    Giss shows adjusted data files for those stations and are evidently using them in their analysis.

    I noticed this a couple days ago while using version 0.4.1 and find they are still not being passed on since downloading version 0.5.1.

    You might want to investigate.

  4. drj Says:

    @Bob Koss: Those two GHCN stations correspond to USHCN stations (180470 and 300448 respectively). Slightly mysteriously, neither of those USHCN stations have records in the F52 temperature dataset. ccc-gistemp eliminates all GHCN records that correspond to USHCN records; even if there is no actual USHCN data. This is a little odd and I was wondering whether it was intended the other day when I was going over the code. I’ll see what the GISTEMP code does.

    Presumably the USHCN stations are used in another of the USHCN datasets (just not the F52 one), but I haven’t checked.

    The correspondence between GHCN and USHCN stations comes from the files ushcn2.tbl and ushcnV2_cmb.tbl which ccc-gistemp gets from the GISTEMP tarball.

  5. Bob Koss Says:


    Thanks for the prompt response.

    If it is only the two stations I wouldn’t expect a noticeable change. Though it would be nice to fully resolve the why of the discrepancy.

    One other oddity I came across.

    I noticed version 0.4.1 has 1093 empty subboxes through March data. The version 0.5.1 has 1109 empty subboxes through July data.
    The largest difference is 11 in the Region (+00/+24 -180/-158) with the other differences close to that area.

    Can I assume this is due to a reordering of stations by record length as you described for Byrd?

  6. Craig Allen Says:

    Is there any possibility of calculating trends in the setting of record highs and record lows across the dataset? It would be of particular interest in discussions of the Russian heatwave.

  7. drj Says:

    @Bob Koss: Looks like dropping US GHCN stations that have no USHCN data is a bug. I’ll file and fix.

    As to your empty subbox thing, I have no idea.

  8. drj Says:

    @Craig Allen: Sorry no! There’s not quite enough spare time around to keep ccc-gistemp ticking over, and since record highs would be an additional project we won’t be taking it on. However, if you fancy a bit of programming (or can find someone who does) and join the mailing list we can probably provide a bit of advice on how to proceed.

  9. Bob Koss Says:


    Thanks for looking into it. I checked a little closer and there seems to be a total of five US stations which fall into the category of not being passed along.

    42572406006 1817,1999
    42572523004 1890,2005

  10. drj Says:

    @Bob Koss: Agreed. It is a bug, and I just fixed it.

    Thanks for reporting it!

  11. Bob Koss Says:


    Your prompt attention was great. Thank you.

    I copied the two modified files and ran it. First time it failed. My version of parameters.py doesn’t seem to be the latest one. Added in the data_sources parameter to the file and it worked fine. Those extra five stations of additional data only amounted to a trivial drop of .01c when running land only.

  12. GMcKee Says:


    I am using your “Just 440 stations’ as a springboard into trying to understand the CCC step2 UHI implementation. Based on a modification of your Python script, I have loaded the input/v2.inv fields of current interest to me into a database table.

    I also have a Python text processing script that is parsing and summarizing information from log/step2.log. So far, I have summarized the ‘action’ and ‘neighbour’ portions of the log file into two additional tables that I am attempting to understand before moving on to the time series stored in the other portions of the log.

    Assuming that neighbours are used in doing the step 2 UHI adjustment, I did a marginal distribution on the number of neighbours. The tail extends to almost two hundred. I decided to start with three sites with the smallest number of neighbors.

    The second site turned up with only three neighbours is Abadan. I was surprized to learn that one Abadan neighbor is Baghdad. Checking v2.inv, according to “Just 440 stations”, Baghdad is close enough (495 km) and has a low enough night brightness to qualify as a “rural site for UHI adjustment”.

    I have extended this analysis to learn that Baghdad was “cited” 26 times as a “neighbour” during step 2. Looking for other non-airport sites with over 500,000 in population and low night brightness yields in addition, Nanchang, Los Gatos, Wroclaw, and Izmir. Below, is a small CSV table summarizing how many times each has been “cited” as a neighbor.

    13761641001,Goree Senegal,14.4,-19.5,U,799,0,0
    22227995000,Samara (bezen,52.98,49.43,U,1216,7,0
    22228900001,Kujbysev (bezencuk),53.25,50.45,U,1216,9,8
    23248826000,Phu Lien,20.8,106.63,U,1279,0,15
    42574509001,Los Gatos Usa,37.2,-122.0,U,6253,7,61
    63512424000,Wroclaw Ii,51.1,16.88,U,523,8,77
    64917220001,Izmir Turkey,38.4,27.3,U,637,0,50

    I understand this is an input data QC issue rather than CCC code issue.

    I am not organized enough or far enough along in the step 2 code (or the CCC code in general) to determine for myself if Baghdad’s being cited as a “neighbour” means that it was used in the UHI adjustment process.

    Can you enlighten me on this concern?

  13. drj Says:

    @GMcKee: It is excellent that you have found ccc-gistemp to be useful for this work, and just the sort of thing we would like people to do.

    By the way, lengthy and detailed discussions like this are probably best done on the mailing list (please join!).

    Yes, all the stations listed on the “neighbours” log record are used to construct the combined rural reference record. Note that each neighbour station is assigned a weight based on its distance (see get_neighbours() in step2.py). This weight falls from 1 at the urban station to 0 at the edge of the 500 km threshold (or 1000 km threshold if that is used). The station’s weight is used in producing the combined record.

    In this case the contribution from station 209406500000 will be tiny.

    The “adjustment” log record gives the combined rural reference record: the ‘series’ member is the combined rural record as annual anomalies; the ‘year’ member is the start year (which is 1880); the ‘difference’ member gives the difference between the urban and rural annual anomaly series. It is the difference record that is used to produce the two leg fit.

    Currently the stuff that is logged is the stuff that I found useful for debugging. Step 2 should probably log a little more. Notably, the adjusted years and the slope parameters of the fit are not logged. Also, the neighbour weights should be logged (like they are in Step 3).

    If you’re writing tools to parse the log you should note that we have an unwritten plan that the logs have the form:


    Where OBJECT is some JSON object. They don’t quite have this form at the moment (they are Python literals rather than JSON objects).

    I will likely change the log format to match JSON in the next release.

  14. drj Says:


    Perhaps you would like to help produce the “whizzy tool” mentioned in this Kathmandu thread on the discussion list:


  15. drj Says:


    Your Los Gatos station shows a clear win for satellite brightness. Los Gatos does not have a population of 6.3 million!

    Yes, the GHCN metadata is wrong.

  16. Peter O'Neill Says:

    This post illustrates the potential pitfalls of using nightlights for station classification based on the GHCN inventory – Los Gatos is not alone in having incorrect metadata.

    The 440 stations consist of 338 US stations, and 102 stations covering the rest of the world. It is these 102 stations I consider here in greater detail (which is not to say that the 338 US stations are problem free, but they cover a small part of the world, and the greater station density may offset some erroneous data).

    The non-US stations are also almost all WMO stations, as indicated by IDs ending in 000. (GISS indicates that these are “probably” WMO stations, with values greater than 000 indicating stations near WMO stations. Here there may be some doubt about the WMO status of two of these 102 stations). This provides an alternative source of station coordinates: the WMO metadata, which is also currently being updated to accommodate station coordinates with higher precision.

    Correcting the GHCN coordinates, using some judgement where appropriate (all non-matching locations were compared in Google Earth, and errors can still be found in the WMO metadata, even for those stations already updated with higher precision), 24 of these 102 stations are reclassified as non-rural according to the GISS nightlights criterion. The GHCN R/S/U classification shows 44 of these stations as peri-urban or urban – something which should have been shown for the full station set in Hansen et al (2010), and which would have indicated the need for caution in moving to a nightlights classification. The mean location difference was 5.1 km (excluding stations showing no difference), with a maximum difference of 291 km. In this case I have used the recommended nightlights image rather than the deprecated version used by GISS (examination of nightlights contours for familiar locations, particularly small towns set in rural surroundings, shows why the latter image should be regarded as deprecated). This does lead to a small number of reclassifications even without coordinate corrections, but I have found that use of the deprecated image applied to the full station set gives broadly similar reclassification results.

    Unfortunately, metadata correction is not so easy for the full station set, which contains many non-WMO stations. But an error rate of 24 of 102 for this sensibly defined subset should be considered disquieting. Stations currently reporting, and with long records, might be assumed to be those where the true location would be more likely to be known accurately.

    A side issue for Clear Climate Code: while I obtain broadly similar results I have noted some differences, and then compared your subset data with the archived GISS Ts.txt for the same month (data up to May 2010). For example, your record for 301872220004 CATAMARCA AER is missing values for 1941-1950, data which is present in the GISS version (and had been already present in earlier versions, so unlikely to have resulted from addition of these values to v2.mean a few days after you took your copy. I also found one additional reral long-record station, 41476458000 MAZATLAN, with data from 1880 to 2010 with 9 missing years from 1912-1920. This station may be missing additional years in your version, similar to CATAMARCA (and I think other stations, but I only checked as far as CATAMARCA as there are very many minor rounding differences in the full comparison.

Leave a Reply