domingo, 21 de octubre de 2012

Analyzing John Snow's cholera dataset with QGIS and SEXTANTE

To start this blog, I have taken the well-known dataset that John Snow used in his groundbreaking work and I have been playing with it in SEXTANTE to get some results. Not really new things to come out (the analysis is pretty obvious and there is no need for sofisticated GIS techniques to end up with good results and conclusions), but I think it is a good way of showing how these spatial problems can be analyzed and solved by using different tools included in SEXTANTE.

The dataset I have used contains shapefiles with cholera deaths and pump locations, and an OSM rendered map in TIFF format. They have been adapted from this dataset . There were a couple of missing pumps and I added some values to the pump layer attributes table. Also, I had to modify the OSM image to be able to process it in SAGA, which does not allow for raster layer with different x and y cellsizes (I have to see how to do that automatically in SEXTANTE...but it doesn't look like an easy task...)

The first thing to start with is calculting the Voronoi diagram (a.k.a. Thyessen polygons) of the pumps layer.
The Voronoi Diagram tool in MMQGIS can be used for that.

Pretty easy, but it will already give us interesting information.

Clearly, most cases are within one of the polygons

To get a more quantitative result, we can count the number of deaths in each polygon. Since each point represents a building where deaths occured, and the number of deaths is stored in an attribute, we cannot just count the points. We need a weighted count, so we will use the Count points in polygon (weighted) tool in the fTools group.

The new field will be called DEATHS, and we use the COUNT field as weighting field. The resulting table clearly reflects that the number of deaths in the polygon corresponding to the first pump is much larger than the other ones.

Another good way of visualizing the dependence of each point in the Cholera_deaths layer with a point in the Pumps layer is to draw a line to the closest one. This can be done with the Distance to closest hub tool in the MMQGIS group, and using the configuration shown next.

The result looks like this:

Although the number of lines is larger in the case of the central pump, do not forget that this does not represent the number of deaths, but the number of locations where cholera cases were found. It is a representative parameter, but it is not considering that some locations might have more cases than other.

A density layer will also give us a very clear view of what is happening. We can create it with the SAGA Kernel density algorithm. Using the Cholera_deaths layer and its COUNT field as weight field, and with a radius of 100, we get something like this.

Combining with the pumps layer, we see that there is one pump clearly in the hotspot where the maximum density of death cases is found.

All these analysis are using euclidean distance, but people did not walk in straight lines to go to the pump to get some water. Let's try to take that into account.

There are many sources of spatial data where we could get vector roads data, but let's imagine that we cannot have other data apart from the OSM image. How could we use it?

First, let's classify the map, so as to have a new map with a given value (1 seems like a good one, we will later see why) in those pixels representing streets, and the no-data value in the rest of them. All non-streetpixels correspond to buildings, and are rendered in a light-orange dark. This is a paletted image, and the index for that color is 204. To check it, you just have to select the information tool, select the image to query and click on any of those orange pixels. The dialog that shows up will tell you the band value.

So what we have to do is to change those pixels with a value of 204 to no-data (you cannot walk through buildings), and the rest of them to 1 (you can walk there). There are several ways of doing that. Using the Reclassify algorithms is one of them. The raster calculator is another one, and we have two of them, one in SAGA and one in GRASS. I will be doing it using the SAGA one.

Here is how the parameters dialog has to be filled.

You have to select the OSMMap layer in the grids field. The formula ifelse(a=204, -99999,1) tells SAGA that if the value in first grid in the Grids field (grids are referred as a, b, c, etc...or g1, g2, g3, etc, and in this case we just have one), the resulting value will be -99999 (no-data), otherwise it will be 1.

(I have to think about a better way of setting no data values, but in the meantime it is a good thing to know that SAGA uses -99999 as the default no-data value, so you can use it directly like that.)

The resulting layer looks like this.

Roads are rendered in grey, buildings are transparent. Rename it to "streets"

You might notice that this is a quick and dirty classification, as it does not consider other colors in the map, like those of labels, but in the center area where the interesting stuff is happening, it represents a very good solution, so there is no need for more refinement.

Let's turn this layer into a new one where no-data vlaues will be preserved, but all street pixels, instead of having a value of 1, will have the walking distance to the closest pump, considering, of course, the streets.  This is going to be a raster analysis, so let's first convert the pumps vector layer into a raster one.

Select the Shapes to Grid tool in SAGA and fill it as shown next.

To get the output extent, you do not have to type it. Click on the button on the right-hand side and select Use layer/canvas extent.

Since we will be using this rasterized layer along with the streets one, select the streets layer and its extent will be automatically added to the text field. You can do the same with the cellsize, but it is not needed, since the cellsize of the streets image (and the original OSMMap layer) is 1, which happens to be the default value (just a coincidence...)

The resulting layer looks mostly transparent, as it is all comprised of no-data values, except for those pixels where a point representing a pump was found. You might not see them, but those pixels are there, do not worry. We will use them in the next step. Rename the layer to "pumps"

Select the SAGA  Accumulated cost (isotropic) algorithm and set the following values.

Execute the algorithm and you will get two new layers, one with the cost, and another with the index of the closest destination point (the closest pump). This last one is really more intersting for us. Changing the styling to use the Pseudocolor color map, and adding the vector Pumps and Cholera_deaths layers, you should get something like this.

There is not much difference when compared with the Voronoi polygons, but it it illustrates a different way of getting to the same result. Other parameters (i.e. steepness of streets) can be incorporated this way, just changing the cost layer (now cost is 1 through all cells).

The cholera_deaths points layer can be extended sampling the closest point raster layer, adding a new attribute with the index of the closest pump. I'll let you find out how to do that with SEXTANTE.

I have a few more ideas, and will probably write some more examples based on this dataset. The results are not going to be much more better than this ones (these are more than enough to "solve" the problem), but will be useful for showing different techniques to use SEXTANTE in QGIS.

If you want to go a bit further, you can try creating a model with all these steps, or even a script in the QGIS console.


Note 1: I developed the SAGA cost algorithms quite a few years ago.  I think now that they can be improved, and I'm thinking about doing some improvements with new ideas. This looks to me  like a good excuse to go back to a bit of good old c++ coding :-)

Note 2: I fixed a couple of bugs, and improved some SEXTANTE little things while writing this, so you need  to work with the github version to be able to replicate this.

Note 3: Of course, the street classification can be done as well using the grayscale OSM image.

2 comentarios:

  1. Could you please provide a link to the dataset used in this analysis? Thank you very much.

  2. Ivan, I will add an entry about how I prepared the data, and put the altered dataset somewhere. In the meanwhile, you can try downloading the original data from the link already in the text...and figure out how to adapt it :-) It's not hard and it is a good exercise, I guess.

    Thanks for the interest