# Introduction

## Context

Spatially continuous data play a significant role in planning, risk assessment and decision making in environmental management (Li et al. 2011). However, these data are not always available and often difficult or expensive to acquire. The acquisition of environmental data such as groundwater temperature, hydraulic head, substance concentration of soil are usually collected by point sampling. Then, geoscientists often require spatial interpolation methods to get spatially continuous data over a region of interest: here comes the Geostatistics.

## What is geostatistics?

Geostatistics is a branch of statistics focusing on spatial or spatiotemporal datasets. Developed originally to predict probability distributions of ore grades for mining operations, it is currently applied in diverse disciplines including petroleum geology, hydrogeology, hydrology, meteorology, oceanography, geochemistry, geometallurgy, geography, forestry, environmental control, landscape ecology, soil science, and agriculture (wikipedia definition).

The principle of geostatistic is well resumed on the documentation webpage:

The basic idea of geostatistics is to describe and estimate spatial correlations in a set of point data. The typical application is geostatistics is an interpolation. Therefore, although using point data, a basic concept is to understand these point data as a sample of a (spatially) continuous variable that can be described as a random field, or to be more precise, a Gaussian random field in many cases. The most fundamental assumption in geostatistics is that any two values xi and xi+h are more similar, the smaller h is, which is a separating distance on the random field. In other words: close observation points will show higher covariances than distant points. In case this most fundamental conceptual assumption does not hold for a specific variable, geostatistics will not be the correct tool to analyze and interpolate this variable.

## What we do here…

The aim of this notebook is to build a piezometric map using the Python Scikit-GStat library and to compare the results with other standard interpolation techniques. After some setups, we will download a dataset of points giving hydraulic head of a groundwater body located in the area of Lyon (France) and we will clean this dataset to eliminate all points outside of our groundwater body of interest.

In a second part, we will apply two standard interpolation methods (i.e. linear and cubic) given by the *griddata* function (from *scipy.interpolate*) to map the hydraulic head across our area of interest. The limits of both interpolation methods will be discussed.

Finally, we will explore the ordinary kriging method given by the Scikit-GStat library:

- We will first build a semi-variogramm exploring different parameters to better understand the relationship between measurements variability and distance between measurements.
- Secondly, we will build an ordinary kriging model to interpolate the hydraulic head across our area of interest.
- We will finally see how to plot the error estimation across our area of interest.

Please note the reference of this library and the full documentation:

*Mirko Mälicke, Helge David Schneider, Sebastian Müller, & Egil Möller. (2021, April 20). mmaelicke/scikit-gstat: A scipy flavoured geostatistical variogram analysis toolbox (Version v0.5.0). Zenodo. http://doi.org/10.5281/zenodo.4704356*,- the full documentation of this Python library can be downloaded here.

Finally, please note that the name kriging refers to its inventor Dave Krige who published the method in 1951:

*Krige, D. G. (1951). A statistical approach to some basic mine valuation problems on the Witwatersrand. Journal of the Southern African Institute of Mining and Metallurgy, 52(6), 119-139.*

# Run me first

Two options are proposed to install scikit-gstat. Please select the one you prefer.

## Option 1: Install scikit-gstat with PyPi

Run the following command in you terminal:

## Option2: Install the most recent version from GitHub

To run option 2, please copy/past following instructions into your powershell/terminal:

## Install/Import other libraries

We also need to install and import other libraries as follow:

## Import our hydraulic head dataset

The hydraulic head dataset we are working with is available on the github repository of this notebook. Please note that this dataset is derived from a dataset provided by the french geological survey. Please visit a previous notebook to see how to explore french hydrogeological data.

To download information at the scale of a department, we need some a geographical dataset with administrative units of France. We can use the openstreetmap shapefile of departments of France. We download this shapefile and put it in a local directory as follow:

We can display the location of our points on a static map using `matplotlib`

:

Now, let’s display the location of our points on an interactive map using `folium`

. We also add geological tile layers of France using the `folium.WmsTileLayer`

method (please note that we work with longitudes and latitudes with `folium`

):

## Cleaning our dataset

Before any interpolation, we need to clean our dataset. Particularly, we need to exclude all points which are not located in our groundwater body of interest. In this area there are several groundwater bodies and we make the choice to work on the shallow alluvial groundwater body. It includes groundwater flowing through fuvio-glacial corridors (on the eastern part of our available dataset), and groundwater flowing through the modern alluvial deposit.

To do so, we need to make a spatial join between our dataset and the shape of our groundwater body. The shape of the relevant groundwater body can be downloaded as follow (please note that provided file is an incomplete shape of the real groundwater body):

We then make the spatial join and display the location of both files:

We can also display some histograms using the `hist`

method and boxplots using the `boxplot`

method to have a closer look on how our values (i.e. hydraulic head `hh`

and ground elevation `z`

) are distributed:

The hydraulic head boxplot put in evidence a very low outlier (around 130 m). This point can easily be identified in yellow on the previous map, in the south of our area of interest. Considering the very low hydraulic gradient of the groundwater body we are studying (about 0.2%) and other measurements around, this value is clearly an error and should be removed from our analysis (or maybe not an error but a hydraulic head measurement inside a pumping well). Then, let’s remove points where hydraulic head is lower than 145 m and display our cleaned dataset (please note that we could have gone further to clean this dataset):

# Application of two standard interpolation techniques using `griddata`

## Creation of a grid over the area of interest

First, we need to build a meshgrid over the area of interest. A mesh-grid is a grid of coordinates, in which each value simply holds the indices of the position in the grid. In our case, the meshgrid is two dimensional, therefore the two grid-arrays (for x and y) will hold all coordinates present in the grid. We need to create this grid, as the coordinates are the input data for the `griddata`

function. To do so, we need to specify the resolution of the grid we want (the number of values between our x-min and our x-max and similarly in the y-direction) and build the grid using the `numpy.mgrid`

method:

## Linear and cubic interpolations

`griddata`

function from the `scipy`

library:We can now plot the results given by both techniques:

## Limits of standard interpolation techniques

`scipy.interpolate.griddata`

documentation:- the
`linear`

method tessellates the input point set to N-D simplices, and interpolate linearly on each simplex. - the
`cubic`

method returns the value determined from a cubic spline.

Of course, both techniques can gives pretty good results for dense and regularly spaced datasets, but these conditions barely happen when handling environmental measurements/sampling which are often scattered and non-uniformly distributed. Under such conditions, these techniques have several drawbacks:

- the weighted model is completely arbitrary and doest not account for spatial variability of measurements,
- we have no idea of the variance/error associated to the interpolation method,
- the cubic method trend to emphasis on diverging values (we can easily check that with the multiple circular zones on the figure)
- it is not possible to make extrapolations with these methods.
- both methods use splines and can therefore be highly influenced by outliers

# Application of a geostatistical approach with `scikit-gstat`

## Variography study of our dataset

The first step of our geostatistical analysis consists in analysis the variability of our mesurements using a variogram. We can now import to tools we are going to work with (if you need a quick recap on main geostatistic concepts, please visit this video on YouTube):

We can now build our variogram using the `Variogram`

function. We need to specify some arguments:

- the coordinates zipped into a list of tuples,
- the hydraulic head values of our dataset associated to the coordinates we specified earlier,
- the
`normalize`

parameter which is`True`

or`False`

: used to normalize distances and semi-variances. - the
`model`

representing the theoretical variogram function to be used to describe the experimental variogram. It can takes the following possibilities:`spherical`

,`exponential`

,`gaussian`

,`cubic`

,`stable`

, and`matern`

, - the
`use_nugget`

parameter which is`True`

or`False`

: the nugget represents the intrinsec variability of measurements (occuring even if the distance between measurements is zero). In our case, wa can expect a nugget effect because our hydraulic head measurements are not synchronous and there is of course a seasonal fluctuation of hydraulic head. - the
`maxlag`

parameter representing the maximal distance used between points to calculate semi-variances. - the
`n_lags`

parameter representing the number of intervals in which we divide`maxlag`

to calculate intermediate semi-variances.

This variogram gives us following informations:

- at the origin, we observe the nugget effect: this effet is about 2-4m wich is in line with natural hydrualic head fluctuation in our area of interest,
- of course when the distance increases, the semivariance increases too,

`scikit-gstat`

also has a powerful fitting parameterization. You can switch between a still growing set of fitting algorithms, and apply different weighing functions. These can be used to put more weight on shorter lag bins. This can be helpful in cases where outliers at larger distances than the effective range influence the sill. Remember, that a good fit of the theoretical variogram will have a higher impact on the Kriging quality, than on large lags. With `fit_sigma`

you can either pass weights, or the name of a weighing function. With `'linear'`

the weights of each lag bin will decrease linearly.

You can see, that a linear decrease of weights completely under-estimates the sill. But the `fit_sigma='sqrt'`

produces a slightly better fit on the first 5 bins. We can of course play with all parameters to find the more appropriate model.

## Building our ordinary kriging model

Now that we have a fitting model for our variogram, we can build our ordinary kriging model using the `OrdinaryKriging`

function of the `scikit-gstat`

library. We must specify some parameters such as:

- the variogram we want to use,
`min_points`

: the minimum number of points we want to take into account to make our calculation,`max_points`

: the maximal number of points we want to take into account to make our calculation,`mode`

: has to be one of`exact`

or`estimate`

. In`exact`

mode (default) the variogram matrix will be calculated from scratch in each iteration. This gives an exact solution, but it is also slower. In estimate mode, a set of semivariances is pre-calculated and the closest value will be used. This is significantly faster, but the estimation quality is dependent on the given precision.`precision`

(integer): Only needed if`mode="estimate"`

. This is the number of pre-calculated in-range semivariances. If chosen too low, the estimation will be off, if toohigh the performance gain is limited.

Now we can display the result of our ordinary kriging estimation and its associated error:

As we can see:

- we are able to make extrapolation with our ordinary kriging model,
- the error increase in areas with a lower density of measurements.

Now, considering that interpolation/extrapolation has no meaning outside the groundwater body of interest, we can make a mask to display the result on relevant locations:

We finally display the result:

## Export the result as a shapefile

Now we may want to export our ordinary kriging estimation as a shapefile. So, let’s convert resulting array into a geodataframe:

We can finally save the the shapefile using the `to_file`

method:

## Export the result as a geoTIFF raster

To export the kriging estimation as a geoTIFF please follow the procedure below:

# Discussion

In this notebook, we applied an ordinary kriging method to determine the hydraulic head across an area of interest. The result is in great accordance with the regional understanding of groundwater flow systems. However, our dataset is based on non-synchronous measurements leading to a significative variability which does not allow us to describe draw conclusions regarding local flow systems behavior.

There are many other geostatistical methods derived from the variography study (e.g. simple kriging, ordinary kriging, universal kriging, co-kriging, etc.)

Please note that kriging algorithms can be combine with machine learning and/or deep learning algorithm to improve local estimations: see Li et al., (2011) and Li & Heap (2014).

# Play with the script…

This post is available as a notebook and can be run on Goolge Colab.

Please note that all my notebooks are available on my GitHub repository PythonForGeosciences.