Error detection and correction in DEM

Felicísimo, Angel M. (1994): Parametric statistical method for error detection in digital elevation models.
ISPRS Journal of Photogrammetry and Remote Sensing, 49(4): 29-33.

Parametric statistical method for error detection in digital elevation models

We propose an automatic method for the detection and correction of anomalous values in matrix elevation digital models. This method uses statistical criteria and allows us to estimate the error probability of a point together with the statistical parameters derived from the same model, so that they are adapted to the characteristics of the area relief. This method has been programmed for an Arc/Info Geographical Information System environment, and thus the program is presented both in Arc Macro Language and a generic programming language.

1. Introduction

The Digital Elevation Model (DEM) is the basic source of information for developing other models, totally or partially dependent on topography. Therefore, the usefulness and validity of the results obtained are intimately associated with the quality of the original model, as quality is measured in terms of the kind and magnitude of its errors. However, DEMs are created, distributed and used very frequently without any reference to the magnitude of the error implied or to the methods applied to its disclosure and correction. Extending the application area to Geographical Information Systems, Veregin (1989) points out that in many publications, critical analyses on the error sources are dramatically absent, and that derived products are presented without any estimate of their accuracy.

In the case of matrix DEMs, the errors are of an attributive type, that is, they imply an incorrect assignment of altitude and they modify the values of the Z-axis. These errors commonly appear in the creation process of DEMs, both by autornatic and manual procedures. It is therefore necessary to apply systematically methods for their detection, measurement and correction.

Despite the importance of these aspects in the area of Digital Terrain Models (DTM) and Geographical Information Systems (GIS), the literature is sparse. This paper presents an error detection method that is directly appropriate for matrix DEMs. The method is based on the evaluation of transition probabilities, that is, that the differences between one figure and its neighbors may acquire a determined value. The method can be automatically applied, as no intervention by an operator is needed. It can also be completed by correction routines when conflictive data are detected.

Although the statistical principles employed are not new, we have not found any bibliographical reference that includes the criteria we propose in this paper for application to the topic of digital terrain models. The method described here has been systematically used in our research center for the revision of digital elevation models, both self-developed and commercial ones. The results obtained and their low application costs recommend its routine use in research centers in order to increase the attestation of the DEMs in use.

2. Nature of the selection tests

Large errors in matrix models are a special case in which some data have clearly underestimated or overestimated values. These errors may he even generated by automatic stereo correlation methods, that may have operative problems as a result of low contrast in images, ambiguities due to the repetition of objects or to periodic patterns. When the models are built by means of transformations of previous vector files, conflicts can originate by previous errors, or by the behavior of interpolation algorithms used in conflictive zones. The nature of those conflicts is typically local, they usually appear in isolated form and follow a random distribution. That is why detection cannot he achieved by sampling methods, comparing the DEM values with true values. DEM intensive checking must be performed in order to guarantee location by exhaustive analyses.

Despite the evident significance of this kind of errors, detection methods appear to have received very little attention from researchers. In practice, detection of such errors in DEMs is carried out in many cases by visual inspection of perspective displays or shaded relief displays (Weibel and Heller, 1991), whereas systematic and exhaustive calculation analysis methods should he applied.

Generally, errors tend to be detected by the assumption that land surface is essentially continuous (Chrisman, 1991), and hence that comparison of close values may be a valid tool for the detection of anomalous data.

The usual reference is Hannah (1981), where error detection devices are proposed based on point comparison with its neighboring points, using both slope values and change-slope values. For both parameters threshold values are used in order to detect conflictive points.

More recently Nagao et al. (1988) have forwarded more complex tests based on a discriminant analysis so as to tell apart correct points from erroneous ones. In this case selecting threshold values for two parameters is necessary (a variance value and a correlation value).

Despite their different character, these tests present some problems in practice. Generally, these tests have no statistical value, in the sense that they do not offer an error probability measurement for the altitude that is being checked. Also no criteria are suggested to define threshold values that may be adequate for each case. In the second paper, it is also necessary to define values for both data groups (correct and wrong), and hence this process implies the difficult obtaining of previous data.

The method we propose in this paper allows us to correct these problems and it has the following properties:

It is an objective method, where error parameters are determined by statistical methods.
We can assess the probability of erroneous points.
Threshold values are deduced from data from the model, and are therefore adapted to the relief characteristics of the area under study.

3. Parametric test based on deviations of altitude values

This test is based on the analysis of the existing differences between two altitude values for each point: the first one is that collected by the DEM (right or wrong), the second is a value obtained by interpolation from the neighbouring points. The process begins with calculating for each point of the model an altitude value using its closer neighbours. This is not a critical calculation method, and a bilinear interpolation using the four closer neighbours may be enough. In that instance, the altitude assigned to a point placed in row i and column j is given by the following expression:

(1) zest(i,j) = 0.25*(z(i,j-1)+ z(i,j+1)+ z(i-1,j)+ z(i+1,,j))

The difference between the estimated height and that of the DEM is:

(2) deltaz(i,j) = zest(i,j)-z(i,j)

If the process is applied to the totality of the DEM points, we can obtain an arithmetic mean and a standard deviation of the differences:

(3) deltamean = (sum_{(i=1 to n, j=1 to n)} delta(i,j)) / n

(4) destadevs = sqrt ( sum_{(i=1 to n, j=1 to n)}(delta(i,j)-deltamean)^2 )

The previous values define the function of a differences distribution, a Gaussian distribution N(deltamean, deltadevs). As the DEM constitutes the population, the values defining the distribution can be considered as population parameters instead of sample statistics. Knowing the parameters that define the distribution allows us to perform a meaningful test for the individual values of delta(i,j) with which we can validate or reject the hypothesis that the individual deviation value observed may belong to the population of deviations. This is implemented by the Student t test, as it must be applied to every point according to the following expression:

(5) t(i,j) = (delta(i,j)-deltamean) / deltadevs

This value is considered a standardized deviation, and as the number of data from the model is normally very high, its magnitude is compared to the tabulated value t(a ,¥ ). The probability of Type-I error can be defined to detect only large deviations. The error condition (represented as e) is given, therefore, by an expression such as:

(6) e ß|t(i,j)| ³ t(a,¥ )

For a= 0.001, the statistical t(a ,¥ ) has a value of 3.219 for a two-tail test, where the null hypothesis is H₀: delta(i,j)=deltamean, and the alternative is H₀: delta(i,j) ¹ deltamean, although it is convenient to set higher a values in order to detect only the most significant differences. The location of points with a significantly high t(i,j) does not necessarily imply an error, but it is an excellent alarm sign.

4. DEM correction

The error detection stage is logically followed by the correction of erroneous data. In the case of vector digital models, the correction process could be occasionally limited to the deletion of the wrong data (a point in a line, for instance) without significant information loss. In matrix DEMs, conversely, it is necessary to substitute the wrong value for the right one, as the matrix does not allow the existence of voids. The correction of local errors may be implemented manually or automatically. In the first case the right value is introduced once the conflictive point has been detected. Automatic correction uses algorithms to estimate an acceptable value (though not necessarily real) that will substitute the erroneous value. In matrix DEMs automatic substitution can be useful when we are dealing with isolated errors or with very extended models where manual editing may he considered too cumbersome. Substitution methods use interpolation algorithms that obtain a coherent value derived from the context of the problem point. The target is that error detection tests, applied later, consider the new value as acceptable. Interpolation operations can have very different forms; the simplest method is the substitution of erroneous values for the mean value of the data in their environment, but taking into account that if any of those is also defined as conflictive, it must not be included in the calculations. The simplicity of this method, as a counterpart, gives a theoretically less realistic result than that which can he obtained by applying more complex methods. Among those, the best option would probably be the kriging. If some properties of the variable distribution are found, then it becomes the optimal interpolation method, since it provides unbiased estimates with minimum and known variance (Royle et al., 1981). As this is not the right place to explain the kriging theoretical bases, because they are clarified by specific publications (e.g. Davis, 1973; Oliver and Webster, 1990), we shall only point to some basic aspects in relation with the DEM correction, as this is the aim of the present paper. Kriging performs a value estimate for the "variable by means of a linear sum of the data z_k

(7) zest(i,j) = sum_{(k=1 to n)} l(k)·z(k)

where l (k) represents different weights assigned to each of them. These are estimated according to their positions both in relation to the unknown point and to one another. In the case of DEMs, it is reasonable to suppose that the altitude value of a set point is somehow in relationship to the value of the neighbouring points, these being distributed at variable distances. We can also suppose that the "influence" of farther points is lesser than that of nearer points. In kriging, this dependency is empirically estimated by means of the covariance between data separated by different distances. Therefore it uses the semivariance of differences, whose value is calculated by the expression:

(8) g(h) = 1/2 · var(z(p)-z(p+h))

in which var stands for variance, z(p) is the value of the variable in a p point, and z(p+h), is the value of a point placed at a distance h from the former. The function that relates g to the distance h is called semivariogram, and it shows the variation of the data correlation in agreement with distance. The next step is to calculate the n functions of ponderation l (k), that are to be chosen so that they do not present a bias (S l (k)= 1) and so that they minimize the expected variance. This process is performed by the following equation, expressed in a matrix notation:

(9) [g (h(i,j))]·[l (i)] = [g (h(i,p))]

where [g (h(i,j))] is a matrix with the semivariance values for the distances between the points i and j (i = 1,...,n; j = 1,...,n); [l (i)] is a vector of n elements that contains the values of weighting functions (i = 1,..., n), and, finally, [g (h(i,p))] is another vector with the semivariance values of the existing distances between each point i (i = 1,..., n) and the problem point p (see Royle et al., 1981 for greater detail).

Although kriging is an optimal estimator (provided that some conditions are observed), it has some disadvantages. The most important one is that it needs intensive numerical calculations, and there- fore its execution is considerably slow, at least in comparative terms. However, in this particular case, with regularly spaced data, kriging is relatively simple. By using the 8 closest neighhours to the problem point, we only need to calculate and use two coefficients. The first one is the distance existing to the 4 closest neighhours (one interval between rows or columns), the second is applied to the diagonal data (the previous interval multiplied by the square root of 2). In this case, kriging can he a reasonable alternative to the mean altitude method. These methods are two sufficient alternatives for the specific case of obtaining a "filling" value in a DEM since our aim is to avoid a manual edition, hut does not provide additional information, as that is deduced from surrounding altitudes. Our experience is that the mean altitude method is enough to obtain an adequate substitution value, and therefore, for the sake of simplicity, we shall use it in the following section.

5. Algorithms for error detection and correction

The error detection model has been devel- oped in the context of the Arc/lnfo® Geographical Information System using the AML; (Arc Macro Language) computing language (Environmental Systems Research Institute, 1992). Though this language is not universally known, it has some advantages, especially as a consequence of its recursive nature and, therefore, its adaptability for matrix data structures. The Geographical Information System that uses it has become enormously popular with research groups. The AML has been developed by adopting the basic principles of cartographic modelling forwarded by Tomlin (1990), where one may find complete references about its foundations. In this context, the program carries the test for grid data structures (matrix DEMs used by the GIS), and within a module of the same name. If the DEM is named, for instance, dem, the basic commands are:

mddif = dem(0,0)–((dem(O,-i)+dem(0,1)+dem(-i,O)+dem(i,O))/4)
&describe mddif
&set zmean = %grd$mean%
&set zdesv = %grd$stdv%
valts = abs ((%zmean% – mddif)/%zdesv%)
demtmp = setnull (valtso10, dem)
demc = con (valts>10, focalmean(demtmp,DATA), demtmp)

The first command constructs a matrix named mddif, being the result of finding the difference between each point in the DEM and the mean altitude of the four closest neighbours, applying hence Eqs. (1) and (2). In AML, the command is implicitly recursive and it sequentially affects all elements in the DEM. In a generic language, it is equivalent to the following line (the operation is repeated for each DEM element):

mddif(row,col)ß dem(row,col)-(dem(row-i,col)+ dem(row,col-i)+ dem(row,col+1)+ dem(row+i,col))/4

Command 2 executes a statistical routine that calculates the arithmetic mean and the standard deviation of the mddif values, by assigning them to two system predefined variables (grd$mean y grd$stdv), and executing Eqs. (3) and (4). In a different programming language, this command should be substituted by the adequate subroutines for both statistical calculations, such as, for in- stance, those forwarded by Lebart et al. (1985). In the 3rd and 4th steps, the mean and standard deviation values are stored in the zmean and zdesv variables in order to use them in command 5. This performs a new operation on the mddif matrix where the statistical t absolute values matrix is calculated by applying Eq. (5). In a generic language it is equivalent to the following line:

valts (row,col)ß abs((zmean – mddif(row,col))/zdesv)

Command 6 generates a matrix named demtmp, similar to the original DEM, but where probably wrong data are, as they are identified by the error condition (t > 10), eliminated by the command setnull by assigning them a special predefined value (called nodata). This is a flag value to exclude an element from being used in arithmetic operations. Its equivalence in a generic programming language is:

if valts(row,col)>10) then demtmp(row,col)ßnodata else demtmp(row,col)ß dem(row,col)

where "nodata" can be equivalent to a very unusual value such as, for example, – 9999. Finally, command 7 executes directly the error correction by generating the corrected DEM (demc). This command makes use of a conditional statement that modifies the erroneous values of the matrix, substituting them by the arithmetic mean of their closest neighbours (focalmean function). In a programming generic language, it could adopt a form similar to:

if valts(row,col)>10 then
begin
focalmean(demtmp,value)
demc(row,col)ß value
end
else demc(row,col)ß demtmp(row,col)

The focalmean function must be defined so that it will execute the following operation for each DEM element:

define focalmean(demtmp,value)
totalß 0
ndataß 0

for r=-1 to 1
for c=-i to 1
if demtmp(row+r,col+c)<>nodata then
begin
totalß total + demtmp(row,col)
ndataß ndata + 1
end
end {c}
end {r}

valueß total/ndata
end {define}

The neighbours’ arithmetic mean of a DEM element is calculated by this routine which excludes null values, these being identified because they contain the nodata value, and it also returns the estimated value by the value variable.

6. Conclusions

The method proposed in this paper allows the automatic detection and correction of anomalous values in rnatrix DEM using probabilistic criteria that are adaptable to the morphological characteristics of each area. The method has already been tested in our laboratory on digital models, both commercial and home-made ones. It has been found a useful tool for their refining, as it is much more effective than mere eye inspection. Its integration in the Arc/Info environment enables us to run it systematically together with one of the present most extended Geographical Information Systems.

References

Chrisman, N.R., 1991. The error component in spatial data. In: D.J. Maguire, M.F. Goodchild and D.W. Rhind (Editors), Geographical Information Systems. Principles and Applications. Longman, London, Vol. 1, pp. 165 – l74.

Davis, J.C., 1973. Statistics and Data Analysis in Geology. Kansas Geological Survey. John Wiley and Sons, New York, N.Y., 550 pp.

Environmental Systems Research Institute, 1991. AML User’s Guide. App]ications programming language reference. Redlands, Calif.

Hannah, M.J., 1981. Error detection and correction in Digital Terrain Models. Photogramm. Eng. Remote Sensing, 47(1): 63-69.

Lebart, L., Morineau, A. and Fénelon, J.-P., 1985. Tratamiento estadístico de datos. Métodos y programas. Marcombo, Barcelona, 525 pp. (Original title: Traitement des données statistiques. Méthodes et programmes. Bordas, Paris.)

Nagao, M., Mukai, Y., Ayabe, K., Arai, K. and Nakazawa, T., 1988. A study of reducing abnormal elevations in automatic computations of elevations from satellite data. Arch. Photogramm. Remote Sensing, 27 (B4): 280-288.

Oliver, M.A. and Webster, R., 1990. Kriging: a method of interpolation for geographical information systems. Int. J. Geogr. Inf. Syst., 4(3): 313-332.

Royle, A.G., Clausen, F.L. and Frederiksen, P., 1981. Practical universal kriging and automatic contouring. GeoProcessing, 1: 377 – 394.

Tomlin, C.D., 1990. Geographic Information Systems and Cartographic Modeling. Prentice-Hall, Englewood Cliffs, N.J., 249 pp.

Veregin, H., 1989. A taxonomy of error in spatial datahases. NCGIA, National Center for Geographic Information and Analysis, Santa Barbara, Calif., Tech. Pap. 89-12, ll5 pp.

Weibel, R. and Heller, R., 1991. Digital terrain modelling. In: D.J. Maguire, M.F. Goodchild and D.W. Rhind (Editors), Geographical Information Systems. Principles and Applications. Longman, London, Vol. 2, pp. 269-297.