 |
Geostatistics Workshop
Discussion
Using
just forested plots
The sampling intensity of the FIA plots plays
a large part in determining what kinds of information are picked
up by this data and can be accurately depicted spatially.
Relative to species distributions, the FIA plots frequently do appear
to pick up a large part of the spatial variation present (illus-variograms,
illus-table). However,
relative to forest/nonforest cover, which changes over very short
distances in the northeast largely as a result of current and past
landuse history, the FIA plots rarely pick up much information about
its spatial structure (illus) -- resulting
in a flat, pure-nugget variogram or correlogram because the spatial
continuity that is there occurs over distances smaller than those
being sampled. As a result, in this study, we model and interpolate
only the variables in forested areas, and utilize the spatial continuity
between forested areas, without any reference to the nonforest areas
which only add noise to the data. Thus, we are using only
forested plots in the modeling and interpolation. The occurrence
and spatial distribution of nonforest areas is better obtained from
other sources, such as those derived from AVHRR or TM, like the
GAP, MRLC (now NLCD), or the SO-FIA maps. Any of these sources,
with varying degrees of accuracy, can be used to mask out the nonforest
areas in your final map.
Looking
for trends, interesting patterns, average point spacings
Normal-scoring
Normal-scoring the data involves applying a 1-1,
invertible transform to the original data to convert it from the
probably skewed distribution that it's in to a perfectly normal
distribution with a mean of 0 and a variance of 1. In order
to do this any duplicate values (as opposed to duplicate locations,
which were removed earlier in the error-checking step) must be dealt
with--i.e. each value must be made to be unique so that it can be
uniquely translated both forward into the normal-score transform
and backward into the original data values again. In the case
of %ba/acre values, and often most other forest variables we'll
be interested in, the biggest example of this is all the 0's, where
the plot is forested but that particular species or variable does
not occur. The goal is to add a large enough value to the
duplicates to differentiate it from the others, but keep it small
enough to not really be making any appreciable difference to it's
interpretation as 0, even if you have a large number of duplicates.
In this case, we have been adding the extremely tiny value of .00000001
to the 0 values. This is sufficiently small so that even when
you have 5000 0's, the first will be getting .00000001 added to
it, while the 4999th will be getting 4999 x .00000001 added to it,
which is still essentially 0. The two procedures we have used
to deal with this has been to:
- a) sort the duplicate values in a random order, and adding
the tiny distinguishing value to each in that order.
- b) in a slightly more sophisticated manner, sort the duplicate
values by the magnitude of the values of their neighbors, and
add the .00000001's to the duplicates in that order. (there
is a fortran routine rankdupe.f available for doing this)
We have used both options, and there really isn't any difference
between the two in the final output, so using option a), which involves
only some manipulation in Excel, is probably just fine.
Modeling the variogram/correlogram calculated from the normal-scored
data is a necessary step for running the Sequential Gaussian Conditional
Simulation (SGCS). Checking out the normal-scored variogram
can also be useful when just exploring the data because it can reveal
spatial structure in the data that is otherwise hidden in the variogram
of the original data because of the strong univariate characteristics
(generally the very large skew in the distribution) of that data.
Some examples of this are illus
or illus.
If this is true, you may want to consider using normal-scored data
even when you are doing OK. Some cautions apply, since you
will be backtransforming only the final mean estimate at the end
(unlike the SGCS which backtransforms each realization *before*
the summary stats are calculated) but the cost may be worth the
gain in some cases...
Dividing the area
into separate regions
-
When to do this: When exploratory data analysis indicates
the existence of several populations with significantly different
statistics, one should then consider the possibility of subdividing
the area into more homogenous subzones, each modeled and interpolated
separately (Deutsch and Journal, 1998, p. 71). This is
advisable primarily because you want to make the model as appropriate
as possible to the data in each local area, and if the two areas
are substantially different, then the variogram, and thus the
model, will be just an average of the spatial continuity in
the two areas and thus not entirely appropriate to either.
Dividing the data down into separate regions can also improve
the assumption of stationarity (see also next section below).
- Using ecoregions or hand-drawn areas.
- Standard areas such as ecoregions are handy in that they
typically/hopefully have some relevance to the ecosystems
on the ground and thus the spatial distributions of forest
variables such as species relative importance. They
already exist, so you just have to call them up as a coverage/layer,
and you would use the same regions for all species.
They are also already well-defined and well-documented, and
thus completely objective in this use of them. Their
disadvantage is that they may not actually perfectly relate
to the particular variable of interest you are looking at.
- Hand drawn areas require you to take a look at the distribution
of the data in a map (as you should be anyway) and draw polygons
around those areas/populations that appear to be different.
Such hand-drawn areas are thus more subjective. And
they are very species-specific, requiring you to redraw different
areas for each species. In either case, first check to see
if the areas contain enough points in each to calculate a
realistic variogram/correlogram from. Then, take a look
at the variogram/correlogram in each area to determine if
they are different enough to warrant the extra effort of modeling
and interpolating them separately. (because doing two
subareas roughly doubles the work over one).
Use
of the correlogram over the variogram
Stationarity
Stationarity comes up as a condition/assumption
for many geostatistical interpolation procedures. Basically,
it describes the situation in which the correlation between points
depends only on the separation distance between them and not on
their locations (Isaaks and Srivastava, 1989, p. 221). Presence
of a long distance structure in your variogram/correlogram is
indication of a trend which *can be* an indication of a lack of
stationarity. Differences between the variogram and the
correlogram (autocorrelation) can also indicate a lack of stationarity.
Severe lack of stationarity is probably worthy of special attention.
However, the stationarity assumption is primarily important within
the search radius (Isaaks and Srivastava, 1989, p. 530), and with
FIA data, we are typically working with a large number of ground
data points (often called the conditioning data) in which case
we have enough data that our effective search area (determined
either by the radius and/or the max number of points set in the
parameter file) is limited to a 'local' area over which there
isn't any trend (Verly, 1993)--i.e. it is 'locally stationary.
In addition, separating the population into smaller, more homogeneous
units, as also recommended above when time is available, and modeling
and interpolating these separately will also substantially reduce
this situation.
The assumption of multi-normality...
- Checking for bivariate normality of the input data
- Checking for performance of the algorithm in the output.
- What is usually done.
Creation of a general datasets, vs. use of the data and choice
of an estimate for a specific purpose
Choosing the percentile to use as the estimate
Choosing the range (upper and lower bounds) to use as the uncertainty
The
importance of uncertainty
Some measure of the uncertainty of that estimate, ideally
at each cell of for each local area. This make the datasets
more widely and generally useful. For example, the uncertainty
can be in the form of a probability (in this case of BF occurrence
at that location) --I.e. the probability represents the certainty
that we *know* BF can be found at that particular location, or
conversely, the certainty that we know BF is not there…
Or the uncertainty can be in the form of a +/- value attached
to the estimate of BF %ba/acre values.
Our uncertainty that our modeled estimates are correct is important
information because it affects how much weight we give to different
sources of data/information in both our decisions and our analyses.
When we have a measure of which areas have an uncertainty that
is unacceptable to our particular use of the data, it gives us
an idea of how much we don’t know, and which areas might benefit
most from additional sampling or the use of additional ancillary
data. And in the case of datasets at larger spatial scales
(e.g. the 2km x 2km cell size), the uncertainty is also depicting/expressing/including
the variation that may be captured within that summary area (i.e.
below that level of resolution being presented in the map).
The
importance of some sense of local spatial variability
Understanding the variability that really exists in
a landscape can also be very important information for management
and analysis. Making explicit this variability, as an additional
characteristic of the output can be important additional information
for the user—because a highly variable area may be treated differently
from a more homogenous one, and it may have different implications
for the types of management that are appropriate under those conditions.
How much variability is desired in an output dataset depends in
part on the intended use of the data, the output resolution of
the interpolated/estimated/simulated dataset, and on the inherent
variability in the data itself. For example, in a single
realization it may actually be harder to see the overall spatial
patterns in the landscape because there is so much local variability
that our eyes cannot pick up the larger patterns. In this
case, if we are interested in the overall pattern, blurring the
data somewhat, as in a moving windows average, can bring out the
larger pattern…, of course at the expense of local variability…
illustration: the raw MSN output vs. the same after a moving
window average
? or: a single realization
vs. the percentile ‘estimate’ -- illustration
comparison of the results of simulation vs. OK vs. MWA...
the opportunities and limitations of this dataset
when to use simpler methods or more sophisticated methods...,
illus
|