USDA Forest ServiceSkip navigational links  
 Northeastern Forest Inventory & Analysis
 Go to: NE FIA Home Page
 Go to:
 Go to:
Go to:
 Go to:
 Go to:
Go to:
 Go to: Publications & Products
 Go to:
Go to: FIA Site Map
 Go to: NE Station
 Go to:
 Go to:

Go to:Introduction

Go to:Flowchart of Process

Viewing:Outline of Steps

Go to:Definitions and Descriptions

Viewing:Discussion

Go to:Step by Step

Go to:Examples & Downloads

Go to:Videos

Go to:Arc Scripts

Viewing:Contact: Andrew Lister

 

Forest Inventory & Analysis Program
11 Campus Blvd.
Suite 200
Newtown Square, PA 19073-3294

(610)557-4075
(610)557-4250 FAX
(610)557-4132 TTY/TDD

 United States Department of Agriculture Forest Service. USDA logo which links to the department's national site. Forest Service logo which links to the agency's national site.
 

GIS /Spatial Statistics

Geostatistics Workshop

Discussion

Go to:Using just forested plots
    The sampling intensity of the FIA plots plays a large part in determining what kinds of information are picked up by this data and can be accurately depicted spatially.  Relative to species distributions, the FIA plots frequently do appear to pick up a large part of the spatial variation present (illus-variograms, illus-table).  However, relative to forest/nonforest cover, which changes over very short distances in the northeast largely as a result of current and past landuse history, the FIA plots rarely pick up much information about its spatial structure (illus) -- resulting in a flat, pure-nugget variogram or correlogram because the spatial continuity that is there occurs over distances smaller than those being sampled.  As a result, in this study, we model and interpolate only the variables in forested areas, and utilize the spatial continuity between forested areas, without any reference to the nonforest areas which only add noise to the data.  Thus, we are using only forested plots in the modeling and interpolation.  The occurrence and spatial distribution of nonforest areas is better obtained from other sources, such as those derived from AVHRR or TM, like the GAP, MRLC (now NLCD), or the SO-FIA maps.  Any of these sources, with varying degrees of accuracy, can be used to mask out the nonforest areas in your final map.

Go to:Looking for trends, interesting patterns, average point spacings

Normal-scoring
    Normal-scoring the data involves applying a 1-1, invertible transform to the original data to convert it from the probably skewed distribution that it's in to a perfectly normal distribution with a mean of 0 and a variance of 1.  In order to do this any duplicate values (as opposed to duplicate locations, which were removed earlier in the error-checking step) must be dealt with--i.e. each value must be made to be unique so that it can be uniquely translated both forward into the normal-score transform and backward into the original data values again.  In the case of %ba/acre values, and often most other forest variables we'll be interested in, the biggest example of this is all the 0's, where the plot is forested but that particular species or variable does not occur.  The goal is to add a large enough value to the duplicates to differentiate it from the others, but keep it small enough to not really be making any appreciable difference to it's interpretation as 0, even if you have a large number of duplicates.  In this case, we have been adding the extremely tiny value of .00000001 to the 0 values.  This is sufficiently small so that even when you have 5000 0's, the first will be getting .00000001 added to it, while the 4999th will be getting 4999 x .00000001 added to it, which is still essentially 0.  The two procedures we have used to deal with this has been to:

  • a)  sort the duplicate values in a random order, and adding the tiny distinguishing value to each in that order.
     
  • b)  in a slightly more sophisticated manner, sort the duplicate values by the magnitude of the values of their neighbors, and add the .00000001's to the duplicates in that order.  (there is a fortran routine rankdupe.f available for doing this)

We have used both options, and there really isn't any difference between the two in the final output, so using option a), which involves only some manipulation in Excel, is probably just fine.  

Modeling the variogram/correlogram calculated from the normal-scored data is a necessary step for running the Sequential Gaussian Conditional Simulation (SGCS).  Checking out the normal-scored variogram can also be useful when just exploring the data because it can reveal spatial structure in the data that is otherwise hidden in the variogram of the original data because of the strong univariate characteristics (generally the very large skew in the distribution) of that data.  Some examples of this are illus or illus.  If this is true, you may want to consider using normal-scored data even when you are doing OK.  Some cautions apply, since you will be backtransforming only the final mean estimate at the end (unlike the SGCS which backtransforms each realization *before* the summary stats are calculated) but the cost may be worth the gain in some cases...

Dividing the area into separate regions

  • When to do this:  When exploratory data analysis indicates the existence of several populations with significantly different statistics, one should then consider the possibility of subdividing the area into more homogenous subzones, each modeled and interpolated separately (Deutsch and Journal, 1998, p. 71).  This is advisable primarily because you want to make the model as appropriate as possible to the data in each local area, and if the two areas are substantially different, then the variogram, and thus the model, will be just an average of the spatial continuity in the two areas and thus not entirely appropriate to either.  Dividing the data down into separate regions can also improve the assumption of stationarity (see also next section below).

    • Making sure they have enough plots in each
                 
    • Determining if they are different enough to warrant the extra effort of modeling and interpolating them separately. 

  • Using ecoregions or hand-drawn areas.
    • Standard areas such as ecoregions are handy in that they typically/hopefully have some relevance to the ecosystems on the ground and thus the spatial distributions of forest variables such as species relative importance.  They already exist, so you just have to call them up as a coverage/layer, and you would use the same regions for all species.  They are also already well-defined and well-documented, and thus completely objective in this use of them.  Their disadvantage is that they may not actually perfectly relate to the particular variable of interest you are looking at.
    • Hand drawn areas require you to take a look at the distribution of the data in a map (as you should be anyway) and draw polygons around those areas/populations that appear to be different.  Such hand-drawn areas are thus more subjective.  And they are very species-specific, requiring you to redraw different areas for each species. In either case, first check to see if the areas contain enough points in each to calculate a realistic variogram/correlogram from.  Then, take a look at the variogram/correlogram in each area to determine if they are different enough to warrant the extra effort of modeling and interpolating them separately.  (because doing two subareas roughly doubles the work over one).

Go to:Use of the correlogram over the variogram

Stationarity
    Stationarity comes up as a condition/assumption for many geostatistical interpolation procedures.  Basically, it describes the situation in which the correlation between points depends only on the separation distance between them and not on their locations (Isaaks and Srivastava, 1989, p. 221).  Presence of a long distance structure in your variogram/correlogram is indication of a trend which *can be* an indication of a lack of stationarity.  Differences between the variogram and the correlogram (autocorrelation) can also indicate a lack of stationarity.  Severe lack of stationarity is probably worthy of special attention.  However, the stationarity assumption is primarily important within the search radius (Isaaks and Srivastava, 1989, p. 530), and with FIA data, we are typically working with a large number of ground data points (often called the conditioning data) in which case we have enough data that our effective search area (determined either by the radius and/or the max number of points set in the parameter file) is limited to a 'local' area over which there isn't any trend (Verly, 1993)--i.e. it is 'locally stationary.  In addition, separating the population into smaller, more homogeneous units, as also recommended above when time is available, and modeling and interpolating these separately will also substantially reduce this situation.

The assumption of multi-normality...

  • Checking for bivariate normality of the input data
  • Checking for performance of the algorithm in the output.
  • What is usually done.

Creation of a general datasets, vs. use of the data and choice of an estimate for a specific purpose

Choosing the percentile to use as the estimate

Choosing the range (upper and lower bounds) to use as the uncertainty

The importance of uncertainty

    Some measure of the uncertainty of that estimate, ideally at each cell of for each local area.  This make the datasets more widely and generally useful.  For example, the uncertainty can be in the form of a probability (in this case of BF occurrence at that location) --I.e. the probability represents the certainty that we *know* BF can be found at that particular location, or conversely, the certainty that we know BF is not there…  Or the uncertainty can be in the form of a +/- value attached to the estimate of BF %ba/acre values.
    Our uncertainty that our modeled estimates are correct is important information because it affects how much weight we give to different sources of data/information in both our decisions and our analyses.  When we have a measure of which areas have an uncertainty that is unacceptable to our particular use of the data, it gives us an idea of how much we don’t know, and which areas might benefit most from additional sampling or the use of additional ancillary data.  And in the case of datasets at larger spatial scales (e.g. the 2km x 2km cell size), the uncertainty is also depicting/expressing/including the variation that may be captured within that summary area (i.e. below that level of resolution being presented in the map).

The importance of some sense of local spatial variability

    Understanding the variability that really exists in a landscape can also be very important information for management and analysis.  Making explicit this variability, as an additional characteristic of the output can be important additional information for the user—because a highly variable area may be treated differently from a more homogenous one, and it may have different implications for the types of management that are appropriate under those conditions. How much variability is desired in an output dataset depends in part on the intended use of the data, the output resolution of the interpolated/estimated/simulated dataset, and on the inherent variability in the data itself.  For example, in a single realization it may actually be harder to see the overall spatial patterns in the landscape because there is so much local variability that our eyes cannot pick up the larger patterns.  In this case, if we are interested in the overall pattern, blurring the data somewhat, as in a moving windows average, can bring out the larger pattern…, of course at the expense of local variability…
    illustration:  the raw MSN output vs. the same after a moving window average
    ? or:        a single realization vs. the percentile ‘estimate’    -- illustration

comparison of the results of simulation vs. OK vs. MWA...
the opportunities and limitations of this dataset
when to use simpler methods or more sophisticated methods..., illus