Areal Weighted Interpolation

Christopher Prener, Ph.D.

2018-12-20

Areal weighted interpolation is a technique for estimating the values for overlapping but incongruent polygon features. This article describes the areal package’s approach to areal weighted interpolation. After providing a quick introduction to the technique, the options for aw_interpolate() are discussed and an example of interpolating data using the manual workflow is provided.

Introduction to Areal Weighted Interpolation

Areal weighted interpolation is the simplest approach to estimating population values for overlapping polygons. It makes a significant and important assumption - that individuals are spread out evenly within the source features. This assumption quickly breaks down in the real world - areas that have commercial developments mixed in with residential housing, for example, or neighborhoods with a large city park. We do not always have access to this type of contextual data, however, and so areal weighted interpolation remains a popular choice.

Areal weighted interpolation is a multi-step process. The areal package contains a number of example data sets that can be used to illustrate this process:

library(areal)

# load data into enviornment
race <- ar_stl_race                 # census tract population estimates
asthma <- ar_stl_asthma             # census tract asthma rate estimates
wards <- ar_stl_wards               # political boundaries
wardsClipped <- ar_stl_wardsClipped # political boundaries clipped to river

The boundaries for the race and asthma the data are the same - census tracts. When mapped, this is what the census tract and ward features look like:

Step 1: Intersection

The first step with areal weighted interpolation is to intersect the data. Imagine one shapefile (we’ll call this the “target”) acting as a cookie cutter - subdividing the features of the other (which we’ll call the “source”) based on areas of overlap such that only those overlapping areas remain (this is important - if these shapefiles do not cover identical areas, those areas only covered by one shapefile will be lost). The number of new features created is entirely dependent on the shapes of the features in the source and target data sets:

By intersecting these two data sets, we get a new data set with n = 287 features. The resulting sf object looks like so:

One by-product of the intersection process is that each intersected feature takes on the attributes of both the source and target data. The population value of interest from each source feature (for example, total population per tract or TOTAL_E), therefore exists as an attribute for each intersected feature. The identification numbers from both the source (GEOID) and the target data (WARD) are also applied:

First Four Rows of Intersected Data
GEOID TOTAL_E WARD
29510101100 2510 11
29510101100 2510 12
29510101200 3545 12
29510101200 3545 13

Step 2: Areal Weights

We then calculate an areal weight for each intersected feature. Let:

\[ {W}_{i} = \frac { {A}_{i} }{ {A}_{j} } \]

Since \({A}_{j}\) is calculated using the source identification number, the first two observations from table above with the first four rows of intersected data would have the same value for \({A}_{j}\), and the second two observations would also share the same \({A}_{j}\). The resulting values for \({W}_{i}\) would therefore be:

First Four Rows of Intersected Data
GEOID TOTAL_E WARD Ai Aj Wi
29510101100 2510 11 355702.9 1257034 0.282970
29510101100 2510 12 901331.1 1257034 0.717030
29510101200 3545 12 875554.7 1084167 0.807583
29510101200 3545 13 208612.1 1084167 0.192417

Step 3: Estimate Population

Next, we need to estimate the share of the population value that occupies the intersected feature. Let:

\[ {E}_{i} = {V}_{j}*{W}_{i} \]

Using our sample data, we therefore multiply the value (TOTAL_E) by the weight (Wi) to produce our EST estimate column:

First Four Rows of Intersected Data
GEOID TOTAL_E WARD Ai Aj Wi EST
29510101100 2510 11 355702.9 1257034 0.282970 710.2547
29510101100 2510 12 901331.1 1257034 0.717030 1799.7450
29510101200 3545 12 875554.7 1084167 0.807583 2862.8820
29510101200 3545 13 208612.1 1084167 0.192417 682.1182

Step 4: Summarize Data

Finally, we summarize the data based on the target identification number. Let:

\[ {G}_{k} = \sum{{E}_{ik}} \]

With our hypothetical data, the resulting table would therefore look like:

Resulting Target Data
WARD EST
11 710.2547
12 4662.6270
13 682.1182

This process is repeated for each of the n = 287 observations in the intersected data - areal weights are calculated, and the product of the areal weight the source value is summed based on the target identification number.

Extensive and Intensive Interpolations

Extensive Interpolations

The example above is a spatially extensive interpolation because it involves count data. In areal, these estimates are obtained using the aw_interpolate() function:

For spatially extensive interpolations, a list of variable names should be supplied for the argument extensive. This can be a single variable name, such as in the example above, or a vector of variable names:

This ability is a key feature of areal - iteration is built into the package by default, eliminating the need for repeated table joins after interpolations are calculated.

Calculating Weights for Extensive Interpolations

The aw_interpolate function also uses an argument weight. There are two options, "sum" and "total". Each makes a different assumption about the nature of the data and the relationship between the source and target features. For perfectly overlapping data, the distinction between these two options should not matter. In practice, however, there are often deviations in our data even between features that should be perfectly congruous.

The "sum" approach to calculating weights assumes that 100% of the source data should be divided among the target features. When \({A}_{j}\) is calculated (see previous section), it is done by taking the sum of the areas for all intersected features (\(i\)) within a given source feature (\(j\)). Let:

\[ {A}_{j} = \sum{{A}_{ij}} \]

On the other hand, the "total" approach to calculating weights assumes that, if a source feature is only covered by 99.88% of the target features, only 99.88% of the source target’s data should be allocated to target features in the interpolation. When \({A}_{j}\) is created, the actual area of source feature \(j\) is used.

Weights Example 1: Non-Overlap Due to Data Quality

In the example above, race and wards are products of two different agencies. The aw_stl_wards data is a product of the City of the St. Louis and is quite close to fully overlapping with the U.S. Census Bureau’s TIGER boundaries for the city. However, there are a number of very small deviations at the edges where the ward boundaries are smaller than the tracts (but only just so). These deviations result in small portions of census tracts not fitting into any ward.

We can see this in the weights that are used by aw_interpolate(). The aw_preview_weights() function can be used to return a preview of these areal weights.

The first tract listed above has a total estimated population of 2510. The practical impact of the weights is that only 2507.072085 individuals will be allocated to wards if the "total" approach to calculating areal weights is used. If "sum" is used, on the other hand, all 2510 individuals would be allocated to wards. In this scenario, the "sum" approach makes more sense because, while the race and ward data do not overlap in practice, they should overlap since no tracts extend out of the city’s boundaries. We therefore want to ensure that all individuals within each tract are allocated out to wards.

With spatially extensive interpolations that utilize the "sum" approach, the sum of the interpolated column should equal the sum of the original source data’s column that was interpolated. This can be verified with aw_verify():

This check does not work with the "total" approach to areal weights:

Weights Example 2: Non-Overlap Due to Differing Boundaries

We can use the aw_stl_wardsClipped data to illustrate a more extreme disparity between source and target data. The aw_stl_wardsClipped data have been modified so that the ward boundaries do not extend past the Mississippi River shoreline, which runs along the entire eastern boundary of the city. When we overlay them on the city’s census tracts, all of the census tracts on the eastern side of the city extend outwards.

The difference in weights in this example is more extreme:

Only 72.31% of tract 29510101800, for example, falls within a ward. In many American cities that lie within larger counties, tract boundaries do not stop at the municipal boundaries in a way that is similar to the difference between tracts and the clipped wards here. In this scenario, we do not want to allocate every individual into our city of interest and the "total" approach to weights is appropriate. Not using "total" would result in an over-count of individuals in our city.

If, on the other hand, we believe that all of the individuals should be allocated into wards, using "total" in this case would result in a severe under-count of individuals.

Intensive Interpolations

Spatially intensive operations are used when the data to be interpolated are a ratio. An example of these data can be found in ar_stl_asthma, which contains asthma rates for each census tract in the city. The interpolation process is very similar to the spatially extensive workflow, except with how the areal weight is calculated. Instead of using the source data’s area for reference, the target data is used in the denominator. Let:

\[ {W}_{i} = \frac { {A}_{i} }{ \sum{{A}_{ik}} } \]

Like spatially extensive interpolations that use the "sum" approach, the weights for intensive interpolations should always be equal to 1 as well.

We can calculate the intensive interpolation by specifying a variable name for the intensive argument in aw_interpolate() and omitting the extensive argument:

This gives us an estimate of the asthma rates at the ward level.

Mixed Interpolations

areal also provides support for “mixed” interpolations where both spatially extensive and intensive interpolations need to be calculated. We specify a variable name or a vector of variable names for both the intensive and extensive arguments:

Output Options

All of the above examples have created a tibble for output, but areal also supports the creation of sf objects as well:

aw_interpolate(wards, tid = WARD, source = asthma, sid = GEOID, 
               weight = "sum", output = "sf", intensive = "ASTHMA")
#> Simple feature collection with 28 features and 4 fields
#> geometry type:  POLYGON
#> dimension:      XY
#> bbox:           xmin: 733361.8 ymin: 4268336 xmax: 746157.7 ymax: 4295504
#> epsg (SRID):    26915
#> proj4string:    +proj=utm +zone=15 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> First 10 features:
#>    OBJECTID WARD      AREA    ASTHMA                       geometry
#> 1         1    1  46138761 13.379577 POLYGON ((740184.2 4286431,...
#> 2         2    2 267817711 13.164879 POLYGON ((742392.1 4289178,...
#> 3         3    3  66291644 14.108107 POLYGON ((742956.1 4284113,...
#> 4         4    4  53210707 13.613989 POLYGON ((739557.6 4284080,...
#> 5         5    5  60462396 13.789757 POLYGON ((744883.8 4281632,...
#> 6         6    6  64337271 11.747642 POLYGON ((742501.6 4279976,...
#> 7         7    7 101268146  9.722839 POLYGON ((745618.6 4279867,...
#> 8         8    8  45966410  9.823627 POLYGON ((739842.8 4277724,...
#> 9         9    9  73993891 11.820622 POLYGON ((742619.4 4276734,...
#> 10       10   10  62915358  9.439929 POLYGON ((737257.7 4277050,...

Other Features of aw_interpolate

The sf option will include all of the variables that were included in the original target data. The aw_interpolate() function is pipe-able, allowing for existing tidyverse workflows to be integrated into the interpolation process. For example, if we wanted to remove the OBJECTID and AREA columns because they are not needed, this can be accomplished easily with areal and dplyr:

wards %>%
  select(-OBJECTID, -AREA) %>%
  aw_interpolate(tid = WARD, source = asthma, sid = GEOID, 
                 weight = "sum", output = "tibble", intensive = "ASTHMA")
#> # A tibble: 28 x 2
#>     WARD ASTHMA
#>  * <int>  <dbl>
#>  1     1  13.4 
#>  2     2  13.2 
#>  3     3  14.1 
#>  4     4  13.6 
#>  5     5  13.8 
#>  6     6  11.7 
#>  7     7   9.72
#>  8     8   9.82
#>  9     9  11.8 
#> 10    10   9.44
#> # ... with 18 more rows

All of the areal functions that are exported support non-standard evaluation, meaning that inputs can be either unquoted as they are above or quoted:

wards %>%
  select(-OBJECTID, -AREA) %>%
  aw_interpolate(tid = "WARD", source = asthma, sid = "GEOID", 
                 weight = "sum", output = "tibble", intensive = "ASTHMA")
#> # A tibble: 28 x 2
#>     WARD ASTHMA
#>  * <int>  <dbl>
#>  1     1  13.4 
#>  2     2  13.2 
#>  3     3  14.1 
#>  4     4  13.6 
#>  5     5  13.8 
#>  6     6  11.7 
#>  7     7   9.72
#>  8     8   9.82
#>  9     9  11.8 
#> 10    10   9.44
#> # ... with 18 more rows

This functionality is not available for the intensive and extensive arguments at this time.

Manual Workflow

areal purposely exports the sub-functions that are called by aw_interpolate() so that the interpolation process is not a “black box” but rather can be recreated manually. This is envisioned as a diagnostic toolkit, with the final interpolations estimated using the simpler aw_interpolate() function once any issues have been identified and ameliorated

First, we’ll prepare the data but retaining only the columns we are interested in from the source data using the select() function from dplyr:

race <- select(ar_stl_race, GEOID, TOTAL_E)
wards <- select(wards, -OBJECTID, -AREA)

We want to be careful to retain both a column with a value to be interpolated (total population in this case, TOTAL_E) and a column with a unique identification number (GEOID in this case).

Intersect Data

As we noted above, the interpolation process begins with calculating the intersection between the source and target data. We use the function aw_intersect() to accomplish this:

Note that aw_intersect() automatically calculates the area of the intersected feature.

Calculate Total Area

Next, we want to apply the total area of our source features to our data using aw_total(). This will implement the correct areal weighting approach based on the type and weight arguments. We’ll use the "sum" approach to areal weights here:

Changing type to "intensive" would be necessary for spatially intensive interpolations. Likewise, changing weight to "total" is necessary if areas that lack overlap should not be allocated into the target features.

Calculate Areal Weight

With the total weight in hand, we are ready to calculate the areal weight itself using aw_weight().

Calculate Estimated Population

We can then multiply the value (TOTAL_E) by the weight (areaWeight) to get a population estimate for each intersected feature using aw_calculate():

There is an optional newVar argument that can be used to store the estimates in a new column rather than in the existing value column.

Aggregate Estimated Population by Target ID

Finally, we aggregate the estimated values by target features using aw_aggregate():