# Measuring District Map Similarity

While working on the various Florida redistricting challenges over the past couple years, one of the major tasks was to sort out the source of district shapes in enacted maps from earlier introduced maps and publicly submitted maps. This was not something that could be done, at least easily, through visual inspection – the Florida Senate redistricting site has over 100 maps for download for the congressional plan alone. Additionally, appearances can be deceiving. An extreme example – fictional, but based on something that actually came up – is presented below.

These districts differ massively in shape, but contain the same individuals, save for 20 people – as you can imagine, the population density of the Everglades is pretty low. One can imagine the reverse, as well, where minor geographic changes in dense city areas can have large impacts on the makeup of a district.

One measure I used as a first cut into map similarity is a relatively straightforward one, based on overlapping populations between two plans. The idea is simple: for each district in one plan, find the district in the second plan with the greatest population overlap. Sum the overlap populations for all districts, and divide by the total state population. In informal discussions, I referred to this as a correlation, since it had a similar feel. The values fell between 0 and 1, and higher values indicated more overlap. The maximum value of 1 occurs with identical maps.

It’s not like a correlation in other ways, though. The minimum theoretical value is not 0, but instead the inverse of the number of districts (assuming equal populations), achieved when the population of each district in the first map was evenly divided among every district in the second map. For obvious reasons, it never comes close to this minimum; the lowest figure I can remember coming across was in the 0.55-0.60 range.

The way my SQL database was laid out made this easy to calculate. Each table row represented a census block, and columns held information on population (total, VAP, race, etc.), disaggregated election results, and district assignment in various plans. Luckily, one standard file type for district plans offered by the state was that of the .doj, which was simply a text file where each row had a census block ID number (the Census GEOID), followed by the district assignment – since districts are defined by census blocks, there was no issue with split blocks.

Thus, calculating the figure was achieved by running the query:

`SELECT plan1, plan2, SUM(population) as piecepop FROM table GROUP BY plan1, plan2 ORDER BY plan1, piecepop DESC`

(I stored demographic and plan information in separate tables, but the command above assumes they’re in the same table for simplicity.)

What is returned is a row for each combination of districts between the two plans that had any overlap, sorted by district number for plan 1, and the first listed row for a particular district being the one with the largest overlap. A simple script pulled the top row for each district, summed the population, and then did the division by the state population. By running this for each combination of plans one is interested in, you can output something that looks very much like a correlation table.

Again, this measure is useful as a first look at map similarity, but it is not perfect. It is possible in the case of highly dissimilar sections of a state for a district in plan 2 to be matched with more than one district in plan 1. Even going into more nuanced analysis, this is a problem that is difficult to remedy. It is useful in some cases to give side by side stats of matching districts between two plans, but it is not always possible to make satisfactory pairings so that each district is matched uniquely; to do so sometimes can require pairing districts with very little overlap. This can also lead to the possibility of coming up with two different values for the same two plans when reversing the order of them.

This method will give you a plan-wide figure, but it can mask variation in similarity across the state: for instance, a lot of maps the state produced were identical in the panhandle, but the Orlando area saw a lot of different approaches. We got more mileage out of a variation of this statistic – instead of aggregating up to the plan level, we output the district-by-district correlations. For each district in plan 1, we calculate the largest overlap population divided by the average of the two matching districts’ populations (taking the average is important if there’s a possibility of different district populations). Instead of a correlation table-type output, then, we used a single plan as plan 1, then varied plan 2 with all the maps that could have potentially contributed to plan 1, and printed a spreadsheet with plan 1 districts as the rows and all the different plan 2’s as columns. Finding a lot of 1’s in a column meant a lot of identical districts between the two plans.

The rows in the table output by the SQL query that don’t represent the greatest overlap are also of interest, especially when you know that one plan is a descendant of the other; these represent changes that were made by map drawers. In a future post I’ll look at what sort of information you can glean from these.