Redistricting Analytics in Python (rdapy)

High-Volume Offline Scoring

A command-line interface provides a set of scripts that enable you to “score” large collections of plans in bulk. This is useful for scoring ensembles of plans, e.g., generated by ReCom. These scripts are in the scripts directory.

Input Data

The data & shapes used by this scoring toolchain are the GeoJSON and adjacency graph files published by Dave’s Redistricting (DRA) in the dra2020/vtd_data GitHub repository. The specifics of that format are described there. The front-end data-processing scripts here depend on that custom GeoJSON format.

You can use this data in two different ways. One is to clone the dra2020/vtd_data GitHub repository:

cd /path/to/your/dev/root
git clone https://github.com/dra2020/vtd_data

This copies all the data in the repository to your local machine. This repository is quite large, so you may want to use the alternative method below.

Another way to use this data is to download the data for a state temporarily as you need it. For example, on a Mac or Linux:

scripts/GET-GEOJSON.sh \
--state NC \
--output /tmp/NC_Geojson.zip \
--version v06

This downloads the v06 NC GeoJSON file and adjacency graph to a temporary file in /tmp. From there, you can either manually unzip or use the UNZIP-GEOJSON.sh script. For example:

scripts/UNZIP-GEOJSON.sh \
--input /tmp/NC_Geojson.zip \
--output /tmp/NC

This example unzips the downloaded file to a directory in /tmp.

Either way, the unzipped GeoJSON directory will contain four files:

A license file
A README file
A GeoJSON like this NC_2020_VD_tabblock.vtd.datasets.geojson, and
An adjacency graphy like this NC_2020_graph.json

which you can use as input to the scoring scripts.

For Windows, you’ll have to convert those bash scripts.

Scores (Metrics)

This toolchain computes several dozen metrics described in Scores (Metrics). It produces a CSV file with one row of plan-level scores per plan and a JSONL file with one row of district-level aggregates per plan.

The plan-level scores include most of the analytics computed in the DRA app. A few are not included for various reasons:

Jon Eguia’s geographic baseline & advantage measures are implemented in the DRA client, as opposed to the in the dra-analytics package (because of the information needed to compute them), so that is not included here.
“Know it when you see it” (KIWYSI) compactness (see Compactness) is very expensive, so it’s not included here. The straightforward way to calculate compactness metrics for a plan is, of course, to first create district shapes based on the precinct assignments and then compute the metrics using those shapes. Unfortunately, the simple naive approach to creating district shapes — “dissolving” precinct shapes into district shapes — is a very expensive operation. Even with just precinct shapes (i.e., not blocks), that can take ~60 seconds for a North Carolina plan. Traditional compactness measures of Reock and Polsby-Popper can be computed quickly using pre-computed shapes attributes without actually having district shapes.
Gamma and Global Symmetry aren’t included, as they are legacy measures that are not widely used.

The scores here also include several metrics not yet in the DRA app, including:

In the proportionality/partisan category, there are two efficiency gap variations: efficiency_gap_wasted_votes and efficiency_gap_FPTP to complement the statewide fractional seats version in DRA.
In the competitiveness category, there is a simple count of the number of districts in the 0.45-0.55 range, competitive_district_count, and the average margin of victor, average_margin.
In the minority opportunity category, there majority-minority district counts for Blacks alone, Hispanics alone, and Black & Hispanic coalition districts: mmd_black, mmd_hispanic, and mmd_coalition.
In the compactness category, cut_score is a discrete geometry measure of compactness. The sibling spanning_tree_score is not computed though, as it is also very expensive. There’s also a “population compactness” or “energy” score.
Finally, in the county-district splitting category, there are simple counts of the numbers of counties split as well as the number times counties are split.

SCORE.sh

Continuing the example above, you can score an ensemble of plans which generates a CSV file of scores and a JSONL file of by-district measures. For example, on Mac or Linux:

scripts/score/SCORE.sh \
--state NC \
--plan-type congress \
--geojson /tmp/NC/NC_2020_VD_tabblock.vtd.datasets.geojson \
--census T_20_CENS \
--vap V_20_VAP \
--cvap V_20_CVAP \
--elections E_16-20_COMP \
--graph /tmp/NC/NC_2020_graph.json \
--plans testdata/plans/NC_congress_plans.tagged.jsonl \
--scores path/to/scores.csv \
--by-district path/to/by-district.jsonl

where:

The state is a two-character state code.
The plan-type is congress, ‘upper, or lower`, for upper and lower state house.
The geojson is a DRA precinct GeoJSON file with data coded by dataset.
The graph is a JSON file that contains the node/list of neighbors adjacency graph of the precincts.
The census, vap, and cvap are the dataset keys for the census, VAP, and CVAP datasets in the GeoJSON. The example uses the 2020 census, VAP, and CVAP data from the GeoJSON, as well as the 2016-2020 election composite.
The plans is a JSONL file that contains the ensemble of plans to be scored. The plans can be simple dictionaries of geoid:district assignments, or they can be ‘tagged’ plan records. An example of this is provided in testdata/plans/NC_congress_plans.tagged.jsonl.
The optional precomputed argument is a JSON file that contains pre-computed geographic baselines for states and chambers (plan types). If provided, scoring includes the geographic advantage measure.

The script writes a set of plan-level scores to a CSV file a set of by-district measures to a JSONL file, and metadata for the scores in a JSON file. Examples of these files can be found in testdata/examples/.

The plan-level scores are described in Scores (Metrics).

By default, this script calculates all metrics (“scores”) for all plans in an input ensemble. If your ensembles are very large though, you can increase scoring throughput by breaking the overall process down into pieces and running them in parallel.

Alternatively or on Windows, use this Python version of the script, scripts/score/SCORE-PYTHON.py, which takes the same arguments as the bash script and produces the same output files.

Component Scripts

If you want more fine-grained control over the scoring process, you can use the constituent component scripts directly.

Mapping Scoring Data to a DRA GeoJSON

This script maps the data needed for scoring plans to the data in a DRA GeoJSON file.

scripts/data/map_scoring_data.py \
--geojson path/to/DRA.geojson \
--data-map path/to/data_map.json

The specific datasets used can be specified as optional arguments. The default datasets are the 2020 census, VAP, and CVAP data, and the composite election dataset for 2016-2020 elections. By default, composite elections are not expanded to include the constituent elections, but you expand composite elections with the --expand-composites option.

Extracting Data from a DRA GeoJSON

This script extracts data from a DRA GeoJSON file and writes it to a JSONL file, using the data map to determine what data to extract.

scripts/data/extract_data.py \
--geojson path/to/DRA.geojson \
--data-map path/to/data_map.json \
--graph path/to/adjacency_graph.json \
--data path/to/input_data.jsonl

Aggregating Plans by District

This script reads a stream of plans from STDIN, aggregates data & shapes by district, and writes the plan and the aggregates to STDOUT.

cat path/to/plans.jsonl \
| scripts/score/aggregate.py \
--state xx \
--plan-type congress \
--data path/to/input_data.jsonl \
--graph path/to/adjacency_graph.json > path/to/plans_plus_aggregates.jsonl

It reads plans as JSONL from the input stream. Each plan can be a simple dictionary of geoid:district assignments, or a tagged format with the "_tag_" tag equal to "plan" and the "plan" key containing the geoid:district pairs. Examples of these formats can be found in testdata/plans/ in NC_congress_plans.naked.jsonl and NC_congress_plans.tagged.jsonl, respectively.

If the JSON records are in tagged format, metadata records are passed through unchanged, as are any other non-plan records.

In addition to any records simply passed through, the output stream contains a record for each plan with the geoid:district assignments in the "plan" key and the district-level aggregates in the "aggregates" key. Aggregates hierarchically encode the type of dataset (census, vap, cvap, election, shape), the dataset key (defined by DRA in the README for the GeoJSON), the aggregate name (e.g., “dem_by_district”), and the aggregates by district. For example:

{"election": {"E_16-20_COMP": {"dem_by_district": [...] ...} ...}.

The first item in each list of values is a state-level aggregate, and the rest are district-level aggregates for districts 1 to N.

You can see an example in testdata/examples/NC_congress_aggs.100.jsonl.

There are some helper scripts to convert alternative formats into the tagged format that can be ingested by the aggregate.py script.

Scoring Plans

This script reads a stream of plans with district aggregates from STDIN, scores the plans, and writes the plan-level scores to STDOUT along with the district-level aggregates.

cat path/to/plans_plus_aggregates.jsonl \
| scripts/score/score.py \
--state xx \
--plan-type congress \
--data path/to/input_data.jsonl \
--graph path/to/adjacency_graph.json > path/to/scores_plus_aggregates.jsonl

Analogous to the aggregate.py script output, scoring writes plan-level scores in a hierarchical JSONL format: the type of dataset (census, vap, cvap, election, shape), the dataset key, and the metric name and value. For example:

{"election": {"E_16-20_COMP": {"estimated_vote_pct": 0.4943, ...} ...} ...}.

Write Scores to Disk

This script reads a stream of scores with district aggregates from STDIN, and writes the plan-level scores to a CSV file and the district-level aggregates to a JSONL file.

cat path/to/scores_plus_aggregates.jsonl \
| scripts/write.py \
--data path/to/input_data.jsonl \
--scores path/to/scores.csv \
--by-district path/to/by-district.jsonl

To keep the plan-level scores a simple CSV, this script “flattens” the hierarchical JSONL format into a CSV file. By default, the field names are simply the name of the metrics. However, if you specify the --prefixes option or there are multiple election datasets scored, this script prefixes the metric names with the dataset key, e.g., E_16-20_COMP.estimated_seats.

If you want output in a different format, you can process the output of the score.py script with a different script-let or directly, e.g., using jq.