High-Volume Offline Scoring
A command-line interface provides a set of scripts that enable you to “score” large collections of plans in bulk.
This is useful for scoring ensembles of plans, e.g., generated by ReCom.
These scripts are in the scripts
directory.
Input Data
The data & shapes used by this scoring toolchain are the GeoJSON and adjacency graph files published by Dave’s Redistricting (DRA) in the dra2020/vtd_data GitHub repository. The specifics of that format are described there. The front-end data-processing scripts here depend on that custom GeoJSON format.
You can use this data in two different ways.
One is to clone the dra2020/vtd_data
GitHub repository:
cd /path/to/your/dev/root
git clone https://github.com/dra2020/vtd_data
This copies all the data in the repository to your local machine. This repository is quite large, so you may want to use the alternative method below.
Another way to use this data is to download the data for a state temporarily as you need it. For example, on a Mac or Linux:
scripts/GET-GEOJSON.sh \
--state NC \
--output /tmp/NC_Geojson.zip \
--version v06
This downloads the v06 NC GeoJSON file and adjacency graph to a temporary file in /tmp
.
From there, you can either manually unzip or use the UNZIP-GEOJSON.sh
script.
For example:
scripts/UNZIP-GEOJSON.sh \
--input /tmp/NC_Geojson.zip \
--output /tmp/NC
This example unzips the downloaded file to a directory in /tmp
.
Either way, the unzipped GeoJSON directory will contain four files:
- A license file
- A README file
- A GeoJSON like this
NC_2020_VD_tabblock.vtd.datasets.geojson
, and - An adjacency graphy like this
NC_2020_graph.json
which you can use as input to the scoring scripts.
For Windows, you’ll have to convert those bash scripts.
Scores (Metrics)
This toolchain computes several dozen metrics described in Scores (Metrics). It produces a CSV file with one row of plan-level scores per plan and a JSONL file with one row of district-level aggregates per plan.
The plan-level scores include most of the analytics computed in the DRA app. A few are not included for various reasons:
- Jon Eguia’s geographic baseline & advantage measures are implemented in the DRA client,
as opposed to the in the
dra-analytics
package (because of the information needed to compute them), so that is not included here. - “Know it when you see it” (KIWYSI) compactness (see Compactness) is very expensive, so it’s not included here. The straightforward way to calculate compactness metrics for a plan is, of course, to first create district shapes based on the precinct assignments and then compute the metrics using those shapes. Unfortunately, the simple naive approach to creating district shapes — “dissolving” precinct shapes into district shapes — is a very expensive operation. Even with just precinct shapes (i.e., not blocks), that can take ~60 seconds for a North Carolina plan. Traditional compactness measures of Reock and Polsby-Popper can be computed quickly using pre-computed shapes attributes without actually having district shapes.
- Gamma and Global Symmetry aren’t included, as they are legacy measures that are not widely used.
The scores here also include several metrics not yet in the DRA app, including:
- In the proportionality/partisan category, there are two efficiency gap variations:
efficiency_gap_wasted_votes
andefficiency_gap_FPTP
to complement the statewide fractional seats version in DRA. - In the competitiveness category, there is a simple count of the number of districts in the 0.45-0.55 range,
competitive_district_count
, and the average margin of victor,average_margin
. - In the minority opportunity category, there majority-minority district counts for Blacks alone, Hispanics alone, and
Black & Hispanic coalition districts:
mmd_black
,mmd_hispanic
, andmmd_coalition
. - In the compactness category,
cut_score
is a discrete geometry measure of compactness. The siblingspanning_tree_score
is not computed though, as it is also very expensive. There’s also a “population compactness” or “energy” score. - Finally, in the county-district splitting category, there are simple counts of the numbers of counties split as well as the number times counties are split.
SCORE.sh
Continuing the example above, you can score an ensemble of plans which generates a CSV file of scores and a JSONL file of by-district measures. For example, on Mac or Linux:
scripts/score/SCORE.sh \
--state NC \
--plan-type congress \
--geojson /tmp/NC/NC_2020_VD_tabblock.vtd.datasets.geojson \
--census T_20_CENS \
--vap V_20_VAP \
--cvap V_20_CVAP \
--elections E_16-20_COMP \
--graph /tmp/NC/NC_2020_graph.json \
--plans testdata/plans/NC_congress_plans.tagged.jsonl \
--scores path/to/scores.csv \
--by-district path/to/by-district.jsonl
where:
- The
state
is a two-character state code. - The
plan-type
iscongress
, ‘upper, or
lower`, for upper and lower state house. - The
geojson
is a DRA precinct GeoJSON file with data coded by dataset. - The
graph
is a JSON file that contains the node/list of neighbors adjacency graph of the precincts. - The
census
,vap
, andcvap
are the dataset keys for the census, VAP, and CVAP datasets in the GeoJSON. The example uses the 2020 census, VAP, and CVAP data from the GeoJSON, as well as the 2016-2020 election composite. - The
plans
is a JSONL file that contains the ensemble of plans to be scored. The plans can be simple dictionaries of geoid:district assignments, or they can be ‘tagged’ plan records. An example of this is provided intestdata/plans/NC_congress_plans.tagged.jsonl
. - The optional
precomputed
argument is a JSON file that contains pre-computed geographic baselines for states and chambers (plan types). If provided, scoring includes the geographic advantage measure.
The script writes a set of plan-level scores to a CSV file
a set of by-district measures to a JSONL file, and
metadata for the scores in a JSON file.
Examples of these files can be found in testdata/examples/
.
The plan-level scores are described in Scores (Metrics).
By default, this script calculates all metrics (“scores”) for all plans in an input ensemble. If your ensembles are very large though, you can increase scoring throughput by breaking the overall process down into pieces and running them in parallel.
Alternatively or on Windows, use this Python version of the script, scripts/score/SCORE-PYTHON.py
,
which takes the same arguments as the bash script and produces the same output files.
Component Scripts
If you want more fine-grained control over the scoring process, you can use the constituent component scripts directly.
Mapping Scoring Data to a DRA GeoJSON
This script maps the data needed for scoring plans to the data in a DRA GeoJSON file.
scripts/data/map_scoring_data.py \
--geojson path/to/DRA.geojson \
--data-map path/to/data_map.json
The specific datasets used can be specified as optional arguments.
The default datasets are the 2020 census, VAP, and CVAP data,
and the composite election dataset for 2016-2020 elections.
By default, composite elections are not expanded to include the constituent elections,
but you expand composite elections with the --expand-composites
option.
Extracting Data from a DRA GeoJSON
This script extracts data from a DRA GeoJSON file and writes it to a JSONL file, using the data map to determine what data to extract.
scripts/data/extract_data.py \
--geojson path/to/DRA.geojson \
--data-map path/to/data_map.json \
--graph path/to/adjacency_graph.json \
--data path/to/input_data.jsonl
Aggregating Plans by District
This script reads a stream of plans from STDIN, aggregates data & shapes by district, and writes the plan and the aggregates to STDOUT.
cat path/to/plans.jsonl \
| scripts/score/aggregate.py \
--state xx \
--plan-type congress \
--data path/to/input_data.jsonl \
--graph path/to/adjacency_graph.json > path/to/plans_plus_aggregates.jsonl
It reads plans as JSONL from the input stream.
Each plan can be a simple dictionary of geoid:district assignments, or
a tagged format with the "_tag_"
tag equal to "plan"
and the "plan"
key containing the geoid:district pairs.
Examples of these formats can be found in testdata/plans/
in NC_congress_plans.naked.jsonl
and NC_congress_plans.tagged.jsonl
, respectively.
If the JSON records are in tagged format, metadata records are passed through unchanged, as are any other non-plan records.
In addition to any records simply passed through, the output stream contains a record for each plan with
the geoid:district assignments in the "plan"
key and the district-level aggregates in the "aggregates"
key.
Aggregates hierarchically encode the type of dataset (census, vap, cvap, election, shape),
the dataset key (defined by DRA in the README for the GeoJSON), the aggregate name (e.g., “dem_by_district”), and
the aggregates by district. For example:
{"election": {"E_16-20_COMP": {"dem_by_district": [...] ...} ...}
.
The first item in each list of values is a state-level aggregate, and the rest are district-level aggregates for districts 1 to N.
You can see an example in testdata/examples/NC_congress_aggs.100.jsonl
.
There are some helper scripts to convert alternative formats
into the tagged format that can be ingested by the aggregate.py
script.
Scoring Plans
This script reads a stream of plans with district aggregates from STDIN, scores the plans, and writes the plan-level scores to STDOUT along with the district-level aggregates.
cat path/to/plans_plus_aggregates.jsonl \
| scripts/score/score.py \
--state xx \
--plan-type congress \
--data path/to/input_data.jsonl \
--graph path/to/adjacency_graph.json > path/to/scores_plus_aggregates.jsonl
Analogous to the aggregate.py
script output, scoring writes plan-level scores in a hierarchical JSONL format:
the type of dataset (census, vap, cvap, election, shape), the dataset key, and the metric name and value.
For example:
{"election": {"E_16-20_COMP": {"estimated_vote_pct": 0.4943, ...} ...} ...}
.
Write Scores to Disk
This script reads a stream of scores with district aggregates from STDIN, and writes the plan-level scores to a CSV file and the district-level aggregates to a JSONL file.
cat path/to/scores_plus_aggregates.jsonl \
| scripts/write.py \
--data path/to/input_data.jsonl \
--scores path/to/scores.csv \
--by-district path/to/by-district.jsonl
To keep the plan-level scores a simple CSV, this script “flattens” the hierarchical JSONL format into a CSV file.
By default, the field names are simply the name of the metrics. However, if you specify the --prefixes
option
or there are multiple election datasets scored, this script prefixes the metric names with the dataset key,
e.g., E_16-20_COMP.estimated_seats
.
If you want output in a different format, you can process the output of the score.py
script
with a different script-let or directly, e.g., using jq
.