High-Volume Offline Scoring

A command-line interface provides a set of scripts that enable you to “score” large collections of plans in bulk. This is useful for scoring ensembles of plans, e.g., generated by ReCom. These scripts are in the scripts directory.

Input Data

The data & shapes used by this scoring toolchain are the GeoJSON and adjacency graph files published by Dave’s Redistricting (DRA) in the dra2020/vtd_data GitHub repository. The specifics of that format are described there. The front-end data-processing scripts here depend on that custom GeoJSON format.

You can use this data in two different ways. One is to clone the dra2020/vtd_data GitHub repository:

cd /path/to/your/dev/root
git clone https://github.com/dra2020/vtd_data

This copies all the data in the repository to your local machine. This repository is quite large, so you may want to use the alternative method below.

Another way to use this data is to download the data for a state temporarily as you need it. For example, on a Mac or Linux:

scripts/GET-GEOJSON.sh \
--state NC \
--output /tmp/NC_Geojson.zip \
--version v06

This downloads the v06 NC GeoJSON file and adjacency graph to a temporary file in /tmp. From there, you can either manually unzip or use the UNZIP-GEOJSON.sh script. For example:

scripts/UNZIP-GEOJSON.sh \
--input /tmp/NC_Geojson.zip \
--output /tmp/NC

This example unzips the downloaded file to a directory in /tmp.

Either way, the unzipped GeoJSON directory will contain four files:

which you can use as input to the scoring scripts.

For Windows, you’ll have to convert those bash scripts.

Scores (Metrics)

This toolchain computes several dozen metrics described in Scores (Metrics). It produces a CSV file with one row of plan-level scores per plan and a JSONL file with one row of district-level aggregates per plan.

The plan-level scores include most of the analytics computed in the DRA app. A few are not included for various reasons:

The scores here also include several metrics not yet in the DRA app, including:

SCORE.sh

Continuing the example above, you can score an ensemble of plans which generates a CSV file of scores and a JSONL file of by-district measures. For example, on Mac or Linux:

scripts/score/SCORE.sh \
--state NC \
--plan-type congress \
--geojson /tmp/NC/NC_2020_VD_tabblock.vtd.datasets.geojson \
--census T_20_CENS \
--vap V_20_VAP \
--cvap V_20_CVAP \
--elections E_16-20_COMP \
--graph /tmp/NC/NC_2020_graph.json \
--plans testdata/plans/NC_congress_plans.tagged.jsonl \
--scores path/to/scores.csv \
--by-district path/to/by-district.jsonl

where:

The script writes a set of plan-level scores to a CSV file a set of by-district measures to a JSONL file, and metadata for the scores in a JSON file. Examples of these files can be found in testdata/examples/.

The plan-level scores are described in Scores (Metrics).

By default, this script calculates all metrics (“scores”) for all plans in an input ensemble. If your ensembles are very large though, you can increase scoring throughput by breaking the overall process down into pieces and running them in parallel.

Alternatively or on Windows, use this Python version of the script, scripts/score/SCORE-PYTHON.py, which takes the same arguments as the bash script and produces the same output files.

Component Scripts

If you want more fine-grained control over the scoring process, you can use the constituent component scripts directly.

Mapping Scoring Data to a DRA GeoJSON

This script maps the data needed for scoring plans to the data in a DRA GeoJSON file.

scripts/data/map_scoring_data.py \
--geojson path/to/DRA.geojson \
--data-map path/to/data_map.json

The specific datasets used can be specified as optional arguments. The default datasets are the 2020 census, VAP, and CVAP data, and the composite election dataset for 2016-2020 elections. By default, composite elections are not expanded to include the constituent elections, but you expand composite elections with the --expand-composites option.

Extracting Data from a DRA GeoJSON

This script extracts data from a DRA GeoJSON file and writes it to a JSONL file, using the data map to determine what data to extract.

scripts/data/extract_data.py \
--geojson path/to/DRA.geojson \
--data-map path/to/data_map.json \
--graph path/to/adjacency_graph.json \
--data path/to/input_data.jsonl

Aggregating Plans by District

This script reads a stream of plans from STDIN, aggregates data & shapes by district, and writes the plan and the aggregates to STDOUT.

cat path/to/plans.jsonl \
| scripts/score/aggregate.py \
--state xx \
--plan-type congress \
--data path/to/input_data.jsonl \
--graph path/to/adjacency_graph.json > path/to/plans_plus_aggregates.jsonl

It reads plans as JSONL from the input stream. Each plan can be a simple dictionary of geoid:district assignments, or a tagged format with the "_tag_" tag equal to "plan" and the "plan" key containing the geoid:district pairs. Examples of these formats can be found in testdata/plans/ in NC_congress_plans.naked.jsonl and NC_congress_plans.tagged.jsonl, respectively.

If the JSON records are in tagged format, metadata records are passed through unchanged, as are any other non-plan records.

In addition to any records simply passed through, the output stream contains a record for each plan with the geoid:district assignments in the "plan" key and the district-level aggregates in the "aggregates" key. Aggregates hierarchically encode the type of dataset (census, vap, cvap, election, shape), the dataset key (defined by DRA in the README for the GeoJSON), the aggregate name (e.g., “dem_by_district”), and the aggregates by district. For example:

{"election": {"E_16-20_COMP": {"dem_by_district": [...] ...} ...}.

The first item in each list of values is a state-level aggregate, and the rest are district-level aggregates for districts 1 to N.

You can see an example in testdata/examples/NC_congress_aggs.100.jsonl.

There are some helper scripts to convert alternative formats into the tagged format that can be ingested by the aggregate.py script.

Scoring Plans

This script reads a stream of plans with district aggregates from STDIN, scores the plans, and writes the plan-level scores to STDOUT along with the district-level aggregates.

cat path/to/plans_plus_aggregates.jsonl \
| scripts/score/score.py \
--state xx \
--plan-type congress \
--data path/to/input_data.jsonl \
--graph path/to/adjacency_graph.json > path/to/scores_plus_aggregates.jsonl

Analogous to the aggregate.py script output, scoring writes plan-level scores in a hierarchical JSONL format: the type of dataset (census, vap, cvap, election, shape), the dataset key, and the metric name and value. For example:

{"election": {"E_16-20_COMP": {"estimated_vote_pct": 0.4943, ...} ...} ...}.

Write Scores to Disk

This script reads a stream of scores with district aggregates from STDIN, and writes the plan-level scores to a CSV file and the district-level aggregates to a JSONL file.

cat path/to/scores_plus_aggregates.jsonl \
| scripts/write.py \
--data path/to/input_data.jsonl \
--scores path/to/scores.csv \
--by-district path/to/by-district.jsonl

To keep the plan-level scores a simple CSV, this script “flattens” the hierarchical JSONL format into a CSV file. By default, the field names are simply the name of the metrics. However, if you specify the --prefixes option or there are multiple election datasets scored, this script prefixes the metric names with the dataset key, e.g., E_16-20_COMP.estimated_seats.

If you want output in a different format, you can process the output of the score.py script with a different script-let or directly, e.g., using jq.