Boosting Scoring Throughput
By default, the SCORE.sh
script calculates all metrics (“scores”) for all plans in an input ensemble.
When the ensemble is large, this can take a long time.
You can use a combination of two techniques to increase scoring throughput:
- “Shard” the ensemble of plans into files with fewer plans, e.g., divide it into 10 smaller files, and score the shards in parallel.
- Score independent categories of metrics—“general”, “partisan”, “minority”, “compactness”, and “splitting”—separately and in parallel.
When used together, you can substantially boost throughput.
Used together, the basic process is as follows:
- Shard the ensemble of plans, e.g., into 10.
- Score each category of metrics for each shard separately, using the
--mode
arg onSCORE.sh
. - “Vertically” concatenate each set of shards back into one file of scores -or- by-district measurements for each category, and
- Optionally, “horiztonally” join all the categories files into one overall scores or by-district file.
There are three utility bash scripts in the scripts/throughput
directory to support this:
SHARD.sh
– e.g.,scripts/throughput/SHARD.sh /path/to/plans.jsonl
CONCAT_FILES.sh
–scripts/throughput/CONCAT_FILES.sh /path/to/csvs "NC_congress_scores_compactness_*.csv"
orscripts/CONCAT_FILES.sh /path/to/csvs "NC_congress_by-district_compactness_*.jsonl" --no-header
JOIN_CSVS.sh
– e.g.,scripts/throughput/JOIN_CSVS.sh /path/to/csvs "NC_congress_scores_*.csv"
The SCORE.sh
script takes an optional --mode
argument to specify which category of metrics to calculate.
The default is to calculate all categories.