
Election Data
This project has two interrelated parts. The first aims to collect and standardize all of the election data on state websites that is “easily” available. This election information is mostly from 1992 and later. The second is to collect precinct level election results for earlier periods than is posted on state websites for key states, often going back to the late 1960s. This involves optical character recognition of images of election returns, although this process is necessary for many later years as well.
IMPORTANCE
There is a wealth of information on state government websites, including
-
Precinct level election results
-
Voter registration and participation data, sometimes broken down by demographic characteristics
-
Candidate filing information
Maintaining public records of elections is the responsibility of the states, and as such come in a bewildering variety of structures, naming conventions and formats. Some are trivially easy for an expert to work with while others are very time consuming (i.e., Mississippi 2018 precinct results).
Such data should be fully standardized across states and ready for frictionless use by interested parties. Whether candidates are the same individual running in different years should also be established.
Existing election datasets are:
-
Missing for many states and years
-
In multiple disparate sources, which must be cobbled together for large scale analyses
-
Not sufficiently standardized
-
Almost entirely absent for voter registration and participation data
-
Entirely absent for candidate filing information
-
Heavily favor national over state offices, and almost always omit local offices
All of the important uses of the data mentioned in the overview apply to this specific project.
MY WORK UP TO NOW
I’ve created a suite of commands—Election Hoard Software—that radically expedites the collection, restructuring and standardization of election data. It automates the process between “decision nodes” where human judgment is necessary. Commands take relevant election laws into account and are also driven by extensive dictionaries. The code behind these commands comes to 69 pages, and its codebook is 34 pages.
In early 2019, I went through the 2018 election results for all states three separate times as I created and improved the software (854 hours).
My log shows 3,350 hours working on election data from Sep 3, 2014 to April 26, 2020—which understates the amount and doesn’t include time spent on the State Legislative Election Returns database.
Other work already completed includes:
-
downloading a large portion of what is on state websites
-
compiling many states and years of data (the lion’s share of work)
-
acquisition of many primary sources (see “state government primary sources collection”)
Another metric tracking the extent of my work is that the folder much of my election data has 52,937 files.
Criticism of Alternative Models of Data Collection
Alternative #1: The Absentee P.I. Model
The absentee principle investigator model is exemplified by the M.I.T. Election Data Science Lab (MEDSL) which has been funded to—among many other things—post the precinct results for the 2018 general elections.
Under this model, an “absentee P.I.” heads the organization, while primary responsibility for overseeing data compilation lies with a post-doc who manages a number of graduate students who work as little as ten hours a week on the project. This multiplication of effort—having different people solve the same problems over and over—creates enormous inefficiencies.
One sign of this inefficiency is that 18 months after the November 2018 elections (as of May 4, 2020), there are still six states that haven’t had their precinct results posted by MEDSL (IN, KY, ME, MS, SD and UT), two of which I completed working with 10 months ago with no financial support (MS and SD).
Another large inefficiency is caused by only working on one election year at a time. Election results are often presented in identical or at least very similar ways by one state over time. At the extreme, the data for ten election years for one state can be compiled in under twice the time it would take to do one. An expert who can work quickly with the data can broaden the scope of what they’re doing and take advantage of economies of scale.
What is most wrong with the absentee P.I. model is that none of the parties involved are particularly interested in this aspect of what they’re doing. In Political Science, data collection is often perceived as trivial and academics are rewarded for publishing research examining the relationship between phenomena.
Let MEDSL do what MEDSL is most interested in doing—analyzing the data that a specialist compiles for the community at large. They are over-extended and funders will get a higher return on their dollar backing a different model of data collection. If MEDSL wishes to post the data I collect on their website, I think that would be great, as long as I have sufficient funding to continue my data collection.
Alternative #2: The Volunteer Model
This model is exemplified by the volunteer-driven OpenElections.
The idea of citizen volunteers contributing to the common good is inspiring but the reality isn’t. Volunteers are not able to put the time in to process disparately presented election data in a timely manner. Many of their volunteers are technically adept, but still have not acquired specialist tools and skills that enable high levels of productivity. Another inefficiency is that concessions must be made in order to keep them motivated.
OpenElections is also subject to the inefficiency mentioned about MEDSL: volunteers generally work with one state for one election year. Since volunteers’ time is limited, they can’t take advantage of economies of scale.
The group has been slow to post the results of the 2018 elections. Looking at their website, very few states and years of precinct results have been posted. Their GitHub does indicate that they’ve now almost completed data collection for 2018. It is their goal to post data going back to 2000, but a very small proportion of precinct results have been collected since the beginning of the project four or more years ago, even on their GitHub.
There are many tasks that volunteers are well suited for. Contacting local election offices to obtain precinct results for states that do not post them (i.e., Maine, New York), physically going to nearby offices to obtain results, or contacting election offices to resolve discrepancies are things volunteers would be superlative for. If volunteers’ time was redirected in this way, their full potential and motivation could be utilized. They would probably even have a lot more fun.
OpenElections already has a high-quality Website and well-known presence, and it would be a good thing for them to post the data I compile on their Web site. There is enough room for everyone to play a role and get credit. My main concern is that I want to collect these data and make them public, and for that I need funding.
Historic Precinct Results
This portion of the project involves the collection of precinct results for several key states back to the late 1960s or back to 1992 for many other states. Because the primary sources for this are old, there is more emphasis on optical character recognition of images (often low quality) of election data.
IMPORTANCE
All of the important uses of the data mentioned in the overview are relevant to this project.
Analyses of elections should meet three conditions:
-
They shouldn’t depend on survey data
-
The unit of analysis should be as small as possible, which generally means precinct level data
-
Samples should span many times and places.
The first and second points are now widely understood, but the third isn’t. Many believe they are engaged in “big data” analysis if the N of their study is large. But if such large N studies only include the last decade or so, they omit a diversity of background conditions, which in turn results in less accurate forecasts. This is why precinct level election data over long time periods is needed.
Another aspect of this project, which has only been completed for one state, is to code which state legislative district (both house and senate) each precinct is in. Another benefit of this is that it will often enable census geographies to be matched to state legislative districts for years prior to the 1990s. Combining this information with precinct results for statewide offices will then enable analyses of partisan gerrymandering.
MY WORK UP TO NOW
The years 1984-1990 haven’t been targeted because they are already covered by an existing dataset (ROAD). Scope of targeted offices varies, some states merely including president and governor, but others including as many as 12 offices.
These data are clean for the vast majority of state-years, although discrepancies in sources still have to be resolved.
The tables for the seven states requiring the most work for this project encompassed
1.14 miles of pages.
The 11 states below account for the vast majority of time spent which was
1,412 hours
Workers also spent several hundred hours cleaning these files (documented in logs).
Code and cleaning routines were developed to make workers’ time most effective.
My time admittedly involved experimentation, adding to it substantially.
Hours spent on states most intensively worked on were as follows:
State A 1966-2002: 233
State B 1976-1996: 72
State C 1976-1982: 186
State D 1972-1994: 133
State E 1992-2002: 41
State F 1992-2002: 29
State G 1966-1982: 272
State G 1966-1982 assigning state legislative districts to precincts: 79
State H 1968-1994: 121
State I 1992-1994: 34
State J 1966-1998: 197
State K 1992-1994: 11
Total of above: 1,412