Data Sources & Methodology

Our methodology page is intended to explain the parts of Digital Democracy — how our data is collected and where it comes from, the time periods reflected in the data, our efforts to establish the quality of the data and the limitations in using the data for analysis. For information about the purpose and mission of Digital Democracy, please see our “About” page.

Also, we are committed to the transparency and integrity of this project, so if you have a comment, correction or feedback, please share your input here.

Data sources

The database created by Digital Democracy is unprecedented. It combines a variety of information and data from throughout state government that is publicly available, but isolated and difficult to access. This database combines these related data sources so they can be compared to reveal relationships, patterns and aberrations in the Legislature and the policy-making process. 

The database includes four categories of information:

  • Transcripts: Every word uttered in a public hearing or floor session is captured in the database.
    • This data is often available on Digital Democracy within 48 hours of a hearing. 
    • By April 2024, the archive of transcripts will span from January 2023 to present. 
  • Bill information: The database includes the text of every bill, amendments, bill analysis, vote, supporters and opponents.
    • This data is available within hours or days of when a bill is introduced or amended and as it proceeds through committee and floor votes
    • Digital Democracy includes this bill information since 2013
  • Financial information including campaign donations, independent expenditures, political party spending, gifts, travel and behests.
    • Data for campaign donations, independent expenditures and political party spending cover the period from 2013 to 2023 with periodic updates from Open Secrets.
    • Data for gifts to legislators and sponsored travel is filed annually to the Fair Political Practices Commission. Digital Democracy contains this data since 2022.
    • Behests, or payments made to third parties at the request of a legislator, is required to be posted within a month of the payment. The Digital Democracy database is updated daily with a record of payments to 2013. 
  • District data including voter registration, election results and demographics.
    • Data about voter registration is provided by the California Secretary of State. Digital Democracy includes data from 2022.
    • Election results are updated after each election.
    • Demographic data is drawn from the US Census Bureau’s American Community Survey. The Digital Democracy data is from 2022. 

Understanding our data

One of the greatest challenges in using data to create transparency in state government is that there is no standardized identification required when a person, organization or company testifies in a public hearing or donates money to a campaign. 

As a result, it’s difficult to know if the “Bob Smith” who testified in 2023 is the same “Bob Smith” who testified in 2024. Similarly, we may have information under different versions of the same entity like “ACLU” versus “American Civil Liberties Union,” or different subgroups such as “Association for Commuter Transportation” versus “Association for Commuter Transportation, Southern California Chapter.” These entities are the same for our data analysis. Conversely, entities such as “US Chamber of Commerce” and “San Luis Obispo Chamber of Commerce” are not the same organization or have a parent/chapter relationship even if they sound similar. 

Digital Democracy uses technology and human judgment to discern how these data points should be recorded and linked to other records. For example, we use facial recognition technology and human review to help us understand if “Bob Smith” in 2023 is the same person as “Bob Smith” in 2024. Artificial intelligence combined with human review also helps us understand which entities are related and which are not when they have similar names.

Even with the technology and human review, the database is still not precise. For that reason, a name typed into a search bar may still return no results or too many similar results. As we develop Digital Democracy, we will continue to improve the quality of the database. The technology we use is also evolving rapidly and becoming more efficient. But without an identification system for donors or those who testify, the database will always be imprecise and the results of a search or analysis should be considered with that context.

The problems with identification are found throughout the Digital Democracy database and website. In the transcripts, the name of a speaker may not be captured accurately if it is mumbled by the speaker, misspelled by the transcription program or improperly captured due to human error. The inconsistency in names — for people, organizations or companies — also means that some financial transactions may not be captured correctly. 

The challenge is important for Digital Democracy because this unique database is designed to capture and compare all of the interactions of a person or organization that are recorded in a variety of isolated and separate sources. For example, a Digital Democracy search for Chevron should reveal information from the Secretary of State’s office about lobbyists the company employs and donations the company has made to legislators; testimony by a company representative from the transcript of a hearing; data about gifts or travel or behests involving a legislator from forms at the Fair Political Practices Commission and positions on bills from the California Legislative Information website. That broad analysis requires an accurate match within the Digital Democracy database of dozens of Chevron data points created over several years by separate offices.

Transcripts

Under Proposition 54 in 2016, the state Senate and Assembly are required to videotape public hearings and floor sessions and to post the video online within 24 hours. Digital Democracy sends the video to an online transcription service that uses artificial intelligence to transcribe the audio within a few hours. Speech-to-text technology is far from perfect, however, even with the latest advances in AI. In many applications, a 10% or 20% error rate is common. 

To significantly improve the quality of the transcript, a team of Digital Democracy contractors review each transcript to correct errors. Speakers are also identified with facial recognition technology, but human editors confirm the identification and link speakers to related information about that person in the database. To increase accuracy, the names of legislators and registered lobbyists are hard-coded into the program. To a limited extent, we use internet searches to identify some other speakers. 

Videos are also cut into smaller segments corresponding to bill discussions. Each segment of video and the corresponding transcript are indexed so that a search can precisely identify bill discussions, speaker quotes or keywords.

The human editing process takes at least two hours of review for every hour of video. In all, when Digital Democracy launched in March 2024, more than 1,600 hours of video were processed and more than 12,000 individuals were identified and stored in the database. 

Legislation

The database captures information about the policy-making process from the California Legislative Information website, also known as “Leginfo.” That data includes the text of bills and amendments, committee and floor votes, bill analysis, supporters and opponents who registered official positions, governor vetoes, a history of a bill’s progress and the current status of the bill.

The Leginfo data can be seen in several places on the Digital Democracy website:

  • Legislator pages: This is used to create a “bill activity” graphic that displays how many bills a legislator has authored and how many have passed or failed. It is also used to display all of the bills authored by each legislator.
  • Bill pages: There is a web page for every bill introduced in the Legislature that draws data from Leginfo including text, status, analysis, votes, supporters and opponents.
  • Hearing pages: Web pages for each public hearing include data from leginfo about legislation considered in the hearing.
  • Topic pages: The web pages focused on six major state topics including data from Leginfo about all of the current bills related to that topic.
  • Search directories: The data from Leginfo also appears on directory pages produced by search queries.

Financial transactions

We consider two categories of financial information: money given to help a candidate win an election and financial transactions involving an incumbent legislator, which we describe as “influence.” 

Election money

Political campaigns in California have to disclose their contributors to the Secretary of State. The data contains some information about the donor, the date of the payment, and the amount of money. However, categorizing political donations by economic sector can be difficult. What categories should be used? Is a company like Tesla a car company or a tech company? What about delivery services like DoorDash? Amazon?

CalMatters uses categories identified by OpenSecrets, a national nonprofit dedicated to comprehensive, nonpartisan analysis of political donations to state and federal officeholders. Open Secrets, previously known as Follow the Money, has been a trusted source of campaign finance data for decades and is widely cited by major media organizations. In California, OpenSecrets gets updates at least twice a year from the Secretary of State, processes the data, and then gives it to CalMatters.

The categorization system divides the entire economy into 20 sectors. Each of those sectors is divided into industries, which are further segmented into 438 total business categories. There is a catch-all sector called “Uncoded” which are contributions that have yet to be categorized.

Because the data can have nuances (such as different name spellings, the inclusion of middle initials, or a slightly different version of the company name) all of this categorization is done by a person, either at OpenSecrets or at CalMatters. We go contributor by contributor and do our best to accurately capture the main economic interest of that person, company, or organization.

Influence

We describe financial transactions with an incumbent legislator as influence. Influence is divided into three subcategories: gifts, sponsored travel and behests. The data is contained on the Form 700, which legislators submit annually to the Fair Political Practices Commission. They are required to disclose stock, property, and business interests as well as any gifts they received or any trips they took at somebody else’s expense.

We display data about influence money on the pages for each legislator including: 

  • Personal gifts: Legislators are not allowed to accept gifts of more than $10 per month from registered lobbyists. Gifts from any other single source are limited to $590 in a calendar year.
  • Sponsored travel: Legislators are allowed to have their expenses paid for domestic and international travel that is related to their work including, for example, trips to conferences or to tour places facing issues similar to California. There is no limit on the amount of travel expenses a legislator can accept. The trips are often funded by companies or organizations with interest in state policy.
  • Behests: These are unlimited payments made “at the behest” — or request — of a legislator to a third party. The behest might involve, for example, a legislator requesting a donation be made to a school district or a job fair. Sometimes the behested payments are made to a charity run by the legislator that might host networking events, retreats, conferences and other activities. Behests must be reported if they exceed $5,000 from a single source in a year.

Ideology

Is it really possible to depict a person’s political ideology in all of its nuance and complexity with a single number? Of course it isn’t. But by looking at how often certain legislators vote with one another, we’ve come up with a starting point to give readers a better sense of how lawmakers stack up.

To do it, we gathered all the “aye” and “no” votes from every assembly member and state senator from the last legislative session. That includes floor session votes, but also votes in committee. After excluding resolutions, which are just procedural or symbolic, we were left with 18,616 separate roll calls to analyze. We then fed that long list of votes into software written by political scientists at UCLA, USC, the University of Georgia and Rice to come up with a measure of ideological “distance” — how close or far apart different lawmakers are to one another based on their voting behavior.

An example: San Francisco’s Phil Ting and San Diego’s Chris Ward, both liberal Democrats, voted together about 96% of the time in a previous session. Contrast that with Fresno Republican Jim Patterson, whose votes overlapped Ting’s only 63% of the time. Feed those patterns through the software and Ting and Ward are ideological neighbors whereas Patterson lives on the far side of town from both of them.

That “distance” is assigned a number between -1 and 1, but we converted it from 0 to 100, with all the liberals clustered around 0 and the conservatives at the top of the range.

Political scientists have been tinkering with some version of this method, called NOMINATE, since the early 1980’s. You can read more about how we’ve used this method in the past here.

District data

District demographics for race/ethnicity, median household income, median age, poverty rate and educational attainment come from the U.S. Census Bureau American Community Survey.

Election results data come from the California Secretary of State’s Elections Statistics.

District map boundaries data come from the Statewide Database.

District partisanship — whether a district is considered safe Democrat, safe Republican or competitive — is based on an analysis by CalMatters reporters looking at voter registration data, election results from the Secretary of State and other information.

Legislator information

Contact information comes from the official California state Senate and Assembly websites.

Social media information comes from CalMatters data collection.

Legislator gender, race/ethnicity, sexual orientation, birth dates, birthplaces and residence data came from CalMatters data collection from elected official offices, the California State Library and Political Data Inc.

Legislators background data come from official ballots on the California Secretary of State Prior Elections page.