Methodology

A portion of the research for “Pretty Lady Cadres” draws on a dataset that ChinaFile constructed, consisting of biographical information about the top Party and state leaders at the provincial, municipal, and county levels of China’s government. Each administrative level has both a Chinese Communist Party Secretary and a head of government (a governor, a mayor, or their equivalent, depending on the level and the specific province or region). In the administrative hierarchy, there are multiple municipalities under each province, and multiple counties under each municipality.

This information was hosted in fairly complete form on a government website maintained by the Party’s flagship newspaper, The People’s Daily, until sometime in 2018 when the site was overhauled, eliminating information about county-level leaders. Thus, to construct our dataset, ChinaFile relied on the Internet Archive, a non-profit that archives select webpages over time.

By scraping the Internet Archive’s records, we were able to obtain information about the top government and Party leaders at the county, municipal, and provincial levels. Though we were unable to reconstruct the entire database, our dataset includes nearly 4,300 county or county-equivalent localities throughout China.

The available data varied by leader and locality, but usually included some mix of the person’s name, age, gender, ethnic background, education, hometown, and previous work history as of 2017.

We then created a computer program to parse the leaders’ biographies and separate out information about the leaders’ genders, allowing us to see how many leaders’ resumes included this information.

Of the roughly 9,900 leaders listed in our dataset, about 2,900 did not include gender information. For these entries, we turned to the existing program “ngender,” which uses historical name data to make guesses about the gender of a person based on their name. The program also gives a “confidence rate” for each of its guesses. Running ngender against leader data for which we already had gender information, we determined that the program guessed correctly in nearly 93 percent of cases, and tended to make more accurate guesses for names we knew to be associated with men.

The average confidence rate for correctly-guessed female entries was 85 percent. We then ran ngender against those entries for which we did not have gender information, manually checking any guesses which had a confidence rate below 85 percent. This means that we cannot be sure every entry in our dataset has the correct gender, but we have manually corrected for the most likely incorrect guesses, and we feel confident that the overall trends are accurate.

Jessica Batke and Shen Lu