Nonprofit Mission Statements

In an earlier post, I shared some Python code that compiles nonprofit data from the Urban Institute’s National Center for Charitable Statistics. But those data are relatively limited. To get more detailed data about nonprofits, researchers are increasingly using the tons of information you can get from a nonprofit’s IRS filing (e.g., 990, 990EZ, 990PF, etc.). This has been made easier since the IRS started releasing machine-readable filings (of organizations filing their forms electronically). There are now a number of organizations and researchers, including Open990, where you can get the raw data and, in some cases, tabular datasets.

I’m sharing some code below that shows you how to 1) scrape the files from (what I can tell is) the best place to get the raw IRS electronic filings (in XML format), and 2) get the data from the XML files and put them into a tabular dataset that can be used for analysis. In this case, I’m only focusing on 990 filings from 2019, getting a relatively small subset of organizations, and getting only the mission statements of those nonprofits. The XML files have tons of other information, including executive compensation, grant recipients, an organization’s activities, etc.

# import libraries for scraping XML files from AWS site
import pandas as pd
import requests
import os
pd.set_option('display.max_colwidth', 100)
# import libraries for parsing XML files
from lxml import etree
from collections import Counter
import itertools
import glob

Step 1: Download Index File from AWS site

# URL of IRS 990 filings index file from 2019
irs_url_2019 = 'https://s3.amazonaws.com/irs-form-990/index_2019.csv'
# read in the index file and keep some variables as strings
orgs_2019 = pd.read_csv(irs_url_2019, dtype={'EIN' : str, 'OBJECT_ID' : str, 'RETURN_TYPE' : str})
# shape of index file
orgs_2019.shape
(396187, 9)
orgs_2019.head()
RETURN_IDFILING_TYPEEINTAX_PERIODSUB_DATETAXPAYER_NAMERETURN_TYPEDLNOBJECT_ID
016285381EFILE1330858922018095/10/2019 6:06:12 AMLOGOS ENCOUNTER INC99093493091012069201910919349301206
116279505EFILE6404118472018055/8/2019 9:46:22 PMMISSISSIPPI CHRISTIAN FOUNDATION99093493101010839201931019349301083
216279502EFILE8702135292018055/8/2019 9:46:20 PMINTL SOC DAUGHTERS OF UT PIONEERS99093493101010539201931019349301053
316279501EFILE2042234372018065/8/2019 9:46:19 PMSCHOOLHOUSE SUPPLIES INC99093493101010189201931019349301018
416279248EFILE4750668192018065/8/2019 9:13:09 PMMINDFUL LIFE PROJECT99093493099005419201910999349300541
# look at how many of each kind of return type there are
orgs_2019['RETURN_TYPE'].value_counts()
990      165201
990EZ     90003
990PF     61998
990O      50643
990EO     28342
Name: RETURN_TYPE, dtype: int64
# subset to just 990 filings (i.e., exclude EZ, PF, etc.)
orgs_2019_990 = orgs_2019[orgs_2019['RETURN_TYPE'] == '990']
### make sure there aren't duplicate object IDs, which will be used to scrape the actual XML files
# print number of rows in 990 dataframe
print (orgs_2019_990.shape[0])
# print number of unique object IDs
print (orgs_2019_990['OBJECT_ID'].nunique())
165201
165201
# make a list of all object IDs to iterate through later
ids_990 = orgs_2019_990['OBJECT_ID'].tolist()
len(ids_990)
165201

Create a folder to put all the XML files in

os.mkdir('returns_2019_990')

Main function to download files from AWS site and put them in the just-created folder

# Note: This just scrapes 100 files. Remove `[:100]` from first line of loop to download all 165K files
# Based on the output from the timeit module, this goes pretty quickly
%%timeit
for i in ids_990[:100]:
    url = 'https://s3.amazonaws.com/irs-form-990/{}_public.xml'.format(i)
    r = requests.get(url)
    with open('returns_2019_990/{}.xml'.format(i), 'wb') as f:
        f.write(r.content)
19.5 s ± 1.14 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Step 2: Parse Lots of XML Files

# put all downloaded file names into a list
xml_file_names = glob.glob('returns_2019_990/*.xml')
# 100 files
len(xml_file_names)
100
# example file name
xml_file_names[0]
'returns_2019_990\\201900989349301350.xml'
# create new list that takes out ".xml" from each file name
files_edited = []

for file in xml_file_names:
    files_edited.append(file[:-4])
# edited file name
files_edited[0]
'returns_2019_990\\201900989349301350'
# namespace used in XML parsing
ns = {'irs' : 'http://www.irs.gov/efile'}

Main function to parse a bunch of XML files and put text into lists

# This gets the text of a few fields (e.g., EIN, ActivityOrMissionDesc, MissionDesc) in the XML files
# and puts them, first, in a dictionary, and then, in a list. This will then make it easy to convert
# that into a dataframe.
allxmls_dfs = []
allxmls_df_lengths = []

for i in files_edited:
	root = etree.parse('{}.xml'.format(i)).getroot()
	
	# for child in root:
		# print(child.tag, child.attrib)
	
	full_list = []
	
	var_dict = {}
	
	ein = root.find('irs:ReturnHeader', ns).find('irs:Filer', ns).find('irs:EIN', ns)
	var_dict['ein'] = ein.text
	# print (ein.text)	
	act = root.find('irs:ReturnData', ns).find('irs:IRS990', ns).find('irs:ActivityOrMissionDesc', ns)
	if act is not None:
		var_dict['activity'] = act.text
	
	act2 = root.find('irs:ReturnData', ns).find('irs:IRS990', ns).find('irs:ActivityOrMissionDescription', ns)
	if act2 is not None:
		var_dict['activity2'] = act2.text
		
	mission = root.find('irs:ReturnData', ns).find('irs:IRS990', ns).find('irs:MissionDesc', ns)
	if mission is not None:
		var_dict['mission'] = mission.text
	
	mission2 = root.find('irs:ReturnData', ns).find('irs:IRS990', ns).find('irs:MissionDescription', ns)
	if mission2 is not None:
		var_dict['mission2'] = mission2.text
	
	full_list.append(var_dict)
		
	# print ('List length of {} xml is {}'.format(i, len(full_list)))
	
	tmp_df = pd.DataFrame(full_list)
    # create object ID without the file name extras
	tmp_df['OBJECT_ID'] = i[17:]
	# tmp_df = tmp_df.dropna(subset=['ein'])
	# print ('Dataframe shape of {} xml is {}'.format(i, tmp_df.shape))
	allxmls_dfs.append(tmp_df)
	allxmls_df_lengths.append(len(tmp_df))
# print out what one dictionary looks like
var_dict
{'ein': '753004503',
 'activity': 'To promote hockey for the youth and young adults in the Pikes Peak Region.',
 'mission': 'To promote hockey for the youth and young adults in the Pikes Peak Region.'}
# print out what above dictionary looks like as one-row dataframe
allxmls_dfs[99]
einactivitymissionOBJECT_ID
0753004503To promote hockey for the youth and young adults in the Pikes Peak Region.To promote hockey for the youth and young adults in the Pikes Peak Region.201940939349300424
# Make sure length of what's been parsed is equal to number of files
print ('Total length of all xmls is {}'.format(sum(allxmls_df_lengths)))
Total length of all xmls is 100

Step 3: Convert List to a Tabular Dataset

irs_2019_990 = pd.concat(allxmls_dfs, ignore_index = True)
# Make sure dataset shape is what we expect and that there are no duplicate object IDs
print ('Shape of full dataset is {}'.format(irs_2019_990.shape))
print ('Number of unique OBJECT_ID numbers is {}'.format(irs_2019_990['OBJECT_ID'].nunique()))
Shape of full dataset is (100, 4)
Number of unique OBJECT_ID numbers is 100
# Preview of dataset!
irs_2019_990
einactivitymissionOBJECT_ID
0942967138Support the educational activities of Grattan Elementary School - students, teachers & administr...Support the educational activities of Grattan Elementary School - students, teachers & administr...201900989349301350
1261095856To seek better treatment options for women with ovarian cancer by providing diagnostic services ...The mission is to seek better treatment options for women with ovarian cancer by providing diagn...201900989349301400
2990110027WE EDUCATE YOUNG WOMEN TO LEAD A LIFE OF ACHIEVEMENT. TO REALIZE OUR VISION WE WILL PROVIDE A VI...WE EDUCATE YOUNG WOMEN TO LEAD A LIFE OF ACHIEVEMENT. TO REALIZE OUR VISION WE WILL PROVIDE A VI...201900989349301410
3200725426SERAPHIC FIRE AIMS TO PRESENT HIGH-QUALITY PERFORMANCES OF UNDER-PERFORMED MUSIC WITH CULTURAL S...SERAPHIC FIRE AIMS TO PRESENT HIGH-QUALITY PERFORMANCES OF UNDER-PERFORMED MUSIC WITH CULTURAL S...201900989349301510
4521891697We advocate and embody independence and equality for all people with disabilities.Independence Now, Inc strives to facilitate independent thought and action by people with disabi...201900989349301520
...............
95133712927TO SUPPORT THE NEW ROCHELLE LIBRARY.THE NEW ROCHELLE PUBLIC LIBRARY FOUNDATION, A NON-PROFIT ORGANIZATION, RAISES RESOURCES AND PROV...201940939349300209
96931021970THE MISSION OF THE LAGUNA BEACH EDUCATION ENDOWMENT AND CAPITAL FUND IS TO PROMOTE EDUCATIONAL E...THE MISSION OF THE LAGUNA BEACH EDUCATION ENDOWMENT AND CAPITAL FUND IS TO PROMOTE EDUCATIONAL E...201940939349300214
97205595877TO SERVE IN THE AREA OF FIRE SAFETY AND SAVE LIFES AND PROPERTY DURING A TIMES OF DISASTERS AND ...TO SERVE IN THE AREA OF FIRE SAFETY AND SAVE LIFES AND PROPERTY DURING A TIMES OF DISASTERS AND ...201940939349300224
98561576543TO SERVE, EMPOWER AND MINISTER TO CLIENTS IN ORDER TO PREVENT AND END HUNGER AND HOMELESSNESS TH...TO SERVE, EMPOWER AND MINISTER TO CLIENTS IN ORDER TO PREVENT AND END HUNGER AND HOMELESSNESS TH...201940939349300304
99753004503To promote hockey for the youth and young adults in the Pikes Peak Region.To promote hockey for the youth and young adults in the Pikes Peak Region.201940939349300424

100 rows × 4 columns

# Export to .csv file
irs_2019_990.to_csv('irs_2019_990.csv', index = False)