Scraping ESG Scores
Getting Company and Mutual Fund ESG Scores from Yahoo Finance
This is a coding (R language) walk-through on scraping sustainability/ESG data (scores, ratings, and rankings) from Yahoo Finance; for individual companies/stocks and mutual funds. These coding posts are more informal than my research articles and are about process, written for an audience of fellow sustainability/ESG coders and data analysts.
Legal/Ethics
Always scrape legally and ethically! Data scraping is ubiquitous. Websites know this and are generally permissive, albeit within reasonable bounds. From a legal standpoint, read the site’s “Terms of Service” (TOS) before writing a single line of code. From an ethics standpoint, it’s mainly about not being a burden and other common-sense courtesies.
- Be transparent. Send a custom HTTP header and user agent string containing your name, intent, and contact info.
- Don’t tax servers or smell like a DDoS attack. Send requests at a modest rate (smell like a human) and use timeouts.
- If there’s a public API use it even if that means a learning curve. The site invested in that API so that we don’t scrape.
- Honor robots.txt re: the specific the path you’re scraping. A site might permit crawling in some areas but not others.
- Read the TOS; robots.txt is for crawlers but there’s a fine line between crawling and data mining, often prohibited.
- Be aware of downstream legalities. A site that permits scraping could be hosting third-party copyrighted material.
- Download only what you need and create something new from it. Don’t hoard data or re-package it untransformed.
- Show gratitude. If you’re scraping for blog, give proper and prominent attribution and drive traffic with a backlink.
I. Context
ESG Investing in Theory
ESG investing1 is a specific strategy under the sustainable/socially-responsible investing (SRI) umbrella. ESG investing incorporates Environmental, Social, and Governance (ESG) factors into investment decisions. Typical examples are:
- E: Energy consumption and efficiency, emissions and pollutants, water usage, waste generation and disposal
- S: Employee diversity, LGBTQ+ rights, labor practices, community involvement, charitable giving, human rights
- G: Board and executive diversity, executive compensation, shareholder rights, disclosure and transparency
Early on, ESG investing was a niche equated with “principles over profits.” Recently, it’s gone more mainstream. Non-financial ESG data complements traditional financial metrics; together they can lead to higher risk-adjusted returns.
ESG Investing in Practice
Unfortunately, like a lot in the investing world, the cards are stacked in favor of institutional investors when it comes to SRI/ESG investing. ESG options are limited for the individual investor, either cost-prohibitive or time-prohibitive.
- Cost-Prohibitive: Ironically, the mainstreaming of ESG investing has led to extreme commercialization (and borderline monopolization) of ESG information. Third-party ESG data and products are insanely expensive.
- Time-Prohibitive: The DIY approach just isn’t practical for the casual investor. Freely-available raw ESG data — which commercial products re-package — isn’t standardized, requiring a steep learning curve to interpret.
Morningstar/Sustainalytics
Options are limited, but they do exist. Particularly accessible is Morningstar/Sustainalytics ESG data on Yahoo Finance. It’s freely available for both individual companies/stocks and mutual funds under the “Sustainability” tab on the quote page. Here is a company example, and here’s a fund example. The data is similar for each, with a few differences.
Note: I’m treating Morningstar/Sustainalytics interchangeably throughout this post. It’s more nuanced, but basically a few years ago Morningstar and Sustainalytics partnered to provide ESG data for mutual funds and ETFs on Yahoo. Sustainalytics was independently supplying company/stock ESG data. But fund ESG scores are based a portfolio “roll-up” of Sustainalytics’ company scores, and in mid 2017 Morningstar acquired a 40% stake in Sustainalytics. So while there are differences, they’re basically the same re: ESG data.
This ESG data is basic (i.e., summary-level scores, ratings, rankings) but it’s a godsend for individual investors. First, it’s free and doesn’t require registration. Second, the underlying data is from Sustainalytics and Morningstar; two reputable sources of ESG and fund research respectively. Usually I wouldn’t recommend Yahoo for investing, but this is different.
II. Process
Background
I assume some knowledge of R programming, HTML/CSS, and HTTP requests. This post won’t get too technical but it helps to know how I got the element, attribute, and value (table, id, constituents) when you see something like:
html_nodes("table[id='constituents']")
I’ll go over that more later. Another thing is that while I adhere to R language best practices (naming, styling, etc.) — i.e., code should be legible and self explanatory — I do have two naming idiosyncrasies to be aware of:
- I preface object names with a three-letter descriptor such as lkp_ (for lookup tables), dat_ (data tables), fun_ (functions), var_ (variables), and tmp_ (temporary objects). It really helps keep track of things, particularly for large projects.
- Within the occasional unavoidable loop I suffix names using the iterator; e.g., “data.i” where “i” is the iterator. This lets me use the name root name in nested loops (data.i, data.x) and avoids accidental naming conflicts.
Be aware of how you name things. Careless naming can “delete” existing objects and break things. I cringe when coding tutorials use “data” for a data frame; it overwrites the base function data(). Lazy and irresponsible from a teaching point.
Step 1: Preparation
The first few steps (all sub-steps under this section) are preparatory and generic; that is, they are the same for scaping companies as for scraping mutual funds.
Step 1a: Load Libraries
Here, we simply load the libraries (aka packages) that we’ll be using throughout:
- urltools (for a custom function)
- httr (to fetch website content)
- robotstxt (to scrape ethically)
- rvest (to parse HTML and XML)
- stringr (for string manipulation)
- dplyr (for data manipulation)
- tibble (to store data efficiently)
Note: The rvest, stringr, dplyr, and tibble libraries are part of the awesome tidyverse collection and could all be loaded via a single library (tidyverse) but for demonstration purposes I like to make the underlying libraries very clear.
library(urltools)
library(httr)
library(robotstxt)
library(rvest)
library(stringr)
library(dplyr)
library(tibble)
If you don’t have a particular library installed it’s just a matter of install.packages(“name”) where “name” is the name of the library/package, in quotes. Note that quotes are required for install.packages() but not library().
Step 1b: Write Functions
Following the DRY (“don’t repeat yourself”) principle of programming, I’ll first create a few functions. Doing so makes subsequent code a lot cleaner and reduces (virtually eliminates) the chance of error.
The first function (fun_parse) is the workhorse. For each iteration of a loop we’ll receive an XML document (a web page) as a response. This function extracts ESG data from that page. What data it extracts is determined by the XPath plugged into the function as an argument. The function’s workflow is as follows:
- extracts the content (initially a text string) of the node identified by the “xpath” argument
- if the string is empty and the XPath is for Controversy Rating return “None” (XPath quirks)
- if the string contains “% AUM” all non-numeric characters are stripped a percent is returned
- if the string contains all letters, it’s a rating (e.g., “Outperformer”); I only trim white space
- if the string contains any numbers at all it’s either a score, a percentile ranking, or a date
- if the string contains “percentile” it’s a ranking; numerals are extracted and divided by 100
- if the string contains “updated on” it’s a date; the date string is coerced into a date object
- anything left at this point is a score (1-100 scale); the text integer is converted to numeric
fun_parse <- function(xpath, xmldoc = page.i) {
x <- xmldoc %>%
html_nodes(xpath = xpath) %>%
html_text(trim = TRUE)
if (length(x) == 0 & xpath == '//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div/div[2]/div[1]/span/span/span') {
return("None")
}
if (grepl("% AUM", x)) {
return(as.numeric(sub("% AUM", "", sub("based on ", "", x))) / 100)
}
if (!grepl("\\d", x)) {
return(trimws(x))
} else {
if (grepl("percentile", x)) {
return(x %>% str_replace_all("[^0-9\\.]", "") %>% as.numeric() / 100)
} else {
if (grepl("updated on", x)) {
r <- sub("Last updated on ", "", x)
r <- paste(unlist(strsplit(r, "/"))[2], unlist(strsplit(r, "/"))[1], sep = "-")
return(anytime::anydate(r))
} else {
return(as.numeric(x))
}
}
}
}
The above extracts most of what we want. The function below finishes the job. It parses the “Product Involvement Areas” table — which indicates involvement in controversial activities — and returns a list of those marked “Yes.” This data is only available for companies, not mutual funds.
fun_lists <- function() {
x <- page.i %>%
html_nodes(xpath = '//*[@id="Col2-3-InvolvementAreas-Proxy"]/section/table') %>%
html_table() %>%
data.frame()
n <- sum(grepl("Yes", x[, 2]))
if (n == 0) return(NA)
if (n == 1) return(x[grep("Yes", x[, 2]), 1])
if (n >= 2) return(list(x[grep("Yes", x[, 2]), 1]))
}
The last one is a just helper function, a convenience wrapper around the paths_allowed() function in the robots.txt library. It takes any URL as an argument and parses out info needed for subsequent arguments. With it, we can use a single function for testing robots.txt regardless of the target’s URL. It’s especially time-saving if scraping multiple sites.
fun_robots <- function(url = link.i) {
base_url <- paste0(url_parse(url)$scheme, "://", domain(url))
paths_allowed(
paths = sub(base_url, "", link.i),
domain = domain(url),
bot = "*"
)
}
Step 1c: Write User Agent
Transparency is the first bullet point in my introductory sermon on scraping ethics. In this step I create a custom user agent string. It can be anything but should at least include a name and contact info. It’s also nice to state something about what you’re doing (i.e., purpose of scraping). Doing this accomplishes two main things:
- It doesn’t absolve you of responsibility but it starts you off on better footing. Sort of like a cop uncle giving you a sticker for your car window; you’re still going down if you murder but you might get a break for rolling through a stop sign.
- It humanizes your activity and gives the website’s system administrators incentive and means to contact you if there’s an issue. Otherwise, it’s too easy for them to just blacklist your IP address, even if your activity isn’t really blacklist worthy.
Just for kicks, here’s my default user agent (your could be different)…
httr:::default_ua()
## [1] "libcurl/7.59.0 r-curl/3.3 httr/1.4.0"
And here I establish a custom user agent string variable to be used in a later step…
var_agent <- "Kyle Rudden (me@kylerudden.com). Doing personal research."
Step 2: Create Data Tables
This is where things start to diverge a little depending on whether we’re scraping ESG data for companies or funds.
Before scraping, we’ll create a table — one each for companies and funds — containing pre-populated meta data (e.g., company or fund name) and empty (placeholder) columns for ESG data to be downloaded.
Step 2a: Create Data Table (Companies)
The S&P 500 is a good universe to work with — broad enough for meaningful follow-up analyses and its constituents are large enough to have likely been scored by Morningstar/Sustainalytics (i.e., there will be ESG data to scrape).
The code below fetches a table of S&P 500 constituents from Wikipedia. I usually avoid Wikipedia but: 1) everyone can access it, 2) it’s another scaping example, and 3) accuracy doesn’t matter for this post (any tickers will do).
# Note: ^GSPC is the symbol/ticker for the S&P 500 Index
dat_stocks <- read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies") %>%
html_nodes("table[id='constituents']") %>%
html_table() %>%
data.frame() %>%
as_tibble()
Notice that I employ forward-piping using the “%>%” operator from the magrittr library (also from the tidyverse, and loaded as an rvest dependency). Piping results in less code and greater readability. You can read about it here.
I now want to do three things with the raw table: 1) clean up the column names, 2) select only the columns we’ll use, and 3) rename two tickers (i.e., substitute the period in BRK.B and BF.B with a dash; Yahoo uses BRK-B and BF-B).
Inspect current column names (the Wikipedia headers include spaces, which R replaces with periods)…
colnames(dat_stocks)
## [1] "Symbol" "Security" "SEC.filings"
## [4] "GICS.Sector" "GICS.Sub.Industry" "Headquarters.Location"
## [7] "Date.first.added" "CIK" "Founded"
Rename columns, reduce table to just those we want (also re-ordering them in the same step), and fix two tickers…
# rename columns
colnames(dat_stocks) <- c("company", "ticker", "filings", "sector", "industry", "location", "added", "cik", "founded")
# select columns
dat_stocks <- dat_stocks[, c("ticker", "company", "sector", "industry")]
# rename tickers
dat_stocks$ticker <- gsub("[.]", "-", dat_stocks$ticker)
And a quick inspection (just the first five rows)…
head(dat_stocks, 5)
## # A tibble: 5 x 4
## ticker company sector industry
## <chr> <chr> <chr> <chr>
## 1 3M Company MMM Industrials Industrial Conglomerates
## 2 Abbott Laboratories ABT Health Care Health Care Equipment
## 3 AbbVie Inc- ABBV Health Care Pharmaceuticals
## 4 ABIOMED Inc ABMD Health Care Health Care Equipment
## 5 Accenture plc ACN Information Technolo~ IT Consulting & Other Servi~
Below I add placeholder columns for ESG data that we’ll acquire shortly. I should note that creating these empty ESG columns isn’t necessary, nor is defining their column types (e.g., integer) as I do. You could just let the loop create the new columns. Doing it this way just makes it easier to see what’s coming.
dat_stocks$esgRating <- as.character(NA) # ESG Rating
dat_stocks$esgScore.tot <- as.integer(NA) # ESG Score (Total/Overall)
dat_stocks$esgScore.env <- as.integer(NA) # ESG Score (Environmental)
dat_stocks$esgScore.soc <- as.integer(NA) # ESG Score (Social)
dat_stocks$esgScore.gov <- as.integer(NA) # ESG Score (Governance)
dat_stocks$esgRank.tot <- as.numeric(NA) # Percentile Rank (Total/Overall)
dat_stocks$esgRank.env <- as.numeric(NA) # Percentile Rank (Environmental)
dat_stocks$esgRank.soc <- as.numeric(NA) # Percentile Rank (Social)
dat_stocks$esgRank.gov <- as.numeric(NA) # Percentile Rank (Governance)
dat_stocks$conRating <- as.character(NA) # Controversy Rating
dat_stocks$conLevel <- as.integer(NA) # Controversy Level
dat_stocks$conAreas <- as.character(NA) # Controversy Areas (Products)
dat_stocks$asOf <- Sys.Date() # Last Updated date
Step 2b: Create Data Table (Mutual Funds)
For mutual funds I’ll use a group of 50 pre-selected funds (download a CSV file here). They’re mostly SRI/ESG funds but I threw in a few that aren’t — either explicitly or aren’t rated — for the full spectrum of scraping results.
dat_funds <- read.csv("dat_funds.csv")
head(dat_funds)
## # A tibble: 6 x 3
## ticker familyName fundName
## <chr> <chr> <chr>
## 1 ABEMX Aberdeen Aberdeen Emerging Markets Fund
## 2 ABALX American Funds American Funds American Balanced Fund
## 3 AMRMX American Funds American Funds American Mutual Fund
## 4 AEPGX American Funds American Funds EuroPacific Growth Fund
## 5 AGTHX American Funds American Funds The Growth Fund of America
## 6 AWSHX American Funds American Funds Washington Mutual Investors Fund
Similar to above, I add placeholder columns for ESG data that we’ll acquire shortly. The columns for mutual funds are slightly different than for stocks/companies.
dat_funds$esgRating <- as.character(NA) # ESG Rating
dat_funds$esgScore.tot <- as.integer(NA) # ESS Score (Total/Portfolio)
dat_funds$esgScore.env <- as.integer(NA) # ESG Score (Environmental)
dat_funds$esgScore.soc <- as.integer(NA) # ESG Score (Social)
dat_funds$esgScore.gov <- as.integer(NA) # ESG Score (Governance)
dat_funds$esgScore.aum <- as.integer(NA) # ESG Score (% AUM basis)
dat_funds$esgScore.raw <- as.integer(NA) # ESG Score (Raw)
dat_funds$esgScore.ded <- as.integer(NA) # ESG Score (Controversy Deduction)
dat_funds$susMandate <- as.character(NA) # Sustainability Mandate
dat_funds$susRank.pct <- as.numeric(NA) # Sustainability Rank (Percentile)
dat_funds$susRank.cat <- as.numeric(NA) # Sustainability Rank (Category)
dat_funds$asOf <- Sys.Date() # Last Updated date
Step 3: Download ESG Data
Just as with creating data tables I’ll break the actual scraping into sub-steps; one each for companies and funds. There are minor differences but the process is basically the same: iterate through the data table, use the ticker to construct an HTTP request URI, retrieve the entire page, then extract specific XML nodes and their data using XPaths.
A note on XPaths… XPath is a syntax used navigate XML hierarchies. Think of an XPath as directions (a path) to a data item. XPaths are convenient but have a downside: if the site structure changes, XPaths in scaping code must be updated. More change-resistant options exist but the cost is up-front coding time. For large projects it’d be worth it but for a few ESG items XPath are just too easy to not use.
A few points about the scripts’ flow (same for companies and funds):
- if first forms a request URL (based on ticker) and checks it with robotstxt::path_allowed(), creating a bots.i variable
- if bots.i is true (allowed) the script continues, ironically by pausing (for 0.5 to 3.0 seconds, determined randomly)
- the page contents are fetched and each of the ESG data items are extracted using the two functions created earlier
- the extra steps for Controversy Level are because if the level is “None,” a different XPath applies than for other levels
- everything is wrapped in a tryCatch() so that if there’s a problem with one iteration the whole process doesn’t crash
Step 3a: Download ESG Data (Companies)
Scraping all S&P 500 companies should only take a few minutes depending on computer, internet speed, and the hand you’re dealt with the random pauses. For me it was 2.5 minutes. If you want to speed things up, or otherwise don’t want the entire index, you could always subset dat_stocks. A few super quick examples:
By sector…
sort(unique(dat_stocks$sector))
## [1] "Communication Services" "Consumer Discretionary" "Consumer Staples"
## [4] "Energy" "Financials" "Health Care"
## [7] "Industrials" "Information Technology" "Materials"
## [10] "Real Estate" "Utilities"
# energy and utilities sectors
dat_stocks <- subset(dat_stocks, sector %in% c("Energy", "Utilities"))
By sector and industry…
sort(unique(dat_stocks$industry[dat_stocks$sector == "Utilities"]))
## [1] "Electric Utilities"
## [2] "Gas Utilities"
## [3] "Independent Power Producers & Energy Traders"
## [4] "Multi-Utilities"
## [5] "Water Utilities"
# utilities but not water utilities
dat_stocks <- subset(dat_stocks, sector == "Utilities" & industry != "Water Utilities")
Anyway, here’s the full S&P 500 download…
i <- 1
for (i in 1:nrow(dat_stocks)) {
message(paste0(i, " of ", nrow(dat_stocks)))
tryCatch({
tick.i <- dat_stocks$ticker[i]
link.i <- paste0("https://finance.yahoo.com/quote/", tick.i, "/sustainability")
bots.i <- suppressMessages(fun_robots(link.i))
if (bots.i) {
Sys.sleep(runif(1, 0.5, 3.0))
page.i <- GET(link.i, user_agent(var_agent)) %>% content()
dat_stocks$esgRating[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[3]/div/span')
dat_stocks$esgScore.tot[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[2]/div[1]')
dat_stocks$esgScore.env[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[2]/div/div[2]/div[1]')
dat_stocks$esgScore.soc[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[3]/div/div[2]/div[1]')
dat_stocks$esgScore.gov[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[4]/div/div[2]/div[1]')
dat_stocks$esgRank.tot[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[2]/div[2]/span/span')
dat_stocks$esgRank.env[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[2]/div/div[2]/div[2]/span/span')
dat_stocks$esgRank.soc[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[3]/div/div[2]/div[2]/span/span')
dat_stocks$esgRank.gov[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[4]/div/div[2]/div[2]/span/span')
dat_stocks$conRating[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div/div[2]/div[1]/span/span/span')
dat_stocks$conLevel[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div/div[2]/div[1]/div')
dat_stocks$conAreas[i] <- fun_lists()
dat_stocks$asOf[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[3]/span[2]/span')
}
}, error=function(e){})
}
dat_stocks$asOf[which(is.na(dat_stocks$esgRating))] <- NA
A quick inspection of the results shows that 95.4% of S&P 500 constituents are rated by Morningstar/Sustainalytics…
scales::percent(sum(!is.na(dat_stocks$esgRating)) / nrow(dat_stocks))
## [1] "95.4%"
Now that I think about it, I want to know the percentage of market capitalization (versus company count). First I have to download market caps (using the quantmod library). I suppress warnings due to an irrelevant time zone warning.
dat_stocks$mktCap <- suppressWarnings(
quantmod::getQuote(dat_stocks$ticker, what = "marketCap")$marketCap
)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## downloading set: 1 , 2 , 3 , ...done
It’s about the same — 95.5% of total S&P 500 market cap is rated by Morningstar/Sustainalytics…
scales::percent(sum(dat_stocks$mktCap[which(!is.na(dat_stocks$esgRating))]) / sum(dat_stocks$mktCap))
## [1] "95.5%"
Step 3b: Download ESG Data (Mutual Funds)
Again, it’s basically the same process — some slight variation in data points and XPaths and one additional check. The new “if” statement checks a particular XPath for the presence of “ESG” as a test; it’ll return FALSE if there’s no ESG data.
Often there is more recent ESG info on Morningstar’s site. However, Morningstar’s robots.txt requests we not be scrape it. Just be aware of the occasional lag between morningstar.com and yahoo.com. This seems to be more pronounced for funds than for companies.
i <- 1
for (i in 1:nrow(dat_funds)) {
message(paste0(i, " of ", nrow(dat_funds)))
tryCatch({
tick.i <- dat_funds$ticker[i]
link.i <- paste0("https://finance.yahoo.com/quote/", tick.i, "/sustainability")
bots.i <- suppressMessages(fun_robots(link.i))
if (bots.i) {
Sys.sleep(runif(1, 0.5, 3.0))
page.i <- GET(link.i, user_agent(var_agent)) %>% content()
if (grepl("ESG", fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/h3/span'))) {
dat_funds$esgRating[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[3]/div/span')
dat_funds$esgScore.tot[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[2]/div[1]')
dat_funds$esgScore.env[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[2]/div/div[2]/div[1]')
dat_funds$esgScore.soc[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[3]/div/div[2]/div[1]')
dat_funds$esgScore.gov[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[4]/div/div[2]/div[1]')
dat_funds$esgScore.aum[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div[2]/div[1]/div[2]/span')
dat_funds$esgScore.raw[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div[2]/div[3]')
dat_funds$esgScore.ded[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div[2]/div[3]/div[3]')
dat_funds$susMandate[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[3]/div/span/span/span')
dat_funds$susRank.pct[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[3]/p[1]/span/span') / 100
dat_funds$susRank.cat[i] <- page.i %>%
html_nodes(xpath = '//*[@id="Col1-0-Sustainability-Proxy"]/section/div[3]/p[2]/span/span') %>%
html_text(trim = TRUE)
dat_funds$asOf[i] <- fun_parse('//*[@id="Col1-0-Sustainability-Proxy"]/section/div[4]/span[2]/span')
}
}
}, error=function(e){})
}
dat_funds$asOf[which(is.na(dat_funds$esgRating))] <- NA
There are two fields in need of explanation. The sum of esgScore.raw and esgScore.ded (zero or a negative number for controversy deduction) should equal esgScore.tot. However, it’s often off by one due to rounding and is occasionally off by a lot more due to know who knows why. You can see that below (mostly -1 to +1, except for two extremes).
dat_funds$esgScore.tot - (dat_funds$esgScore.raw + dat_funds$esgScore.ded)
## [1] 5 -1 0 0 0 0 0 0 1 0 0 1 -1 0 0 0 0 0 0 0 1 -1 0 0 1
## [26] 0 0 1 0 0 1 -1 -1 0 0 6 -1 0 0 0 0 0 0 0 0 0 NA NA 0 0
Step 4: Analysis
This post is about getting the data. The analysis is up to you. I’ll leave it at a couple of rough summarizations. The first is a simple roll-up of dat_stocks by sector (ESG scores are simple averages versus market-cap weighted)…
dat_look <- subset(dat_stocks, !is.na(esgRating)) %>%
group_by(sector) %>%
summarise(
esgScore.tot = ceiling(mean(esgScore.tot)),
esgScore.env = ceiling(mean(esgScore.env)),
esgScore.soc = ceiling(mean(esgScore.soc)),
esgScore.gov = ceiling(mean(esgScore.gov)),
) %>%
ungroup()
dat_look <- dat_look[order(dat_look$esgScore.tot, decreasing = TRUE), ]
dat_look
## # A tibble: 11 x 5
## sector esgScore.tot esgScore.env esgScore.soc esgScore.gov
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Utilities 67 59 71 74
## 2 Consumer Staples 63 66 58 65
## 3 Information Technology 63 65 61 64
## 4 Materials 62 60 62 67
## 5 Energy 61 55 64 65
## 6 Financials 58 55 61 57
## 7 Health Care 58 67 51 60
## 8 Industrials 58 57 55 64
## 9 Real Estate 58 61 51 62
## 10 Consumer Discretionary 57 55 55 64
## 11 Communication Services 54 57 51 59
Or a similar summary of dat_funds — rolled up to the fund family level and simple score averages…
dat_look <- subset(dat_funds, !is.na(esgRating)) %>%
group_by(familyName) %>%
summarise(
esgScore.tot = ceiling(mean(esgScore.tot)),
esgScore.env = ceiling(mean(esgScore.env)),
esgScore.soc = ceiling(mean(esgScore.soc)),
esgScore.gov = ceiling(mean(esgScore.gov)),
) %>%
ungroup()
dat_look <- dat_look[order(dat_look$esgScore.tot, decreasing = TRUE), ]
head(dat_look, 10)
## # A tibble: 10 x 5
## familyName esgScore.tot esgScore.env esgScore.soc esgScore.gov
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aberdeen 53 54 52 53
## 2 Calvert Research and Man~ 50 55 55 52
## 3 Morgan Stanley 50 54 55 49
## 4 Jensen 49 60 56 53
## 5 JPMorgan 49 56 55 53
## 6 Parnassus 48 54 53 52
## 7 TIAA Investments 48 55 54 53
## 8 American Funds 47 56 53 52
## 9 MainStay 47 56 54 49
## 10 Pioneer Investments 47 58 53 52
III. Conclusion
Invest responsibly. Scrape ethically. Feel free to email me with questions or just to discuss.
For good introductions to SRI/ESG investing see MSCI’s ESG 101: What is ESG Investing? and the Forum for Sustainable and Responsible Investment’s SRI Basics: What is Sustainable, Responsible, and Impact Investing?↩