Monday 20 August 2018
Many of us were 5 College students at one point, so we love to team up with professors to support local data science events, especially ones for students. You may have seen us Junior Data Scientists poking around at 5 College Datafest or giving R/Python tutorials at UMass Amherst. Ever since the induction of our fearless program manager, Christine Pfeil, we’ve stepped up our outreach game; on January 19, we threw our first conference! We invited undergraduate and graduate students, professors, and industry professionals to join us at the Museum of Science in Boston for a celebration of women leaders in data science across industry, research, and education.
The conference was broken into industry, research, and education segments, with speakers and a panel for each segment. In case you missed out, here’s a quick recap of the talks:
Two of our own MassMutual data scientists, Wenjing Lu and Sara Saperstein, presented in the industry segment about their career trajectories from academia into data science. Unlike older, established sciences, data science doesn’t have a dedicated place in many academic programs, so it’s quite common to meet data scientists who originally studied bioinformatics or statistics (like Wenjing) or computational neuroscience (like Sara). We got some really good advice from Monika Wahi (Founder of DethWench) about how to rise up in industry and eventually own a company.
Melinda Han Williams (Vice President of Dstillery) walked us through a method they’ve used to give companies a clearer picture of their customer base. For a particular watch company, they studied the browsing patterns of users who went onto the watch company’s webpage. Then, they collected the URLs of other webpages that were browsed concurrently, which gave information about the users’ interests in general, as well as direct information about competing brands. Through this exercise, they were able to discover that some Jewish communities had a preference for this particular watch brand, particularly as gifts for major life celebrations. Melinda told us that the watch company had no idea about this niche market, and that this is the level of insight they hope to bring to their clients.
We had both Una-May O’Reilly AnyScale Learning For All (ALFA) and Mine Cetinkaya-Rundel (RStudio, Duke University) speak about Massive Online Open Courses (MOOCs), which could revolutionize education. As the leader and principal research scientist at ALFA, Una-May talked about ongoing research efforts to optimize MOOCs by studying the output of popular applications such as edX and Coursera. As a professional educator at RStudio and professor at Duke University, Mine aims to optimize MOOC experience at a more granular level, often by using observations from classes or focus groups. She spoke about an intro course to data science she designed, “We’ll start with Data Science”. Rather than beginning with difficult statistical concepts, and from there building up to machine learning, Mine led with interesting applications of data science. By lifting prerequisites for computer science or math for the intro course, Mine prevented many students from leaving the data science pipeline prematurely. Not only did the course help students uncover passions for data science, but it also supported diversity in the pool of people going through the STEM pipeline at Duke University.
John Tukey once said, “The best thing about being a statistician is that you get to play in everyone’s backyard.” I think that this trait of data science was particularly exemplified by the range of topics presented by the speakers during the research segment. We started with Una-May O’Reilly on MOOCs, went to deep space with Laura Cadonati (Georgia Tech), then to sampling methodologies with Krista Gile (UMass Amherst), and finally landed in the biomedical field with Raji Balasubramanian (UMass Amherst). Laura talked about how the large-scale experiment LIGO: Laser Interferometer Gravitational-Wave Observatory is being used to detect black holes and neutron stars. Krista and Raji are both conducting research on statistical methodologies, particularly where it concerns difficult data that is nonetheless crucial for studying health issues. Krista talked about how link-tracing sampling methods are a viable way to get information about elusive populations who are at-risk for HIV infection, such as people who inject drugs intravenously. Raji is currently focused on designing experiments to study women’s health around the understanding that self-reported data is incredibly noisy. Some interesting datasets that were mentioned include the Nurses’ Health Study from 1976 as well as the Women’s Health Initiative from 1991. Both researchers may occasionally work with groups such as the American Statistical Association, Westat to establish these methods, so that they become standards across the field.
I’ll end this blog post by re-telling my favorite story of the day, a real heavy-hitter from our keynote speaker, Megan Price.
Megan is the executive director of the HRDAG: Human Rights Data Analysis Group, which is a non-profit, non-partisan organization that aims to support human rights, as stated by various international treaties such as “Universal Declaration of Human Rights” or “International Covenant on Civil and Political Rights”. As scientists who aren’t advocates themselves, they are tasked with providing unbiased and accurate analyses that give advocates and courts insight into instances of human rights violations.
Their investigations often take place in countries outside of the US and may take a decade to complete. Hearing this as a data scientist, I guessed that 70%, or 7 years, on such a project might be dedicated to a grueling data wrangling process, and I think I was right... Megan shared a particularly fascinating story about salvaging the Guatemalan National Police Archive (GNPA) after the National Police was disbanded in 1996 at the end of a 36-year civil war. Tens of millions of unorganized decaying documents were found in an abandoned building. These are important pieces of paper – some contain the only evidence of a victim’s suffering – but they were left behind to become housing for rats.
Instead of giving up on the stories because they lacked a traditional sample frame, the team from HRDAG used the physical topology of the building to develop an iterative, multi-stage probability proportional to size (PPS) sampling frame. This allowed them to generate 3D coordinates within the building from which to sample data from! As of April 2009, they had sampled 20k documents and securely stored them in Martus, a secure data warehouse commonly used by defenders of human rights.
This data has been crucial in prosecuting military leaders who committed acts of genocide against the Ixil Mayan communities during the 1980s as well as other crimes against humanity, such as state-sanctioned murder of labor leaders.
Overall, I think the conference was really inspiring. Within any data analyst, there could exist a latent hero; it’s 2018, there’s never been a better time to unleash her into the world!