Infrastructure Built for DS by DS — Saving Time with Custom R Packages

Peng Wang
Thursday 18 July 2019

I still remember when Senior Data Scientist Xiangdong Gu told me that R could directly communicate with Teradata, Vertica, database analytics and management systems. I was thrilled! R was one of my favorite tools, and this meant I could remain productive, in R, without spending time with external tools simply to move data around. I quickly devised a workflow with my unique settings and configurations.

Word spread. Different workflows were adopted and individual features were added each time as they branched out. Data Scientist Grace Yoo even set up a tutorial for the team. What could be more fulfilling than solving a problem and adding your own personal touch to it? We all love doing this.

However, it wasn't long before the headaches started. I couldn't run anybody else's code on my laptop. Jenkins servers weren’t working as intended futhermore secret config files and Vertica .jar files were unorganized. On my own laptop, I had a folder with over 30 different settings files for different projects and environments (Vertica, Jenkins, etc). It was exhausting to keep track of all these files.

It was clear we needed to standardize workflows and increase code readability. We needed to identify common procedures and build shared libraries.

As the first attempt, Xiangdong created mmlib, a package that connected databases to R, and added a few utility functions. It was a great start, but still far from offering a complete, easy-to-use, and well-documented solution. Paying off technical debt takes time, and so does building new tools to do the work. Initially, the team was slow to adopt the new package, mmlib sat unused for a time after its launch. A tipping point finally came when everyone had a different solution to the same problem and after a few discussions with the team we realized a single well tested stable solution was needed.

RJDBC package which we used earlier to connect to databases through R was painfully slow if we used it, but if we used VSQL instead, special characters in strings such as commas or quotation marks could wreak havoc on our queries. Someone started to use |- or ~-separated formats. Kudos to that engineering genie, but this method wasn't fool-proof either. And what if we have a table with a thousand columns? Creating such a table initially in Vertica is simply intimidating. We started devoting time and effort in designing a reliable solution. Lizzie and Jordan, two junior data scientists helped a lot in these early stages. The first function was called upload_vertica_table(). It offered a one-stop-shop solution for connecting to Vertica and uploading a data frame efficiently. Environment agnostic, the same code could be run anywhere in any environments without modification. Blisteringly fast, the upload speed was almost only limited by the network bandwidth. Easy to use, one function call could automatically create a table and upload the values.

mmlib finally took off. Almost every R user on our data science team started to use and rely on it. The package was a success! Adam Fox, Head of Data Science, even suggested that maybe we should do the same thing in Python. Thus the idea for mmpac was born.

This time, we took a big step back and started to think about how we'd like to design a package from the ground up, including all the features we'd like it to have. Careful consideration was put into selecting the design, modularizing different components, and of course working within a single naming convention. The outcome was very satisfying, and we eventually revamped mmlib with the same kind of thoughtfulness and design.

Tam, a junior data scientist, worked tirelessly to rebuild the mmlib package with principles of robust package design in mind, adding more functionality and better documentation along the way. We iterated, iterated, and iterated. More people got involved. Maddie added S3 capabilities to the package so that users could interact with S3 from RStudio. Evan, another Junior Data Scientist, worked to add more functionality to mmpac.

We kept improving, fixing bugs, adding new features. It's a never-ending process and a labor of love. In the end, we believe that the best product is built in small steps, not by waiting for a perfect final product to be created.

About a month ago, I was talking with a junior data scientist in the team about how to further enhance mmlib. She looked at me and asked, "What is mmlib?" I was startled, and then quickly learned that its functions were so integrated into her project work that it had slipped her attention. She was using it without even realizing it!

We took it as a compliment.