Hermetic R builds: The solution to unwanted R updates
Article by: Jason Harenski
Recently, we have been working on a program to make the work of data scientists discoverable and reusable, and I wanted to share a bit about it with a wider audience. As part of our application, we built a DevOps system which enforces hermetic R and Python container builds and have seen lower error counts and shorter build cycle times as a result.
What is a hermetic build?
The goal of a hermetic build system is to operate as a pure function, or in build terms, to always return the same output assets given the same input source code. In order to do this, the system must provide a way to isolate the build from changes to the state of the host system, and a way to ensure the sameness of the inputs.
- Source Identity. Local source assets are protected by git, which identifies sets of code mutations with a unique hash code. External assets, however, can arrive in many forms, depending on the development idiom and culture of a language or tool. In 2020, a significant percentage of external code assets are hosted on services whose APIs provide a stable, discoverable URL which can be used to download specific versions of assets as of a given change set, either at the commit scale, or tagged release scale, or both. A hermetic build system leverages this API to specify exactly which versions of an asset to download and build with. It checks hashes of these assets against stored values so it can detect any unexpected changes. Upon retrieval, if an external asset does not match its stored hash, the build fails.
- Isolation. The other half of the hermetic build goal is to specify versions of tools required to build the source. These build systems also must provide ways to prevent accidental infection (timely!) through accidental use of unchecked source code resources, locally modified shell symbols, or environment variables et cetera. In its fullest form, hermetic build systems treat tools such as compilers like source code, downloading their own copies of tools and managing their storage and use inside managed file trees used as secure sandboxes. The isolation from the host machine and local user, including installed versions of languages, should be total.
Building with R
The R language presents interesting challenges to these goals. R is used by data scientists precisely because it wraps arbitrarily complex behind-the-scenes tool use in simple, semantically clean functional wrappers, allowing data scientists to focus solely on Getting Things to Work. For the would-be hermetic builder, however, it’s complicated:
1. Underlying dependencies. Most of the R authors in our application are blissfully unaware of which underlying technologies any given R libraries use. In practice, anything is possible including use of Java, C++, Python, TCL/tk, and/or any other tool stacks. These tools have dependencies on underlying system libraries which are not specified anywhere. The sadly non-hermetic R idiom is to assume that the user has the ability to install any missing packages locally, and that transitive dependencies will be resolved by a host OS package manager.
2. Version control. R’s idiomatic development culture predates universal access to free source control, so code releases are typically identified by a version number in the name of an asset bundle. R is not alone in this, but version numbers alone are not hermetic. Nothing enforces version number changes in lockstep with modifications to source code. Version numbers can be updated, or not updated or arbitrarily changed for any reason, independent of source modification. Further, while dependencies between R libraries are provided in text form, no package management tools exist to warn users about version incompatibilities.
3. Poor location stability. R does have its own website for package distribution, located at https://cran.r-project.org, commonly called CRAN. CRAN maintains an archive of current and recent versions of R libraries. Unfortunately, it uses one URL format for the current version, another for recent historical versions, and returns a 404 for more ancient historical versions. A sudden burst of new releases can cause specific versions of packages to change from the “current” URL format to the “historical” one without notice.
The problem
We needed to standardize, simplify and hermetically seal the build process around R libraries.
Our project enables model authors (presumed to be data scientists of varying degrees of cloud engineering knowledge) to develop models in their preferred coding language on their desktop as they have always done, and then upload these models to a cloud-hosted central registry for discovery. Our user interface then enables other users to run these models against their own data sets.
For model authors who choose to write R scripts, we offer hundreds of hosted R libraries to write to, constantly adding more by request, so finding a way to make our builds stable yet open to additions was a big need.
To see how we solved this problem, view the full article on our site here: https://www.logic2020.com/insight/tactical/hermetic-r-builds-solution-unwanted-r-updates?utm_source=social&utm_medium=Medium&utm_campaign=Hermetic_R_Builds