Introducing Splitgraph: a diary of a data scientist
Table of Contents
- TL;DR / executive summary
- A diary of a data scientist
- A diary of a software engineer
- State of the art in software engineering
- Towards composable, maintainable, reproducible data science
It's been two years since project Morrowind (which apparently now has been made an official speedrun category). During that time, I've been working on another exciting project and it's time to finally announce it.
Today, we are delighted to launch Splitgraph, a tool to build, extend, query and share datasets that works on top of PostgreSQL and integrates seamlessly with anything that uses PostgreSQL. It brings the best parts of Git and Docker, tools well-known and loved by developers, to data science and data engineering, and allows users to build and manipulate datasets directly on their database using familiar commands and paradigms.
Splitgraph launches with first-class support for multiple data analytics tools and access to over 40000 open government datasets on the Socrata platform. Analyze coronavirus data with Jupyter and scikit-learn, plot nearby marijuana dispensaries with Metabase and PostGIS or just explore Chicago open data with DBeaver and do so from the comfort of a battle-tested RDBMS with a mature feature set and a rich ecosystem of integrations.
Let's consider the life of a more and more prominent type of worker in the industry: a data scientist/engineer. Data is the new oil and this person is the new oil rig worker, plumber, manager, owner and operator. This is partially based on my own professional experience and partially on over-exaggerated horror stories from the Internet, just like a good analogy should be.
Came to work to a small crisis: a dashboard that our marketing team uses to help them direct their door hinge (yes, we sell door hinges. It's a very niche but a very lucrative business) sales efforts is outputting weird numbers. Obviously, they're not happy. I had better things to do today but oh well. I look at the data that populates the dashboard and start going up the chain through a few dozen of random ETL jobs and processes that people before me wrote.
By lunchtime, I trace this issue down to a fault in one of our data vendors (that we buy timber price data from: apparently it's a great predictor of door hinge sales): overnight, they decided to change their conventions and publish values of 999999 where they used to push out a NULL. I raise a support ticket and wait for it to be answered. In the meantime, I enlist a few other colleagues and we manage to repair the damage, rerunning parts of the pipeline and patching data where needed.
Support ticket still unanswered (well, they acknowledged it but said they are dealing with a sudden influx of support tickets, I wonder why) but at least we have a temporary fix.
In the meantime, I start work on a project that I wanted to do yesterday. I read a paper recently that showed that another predictor of door hinge sales is council planning permissions. The author had scraped some data from a few council websites and made the dataset available on his Web page as a CSV dump. Great! I download it and, well, it's pretty much what I expected it to be: no explanation of what each column means and what its ranges are. But I've seen worse. I fire up my trusty Pandas toolchain and get to work.
By the evening, there's nothing left of the old dataset: I did some data patching and interpolation, removed some columns and combined some other ones. I also combined the data with our own historical data for door hinge sales in given postcodes. In conjunction with this data, the planning permission dataset indeed gives an amazing prediction accuracy. I send the results to my boss and go home happy.
This is the happiest I'll be this week.
The timber sales data vendor has answered our support ticket. In fact, our query made them inspect the data closer at which point they realised they had some historical errors in the data which they decided to rectify. The problem was that they couldn't send us just the rows that were changed and instead linked us to an SQL dump of the whole dataset.
I spend the rest of the day downloading it (turns out, there's a lot of timber around) and then hand-crafting SQL queries to backfill the data into our store as well as all the downstream components.
In the meantime, marketing together with my boss has reviewed my results and is really excited about council planning permission data. They would like to put it into production as soon as possible.
I send some e-mails to the author of the paper to find out how they generated the data and if they would be interested in sharing their software, whilst also trying to figure out how to plumb it into our pipeline so that the projections can make it into their daily reports.
Boss is also unhappy about our current timber data vendor and is wondering if I could try out a dataset provided by another vendor. Easier said than done, as now I have to somehow reproduce the current pipeline on my machine, swap the new dataset in and rerun all our historical predictions to see how they would have fared.
The council planning permission data project is probably not happening. Firstly, it's because the per-postcode sales data that I used in my research is in a completely different database engine that we can't directly import into our prediction pipeline. But in worse news, the author of the paper doesn't really remember how he produced the data and whether his scraping software still works.
After a whole day of searching, I did manage to find a data vendor that seems to be doing similar things, with no pricing data, no nothing. I drop them an e-mail and go home.
Come to work to learn about an overnight production issue that the operators managed to mitigate but now I have to actually fix. Oh well. I get the tag of the container we had running in production and do a
docker pull. I fire it up locally and use a debugger (and the container's Dockerfile) to locate the issue: it's a bug in an open-source library that we're using. I do a quick scan of GitHub issues to see if it's been reported before. Nope. I raise an issue and also submit a pull request that I think should fix it.
In the meantime, the tests I run locally for that library pass with my fix so I change the Dockerfile to build the image from my patched fork. I do a
git push on the Dockerfile, our CI system builds it and pushes the image out to staging. We redirect some real-world traffic to staging and it works. We do a rolling upgrade of the prod service. It works.
I spend the rest of the day reading Reddit.
Github issue still unanswered, but we didn't have any problems overnight anyway.
I have some more exciting things to do: caching. Some guys from Hooli have open-sourced this pretty cool load-balancing and caching proxy that they wrote in Go and it fits our use case perfectly for an internal service that has always had performance issues.
They provide a Docker container for their proxy, so I quickly get the
docker-compose.yml file for our service, add the image to the stack and fiddle around with its configuration (exposed via environment variables) to wire it up to the service workers. I run the whole stack up locally and rerun the integration tests to hit the proxy instead. They pass, so I push the whole thing out to staging. We redirect some requests to hit the staging service in order to compare the performance and correctness.
I spend the rest of the day reading Reddit.
The Github issue has been answered and my PR has been accepted. The developer also found a couple of other bugs that my fix exposed which have also been fixed now. I change our service to build against the latest tag, build on CI, tests pass.
I look at the dashboards to see how my version of the service did overnight: turns out, the caching proxy reduced the request latency by about a half. We agree to push it to prod.
I spend the rest of the week reading Reddit.
There is a lot of tools, workflows and frameworks in software engineering that made developers' lives easier and that paradoxically haven't been applied to the problem of data processing.
In software, you do a
git pull and bring the source code up to date by having a series of diffs delivered to you. This ability to treat new versions as patches on top of old versions has opened up more opportunities like rebasing, pull requests and branching, as well as inspecting history and merge conflict resolution.
None of this exists in the world of data. Updating a local copy of the dataset involves downloading the whole image again, which is crazy. Proposing patches to datasets, having them applied and merging several branches is unspoken of and yet is a common workflow in data science: why can't I maintain a fork of data from a vendor with my own fixes on top and then do an occasional
git pull --rebase to have my fork up to date?
In software, we have learned to use unique identifiers to refer to various artifacts, be it Git commit hashes, Docker image hashes or library version numbers. When someone says "there's a bug in version 3.14 (commit 6ff3e105) of this library", we know exactly which codebase they refer to and how we can get and inspect it.
This doesn't happen with data pipelines: most of the time we hear "the data we downloaded last night was faulty but we overwrote chunks of it and it's propagated downstream so I've no idea what it looks like now". It would be cool to be able to refer to datasets as single, self-contained images and for any ETL job to be just a function between images: if it's given the same input image, then it will produce the same output image.
To expand on that, Docker has made this "image" abstraction even more robust by packaging all of the dependencies of a service together with that service. This means that this container can be run from anywhere: on a developer's machine, on a CI server, or in production. By giving the developers tools that make replicating the production experience easier, we have decreased the distance between development and production.
I used to work in quant trading and one insight I got from that is that getting a cool dataset and finding out that it can predict the returns on some asset is only half of the job. The other half, less talked about and much more tedious, is productionizing your findings: setting up batch jobs to import this dataset and clean it, making sure the operators are familiar with the import process (and can override it if it goes wrong), writing monitoring tools. There's the ongoing overhead of supporting it.
And despite that, there is still a large distance between research and production in data science. Preparing data for research involves cleaning it, importing it into say Pandas, figuring out what every column means, potentially hand-crafting some patches. This is very similar to old-school service set up: do a
sudo apt-get install of the service, spend time setting up its configuration files, spend time installing other libraries and by the end of the day don't remember exactly what you did and how to reproduce it.
Docker made this easier by isolating every service and mandating that all of its dependencies (be it Linux packages, configuration files or any other binaries) are specified explicitly in a Dockerfile. It's a painful process to begin with but it results in something very useful: everyone now knows exactly how an image is made and its configuration can be experimented on. One can swap out a couple of
apt-get statements in a Dockerfile to install, say, an experimental version of
libc and get another version of the same service that they can compare against the current one.
In an ideal world, that's what would happen with data: I would write a Dockerfile that grabs some data from a few upstream repositories, runs an SQL JOIN on tables and produces a new image. Even better, I should be able to have this new image kept up to date and rebase itself on any new changes in the upstream. I should be able to rerun this image against other sources and then feed it into a locally-run version of my pipeline to compare, say, the prediction performance of the different source datasets.
We are slowly coming to a set of standards on how to distribute software which has reduced onboarding friction and allowed people to quickly prototype their ideas. One can do a
docker pull, add an extra service to their software stack and run everything locally to see how it behaves within minutes. One can search for some software on GitHub and
git clone it, knowing that it probably has fairly reproducible build instructions. Most operating systems now have package managers which provide an index of software that can be installed on that system as well as allow the administrator to keep those packages up to date.
There is a ton of open data out there, with a lot of potential hidden value, and most of it is unindexed: there's no way to find out what it is, where it is, who maintains it, how often it's updated and what does each column mean. In addition, all of it is in various ad hoc formats, from CSVs to SQL dumps, from HDF5 files to unscrapeable PDF documents. For each one of these datasets, an importer has to be written. This raises the friction of onboarding new datasets and innovating.
One thing that Git and Docker are popular for is that they're unopinionated: they don't care about what is actually being versioned or run inside of the container. If Git only worked with a certain folder structure or required one to execute system calls to perform checkouts or commits, it would never have taken off. That is,
git doesn't care whether what it's versioning is written in Go, Java, Rust, Python or is just a text file.
Similarly with Docker, if it only worked on artifacts produced by a certain programming language or required every program to be rewritten to use Docker, that would slow down adoption a lot if not outright kill the tool.
Both of these tools build up on an abstraction that has been around for a while and that other tools use: the filesystem. Git enhances tools that use the filesystem (such as the IDE or the compiler) by adding versioning to the source code. Docker enhances the applications that use the filesystem (that is, all of them) by isolating them and presenting each one with its own version of reality.
Such an abstraction also exists in the world of data: it's SQL. A lot of software is built on top of it and a lot of people, including non-technical ones, understand SQL and can write it. And yet most tools around want users to learn their own custom query language.
All these anecdotes and comparisons show that there are a lot of practices that data scientists can borrow from software engineering. They can be combined into three core concepts:
- Composability: Much like with Docker, where it's easy to take existing containers and extend them or compose multiple containers into a single service, it should be straightforward to extend or apply patches to datasets or create derivative datasets. Most of the hidden value in data comes from joining multiple, sometimes seemingly unrelated datasets together.
- Maintainability: Coming up with a predictive model by doing some exploratory work in a Jupyter notebook is only half of the battle. Maintaining the data pipeline, keeping the derivative data up to date and being able to quickly locate, fix and propagate fixes to issues in upstream data should be made easier. What-if analyses should be a matter of changing several lines of configuration, not manually tracking changes through several layers of ETL jobs. Updating data should be done by
git pull, not by downloading a new data dump and rerunning one's import scripts.
- Reproducibility: The same Dockerfile and the same build context will result in the same Docker container which will behave the same no matter where it is. One should always know how to rebuild a data artifact from scratch by only relying on its dependencies and a build recipe and sharing datasets should be as easy as pushing out a new Docker image or a new set of Git commits.
Over the past two years, I and Miles Richardson have been building something in line with this philosophy.
Splitgraph is a data management tool and a sharing platform that is inspired by Docker and Git. It currently is based on PostgreSQL and allows users to create, share and extend SQL schema images. In particular:
- It supports basic git-like operations, including
pulletc to produce and inspect new commits to a database. Whilst an image is checked out into a schema, it's no different from a set of ordinary PostgreSQL tables: any other tool that speaks SQL can interact with it, with changes captured and then packaged up into new images.
- It uses a Dockerfile-like format for defining images, called Splitfiles. Much like Dockerfiles, it uses command-based image hashing so that if the execution results in an image that already exists, the image will be checked out instead of being recreated. Furthermore, the Splitfile executor does provenance tracking so that any image created by it has its dependencies recorded and can be updated when the dependencies update.
- Data ingestion is done by either writing to the checked-out image (since it's just another PostgreSQL schema) or using Postgres Foreign Data Wrappers that allow to mount any database (currently we include the open-source PostgreSQL, MySQL and MongoDB FDWs in our engine as well as an FDW for the Socrata Open Data platform) as a set of local SQL tables.
- There's more exciting features designed to improve data ingestion, storage, research, transformation and [https://www.splitgraph.com/product/data-lifecycle/sharing].
We have already done a couple of talks about Splitgraph: a short one at a local Docker meetup in Cambridge (slides) talking about parallels between Docker and Splitgraph and a longer one at a quantitative hedge fund AHL (slides) discussing the philosophy and the implementation of Splitgraph in-depth. A lot has changed since then but it still is a good introduction to our philosophy.