Mega consultancy McKinsey has made its first foray into the open source world, offering up a machine learning development framework developed at its QuantumBlack analytics unit.
The company describes Kedro as a Python library that can be used to construct data machine learning pipelines, streamlining the way data scientists and engineers work when collaborating on “large scale analytics deployments”.
As the company’s announcement diplomatically puts it, “every data scientist follows their own workflow…distractions and shifting deadlines may introduce friction, ultimately resulting in incoherent code.”
Product manager Yetunde Dada said machine learning projects “often don’t have the application of good software engineering principles. So when you take that code and you try and make it work for a very solid business case so it starts to generate value for you company, typically you have to do a lot of engineering work to fix the code to make it run as it should.”
The unruliness of data scientists has spurred a number of initiatives. MLflow has pitched itself as a framework for tracking experiments, but also for packaging code, and managing and deploying models.
Dada described MLflow as “the application of one software engineering principle…this whole ability to do versioning.”
Kedro, she said, “Is essentially the application of modularity which is being able to split your code base into small chunks so it’s easy to test.”
It covered “versioning, reproducibility, and the ability to log what’s happening in your pipeline…We essentially cover the spectrum of software engineering principles needed to take machine learning into production.”
She said it “Typically does very well for large teams of data scientists – upwards of three data scientists that have to collaborate on a single code base.”
Dada said the tool was “data source agnostic” and able to handle anything from “a few GB to TBs of data”. Kedro promises “seamless packaging, allowing you to ship your products into production”, using the likes of Docker or Airflow – though plug-ins for these appear to be listed as “coming soon” on the tool’s GitHub page.
Dada also highlighted its pipeline visualisation capabilities: “It’s quite an impactful thing when you see 100s of data sources coming together to make one machine learning result. Clients get that visibility of what is happening in their pipeline because of the way they’ve written their Kedro code.”
While McKinsey is ubiquitous amongst mega companies, its software is not often seen in the wild. Michele Battelli, the head of engineering and product at Quantum Black’s said its technology was usually geared towards its internal teams or external clients.
However, he continued, “We want clients to continue using the technology we build after the project is complete. That is why we’re open sourcing some of this technology.” This would ensure clients’ independence, and reduce technical debt, he said.
You can find Kedro on GitHub here.