Thanks to advances in hardware technology and the amount of data available to many companies, machine learning has become more and more compelling to businesses and developers alike.
Unfortunately, it isn’t as easy as working through a TensorFlow tutorial and voilà, you’re a machine learning expert. Even if the technical foundations are clear, the field comes with a few challenges many software developers without practical ML experience aren’t even aware of. At Cloudera’s DataWorks Summit, we got the chance to talk to the company’s GM of Artificial Intelligence, Hilary Mason, who knows about the troubles people getting into machine learning face from her own years of experience as a data scientist and ML researcher.
“Many of the things we have in the mature practice of software engineering, we do not really have yet in the practice of data science and machine learning. Using version control in a robust and collaborative way much less being able to test models and look for bias in models – these are all issues where there is no one right way to do it which is surprising, because in software engineering there is more or less a right way to do it and we sort of know what it is.”
Until this is figured out and some sort of best practices exist, developers in this area will have to make do. “Today, data science and machine learning practices tend to get shoved into those software engineering workflows where it works, and where it doesn’t there is no standard. So some organisations basically have no testing of their machine learning code, others use very rigorous software engineering practices, but most exist in some sort of happy medium.”
Placing ML specialists for purpose and impact
While this kind of environment is great for adventurous individuals, they might hit some walls later on, as the placement of data science and machine learning teams within a company decides on their impact. “When the company is really small it is really easy, but even at a medium sized company there is the question of whether this team reports up to the CFO/COO, whether it reports into IT, or engineering, or product, or a research and development/innovation group.”
Putting your data scientist into the IT department for example might not be the best idea, if you want your product team to come up with something AI related. “If you put them in the COO/CFO/operations space they will only ever do business metrics and predictions and forecasting and they will never impact a product. If you put them in product, they probably wouldn’t impact the business and they certainly won’t be able to do R&D, because there’s always something for a product shipping today that is more important than an R&D effort for something you might someday get to do.”
As soon as purpose and therefore positioning is figured out, a long hard look at the products at hand might give pointers as to which would be a good first project to introduce machine learning technologies into. “Typically their [Cloudera customers] first couple of projects are ones were they have a clear ROI. They already pay for a process and they want to add automation to make that process cheaper – so that they can also point to it and say ‘oh we saved this many dollars from this investment in automation’. That’s typically the first project, and when they’ve done that, sometimes they sort of say, ‘ok – let’s think of new revenue opportunities off this data and this capability’.”
Nowadays, neural networks, especially those of the deep learning variety, seem to be the solution of choice judging by the number of mentions alone. However, a look at what practitioners doing reveals there’s still a place for classic techniques such as Support Vector Machines.
Basics can be interesting
“For the vast majority of machine learning problems that people want to solve, you should never start with a neural network. You should start with some sort of straight forward generally interpretable approach and work through the complexities of the classical machine learning approaches because these things are much easier to design, faster to iterate, much faster to run, computationally much more attractable, and easier to maintain.”
Once that’s achieved, you can still try your hand at deep learning: “Only once you have something that works fairly well or you decide it’s impossible to get to something that works fairly well in that classical approach you should then try implementing some sort of neural network or deep learning approach. And in that case you obviously get a lot of power not having to do the manual feature engineering, but you also sacrifice a fair bit of manageability in the sense of instead of designing the features you now design the nature of the network. And you take on the huge burden of the amount of computation required to train and maintain it and then you also need to have extremely large clean datasets to even have a hope of that working, which actually a lot of our customers don’t have.”
A look at the numbers speaks for itself: “I do keep a sort of mental tally of the percentage of our customers where I actually see them using deep learning in production and it’s still around 5 per cent. The vast majority of use cases really should take that classical machine learning approach.”
“Now there is a world, there is a potential future path, where maybe in more than 50 per cent you would use deep learning. And if we continue to see rapid progress in AutoML and in multi-task learning, I could see that happening, but I think it’s some number of years away and has a low probability of happening in any case.”
Looking for the next useful thing
To get to that future world however, there is still a lot of research to be done and problems to be tackled. Mason does her part in that at Fast Forward Labs, a machine intelligence research company she founded in 2014. In 2017, FFL got bought by Cloudera, which is how Mason became part of the software company.
Today, Cloudera FFL’s quarterly reports on new technologies and releases of prototypes provide insight into what could prove useful in the coming months. “We had a recent report on federated learning, […] machine learning at the edge, where you can’t move the data from the edge to some central data lake to do the analysis, because there are either regulatory or privacy reasons – GDPR is a good example. In an internet of things context the data stream might just be much too large to move or maybe you are limited on bandwidth. So what you do is learn in that edge environment and share what is learned from that particular model back upstream and than out with all the other implementations, so that’s one example.”
But not just that, the lab also works as an outsourced R&D department for other companies, which means they are quick to find out what practitioners are struggling with, and get working on that. One of those areas is learning with limited labeled data – mostly because not everyone interested in machine learning has enough clean material to train a good model.
“So this [active learning or learning with limited data] is a set of techniques for prioritising the most important examples in your data set to pay for a label for – whether you’re paying a human to label it or you’re paying a more expensive classifier to label it so that you can improve the accuracy of the model you’re trying to train off that data.”
Multi-tasking and getting there
On the neural network site of things, multi-task learning, which is the ability to train one neural network for multiple objective functions, is a long-term research interest the lab has right now. “Let’s say in our prototype you have a news article and you want to know the topic and you want to know significant sentences and I think you want to know the sentiment, like three related text classification tasks. The way you would architect that today would be to build three separate classifiers, even though it is the same piece of data getting classified, and they would be done independently.”
“But it turns out in multi-task learning, and this totally blew my mind when I first heard about it, if you train one neural network with three objective functions, it actually learns an overlap in the future representations so the accuracy on all three tasks goes up. This is particularly useful for our customers who might have smaller datasets that are related and have labels, vs extremely large clean datasets.“
While these approaches might be helpful for some kinds of application, the whole sector could profit from more people looking into machine learning from a more ops-centric perspective, since integrating machine learning into a lifecycle with continuous delivery and integration is still …tricky, to say the least. “This is something that is an unsolved problem – it’s unsolved from a process point of view and there are actually some unsolved computer science problems in there. Because once you deploy a model you’re not done.”
“Any model that touches the real world will decay over time so it needs to be maintained, you need ways to monitor it when it has decayed and find out what decay looks like. We have really rudimentary ways of doing this right now, we don’t have great tooling around this rudimentary ways of doing this – the tooling we have doesn’t integrate with standard software or even monitoring capabilities. So there’s a lot of maturity we yet have to develop in this area of the world.”
If you think you could help with that, but have no idea about how to get started on the actual basics, Mason has some quite practical – and most importantly self-tested – advice: “Actually start with classical machine learning, but don’t jump into neural networks even though they’re really cool. What I would say is pick a project that is something you are personally interested in, so get some public data.”
Every tech starts with a first line of code
For Mason’s own trials, this data came in the form of thousands of crawled recipes and scraped and parsed menu data from whatever the local neighbourhood had to offer – so get creative. “Find something that you don’t mind looking at for a while and then try to come up with a project where you can start with a really straightforward approach. Start with a regression, start with naive bayes or k-means if you’re clustering – something you can implement the algorithm yourself, you can see it work, you can understand how it works, and then work your way up.”
It’s not about a tidy, well documented piece of code either, says Mason. “I’m always a fan of implementing by yourself – even if your code is messy and it doesn’t scale and maybe it only works four out of five times – so that you understand the mathematics and you understand how you get that really great fundamental intuition for what is happening and how it works, so that you can bring that into a set of problems.”
The knowledge of how these systems work is crucial, because if absent, developers can’t focus on challenges like formulating the actual machine learning problem. “In applied machine learning where you’re trying to solve the most useful problem, often we do a trick, where we say ‘here is the problem we think we want to solve, but that is going to be intractable or going to take us six months. But here is a problem we can approach in two weeks, so let’s try that’”.
“Speaking to the software engineers, there’s also a different style of writing. I’ve written a great deal of web application code or distributed systems code and I’ve also written plenty of data science and machine learning code. You write the code differently because you’re writing code to learn from a dataset that you do not yourself control. So there’s also a mindset shift from deterministic programming to sort of probabilistic almost playing sort of trying to understand what comes out of it.”
And if you’re still set on neural networks, you can try reworking your finished classic ML project as soon as it’s stable. “Because then you get the full experience of developing it, from feature engineering, to model selection, to the result. And then you also have the experience of using neural networks, maybe even PyTorch or whatever tool you’re excited about, to actually get to a result. And you’ll know potentially if that result is any good because you have something to compare it to and you might even know what you want to change about the dataset you’re feeding in or you have a bunch of better intuition for how it’s working. But there is no substitute for picking a project and just actually trying to write some code.”