Searching for data sets? Right this way says Google

Searching for data sets? Right this way says Google

Google has launched Dataset Search to give data scientists a hand at discovering data sets – wherever they may lie. But providers will have to get active themselves to ensure they are not overlooked.

Remember the bad old days, when you tediously had to scrape together data to get a good machine learning training set? Times have changed and though you still might have to wait for your project partner to give you access to your specific sets, open data initiatives (Hello NASA!) and changes in the way research is published help to make data available for all sorts of stuff. Building a machine learning model to make predictions about your next local natural disaster for example. The problem left is to find those data sets when you need them.

Since Google seems to be the epitome of search, its AI division has now taken this problem upon itself and launched Dataset Search. The service lets data enthusiasts find data sets in all sorts of places (personal websites, research sites, digital libraries, etc.) – given people actually took the time to add metadata in the Google endorsed way – which is based on an open standard Google is also involved in.

Data scientists who’d like their sets to be discovered more often can find a guide to describing them in a search engine friendly way in Google’s product pages. Necessary information includes the creator of the data set, when it was published, how the data was collected, and what the terms for using the data are. The search engine collects and links those information, trying to find different versions of a set, as well as publications mentioning it.

The now available version of Dataset Search should, according to Google AI Research Scientist Natasha Noy, find references to “most data sets in environmental and social sciences, as well as data from other disciplines including government data and data provided by news organizations”. To get better in other areas, the team behind the project has to rely on more data providers adding good metadata.