GitHub digs into machine-learning repos, uncovers a lot of Python

Python logo

GitHub has thrown a light on what is happening in the machine learning and data science development worlds, by doing its own data dig on repos across its platform.

The figures covered the period from January 1 to December 31 2018,and tracked “contributions” such as pushing code, opening issues or pull request, commenting, etc.

Unsurprisingly, perhaps, Python is the most common language among machine learning repositories, and was also the third most common language on GitHub overall, as it has been since 2015.

However, C++, JavaScript, Java, C#, Shell made up the rest of the top five languages, while Julia came in at 6, R at 8 and Scala at 10. “Julia, R, and Scala all appear in the top 10 for machine learning projects but not for GitHub overall,” GitHub said.


When it comes to the Python packages imported by machine learning or data science projects, numpy was the leader, being taken in by 74 per cent. The top five was rounded out by scipy (47 per cent) pandas (41 per cent) matplotlib (40 per cent) and sckit-learn (38 per cent). Tensorflow was in 7th place, with 24 per cent.

By contrast, Tensorflow topped the charts when it came to machine-learning projects by contributions. Apparently, it clocked five times as many contributors as the second placed scikit0learn, explosion/spaCy was in third place, with Julia in fourth.

There was also some interesting action outside the top tens shown on GitHub’s slides

Scala is becoming increasingly common when interacting with big data systems like Apache Spark. Also, GitHub revealed that pytorch was one of the fastest growing projects in its 2018 State of the Octoverse report – the second after Microsoft Azure/azure docs – though it didn’t make any of this week’s lists.

- Advertisement -