After almost 12 years of work, the team behind data analysis and manipulation library pandas is finally able to celebrate their first major release.
Pandas comes in the form of a Python package and aims to be “the fundamental high-level building block for doing practical, real world data analysis” in that same language. It provides flexible data structures that help in dealing with relational and labeled data, which has made the library quite popular amongst machine learning practitioners.
Version 1.0 splashes out on new features, providing devs with a function to convert data frames into markdown tables, for example. It also adds an engine keyword to rolling.apply and expanding.apply, so that users can choose Numba instead of Cython.
This is meant to speed up the process for larger data sets for example – but only after the first time the function is run using the engine. On the first try, the process is bound to produce some compilation overhead, but since the function will then be cached, following calls will get fast results. Another addition to rolling operations is a pandas.api.indexers.BaseIndexer() class. Analysts can use it to define how start and end indices for a window are created, if a custom approach is needed.
Those interested to see in which direction the library is going, can take a look at the experimental features that made it into the 1.0 release. Amongst other things, it includes a pd.NA singleton, which can be used as an indicator for missing data across types (as opposed to datatime-like or object-dtype data only, which pd.nan and pd.NaT expect).
On top of that, there’s an experimental StringDtype, extending string data to tackle some issues with object-dtype NumPy arrays. Once the details are figured out, the string extension type will prevent the accidental mixing of strings and non-strings in such arrays, help select just text for certain operations and clarify contents during reading. New methods like DataFrame.convert_dtypes() and Series.Convert_dtypes are meant to encourage the new dtypes use.
Devs who made use of older pandas versions are recommended to upgrade to pandas 0.25 to see if their code runs without warnings before making the leap to 1.0, only because the team has removed a lot of deprecated features.
Starting with the current release, pandas also switches to a variant of semantic versioning for their release. This largely means that API-breaking changes will only be part of major releases (2.0.0, 3.0.0, …), experimental features aside. Meanwhile deprecations will be introduced in minor releases (1.1.0, 1.2.0, …) and enforced in major ones.