Apache Lucene goes full steam ahead on performance with 9.0 release

Apache Lucene goes full steam ahead on performance with 9.0 release

The team behind search engine Apache Lucene has recently made version 9.0 of the open source project available for downloading, sharing performance improvements and first steps towards Java module system support with its user base.

Lucene 9.0, which serves as the basis for projects such as Elasticsearch and MongoDB Atlas’ full-text search, tries to keep up with the times, by looking into ways of supporting new usage scenarios and Java features. It is, for instance, the first release to provide JARs with automatically generated module names, which the team behind the engine hopes will help to enable work with the Java module system somewhere along the line. 

The Lucene team also has been busy exploring the indexing of high-dimensionality numeric vectors to perform nearest-neighbor search in v9.0. The resulting implementation uses the Hierarchical Navigable Small World graph algorithm and has been added to answer a growing demand from data scientists working in the field of machine learning to index documents containing vectors.

However, the focus of the new major release seems to have been largely placed on performance, as the update’s announcement highlights speed-ups in areas like taxonomy faceting, sorting, and indexing of multi-dimensional points. And there’s still more to come, as it also includes some foundational work to take system statistics into account when running queries concurrently, which looks to make the most out of the resources available.

Apart from that, Lucene comes with reworked ConcurrentMergeScheduler settings, which assumes modern I/O to improve indexing performance and prevent systems from running into seemingly random JDK issues. RegExp queries have become more strict following the Java Pattern policy for rejecting illegal syntax, and now know how to handle \w, \W, \d, \D, \s, and \S expressions.

With the new release the Lucene team decided to update the project to use version 2.0 of Snowball, a processing language for stemming algorithms. Thanks to the change, users now have analysers for Serbian, Nepali, and Tamil at their disposal. Lucene 9.0 is also the first release to provide a minimal stemmer for Swedish (more complex versions have been available already), as well as a JapaneseCompletionFilter for Input Method-aware auto-completion.

To make the new version work, developers need to have JDK 11 or newer installed. Under the hood changes in component handling also mean that authors of custom analysis factories need to fit those with a default constructor implementation to keep the factories functional. It’s also generally recommended to check imports, as Lucene 9 doesn’t use split packages anymore, hence renamed some none-core JARs. 

More details on renamings and other backwards incompatibilities can be found in the project’s changelog.