Alphabetic index search

by Günter Hipler (UB Leizpzig, https://twitter.com/vog61) and Silvia Witzig (UB Basel)

 

In this article we will detail how we implemented the alphabetic index search for swisscollections based on VuFind and Apache Solr.

The past

Alphabetic browse indices were very popular in OPACs based on integrated library systems like Aleph 500. They were mainly used by expert users or library staff and were an outstanding feature in specialists catalogs like the one for "Handschriften, Archive und Nachlässe" (manuscripts, archives and estates) - the predecessor of swisscollections.

Over the years, the swissbib team has repeatedly been confronted with the requests for an alphabetical index search in the swissbib discovery system. Mainly because of the time restrictions of a small development team and regular controversial discussions in the library community about the sense and purpose of an alphabetical index the development of such a functionality was regularly postponed. Even commercial providers of search engine based discovery systems did not provide a solution for this request of their customers. We think the main reason for this restraint was that the alphabetic browse functionality in the OPAC is based on a relational database solution as the core component of an integrated library system. It was not easy to implement the same functionality based on search engine technology.

From our point of view the main reason for this difficulty is that browsing is not the same as searching. When browsing people know exactly at which point of an alphabetic list (formerly card catalogs) they can start their search and then follow entry by entry manually. When searching there are all kinds of sophisticated look ups on most of the time analyzed terms part of an index which is done by a search engine. While it might be possible to simulate an "entry" into a terms posting list (for example all the A-Z lists functionalities might be based on this) it is difficult to simulate the alphabetic browsing of a user trying to look up entry after entry in such a posting list with analyzed terms of an optimized search index in an alphabetic order - no lookups as it is done by search requests. This was frankly not the purpose of the search index design.

We think this is the reason why the rare implementations of the browse functionality nowadays in some way use a relational database in the background to simulate the step by step behavior of the users.

Currently available solutions

Our team is currently aware of two implementations for alphabetical index search within a search engine based discovery solution. One is the Open Source solution of VuFind in a demo version and as part of a productive environment. The other one is part of the commercial Primo product.

As the front-end of swisscollections is based on VuFind we planned to use the VuFind implementation. First we analyzed this implementation in detail and evaluated if it would meet our requirements.

Description and evaluation of the VuFind implementation

Because of the open source character it is easy to analyze how the functionality is implemented. The main documentation is available online:

As a summary our findings and conclusions:

  • "VuFind Alphabetical Heading Browse" ("VAHB") is mainly designed for a Solr Index with a single shard.

  • Often Solr is used as single node server and the deployment is mostly side by side with the VuFind web-application.

  • There are some shell scripts which can be used for terms extraction of an existing Solr core. These terms are then analyzed (via provided custom analyzers) and stored within a relational file based database (SQLite).

  • The database with the (sorted) terms of the index is then deployed side-by-side with the Solr server which makes it possible for a deployed VuFind custom Solr-plugin to access the sorted terms in the database.

  • VAHB solutions known to us offer only a small number of indices (e.g. four in the Villanova discovery).

  • The implementation is a little bit outdated. E.g. ant is used as a build tool and we haven't seen the possibility to use managed dependencies.

As part of the evaluation we implemented two prototypes to get a deeper understanding if and how it could match our requirements

  • A gradle multi project to overcome the unmanged dependencies. This was possible with the exception of the MarcImporter.jar where we couldn't find a suitable component so we had to define a file based lib.

  • Because it would be very unnatural (if not impossible) to use the script based workflows within our Kafka based microservice environment we implemented a quick prototype as a consumer for Kafka to insert new/updated values into a relational SQL database (in our case MariaDB to access it via network as a shared database).

Perhaps such prototypes might be helpful for the further development of the original VuFind artifacts. Later in the year 2021 (when we had already completed our solution) we saw that others are also working on the integration into multi node SolrCloud environments.

Alphabetic index search for swisscollections

Requirements for swisscollections

An alphabetic index search solution for swisscollections should be able to fit into our environment and has to meet certain requirements:

  • Workflows should be fully automatable. Whenever possible, a service should be based on event-based mechanisms and obtain its information from a log-based system (in our case Apache Kafka).

  • We are using a SolrCloud cluster with multiple nodes and collections with multiple shards and replicas.

  • We are trying to avoid any manual work. The cluster is set up on remote servers, collections are created via the Solr API, Solr configurations are stored in Zookeeper-Ensembles, we have a productive, a fallback, and a separate test cluster. All this would make it difficult to use file-based databases and customized SolrHandler-jars being part of the classpath of the Solr-Clusters.

  • For swisscollections 79 different indices have to be provided. We doubt that such a requirement is compatible with a database supported solution.

  • Updates and deletes are processed several times a day. We doubt that it would be possible to update the sorted terms in a database so often.

Possible solutions for swisscollections

From the evaluation of "VAHB", we realised that we should avoid a database-supported solution as far as possible. This was in line with our original wish to implement a complete search engine-based functionality. Some of the possibilities we discussed:

The Solr Terms Component seemed a natural fit for the requirements of alphabetic browsing. You can easily pick a start point in the list of terms and you can get the next terms step by step in the (alphabetic) sort order of the index. Additionally, the terms to be indexed can be analyzed with the desired analyzers.

However we saw two serious difficulties with this solution:

  • Although a user wants to browse through the analyzed terms a user is only interested in seeing the original values. Is it possible to create any relation between the analyzed terms in the Solr terms list and the values to be shown to the user? In the database of the "VAHB" solution this relation is done via two separate columns in Sqlite database.

  • The Solr Terms Component doesn't provide any analyzing functionality for the values a user sends to the search engine. The aim of the terms component is a very fast lookup for terms in a list.

It quickly became clear that we could not adopt this initially obvious option one-to-one.

Components and details of the implemented solution

The solution we decided to implement consists of four parts:

  1. A Solr Index based on the Terms Component for a quick entry into the list and the possibility of using the next following terms in alphabetical order.

  2. A service (browse-index-values, further details below) for preparing the terms for all indices. These terms also contain a reference to the value used for display and to identifiers from the authority file GND. Additionally, terms are created for all the variants of names listed in the GND.

  3. A service (termsquery-analysis-api, further details below) to connect the front-end and the Solr Index. This service analyzes the users input in the same way as it is done for the prepartion of the terms (2) and queries the index. Out of the result it creates a simple JSON file containing the values for display and the number of documents found.

  4. The section Index Search on the swisscollections front-end where a user can browse alphabetically through 79 indices.

 

Preparation of the terms in browse-index-values

This service prepares the terms for the index. The input are bibliographic records in MARC21 which are processed with XSLT scripts to extract the values for the alphabetic index search from the relevant fields.

Out of the extracted values a term is created which contains:

  • the analyzed value

  • the original value

  • optionally the identifier from the GND for this value

For example: mozart anna maria 1720 1778##########Mozart, Anna Maria (1720-1778)##########(DE-588)119285428

Values which have a GND identifier are processed further: an additional term is created for all variant names present in the GND. Theses terms additionally contain the preferred name.

For example: pertl maria anna 1720 1778##########Pertl, Maria Anna (1720-1778)##########(DE-588)119285428 ##########Mozart, Anna Maria (1720-1778)

The complete term is stored in the index. Because the first part is analyzed with the standard tokenizer and filters of the Lucene library the terms are sorted in correct alphabetic order. The following parts of the term are used for display purposes in the frontend. Conceptually this is the same principle as two columns in the SqlLite database used by VuFind.

Connecting index and front-end with the termsquery-analysis-api

This services receives the query performed on the Index Search section of swisscollections. Because the API is implemented in a JVM based language (in our case Scala / Play framework) we are able to integrate the same Lucene analyzers as we use on the index creation side. Thus the query is analyzed according to the same configuration as the analysis done during the preparation of the term in browse-index-values. With the analyzed query a search is performed on the index.

The result is transformed into a simple JSON structure which contains the values for display and the number of documents found. To create the JSON file the term is split on ########## and an object with up to four fields is generated.

For example: { "fieldvalue": "Pertl, Maria Anna (1720-1778)", "frequency": 4, "references": "(DE-588)119285428", "preferredName": "Mozart, Anna Maria (1720-1778)" }

The API is open and could be used by any client. Two examples for a more exotix index “browsezurichperson”:

For the known Basler name “sarasin” the index obviously doesn’t provide an entry: https://alphabrowse.swisscollections.ch/browsing/browsezurichperson?term=sarasin

Whereas “spyri” is found in the zurichperson index as an entry: https://alphabrowse.swisscollections

If someone is interested in the complete structure of the stored term (perhaps out of pure curiosity) the API provides it: https://alphabrowse.swisscollections.ch/browsing/browsezurichperson?term=sarasin&raw=true

Problems of the solution and how we solved them

Using the terms component for search engine based alphabetic browse functionality seems to be an ideal solution – at first glance.

The swisscollections service harvests the underlying datasources every hour. Beside creating new documents this produces a lot of updates and deletes on the existing dataset. Lucene (the library Solr is using for the search functionality) stores the datastructures for optimisation reasons in so called immutable segments. Updated or deleted documents are written in new segments and older entries in old segements are not immediately physically deleted. This behaviour does not affect the Lucene standard search. However, the same does not apply to the search with the terms component. The updated terms are still used until Lucene internally decides to merge segments. This behaviour affects our use case. The users want to know how many documents are available for an entry of the browsing list. Because terms might be part of different segments the numer of documents for each term might be inconsistent with the real number of documents in our search index. This is not acceptable.

We “solved” the problem by starting a manual segment merge every night (as simple as executing the following Solr command http://localhost:8080/solr/collectionname/update?optimize=true). Users who have been using Solr for a long time know it as “optimisation” which is no longer recommended. For an index with less than 5 million documents this is an acceptable solution but we cannot recommend it, because resource usage for merging segments outside the algorithms provided by Lucene might be too high.

But Open Source to the rescue! Lucene version 8.7 introduces a new functionality called «merge on refresh» The interesting story how this feature is implemented is told in a blog post by Michael McCandless. Sofar we haven't analysed the new possibilities but we are confident that this extension can solve our problem with the terms component for alphabetical browsing even for larger indices.

Another smaller problem with the terms component might occur in distributed clusters where the indication of the exact number of documents for each term may not always be 100% correct. To tackle this problem we use only one shard for the collection of the browse index values. This shard could be replicated in case of performance issues for really large collections.

Conclusion

About 3 years before the end of life of swissbib (April 2021) we started to implement the first distributed services for our data processing using technologies like Apache Kafka, Apache Flink and Apache Spark. We used the experiences we gained as a stable basis for the productive services Memobase and swisscollections.
Distribution of services or distinct (micro) services communicating via an API (or message log like Kafka) are nowadays a widely used form for data analysis and processing.
However, the same tendency towards distribution can also be observed in the area of front-end development, especially through the use of JavaScript frameworks for the user interface and back-end APIs that can be implemented with different technologies where you can take the technology that fits best for the application. In the case of the API for swisscollections, this had to be based on the JVM (usually Scala, Java or Kotlin) because we include the Lucene Analyzers library.

With our solution for the swisscollections browse index, we have taken a step into this direction, but still use the classic VuFind as frontend. We think that a further development into a more distributed direction should also be possible for VuFind with its broad and experienced user community.

The development of swisscollections has been a lot of fun. We thank all who have participated:
Siegfried Ballmer, outermedia, Berlin
Christoph Böhm, outermedia, Berlin
Silvia Witzig, former swissbib team, UB Basel
Lionel Walter, former swissbib team, now https://arbim.ch
Günter Hipler, former swissbib team, now UB Leipzig