Bibliographic data and authorities

swissbib contains the bibliographic data of RERO, the IDS-libraries, the Swiss national library and IDS-partner-libraries. Due to the organizational structure of the Swiss library system the data is stored in different databases. The amount of duplicates in this data is high because of the similar collection structure of the different libraries. The data is catalogued according to four sets of rules: AACR2, Recaro and KIDS (IDS-libraries until 2015), RDA (IDS-libraries from 2016). The data is structured in formats MARC21 and IDSMARC. Multilingualism is another factor which influences mostly the recording of authorities and subject headingsis.

Whereas the formats can be mapped rather easily the different cataloguing rules and to a bigger extent the different cataloguing practices pose problems.

In addition swissbib contains the bibliographic data of institutional repositories of the Swiss universities, of the collections of e-codices and e-periodica and the metadata for articles purchased as national licences. This data is structured in Dublin Core or MODS.

swissbib uses CBS as a data hub and for all processes to normalize, cluster and merge this data.

Study of the deduplication of bibliographic data and authorities

In order to get a picture of the difficulties related to eliminating duplicates in catalogue-data the swissbib project mandated Pierre Gavin and Jean-Bertrand Gonin in 2007 to conduct a study of the feasibility of deduplication of bibliographic and name-authority data. Their findings are summarized on the page deduplication study.

swissbib matching and merging

swissbib record matching is completely index based. In order to enhance the chance that the elements match the data is normalized and transformed.

match elements

The following elements of a record are used to generate match indexes:

ID (ISSN, ISBN)
year
decade
century
title (245$a,$b,$n,$p / 246 $a)
edition (250$a)
(corporate) authors
impressum: editor name (260$b / 264$b)
extent (300$a, +/- 1 page)
volume number
coordinates and scale

match threshold

The match threshold is 1, so all elements that can be compared must match.

merging

Instead of classical merging which merges two or more matching records into a new one and is a destructive process, swissbib builds a cluster of matching records. Out of a cluster of matched records a display record is built (“master record”) based on the “richest” record in the cluster. Additional information from the other records in the cluster is added. This record is temporary and is rebuilt or updated each time a record in the cluster is updated or deleted.

However there are some rules that prevent merging:

a document is marked with a “nomerge”-code by a library
two or more matching documents that are from the same source aren't merged
documents which are still in process (and are quite likely to cause wrong merges)

Normalization, Clustering and Merging