Normalization, Clustering and Merging
Bibliographic data and authorities
swissbib contains the bibliographic data of RERO, the IDS-libraries, the Swiss national library and IDS-partner-libraries. Due to the organizational structure of the Swiss library system the data is stored in different databases. The amount of duplicates in this data is high because of the similar collection structure of the different libraries. The data is catalogued according to four sets of rules: AACR2, Recaro and KIDS (IDS-libraries until 2015), RDA (IDS-libraries from 2016). The data is structured in formats MARC21 and IDSMARC. Multilingualism is another factor which influences mostly the recording of authorities and subject headingsis.
Whereas the formats can be mapped rather easily the different cataloguing rules and to a bigger extent the different cataloguing practices pose problems.
In addition swissbib contains the bibliographic data of institutional repositories of the Swiss universities, of the collections of e-codices and e-periodica and the metadata for articles purchased as national licences. This data is structured in Dublin Core or MODS.
swissbib uses CBS as a data hub and for all processes to normalize, cluster and merge this data.
Preliminary study of the deduplication of bibliographic data and authorities
In order to get a picture of the difficulties related to eliminating duplicates in catalogue-data the swissbib project mandated Pierre Gavin and Jean-Bertrand Gonin in 2007 to conduct a study of the feasibility of deduplication of bibliographic and name-authority data. They checked the feasibility of deduplication of multilingual content in different MARC flavours (IDSMARC and MARC21) and catalogued after different rule sets (AACR2, KIDS).
This is a summary of the findings.
Deduplication of bibliographic records
Proceeding
A sample of 80'000 records out of seven databases from two networks (RERO and IDS) and the Swiss National Library was taken. To assure a high density of duplicates the records were extracted by keyword search via z39.50 and not just sequentially by record number. To evaluate the accuracy of the deduplication a group of 700 records were first checked by algorithm and then rechecked manually.
Limitations
Not to spoil the accuracy of the findings media types with special cataloguing requirements were excluded:
serials
analytics
historic books
dummies (should not be taken into account at all)
The algorithm
The algorithm takes into account the content of the following fields:
ISBN
title
author
editor
pagination
media type
For each field the algorithm assigns a number that signifies the similarity of its content. The fields mentioned above are of different importance and therefore the assigned numbers are of different values.
A number between 0 and 10 signifies a duplicate.
11 and 12 are still strong indicators for duplicates. Whether a record is finally taxed as a duplicate depends onthe assignment of these values to specific fields.
numbers over 20 indicate that it could not be a duplicate
The algorithm still has potential to be refined. The zone for collection 490 and the language code of 008 or 040 could be included into the analytical framework.
Findings
The preliminary study shows that deduplication of multilingual content by algorithm is feasible.
Accuracy of algorithm
The accuracy of the algorithm in its present state shows the following characteristics. It produces approximately:
5.3% of false duplicates
2.2% of false non-duplicates
9.2% of probable duplicates
3.9% of probable non-duplicates
As said there is still potential to refine the algorithm.
Positive factors concerning deduplication
ISBD block is very similar, sometimes identical
ISBN is very frequent
MARC21 and IDSMARC are in large parts identical and applied in a standard compliant manner
the quality of the catalogue records is generally of good quality
multilingualism poses practically no problems within field 300 and 500
multilingualism poses practically no problems with personal names - exceptions are popes or kings
Problems (to be solved)
It is to say that some of the points mentioned below are not very problematic. So the issues that are complicated to solve are marked with ***.
inconsistent cataloguing levels for the same title
converted records in various grades of quality
high differences in the amount and order of 7xx fields
different approach to pagination mainly in old or recatalogued records
typos (not very frequent in the data set)
flawed cataloguing
differences in MARC-Codification:
no author in field 1xx (problems for FRBRization)***
different grades of multilevel cataloguing***
field 020:
the ISBN has to be validated
the ISBN has to be normalized either 10 to 13 or 13 to 10
field 1xx and 7xx - author: unfortunately the allocation of one author could differ form one site to the other***
field 245 - title: the algorithm must be refined to cope with minor differences
field 260 - editor: the algorithm must be refined to cope with differences
field 300 - collation: different treatment of pagination results in rather weak significance for the overall deduplication process
transliteration: RERO uses different methods of transliteration as IDS and national library***
multilingualism:
names of corporate bodies pose problems
subject headings are difficult to handle
uniform titles: ***
it is not completely clear in which cases an uniform title was set (differs from site to site)
as an example: different location in RERO (7xx$t) and IDS (240)
original titles in the case of translations: 509 IDS and 500 RERO***
Data concerning the deduplication of 80'000 records
Numbers of duplicates found in the sample
N = Nebis | Z = IDS Zürich | B = IDS Basel/Bern | S = IDS St. Gallen (UniSG) | L = IDS Luzern | R = RERO | H = Helveticat |
| Numbers of duplicates found | |||||||
| * | N | Z | B | S | L | R | H |
* | 37904 | 13746 | 8761 | 17044 | 4016 | 5413 | 16732 | 9318 |
N | 6679 | 480 | 1872 | 4008 | 827 | 1199 | 3346 | 1756 |
Z | 3782 | 1859 | 191 | 2527 | 469 | 690 | 2070 | 802 |
B | 9088 | 4124 | 2619 | 1040 | 1076 | 1483 | 4866 | 2127 |
S | 1526 | 804 | 464 | 1038 | 48 | 261 | 897 | 375 |
L | 2088 | 1180 | 684 | 1400 | 264 | 128 | 1069 | 534 |
R | 9533 | 3514 | 2115 | 4873 | 935 | 1094 | 1613 | 2864 |
H | 5208 | 1785 | 816 | 2158 | 397 | 558 | 2871 | 860 |
Numbers of retrieved sample data in proportion to total catalogue size
| Catalogue size in proportion to sample | |||
|
|
|
| % |
Site | kRec | retrieved | proportion (x100) | |
N | 3800 | 14731 | 3.88 | |
Z | 1700 | 6076 | 3.57 | |
B | 4400 | 17371 | 3.95 | |
S | 500 | 2357 | 4.71 | |
L | 500 | 2973 | 5.95 | |
R | 4300 | 24871 | 5.78 | |
H | 1500 | 12756 | 8.5 | |
TOTAL | 16700 | 81135 |
|
Clustering and merging in swissbib
The clustering alogrithm used by swissbib for the detection of duplicate records compares the data in two steps. In a first step the data is split into preliminary clusters. In the second step, the records within these preliminary clusters are compared with each other. The comparison is based on indexed and for some elements heavily normalized data.
The result of this comparison is either a single record or a cluster of records which are considered duplicates. These clusters of records are the input for the merging process which creates the merged records used in swissbib.
Every update of a record triggers the clustering of the records and, if a cluster gains or loses members, an update of the merged record.
Preliminary comparison
For the first step two indexes are compared:
Matchkey:
Based on title (n letters from word x)
Form of item and material type
Exact date, if present
URL, if published before 1900ID (ISBN/ISSN):
Based on ISBN/ISSN
Form of item and material type
Exact date, if present
URL, if published before 1900
Detailed comparison
The following elements of a record are used to generate match indexes:
ID (ISSN, ISBN)
title (245$a,$b,$n,$p / 246 $a)
authors (corporate, 110 / 111 / 710 / 711)
authors (person, 100 / 700)
year of publication (008)
decade of publication (008)
century of publication (008)
edition (250$a)
part (490$v / 830$v)
publishers initials, for all publications (260$b / 264$b)
publishers name, only for serials (260$b / 264$b)
pages (300$a, +/- 1 page)
volumes (300 $a)
coordinates
scale
source system, only for non-textual material
Merging
Instead of classical merging which merges two or more matching records into a new one and is a destructive process, swissbib builds a cluster of matching records. Out of a cluster of matched records a merged record is built based on the “richest” record in the cluster. Additional information from the other records in the cluster is compared to the information already present and is added accordingly. This record is temporary and is rebuilt or updated each time a record in the cluster is updated or deleted.
However there are some rules that prevent merging:
a document is marked with a “nomerge”-code by a library
two or more matching documents that are from the same source aren't merged
documents which are still in process (and are quite likely to cause wrong merges)