Normalization, Clustering and Merging

Bibliographic data and authorities

swissbib contains the bibliographic data of RERO, the IDS-libraries, the Swiss national library and IDS-partner-libraries. Due to the organizational structure of the Swiss library system the data is stored in different databases. The amount of duplicates in this data is high because of the similar collection structure of the different libraries. The data is catalogued according to four sets of rules: AACR2, Recaro and KIDS (IDS-libraries until 2015), RDA (IDS-libraries from 2016). The data is structured in formats MARC21 and IDSMARC. Multilingualism is another factor which influences mostly the recording of authorities and subject headingsis.

Whereas the formats can be mapped rather easily the different cataloguing rules and to a bigger extent the different cataloguing practices pose problems.

In addition swissbib contains the bibliographic data of institutional repositories of the Swiss universities, of the collections of e-codices and e-periodica and the metadata for articles purchased as national licences. This data is structured in Dublin Core or MODS.

swissbib uses CBS as a data hub and for all processes to normalize, cluster and merge this data.

Preliminary study of the deduplication of bibliographic data and authorities

In order to get a picture of the difficulties related to eliminating duplicates in catalogue-data the swissbib project mandated Pierre Gavin and Jean-Bertrand Gonin in 2007 to conduct a study of the feasibility of deduplication of bibliographic and name-authority data. They checked the feasibility of deduplication of multilingual content in different MARC flavours (IDSMARC and MARC21) and catalogued after different rule sets (AACR2, KIDS).

This is a summary of the findings.

Deduplication of bibliographic records

Proceeding

A sample of 80'000 records out of seven databases from two networks (RERO and IDS) and the Swiss National Library was taken. To assure a high density of duplicates the records were extracted by keyword search via z39.50 and not just sequentially by record number. To evaluate the accuracy of the deduplication a group of 700 records were first checked by algorithm and then rechecked manually.

Limitations

Not to spoil the accuracy of the findings media types with special cataloguing requirements were excluded:

  • serials

  • analytics

  • historic books

  • dummies (should not be taken into account at all)

The algorithm

The algorithm takes into account the content of the following fields:

  • ISBN

  • title

  • author

  • editor

  • pagination

  • media type

For each field the algorithm assigns a number that signifies the similarity of its content. The fields mentioned above are of different importance and therefore the assigned numbers are of different values.

  • A number between 0 and 10 signifies a duplicate.

  • 11 and 12 are still strong indicators for duplicates. Whether a record is finally taxed as a duplicate depends onthe assignment of these values to specific fields.

  • numbers over 20 indicate that it could not be a duplicate

The algorithm still has potential to be refined. The zone for collection 490 and the language code of 008 or 040 could be included into the analytical framework.

Findings

The preliminary study shows that deduplication of multilingual content by algorithm is feasible.

Accuracy of algorithm

The accuracy of the algorithm in its present state shows the following characteristics. It produces approximately:

  • 5.3% of false duplicates

  • 2.2% of false non-duplicates

  • 9.2% of probable duplicates

  • 3.9% of probable non-duplicates

As said there is still potential to refine the algorithm.

Positive factors concerning deduplication

  • ISBD block is very similar, sometimes identical

  • ISBN is very frequent

  • MARC21 and IDSMARC are in large parts identical and applied in a standard compliant manner

  • the quality of the catalogue records is generally of good quality

  • multilingualism poses practically no problems within field 300 and 500

  • multilingualism poses practically no problems with personal names - exceptions are popes or kings

Problems (to be solved)

It is to say that some of the points mentioned below are not very problematic. So the issues that are complicated to solve are marked with ***.

  • inconsistent cataloguing levels for the same title

  • converted records in various grades of quality

  • high differences in the amount and order of 7xx fields

  • different approach to pagination mainly in old or recatalogued records

  • typos (not very frequent in the data set)

  • flawed cataloguing

  • differences in MARC-Codification:

    • no author in field 1xx (problems for FRBRization)***

    • different grades of multilevel cataloguing***

  • field 020:

    • the ISBN has to be validated

    • the ISBN has to be normalized either 10 to 13 or 13 to 10

  • field 1xx and 7xx - author: unfortunately the allocation of one author could differ form one site to the other***

  • field 245 - title: the algorithm must be refined to cope with minor differences

  • field 260 - editor: the algorithm must be refined to cope with differences

  • field 300 - collation: different treatment of pagination results in rather weak significance for the overall deduplication process

  • transliteration: RERO uses different methods of transliteration as IDS and national library***

  • multilingualism:

    • names of corporate bodies pose problems

    • subject headings are difficult to handle

  • uniform titles: ***

    • it is not completely clear in which cases an uniform title was set (differs from site to site)

    • as an example: different location in RERO (7xx$t) and IDS (240)

  • original titles in the case of translations: 509 IDS and 500 RERO***

Data concerning the deduplication of 80'000 records

Numbers of duplicates found in the sample

N = Nebis

Z = IDS Zürich

B = IDS Basel/Bern

S = IDS St. Gallen (UniSG)

L = IDS Luzern

R = RERO

H = Helveticat

 

Numbers of duplicates found

 

*

N

Z

B

S

L

R

H

*

37904

13746

8761

17044

4016

5413

16732

9318

N

6679

480

1872

4008

827

1199

3346

1756

Z

3782

1859

191

2527

469

690

2070

802

B

9088

4124

2619

1040

1076

1483

4866

2127

S

1526

804

464

1038

48

261

897

375

L

2088

1180

684

1400

264

128

1069

534

R

9533

3514

2115

4873

935

1094

1613

2864

H

5208

1785

816

2158

397

558

2871

860

Numbers of retrieved sample data in proportion to total catalogue size

 

Catalogue size in proportion to sample

 

 

 

 

%

Site

kRec

retrieved

proportion (x100)

N

3800

14731

3.88

Z

1700

6076

3.57

B

4400

17371

3.95

S

500

2357

4.71

L

500

2973

5.95

R

4300

24871

5.78

H

1500

12756

8.5

TOTAL

16700

81135

 

 

Clustering and merging in swissbib

The clustering alogrithm used by swissbib for the detection of duplicate records compares the data in two steps. In a first step the data is split into preliminary clusters. In the second step, the records within these preliminary clusters are compared with each other. The comparison is based on indexed and for some elements heavily normalized data.

The result of this comparison is either a single record or a cluster of records which are considered duplicates. These clusters of records are the input for the merging process which creates the merged records used in swissbib.

Every update of a record triggers the clustering of the records and, if a cluster gains or loses members, an update of the merged record.

Preliminary comparison

For the first step two indexes are compared:

  • Matchkey:
    Based on title (n letters from word x)
    Form of item and material type
    Exact date, if present
    URL, if published before 1900

  • ID (ISBN/ISSN):
    Based on ISBN/ISSN
    Form of item and material type
    Exact date, if present
    URL, if published before 1900

Detailed comparison

The following elements of a record are used to generate match indexes:

  • ID (ISSN, ISBN)

  • title (245$a,$b,$n,$p / 246 $a)

  • authors (corporate, 110 / 111 / 710 / 711)

  • authors (person, 100 / 700)

  • year of publication (008)

  • decade of publication (008)

  • century of publication (008)

  • edition (250$a)

  • part (490$v / 830$v)

  • publishers initials, for all publications (260$b / 264$b)

  • publishers name, only for serials (260$b / 264$b)

  • pages (300$a, +/- 1 page)

  • volumes (300 $a)

  • coordinates

  • scale

  • source system, only for non-textual material

Merging

Instead of classical merging which merges two or more matching records into a new one and is a destructive process, swissbib builds a cluster of matching records. Out of a cluster of matched records a merged record is built based on the “richest” record in the cluster. Additional information from the other records in the cluster is compared to the information already present and is added accordingly. This record is temporary and is rebuilt or updated each time a record in the cluster is updated or deleted.

However there are some rules that prevent merging:

  • a document is marked with a “nomerge”-code by a library

  • two or more matching documents that are from the same source aren't merged

  • documents which are still in process (and are quite likely to cause wrong merges)