Bibliographic data and authorities

swissbib contains the bibliographic data of RERO, the IDS-libraries, the Swiss national library and IDS-partner-libraries. Due to the organizational structure of the Swiss library system the data is stored in different databases. The amount of duplicates in this data is high because of the similar collection structure of the different libraries. The data is catalogued according to four sets of rules: AACR2, Recaro and KIDS (IDS-libraries until 2015), RDA (IDS-libraries from 2016). The data is structured in formats MARC21 and IDSMARC. Multilingualism is another factor which influences mostly the recording of authorities and subject headingsis.

Whereas the formats can be mapped rather easily the different cataloguing rules and to a bigger extent the different cataloguing practices pose problems.

In addition swissbib contains the bibliographic data of institutional repositories of the Swiss universities, of the collections of e-codices and e-periodica and the metadata for articles purchased as national licences. This data is structured in Dublin Core or MODS.

swissbib uses CBS as a data hub and for all processes to normalize, cluster and merge this data.

Preliminary study of the deduplication of bibliographic data and authorities

In order to get a picture of the difficulties related to eliminating duplicates in catalogue-data the swissbib project mandated Pierre Gavin and Jean-Bertrand Gonin in 2007 to conduct a study of the feasibility of deduplication of bibliographic and name-authority data. They checked the feasibility of deduplication of multilingual content in different MARC flavours (IDSMARC and MARC21) and catalogued after different rule sets (AACR2, KIDS).

This is a summary of the findings.

Deduplication of bibliographic records

Proceeding

A sample of 80'000 records out of seven databases from two networks (RERO and IDS) and the Swiss National Library was taken. To assure a high density of duplicates the records were extracted by keyword search via z39.50 and not just sequentially by record number. To evaluate the accuracy of the deduplication a group of 700 records were first checked by algorithm and then rechecked manually.

Limitations

Not to spoil the accuracy of the findings media types with special cataloguing requirements were excluded:

The algorithm

The algorithm takes into account the content of the following fields:

For each field the algorithm assigns a number that signifies the similarity of its content. The fields mentioned above are of different importance and therefore the assigned numbers are of different values.

The algorithm still has potential to be refined. The zone for collection 490 and the language code of 008 or 040 could be included into the analytical framework.

Findings

The preliminary study shows that deduplication of multilingual content by algorithm is feasible.

Accuracy of algorithm

The accuracy of the algorithm in its present state shows the following characteristics. It produces approximately:

As said there is still potential to refine the algorithm.

Positive factors concerning deduplication

Problems (to be solved)

It is to say that some of the points mentioned below are not very problematic. So the issues that are complicated to solve are marked with ***.

Data concerning the deduplication of 80'000 records

Numbers of duplicates found in the sample

N = Nebis

Z = IDS Zürich

B = IDS Basel/Bern

S = IDS St. Gallen (UniSG)

L = IDS Luzern

R = RERO

H = Helveticat

Numbers of duplicates found

 

*

N

Z

B

S

L

R

H

*

37904

13746

8761

17044

4016

5413

16732

9318

N

6679

480

1872

4008

827

1199

3346

1756

Z

3782

1859

191

2527

469

690

2070

802

B

9088

4124

2619

1040

1076

1483

4866

2127

S

1526

804

464

1038

48

261

897

375

L

2088

1180

684

1400

264

128

1069

534

R

9533

3514

2115

4873

935

1094

1613

2864

H

5208

1785

816

2158

397

558

2871

860

Numbers of retrieved sample data in proportion to total catalogue size

Catalogue size in proportion to sample

 

%

Site

kRec

retrieved

proportion (x100)

N

3800

14731

3.88

Z

1700

6076

3.57

B

4400

17371

3.95

S

500

2357

4.71

L

500

2973

5.95

R

4300

24871

5.78

H

1500

12756

8.5

TOTAL

16700

81135

Clustering and merging in swissbib

The clustering alogrithm used by swissbib for the detection of duplicate records compares the data in two steps. In a first step the data is split into preliminary clusters. In the second step, the records within these preliminary clusters are compared with each other. The comparison is based on indexed and for some elements heavily normalized data.

The result of this comparison is either a single record or a cluster of records which are considered duplicates. These clusters of records are the input for the merging process which creates the merged records used in swissbib.

Every update of a record triggers the clustering of the records and, if a cluster gains or loses members, an update of the merged record.

Preliminary comparison

For the first step two indexes are compared:

Detailed comparison

The following elements of a record are used to generate match indexes:

Merging

Instead of classical merging which merges two or more matching records into a new one and is a destructive process, swissbib builds a cluster of matching records. Out of a cluster of matched records a merged record is built based on the “richest” record in the cluster. Additional information from the other records in the cluster is compared to the information already present and is added accordingly. This record is temporary and is rebuilt or updated each time a record in the cluster is updated or deleted.

However there are some rules that prevent merging: