A Beginner’s Guide to MARC and SPARQL

I’ve been retired for a whole week now, but one thought remains unsettled. As my last act of employment, I wrote a blog post that described an odd idea…

“Believe it or not, MARC can be indexed using SPARQL…. [It’s] not that hard.”

Hanging Together, “How MARC can SPARQL!”
https://hangingtogether.org/how-marc-can-sparql/

I forfeited access to the records and code when I retired, but as the drop quote says “[It’s] not that hard.” Fortunately, OCLC Research published a “Library Science” subset of WorldCat in 2016 that anyone can download and experiment with. The exploration can start there.

The original post left implementation as an exercise for the reader. As amends, I created a Github repository named “smarcql” (get it?) and started over with the code. Some details need to be fleshed out but the bones should be good. Cloning this repository and loading data locally should allow developers to follow along for themselves. The setup instructions aren’t trivial, but should be manageable for people willing to install Java, Git, Sbt, and Spark in their own environments. Most of the heavy lifting is done by an XSL stylesheet that converts MARC/XML to SMARQL/RDF.

That’s it on the setup for now. Let me know how it goes in the comments. Assuming the instructions work out, here is a test query to confirm the result:

Most frequently referenced author/title (100$a/245$a)

PREFIX tag: <https://w3id.org/smarcql/tag/>
PREFIX code: <https://w3id.org/smarcql/code/>

SELECT ?author ?title (COUNT(DISTINCT ?rec) AS ?recCount)
WHERE {
  
  ?rec tag:bd100 [
    code:sa ?author
  ];
  tag:bd245 [
      code:sa ?title
  ] .
}
GROUP BY ?author ?title
ORDER BY DESC(?recCount)
LIMIT 5
authortitlerecCount
Cockerell, Douglas.Bookbinding, and the care of books :64
Waal, Henri van de,Iconclass :58
Dewey, Melvil,Dewey decimal classification and relative index.56
Dewey, Melvil,Dewey decimal classification and relative index /51
Chan, Lois Mai.Library of Congress subject headings :50

In other words, there are 64 MARC records in the dataset with “Cockerell, Douglas.” as the creator (100 $a) and “Bookbinding, and the care of books :” as the title (245 $a). People looking at this result from an entity perspective might be surprised that Melvil Dewey’s title appears twice with different record counts, but people steeped in the legacy of MARC probably feel the pain of inconsistencies, big and small. Details like capitalization and punctuation (not to mention typos, languages and encoding rules) often vary in the MARC realm in ways that stick out like a sore thumb in SPARQL. Text searching, as opposed to querying, can mask some of these differences, but that and other topics can be addressed another day.