QuxFarm

QuxFarm

A blog about data

    • About
    • Contact
  • MARC 880 ≠6 and “Multilingual” SMARCQL

    sliced meat on white ceramic plate
    Photo by Rudy Issa on Unsplash

    I’m almost two months into retirement and decided to give my video game controller a break. Just to mix it up a bit, I decided to add a new MARC+SPARQL=SMARCQL feature.

    One of the nice things about RDF is support for multiple languages and scripts. For example, here is a simple Wikidata query that lists Henry Petroski’s name in various scripts:

    https://w.wiki/565Q

    MARC also has the ability to capture information in multiple scripts, but the mechanism isn’t as straightforward. Instead, fields cataloged with alternate scripts are paired using an 880 field via a structured ≠6 “Linkage” subfield. Here, then, are examples of Henry Petroski’s name as they might be recorded in MARC records:

    100 ≠6 880-01 ≠a Petroski, Henry.
    880 ≠6 100-01/$1 ≠a 佩特罗斯基.

    or possibly

    880 ≠6 100-01/(2/r ≠a פטרוסקי, הנרי.

    The ≠6 structure contains up to 4 components described in detail here:

    ≠6 [linking tag]-[occurrence number]/[script identification code]/[field orientation code]

    Earlier SMARCQL releases left the ≠6 parsing and lookup as an exercise for users. The latest release, however, parses the ≠6 components as RDF triples and adds skos:exactMatch triples to connect the corresponding fields. Including these in the triplestore allows for some new analysis possibilities. For example, here is a query that lists variant scripts for the latin name “Petroski, Henry.” as they exist in the MARC test set:

    PREFIX code: <https://w3id.org/smarcql/code/>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    
    SELECT DISTINCT ?latinName ?altName ?script
    WHERE {
      VALUES ?latinName {"Petroski, Henry."}
             
      ?field code:sa ?latinName ;
        skos:exactMatch [
          code:s6_scriptIdentificationCode [rdfs:label ?script] ;
          code:sa ?altName
        ]
    }
    latinNamealtNamescript
    Petroski, Henry.佩特罗斯基.Chinese, Japanese, Korean
    Petroski, Henry.פטרוסקי, הנרי.Hebrew

    Keep in mind that the ability to query terms across multiple scripts using SMARCQL doesn’t imply there is one and only one Henry Petroski being referred to. For that, the query would have to take ≠0 or ≠1 identifiers into account or else adapt the query to include contextual information for downstream reconciliation using a tool like OpenRefine.

    realworldobject

    April 23, 2022
    MARC, SMARCQL, SPARQL
  • Treatment of the MARC Leader in SMARCQL

    This post describes the latest release of SMARCQL, which adds handling for the MARC Leader. Note that SMARCQL’s initial release only accounted for MARC “Variable Data Fields” (01X-9XX), which characteristically contain “coded” subfields. For reference, the basic query pattern looks something like this:

    PREFIX tag: <https://w3id.org/smarcql/tag/>
    PREFIX code: <https://w3id.org/smarcql/code/>
    
    SELECT ?rec ?title
    WHERE {
      ?rec tag:bd245 ?field . # 245 - Title Statement
      ?field code:sa ?title . # $a - Title
    }
    LIMIT 5

    This rec/tag/field/code/subfield pattern doesn’t work well for the MARC Leader, though, because values are parsed from character positions rather than code delimiters. In the new release, leader positions can now be addressed in queries using a pattern like this:

    PREFIX tag: <https://w3id.org/smarcql/tag/>
    PREFIX pstn: <https://w3id.org/smarcql/position/>
    
    SELECT *
    WHERE {
      ?rec tag:bdleader ?bdleader .
      ?bdleader pstn:bdleader_06 ?bdleader_06 .
    }
    LIMIT 5

    A few things to point out:

    • The position: namespace is new, but will come into play again later when the 00X-Control Fields are addressed.
    • Strictly speaking the MARC Leader isn’t a “tag” even though I assigned “bdleader” to that namespace. The SMARCQL Ontology defines a variety of namespaces, but none seem appropriate for this situation.
    • The justification for using tag: is symmetry with the positional 00X-Control Fields. The Leader/06 and 07 influence how 00X fields are parsed, though, so those essentials are being addressed first.

    SMARCQL Ontology Treatment

    In order to adequately demonstrate queries involving the leader, we need to start fleshing out the ontology. The SMARCQL Ontology is still minimalistic, but it now contains details to help exercise leader-based queries. To start, I updated my Blazegraph instance to index the ontology triples alongside the sample SMARCQL/RDF test data.

    wget https://realworldobject.github.io/smarcql/ontology/smarcql.ttl
    
    java -cp blazegraph.jar com.bigdata.rdf.store.DataLoader src/main/resources/fastload.properties smarcql.ttl

    Here, then, is a query that lists the leader fields along with potential values and the number of records associated with each.

    PREFIX tag: <https://w3id.org/smarcql/tag/>
    PREFIX class: <https://w3id.org/smarcql/class/>
    
    SELECT ?positionLabel ?positionValue ?positionValueLabel (COUNT(?rec) AS ?numRecs)
    WHERE {
      ?position rdfs:domain class:BdLeader;
                rdfs:range ?class ;
                rdfs:label ?positionLabel .
      ?individual a ?class ;
                    rdfs:label ?positionValueLabel ;
                    rdf:value ?positionValue .
      OPTIONAL { 
        ?rec tag:bdleader [
          ?position ?individual
        ]
      }
    }
    GROUP BY ?positionLabel ?positionValue ?positionValueLabel
    ORDER BY ?positionLabel ?positionValue

    The result looks like this:

    positionLabelpositionValuepositionValueLabelnumRecs
    Bibliographic levelaMonographic component part3304
    Bibliographic levelbSerial component part14
    Bibliographic levelcCollection600
    Bibliographic leveldSubunit354
    Bibliographic leveliIntegrating resource1322
    Bibliographic levelmMonograph/Item707498
    Bibliographic levelsSerial66287
    Type of recordaLanguage material747970
    Type of recordcNotated music12
    Type of recorddManuscript notated music0
    Type of recordeCartographic material86
    Type of recordfManuscript cartographic material1
    Type of recordgProjected medium4757
    Type of recordiNonmusical sound recording2320
    Type of recordjMusical sound recording291
    Type of recordkTwo-dimensional nonprojected graphic249
    Type of recordmComputer file5409
    Type of recordoKit92
    Type of recordpMixed materials70
    Type of recordrThree-dimensional artifact or naturally occurring object15
    Type of recordtManuscript language material18107

    Here are some observations:

    • In MARC, leader values are encoded as character codes. In SMARCQL, these codes (strings) have been upgraded to URIs (things) that are defined in the ontology. For example, notice that the output includes labels for the leader positions and values. Those came from the ontology, based on the Library of Congress MARC Bibliographic Format specification.
    • The Leader/06 (“Type of record”) and 07 (“Bibliographic level”) values are crude, but start to give a sense of the type of the library resource that the MARC is describing.

    Digging Deeper into the Result

    The WorldCat-derived dataset used in this example has a “library science” theme, so the record counts in the result reflect that. Eyeballing the list, though, makes me wonder how the twelve hits on Leader/06 “c” (Type of record: Notated music) passed the selection criteria. To look closer, here is a query to list them and see:

    PREFIX tag: <https://w3id.org/smarcql/tag/>
    PREFIX code: <https://w3id.org/smarcql/code/>
    PREFIX pstn: <https://w3id.org/smarcql/position/>
    PREFIX ind: <https://w3id.org/smarcql/individual/>
    
    SELECT ?wc ?creatorLabel ?title
    WHERE {
      ?rec tag:bd001 ?bd001 ;
           tag:bdleader [
             pstn:bdleader_06 ind:bdleader_06-c
           ] .
      BIND(URI(CONCAT("http://www.worldcat.org/oclc/", ?bd001)) AS ?wc)
      OPTIONAL {?rec tag:bd100 [code:sa ?creatorLabel] }
      OPTIONAL {?rec tag:bd245 [code:sa ?title ] }
    }
    ORDER BY ?title

    The result looks like this:

    wccreatorLabeltitle
    <http://www.worldcat.org/oclc/681584959>ABC Coast FM Gutenberg experience
    <http://www.worldcat.org/oclc/458544284>Books and libraries in the Americas.
    <http://www.worldcat.org/oclc/643257385>Buchkultur :
    <http://www.worldcat.org/oclc/820384936>Bulletin de la Société des Amis de la Bibliothèque et de l’Histoire de l’Ecole Polytechnique
    <http://www.worldcat.org/oclc/906787592>Marín Fernández, Josefa.Estadística aplicada a las ciencias de la documentación/
    <http://www.worldcat.org/oclc/906785372>Kerr, George D.Fidelizar clientes en la biblioteca pública :
    <http://www.worldcat.org/oclc/473663877>Fun songs /
    <http://www.worldcat.org/oclc/643845859>Health information and libraries journal :
    <http://www.worldcat.org/oclc/946097862>Informetrics 89/90 :
    <http://www.worldcat.org/oclc/45253549>Just for kids NOT!
    <http://www.worldcat.org/oclc/34835157>Just for kids NOT!
    <http://www.worldcat.org/oclc/830140233>Toward a theory of librarianship :

    Why “Books and libraries in the Americas.” is cataloged as a musical score or what “Fun songs” has to do with “library science” may be interesting questions, but SMARCQL can only help find potential issues like this not correct them. Also, to be fair, this dataset is from 2016 so some of the problems it reveals may have since been corrected.

    SMARCQL Ontology Namespaces

    Now that the SMARCQL Ontology is getting fleshed out, it’s worth discussing the namespaces that it declares:

    tag: <https://w3id.org/smarcql/tag/>

    The tag: namespace represents the notion of a MARC tag, which SMARCQL treats as owl:ObjectProperty. For example, tag:bd245 refers to the “245 – Title Statement” in MARC Bibliographic Format. The “bd” prefix is an abbreviation of “bibliographic data” and serves two purposes. 1) It differentiates the tag’s meaning from other MARC formats. 2) It allows SMARCQL/RDF to be serialized as RDF/XML, which doesn’t allow element names to start with a digit.

    code: <https://w3id.org/smarcql/code/>

    The code: namespace represents the notion of a MARC subfield code, which SMARCQL treats as owl:DatatypeProperty. For example, code:sa refers to “$a” whose meaning varies based on the tagged field where it occurs. This treatment of code: elements as RDF properties is admittedly awkward because the property’s meaning varies from tag to tag. The problem isn’t insurmountable, but remains a topic for the future.

    position: <https://w3id.org/smarcql/position/>

    The position: namespace represents positional elements that occur in the MARC Leader and 006/007/008 fields. SMARCQL treats these as owl:ObjectProperty, even though the values in MARC/XML are generally codes. As described above, this upgrade from string to thing allows for enhanced functionality such as label assignment. For now, only two position elements are accounted for in SMARCQL (position:bdleader_06 and position:bdleader_07). The others will come in due course.

    individual: <https://w3id.org/smarcql/individual/>

    The individual: namespace represents string values in positional MARC/XML elements that have been upgraded to URI-identified things. For now, the only examples are values for Leader/06 and 07. Note that SMARCQL only intends to upgrade positional values that are defined by the MARC Format itself. In particular, SMARCQL makes no effort to upgrade values for the code: properties. Those elements will inevitably be owl:DatatypeProperty and remain so. URIs that correspond to subfield literals might be available in the field’s $0/$1, but if so the queries must look there explicitly.

    class: <https://w3id.org/smarcql/class/>

    The class: namespace is used to coordinate the domain and range aspects of the SMARCQL properties as well as group identified individuals in a class taxonomy. Details can be discussed later.

    realworldobject

    March 16, 2022
    MARC, SMARCQL, SPARQL
  • Fulltext Searching Against MARC+SPARQL (SMARCQL)

    In my initial SMARCQL post, I included a SPARQL query for the most frequently referenced author/title and pointed out how inconsistencies in MARC records “stick out like a sore thumb in SPARQL”. I also hinted that text searching, as opposed to querying, can help deal with some of those difficulties. To facilitate this in practice, Blazegraph supports a hybrid FullTextSearch extension that is easily enabled.

    To illustrate, here is a hybrid query that uses the text search extension to find references to the terms “lois chan”, regardless of order, adjacency, capitalization, and punctuation. The remainder of the query groups the resulting subfields by tag and code so the context and variances are a little more obvious.

    PREFIX tag: <https://w3id.org/smarcql/tag/>
    PREFIX code: <https://w3id.org/smarcql/code/>
    prefix bds: <http://www.bigdata.com/rdf/search#>
    
    SELECT ?tag ?code (COUNT(DISTINCT ?rec) AS ?numRecs) ?subfield
    WHERE {
      ?rec ?tag ?field .
      ?field ?code ?subfield .
      ?subfield bds:search "lois chan" .
      ?subfield bds:matchAllTerms "true" .
      
      FILTER(?code != rdfs:label)
    }
    GROUP BY ?tag ?code ?subfield
    ORDER BY DESC(?numRecs)
    LIMIT 10
    tagcodenumRecssubfield
    tag:bd100code:sa176Chan, Lois Mai.
    tag:bd245code:sc75Lois Mai Chan.
    tag:bd700code:sa48Chan, Lois Mai.
    tag:bd776code:sa25Chan, Lois Mai.
    tag:bd245code:sc17by Lois Mai Chan.
    tag:bd245code:sc13prepared by Lois Mai Chan for the Library of Congress.
    tag:bd700code:sa12Chan, Lois Mai
    tag:bd245code:sc10Lois Mai Chan and Richard Pollard.
    tag:bd250code:sb9by Lois Mai Chan.
    tag:bd245code:sc9Sharon Chien Lin ; forewords by Lois Mai Chan and Ching-Chih Chen.

    The FullTextSearch extension also supports relevance (cosine similarity) and ranking (ordinal position) , which can be useful for reconciliation or autosuggestion purposes. Once the SMARCQL mapping and ontology are a little more fleshed out, I have some applications in mind to demonstrate more of these possibilities.

    realworldobject

    March 12, 2022
    MARC, SMARCQL, SPARQL
  • A Beginner’s Guide to MARC and SPARQL

    I’ve been retired for a whole week now, but one thought remains unsettled. As my last act of employment, I wrote a blog post that described an odd idea…

    “Believe it or not, MARC can be indexed using SPARQL…. [It’s] not that hard.”

    Hanging Together, “How MARC can SPARQL!”
    https://hangingtogether.org/how-marc-can-sparql/

    I forfeited access to the records and code when I retired, but as the drop quote says “[It’s] not that hard.” Fortunately, OCLC Research published a “Library Science” subset of WorldCat in 2016 that anyone can download and experiment with. The exploration can start there.

    The original post left implementation as an exercise for the reader. As amends, I created a Github repository named “smarcql” (get it?) and started over with the code. Some details need to be fleshed out but the bones should be good. Cloning this repository and loading data locally should allow developers to follow along for themselves. The setup instructions aren’t trivial, but should be manageable for people willing to install Java, Git, Sbt, and Spark in their own environments. Most of the heavy lifting is done by an XSL stylesheet that converts MARC/XML to SMARQL/RDF.

    That’s it on the setup for now. Let me know how it goes in the comments. Assuming the instructions work out, here is a test query to confirm the result:

    Most frequently referenced author/title (100$a/245$a)

    PREFIX tag: <https://w3id.org/smarcql/tag/>
    PREFIX code: <https://w3id.org/smarcql/code/>
    
    SELECT ?author ?title (COUNT(DISTINCT ?rec) AS ?recCount)
    WHERE {
      
      ?rec tag:bd100 [
        code:sa ?author
      ];
      tag:bd245 [
          code:sa ?title
      ] .
    }
    GROUP BY ?author ?title
    ORDER BY DESC(?recCount)
    LIMIT 5
    authortitlerecCount
    Cockerell, Douglas.Bookbinding, and the care of books :64
    Waal, Henri van de,Iconclass :58
    Dewey, Melvil,Dewey decimal classification and relative index.56
    Dewey, Melvil,Dewey decimal classification and relative index /51
    Chan, Lois Mai.Library of Congress subject headings :50

    In other words, there are 64 MARC records in the dataset with “Cockerell, Douglas.” as the creator (100 $a) and “Bookbinding, and the care of books :” as the title (245 $a). People looking at this result from an entity perspective might be surprised that Melvil Dewey’s title appears twice with different record counts, but people steeped in the legacy of MARC probably feel the pain of inconsistencies, big and small. Details like capitalization and punctuation (not to mention typos, languages and encoding rules) often vary in the MARC realm in ways that stick out like a sore thumb in SPARQL. Text searching, as opposed to querying, can mask some of these differences, but that and other topics can be addressed another day.

    realworldobject

    March 12, 2022
    MARC, SPARQL

QuxFarm

,

Powered by WordPress