Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Dublin Core's dirty little secret (2010) (reprog.wordpress.com)
100 points by dcminter on Oct 2, 2021 | hide | past | favorite | 53 comments


I worked for OCLC for about five years, so I have a lot of sympathy for the pains that archivists go through when trying to standardize content metadata.

Can't be that hard of a problem to solve, right? I mean, we're only talking about coming up with a good solution for cataloging the sum total of all human knowledge.

(Not a small undertaking.)

Libraries are caught between a rock and hard place. On the one hand, they have to be stewards of this gigantic and ever growing mound of paper, and on the other hand, they have to deal with the lofty ideas of the W3C and Tim Berners-Lee, trying to connect all of that paper to his early 2000s vision of the ultimate knowledge graph / the semantic web. No wonder we're sticking to MARC 21.

The only people using that web for research are universities. The rest of us are using WHATWG's spawn-of-satan hacked together platforms because the technology always moves faster than catalogued content. Hell, at this point, GPT-3 is probably a better approach to knowledge processing than trying to piece together something actionable from a half baked information graph born of old programmers' utopian fever dreams.


“old programmers' utopian fever dreams” - accurate description of me when I first found Dublin Core. Never made it to production. It was unnecessary overhead for describing internal content.


>> "Hell, at this point, GPT-3 is probably a better approach to knowledge processing than trying to piece together something actionable from a half baked information graph born of old programmers' utopian fever dreams."

Greatest thing I've read on HN. As a librarian and developer, can confirm. At least in most cases...(slipping back into fever dream)....


Having worked with DC, Marc21 and some of its precursors, I feel that one major problem with the approach is that cataloguers try to infinitely cut up metadata into ever smaller pieces until you end up with a impossible large collection of fields with an overly high level of specificity and low level of applicability.

For example, the middle name of a married name of a person that contributed to one part of one edition of the work in a certain capacity etc.

You end up with so much unwieldily specific metadata that consistent cataloguing and detailed searching become nigh impossible.

Ever since search engines, the world has moved into using mostly flat search engines rather than a highly specific facet search for deep fields. Of course, one would want something a bit smarter than full text search, but the real, human world is so complex that trying to categorise any data point quickly becomes an exercise in futility.


The problem with models like GPT-3 is that they are unable to differentiate between information sources with different "trustworthiness." They learn conspiracies and wrong claims and repeat them.

It's possible to feed GPT-3 prompts that encourage it to respond with conspiracies (i.e. "who really caused 9/11?") but it also randomly responds to normal prompts with conspiracies/misinformation.

A recent paper[0] has looked into building a testing dataset for language models ability to distinguish truth from falsehood.

[0] https://arxiv.org/abs/2109.07958


> A recent paper[0] has looked into building a testing dataset for language models ability to distinguish truth from falsehood.

Isn't this a massive category error? Truth or falsehood does not reside within any symbol stream but in the interaction of that stream with observable reality. Does nobody in the AI world know Baudrillard?


Well, yeah, these models can not interact and observe the world to test the veracity of claims. They are language models and the target for them is text production. No one expects them to understand the universe.

My comment was in response to > GPT-3 is probably a better approach to knowledge processing

and the paper is relevant in that it shows the limitation of current language models in terms of logical consistency or measures of the quality of text sources. GPT-3 and other models are not trained for this and obviously they fail at the task. This is evidence against them being a "better approach to knowledge processing."

Even if we trained future models preferentially on the latest and most cited scientific papers, we will still have issues with conflicting claims and incorrect/fabricated results.

However, that does not mean that it would not be practically useful to figure out a way to include some checks or confidence estimates of truthfulness of model training data and responses. Perhaps just training the models to answer that they don't know when the training data is too conflicted would be useful enough.


It should be at least theoretically possible for an AI to identify contradictions, incoherence and inconsistency in a set of assertions. So, not identifying falsehood per se, but assigning a fairly accurate likelihood score based solely on the internal logic of the symbol stream. In other words, a bullshit detector.


> It should be at least theoretically possible for an AI to identify contradictions, incoherence and inconsistency in a set of assertions

Not in the slightest. Likelihood of veracity is opinion -- laundering it as fact to make some people feel better doesn't make it any more subjective, or authoritative.


I think we're not disagreeing, exactly.

As a simple example, here is a set of assertions:

* The moon is made of green cheese

* The moon is crystalline rock surrounding an iron core

It wouldn't take an AI to see that both of these can't be true, even if we weren't clear about what a moon is made of, exactly.

Some of our common understanding could contain more complicated internal contradictions that might be harder for a human to tease apart, that an AI might be able to identify.


How would the AI know that "made of green cheese" isn't just another way of saying "crystalline rock surrounding an iron core"? To find a contradiction in a statement like "X is A. X is B" it'd first have to be intelligent enough to know when A != B. In your example, that's not as simple as 1 != 0.


> The problem with models like GPT-3 is that they are unable to differentiate between information sources with different "trustworthiness." They learn conspiracies and wrong claims and repeat them.

Impressive, AI has already reached human levels of understanding.


...it saddens me how true that is given the last few years. I still like to think most people are capable of a deeper level of understanding.


I don't agree with "given the last few years". Conspiracy theories have always been a thing, now information just flows faster.


Isn’t there the same problem with structured data?


> Hell, at this point, GPT-3 is probably a better approach to knowledge processing than trying to piece together something actionable from a half baked information graph born of old programmers' utopian fever dreams.

I mean, at this point, wouldn't it be a lot simpler to go back to the middle ages (or earlier) and have a few humans memorize all that stuff? It's not as if GPT3 would give you any more insight than that approach...


Sayre's Law: "In any dispute the intensity of feeling is inversely proportional to the value of the issues at stake."

https://en.m.wikipedia.org/wiki/Sayre%27s_law

Was there ever a less essential fiefdom than citation formats?


Some people catalog information every day, so it matters to them for practical reasons. Same with people who rely on those resources being accurately cataloged.


I'm ok with obsessing about the data.

Presentation, an order of magnitude less.

Substance >>> style.


You're missing the point. Citation formats are not a matter of style here. It's about making research results easier to find, which directly affects the quality of new research.


Cataloging format is to style as database modeling is to... well, style. It's got nothing to do with aesthetics, it's about describing the data in a way that makes it useful later.


Have you noticed that search engines suck?

Citation formats are an attempt to end up with a search engine that doesn't suck. (They can do lots of other things along the way.)

Options:

1. Accept any format and try to turn it into your own internal representation. Problems: (a) your own representation is a citation format; (b) you need to write an infinite number of converters and hire an infinite number of trained people to work out the edge cases.

2. Accept a limited number of common formats. Problem: your search engine will not be useful for a majority of the corpus.

3. Convince everyone that your new citation format is unstoppable. Problems: (a) convincing everyone; (b) actually having that citation format cover, say, 99% of the cases; (c) XKCD#927 (Standards).

Dublin Core is/was a terrible attempt at a type 3 solution.


Was involved with a project once upon a time that spent six months developing ontologies for event ticketing, not me thank god, but another team of about eight.

The delivered ontology got used twice or maybe three times, and it didn't cover all the corner cases we knew about.


Heard a bloke introduce himself at the onboarding as a "Sematic Ontologist".

His sobriety was only exceeded by his earnestness.

I prayed he'd not become attached to my project, as he looked like the sort who could love a problem for weeks without ever reaching a conclusion.


Oh wow, having interacted with the SemWeb community quite a lot in the past couple of years,

> the sort who could love a problem for weeks without ever reaching a conclusion.

is such an accurate description of a number of people part of it. Possibly the root cause of all the problems that led to it never taking off at the scale they'd envisioned.


The semantic web is a non-solution in eternal search of non-problems.


Sounds like the eternal student, they just don't work in business, although seemingly I keep getting lumped with the useless persist question askers, sometimes I feel like I discussing shit with David Attenborough, just get on with job ends rant


Back when I was an IT Guy at a small marketing company, I did whatever was asked of me (I'm getting paid, so why not).

I got pretty good at data entry, and it's fairly enjoyable. You're not worried about bigger issues than reading scribbles on paper, and trying to get them entered correctly.

It is from this experience, that I think they should have done the following:

Sit everyone on the committee down down at terminals, and make them do card catalog data entry for a week, just typing into a unformatted text file.

After those dues are paid, then you get to participate for the year. I think contact with the data at hand would have rapidly lead to a far better understanding of the required structures.


Librarians spend years studying just cataloging to be able to take a publication a library acquires, put a Dewey or LoC number on it, and shelve it, and do it all in a way that makes makes it possible to find it again later when someone wants it. Cataloging is an entire sub-specialty of Library and Information Science. A week of data entry would just scratch the surface.


I'm not suggesting it would give an encyclopedic knowledge, but it would at least acquaint them with the actual data and categorization systems in use in a very visceral way.

In a related way, I find myself watching many of the old IBM videos about mainframes and the punch card systems that proceeded them to have a better feel for what came before.


> it would at least acquaint them with the actual data and categorization systems in use in a very visceral way.

These days, especially with people leading tech, I'd fear that exposing them in that way would only lead them to disdain it and complain that it's "ripe for disruption" and decide that two guys with an ML model and VC funding could and should replace it.

Even in this forum, look at the comments suggesting GPT-3 could replace the whole shebang. What those suggestions miss is that it would me that instead of needing to know the ins-and-outs of a formal, if flawed, ontology, we'd have to be guessing at what the algorithm applied, and forgetting that it was trained on existing metadata models and has its own metadata.


These days, you can use Wikidata and the related tooling such as Scholia. Here's a starting point to their ontology, including the difficult details such as the "page(s)" qualifier for the property "published in": https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_Me...

Scholia: https://www.wikidata.org/wiki/Wikidata:Scholia


Dublin Core was always trying too hard to not be hard to use. It always struck me as a metadata standard for the library at an elementary school.

I was amused at how Adobe’s XMP format amended DC used features of RDF that most people were scared to use (ordered collections) to solve the simple problem of keeping track of the order of author’s names on a paper.

The big trouble I see with semweb standards is that they stop long before getting to the goal. RDFS and OWL for instance solve some problems of mashing up data from different vocabs but can’t do the math to convert inches to meters which is absolutely essential.

I was not so amused to see Adobe thoroughly nerfed XMP and it no longer sucks in EXIF metadata from images. I was hoping I could find the XMP packet and put it through an RDF/XML parsers and get all the metadata from my images but even in Lightroom images the XMP packet is a nothingburger these days. I guess people complained about the GIF files that had 2k of pixel data an 80k of XMP and that Adobe was co-opting other standards. Heck they even deleted most of the content from the standard including how to express EXIF in RDF so you can’t stick XMP in your database and SPARQL it.


Depends on what device generates the image perhaps? DJI Drones have a lot of info as EXIF XMP. I just parse that as XML.

EDIT: It's not 80k, but good enough as a GPS log.


The gif in the story was generated by adobe tools in the bad old days, circa 2005. I noticed the nerfing happened in the last 5 years when I looked closely at what Lightroom does.


In terms of wasted space, only 1/8th is actual xmp for that DJI exif chunk, the rest is padding for (13mb jpg:s in our case). 8k, so 1k actual xmp data of which a lot is structural. I've only tried with scripts, verifying with exiftool and a hex editor every now and then.


There are also modular approaches like CMDI [0] out there, that allows for constructing a metadata profile with a corresponding schema [1]. A single metadata standard won't be able to cover all disciplines regardless. Too many free text fields that can't be properly validated. E.g. if you work with documenting minority languages, language ISO codes are vital (assuming one has been created for the language in question).

    [0]: https://www.clarin.eu/content/component-metadata
    [1]: https://www.clarin.eu/content/component-registry-documentation


I think one of the issues is because librarians work with things that have been made and do not change. The catalogues also traditionally did not change, but there's no reason now why this should be the case.

If we look at wikidata or openstreetmap for example the way things are categorised changes and evolve, some things stay the same, new things get added, some things get deprecated, and bots or humans can update existing records and machines can handle any deprecations

Far better then to start simple, get adoption and update records as you go, I think.


One of the particularly notable things about libraries and catalogues is how much does change.

There are on the order of 1--5 million new books published in English each year (about 300k as "traditional" publications, fairly constant level since the 1950s), the largest collections have tens of millions of volumes, the cataloguing technology and standards have changed tremendously over the past 50 years, and the role and usage of libraries themselves is under constant flux. At the same time, budget and staffs are small relative to the total record store.

The challenge of organising, classifying, and making accessible all human knowledge has occupied numerous great minds for centuries. We're familiar with Google's take, marrying full-text search with a document-ranking model. That worked relatively well for about 20 years. It's now showing extreme signs of stress.

At the same time, much of the past (and present) traditionally-published media is finding its way online, whether the original publishers and established libraries like it or not.

LibGen and SciHub may have far more to say about the future of cataloguing standards than the OCLC.


Oh and how those categories change and are subject to human error and bias.

Library of Congress Class D: WORLD HISTORY AND HISTORY OF EUROPE, ASIA, AFRICA, AUSTRALIA, NEW ZEALAND, ETC.

Subclass DB: History of Austria. Liechtenstein. Hungary. Czechoslovakia

https://www.loc.gov/aba/cataloging/classification/lcco/lcco_...


Right.

If you dig into the LoC Classification what you'll find are sections, some large, some small, that have been entirely retired or supersceded.

Geography is only one of those. Psychology has been huge, notably as regards human sexuality and gender. Technology and business, in which change and obsolescence are quite rapid, another. (Both have significantly co-evolved with the development of cataloguing systems and library science itself.) Emerging domains, particularly computer science and information systems, have struggled to find where they fit within the classification. And different domains are both idiosyncratic and specific to the Library's interests. For the LoCCS, US history and geography are overrepresented relative to the rest of the world. And the entire Law classification, much of it relatively reasonably developed, has a markedly differnt style and character than the rest of the classification.


> US history and geography are overrepresented relative to the rest of the world.

Yeah in the document I linked, all of Africa gets lumped into the single subclass DT. But Subclass DX is "Romanies" It was formerly "History of Gypsies", btw https://en.wikipedia.org/wiki/Library_of_Congress_Classifica...


On "gypsies", one of numerous nomenclature changes.


... which is to say: what mature cataloguing practices have developed is a change management process. A recognition that boundaries shift (metaphorical, physical, and political), knowledge evolves, and terminology may come to be outdated or even prejudicial.

LoCCS isn't perfect. But it has developed the self-awareness, capabilities, institutions, and processes to change and adapt without destroying itself in the process.


But why do the Romani get an entire subclass, when Africa – ALL of Africa, is also just one subclass?


Put simply:

- The classification, as all classifications do, is both current and historical, and reflects the biases of the periods over which it evolved.

- The classification serves a pragmatic purpose, as all subject-based shelving classifications do, of reflecting the need to assign volumes of a large collection of books to a single shelf location for both access and reshelving.

- Most classification systems also attempt to bucket all of human knowledge into a fixed number of buckets (20 in the case of the LoCCS), which results in compromises and awkwardness. Additional knowledge and changes in surrounding human culture tend to make earlier structures awkward, if not fully obsolete.

- In the late 19th / early 20th century, not only was there little extant literature on African history, but there was comparatively little interest at the US Library of Congress, as well as considerable cultural antipathy in the United States generally given the country's racist history and politics at the time.

(The areas of interest can be surmised by viewing the history classification: the United States itself, North and South America, Europe, with a particular focus on England and Western Europe, and generally The Rest of the World.)

All topical classifications and ontologies are artefacts of their times. The history of catalogists, encyclopaedists, science, history, and philosophy makes this clear. There are ideas which come and go, some rapidly, some establish themselves solidly for centuries or millennia. This is one of the occupational hazards of the practice.

What the US Library of Congress Classification System does have, as I've tried to express above, is both an ability to expand within the classification system (dive into the Africa section and it expands in subclasses), and a process for accommodating change. Subclassifications can be retired, and terminology changed.

It's a process, not a product.


That doesn't explain why the Romani needed an entire subclass. What was the bias there, when it was "History of Gypsies"? Was there really that much published about them at the time? As much about them alone as all of Asia, or all of Africa, or even about all of Central Europe?


I'd suggest re-reading my previous comment.

On the context of the D Class:

Class D, History, was first drafted in 1901 and after undergoing revision was published in 1916. Two supplements were published: European War (first edition, 1921; second edition, 1933; reprinted in 1954 with additions and changes), and Second World War (1947). The second edition of class D was published in 1959, containing additions and changes through June 1957. A reprint edition was published in 1966, containing a supplementary section of additions and changes through July 1965. The third edition was published in five parts between 1987 and 1990: D-DJ, DJK-DK, DL-DR, DS, and DT-DX. Subclasses DS-DX, History of Asia, Africa, Australia, New Zealand, etc., were published in a single edition in 1998, complemented by a 2001 edition of D-DR. A subsequent edition was published in 2008, followed by an edition in 2012. Annual editions have appeared beginning in 2015. This 2021 edition of DS-DX cumulates all additions and changes that have been made since the publication of the 2020 edition.

https://www.loc.gov/aba/publications/FreeLCC/LCC_DS-DX2021PR...

For the divisions within the subclass, see: pp. 334--335: https://www.loc.gov/aba/publications/FreeLCC/LCC_DS-DX2021TE...

Culturally, the Romani are distinct within Europe, in having their own language and practices, and are a nomadic / diaspora group without a specific homeland and a long history of presence, and often various frictional interactions with, numerous other cultures, nations, and states.

Given that the group is tacked on at the end of the global history section, it doesn't seem to have an exceptionally high significance within the classification.

Other than that, I don't know what the rationale is, and I've tried to convey that I don't.

Do you have a specific argument against establishing a category for this culture?


I am not sure if I agree with this article. The semantic web and linked data are built using layers of schemas and Ontology’s and changing old stuff is a bad idea because links get broken. Additive systems, instead of creating new versions, will always seem a bit messy.

If you look at valuable public KGs like DBPedia and WikiData you will see these layers of old and older schemas. My argument that this is not such a bad thing does have a powerful refutation: the slow uptake of linked data tech and KGs.


Dublin core has always been rough and strange and I’m SURE actual librarians can tell me all sorts of stories about complex things but schema.org has a fine model at first glance https://schema.org/ScholarlyArticle. Check out the MLA example at the end and how DOI and articles over multiple issues are handled etc.


Apropos: Ontology is Overrated -- Categories, Links, and Tags http://digitalcuration.umaine.edu/resources/shirky_ontology_...


Dublin Core reminds me of SAML. Designed as a bag of standards parts, not as a complete standard.


Was 90% sure this was going to be an article about homelessness in Dublin city centre (the real Dublin, not some imitator in Ohio :p)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: