The Mess of Bibliographic Metadata

For sale: author, title, and description

Dec 19, 2024

You walk into a bookstore and pick up a book that catches your eye. You like the cover art, or maybe you recognize the author or title. If you're a real bookhead, you'll pick up a book because of who published it. You notice the book is blurbed by one of your favorite writers, and then you read the description on the back. You're hooked. What's the price? It's right there, next to the ISBN.

These bits of information—author, title, cover art, description, ISBN, MSRP price, etc—are considered bibliographic metadata. You can freely access this information on the physical copy of the book because the publisher shares it on the book. It's in their interest to share this information, since it helps them sell the book.

This bibliographic metadata isn't as freely shared online. Nope, if you want to use this metadata for your bookstore's e-commerce site, you'll have to pay companies like Ingram Book Group, Neilson BookData, or Bowker for a data license. Or, you can pay a programmer to ingest the data feeds from the publishers, but in that case, you might be paying money only to miss a lot of books. Alternatively, ISBNdb scrapes this metadata from online retailers and resells it, presumably using some of the data from the other providers, and their data is bad. There are more data providers, but these are the bigger ones.

These terms for this data stipulate how you can use the data. You must expunge their data from your system once you stop paying them. They might prevent you from using this data in your marketing emails where you promote new releases, nor can you always use the metadata in your store's Christmas catalog. That's right, you're technically not allowed to copy/paste a book's description from Edelweiss into your email newsletter, according to their terms. Whoops.

It's a mess!

For a business, it makes sense that these data providers resell this data from the publishers. They noticed an opportunity, and now they're profiting from it. You're paying for the costs it takes to source and maintain this database. They claim to clean and curate the data. Ingram sells "enhanced" metadata, and they have an extreme edge since they are the primary book distributor in the United States, arguably a monopoly, and they require a publisher to give them metadata if the publisher wishes to distribute with them (which, as we’ve seen with the fall of Small Press Distribution, is about the only option).

As a bookstore in an industry with thin margins, this can be a small tragedy. It's possible, and probable, that you're breaking these terms when you're creating your marketing materials. And for a programmer who is trying to build software in this space with limited funds, this feels like a sour situation. Take this comment on HackerNews:

I'm happy to pay. Ingram content provides potentially suitable web service for consumption, but if you're not a library or physical book retailer access to their metadata costs at minimum $500/month with very strict usage terms.

Not much room for bootstrapped innovation there!

If you build an application that uses this data, all of these providers charge a flat monthly fee + a fee for each of your application's customers. This means that if you have an IndieCommerce e-commerce website, or a store on Bookshop.org, then you're an indirect customer of Ingram's Data Services. But there is no way in hell that Bookshop.org is paying the fee for each store that uses their platform, as this would be a very high cost for them. And Neilson BookData, another data supplier, uses Ingram's data feed for parts of their database, and I doubt they have a similar pricing structure. My suspicion is that these companies used their insider status to get better terms, which good for them, but this isn't possible for a small business like myself who is an outsider building software for independent bookstores.

And then you have the reality of the metadata sometimes being bad, thanks to the publishers who create this metadata (which I don’t have a ton of insight into and would be curious to hear more about this if you work in publishing). I heard a podcast episode (start at 7:45) where a writer with a forthcoming book said that Edelweiss had the wrong biography for them.

For context about Edelweiss:

This ^ tweet originated from this discussion:

Like I said, a mess.

There's a parallel with libraries, who solved the metadata problem long before the book industry. I could really get into the weeds here but I'll keep it simple (see MARC, FRBR, Z39.50, and ONIX). The OCLC started in 1967 to centralize bibliographic metadata, since they realized it would be a better situation for libraries to have all of this data in one place rather than each library painstakingly creating the bibliographic data by hand. This led to an amazing repository of bibliographic data about everything in a library's collection. Libraries would submit the bibliographic records to OCLC, who then turned around to resell this data back to the libraries.

Over the years, OCLC has become more protective of this metadata. If you'd like more background about this, this article "Let the Metadata Wars Begin" is a good place to start. Also, check out this blog post by the late Aaron Swartz.

The tension about access between libraries and OCLC platform feels more prominent than bookstores because libraries are essentially owned by the community with the goal to provide open access to their holdings. Not to mention, OCLC’s data was created by librarians. The book trade is different because bookstores are mostly privately-owned, profit-driven enterprises, and their number one goal is to stay in business by selling books (as a means to fulfill other missions like promoting literature, providing a community space, supporting writers, etc).

Open Library has attempted to free this data, but two recent lawsuits might put this in jeopardy: OCLC vs Clarivate and Hatchette v. Internet Archive. Post45 Data Collective is another interesting project in this space. There are a lot of other projects like this, and most of them are based in academia, where open data is more existential than the book trade. And don't forget the pirates, like Anna's Archive (if you're reading this Anna, you can buy books at Bookshop.org).

To bring it back to Ingram and the book trade, the situation with Small Press Distribution worries me about how Ingram might treat small businesses that use their data services, like a small independent bookstore or a software company. And if you consider how the internet is becoming more closed, with fights over copyright thanks to AI companies, then it's not a far-fetched conclusion to think that bibliographic metadata will become another casualty.

Reclaiming the data could be an interesting strategy for somebody who wanted to take on Ingram in the book business. But how could a business position themselves so that the publishers who rely on Ingram submitted their data feeds to other providers? What would happen if Ingram's terms of service disallowed publishers from doing that? What if the publishers shared the data with an open data license? Do authors have any control over how their book metadata is used?

bookhead life

Discussion about this post