Some president guy meets a singer dude. (image via Chronophobic)
The breakdown: How does a major announcement by the New York Times to make a massive digital index available to the public change the landscape for reliable content topics and metadata? Rachel Lovinger explores why Wikipedia shouldn’t be our one stop shop when it comes to significant events.
A few months ago the New York Times announced their intention to make their entire index available, in a structured digital format. The Index was first published in bound volumes in 1913 and has grown to include over 500,000 terms that have been used to tag articles going all the way back to 1851. That’s 500,000 significant people, places, things, organizations, and concepts. To be clear, the Index includes the tagged terms, not the articles themselves.
Ok, so that’s a big list of words, but why does it matter? As we move towards a more data-driven digital world, there’s a strong need for online services to have a reliable, accurate, common frame of reference that covers all the major topics, people & things of interest. Let’s say you’re a big fan of the movie Up and you want to subscribe to a service that pulls in any news, media, and conversations about the animated movie. In order to be sure that content is related to the film, and not all the many other uses of the word “up,” automated services will need to use some kind of unique identifier. This can be an alphanumeric code (like an AMG ID, licensed from All Media Guide) or a URL (like http://www.imdb.com/title/tt1049413/), but it has to be something that the service and the content providers can both share.
Many experimental projects have tried using Wikipedia as this kind of database of knowledge. In some ways, this makes sense. If you strip out the content of the pages, you’re left with a taxonomy of nearly 3 million page names. This list of terms is well-structured, because of Wikipedia’s use of links and categories, and it covers a huge body of human knowledge.
But one could argue that Wikipedia has an unhealthy emphasis on pop culture and internet memes. How valuable are those 3 million page names when they include a huge number of topics like The Hampster Dance (an animation of rodents dancing), Chrismukkah (a blending of Christmas and Hanukkah, popularized by a TV show called The O.C), Brfxxccxxmnpcccclllmmnprxvclmnckssqlbb11116 (a name given to a Swedish child born in 1991), More cowbell (a popular phrase from a Saturday Night Live sketch starring Christopher Walken) and nearly 500 pages devoted to the creatures of Pokémon (a media franchise about battling monsters)? Suppose you mention Elvis, does Wikipedia know if you mean Elvis Presley, Élvis Alves Pereira, the TV miniseries, the album, the film, the TV special, the text editor, the comic strip, the character in the movie Cars, the pinball machine, the helicopter, or the other album?
The New York Times Index would offer the Web of Data another option for a structured, digital, open representation of human knowledge. One that comes from a trusted brand that’s known for its depth and breadth of coverage. Coverage that’s been researched and fact-checked by professionals.