Content Modularity: More Than Just Data Normalization

Guest Author, October 21st, 2009

This guest post comes from Daniel Jacobson, Director of Application Development for NPR. Daniel leads NPR’s content management solutions, is the creator of the NPR API and is a frequent contributor to the Inside NPR.org blog.

NPRAs discussed in my previous post, COPE (Create Once, Publish Everywhere) is a fundamental philosophy that drives NPR’s digital publishing and distribution strategy and is the foundation of the NPR API. Supporting it all is a single system that manages all incoming content and funnels it out through a single distribution pipe, regardless of content type or destination. A key principle that supports COPE is ensuring that content is stored in a modular way.

Modular storage of content is more than just database normalization. It requires strategic design of the data model to ensure that discreet objects are stored in distinct locations. To create the right design, you must truly understand your system and the assets that it stores. That is, you need to be able to identify and represent the object (or series of objects) that is at the core of your system. For NPR, the core of the system is a story. We then attach “resources” to the story, each of which is its own object in the database (examples of resources include full text with each paragraph stored as distinct records, audio, video, images, related links, and a range of other object types). Then stories get attached to lists, which are essentially a series of taxonomies that help our systems slice through the stories.

npr_entity_diagram1

The diagram above is a basic entity diagram of how NPR manages data for a story, some related resources and the list to which the stories are assigned. This is a conceptual model that represents how these entities relate to each other and does not include all resource or list entities in the system. The physical model, obviously, is much more complex. Click here for a larger and more complete view of this diagram (PDF).

NPR’s system is obviously much more complicated than this, but the breakdown of story/resources/lists is the foundation of it all. Accordingly, storage of this information in the database needs to ensure that all of these objects can be manipulated independently. With this approach, NPR is able to create a list of all images in the system, or all stories that have video, or all stories in the News topic, or any number of other combinations of stories or resources. The power of this modularity is that we have tremendous control over what gets distributed to each destination. And the distribution of content for all of these scenarios is the same simple REST-based API, requiring no special coding to generate the content for the different destinations.

npr_xmlsamp

The above is an excerpt of XML outputted from the NPR API. Clean, effective storage of the content makes it a simpler and more flexible process to manage it differently as it gets distributed to different destinations. Click here to see an expanded view of the XML with annotation detailing how it maps to the entity diagram.

Conversely, WPT’s tend to store objects to enable the building of a web page. As a result, the content may be bundled together in database fields, storing the actual references to images, video and audio entirely within the story content text. It is still possible that the WPT’s are adhering to some form of data normalization in their storage techniques, but that does not mean that these systems are embracing COPE.

There are two significant problems with the WPT approach of data storage. First, as an example, the image references within the block of text will contain HTML and possibly other markup, making the text block dirty. Any distribution to other platforms could then require special treatment to prepare the content for that destination. More importantly, however, is the fact that these same images are very difficult to repurpose because they are embedded in text. So, it would be quite a challenge to make a feed of images, to identify only those posts that contain images, to resize some or all images in the system, or to consistently restrict distribution of images that do not have the rights cleared.

Building systems that manage the content in a modular way and separates it from display sets it up well to be distributed on a range of platforms. The final piece to the puzzle, however, is content portability. Content portability ensures that the content can actually live and thrive in all platforms to which it gets distributed (even those that do not yet exist). Building a distribution channel, like an API, is simply not enough anymore. Content portability must be applied at the CMS level, which will be the topic of my next article.

Both comments and pings are currently closed.

16 Responses to “Content Modularity: More Than Just Data Normalization”

October 21st, 2009
at 1:54 pm
Comment by: COPE: Create Once, Publish Everywhere

[...] to our changing landscape, they will need to focus more on content modularity and portability. In my next post, I will go into more detail about NPR’s approach to content modularity and why our approach is [...]

October 22nd, 2009
at 11:24 pm
Comment by: Bookmarks for October 22nd < fugaz

[...] Content Modularity: More Than Just Data Normalization – Modular storage of content is more than just database normalization. It requires strategic design of the data model to ensure that discreet objects are stored in distinct locations. To create the right design, you must truly understand your system and the assets that it stores. That is, you need to be able to identify and represent the object (or series of objects) that is at the core of your system. For NPR, the core of the system is a story. We then attach “resources” to the story, each of which is its own object in the database (examples of resources include full text with each paragraph stored as distinct records, audio, video, images, related links, and a range of other object types). Then stories get attached to lists, which are essentially a series of taxonomies that help our systems slice through the stories. Post a comment | Trackback URI [...]

October 30th, 2009
at 12:02 am
Comment by: From Older Entrepreneurs to the Do’s and Don’ts of Effective Web Design « The Product Guy

[...] http://blog.programmableweb.com/2009/10/21/content-modularity-more-than-just-data-normalization/ On the importance of content modularity to Modular Innovation. [...]

November 3rd, 2009
at 3:07 pm
Comment by: Mark Kennedy

I love to see what NPR is doing. One quibble with the article: the acronym “WPT” is referenced several times, but never defined. A search on your site returns no hits for that term. I’m guessing “Web Publishing Tool?”

November 5th, 2009
at 10:20 am
Comment by: Justin Cormack

Good set of articles, although I think this extreme normalization is too much for most applications, see my response:

http://blog.technologyofcontent.com/2009/11/id-love-to-stay-here-and-be-normal-but-its-just-so-overrated-or-how-i-learned-to-stop-worrying-and-love-html/

November 5th, 2009
at 10:49 am
Comment by: Daniel Jacobson

Mark,
Thanks for the comment. This article is actually the second in a three-part series. WPT is defined in the first article, which can be found at http://blog.programmableweb.com/2009/10/13/cope-create-once-publish-everywhere/.

You are correct, though, that in my previous article, I make the distinction between WPT (Web Publishing Tool) and CMS, where a WPT focuses on publishing content to a single platform and a CMS focuses on platform-agnostic content management.

November 9th, 2009
at 5:54 pm
Comment by: Rahul Dev

Daniel, thanks for providing your invaluable insights via the article series.

I was wondering if you have considered the standard published by IDEAlliance available here

http://www.idealliance.org/industry_resources/intelligent_content_informed_workflow

I found that PRISM metadata standard and PAM format are similar in concept to NPRML.

November 10th, 2009
at 2:24 pm
Comment by: Daniel Jacobson

Rahul,
Thanks for the comment. We haven’t focused on PRISM yet, but it is something that we are aware of. Right now, we are actively planning to incorporate PBCore as an output, which is also partially based on the Dublin Core standard. As more people adopt PRISM, however, and as those people request it from the NPR API, we will certainly consider adding it as an output type.

November 12th, 2009
at 3:58 pm
Comment by: Unbundling the Magazine « Ergo McHenceforth

[...] "purified" asset repositories, free from platform-specific formatting, thereby achieving content modularity.  Furthermore, they need to imbue those assets with a rich set of meta-data/context in order [...]

November 12th, 2009
at 4:07 pm
Comment by: Unbundling the Magazine « Ergo McHenceforth

[...] asset repositories, free from platform-specific formatting, thereby achieving content modularity.  Furthermore, they need to imbue those assets with a rich set of meta-data/context in order [...]

November 12th, 2009
at 4:49 pm
Comment by: Unbundling the Magazine « Ergo McHenceforth

[...] asset repositories, free from platform-specific formatting, thereby achieving content modularity.  Furthermore, they need to imbue those assets with a rich set of meta-data/context in order to [...]

November 12th, 2009
at 6:09 pm
Comment by: The NPR Model Is Correct | Tech Startups

[...] Create Once, Publish Everywhere Content Modularity: More Than Just Data Normalization Content Portability: Building an API is Not [...]

December 7th, 2009
at 4:36 pm
Comment by: fuzzyjay sent a spelling edit. | gooseGrade

[...] model to ensure that discrete objects are stored in distinct locations.Status: PendingReport Abusehttp://blog.programmableweb.com/2009/10/21/content-modularity-more-than-just-data-normalization/ [...]

March 20th, 2010
at 11:41 pm
Comment by: Daniel Jacobson's Blog » Content Modularity: More Than Just Data Normalization

[...] This post first appeared on ProgrammableWeb.com [...]

April 22nd, 2010
at 12:03 pm
Comment by: We’re in the information business

[...] story consists of a narrative and a bunch of assets or resources: photographs, infoboxes, rating cards. What are yours? These are entities [...]

April 18th, 2011
at 11:27 am
Comment by: What We Did Wrong: NPR Improves its API Architecture

[...] easiest to just use XML throughout the entire API architecture. Our API starts with a NO-SQL like XML repository that closely resembled the data schema in our database. We then used this XML throughout the entire process, creating a super XML document [...]

Follow the PW team on Twitter

ProgrammableWeb
APIs, mashups and code. Because the world's your programmable oyster.

John Musser
Founder, ProgrammableWeb

Adam DuVander
Executive Editor, ProgrammableWeb. Author, Map Scripting 101. Lover, APIs.