Structured Content Authoring for All - Intelligent Information Blog

Guest Author Jan Benedictus

Many content-professionals in publishing or technical documentation are aware of the benefits of structured content. However, in the greater scheme of things, structured content authoring is only a fraction of all content-authoring being done. But this is today and things are changing. Structured content might well be the predominant form of content in the future. This post defines structured content as boxed, interpretable and connected content. It also describes why and how authoring in a structured way will become main-stream eventually.

Structured content
Structured content is planned, developed and connected outside of an interface so it’s ready for any interface. Content is related as data in way that it makes sense to people and computers. The main goal is provide dynamic delivery and to engage the user with the information they need.

Content authoring today

Although concrete data isn’t available, we do have estimates about how many people are authoring in the various formats: unstructured such as MSWord, semi-structured HTML, and fully structured XML. Based on our research, we arrive at these estimates below. Visualizing these results in a pie-chart where XML authoring is represented by a hairline segment, we realize that structured content authoring is still very much a niche area.

750 million users of Word or similar
60 million users of some form of HTML editing
500.000 users of structured XML editors.

The Authors’ Key Challenge: Creating Discoverable Content

Structured content may not be of value in all circumstances, but trends in the way content is published and discovered make a compelling case for structured content authoring as the predominant writing paradigm in the foreseeable future. The main driver is the amount of available online content that is ever growing, with no signs of slowing down. This implies that getting attention has become, the key challenge for anyone who creates content. In other words, it is increasingly difficult to write something valuable.

Publishing and discovery tools are shaping this landscape:

1. Search engines are becoming more powerful in ranking the content that is relevant for an individual. This ranking is not only based on the explicit search query, but includes many other factors such as device, earlier reads, and personal preferences holding profile.

2. Recommendation engines take this one step further and proactively suggest content that is likely to be relevant, again for this person at this time. They recommend without an explicit search but suggest content in a proactive manner. Examples are “related content”, content-alerts or “interesting for you”. The engines driving these platforms learn from the reader’s behavior which of the recommended content he or she ignores or engages with, then train the algorithm to propose more relevant content next time.

3. Dynamic publication systems reuse and repurpose content and present it in new forms. Examples include a ‘personal publication’, ‘content dashboard’, ‘personal digest’, all dynamically generated. Repurposing implies that content is taken out of its original context and presented in a different one: “split, mix and match”.

4. Knowledge systems are an advanced form of this which uses and re-purposes content to compose an answer to a question they are asked. Combining self-training algorithms with natural language processing enables systems to assemble answers to a question in natural languages by combining fragments from various content sources. This can be in the form of chat-bots or a speech-based interface. Essentially, fragments of content form the answer without references to whole documents.

Authors no longer determine how their content is consumed

All the above trends fundamentally change the relationship between an author and the way his or her content is consumed. For starters, the route through which content is discovered and accessed by readers is no longer under the author’s influence. In traditional publishing, an author chooses the publication target, specifies the audience, and suggests related content. However, in modern publishing and discovery tools, the algorithms determine which content is shown when and to whom. They do so by matching the information they have about the content with the profile of the reader. As an author, the only way to influence when and who will see the content is by increasing the level of information available to the matching-algorithm. Also increasingly, the author has no control over the form in which content is consumed. This concerns not only the visual formatting, but also the context where content is shown. Growingly, textual fragments will become the main form of content in circulation, even if these were originally written as part of a larger document. The content is not shown in its original form, but used ‘as data’ or ‘assets’ to be part of a solution for an end-user.

Structured Content: Boxed, Interpretable, and Connected

Why do future publishing and discovery tools need structured content?
And what does that mean?

Natural Language Generation and Artifical Intelligence may be able to extract information about the subject of a random piece of content, but they will not, at least in the predicatable future, be able to surely know what the intention of an author is, unless the author provides that information. In structured content, every piece of content lives within a context, and is enriched with semantic information that an author adds to the content.

For instance, the sentence “In 2014 Value Added Tax percentage over computers is 21%” takes on a very different significance when used as a definition as opposed to being used as an example.

When generating an overview of VAT percentages over a range of years, a publication engine may choose to ignore the text fragment if its intended use is an ‘example’, while including it when it is a ‘definition”. To be able to treat content-fragments as assets, engines need information about which parts belong together, how they are grouped or nested, and what should be considered as ‘context’ for other fragments.

Structured content – what do we mean?

All the trends above set the tone for structured content. We group them under “boxed, interpretable, and connected”.

Boxed: content is well-structured

Boxed refers to a strict hierarchical structure is applied to the text. Documents do not have a ‘flow’ of text, and instead are composed of ‘boxes’. Structures are always nested and cannot overlap. Markup is solely used to define structural threads? not formatting. Structural markup drives the way content is formatted, but doesn’t decide on the formatting per se.

For instance, a list-item is always part of a list, and a list cannot contain any text that isn’t contained within a list-item. Any inline marking that starts within a list-item cannot extend beyond its boundaries. Publishing engines then decide how the list-items are visualized.

Interpretable: content is semantically tagged, expressing ‘aboutness’

Interpretable means that to any hierarchical level in the content, information is added. This ‘aboutness’ tagging gives information about the intention of a phrase, a paragraph or any other hierarchical level in the structure. Adding this is valuable if the intention cannot be clearly understood from the text itself.

Any aboutness that is tagged defines context for information on a deeper nested level. For example:

a word is marked as ‘company’;
which appears in a paragraph marked ‘licensee’;
which in itself is part of in a section marked ‘parties’;
all within in a document marked ‘license contract’.

The context in which this company is mentioned is now easily interpreted by humans and machines.

Connected: semantic tagging is not free, but linked to the semantic web

Connecting semantic tagging to any form of semantic web adds a dimension to the ‘aboutness’. Connecting content to the semantic web is done by referencing semantic tags to a taxonomy. Within organizations, taxonomies are used to make content and references unambiguous by linking terms to a standard definition. Ontologies define relationships between terms and the nature of this relationship. Both taxonomies and ontologies are built up and maintained by organizations internally, or are available on the web.

For instance:

A word marked as ‘company’ gets a reference: ‘company=taxonomy/companies/company12345’.

The taxonomy defines: /company/12345=”XYZ Industries Netherlands BV, Amsterdam, Netherlands”.

The ontology adds: /company/12345 is in industry/semiconductors

Using this information, the content will now be findable for a knowledge system when we ask for all companies in the Netherlands and/or in the semiconductor industry with whom we have a license contract. Even if Netherlands, semiconductors, or ‘license’ isn’t mentioned explicitly.

Shifting Towards Mass-ready Tools to Write Boxed, Interpretable and Connected Content

We are witnessing a shift from page-formatting to ‘writing for relevance’, and that is a major change. The book “Track Changes – a literary history of Word Processing (2016)” describes how authoring developed from the first word-processor in the mid-1980’s to the WYSIWYG-based dominancy of MSWord in the past 15 years. The book also describes how this is currently changing.

Despite the often-heard “authors just want MS Word”, especially of the younger generations of writers who embrace other tools for writing, editing, collaborating, and spreading content, a specific tool is chosen depending on the eventual purpose. Examples are: WordPress for blogging, Medium.com for direct publishing, Scrivener for creativity and organizing, and Markdown for writing documentation the way programming-code is written. This agility opens the way for structured content authoring, but habits are strong. So even if the need to write content which is boxed, interpretable and connected is eminent, we will have to follow conventions that have formed in 25 years of using MSWord. There is need for word-processing software that is based on structured content, but yet respects the flexibility and interface we have gotten used to. Concretely, this means that the way content is presented, the way the cursor behaves ,and the flexibility to move our text around when writing, is not hampered by the structure or cluttered by the semantic tagging present ‘under the hood’.

Furthermore, adding ‘aboutness’ to text is an extra task. Experiences with users of CMS-es that are asked to add meta-data tell us that this task isn’t adopted easily. Authoring tools will have to be proactive and learn to be of assistance in this task. This means that much of the intelligence that is developed for dynamic content consumption will be integrated in the authoring process as well.

As tool-developers, we are doing our best to play our part in this new way of content-creation. Together with others in this field, we pursue one mission: let’s make structured content authoring easy to do for anyone.

About the Author:

Jan Benedictus is the founder of FontoXML – webbased editor for structured content. He has worked in the structured content and online publishing world since the late 1990’s and is a recognized experts on structured content, online engagement en UX design.

@JanBenedictus