How to Create and Deliver Intelligent Information

IT Information Management Optimisation

In many companies, the condition of their IT documentation can be summarised like this: Lots of information is missing or simply inaccessible. Any information that does exist is often ambiguous and unreliable and anyone dealing with IR (Information Retrieval) soon becomes lost in a quagmire of unsorted information consisting of all sorts of documents.

A completely new approach is needed to get on top of this “digital landfill”.

There are two basic methods available to reduce the masses of information to a usable quantity: “create” and “capture”. The company covers its need for information by either “creating” it again from scratch or by sifting through the digital landfill to find and “capture” what is needed. Analytics primarily supports the “capture” approach, which we will focus on here.

Evaluation in Capturing

Even if sufficient documents/information exist and are made available, an analysis via sequential manual processing is rarely feasible. Can analytics be the answer here? Yes – specifically by way of evaluation, which is an essential component of capturing. In the following, we will present the strengths and weaknesses of three model approaches for using analytics for the purpose of evaluation:

The strengths and weaknesses of the models will be determined according to how well they answer the following question: What makes a document highly relevant, relevant or irrelevant in terms of a specific element of information? In our analysis, a document is relevant only if it contains relevant information.

In the first step, the effort involved in converting the document into elements of information is ignored. Analytics should therefore initially be limited to “evaluation” and should reduce the number of documents that then flow completely or partially into an information portal in a second, automatic/semi-automatic/manual step.

Results of the Automated Evaluation

What does the first step of automated evaluation entail? Analytics is used to automatically differentiate between potentially relevant and irrelevant information. In environments with a “historically grown” high number of documents in particular, we can expect significant results.

The final result will be a corpus of documents whose content will at least in part be relevant and should be transferred completely or partially into the information portal. 

In the subsequent automated/manual step, the information from the relevant documents is captured, adapted to the desired IT Information Management (ITIM) structure and reallocated. The next step entails the documents being checked by subject matter experts (SMEs) for correctness of content, and then forwarded to the responsible information manager for publication.

Analytics Methods

Three analytics methods can be used to check and evaluate information for specific criteria:

  1. Information Retrieval
  2. Supervised Machine Learning
  3. Unsupervised Machine Learning

Information Retrieval

Information Retrieval (IR) involves the creation of an index to which queries with various criteria can be added. The query is not just a collection of search terms, but of values from the criteria. These are specified in the criteria catalogue. Various priorities can be set and criteria used so that a document may be irrelevant in one context, but relevant in another. The index only has to be partially updated if the criteria catalogue is changed. After the check, the information elements/documents are automatically sorted by hit probability.

Strengths:

  • Flexibility: This method can easily be adapted if documents are added, requirements change or new findings are made in terms of criteria
  • Scalability: Queries can be defined for all documents or tailored to individual subsets
  • Ease of creation: No training necessary, and transferring the criteria into appropriate representations is comparatively easy
  • Transferability: Engine and ranking algorithms can be applied to various indices, provided the criteria match

Weaknesses:

  • No clear response: There is no classification into “relevant” or “irrelevant”. In other words: a line must be drawn based on defined guidelines or at the discretion of the SME.
  • Application workload: The appropriate query must be developed – depending not only on content, but also structure in some cases; this requires expert knowledge for an appropriate weighting of the criteria.

Supervised Machine Learning

For the evaluation of documents with the help of “Supervised Machine Learning”, some of the documents are manually assigned to predefined categories. A threshold between the categories is then calculated and all other documents are automatically assigned. A significance level can be calculated for this assignment. 

Should the catalogue of criteria or the definition of the true positive changes, then the threshold must be recalculated. Rule of thumb: High variance in the analysed documents will impact negatively on precision. A subdivision into subcorpora can counteract that effect – this subdivision, however, requires a lot of effort and increases the risk of “overfitting”.

Strengths:

  • Clear classifications: Exact limits and significance levels
  • No application workload: The entire corpus is categorised automatically
  • Approach optimisation: Irrelevant criteria are quite easily detected
  • Overview of the inventory: Criteria can also be viewed individually, e.g. “70% of the documents are checked too rarely to be reliable.”

Weaknesses:

  • Creation workload: A training corpus must be collated (expert knowledge required!) and then categorised manually by SMEs. 
  • Corpus-specific: The threshold must be recalculated individually for each corpus and after any major change to the corpus/definition of true positive; only individual elements may be transferable
  • CPU-intensive: ML (Machine Learning) requires a lot of processing power

Unsupervised Machine Learning

For analysis with the help of “Unsupervised Machine Learning”, a way must be found to teach the computer to recognise the difference between relevant and irrelevant documents itself. The criteria from the criteria catalogue should serve as points of reference. A representation that enables ML must therefore be found for each criterion. Examples: What do relevant documents have in common? Are they reviewed with similar frequency? Do they contain no personal contact information?…

Strengths:

  • Minimal creation workload: No training required
  • Precise classifications: Differentiation according to level/type/reason/area of relevance…

Weaknesses:

  • High creation workload: The development of an appropriate representation of the relevance criteria is complex, time-consuming and requires experience & expert knowledge
  • High interpretation workload: The recognised groups must be interpreted manually and will change with each run
  • No transferability: Categorisation thresholds and interpretation cannot be transferred to other corpora
  • CPU usage: Requires even more computing power than Supervised Machine Learning

Conclusion

The use of analytics for the evaluation of documents in capturing is recommended whenever there is a high number of documents to review – and particularly when documentation has “grown historically” without maintenance and has therefore become unmanageable. The prerequisite is always that sufficient documents are available at the time of analysis. In order to avoid nasty surprises, companies should always check whether one of the following cases applies before implementing analytics:  

  • Low number of documents
  • Low number of documents and the ones available are in multiple languages
  • Mix of various document types: text, table, diagram, audio, video

If this is not the case, thresholds for computer-aided analysis cannot be calculated in order to select the appropriate analytics method.

The conclusion after comparing the three approaches: Supervised Machine Learning is in many cases the most suitable approach. Clear classifications and the low levels of expert knowledge needed for project implementation mean that this method is preferable to IR (Information Retrieval). The effort needed to collate a training corpus is still significantly lower than checking all documents manually. 

 

Publications on IT Information Management:

Author: avato (Isabell Bachmann)

Follow me

Like it? Share it! Spread the Word...

Sag uns jetzt deine Meinung per Kommentar!