Latest posts by Jennifer Gitt
- Analytics Offers These 3 Optimisation Scenarios - 3 July 2018
- Garbage in, Garbage out – Intelligent Information: 3 Conditions - 22 February 2018
IT Information Management Optimisation
In many companies, the condition of their IT documentation can be summarised like this: Lots of information is missing or simply inaccessible. Any information that does exist is often ambiguous and unreliable and anyone dealing with IR (Information Retrieval) soon becomes lost in a quagmire of unsorted information consisting of all sorts of documents.
A completely new approach is needed to get on top of this “digital landfill”.
There are two basic methods available to reduce the masses of information to a usable quantity: “create” and “capture”. The company covers its need for information by either “creating” it again from scratch or by sifting through the digital landfill to find and “capture” what is needed. Analytics primarily supports the “capture” approach, which we will focus on here.
Evaluation in Capturing
Even if sufficient documents/information exist and are made available, an analysis via sequential manual processing is rarely feasible. Can analytics be the answer here? Yes – specifically by way of evaluation, which is an essential component of capturing. In the following, we will present the strengths and weaknesses of three model approaches for using analytics for the purpose of evaluation:
The strengths and weaknesses of the models will be determined according to how well they answer the following question: What makes a document highly relevant, relevant or irrelevant in terms of a specific element of information? In our analysis, a document is relevant only if it contains relevant information.
In the first step, the effort involved in converting the document into elements of information is ignored. Analytics should therefore initially be limited to “evaluation” and should reduce the number of documents that then flow completely or partially into an information portal in a second, automatic/semi-automatic/manual step.
Results of the Automated Evaluation
What does the first step of automated evaluation entail? Analytics is used to automatically differentiate between potentially relevant and irrelevant information. In environments with a “historically grown” high number of documents in particular, we can expect significant results.
The final result will be a corpus of documents whose content will at least in part be relevant and should be transferred completely or partially into the information portal.
In the subsequent automated/manual step, the information from the relevant documents is captured, adapted to the desired IT Information Management (ITIM) structure and reallocated. The next step entails the documents being checked by subject matter experts (SMEs) for correctness of content, and then forwarded to the responsible information manager for publication.
Three analytics methods can be used to check and evaluate information for specific criteria:
- Information Retrieval
- Supervised Machine Learning
- Unsupervised Machine Learning
Information Retrieval (IR) involves the creation of an index to which queries with various criteria can be added. The query is not just a collection of search terms, but of values from the criteria. These are specified in the criteria catalogue. Various priorities can be set and criteria used so that a document may be irrelevant in one context, but relevant in another. The index only has to be partially updated if the criteria catalogue is changed. After the check, the information elements/documents are automatically sorted by hit probability.
Supervised Machine Learning
For the evaluation of documents with the help of “Supervised Machine Learning”, some of the documents are manually assigned to predefined categories. A threshold between the categories is then calculated and all other documents are automatically assigned. A significance level can be calculated for this assignment.
Should the catalogue of criteria or the definition of the true positive changes, then the threshold must be recalculated. Rule of thumb: High variance in the analysed documents will impact negatively on precision. A subdivision into subcorpora can counteract that effect – this subdivision, however, requires a lot of effort and increases the risk of “overfitting”.
Unsupervised Machine Learning
For analysis with the help of “Unsupervised Machine Learning”, a way must be found to teach the computer to recognise the difference between relevant and irrelevant documents itself. The criteria from the criteria catalogue should serve as points of reference. A representation that enables ML must therefore be found for each criterion. Examples: What do relevant documents have in common? Are they reviewed with similar frequency? Do they contain no personal contact information?…
The use of analytics for the evaluation of documents in capturing is recommended whenever there is a high number of documents to review – and particularly when documentation has “grown historically” without maintenance and has therefore become unmanageable. The prerequisite is always that sufficient documents are available at the time of analysis. In order to avoid nasty surprises, companies should always check whether one of the following cases applies before implementing analytics:
- Low number of documents
- Low number of documents and the ones available are in multiple languages
- Mix of various document types: text, table, diagram, audio, video
If this is not the case, thresholds for computer-aided analysis cannot be calculated in order to select the appropriate analytics method.
The conclusion after comparing the three approaches: Supervised Machine Learning is in many cases the most suitable approach. Clear classifications and the low levels of expert knowledge needed for project implementation mean that this method is preferable to IR (Information Retrieval). The effort needed to collate a training corpus is still significantly lower than checking all documents manually.
Publications on IT Information Management:
- “Intelligent Information: 3 Conditions“: Where does the concept come from, why do we need intelligent information, and what does it really mean?
- “Business Case ITIM (IT Information Management)“: Faster, better, cost-effective – This is why ITIM always pays off
- “Simplify IT Information Management“: IT Information Management challenges & methods regarding „Simplify IT Information Management“
- “What can Business IT Learn From Wikipedia?“: Wikipedia’s approach and methods as the secret for success & conclusions for ITIM
Author: avato (Isabell Bachmann)