oturn home > Theory of presentation: overview > Appendix V-Evaluation

A theory of presentation and its implications for the design of online technical documentation
©1997 Detlev Fischer, Coventry University, VIDe (Visual and Information Design) centre

Appendix V-Evaluation

This appendix outlines context, stages and methods of the cinegram evaluation and specific methodological problems that account for changes of evaluation method. General methodological issues are discussed in chapter 2–Methodology, while samples demonstrating the grounded theory method can be found in appendix VI–Grounded theory applied. The evaluation results are not presented here; they are woven into the theory and have been worked into the cinegram prototype.

Originally, the aim of the evaluation had been to expose the cinegram prototype to different types of users in order to validate and improve the cinegram design. Later the evaluations became an instrument contributing evidence for the emerging theory of presentation.

The evaluation took place in four partly overlapping stages. Users were peers, students of technical communication, students of aerospace systems engineering, and experienced practitioners in the Service Engineering section at Rolls Royce.

All stages contributed to the pool of evidence that became the ground for theory development. The evaluation with professional service engineers was most decisive for changes to the cinegram prototype. However, the evaluations with novices and students revealed aspects hidden by the expertise of professionals. The fact that many of the users in the first three stages were learners both in terms of the cinegram prototype and in terms of its substantive domain made it possible to witness the emergence of situational and operational protocols as they began to shape the evaluation setting.

Stage 1: Think-aloud trials

The first stage consisted in a series of informal evaluation sessions with peers and acquaintances. Most users were novices both in terms of the cinegram and the domain. However, many had experience of multimedia systems, and some were experienced multimedia designers —a fact that substantially informed expectations regarding the cinegram interface. One aim of the trials was to eliminate as many obvious cinegram bugs, implementation problems [1], and design flaws as possible. A secondary aim was to trial different types of question during the preparation of the question catalogue for stage 2.

Method

The think-aloud method (also called protocol study) is widely claimed to be very effective (cf. Tognazzini 1992, p84). It is also very cost-effective, which can become a powerful argument for the inclusion of evaluation into the design process from the very beginning (a financial argument is put forward by Nielsen & Mack 1994; cf. also Brown 1994).

Procedure

Sessions were arranged informally and held between the author and individual users. The usual informal evaluation message was 'have a go, explore the system, and tell me what you experience'. While some users were happy to do just that, others were at a loss, ‘didn't know what to do or where to start’ and asked to be given a task [MT 1.3 P2]. In some cases the task merely gave users initial momentum and was brushed aside soon. The author frequently took the liberty to elicit users' opinions regarding their understanding of particular interface features. Colleagues worked through the cinegram following early versions of the developing question catalogue, commenting on both questions and cinegram design.

Methodological problems

Stage 1 of the evaluation revealed methodological problems which can be grouped under situational protocol and power differential. These problems are fundamental and appeared again in the later stages where they are not mentioned explicitly.

Situational protocol. In situations where the evaluator invites a peer, student, or colleague into his or her den to ‘have a go’ at the prototype, the situational protocol of the setting instantiates a tacit social code that often distorts or filters users' responses. This is because users perceive the designer's stake in his or her design, based on its implications for self-esteem and career rewards. It clearly shines through every passionate avowal of self-critical attitude and professed willingness to learn from the user's most honest and scathing critique.

Beyond mere politeness and respect for the other's work, the user does not want to enter a level of critique or debate which might produce insult or embarrassment. By the same token, however, the user knows that something must be commented on and criticised. What is said is then mostly complimentary on the surface. The designer develops an ear to infer what is actually meant. He may indeed magnify the implicit critique and ruthlessly criticise his own design, to be stopped by the user's apologetic qualification and mitigation.

Through the presence of the user the designer suddenly sees his or her design through someone else's eyes, even before any comment is made. It is difficult to account for this experience—its distancing effect however is very real and can lead to experiences such as ‘suddenly seeing the obvious ’.

Power differential. The protocol implies that whatever control the evaluator exerts during the the session is only granted within temporal tolerances. The protocol tacitly includes a clause allowing users' retreat (simply not turning up, for example) without the threat of reprisal. Once the user has entered the evaluation setting, the power differential is leaning heavily towards the evaluator who owns the method and object to be evaluated, both of which are virtually unknown to the user. All this results in an evaluation protocol of playful submission to the evaluator's directives, even more so when these directives encourage the user to ‘take over’.

The dialectic between ‘observe what the user discovers’ and ‘find out what the user thinks about this feature’ shows the problem of testing in a nutshell: letting things happen produces discoveries that cannot be measured, whereas constraining through prompts and set questions enables measurement, but largely eliminates the emergence of discovery, blind spots and errors. Constraining secures that the designer will miss the problems he or she cannot see—which are often more critical then the problems on the check list.

Since the oil system was completely unfamiliar to most users, it was difficult for them to imagine the framing referent domain: for example, a question asking to name the location of a component on the engine posed the question of knowing what the engine was. This indicated the importance of a substantive introduction to the engine for stage 2 of the evaluation.

One purpose of the early think-aloud evaluations was to test versions of the question catalogue. Cinegram users commented that some of the questions (e.g. the first one, ‘What is the purpose of the oil system?’) were too obvious since they afforded local matches to the text on the opening screen: ‘The purpose of the oil system is to…’. Some users suggested asking users for a description instead of putting questions.

Move towards dialogical method. The planned think-aloud protocol allocated the role of quiet observer to the evaluator while the user was supposed to comment on his or her actions during use. The rigorous following of this protocol proved very difficult and was often abandoned, particular when user and evaluator were personal acquaintances. Quite naturally, users asked questions and expected answers, even if they had initially agreed to forego answers as part of the protocol (cf. Rubin 1994).

Stage 2: Comparative evaluation

Stage 2 consisted in a number of comparative evaluations [2] with  second-year students of Technical Communication from a variety of backgrounds. None of the users had seen or used the cinegram prototype before. The pairs were asked to work through a catalogue of initially 18 questions about the Trent oil system. The pairs used either chapter 79 of Rolls Royce's Trent 700 Maintenance Manual (in the following simply referred to as the manual) or the cinegram prototype.

All users had had some exposure to computer-based document systems in projects carried out as part of their course. Many had insights into the process of designing technical documentation from their first year industrial placements. Some of the mature students had been in jobs in which they had picked up some engineering knowledge. A certain preoccupation with technical documentation and a potential bias towards more recent computer-based document systems could therefore be expected.

In this stage of the evaluation, the primary aim was to compare the use of the two different document systems. I hoped to find the cinegram architecture and display techniques workable and wanted observations and comments on how they could be further improved. I also expected the results from the manual group to be complementary to the investigation of usability problems carried out in the field.

A secondary aim was to investigate the secondary process of learning to use a novel interface while being engaged in the primary presentation activity of answering a series of factual questions [3]. For this reason, the amount of introduction to cinegram and domain that users received before the sessions was consciously kept to a minimum. Users had no hands-on experience of either document system before the start of the session.

Both evaluation protocol and cinegram were changed after each round in response to flaws surfacing during the sessions or pointed out by users and witnesses. The variables kept constant at any one time across both groups of users were the evaluation protocol, including the type of introduction and written instruction; the composition of the setting (two users and one witness), and the question catalogue.

Method

The use of a question catalogue allowed evaluation on two different levels. On the primary level, question-asking [4] revealed cinegram usability problems which could be compared to the manual usability problems. In many respects, novice users resemble non-experts occasionally consulting a rarely used manual. In an organisation like Rolls Royce, many peoples' work is quite remote from technical engineering tasks, but involves activities which require some understanding of physical systems. Conventional training documents are not ideal for such emergent purposes since they are usually designed according to a scheduled model of learning as transfer [5]. A system which would be so easy to use that it did not require any training would be ideal for the occasional irregular user.

On a secondary level, question-asking allowed close observation of the development of users' understanding of both prototype and referent system. In a sense, the primary level functioned as a prop to make the secondary possible [6].

Observations are subjective transformations. The observer makes real-time selections from the raw reality of ‘what happens’. Articulation then transforms these selections to a condensed and qualified description. Through the dialogue with users, the observer can turn observation into a recursive instrument. Dialogue reveals actions, assumptions, and anticipation which would have otherwise remained hidden and are unlikely to be recoverable from recorded material [7].

Question design. The initial question catalogue was suggested by a subject expert from Rolls Royce and then discussed and refined between the author, the subject expert, and an expert of technical communication. The questions resemble those which might arise in flight line maintenance. We used factual questions (e.g.: ‘What is the mesh size of the scavenge filter?’) and conceptual questions (e.g.: ‘Why are there two filters in the oil system?’).

Our first objective was to make sure that all the questions were answerable with both document systems. We decided to start the catalogue with a range of questions covering function and location of the major components. The idea was that working through these questions, the users would gradually build up an overall view of the referent system, after which we would place a small number of conceptual questions.

Procedure

The original plan was to involve 24 students which would have resulted in 12 pairs, 6 for each type of document system. Unfavourable circumstances did not allow the successful completion of the second stage with all users. However, including two pairs of users from a pilot run, altogether 10 pairs of users took part in this type of evaluation.

At the beginning of each round of sessions, all participants received a substantive introduction to reference domain. A process of drawing lots determined pairing, and one evaluator was assigned to each pair. The evaluation sessions took place in separate rooms. The author spent 5 minutes demonstrating the cinegram interface while a Rolls Royce expert did the same for the manual interface. Then the users received the question catalogue and paper for note-taking. They were encouraged to demand clarification of unclear questions.

Using the respective document system, the users then worked through the catalogue and wrote down their answers below each question. A note-taking evaluator sat in with each pair. After the session each user was asked to complete a questionnaire to allow qualification of the results according to variations in educational and professional background.

Methodological problems

Methodological problems added in stage 2 can be divided into evaluation protocol, comparison, and indirection.
The evaluation protocol, particularly the question design, question order and the design of preparatory material and report sheets turned out to have a great influence on the type of answers. The range of non-directive questions located in an introductory paragraph of the user report sheet had the effect that users consciously or unconsciously conformed with the authors' expectations. Also, users translated the space allocated for answers into assumptions as to the importance of the respective question.

It was often difficult to tell if the given answers indicated understanding since users, lacking knowledge of the referent system, had often simply copied parts from the document system text. Although both users and witnesses reported some problems with aspects of the interface, reports did not yield much information about the way users' understanding or misunderstanding had emerged, how affordances had been perceived, or system reactions correctly or incorrectly anticipated.

The given answers differed so much in terms of level, precision and style that a comparison between the two document systems by measuring the ‘correctness’ of answers seemed methodologically dubious. The reason for this diversity was not that the questions had lacked precision. Rather, the type of document system used had instantiated specific validation contexts which had already influenced the very interpretation of questions (cf. section 3.3 Validation context in chapter 3–Context). Many answers therefore bypassed the ideal professional answer we had expected in designing the questions.

Another set of methodological problems relates to the comparison between manual and cinegram. There are substantial differences between the maintenance manual and the cinegram. Although both cover the ground necessary to answer the question catalogue [8], the two document systems are very different in terms of boundaries, media, form of presentation and style of writing. Since all our questions referred to the oil system, we had decided to single out the oil system section and to present it to users as a manual [9] in its own right. However, in the field, the oil system chapter is only a small part of a file containing several other chapters, which in turn is just one of four files which comprise the whole Trent maintenance manual [10].

The maintenance manual contains a number of detailed installation/removal and fault isolation procedures for which there is no equivalent in the cinegram prototype. The cinegram, on the other hand, contains many views (e.g. photos and animated cross sections) which are not available in the manual.

Indirection through witness reports also affected the comparability of evidence. Most of the comparative evaluation sessions were only documented by users' completed question catalogue and the accompanying witness report.

Although all witnesses had been introduced to the document systems and given opportunity to use it themselves, they often lacked the familiarity needed to detect and describe users' situated conditions and navigation.

However, using witnesses also brought advantages. It increased the number of different perspectives on the problem (cf. Kleining 1994). It probably increased frankness since users knew that the witness had no personal stake in the prototype. Finally, it produced comments by witnesses on both evaluation method and cinegram design.

Since the author had little prior experience in conducting evaluations, most of the reported difficulties had been unforeseen. They became apparent through critical comments from users and witnesses and a through recurrent analyses of the setting, behaviour, debriefing, and written reports of users and witnesses. Critical reflection lead to a progressive re-orientation of the methodology. This began with changes to the evaluation protocol, and finally lead to abandoning the comparative approach altogether.

Stage 3: Dialogical evaluation

Stage 3 had not been planned at the outset, but was added in response to methodological problems found in the earlier stages. It went back to the more informal type of think-aloud evaluation. The analysis of results so far had shown the impact of prior experience in the referent domain on document system use. This informed the choice of a group of third-year aerospace systems engineering students as users for stage 3 of the evaluation [11]. All users had some general knowledge of turbine engines, so a general introduction to the subject was unnecessary. All users were familiar with flow systems, but they all stated that they knew nothing about oil systems. However, it became evident that they could infer system behaviour from their accumulated experience with other flow systems. In contrast to the users in stage 2, the aerospace systems engineering students were less sensitised to issues of technical documentation and had little or no experience of multimedia systems.

The methodological problems that had appeared during the comparative evaluations indicated that a more informal, dialogical evaluation method would reveal more about users' presentations. The aim was now to ‘turn up the magnification’ by creating a situation in which the evaluator could at any time probe the user to shed light on activities which are not directly observable and likely to remain unnoticed by users themselves. The limited number of volunteers and the dubiousness of comparing cinegram and manual suggested a concentration of resources on evaluating the cinegram prototype.

Method

The method applied in stage 3 was developed from the evaluation of the methods used in stage 1 and 2. In order to focus on understanding users' emergent presentation, the number of questions in the catalogue was reduced while the exploration of the way questions were being answered took much of the time. Evaluators explained to users that the aim was to understand their answering strategies to see how well these strategies were supported by the system. They emphasised that the aim was not to see how well users performed. I decided against handing out written questions to avoid an atmosphere of testing. Also, there was no emphasis on getting through all questions and no need to prune digressions. Instead, evaluators read out one question at a time and waited until presentation close-out before reading the next question. The sessions were recorded on audio tape to allow a more fine-grained analysis of the emergent presentation [12].

The omission of an introduction to the cinegram was motivated by the wish to capture users' emergent presentation from a walk-up-and-use perspective. The stance of starting with a clean slate to see how users cope was induced by the question-asking method described by Johnson & Briggs (1994) where users are asked to ask evaluators for help whenever they feel the need to do so. The evaluator records the question and takes it as an indication of situated information needs that reveal which aspects of the interface are ambiguous, misleading, or not understood at all.

Procedure

Prior to the sessions, users of stage 3 had received a short written description of the aims of the evaluation. Of 10 originally planned sessions, only 4 took place, one of them with 2 users. The reason for this was that students were busy completing their final year essays.

All but one session involved the author as participant in dialogue and observer. One session had two users who wanted to do the evaluation jointly. At this occasion, the author took the opportunity to hand out the same question catalogue used in stage 2 to see how users' engineering training would influence the quality of answers. This session was recorded through note-taking. The other three single-user evaluations were taped and later partly transcribed [13].

Methodological problems

The dialogical style of evaluation allowed significant qualitative changes of situational and operational protocol during the sessions. This can be explained with the lack of presented constraints such as a written evaluation protocol or question catalogue, and the transient articulation of answers. Users' responding presentation often led to problem drift and serendipitous exploration. This resulted in rich material revealing details of users' problem aggregation and opening up aspects such as the negotiation of close-out (cf. section 4.4 Aggregation in chapter 5–Problem).

The price of the lack of constraints was the initial insecurity on the side of users who could not retreat to a presented protocol. The other evaluator commented:

‘It seemed to me that my asking the questions and questioning his decisions was a hindrance to the process of exploration that has occurred in previous trials. This symptom evaporated eventually, but the session started off in a noticeably more stilted manner than others have done’ [LF 3.2 P2]

It seems that a material evaluation protocol, e.g., the presence of written questions as resource, can have a useful role in mediating the distance between two people who have never met before and are merely brought together by the evaluation in which both may have little intrinsic interest. The author-as-evaluator experienced the same problem, but in a somewhat milder form. Perhaps the clarity of his motivation to find out about the use his design made the situation less ambiguous for users.

A possible argument against the validity of dialogical evaluation is that the designer-as-evaluator may unwittingly coerce users to take a more positive attitude to the system than they would take in a more neutral setting. To me this seems disproved by the fact that in this stage users were more open in their critique, pointed out more problems, and gave more unsolicited suggestions than in any other stage before.

Stage 4: Trials with professionals

Stage 4 involved service engineers at Rolls Royce. A series of informal presentations to various people at Rolls Royce who commented on early versions of the cinegram prototype was followed by 4 informal evaluation sessions, each with a different service engineer. All users were intimately familiar with the type of oil system, but most were unfamiliar with the particular Trent oil system. The system's novelty made users focus on differences to older systems; the evaluation was a welcome opportunity to find out more about the Trent system.

Method

The position of the author as tolerated observer and interviewer in the oil system group (cf. section 2.3 Fieldwork in chapter 2–Methodology) suggested informal evaluations which would not disrupt normal activities. Choice of user, time and length of the session depended on individuals' availability.

The general approach was to let the user decide what to focus and comment upon, and to explore the subjects raised through probing. The method was therefore dialogical and receptive. Sessions were recorded through note-taking by the author who also handed a draft of the evaluation report to users for comments and corrections.

Procedure

The evaluation procedure was driven by engineers' interest and curiosity. During the evaluations engineers freely commented and suggested improvements. A slightly more formal approach involving a question catalogue was planned only for one trial, but it turned out that the actual evaluation protocol was more dominated by the engineer's curiosity, exploration, and explanations of the domain. Dialogue and observation were interleaved, e.g., observed navigation problems occasioned dialogues between user and evaluator. The sessions were not taped; note-taking seemed the least intrusive way of data recording. Users read and commented on the resulting evaluation reports, often clarifying important points. This had the added effect of making it clear to users that the author had no hidden agenda.

Methodological problems

In stage 4, the major difference to the prior stages was engineers' familiarity with the domain pattern. This dominated the evaluation protocol and validation context of presentation since it created a constant gravity away from discussing the document system towards discussing the substantive problem implied in or triggered by evaluation questions. For example, in one evaluation, a troubleshooting question led to the point where the engineer abandoned the cinegram prototype in order to explain a fault with reference to a case he had been working on. This triggered navigation to the resources the engineer had used for presentation—a file containing reports, test data, correspondence, and an ETG. The cinegram evaluation was a little later resumed. The engineer's focus on the problem meant that for him the whole episode had not been an interruption, but part of one situated and transparent presentation.


Footnotes to appendix V-Evaluation

[1] Because of implementation-related problems such as slow response of pop-up menus and pointers embedded in animation, users often missed or misapprehended navigational features. This made the assessment of design decisions sometimes difficult.

[2] Cf. Walker 's et al (1989) investigation of the use of an on-line manual.

[3] One could argue that as a result of this approach it is impossible to separate methodologically, as is common in the HCI literature, between 'ease of learning' and 'ease of use'. But the fallacy of studying 'ease' is expressed in the paradox that such studies must qua research design exclude non-users who have decided against using the system on grounds of its reputation, experiences of colleagues, pressing workload, or simply at the first off-putting glance.

[4] Wright & Lickorish (1990) used multiple-choice questions for their comparison of two different hypertext navigation methods. Mynatt et al (1992) used question-answering for comparing use of a paper and a hypertext encyclopaedia. Seidler & Wickens (1992) used questions in their ‘treasure hunt’ method to compare two different hypertext navigational structures in a hierarchical hypertext. All these experiments focused on straight-forward fact-finding and did not address questions of interpretation.

[5] For a critique of the concept of learning as transfer, see Brown & Duguid (1992 pp166). The authors quote a number of sources that have contributed to this critique and endorse a view of learning as construction. For a recent view summarising the behaviourist and cognitive tradition and discussing recent constructivist theories of learning, see Driscoll (1994).

[6] ‘Knowing what answer people get in solving problems is much less informative than catching even fragmentary glimpses of the complex processes by which they arrive at the answer’ (Hayes & Flower quoted after Lewis & Waller 1993). See also Jones (1992/1970 p236) on the value of user observations for system design.

[7] This does not imply that recordings are without value, although the methodological difficulties and the sheer workload of analysing recordings are often emphasised in the literature. Cf. Tognazzini (1992 p81), or Johnson & Briggs (1994 p60).

[8] We accidentally missed the fact that one of the questions concerning the anti-leak mechanism of the pressure pump was not yet included in the maintenance manual and could therefore not be answered properly by the users in the manual group.

[9] In the pilot of the comparative evaluation, both manual users had their own copy of the manual, while the cinegram users worked on a single system. In the last third of the manual session, this lead to a 'division of labour' where users decided to answer questions separately while watching out for material that the other might possibly need.

[10] It was not possible to obtain and take away a copy of the full set of Trent maintenance manuals for the evaluation period. Since we could not hope to re-create the richness and complexity of resource base available to service engineers, we decided to draw a boundary and concentrate on the oil system section. The existence of numerous cross references within maintenance manuals and between manual and IPC would have no doubt complicated navigation issues, particularly for non-engineer users.

[11] This is in line with the strategy of theoretical sampling suggested by Glaser & Strauss (1967). Cf. also Glaser (1978) and Strauss (1987).

[12] Cf. Pinnington's (1990) classification of evaluation in coarse-, mid-, and fine-grained.

[13] One tape with a still untranscribed session drowned in olive oil. The bottle broke when I slipped on the stairs of my local Sainsbury supermarket.

Last update: 08 November 2007 | Impressum—Imprint