|
Willpower
Information
Information Management
Consultants
|
Thesaurus principles and practice
This paper was originally presented at a workshop
"Thesauri for museum documentation" held at the Science Museum,
London, on 24th February 1992. The proceedings of the workshop have
been published by the mda (formerly the
Museum Documentation Association).
-
Why do we need a thesaurus?
-
A limited list of indexing
terms
-
Hierarchical
relationships
-
Related terms
-
Definitions and scope
notes
-
Form of the thesaurus
-
Special factors relating to museum
objects
-
Use of a thesaurus when
cataloguing
-
Use and modification of existing
thesauri
-
Thesaurus maintenance
-
What sort of fields is a thesaurus
appropriate for?
-
Other subject retrieval techniques
-
Bibliography on thesaurus
construction and use
One of the reasons for documenting our collections is that
we wish to be able to find objects of a particular kind. We may
ask "What thermometers do we have in the collection?", "What
arrowheads?", "What frocks?", "What whales?" or "What textile
machinery?"
The simple answer is that we give each item a
"name", and then we can create a file of index cards, or a
computer file, in which we can search for these names and
expect to find all the appropriate items. This is the concept
of the simple name field in the MDA data structure.
It is straightforward at first, and seems intuitive, but once
you have documentation which has been built up over time,
perhaps by many different people, problems creep in unless
there are rules and guidelines to maintain consistency.
The word thesaurus is a rather fancy name, which has
acquired a certain mystique, because it is often bandied
about as something necessary for effective information
retrieval, but something which sounds as though it will
involve a lot of work. I have often heard curators say
"That's all very well if you have the time and resources, but
I have this great backlog of cataloguing to do, and I would
never get through the half of it if I had to spend time
setting up anything as complicated as a thesaurus. What I
need is a simple list of names which I can use to index my
objects."
My main purpose in this paper is to make three points:
- A simple name list without some rules will rapidly become
a mess.
- Only three simple rules are needed; using them will make
life easier for you, not harder.
- So long as you stick to these rules, you can take an
existing thesaurus and adapt it to your needs; you are not
limited to using the terms which are listed in it already,
and you are not obliged to use more detail than you need.
What are these rules?
-
Use a limited list of indexing terms, but plenty of
entry terms
-- link these with USE and USE FOR
(UF) relationships.
-
Structure terms of the same type into
hierarchies
-- link these with BROADER TERM/NARROWER TERM
(BT/NT) relationships.
-
Remind users of other terms to
consider
-- link these with RELATED TERM/RELATED TERM
(RT/RT) relationships.
I shall consider each of these rules in turn.
A major purpose of a thesaurus is to match the terms
brought to the system by an enquirer with the terms used by the
indexer. Whenever there are alternative names for a type of
item, we have to choose one to use for indexing, and provide an
entry under each of the others saying what the preferred term
is. If we index all full-length ladies' garments as
dresses, then someone who searches for frocks
must be told that they should look for dresses
instead.
This is no problem if the two words are really synonyms, and
even if they do differ slightly in meaning it may still be
preferable to choose one and index everything under that. I
do not know the difference between dresses and
frocks but I am fairly sure that someone searching a
modern clothing collection who was interested in the one
would also want to see what had been indexed under the other.
We normally do this by linking the terms with the terms
USE and USE FOR , thus:
Dresses
|
USE FOR
|
Frocks
|
Frocks
|
USE
|
Dresses
|
This may be shown in a printed list, or it may be held in a
computer system, which can make the substitution
automatically. If an indexer assigns the term Frocks, the
computer will change it to Dresses, and if someone searches
for Frocks the computer will search for Dresses instead, so
that the same items will be retrieved whichever term is used.
A friendly computer will explain what it is doing, so that
the user is not puzzled by being given items with terms
different from those asked for.
USE and USE FOR relationships are
thus used between synonyms or pairs of terms which are so
nearly the same that they do not need to be distinguished in
the context of a particular collection. Other examples might
be:
Cloaks
|
USE
|
Capes
|
Capes
|
USE FOR
|
Cloaks
|
|
Nuclear energy
|
USE
|
Nuclear power
|
Nuclear power
|
USE FOR
|
Nuclear energy
|
|
Baby carriages
|
USE
|
Perambulators
|
Perambulators
|
USE FOR
|
Baby carriages
|
Perambulators
|
USE FOR
|
Prams
|
Prams
|
USE
|
Perambulators
|
If we name objects, we want
to be as specific as possible. If we have worked hard to
discern subtle distinctions in nature, type or style, we
certainly want to record these. The point is that the
thesaurus is not the place to do this. Detailed
description of an object is the job of the catalogue
record; the job of the thesaurus, and the index which is
built by allocating thesaurus terms to objects, is to provide
useful access points by which that record can be
retrieved.
USE and USE FOR relationships can
also be used to group similar items together, because too
much specificity is as bad as too little. If we have a small
clothing collection, containing ten jackets, it is more
useful to give them all the index term jackets than
to create many specific categories. Anyone searching our
catalogue will then be able to search on the single term
jackets and see a list of the ten items, each with a
description of exactly what kind of jacket it is, as follows:
Jackets:
|
1.
|
Anorak in green cotton, England, 1985.
|
2.
|
Tweed sports jacket, Hawick, Scotland
|
3.
|
Silk bolero with floral embroidery, Spanish, 1930.
|
If we used all the possible specific names, each of which
would have only one or two items in it, such as blazers,
dinner jackets, boleros, donkey jackets, anoraks, flying
jackets, sports jackets, and so on, enquirers would have
to search the catalogue under each name in turn in order to
find all the jackets in the collection, and they would never
be sure that there was not a kind of jacket that they had
overlooked.
To help enquirers who approach the system by one of these
terms, we therefore create the references:
Blazers
|
USE
|
Jackets
|
Dinner jackets
|
USE
|
Jackets
|
and so on.
If we have a hundred jackets, a list under a single term
will be too long to look through easily, and we should use the
more specific terms. In that case, we have to make sure that a
user will know what terms there are. We do this by writing a
list of them under the general heading, thus:
Jackets
|
NT
|
Anoraks
Blazers
Boleros
Dinner jackets
Donkey jackets
Flying jackets
Kagouls
Sports jackets
|
We could just invert terms and rely on the alphabet to bring
them together, in a list such as
Jackets, dinner
Jackets, donkey
Jackets, flying
Jackets, sports
|
but this is unreliable and subject to the vagaries of the
language, which does not always describe a specific type of
item by an adjective preceding the generic name. We have to
accommodate types of jacket which have their own distinctive
names such as Anoraks or Blazers.
In both the above cases, it is important that the terms
which are linked are of the same type. That is to say
that any narrower term must be a specific case of the
broader term, and able to inherit its characteristics.
(The developers of Object Oriented Programming have
recently discovered this idea, which has been known to
the worlds of information science and biological taxonomy
for a very long time.) Thus if we say that
Blazers is a narrower term of Jackets,
we mean that every blazer is, whatever else it may be,
inherently a jacket, and that it has the characteristics
which define a jacket.
Mice can properly be said to be a narrower
term of Rodents, because all mice are
inherently rodents, but it is not correct to list
Mice as a narrower term of Pests,
because some mice, such as laboratory mice and pet
mice, are not pests. The idea is to have relationships
in the thesaurus which are always true, irrespective of
context. In the same way, it would not be correct to
list Buses as a narrower term of
Diesel-engined vehicles, although many of them
are; if we have a diesel-engined bus in our collection,
we should show this by giving it the two terms
Buses and Diesel-engined vehicles.
|
Broader and narrower
terms
Hierarchical
relationships
|
-
Relationships must be independent of
context
-
Terms must represent the same type of
entity
|
|
Mice
|
BT
|
Rodents
|
Rodents
|
NT
|
Mice
|
|
Shoes
|
BT
|
Footwear
|
Footwear
|
NT
|
Shoes
|
|
Mice
|
BT
|
Pests
|
Pests
|
NT
|
Mice
|
|
Shoes
|
BT
|
Shoemaking
|
Shoemaking
|
NT
|
Shoes
|
|
Good computer software should allow you to search for
"Jackets and all its narrower terms" as a single
operation, so that it will not be necessary to type in all
the possibilities if you want to do a generic search:
If we restrict the hierarchical relationship to true
specific/generic relationships, we need another mechanism to
draw attention to other terms which an indexer and a searcher
should consider. These are RELATED TERMS of the
starting term. Related terms may be of several kinds:
- Objects and the discipline in which they are studied,
such as Animals and Zoology.
- Process and their products, such as Weaving and Cloth.
- Tools and the processes in which they are used, such as
Paint brushes and Painting.
It is also possible to use the RELATED TERM
relationship between terms which are of the same kind, not
hierarchically related, but where someone looking for one ought
also to consider searching under the other, e.g. Beds
RT Bedding; Quilts
RT Feathers; Floors
RT Floor coverings.
A thesaurus is not a dictionary, and it does not normally
contain authoritative definitions of the terms which it
lists. It could perfectly well do this, but a lot more
work would be required to develop it in this way. In an
automated system, however, the thesaurus would be a
logical place to record information which is common to
all objects to which a term might be applied, for example
notes on the history and origin of Anoraks or the
identifying characteristics and lifestyle of Mice (or
perhaps Mus musculus in a taxonomic thesaurus).
Where there is any doubt about the meaning of a term,
or the types of objects which it is to represent, a
SCOPE NOTE (SN) is attached to it. For
example,
Fruit
|
SN
|
distinguish from Fruits as an anatomical
term
|
BT
|
Foods
|
|
Preserves
|
SN
|
includes jams
|
|
Neonates
|
SN
|
covers children up to the age of about 4
weeks; includes premature infants
|
A list based on these relationships can be arranged
in various ways; alphabetical and hierarchical sequences
are usually required, and thesaurus software is generally
designed to give both forms of output from a single
input. A typical simple thesaurus of a few clothing terms
is shown in Tables 1 and 2.
|
Table 1: Sample thesaurus - hierarchical sequence
|
knitwear
> cardigans
> pullovers
outerwear
> blouses
> cardigans
> coats
> > raincoats
> dresses
> jackets
> > anoraks
> > blazers
> > dinner jackets
> > donkey jackets
> > reefer jackets
> leggings
> pullovers
> rainwear
> > raincoats
> shawls
> shirts
> skirts
> suits
> trousers
> > jeans
> > shorts
> > slacks
|
|
Table 2: Sample thesaurus - alphabetical sequence
|
anoraks
|
BT
|
jackets
|
|
blazers
|
BT
|
jackets
|
|
blouses
|
UF
BT
|
smocks
outerwear
|
|
breeches
|
USE
|
trousers
|
|
capes
|
USE
|
coats
|
|
cardigans
|
SN
|
knitted jackets
with front opening
|
BT
|
knitwear
|
|
outerwear
|
|
cloaks
|
USE
|
coats
|
|
coats
|
UF
|
capes
|
|
cloaks
|
|
overcoats
|
BT
|
outerwear
|
NT
|
raincoats
|
|
dinner jackets
|
BT
|
jackets
|
|
|
donkey jackets
|
BT
|
jackets
|
|
dresses
|
UF
BT
|
frocks
outerwear
|
|
duffel jackets
|
USE
|
reefer jackets
|
|
frocks
|
USE
|
dresses
|
|
jackets
|
BT
|
outerwear
|
NT
|
anoraks
|
|
blazers
|
|
dinner jackets
|
|
donkey jackets
|
|
reefer jackets
|
|
jeans
|
BT
|
trousers
|
|
jumpers
|
USE
|
pullovers
|
|
knitwear
|
NT
|
cardigans
|
|
pullovers
|
|
leggings
|
BT
|
outerwear
|
|
|
outerwear
|
NT
|
blouses
|
|
cardigans
|
|
coats
|
|
dresses
|
|
jackets
|
|
leggings
|
|
pullovers
|
|
rainwear
|
|
shawls
|
|
shirts
|
|
skirts
|
|
suits
|
|
trousers
|
|
overcoats
|
USE
|
coats
|
|
pullovers
|
UF
|
jumpers
|
|
sweaters
|
BT
|
knitwear
|
|
outerwear
|
|
raincoats
|
BT
|
coats
|
|
rainwear
|
|
rainwear
|
BT
|
outerwear
|
NT
|
raincoats
|
|
reefer jackets
|
UF
|
duffel jackets
|
BT
|
jackets
|
|
|
shawls
|
UF
|
wraps (clothing)
|
BT
|
outerwear
|
|
shirts
|
BT
|
outerwear
|
|
shorts
|
BT
|
trousers
|
|
skirts
|
BT
|
outerwear
|
|
slacks
|
BT
|
trousers
|
|
smocks
|
USE
|
blouses
|
|
suits
|
BT
|
outerwear
|
|
sweaters
|
USE
|
pullovers
|
|
trousers
|
UF
|
breeches
|
BT
|
outerwear
|
NT
|
jeans
|
|
shorts
|
|
slacks
|
|
wraps (clothing)
|
USE
|
shawls
|
|
Many thesauri have been created with the intention of
being used to index documentary material, and thus they include
many terms which relate to abstract concepts, disciplines and
areas of discussion, as well as the names of concrete objects
which are of primary interest to museums. We have to be careful
to be consistent in how we use these terms. The most
straightforward way is to concentrate first on what objects
actually are - spades are Spades and should be given
this term, rather than the area in which they are used, whether
it is gardening or gravedigging.
You may well wish to allocate abstract and discipline terms
to objects too, so that you can retrieve all the objects to
do with Dentistry, Laundry, Warfare or Food
preparation. These terms can also be included in the
thesaurus, so long as they are not given hierarchical
relationships to names of objects. They should be given
RT relationships to an appropriate level of
object terms.
Some thesauri, such as ROOT, interfile terms of different
types in their hierarchical display. Indentation in such
cases does not necessarily indicate a BT/NT
relationship. The relationships are shown in ROOT's
alphabetical sequence, and it is unfortunate that they are
not distinguished in the hierarchical one.
Because these abstract terms do not describe what the object
is, they could be put into a field in the catalogue
record labelled concept or subject,
distinct from the field containing terms which name
the object. I do not think that such a distinction will
generally be helpful to users, however, and there seems to be
no disadvantage in putting both types of term into a single
field so that they can easily be searched as alternatives or
in combination. Such a field would not be correctly called
name and I therefore prefer to call it simply
indexing terms or subject indexing terms.
There has been much discussion on whether thesaurus terms
should be expressed in the singular or the plural. I believe
that the difficulty arises from different views of what is
being done when a term is assigned to an object record. If a
cataloguer thinks that (s)he is naming the object in hand,
(s)he will naturally use the singular: "This is a clock". If
(s)he is assigning the object to a category of similar objects,
the thought will be "This belongs in the category of clocks".
An enquirer will normally ask for a category, so the latter
form will be more natural and logical.
The point is not a trivial one, because as discussed in
section 2 above there is a conceptual
difference between naming or describing an object and
grouping it with others so that it can be found. Both are
essential steps, but an information retrieval thesaurus is
primarily concerned with grouping.
Singular or plural
terms?
|
The cataloguer thinks:
"This is a clock".
|
|
The enquirer asks:
"What clocks do you have?"
|
|
Prefer plural terms because:
- We should design the catalogue to fit the way the
user thinks.
-
Clocks is the name of a category,
including many types,
so plural is more logical.
|
The British Standard for
thesaurus construction recommends that plural terms
should be used, except for a few well-defined cases, and my
view is that this practice should be followed. Unfortunately,
there are many records in museum collections which have been
given singular "object names", and the work of changing these
to plurals in a move to a thesaurus structure may be so great
as to require some compromise.
The British Standard recommends that when indexing parts
or components, separate terms should be assigned for the
component and for the object of which it forms part, so that
aircraft engines would be indexed by the two terms
Aircraft and Engines. This causes problems in
a museum collection, however, because items indexed in this way
would be retrieved in a search for Aircraft, when only
whole aircraft were being sought. It therefore seems preferable
to use a term such as Aircraft components. A
particular engine may well be an aircraft component, but it is
not an aircraft. Similarly a timer from a cooker can be indexed
by the terms Timers and Cooker components,
and a handle broken from a vase might be indexed as
Handles and Vase fragments. There needs to be
local agreement on how this approach is to be applied to a
particular collection.
In the thesaurus, BT/NT relationships can be
used for parts and wholes in only four special cases: parts
of the body, places, disciplines and hierarchical social
structures.
As shown in the sample thesaurus above, a term can have
several broader terms, if it belongs to several broader
categories. The thesaurus is then said to be polyhierarchical.
Cardigans, for example, are simultaneously
Knitwear and Jackets, and should be retrieved
whenever either of these categories is being searched for.
With a polyhierarchical thesaurus it would take more space to
repeat full hierarchies under each of several broader terms
in a printed version, but this can be overcome by using
references, as ROOT does.
There is no difficulty in displaying polyhierarchies in a
computerised version of a thesaurus.
A thesaurus is an essential tool which must be at hand
when indexing a collection of objects, whether by writing
catalogue cards by hand or by entering details directly into a
computer. The general principles to be followed are:
- Consider whether a searcher will be able to retrieve the
item by a combination of the terms you allocate.
- Use as many terms as are needed to provide required
access points.
- If you allocate a specific term, do not also allocate
that term's broader terms.
- Make sure that you include terms to express what the
object is, irrespective of what it might have been used for.
If you have a computerised thesaurus, with good software,
this can give you a lot of direct help. Ideally it should
provide pop-up windows displaying thesaurus terms which the
cataloguer can choose from and then "paste" directly into the
catalogue record without re-typing. It should be possible to
browse around the thesaurus, following its chain of
relationships or displaying tree structures, without having to
exit the current catalogue record, and non-preferred terms
should automatically be replaced by their preferred
equivalents. A cataloguer should be able to "force" new terms
onto the thesaurus, flagged for review later by the thesaurus
editor. When editing thesaurus relationships, reciprocals
should be maintained automatically, and it should not be
possible to create inconsistent structures.
As there are many thesauri in existence already, it is
worth considering seriously whether one of these can be used
before embarking on the job of creating a new one for a
particular museum or collection. So long as the general
principles are followed, you should be able to expand a
thesaurus to give you more detail if you need it, or truncate
some sections at a high level if they contain more detail than
your collections justify. So long as the relationships are
universally true, it should be possible to combine sections of
thesauri developed by different museums and thus avoid
duplication of work.
Even when using an authoritative thesaurus, some care is
needed, and I have mentioned some limitations of ROOT and AAT
in 7.1 and 7.4 above. It is still much easier to base your
work on something like these than to build your own from
scratch, unless you have a very specialised collection.
Someone has to be responsible for this. New terms can be
suggested, and temporarily "forced" into the thesaurus by
cataloguers as they catalogue objects, but someone has to
review these terms regularly and either accept them and build
them into the thesaurus structure, or else decide that they are
not appropriate for use as indexing terms. In that case they
should generally be retained as non-preferred terms with
USE references to the preferred terms, so that
people who seek them will not be frustrated. An encouraging
thought is that once the initial work of setting up the
thesaurus has been done, the number of new terms to be assessed
each week should decrease, and many systems have operated
successfully in the past with printed thesauri, which are quite
difficult to keep up to date.
A thesaurus is not a panacea which will meet all subject
retrieval needs. It is particularly appropriate for fields
which have a hierarchical structure, such as names of objects,
subjects, places, materials and disciplines, and it might also
be used for styles and periods. A thesaurus proper would not
normally be used for names of people and organisations, but a
similar tool, called an authority file is usually used for
these. The difference is that while an authority file has
preferred and non-preferred relationships, it does not have
hierarchies.
[Authority files and thesauri are two examples of a
generalised data structure which can allow the indication of
any type of relationship between two entries, and modern
computer software should allow different types of
relationship to be included if needed.]
A thesaurus is an essential component for reliable
information retrieval, but it can usefully be complemented by
two other types of subject retrieval mechanism.
While a thesaurus inherently contains a classification of
terms in its hierarchical relationships, it is intended for
specific retrieval, and it is often useful to have another way
of grouping objects. This may relate to administrative
distribution of responsibility for "collections" within a
museum, or to subdivisions of these collections into groups
which depend on local emphasis. It is also often necessary to
be able to print a list of objects arranged by subject in a way
which differs from the alphabetical order of thesaurus terms.
Each subject group may be expressed as a compound phrase, and
given a classification number or code to make sorting possible.
It is highly desirable to be able to search for specific
words or phrases which occur in object descriptions. These may
identify individual items by unique words such as trade names
which do not occur often enough to justify inclusion in the
thesaurus. A computer system may "invert" some or all fields of
the record, i.e. making all the words in them available for
searching through a free-text index, or it may be possible to
scan records by reading them sequentially while looking for
particular words. The latter process is fairly slow, but is a
useful way of refining a search once an initial group has been
selected by using thesaurus terms.
This document is at
http://willpowerinfo.co.uk/thesprin.htm
Revised
2008-11-23 23:05
Comments and feedback on content or presentation are
welcome and should be sent to Leonard Will at
L.Will@willpowerinfo.co.uk
Copyright © Willpower Information, 1998-2008.