refer, section 6.

6. Searching.

The indexing and searching process is divided into two phases, each made of two parts. These are shown below.

A.: Construct the index.

(1): Find keys -- turn the input files into a sequence of tags and keys, where each tag identifies a distinct item in the input and the keys for each such item are the strings under which it is to be indexed.
(2): Hash and sort -- prepare a set of inverted indexes from which, given a set of keys, the appropriate item tags can be found quickly.

B.: Retrieve an item in response to a query.

(3): Search -- Given some keys, look through the files prepared by the hashing and sorting facility and derive the appropriate tags.
(4): Deliver -- Given the tags, find the original items. This completes the searching process.

The first phase, making the index, is presumably done relatively infrequently. It should, of course, be done whenever the data being indexed change. In contrast, the second phase, retrieving items, is presumably done often, and must be rapid.

An effort is made to separate code which depends on the data being handled from code which depends on the searching procedure. The search algorithm is involved only in steps (2) and (3), while knowledge of the actual data files is needed only by steps (1) and (4). Thus it is easy to adapt to different data files or different search algorithms.

To start with, it is necessary to have some way of selecting or generating keys from input files. For dealing with files that are basically English, we have a key-making program which automatically selects words and passes them to the hashing and sorting program (step 2). The format used has one line for each input item, arranged as follows:

: name:start,length (tab) key1 key2 key3 ...

where name is the file name, start is the starting byte number, and length is the number of bytes in the entry.

These lines are the only input used to make the index. The first field (the file name, byte position, and byte count) is the tag of the item and can be used to retrieve it quickly. Normally, an item is either a whole file or a section of a file delimited by blank lines. After the tab, the second field contains the keys. The keys, if selected by the automatic program, are any alphanumeric strings which are not among the 100 most frequent words in English and which are not entirely numeric (except for four-digit numbers beginning 19, which are accepted as dates). Keys are truncated to six characters and converted to lower case. Some selection is needed if the original items are very large. We normally just take the first n keys, with n less than 100 or so; this replaces any attempt at intelligent selection. One file in our system is a complete English dictionary; it would presumably be retrieved for all queries.

To generate an inverted index to the list of record tags and keys, the keys are hashed and sorted to produce an index. What is wanted, ideally, is a series of lists showing the tags associated with each key. To condense this, what is actually produced is a list showing the tags associated with each hash code, and thus with some set of keys. To speed up access and further save space, a set of three or possibly four files is produced. These files are:

		File		 Contents
	       entry	 Pointers to posting file
			 for each hash code
	       posting	 Lists of tag pointers for
			 each hash code
	       tag	 Tags for each item
	       key	 Keys for each item
			 (optional)

The posting file comprises the real data: it contains a sequence of lists of items posted under each hash code. To speed up searching, the entry file is an array of pointers into the posting file, one per potential hash code. Furthermore, the items in the lists in the posting file are not referred to by their complete tag, but just by an address in the tag file, which gives the complete tags. The key file is optional and contains a copy of the keys used in the indexing.

The searching process starts with a query, containing several keys. The goal is to obtain all items which were indexed under these keys. The query keys are hashed, and the pointers in the entry file used to access the lists in the posting file. These lists are addresses in the tag file of documents posted under the hash codes derived from the query. The common items from all lists are determined; this must include the items indexed by every key, but may also contain some items which are false drops, since items referenced by the correct hash codes need not actually have contained the correct keys. Normally, if there are several keys in the query, there are not likely to be many false drops in the final combined list even though each hash code is somewhat ambiguous. The actual tags are then obtained from the tag file, and to guard against the possibility that an item has false-dropped on some hash code in the query, the original items are normally obtained from the delivery program (4) and the query keys checked against them by string comparison.

Usually, therefore, the check for bad drops is made against the original file. However, if the key derivation procedure is complex, it may be preferable to check against the keys fed to program (2). In this case the optional key file which contains the keys associated with each item is generated, and the item tag is supplemented by a string

: ;start,length

which indicates the starting byte number in the key file and the length of the string of keys for each item. This file is not usually necessary with the present key-selection program, since the keys always appear in the original document.

There is also an option (-Cn) for coordination level searching. This retrieves items which match all but n of the query keys. The items are retrieved in the order of the number of keys that they match. Of course, n must be less than the number of query keys (nothing is retrieved unless it matches at least one key).

As an example, consider one set of 4377 references, comprising 660,000 bytes. This included 51,000 keys, of which 5,900 were distinct keys. The hash table is kept full to save space (at the expense of time); 995 of 997 possible hash codes were used. The total set of index files (no key file) included 171,000 bytes, about 26% of the original file size. It took 8 minutes of processor time to hash, sort, and write the index. To search for a single query with the resulting index took 1.9 seconds of processor time, while to find the same paper with a sequential linear search using grep (reading all of the tags and keys) took 12.3 seconds of processor time.

We have also used this software to index all of the English stored on our UNIXsystem. This is the index searched by the lookall command. On a typical day there were 29,000 files in our user file system, containing about 152,000,000 bytes. Of these 5,300 files, containing 32,000,000 bytes (about 21%) were English text. The total number of `words' (determined mechanically) was 5,100,000. Of these 227,000 were selected as keys; 19,000 were distinct, hashing to 4,900 (of 5,000 possible) different hash codes. The resulting inverted file indexes used 845,000 bytes, or about 2.6% of the size of the original files. The particularly small indexes are caused by the fact that keys are taken from only the first 50 non-common words of some very long input files.

Even this large lookall index can be searched quickly. For example, to find this document by looking for the keys ``lesk inverted indexes'' required 1.7 seconds of processor time and system time. By comparison, just to search the 800,000 byte dictionary (smaller than even the inverted indexes, let alone the 32,000,000 bytes of text files) with grep takes 29 seconds of processor time. The lookall program is thus useful when looking for a document which you believe is stored on-line, but do not know where. For example, many memos from the Computing Science Research Center are in its UNIXfile system, but it is often difficult to guess where a particular memo might be (it might have several authors, each with many directories, and have been worked on by a secretary with yet more directories). Instructions for the use of the lookall command are given in the manual section, shown in the appendix to this memorandum.

The only indexes maintained routinely are those of publication lists and all English files. To make other indexes, the programs for making keys, sorting them, searching the indexes, and delivering answers must be used. Since they are usually invoked as parts of higher-level commands, they are not in the default command directory, but are available to any user in the directory /usr/lib/refer. Three programs are of interest: mkey, which isolates keys from input files; inv, which makes an index from a set of keys; and hunt, which searches the index and delivers the items. Note that the two parts of the retrieval phase are combined into one program, to avoid the excessive system work and delay which would result from running these as separate processes.

These three commands have a large number of options to adapt to different kinds of input. The user not interested in the detailed description that now follows may skip to section 3, which describes the refer program, a packaged-up version of these tools specifically oriented towards formatting references.

Make Keys. The program mkey is the key-making program corresponding to step (1) in phase A. Normally, it reads its input from the file names given as arguments, and if there are no arguments it reads from the standard input. It assumes that blank lines in the input delimit separate items, for each of which a different line of keys should be generated. The lines of keys are written on the standard output. Keys are any alphanumeric string in the input not among the most frequent words in English and not entirely numeric (except that all-numeric strings are acceptable if they are between 1900 and 1999). In the output, keys are translated to lower case, and truncated to six characters in length; any associated punctuation is removed. The following flag arguments are recognized by mkey:

       -c name	  Name of file of common words; default is
       -f name	  Read a list of files from and take  each
		  as an input argument.
       -i chars   Ignore  all  lines  which begin with `%'
		  followed by any character in
       -kn	  Use at most keys per input item.
       -ln	  Ignore items shorter than letters  long.
       -nm	  Ignore  as  a  key any word in the first
		  words of  the  list  of  common  English
		  words.  The default is 100.
       -s	  Remove  the labels from the output; just
		  give	the  keys.   Used  when  searching
		  rather than indexing.
       -w	  Each	whole  file  is  a  separate item;
		  blank lines in files are irrelevant.

The normal arguments for indexing references are the defaults, which are -c /usr/lib/eign, -n100, and -l3. For searching, the -s option is also needed. When the big lookall index of all English files is run, the options are -w, -k50, and -f (filelist). When running on textual input, the mkey program processes about 1000 English words per processor second. Unless the -k option is used (and the input files are long enough for it to take effect) the output of mkey is comparable in size to its input.

Hash and invert. The inv program computes the hash codes and writes the inverted files. It reads the output of mkey and writes the set of files described earlier in this section. It expects one argument, which is used as the base name for the three (or four) files to be written. Assuming an argument of Index (the default) the entry file is named Index.ia, the posting file Index.ib, the tag file Index.ic, and the key file (if present) Index.id. The inv program recognizes the following options:

      -a	   Append the new keys to a previous set of
		   inverted  files,  making  new  files  if
		   there  is no old set using the same base
		   name.
      -d	   Write the optional key  file.   This  is
		   needed  when you can not check for false
		   drops by looking for  the  keys  in	the
		   original   inputs,  i.e.  when  the	key
		   derivation procedure is complicated	and
		   the	output	keys are not words from the
		   input files.
      -hn	   The hash table size	is  (default  997);
		   should  be prime.  Making n bigger saves
		   search time and spends disk space.
      -i[u] name   Take input  from  file  instead  of	the
		   standard  input;  if  is  present is un
		   linked when the sort is started.   Using
		   this  option  permits  the  sort scratch
		   space to overlap the disk space used for
		   input keys.
      -n	   Make  a  completely	new set of inverted
		   files, ignoring previous files.
      -p	   Pipe into the sort program, rather  than
		   writing  a  temporary  input file.  This
		   saves disk space  and  spends  processor
		   time.
      -v	   Verbose  mode;  print  a  summary of the
		   number of keys which finished  indexing.

About half the time used in inv is in the contained sort. Assuming the sort is roughly linear, however, a guess at the total timing for inv is 250 keys per second. The space used is usually of more importance: the entry file uses four bytes per possible hash (note the -h option), and the tag file around 15-20 bytes per item indexed. Roughly, the posting file contains one item for each key instance and one item for each possible hash code; the items are two bytes long if the tag file is less than 65336 bytes long, and the items are four bytes wide if the tag file is greater than 65536 bytes long. To minimize storage, the hash tables should be over-full; for most of the files indexed in this way, there is no other real choice, since the entry file must fit in memory.

Searching and Retrieving. The hunt program retrieves items from an index. It combines, as mentioned above, the two parts of phase (B): search and delivery. The reason why it is efficient to combine delivery and search is partly to avoid starting unnecessary processes, and partly because the delivery operation must be a part of the search operation in any case. Because of the hashing, the search part takes place in two stages: first items are retrieved which have the right hash codes associated with them, and then the actual items are inspected to determine false drops, i.e. to determine if anything with the right hash codes doesn't really have the right keys. Since the original item is retrieved to check on false drops, it is efficient to present it immediately, rather than only giving the tag as output and later retrieving the item again. If there were a separate key file, this argument would not apply, but separate key files are not common.

Input to hunt is taken from the standard input, one query per line. Each query should be in mkey -s output format; all lower case, no punctuation. The hunt program takes one argument which specifies the base name of the index files to be searched. Only one set of index files can be searched at a time, although many text files may be indexed as a group, of course. If one of the text files has been changed since the index, that file is searched with fgrep; this may occasionally slow down the searching, and care should be taken to avoid having many out of date files. The following option arguments are recognized by hunt:

      -a	  Give	all  output;  ignore  checking for
		  false drops.
      -Cn	  Coordination level retrieve  items  with
		  not  more  than terms of the input miss
		  ing; default implying that  each  search
		  term must be in the output items.
      -F[ynd]	  ``-Fy''  gives the text of all the items
		  found; ``-Fn'' suppresses them.  ``-Fd''
		  where  d is an integer gives the text of
		  the first d items.  The default is
      -g	  Do not use to search files changed since
		  the  index was made; print an error com
		  ment instead.
      -i string   Take as input, instead  of  reading  the
		  standard input.
      -l n	  The  maximum length of internal lists of
		  candidate items is default 1000.
      -o string   Put text output (``-Fy'') in of use when
		  invoked from another program.
      -p	  Print  hash code frequencies; mostly for
		  use in optimizing hash table sizes.
      -T[ynd]	  ``-Ty'' gives  the  tags  of	the  items
		  found; ``-Tn'' suppresses them.  ``-Td''
		  where d is an integer gives the first  d
		  tags.  The default is
      -t string   Put  tag output (``-Ty'') in of use when
		  invoked from another program.

The timing of hunt is complex. Normally the hash table is overfull, so that there will be many false drops on any single term; but a multi-term query will have few false drops on all terms. Thus if a query is underspecified (one search term) many potential items will be examined and discarded as false drops, wasting time. If the query is overspecified (a dozen search terms) many keys will be examined only to verify that the single item under consideration has that key posted. The variation of search time with number of keys is shown in the table below. Queries of varying length were constructed to retrieve a particular document from the file of references. In the sequence to the left, search terms were chosen so as to select the desired paper as quickly as possible. In the sequence on the right, terms were chosen inefficiently, so that the query did not uniquely select the desired document until four keys had been used. The same document was the target in each case, and the final set of eight keys are also identical; the differences at five, six and seven keys are produced by measurement error, not by the slightly different key lists.

		  Efficient Keys		    |			Inefficient Keys
No. keys    Total drops    Retrieved   Search time  |  No. keys    Total drops	  Retrieved   Search time
	   (incl. false)   Documents	(seconds)   |		  (incl. false)   Documents    (seconds)
   1		15	       3	  1.27	    |	  1	       68	     55 	 5.96
   2		 1	       1	  0.11	    |	  2	       29	     29 	 2.72
   3		 1	       1	  0.14	    |	  3		8	      8 	 0.95
   4		 1	       1	  0.17	    |	  4		1	      1 	 0.18
   5		 1	       1	  0.19	    |	  5		1	      1 	 0.21
   6		 1	       1	  0.23	    |	  6		1	      1 	 0.22
   7		 1	       1	  0.27	    |	  7		1	      1 	 0.26
   8		 1	       1	  0.29	    |	  8		1	      1 	 0.29

As would be expected, the optimal search is achieved when the query just specifies the answer; however, overspecification is quite cheap. Roughly, the time required by hunt can be approximated as 30 milliseconds per search key plus 75 milliseconds per dropped document (whether it is a false drop or a real answer). In general, overspecification can be recommended; it protects the user against additions to the data base which turn previously uniquely-answered queries into ambiguous queries.

The careful reader will have noted an enormous discrepancy between these times and the earlier quoted time of around 1.9 seconds for a search. The times here are purely for the search and retrieval: they are measured by running many searches through a single invocation of the hunt program alone. Usually, the UNIX command processor (the shell) must start both the mkey and hunt processes for each query, and arrange for the output of mkey to be fed to the hunt program. This adds a fixed overhead of about 1.7 seconds of processor time to any single search. Furthermore, remember that all these times are processor times: on a typical morning on our PDP 11/70 system, with about one dozen people logged on, to obtain 1 second of processor time for the search program took between 2 and 12 seconds of real time, with a median of 3.9 seconds and a mean of 4.8 seconds. Thus, although the work involved in a single search may be only 200 milliseconds, after you add the 1.7 seconds of startup processor time and then assume a 4:1 elapsed/processor time ratio, it will be 8 seconds before any response is printed.