The indexing and searching process is divided into two phases, each made of two parts. These are shown below.
The first phase, making the index, is presumably done relatively infrequently. It should, of course, be done whenever the data being indexed change. In contrast, the second phase, retrieving items, is presumably done often, and must be rapid.
An effort is made to separate code which depends on the data being handled from code which depends on the searching procedure. The search algorithm is involved only in steps (2) and (3), while knowledge of the actual data files is needed only by steps (1) and (4). Thus it is easy to adapt to different data files or different search algorithms.
To start with, it is necessary to have some way of selecting or generating keys from input files. For dealing with files that are basically English, we have a key-making program which automatically selects words and passes them to the hashing and sorting program (step 2). The format used has one line for each input item, arranged as follows:
These lines are the only input used to make the index. The first field (the file name, byte position, and byte count) is the tag of the item and can be used to retrieve it quickly. Normally, an item is either a whole file or a section of a file delimited by blank lines. After the tab, the second field contains the keys. The keys, if selected by the automatic program, are any alphanumeric strings which are not among the 100 most frequent words in English and which are not entirely numeric (except for four-digit numbers beginning 19, which are accepted as dates). Keys are truncated to six characters and converted to lower case. Some selection is needed if the original items are very large. We normally just take the first n keys, with n less than 100 or so; this replaces any attempt at intelligent selection. One file in our system is a complete English dictionary; it would presumably be retrieved for all queries.
To generate an inverted index to the list of record tags and keys, the keys are hashed and sorted to produce an index. What is wanted, ideally, is a series of lists showing the tags associated with each key. To condense this, what is actually produced is a list showing the tags associated with each hash code, and thus with some set of keys. To speed up access and further save space, a set of three or possibly four files is produced. These files are:
File Contents entry Pointers to posting file for each hash code posting Lists of tag pointers for each hash code tag Tags for each item key Keys for each item (optional)The posting file comprises the real data: it contains a sequence of lists of items posted under each hash code. To speed up searching, the entry file is an array of pointers into the posting file, one per potential hash code. Furthermore, the items in the lists in the posting file are not referred to by their complete tag, but just by an address in the tag file, which gives the complete tags. The key file is optional and contains a copy of the keys used in the indexing.
The searching process starts with a query, containing several keys. The goal is to obtain all items which were indexed under these keys. The query keys are hashed, and the pointers in the entry file used to access the lists in the posting file. These lists are addresses in the tag file of documents posted under the hash codes derived from the query. The common items from all lists are determined; this must include the items indexed by every key, but may also contain some items which are false drops, since items referenced by the correct hash codes need not actually have contained the correct keys. Normally, if there are several keys in the query, there are not likely to be many false drops in the final combined list even though each hash code is somewhat ambiguous. The actual tags are then obtained from the tag file, and to guard against the possibility that an item has false-dropped on some hash code in the query, the original items are normally obtained from the delivery program (4) and the query keys checked against them by string comparison.
Usually, therefore, the check for bad drops is made against the original file. However, if the key derivation procedure is complex, it may be preferable to check against the keys fed to program (2). In this case the optional key file which contains the keys associated with each item is generated, and the item tag is supplemented by a string
There is also an option (-Cn) for coordination level searching. This retrieves items which match all but n of the query keys. The items are retrieved in the order of the number of keys that they match. Of course, n must be less than the number of query keys (nothing is retrieved unless it matches at least one key).
As an example, consider one set of 4377 references, comprising 660,000 bytes. This included 51,000 keys, of which 5,900 were distinct keys. The hash table is kept full to save space (at the expense of time); 995 of 997 possible hash codes were used. The total set of index files (no key file) included 171,000 bytes, about 26% of the original file size. It took 8 minutes of processor time to hash, sort, and write the index. To search for a single query with the resulting index took 1.9 seconds of processor time, while to find the same paper with a sequential linear search using grep (reading all of the tags and keys) took 12.3 seconds of processor time.
We have also used this software to index all of the English stored on our UNIXsystem. This is the index searched by the lookall command. On a typical day there were 29,000 files in our user file system, containing about 152,000,000 bytes. Of these 5,300 files, containing 32,000,000 bytes (about 21%) were English text. The total number of `words' (determined mechanically) was 5,100,000. Of these 227,000 were selected as keys; 19,000 were distinct, hashing to 4,900 (of 5,000 possible) different hash codes. The resulting inverted file indexes used 845,000 bytes, or about 2.6% of the size of the original files. The particularly small indexes are caused by the fact that keys are taken from only the first 50 non-common words of some very long input files.
Even this large lookall index can be searched quickly. For example, to find this document by looking for the keys ``lesk inverted indexes'' required 1.7 seconds of processor time and system time. By comparison, just to search the 800,000 byte dictionary (smaller than even the inverted indexes, let alone the 32,000,000 bytes of text files) with grep takes 29 seconds of processor time. The lookall program is thus useful when looking for a document which you believe is stored on-line, but do not know where. For example, many memos from the Computing Science Research Center are in its UNIXfile system, but it is often difficult to guess where a particular memo might be (it might have several authors, each with many directories, and have been worked on by a secretary with yet more directories). Instructions for the use of the lookall command are given in the manual section, shown in the appendix to this memorandum.
The only indexes maintained routinely are those of publication lists and all English files. To make other indexes, the programs for making keys, sorting them, searching the indexes, and delivering answers must be used. Since they are usually invoked as parts of higher-level commands, they are not in the default command directory, but are available to any user in the directory /usr/lib/refer. Three programs are of interest: mkey, which isolates keys from input files; inv, which makes an index from a set of keys; and hunt, which searches the index and delivers the items. Note that the two parts of the retrieval phase are combined into one program, to avoid the excessive system work and delay which would result from running these as separate processes.
These three commands have a large number of options to adapt to different kinds of input. The user not interested in the detailed description that now follows may skip to section 3, which describes the refer program, a packaged-up version of these tools specifically oriented towards formatting references.
Make Keys. The program mkey is the key-making program corresponding to step (1) in phase A. Normally, it reads its input from the file names given as arguments, and if there are no arguments it reads from the standard input. It assumes that blank lines in the input delimit separate items, for each of which a different line of keys should be generated. The lines of keys are written on the standard output. Keys are any alphanumeric string in the input not among the most frequent words in English and not entirely numeric (except that all-numeric strings are acceptable if they are between 1900 and 1999). In the output, keys are translated to lower case, and truncated to six characters in length; any associated punctuation is removed. The following flag arguments are recognized by mkey:
-c name Name of file of common words; default is -f name Read a list of files from and take each as an input argument. -i chars Ignore all lines which begin with `%' followed by any character in -kn Use at most keys per input item. -ln Ignore items shorter than letters long. -nm Ignore as a key any word in the first words of the list of common English words. The default is 100. -s Remove the labels from the output; just give the keys. Used when searching rather than indexing. -w Each whole file is a separate item; blank lines in files are irrelevant.
The normal arguments for indexing references are the defaults, which are -c /usr/lib/eign, -n100, and -l3. For searching, the -s option is also needed. When the big lookall index of all English files is run, the options are -w, -k50, and -f (filelist). When running on textual input, the mkey program processes about 1000 English words per processor second. Unless the -k option is used (and the input files are long enough for it to take effect) the output of mkey is comparable in size to its input.
Hash and invert. The inv program computes the hash codes and writes the inverted files. It reads the output of mkey and writes the set of files described earlier in this section. It expects one argument, which is used as the base name for the three (or four) files to be written. Assuming an argument of Index (the default) the entry file is named Index.ia, the posting file Index.ib, the tag file Index.ic, and the key file (if present) Index.id. The inv program recognizes the following options:
-a Append the new keys to a previous set of inverted files, making new files if there is no old set using the same base name. -d Write the optional key file. This is needed when you can not check for false drops by looking for the keys in the original inputs, i.e. when the key derivation procedure is complicated and the output keys are not words from the input files. -hn The hash table size is (default 997); should be prime. Making n bigger saves search time and spends disk space. -i[u] name Take input from file instead of the standard input; if is present is un linked when the sort is started. Using this option permits the sort scratch space to overlap the disk space used for input keys. -n Make a completely new set of inverted files, ignoring previous files. -p Pipe into the sort program, rather than writing a temporary input file. This saves disk space and spends processor time. -v Verbose mode; print a summary of the number of keys which finished indexing.
About half the time used in inv is in the contained sort. Assuming the sort is roughly linear, however, a guess at the total timing for inv is 250 keys per second. The space used is usually of more importance: the entry file uses four bytes per possible hash (note the -h option), and the tag file around 15-20 bytes per item indexed. Roughly, the posting file contains one item for each key instance and one item for each possible hash code; the items are two bytes long if the tag file is less than 65336 bytes long, and the items are four bytes wide if the tag file is greater than 65536 bytes long. To minimize storage, the hash tables should be over-full; for most of the files indexed in this way, there is no other real choice, since the entry file must fit in memory.
Searching and Retrieving. The hunt program retrieves items from an index. It combines, as mentioned above, the two parts of phase (B): search and delivery. The reason why it is efficient to combine delivery and search is partly to avoid starting unnecessary processes, and partly because the delivery operation must be a part of the search operation in any case. Because of the hashing, the search part takes place in two stages: first items are retrieved which have the right hash codes associated with them, and then the actual items are inspected to determine false drops, i.e. to determine if anything with the right hash codes doesn't really have the right keys. Since the original item is retrieved to check on false drops, it is efficient to present it immediately, rather than only giving the tag as output and later retrieving the item again. If there were a separate key file, this argument would not apply, but separate key files are not common.
Input to hunt is taken from the standard input, one query per line. Each query should be in mkey -s output format; all lower case, no punctuation. The hunt program takes one argument which specifies the base name of the index files to be searched. Only one set of index files can be searched at a time, although many text files may be indexed as a group, of course. If one of the text files has been changed since the index, that file is searched with fgrep; this may occasionally slow down the searching, and care should be taken to avoid having many out of date files. The following option arguments are recognized by hunt:
-a Give all output; ignore checking for false drops. -Cn Coordination level retrieve items with not more than terms of the input miss ing; default implying that each search term must be in the output items. -F[ynd] ``-Fy'' gives the text of all the items found; ``-Fn'' suppresses them. ``-Fd'' where d is an integer gives the text of the first d items. The default is -g Do not use to search files changed since the index was made; print an error com ment instead. -i string Take as input, instead of reading the standard input. -l n The maximum length of internal lists of candidate items is default 1000. -o string Put text output (``-Fy'') in of use when invoked from another program. -p Print hash code frequencies; mostly for use in optimizing hash table sizes. -T[ynd] ``-Ty'' gives the tags of the items found; ``-Tn'' suppresses them. ``-Td'' where d is an integer gives the first d tags. The default is -t string Put tag output (``-Ty'') in of use when invoked from another program.
The timing of hunt is complex. Normally the hash table is overfull, so that there will be many false drops on any single term; but a multi-term query will have few false drops on all terms. Thus if a query is underspecified (one search term) many potential items will be examined and discarded as false drops, wasting time. If the query is overspecified (a dozen search terms) many keys will be examined only to verify that the single item under consideration has that key posted. The variation of search time with number of keys is shown in the table below. Queries of varying length were constructed to retrieve a particular document from the file of references. In the sequence to the left, search terms were chosen so as to select the desired paper as quickly as possible. In the sequence on the right, terms were chosen inefficiently, so that the query did not uniquely select the desired document until four keys had been used. The same document was the target in each case, and the final set of eight keys are also identical; the differences at five, six and seven keys are produced by measurement error, not by the slightly different key lists.
Efficient Keys | Inefficient Keys No. keys Total drops Retrieved Search time | No. keys Total drops Retrieved Search time (incl. false) Documents (seconds) | (incl. false) Documents (seconds) 1 15 3 1.27 | 1 68 55 5.96 2 1 1 0.11 | 2 29 29 2.72 3 1 1 0.14 | 3 8 8 0.95 4 1 1 0.17 | 4 1 1 0.18 5 1 1 0.19 | 5 1 1 0.21 6 1 1 0.23 | 6 1 1 0.22 7 1 1 0.27 | 7 1 1 0.26 8 1 1 0.29 | 8 1 1 0.29As would be expected, the optimal search is achieved when the query just specifies the answer; however, overspecification is quite cheap. Roughly, the time required by hunt can be approximated as 30 milliseconds per search key plus 75 milliseconds per dropped document (whether it is a false drop or a real answer). In general, overspecification can be recommended; it protects the user against additions to the data base which turn previously uniquely-answered queries into ambiguous queries.
The careful reader will have noted an enormous discrepancy between these times and the earlier quoted time of around 1.9 seconds for a search. The times here are purely for the search and retrieval: they are measured by running many searches through a single invocation of the hunt program alone. Usually, the UNIX command processor (the shell) must start both the mkey and hunt processes for each query, and arrange for the output of mkey to be fed to the hunt program. This adds a fixed overhead of about 1.7 seconds of processor time to any single search. Furthermore, remember that all these times are processor times: on a typical morning on our PDP 11/70 system, with about one dozen people logged on, to obtain 1 second of processor time for the search program took between 2 and 12 seconds of real time, with a median of 3.9 seconds and a mean of 4.8 seconds. Thus, although the work involved in a single search may be only 200 milliseconds, after you add the 1.7 seconds of startup processor time and then assume a 4:1 elapsed/processor time ratio, it will be 8 seconds before any response is printed.