Updated 15 Feb 2001

WIRKSWORTH Parish Records 1600-1900

Return to Front Page


It is hoped that this article will shortly appear in the magazine
published by SOIGG (Society of Indexers Genealogical Group).

Compiled, indexed, formatted and copyright © 2001, . All Rights Reserved.

Large Scale Indexing on the Web using Hyperlinks

by John Palmer, Dorset, England


Sometimes very large text documents contain a large number of names. The Bible is an example. To find the name you want, it is essential to have an Index, with the names in alphabetical order, each name having the page where it can be found. It is left to the user to find a name in the Index, read the page number, go to that page in the Listing, then find the name in that page. Such Indexes have been compiled and ordered by hand, but it is very tedious and prone to error. The Listing may fill a large book, and the Index at the back makes it even larger. The User finds the whole procedure tiring and is not encouraged to browse or return.

Computers offer an alternative solution. These days all computers are sold with Web browser software (Explorer or Navigator) which can read HTML (the language of the Internet). HTML use hyperlinks, which are what makes HTML special and not just another display system. Also computers can read CDs, which are cheap (30p) and can store 650 Mb of text, enough to hold the Bible many times over. Finally, computers can run database software, which is able to search the Bible for one word, or put all the words into alphabetical order, in a few seconds.

These five elements used together - computer, browser, database, CD and hyperlink - enable Large Scale Indexing to be done quickly, cheaply and reliably. One example of LSI compiled by the Author is given in detail below. The same general idea can be applied to most large text-listings. The Author has hyperlinked the following LSIs, which can be viewed on www.wirksworth.org.uk and on CD.

           Subject                    Index            Listing
         Parish Registers          6,200 surnames   88,000 records
         Census 1841              15,500 names      18,500 lines
         Census 1851              16,000 names      19,500 lines
         Census 1881              17,500 names      21,500 lines
         Memorial Inscriptions     3,500 names       1,300 records
         Ince's Pedigrees         20,000 names      50,000 lines
         Churchwardens Accounts    4,500 names      10,000 lines
Each hyperlink has two parts: at the location to jump from (the anchor) and at the location to jump to (the target). The anchor contains the location of the target (file and line), and needs about 27 characters of HTML code. The target contains the line number within a file, and needs about 17. The anchor is placed around a name in the Index, the target is placed next to the matching name in the Listing. For example:
Anchor code:      (A HREF=file1.htm#line)DOXEY William(/A)
Target code:      (A NAME=line)(/A)William DOXEY
                  (for actual code, replace round brackets with angle brackets)
HTML does the rest, the hyperlink characters are invisible in the display but the Index anchor turns the name blue and underlines it. If the anchor is clicked, the display jumps directly to the Listing target, just what's wanted.

Hyperlinks in very large numbers can be encoded and placed in position within the text by databases. This is the crucial facility which makes Large Scale Indexing on the Web possible. Hyperlinks in small numbers can easily be edited and relocated using a Word Processor. This enables web pages to be corrected when the inevitable error is uncovered. So it is easy to keep web pages up-to-date.

"Pedigrees" is an 1850s text work containing around 50,000 lines, half a million words and 20,000 names. An alphabetical list of all names was required, enabling the user to scroll down the list, click on a chosen name, and immediately jump to the same name in the Listing in its context.

"Pedigrees" was imported directly into a "List" database, each of 50,000 lines numbered in sequence and the display indexed on the line number. Since all names were in upper case, it was possible to locate each name and identify its line number. A straight forward program was written in DPL (Data Processing Language) within the database to do this task. Each of the 20,000 names found, with its line number, was then copy exported to a separate "Index" database which was indexed alphabetically.

Each line in the List containing a name was given a target bearing the line number. Each name in the Index was given an anchor bearing the target details. DPL programs relating to string functions was used for this task. Then each line in the List database was exported to Word processor file in line number order. And each record in the Index database was exported to Word Processor file in name order. Exporting was done via ASCII text format files. List and Index HTML files were produced directly from WP files, via exported ASCII formats.

Because Index and Listing HTML files were planned for use on a website, with inherent phoneline downloading delays and user patience limited to around 10 seconds, it was necessary to have a maximum web file size of about 60k. This meant breaking the Listing into 25 separate files (each with a different filename) and the Index into 5 separate files. This affected the anchor code which had to contain the target line number as well as the filename holding that number. This was handled by the database, where each line number also held the filename containing that line number. This was done by using database order, search and edit facilities. Bulk editing ensured no errors.

Finally an HTML file was constructed containing the letters of the alphabet A to Z, each with an anchor. The matching target was written in to the Index file, above the section carrying surnames beginning with that letter.

The result is that the User sees the letters A to Z. If he clicks on D he immediately sees an alphabetical list of all the names whose surname begins with D. If he scrolls down until he finds "William DOXEY" and clicks on that name, he goes straight to the List at the position where that name appears. On the Internet, there is a delay of up to 10 seconds after each click while a new 60k file is downloaded. But if the webfiles are stored on a CD disc being read by the computer, the delay appears instantaneous. By clicking on the browse "Back" icon, the display sequence reverses.

On CD or on web, this speed of access makes the List appear much smaller than its real size, the User is happy and astonished at the new information he keeps finding. He does not get tired so easily, and is encouraged to browse through the Index and List, making unexpected discoveries and connections. His concentration is better, attention span longer and he is encouraged to return. All these plus points seem to justify the effort of hyperlinking.

Compiled, indexed, formatted and copyright © 2001, . All Rights Reserved.