Search the Rodmaker's Archives - Documentation
I made a design decision to conserve disk space at the price
of search ease and efficiency. Basically, searches of the
message subject and author fields are fairly fast and flexible
but search of the texts of the messages is slow and limited.
This is how the search works:
-
Since July 1996, the raw monthly archives are available
from the Listserv site. These and some earlier archives are
available online from Jerry Foster's
Rodmakers web site. (Hats off to Jerry for having the foresight
to do this!) In addition, I had kept a few messages
from months Jerry missed. These were the sources for this
searchable archive.
For more information on the Archives, try a search on
subject "Archives" ;-)
-
Every month (when I get around to it) I grab the last month's raw
archive file and process it as follows:
- The separate
messages are split from the file.
- Each message is
processed, identifying the date, author and subject line.
Other headers are discarded.
- Any attachments or contents which are not ascii text ( eg.
graphics, spreadsheets, etc.) are discarded. If you want
these you will have to go to the raw archives. HTML is converted
to plain text.
- Quoted material which
starts with the ">" is reduced to the first 5 lines.
Text quoted with other identifiers is not reduced.
- Some messages are edited by hand to make them parse.
Since there are over 20000 messages (as of January 1999),
I had to try to automate these steps, and some wierd
things that happened along the way. One common problem is
for quoted material to get "wrapped" so that bare words are
left without the ">" in front. These come out looking
like little nonsense poems in the message body. I'm sure
some other mangling of messages occasionally occurs.
-
Two files are then generated for each month's archives: a
database file consisting of the subject, author, and date
for each message, and a pointer to a file in a ZIP archive
The ZIP archive contains the body of the messages for each month.
-
Searches of the Subject and Author fields can be done
fairly quickly on the database file for the month or
months. But searches of the message text are slow because
each individual message must be unzipped before searching.
If you search more than a few months at a time it will be
very slow. Searches of the Text will stop after one
year, just to show you what you are getting.
"Author" is just the messages "From:" field. This usually
contains the author's real name, but not always. Sometimes it
just contains an email name, and sometimes
the "From:" field is missing altogether.
-
Searches proceed in roughly chronological order.
If a search generates 100 "hits", the remainder of
that month will be processed but then the search will be
terminated. The search can easily be resumed where it left off
by changing the "Beginning" month as specified. Since there
have been over 1500 messages a month recently, the number of matches
may still be long.
-
I hope to be able to keep adding to the archives as long as
disk space is not a problem. If someone has a way to
separate out uninformative messages completely, without
reading every one, send it my way.
Send comments and suggestions to
stetzer@uwm.edu.