data, e.g. for forensics search
purposes (see below).
After indexing, full-text
search speed, even across millions of documents, is typically less than a second.
While indexing a very large collection of documents for the first time may be
time consuming, subsequent updates of the index are usually much faster. dtSearch,
for example, simply checks the file modification dates of all indexed files, and
only reindexes those files that have been added, deleted or changed since the
last index update. (While the text retrieval terminology here relies on the dtSearch
product line, the concepts in this article are generally applicable.)
addition to enabling precision boolean searching, an index can also store such
information as word positions, enabling word or phrase proximity searching. An
index can also hold information about word frequency and distribution, enabling
computation of natural language relevancy rankings across a document collection.
If the company name appears in two million documents, it would get a low relevancy
ranking. If the latest marketing terminology appears in only four documents, it
would get a much higher relevancy rank. In that way, PR could,
for example, enter a whole paragraph of proposed text for a press release as
a natural language search, and zoom right in on the most relevant documents. But
full-text searching, whether boolean, natural language, or otherwise, is only
part of the text retrieval answer. Suppose HR wants to limit its search to documents
with an HR executive designation. This type of fielded data classification can
result from fields or meta data inside a document, or from an overlaying document
management-type application. With the latter, fielded data classification can
rely on associated database entries, such as SQL or XML, or the addition of fields
"on the fly" during the indexing process.
in Security Classifications
the goal is to enable searching organization-wide, but to keep the wrong documents
out of the wrong hands. For example, suppose documents that bear certain
(Text Continued on