Tips For Building an Effective Keyword List

Have you ever received a production or collection of documents from your client and struggled to effectively narrow the document set for relevance?

Contrary to the belief of some, keyword searching is not dead. Even with the advancement of technology and analytic tools, a well refined keyword list can be an efficient way to narrow the focus of your review. There are multiple stages of the EDRM where keywords can be utilized if you can master a few necessary search techniques.

Searching for documents based on a list of keywords can be both rewarding and frustrating.

Building an effective list of terms begins with an understanding of how search engines and indexed databases function in order to give you the desired results. An indexed database is a way of cataloging metadata from documents into specified fields, such as creation date, author, sender and even text from the body of an email, for purposes of searching and organization.

Fortunately, you don’t need to be an expert in coding or database languages to begin building successful keyword searches.

Here are some tips and recommendations to improve your keyword list and search results.

Limit the number of hits with proximity searching.

Using proximity searching is a way to identify two words that are within a specified distance to each other. As an example, if you want to search for the name John Stone, utilizing the proximity search John w/2 Stone will help to return multiple iteration of the name, such as Stone, John; John T. Stone; or John Stone. But will exclude stand-alone instances of John or Stone.

Use all caps for connectors.

A connector is a command used to narrow a search by defining the relationship between terms, such as OR, AND or NOT. It’s recommended to use all caps for connectors as some search engines require it and it’s easier to identify connectors from search terms.

Many characters can index as spaces.

Characters such as the period (.) and the At sign (@) are often viewed as space. This means that if you are searching for [email protected] it will be indexed as three distinct terms: “SamJones” and “LSILegal” and “com”. Other characters that traditionally index as spaces could be: ! & ” # amp; ’ () * + , / : ; < = > ? [ \ ] ^ ` { | } ~.

Be aware when searching for numbers.

The number 1,000,000 is seen as three separate items in the index: (1), (000) and (000). This is due to the “,” being read as a space.

Know when to use parenthesis or quotes.

Quotes are generally used to search for a phrase or literal combination of words, like “Litigation Solutions”. Parenthesis are used to separate or order unique items within the search command, like ((Rich OR Rick) w/2 Adams).

What’s in a name?

Try expanding first names with nicknames. As an example, the name Richard Adams could be searched as ((Richard OR Rick OR Dick) w/2 Adams). It may also be valuable to gather insight from your client for more specific nicknames unrelated to the given name (only a close acquaintance would know “Boomer” is Rich Adam’s nickname).

Beware of false positives.

A false positive is a word that shows up as a returned hit, but was not used for the meaning that was intended by the search. The term “IT” (Information Technology) would return results for each occurrence of the word “it”. Likewise, the term “Conduct” (as in conduct unbecoming) would also return hits for “Conduct” (as in conduct a survey or conduct electricity).

Use Wildcards with caution.

A wildcard uses the asterisk (*) to search for a sequential string of characters rather than an exact match, typically at the end of the root word. As an example, searching for educat* returns results for educate, educated, education, educational and educator. It may be advisable to list out “education” and “educator” if you want to avoid false positives from “educate,” “educated” and “educational”.

Test the results and refine the list.

Using a “hit report” to measure the search results of each keyword is a good way to identify false positives and overly broad terms. Narrowing the number of returned documents without excluding documents that are potentially responsive can require multiple revisions to your keyword list. But the effort can certainly payoff to limit the number of documents necessary for review.