How to Find Similar Documents

Finding Textually Related Patents

To find patents with similar specifications, use Analyze>Find Similar Documents>Use Full Text (1).  AcclaimIP finds patents with specifications that are similar to the source document's specification.  The algorithm uses a technique called TF/IDF which means Term Frequency (in the source document), and Inverse Document Frequency (in the patent corpus).  In a nutshell, TF/IDF preferentially weights important and rare terms from your source documents.  As a result, common terms such as prepositions, articles, conjunctions, and common patent terms (e.g., system, method, or apparatus) are virtually ignored when determining which documents are similar.

AcclaimIP searches the entire corpus and returns a GIGANTIC results set sorted by the best match on top, and then descending to patents that only match on one relatively common term.

Notice there are options for both full text and claims only.  When you choose Use Claims Text you find documents with similar claims in the corpus.  So the options either compare full text to full text OR claims to claims.

Analyzing the Results

Analyzing the Results

The results appear as a ranked list showing all of the "similar" patents.  Which will often be in the millions of documents.  For instance the results in this example include over 24 million documents, but don't worry.  It's easier to deal with than you think.

You may have seen other systems that chop the list of related patents to the most similar 50 or 100 documents, but with AcclaimIP what is important is the sort order!  The most similar document is listed first, then the next most similar, and so on (you can change this by sorting by, say, the Document No. or Title column, but why would you?).  You'll find that at a certain point in the list, the similarity will drop to the point where it isn't really useful to you.  You might then consider refining your search results further, or taking another approach.  But note that the point where the documents aren't useful to you anymore very well might be more than 50 or 100 documents.

In order to narrow down my list somewhat, I typed in some text in the query field to show only utility patents (PT:U, which means Patent Type is Utility).  This query will remove any design patents from my result set.  Design patents have very short full text descriptions and important terms will often appear high on the list because of the relative importance of each matching term in the context of a short description.  Here I have not run the query yet, but once I ran it, the search narrowed to just over 2 million documents.  

Refining TF/IDF Results

Refining TF/IDF Results

Refining the search by requiring both "Audio" and "Video" and at least one of the other terms/strings in the parentheses reduces the set to 95,789.  Note that they are still ranked by their degree of similarity to the source text.

Filtering Using SuperFacets

Filtering Using SuperFacets

These type of results normally can't be faceted in a useful way, because when you deal with large result sets, the facets (counts) in the filters are almost like faceting the full corpus.  If you filtered by Assignee, for example, IBM, Samsung, and Canon would always be at the top of the list – simply because they own the most patents.

AcclaimIP uses SuperFacets which show the counts only from the first 200 most related documents.  For instance, looking in the CPC Complete filter results, we see how 20 of the first 200 most textually related documents are in H04L29/06027.

Combine TF/IDF with Class Searching

Combine TF/IDF with Class Searching

I picked this patent because this is really a video conferencing invention using a mobile phone wirelessly connected to a big screen.  The patent:

  1. Doesn't have the term "conference" or "conferencing"
  2. Shares lots of terms and concepts with other mobile device/handset patents.  

Teleconferencing patents are mostly found in the CPC class H04M3, and some in H04N7 (and children classes).  Here I refined my search results by only the most similar patents that are in these likely classes and I get much better results.  

The point is, these algorithms can be huge timesavers, and take you exactly where you want your search to go, but they can be misleading if you don't understand the patented invention first and combine that knowledge with class searching.

Full Text vs. Claims Only Text

You have the option to use this technique with either the full text of the specification or the text of the claims only.  You'll find that one will often be better than the other depending on the source patent.  The claims are targeted towards the invention, but similar patents will often use completely different claims language depending on how the author defined what he/she means by a term or a phrase in the specification.  So it may not return the results that you want.  Of course, it might be a good idea to use the system both ways, and see where it takes you.


Add your comment

E-Mail me when someone replies to this comment

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.