IBM’s Automatic Classification Module

If you're interested in the classification of electronic information, IBM's Classification Module is worth knowing about.

Assigning a category to electronic material is useful, mainly for archiving/ECM/regulations compliance/e-discovery/retention purposes. Today, governments and military organizations are the main users, but classification is likely to become mainstream over the next year or two.

The module can work independently of other software, but is most commonly used in association with IBM's various ECM tools, Content Collector, FileNet, and Content Manager. The main features are:

  • Scans emails, documents, and many other types of information.
  • Supports a very large number of file types.
  • Users/admins can define classification using traditional Boolean logic/regular expression pattern matching on data and metadata.
  • Bayesian n-tuple word sequences are also used to infer classifications.
  • Bayesian technology also learns and can adjust its judgments in use.
  • The system presents a series of weighted options, from which the user can select correct classifications.
  • Normally the intent is that the module is accessed by other programs; i.e., via an API.
  • However, preprogrammed interfaces are available for SharePoint and MS Office, where users pick from a series of suggested classifications.
  • Many languages are supported.
  • Pricing is around $120K to $150K, depending on the number of processors.
  • The latest release, v8.6, is described here.

The system's main use is to provide machine assistance to users who manually assign classification to material. It can also be used for purely automated classification.

Purely automated classification is right roughly 70% or 80% of the time. That's about par with the state of the art. Hence the emphasis on human-machine-assisted classification.

... David Ferris

Post a comment

You must be logged in to post a comment. To comment, first join our community.