IBM’s Content Collection and Archiving

IBM announced its Content Collection and Archiving products. In summary:

The Problem

  • Today, there are typically a series of different archiving systems, corresponding to different types of electronic information and/or different organizational groups.
  • As a result, there is little consistency in the archives. This is costly. It also increases the chances of noncompliance and makes it harder to actually use the information in the archives.

IBM's Solution

  • IBM proposes its Content Collection and Archiving architecture.
    • Data to be archived is extracted from a wide variety of systems and applications.
    • Then it's run through a central point of processing and control.
    • Then it's deposited in an IBM ECM repository or repositories.
    • If information is originally in a secure and auditable repository, such as an ECM system, then it can remain in place rather than being deposited into an IBM repository.
  • By way of analogy, for readers familiar with the old SoftSwitch approach to integrating heterogeneous email systems, think of an email switch.
  • The system provides the following main functions:
    • Data ingestion. For example, from Exchange, Notes, SharePoint, Windows file system.
    • Classification, using rules- and policy-based classification based on content and metadata. The optional IBM Classification module adds advanced contextual (Bayesian) classification.
    • Data transformation. For example, conversion of documents into PDFs or TIFFs.
    • Extraction of metadata. For example, date of document creation, authors, and users.
    • A common data model for information in different sources and repositories.
    • A federated, virtual view of information spread across different repositories.
    • Filtering and application of policy. For example, the system determines that it's tax material, so should be archived for seven years; or it determines that it's product design material and should be kept for 25 years and only be viewable by certain classes of user; or that this material should be maintained on cheap, slow-access storage.
  • More information here.


  • Providing a centralized point of administration and control in this way makes great sense. Overall, this is an excellent step toward rationalizing and coordinating today's dog's dinner of different archiving systems.
  • As with any switch that converts between formats, the devil's in the details and this won't work as neatly as architecture diagrams suggest. Three examples spring to mind:
    • The Bayesian and rules-based automatic classification, while valuable, will be insufficiently accurate for many types of source information.
    • A variety of specialized search tools will be required for fine drill down, depending on the type of ESI.
    • Federated views of data can have problems. For example, search times, search criteria that produce counterintuitive results.

... David Ferris

Post a comment

You must be logged in to post a comment. To comment, first join our community.