« Cool Word, Cool Contest | Main | Towards Smarter Storage »

April 20, 2007


Chris M Evans

Chuck, I think what you describe sounds like things in an ideal world. Unfortunately in the real world, most of what you describe is not practical or cost effective. For instance, if it is possible to scan for data that give some kind of financial benefit, if you don't know what you're looking for, how are you going to evaluate what the value of that information will be when you locate it? In addition, hopefully all, if not 90%+ of relavent corporate data should be in databases, which you've excluded from the discussion. Whilst having the tools will be essential, I think we will also need a new breed of data analysts who know how to use such data and to know what to look for, rather than speculative data trawling.

Chuck Hollis

Hi Chris -- thanks for the comment, but I disgree regarding many of your assertions.

Your first assertion is that scanning (or classifying) information won't be practical or cost-effective.

I would offer as counterpoint the thousands of IT shops that I know of that routinely scan email for archiving or other purposes, and the growing number of shops that are starting to scan file systems for similar purposes. Yes, the scanning and classification is somewhat primitive today, but I think that will come in time.

As far as having to know what you're looking for in order to find value, the jury is out on that one, in two regards. First, I know of several customers that have created lists of keywords and templates to scan for (e.g. looks like an account number, or a project name, or maybe a customer ID). Second, it is possible to create indexes of all keywords scanned for (a-la-google) albeit at the additional cost for storage. And both are being done today.

As far as your assertion that 90+% of relevant corporate information should be in structured databases, well, that just doesn't seem to be the case anymore.

Email does not fall into that category, nor do all the reports, spreadsheets, presentations, memos, etc. generated from that database information.

Ask anyone who manages a large shop where the information growth is coming from, and they probably won't say "databases".

Furthermore, the problem is worse as at least there are some basic information mgmt concepts in dbms that file systems etc. do not have.

So, how do you recreate database-like properties from unstructured information? Well, you scan them, and put the metadata in a repository ...

I do agree with your last point -- we will need a new class of specialists who understand what they're looking for ("informationists"?) in the context of the broader corporate agenda. And I'm sure they'll wield big words like "taxonomies" and such ...

Thanks for writing ... I enjoy the discussion!

Albert Carriere

Hi Chuck,

Very interesting topic!

I've been working in the Records & Information Management field for over 30 years and I always found that auto classification or auto categorization does not work very well in today’s environment because there are so many variables associate with having a system determine where structured, semi-structured and/or unstructured records are to reside in a corporate file classification scheme. Over the years, I found that organizations that have a sound records management program in place, with a well established file classification system, appropriate policies and training in place, have a greater success in having their employees properly file their documents in the appropriate places within the corporate filing scheme than organization that have no such records management program in place.

For an example, our Legal Department deals with contracts and therefore we have a contract cabinet created in Documentum with an alphabetical listing of customers which include specific contract type folders under the customer folder listing. DCO is used for our legal users to file there emails and documents and they themselves know what contracts they are working on and any other employees working on these contracts are provided with the proper access rights so that they can use the file breakdown as well. Having records management policies and training in place simplifies everyone life as they do know how the corporate filing scheme works and the importance of properly filing any type of records within our standard corporate file classification system.

I’m not saying that information classification tools are not great, however, I do find that if you know, as an employee, what projects you are working on and that a file classification scheme was created for this project or projects for you, then I see no problems in your filing documents in the appropriate files if the files are displayed to you for quick access for filing documents accordingly.

Those are my thoughts...thanks for reading me!

Peter Quirk

One of the best ways for users to support automatic classification is to create more structure (context) in their documents. Unfortunately, the majority of knowledge workers fail to do this because of two related institutional behaviors.

Firstly, we don't train kids in school to use document templates when they first learn how to use word processing tools, spreadsheets, etc. Universities don't correct the behavior, because they assume students have either learned how to use the tools in high school, or can pick up the techniques during writing assignments.

Secondly, most businesses don't invest in the creation of relevant structured templates and in training their staff to use them. The result is that most workers create new information content from the blank word document, or the blank Excel spreadsheet. Next generation schema-based tools like InfoPath have seen very slow adoption, and little or no investment in the creation of relevant schemas.

I think it could take another 30 years to flush the unstructured content generation out of the educational and corporate spheres and replace them with a new generation that understands the value of creating or acquiring schemas before embarking on a lifetime of content creation.

Chuck Hollis

Wonderful point, Peter.

And, as I think about it, there is a new skill that we all need to learn about assigning tags, keywords, metadata and so on to the things we create or touch.

And I know that I'm really bad at that particular skill. I'm not probably alone in that regard.

However, I would offer that we'll see activity on this front far before our learning on this improves and eventually institutionalized.

First, any sort of corporate categorization scheme will have to be owned by a central authority. Individuals and business units can influence and extend, but the core I would offer will have to be centrally governed.

Second, I think that the pressure to classify unstructured information after-the-fact will be so severe that we'll see rapid adoption and extension of these tools in a relatively short period of time.

Email classification, as an example, is not only widely used, it's become very sophisticated as well. I would think files are next in line.

Good thought, Peter. Thanks for the comment!

The comments to this entry are closed.

Chuck Hollis

  • Chuck Hollis
    SVP, Oracle Converged Infrastructure Systems

    Chuck now works for Oracle, and is now deeply embroiled in IT infrastructure.

    Previously, he was with VMware for 2 years, and EMC for 18 years before that, most of them great.

    He enjoys speaking to customer and industry audiences about a variety of technology topics, and -- of course -- enjoys blogging.

    Chuck lives in Vero Beach, FL with his wife and four dogs when he's not traveling. In his spare time, Chuck is working on his second career as an aging rock musician.

    Warning: do not ever buy him a drink when there is a piano nearby.

    Note: these are my personal views, and aren't reviewed or approved by my employer.
Enter your Email:
Preview | Powered by FeedBlitz

General Housekeeping

  • Frequency of Updates
    I try and write something new 1-2 times per week; less if I'm travelling, more if I'm in the office. Hopefully you'll find the frequency about right!
  • Comments and Feedback
    All courteous comments welcome. TypePad occasionally puts comments into the spam folder, but I'll fish them out. Thanks!