Monday, November 19, 2007

On Turning Document into Database

Imagine you have your contacts - names, addresses, phone numbers, spouses, children, etc. - done in Microsoft Word. Periodically you update it, print it and put it in a binder which you keep by the phone. You started doing this 20 years ago, and you have a lot of names and numbers. The information was entered over time without a lot of regard for standards and consistency. How it printed and how it looked when printed were the only considerations. Entries in the document look like this.
Shakespeare, William
Stratford on Avon
or Globe Theatre
London, UK
phone: 707-727-9999)
Tel. 800-555-1212
Fax: (866) 555-4321
(Bill & Stacey. Stacey's cell

Now imagine that you want to migrate this information to Microsoft Outlook or other database application. This was the situation presented to us recently by a client. Unfortunately, it turned out that this was not a trivial piece of work; the Word document contained over 2,000 contacts.

The document had information in record-layout form, without consistent fields/delimeters. Before we could import the data into Outlook, we needed to identify and label fields in each and every record.

One of the File/Import options available in Outlook is a VCard format file. This allows one to import data in record-by-record form, where the records comply with the VCard specs. Initially, we went down this road only to find that the fields available in the VCard spec are too limited -- no spouses, for example; and Outlook only imports one(!?) record per VCard file. There are third-party apps that let you import into Outlook more than one record per VCard file, but the field limits were a deal-breaker.

So, as the job evolved, we ended up identifying fields in each and every record and converting the whole thing into tabluar form, saving it as a Comma Separated Values (CSV) text file, and finally importing that into Outlook. Piece of cake? I wish!

The keys to success here are XML, Regular Expressions and XSL. You know how to use XML/XSL and Regular Expressions, don't you?

First, identify all the fields you are going to use. Hint: Use fields like the ones used in Outlook. Then establish your XML tags for each field: <lname>, <fname>, <mname>, <adr1>, ...

In Microsoft Word, clean up the file as much as possible; then, save it as a text file. We need it as a text file in order to use a Regular Expressions tool to perform complex search-and-replace functions, i.e., identify fields and insert our XML tags. Regexxer, a free, Linux tool, is ideal for this part of the job. But, depending upon the lack of standards and consistency in the document, this part of the job takes HOURS! We drastically underestimated the time involved.

When the long job of tagging the data document is completed, the text file should be made into an XML file and associated with an XSL file which you will create. This XSL file will transform the XML file into a CSV table in a browser. The final step is to copy and save the table and import it into Outlook.

Take it from me, this process works. Unfortunately, the time it takes makes it cost-prohibitive for all but the most cost-insensitive clients.

No comments: