Editing Word Files Outside of Word

One of the biggest problems with being able to edit a Word document from a program that you wrote yourself is that Word uses a set of proprietary markup tags defined by Microsoft and where the full list of all the tags and how they work is known only to Microsoft staff. Some other programs such as Open Office have been written by people who have done a reasonable job of reverse engineering what most of the tags mean but even there they can't get their browser to recognise everything exactly as Word does because of the tags that they haven't properly identified.

If a team of programmers working on another word proceessor can't get it 100% then what chance has any individual got in being able to edit a Word document from their own program? Even a simple search and replace runs the risk of accidentally matching one of the markup tags and totally destroying the page layout. To get that to work properly with a minimal risk of mismatching would require that you define fairly sophisticated tags of your own to mark the code that you intend to search for.

If you are looking to edit the Word document on the web then there are additional issues relating to privacy that apply to the Word document even if you don't perform any edits at all. Word documents in Word format are totally unsuited for web use.

Fortunately Microsoft themselves have solved both of these problems for us. Microsoft have created their own proprietary markup language based on HTML that uses Microsoft proprietary conditional comments to include the information needed to not only recreate the original Word document from this "HTML" version but which also allows it to be imported into other Microsoft programs with appropriate formatting as well. As this format converts all of the page markup into plain text and encloses them inside < > tags (most of the rest of the similarity to real HTML is almost non existant). What this format does do is to make it safe to run whatever edits that we require on the text content of the page. With the markup in a more readable format it also makes it slightly easier for you to be able to modify the actual markup tags yourself without totally stuffing up the page. All of the information that you don't want in the page for privacy reasons is also either automatically removed when you convert to this format or will be more easily visible to you to be able to manually delete it yourself before uploading to the web.

The biggest advantage that this proprietary HTML format has (apart from being much easier to edit outside of Word) is that it retains all of the information needed to recreate the document in Word format and to do so all you need to do is to open the HTML file in Word.

One final thing to remember is that when you are doing this you want to save your Word document in the unfiltered version of HTML (the one that adds all the "garbage" that you need to remove in order to use it as a web page). In this instance we need all of that information to be retained as that is what will enable Word to recreate the Word document back into its original format.


This article written by Stephen Chapman, Felgall Pty Ltd.

go to top

FaceBook Follow
Twitter Follow