Understanding Legacy or Binary file format of MS-Word

Microsoft Word documents come in two formats, the Legacy file format also called Binary format used in Office 97 to Office 2003 and Office Open XML (OOXML) introduced in Office 2007 and has been the new standard ever since. When interacting with Microsoft Office Products programmatically it is vital to understand how Microsoft Office documents format its data. Apache POI provides Horrible Word Processing Format (HWPF) component for reading or writing legacy formatted word documents.

Legacy or Binary file format
MS-Word document data is structured into different streams. Following are main streams:

  • Word Document Stream
  • 1Table Stream or 0Table Stream

Word Document Stream
It is the main stream where character data is stored, character is the basic unit of data in ms-word document, word document stream contains a structure called File Information Block (FIB) at the beginning, which stores the location of character data using a pair of integers, first one indicating location of character and second one specify its size.

Word document stream also contain Clx structure followed by Pcdt structure. Clx structure is an array of Prc structures which contain property information. Pcdt structure contains PlcPcd structures.

Adjacent characters in document text are not always adjacent in document stream. Characters in document text are positioned by a structure called Character Position (CP) which is an unsigned 32 bit integer while characters in document stream are positioned by a structure called PcdPlcPcd maps positions of characters in document stream to those in document text.

1Table Stream or 0Table Stream
These are streams where word store its tables. A word document must contain one of these two streams and at the same time only one of the two streams is used.

Apache POI – Poor Obfuscation Implementation

Apache POI (Poor Obfuscation Implementation), a Java API for Microsoft documents, is an open source API for processing office documents. Using POI it is possible to read, modify or create office documents. Apache POI provides component that with different office documents.

  • XSSF and HSSF for Excel
  • XWPF and HWPF for Word
  • XSLF and HSLF for Power Point
  • HSMF for Outlook
  • HDGF for Visio
  • HPBF for Publisher
  • and more

Apache POI handles Microsoft legacy file format and Office Open XML (OOXML) file format as well, for example Xml SpreadSheet Format XSSF component handles OOXML file format of Excel while Horrible Spreadsheet Format HSSF handles legacy file format.  OOXML file format was introduced with Office 2007 and is default file format ever since, corresponding Word, Excel and Power Point extensions are docx, xlsx and pptx. Before office 2007 Microsoft Office used legacy file format formally known as Binary file format, corresponding Word, Excel and Power Point extensions of legacy file format are doc, xls, ppt.