Microsoft Word documents come in two formats, the Legacy file format also called Binary format used in Office 97 to Office 2003 and Office Open XML (OOXML) introduced in Office 2007 and has been the new standard ever since. When interacting with Microsoft Office Products programmatically it is vital to understand how Microsoft Office documents format its data. Apache POI provides Horrible Word Processing Format (HWPF) component for reading or writing legacy formatted word documents.
Legacy or Binary file format
MS-Word document data is structured into different streams. Following are main streams:
- Word Document Stream
- 1Table Stream or 0Table Stream
Word Document Stream
It is the main stream where character data is stored, character is the basic unit of data in ms-word document, word document stream contains a structure called File Information Block (FIB) at the beginning, which stores the location of character data using a pair of integers, first one indicating location of character and second one specify its size.
Word document stream also contain Clx structure followed by Pcdt structure. Clx structure is an array of Prc structures which contain property information. Pcdt structure contains PlcPcd structures.
Adjacent characters in document text are not always adjacent in document stream. Characters in document text are positioned by a structure called Character Position (CP) which is an unsigned 32 bit integer while characters in document stream are positioned by a structure called Pcd. PlcPcd maps positions of characters in document stream to those in document text.
1Table Stream or 0Table Stream
These are streams where word store its tables. A word document must contain one of these two streams and at the same time only one of the two streams is used.