IMO not all history but most of it is a lie. Then the question is how can one find the truth?
Every author writes their own point-of-view, some baised others not so much. In such cases one can deduce the truth to some extent.
Then their are times where only the winners have written the History. In this case we better make peace with it.
Microsoft Word documents come in two formats, the Legacy file format also called Binary format used in Office 97 to Office 2003 and Office Open XML (OOXML) introduced in Office 2007 and has been the new standard ever since. When interacting with Microsoft Office Products programmatically it is vital to understand how Microsoft Office documents format its data. Apache POI provides Horrible Word Processing Format (HWPF) component for reading or writing legacy formatted word documents.
Legacy or Binary file format
MS-Word document data is structured into different streams. Following are main streams:
- Word Document Stream
- 1Table Stream or 0Table Stream
Word Document Stream
It is the main stream where character data is stored, character is the basic unit of data in ms-word document, word document stream contains a structure called File Information Block (FIB) at the beginning, which stores the location of character data using a pair of integers, first one indicating location of character and second one specify its size.
Word document stream also contain Clx structure followed by Pcdt structure. Clx structure is an array of Prc structures which contain property information. Pcdt structure contains PlcPcd structures.
Adjacent characters in document text are not always adjacent in document stream. Characters in document text are positioned by a structure called Character Position (CP) which is an unsigned 32 bit integer while characters in document stream are positioned by a structure called Pcd. PlcPcd maps positions of characters in document stream to those in document text.
1Table Stream or 0Table Stream
These are streams where word store its tables. A word document must contain one of these two streams and at the same time only one of the two streams is used.
Apache POI (Poor Obfuscation Implementation), a Java API for Microsoft documents, is an open source API for processing office documents. Using POI it is possible to read, modify or create office documents. Apache POI provides component that with different office documents.
- XSSF and HSSF for Excel
- XWPF and HWPF for Word
- XSLF and HSLF for Power Point
- HSMF for Outlook
- HDGF for Visio
- HPBF for Publisher
- and more
Apache POI handles Microsoft legacy file format and Office Open XML (OOXML) file format as well, for example Xml SpreadSheet Format XSSF component handles OOXML file format of Excel while Horrible Spreadsheet Format HSSF handles legacy file format. OOXML file format was introduced with Office 2007 and is default file format ever since, corresponding Word, Excel and Power Point extensions are docx, xlsx and pptx. Before office 2007 Microsoft Office used legacy file format formally known as Binary file format, corresponding Word, Excel and Power Point extensions of legacy file format are doc, xls, ppt.