File Forensics: Unzipping Word Docs to See XML Source

Sunday, October 16, 2011

Dan Dieterle




Have you ever tried to open a Word Docx file in notepad? If so, then you know that you get a screen full of random mess that looks something like this (click image to enlarge):


If the document is written in XML, then you should see formatted, readable text. So why don’t you? The key is the first two readable characters that show up in the picture above – “PK”.

The answer is that the Word data files are zipped! Since DOS days, all zip files when viewed as text start with the characters PK.

All you need to do is run the Docx file through an unzip program and you can see several files and folders full of XML data (click image to enlarge):


The files can now be opened in notepad, but if you just double click on them, they will open in your web browser and be a bit more readable. Browsing through the newly created folders and you will find a ton of formatting information and the complete text of the document.

But you will also find information that could be very useful for forensics. Including file revision, creation and modify dates, document creator and who was the last one to modify the document (click image to enlarge):


Apparently, this type of forensics was used to catch the guy that put a collar bomb on a high school student in Australia. Forensics examiners found the bombers name hidden in documents on a USB drive draped around the victims neck.

For more information, including a forensics recreation, check out “Forensic Examinations 5 – File Signatures, Metadata And The Collar Bomber – Part 2“.

Cross-posted from Cyber Arms

Possibly Related Articles:
Information Security
Forensics Windows Documentation Investigation XML File Integrity Management Digital Evidence
Post Rating I Like this!
Chris Kimmel Solid link for most file headers:

for instance in hex all jpegs will start with FFD8 and end with FFD9

OR... if you can't tell what the file actually is just use unix and run the file command.
The views expressed in this post are the opinions of the Infosec Island member that posted this content. Infosec Island is not responsible for the content or messaging of this post.

Unauthorized reproduction of this article (in part or in whole) is prohibited without the express written permission of Infosec Island and the Infosec Island member that posted this content--this includes using our RSS feed for any purpose other than personal use.