Technology Is Not Dull: Analysing file content

Friday, 9 August 2019

Analysing file content - a simple Python program

I needed to describe the structure of a simple PDF file as part of a presentation, so I thought it would be an interesting to see how an actual (working) example was laid out.

The file contents were created by copying the text from the example in appendix H3 in the ISO 32000 reference document (available from the Adobe site) and pasting them into an empty file. Some tidying up took place as the copy included some extraneous characters.

This is how the file displayed.

This is the Python program used to read and display the file contents and offsets.

# Open and read the contents of a newline delimited file and display
filename="/home/pi/Documents/simplepdf.pdf"
# Open file for read as bytes
with open(filename, "rb") as f:
# Initialise variables
addr=0 # Current byte count
line_number = 0 # Line number displayed
lineoftext = "" # Accumulated text for a line
startlineaddr = 0 # Off set for start of accumulated text

byte = f.read(1) # Read a byte from the file
while byte != b"": # While something has been read
addr+=1 # Increment byte count
if byte==b"\n": # If byte is newline character
line_number += 1 # Increment line number
# Print line number, offset and line of text
print("{:03d}".format(line_number)+"-"+"{:05d}".format(startlineaddr)+":"+lineoftext)
lineoftext="" # Clear line of text
startlineaddr=addr # set offset of new line of text
else:
try: # Decode byte as a utf-8 character,
# if it fails use default string conversion
lineoftext = lineoftext + (byte.decode("utf-8"))
except UnicodeDecodeError:
lineoftext = lineoftext + str(byte)
byte = f.read(1)
# If there is any undisplayed text left, display it
if lineoftext !="":
line_number += 1
print("{:03d}".format(line_number)+"-"+"{:05d}".format(startlineaddr)+":"+lineoftext)
# Display file length
print(str(addr) + " Bytes")

This is the output.

001-00000:%PDF-1.4
002-00010:1 0 obj
003-00018: << /Type /Catalog
004-00038: /Outlines 2 0 R
005-00055: /Pages 3 0 R
006-00070: >>
007-00074:endobj
008-00081:
009-00082:2 0 obj
010-00090: << /Type /Outlines
011-00110: /Count 0
012-00121: >>
013-00125:endobj
014-00132:
015-00133:3 0 obj
016-00141: << /Type /Pages
017-00158: /Kids [4 0 R]
018-00173: /Count 1
019-00184: >>
020-00188:endobj
021-00195:
022-00196:4 0 obj
023-00204: << /Type /Page
024-00220: /Parent 3 0 R
025-00235: /MediaBox [0 0 612 792]
026-00261: /Contents 5 0 R
027-00279: /Resources << /ProcSet 6 0 R
028-00309: /Font << /F1 7 0 R
029-00332: >>
030-00339: >>
031-00343:>> endobj
032-00353:5 0 obj
033-00361:<< /Length 73 >>
034-00378:stream
035-00386: BT
036-00390: /F1 24 Tf
037-00401: 100 100 Td
038-00413: (Hello World) Tj
039-00431: ET
040-00436:endstream
041-00446:endobj
042-00453:
043-00454:6 0 obj
044-00462: [/PDF /Text]
045-00476:endobj
046-00483:
047-00484:7 0 obj
048-00492: << /Type /Font
049-00508: /Subtype /Type1
050-00525: /Name /F1
051-00536: /BaseFont /Helvetica
052-00558: /Encoding /MacRomanEncoding
053-00587: >>
054-00592:endobj
055-00599:
056-00600:xref
057-00605:0 8
058-00609:0000000000 65535 f
059-00629:0000000009 00000 n
060-00649:0000000074 00000 n
061-00669:0000000120 00000 n
062-00689:0000000179 00000 n
063-00709:0000000364 00000 n
064-00729:0000000466 00000 n
065-00749:0000000496 00000 n
066-00768:
067-00769:trailer
068-00777:<< /Size 8
069-00788: /Root 1 0 R
070-00802: >>
071-00806:startxref
072-00817:625
073-00822:%%EOF
827 Bytes
>>>

Interestingly XPDF, Chrome and Apple Preview happily display the source file, even though the xref table (line 58 onwards) does not line up with the objects in the file.

More on that later.

References:

https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf