Friday 9 August 2019

Analysing file content - a simple Python program

I needed to describe the structure of a simple PDF file as part of a presentation, so I thought it would be an interesting to see how an actual (working) example was laid out.

The file contents were created by copying the text from the example in appendix H3 in the ISO 32000 reference document (available from the Adobe site) and pasting them into an empty file. Some tidying up took place as the copy included some extraneous characters.

This is how the file displayed.

This is the Python program used to read and display the file contents and offsets.

# Open and read the contents of a newline delimited file and display
filename="/home/pi/Documents/simplepdf.pdf"
# Open file for read as bytes
with open(filename, "rb") as f:
    # Initialise variables
    addr=0                  # Current byte count
    line_number = 0         # Line number displayed
    lineoftext = ""         # Accumulated text for a line
    startlineaddr = 0       # Off set for start of accumulated text
    
    byte = f.read(1)        # Read a byte from the file
    while byte != b"":      # While something has been read
        addr+=1                # Increment byte count
        if byte==b"\n":        # If byte is newline character
            line_number += 1        # Increment line number
                                    # Print line number, offset and line of text
            print("{:03d}".format(line_number)+"-"+"{:05d}".format(startlineaddr)+":"+lineoftext)
            lineoftext=""           # Clear line of text
            startlineaddr=addr      # set offset of new line of text
        else:
            try:                # Decode byte as a utf-8 character,
                                # if it fails use default string conversion 
                lineoftext = lineoftext + (byte.decode("utf-8"))
            except UnicodeDecodeError:   
                lineoftext = lineoftext + str(byte)
        byte = f.read(1)
    # If there is any undisplayed text left, display it
    if lineoftext !="":
        line_number += 1
        print("{:03d}".format(line_number)+"-"+"{:05d}".format(startlineaddr)+":"+lineoftext)
    # Display file length
    print(str(addr) + " Bytes")

This is the output.

001-00000:%PDF-1.4 
002-00010:1 0 obj
003-00018: << /Type /Catalog 
004-00038: /Outlines 2 0 R
005-00055: /Pages 3 0 R 
006-00070: >>
007-00074:endobj
008-00081:
009-00082:2 0 obj
010-00090: << /Type /Outlines
011-00110: /Count 0 
012-00121: >>
013-00125:endobj
014-00132:
015-00133:3 0 obj
016-00141: << /Type /Pages
017-00158: /Kids [4 0 R]
018-00173: /Count 1 
019-00184: >>
020-00188:endobj
021-00195:
022-00196:4 0 obj
023-00204: << /Type /Page
024-00220: /Parent 3 0 R
025-00235: /MediaBox [0 0 612 792] 
026-00261: /Contents 5 0 R 
027-00279: /Resources << /ProcSet 6 0 R
028-00309:   /Font << /F1 7 0 R 
029-00332:   >> 
030-00339: >>
031-00343:>> endobj
032-00353:5 0 obj
033-00361:<< /Length 73 >>
034-00378:stream 
035-00386: BT
036-00390: /F1 24 Tf
037-00401: 100 100 Td
038-00413: (Hello World) Tj
039-00431: ET 
040-00436:endstream
041-00446:endobj
042-00453:
043-00454:6 0 obj
044-00462: [/PDF /Text]
045-00476:endobj
046-00483:
047-00484:7 0 obj
048-00492: << /Type /Font
049-00508: /Subtype /Type1
050-00525: /Name /F1
051-00536: /BaseFont /Helvetica
052-00558: /Encoding /MacRomanEncoding
053-00587: >> 
054-00592:endobj
055-00599:
056-00600:xref
057-00605:0 8
058-00609:0000000000 65535 f 
059-00629:0000000009 00000 n 
060-00649:0000000074 00000 n 
061-00669:0000000120 00000 n 
062-00689:0000000179 00000 n 
063-00709:0000000364 00000 n 
064-00729:0000000466 00000 n 
065-00749:0000000496 00000 n
066-00768:
067-00769:trailer
068-00777:<< /Size 8
069-00788: /Root 1 0 R 
070-00802: >>
071-00806:startxref 
072-00817:625 
073-00822:%%EOF
827 Bytes
>>> 

Interestingly XPDF, Chrome and Apple Preview happily display the source file, even though the xref table (line 58 onwards) does not line up with the objects in the file.

More on that later.

References: