Object model

This section covers the object model pikepdf uses in more detail.

A pikepdf.Object is a Python wrapper around a C++ QPDFObjectHandle which, as the name suggests, is a handle (or pointer) to a data structure in memory, or possibly a reference to data that exists in a file. Importantly, an object can be a scalar quantity (like a string) or a compound quantity (like a list or dict, that contains other objects). The fact that the C++ class involved here is an object handle is an implementation detail; it shouldn’t matter for a pikepdf user.

The simplest types in PDFs are directly represented as Python types: int, bool, and None stand for PDF integers, booleans and the “null”. Decimal is used for floating point numbers in PDFs. If a value in a PDF is assigned a Python float, pikepdf will convert it to Decimal.

Types that are not directly convertible to Python are represented as pikepdf.Object, a compound object that offers a superset of possible methods, some of which only if the underlying type is suitable. Use the EAFP idiom, or isinstance to determine the type more precisely. This partly reflects the fact that the PDF specification allows many data fields to be one of several types.

Explicit conversion mode

By default, pikepdf automatically converts PDF scalar types to Python native types (int, bool, Decimal). This is convenient but can make type checking difficult, especially when handling potentially malformed PDFs where a field might contain an unexpected type.

pikepdf provides an explicit conversion mode that preserves PDF type information by returning pikepdf.Integer, pikepdf.Boolean, and pikepdf.Real objects instead of native Python types.

>>> import pikepdf
>>> with pikepdf.explicit_conversion():
...     pdf = pikepdf.open("example.pdf")
...     count = pdf.Root.Pages.Count
...     isinstance(count, pikepdf.Integer)  # True
...     int(count)  # Convert to Python int
5

You can enable explicit mode globally with pikepdf.set_object_conversion_mode():

>>> pikepdf.set_object_conversion_mode('explicit')
>>> pikepdf.get_object_conversion_mode()
'explicit'

Safe accessor methods

When working with PDF objects, you often need to extract scalar values while handling type mismatches gracefully. The as_* methods provide type-safe access with optional defaults:

>>> with pikepdf.explicit_conversion():
...     d = pikepdf.Dictionary(Width=100, Name=pikepdf.Name.Foo)
...     d.Width.as_int()  # Returns 100
...     d.Name.as_int(default=0)  # Returns 0 (Name is not an integer)
...     d.Name.as_int()  # Raises TypeError

Available methods:

as_int() - convert to int, or return default
as_bool() - convert to bool, or return default
as_float() - convert to float, or return default
as_decimal() - convert to Decimal (for Real only), or return default

Arithmetic with scalar types

pikepdf.Integer and pikepdf.Real support arithmetic operations with both int and float operands:

>>> with pikepdf.explicit_conversion():
...     d = pikepdf.Dictionary(Value=10)
...     d.Value + 5      # Returns 15 (int)
...     d.Value + 2.5    # Returns 12.5 (float)
...     d.Value / 4      # Returns 2.5 (float, true division)
...     d.Value // 3     # Returns 3 (int, floor division)

For convenience, the repr() of a pikepdf.Object will display a Python expression that replicates the existing object (when possible), so it will say:

>>> catalog_name = pdf.Root.Type
pikepdf.Name("/Catalog")
>>> isinstance(catalog_name, pikepdf.Name)
True
>>> isinstance(catalog_name, pikepdf.Object)
True

Making PDF objects

You may construct a new object with one of the classes:

pikepdf.Array
pikepdf.Dictionary
pikepdf.Name - the type used for keys in PDF Dictionary objects
pikepdf.String - a text string (treated as bytes and str depending on context)
pikepdf.Integer - a PDF integer (explicit mode)
pikepdf.Boolean - a PDF boolean (explicit mode)
pikepdf.Real - a PDF real/floating-point number (explicit mode)

These may be thought of as subclasses of pikepdf.Object. (Internally they are pikepdf.Object.)

There are a few other classes for special PDF objects that don’t map to Python as neatly.

pikepdf.Operator - a special object involved in processing content streams
pikepdf.Stream - a special object similar to a Dictionary with binary data attached
pikepdf.InlineImage - an image that is embedded in content streams

The great news is that it’s often unnecessary to construct pikepdf.Object objects when working with pikepdf. Python types are transparently converted to the appropriate pikepdf object when passed to pikepdf APIs – when possible. However, pikepdf sends pikepdf.Object types back to Python on return calls, in most cases, because pikepdf needs to keep track of objects that came from PDFs originally.

Accessing nested objects

For accessing deeply nested structures, pikepdf.NamePath provides ergonomic syntax with helpful error messages. See Accessing nested objects with NamePath for details.

Object lifecycle and memory management

As mentioned above, a pikepdf.Object may reference data that is lazily loaded from its source pikepdf.Pdf. Closing the Pdf with pikepdf.Pdf.close() will invalidate some objects, depending on whether or not the data was loaded, and other implementation details that may change. Generally speaking, a pikepdf.Pdf should be held open until it is no longer needed, and objects that were derived from it may or may not be usable after it is closed.

Simple objects (booleans, integers, decimals, None) are copied directly to Python as pure Python objects.

For PDF stream objects, use pikepdf.Object.read_bytes() to obtain a copy of the object as pure bytes data, if this information is required after closing a PDF.

When objects are copied from one pikepdf.Pdf to another, the underlying data is copied immediately into the target. As such it is possible to merge hundreds of Pdf into one, keeping only a single source at a time and the target file open.

Indirect objects

PDF has two ways to represented a PDF dictionary that contains another dictionary: it can contain the inner dictionary, or provide a reference to another object. In the PDF file itself, most objects have an object number that is for referencing.

pikepdf hides the details about whether an object is directly or indirectly referenced, since in many situations it does not matter and manually testing each object to see if it needs to be dereferenced before accessing it is tedious. However, you may need to create indirect references. Sometimes, the PDF 1.7 Reference Manual specifically requires that a value be an indirect object.

You can use pikepdf.Object.is_indirect to check if an object is actually an indirect reference. If you require an indirect object, use pikepdf.Pdf.make_indirect() to attach the dictionary to a Pdf and return an indirect copy of it. Direct objects are not attached to any particular Pdf and can be copied from one to another, just like scalars. Indirect objects must be attached.

Stream objects are always indirect objects, and must always be attached to a PDF.

Object helpers

pikepdf also provides pikepdf.ObjectHelper and various subclasses of this class. Usually these are wrappers around a pikepdf.Dictionary with special rules applicable to that type of dictionary. pikepdf.Page is an example of an object helper. The underlying object can be accessed with pikepdf.ObjectHelper.obj.