Support models

Support models are abstracts over “raw” objects within a Pdf. For example, a page in a PDF is a Dictionary with set to /Type of /Page. The Dictionary in that case is the “raw” object. Upon establishing what type of object it is, we can wrap it with a support model that adds features to ensure consistency with the PDF specification.

pikepdf does not currently apply support models to “raw” objects automatically, but might do so in a future release (this would break backward compatibility).

For example, to initialize a Page support model:

from pikepdf import Pdf, Page

Pdf = open(...)
page_support_model = Page(pdf.pages[0])
class pikepdf.Page
add_content_token_filter(self: pikepdf.Page, tf: pikepdf.Object::TokenFilter) → None

Attach a pikepdf.TokenFilter to a page’s content stream.

This function applies token filters lazily, if/when the page’s content stream is read for any reason, such as when the PDF is saved. If never access, the token filter is not applied.

Multiple token filters may be added to a page/content stream.

If the page’s contents is an array of streams, it is coalesced.

as_form_xobject(self: pikepdf.Page, handle_transformations: bool = True) → pikepdf.Object

Return a form XObject that draws this page.

This is useful for n-up operations, underlay, overlay, thumbnail generation, or any other case in which it is useful to replicate the contents of a page in some other context. The dictionaries are shallow copies of the original page dictionary, and the contents are coalesced from the page’s contents. The resulting object handle is not referenced anywhere.

Parameters:handle_transformations (bool) – If True, the resulting form XObject’s /Matrix will be set to replicate rotation (/Rotate) and scaling (/UserUnit) in the page’s dictionary. In this way, the page’s transformations will be preserved when placing this object on another page.
contents_coalesce(self: pikepdf.Page) → None

Coalesce a page’s content streams.

A page’s content may be a stream or an array of streams. If this page’s content is an array, concatenate the streams into a single stream. This can be useful when working with files that split content streams in arbitrary spots, such as in the middle of a token, as that can confuse some software.

externalize_inline_images(self: pikepdf.Page, min_size: int = 0) → None

Convert inlines image to normal (external) images.

Parameters:min_size (int) – minimum size in bytes
get_filtered_contents(self: pikepdf.Page, tf: TokenFilter) → bytes

Apply a pikepdf.TokenFilter to a content stream, without modifying it.

This may be used when the results of a token filter do not need to be applied, such as when filtering is being used to retrieve information rather than edit the content stream.

Note that it is possible to create a subclassed TokenFilter that saves information of interest to its object attributes; it is not necessary to return data in the content stream.

To modify the content stream, use pikepdf.Page.add_content_token_filter().

Returns:the modified content stream
Return type:bytes
obj

Get the underlying pikepdf.Object.

parse_contents(self: pikepdf.Page, arg0: pikepdf._qpdf.StreamParser) → None

Parse a page’s content streams using a pikepdf.StreamParser.

The content stream may be interpreted by the StreamParser but is not altered.

If the page’s contents is an array of streams, it is coalesced.

remove_unreferenced_resources(self: pikepdf.Page) → None

Removes from the resources dictionary any object not referenced in the content stream.

A page’s resources dictionary maps names to objects elsewhere in the file. This method walks through a page’s contents and keeps tracks of which resources are referenced somewhere in the contents. Then it removes from the resources dictionary any object that is not referenced in the contents. This method is used by page splitting code to avoid copying unused objects in files that used shared resource dictionaries across multiple pages.

rotate(self: pikepdf.Page, angle: int, relative: bool) → None

Rotate a page.

If relative is False, set the rotation of the page to angle. Otherwise, add angle to the rotation of the page. angle must be a multiple of 90. Adding 90 to the rotation rotates clockwise by 90 degrees.

class pikepdf.PdfMatrix(*args)

Support class for PDF content stream matrices

PDF content stream matrices are 3x3 matrices summarized by a shorthand (a, b, c, d, e, f) which correspond to the first two column vectors. The final column vector is always (0, 0, 1) since this is using homogenous coordinates.

PDF uses row vectors. That is, vr @ A' gives the effect of transforming a row vector vr=(x, y, 1) by the matrix A'. Most textbook treatments use A @ vc where the column vector vc=(x, y, 1)'.

(@ is the Python matrix multiplication operator added in Python 3.5.)

Addition and other operations are not implemented because they’re not that meaningful in a PDF context (they can be defined and are mathematically meaningful in general).

PdfMatrix objects are immutable. All transformations on them produce a new matrix.

a
b
c
d
e
f

Return one of the six “active values” of the matrix.

encode()

Encode this matrix in binary suitable for including in a PDF

static identity()

Constructs and returns an identity matrix

rotated(angle_degrees_ccw)

Concatenates a rotation matrix on this matrix

scaled(x, y)

Concatenates a scaling matrix on this matrix

shorthand

Return the 6-tuple (a,b,c,d,e,f) that describes this matrix

translated(x, y)

Translates this matrix

class pikepdf.PdfImage(obj)

Support class to provide a consistent API for manipulating PDF images

The data structure for images inside PDFs is irregular and flexible, making it difficult to work with without introducing errors for less typical cases. This class addresses these difficulties by providing a regular, Pythonic API similar in spirit (and convertible to) the Python Pillow imaging library.

as_pil_image()

Extract the image as a Pillow Image, using decompression as necessary

Returns:PIL.Image.Image
extract_to(*, stream=None, fileprefix='')

Attempt to extract the image directly to a usable image file

If possible, the compressed data is extracted and inserted into a compressed image file format without transcoding the compressed content. If this is not possible, the data will be decompressed and extracted to an appropriate format.

Because it is not known until attempted what image format will be extracted, users should not assume what format they are getting back. When saving the image to a file, use a temporary filename, and then rename the file to its final name based on the returned file extension.

Examples

>>> im.extract_to(stream=bytes_io)
'.png'
>>> im.extract_to(fileprefix='/tmp/image00')
'/tmp/image00.jpg'
Parameters:
  • stream – Writable stream to write data to.
  • fileprefix (str or Path) – The path to write the extracted image to, without the file extension.
Returns:

If fileprefix was provided, then the fileprefix with the appropriate extension. If no fileprefix, then an extension indicating the file type.

Return type:
str
get_stream_buffer(decode_level=StreamDecodeLevel.specialized)

Access this image with the buffer protocol

icc

If an ICC profile is attached, return a Pillow object that describe it.

Most of the information may be found in icc.profile.

Returns:PIL.ImageCms.ImageCmsProfile
is_inline

False for image XObject

read_bytes(decode_level=StreamDecodeLevel.specialized)

Decompress this image and return it as unencoded bytes

show()

Show the image however PIL wants to

class pikepdf.PdfInlineImage(*, image_data, image_object: tuple)

Support class for PDF inline images

class pikepdf.models.PdfMetadata(pdf, pikepdf_mark=True, sync_docinfo=True, overwrite_invalid_xml=True)

Read and edit the metadata associated with a PDF

The PDF specification contain two types of metadata, the newer XMP (Extensible Metadata Platform, XML-based) and older DocumentInformation dictionary. The PDF 2.0 specification removes the DocumentInformation dictionary.

This primarily works with XMP metadata, but includes methods to generate XMP from DocumentInformation and will also coordinate updates to DocumentInformation so that the two are kept consistent.

XMP metadata fields may be accessed using the full XML namespace URI or the short name. For example metadata['dc:description'] and metadata['{http://purl.org/dc/elements/1.1/}description'] both refer to the same field. Several common XML namespaces are registered automatically.

See the XMP specification for details of allowable fields.

To update metadata, use a with block.

Example

>>> with pdf.open_metadata() as records:
        records['dc:title'] = 'New Title'
load_from_docinfo(docinfo, delete_missing=False, raise_failure=False)

Populate the XMP metadata object with DocumentInfo

Parameters:
  • docinfo – a DocumentInfo, e.g pdf.docinfo
  • delete_missing – if the entry is not DocumentInfo, delete the equivalent from XMP
  • raise_failure – if True, raise any failure to convert docinfo; otherwise warn and continue

A few entries in the deprecated DocumentInfo dictionary are considered approximately equivalent to certain XMP records. This method copies those entries into the XMP metadata.

pdfa_status

Returns the PDF/A conformance level claimed by this PDF, or False

A PDF may claim to PDF/A compliant without this being true. Use an independent verifier such as veraPDF to test if a PDF is truly conformant.

Returns:The conformance level of the PDF/A, or an empty string if the PDF does not claim PDF/A conformance. Possible valid values are: 1A, 1B, 2A, 2B, 2U, 3A, 3B, 3U.
Return type:str
pdfx_status

Returns the PDF/X conformance level claimed by this PDF, or False

A PDF may claim to PDF/X compliant without this being true. Use an independent verifier such as veraPDF to test if a PDF is truly conformant.

Returns:The conformance level of the PDF/X, or an empty string if the PDF does not claim PDF/X conformance.
Return type:str
class pikepdf.models.Encryption(*, owner, user, R=6, allow=Permissions(accessibility=True, extract=True, modify_annotation=True, modify_assembly=False, modify_form=True, modify_other=True, print_highres=True, print_lowres=True), aes=True, metadata=True)

Specify the encryption settings to apply when a PDF is saved.

Parameters:
  • owner (str) – The owner password to use. This allows full control of the file. If blank, the PDF will be encrypted and present as “(SECURED)” in PDF viewers. If the owner password is blank, the user password should be as well.
  • user (str) – The user password to use. With this password, some restrictions will be imposed by a typical PDF reader. If blank, the PDF can be opened by anyone, but only modified as allowed by the permissions in allow.
  • R (int) – Select the security handler algorithm to use. Choose from: 2, 3, 4 or 6. By default, the highest version of is selected (6). 5 is a deprecated algorithm that should not be used.
  • allow (pikepdf.Permissions) – The permissions to set. If omitted, all permissions are granted to the user.
  • aes (bool) – If True, request the AES algorithm. If False, use RC4. If omitted, AES is selected whenever possible (R >= 4).
  • metadata (bool) – If True, also encrypt the PDF metadata. If False, metadata is not encrypted. Reading document metadata without decryption may be desirable in some cases. Requires aes=True. If omitted, metadata is encrypted whenever possible.
class pikepdf.models.Outline(pdf, max_depth=15, strict=False)

Maintains a intuitive interface for creating and editing PDF document outlines, according to the PDF reference manual (ISO32000:2008) section 12.3.

Parameters:
  • pdf – PDF document object.
  • max_depth – Maximum recursion depth to consider when reading the outline.
  • strict – If set to False (default) silently ignores structural errors. Setting it to True raises a OutlineStructureError if any object references re-occur while the outline is being read or written.
class pikepdf.models.OutlineItem(title: str, destination: (<class 'int'>, <class 'str'>, <class 'pikepdf.objects.Object'>) = None, page_location: (<enum 'PageLocation'>, <class 'str'>) = None, action: pikepdf.objects.Dictionary = None, obj: pikepdf.objects.Dictionary = None, **kwargs)

Manages a single item in a PDF document outlines structure, including nested items.

Parameters:
  • title – Title of the outlines item.
  • destination – Page number, destination name, or any other PDF object to be used as a reference when clicking on the outlines entry. Note this should be None if an action is used instead. If set to a page number, it will be resolved to a reference at the time of writing the outlines back to the document.
  • page_location – Supplemental page location for a page number in destination, e.g. PageLocation.Fit. May also be a simple string such as 'FitH'.
  • action – Action to perform when clicking on this item. Will be ignored during writing if destination is also set.
  • objDictionary object representing this outlines item in a Pdf. May be None for creating a new object. If present, an existing object is modified in-place during writing and original attributes are retained.
  • kwargs – Additional keyword arguments. Any of left, top, bottom, right, or zoom, they will be processed for usage of extended page location types, e.g. /XYZ.

This object does not contain any information about higher-level or neighboring elements.

classmethod from_dictionary_object(obj: pikepdf.objects.Dictionary)

Creates a OutlineItem from a PDF document’s Dictionary object. Does not process nested items.

Parameters:objDictionary object representing a single outline node.
to_dictionary_object(pdf, create_new=False) → pikepdf.objects.Dictionary

Creates a Dictionary object from this outline node’s data, or updates the existing object. Page numbers are resolved to a page reference on the input Pdf object.

Parameters:
  • pdf – PDF document object.
  • create_new – If set to True, creates a new object instead of modifying an existing one in-place.
class pikepdf.Permissions(accessibility=True, extract=True, modify_annotation=True, modify_assembly=False, modify_form=True, modify_other=True, print_lowres=True, print_highres=True)

Stores the permissions for an encrypted PDF.

Unencrypted PDFs implicitly have all permissions allowed. pikepdf does not enforce the restrictions in any way. Permissions can only be changed when a PDF is saved.

accessibility

The owner of the PDF permission for screen readers and accessibility tools to access the PDF.

extract

The owner of the PDF permission for software to extract content from a PDF.

modify_annotation
modify_assembly
modify_form
modify_other

The owner of the PDF permission to modify various parts of a PDF.

print_lowres
print_highres

The owner of the PDF permission to print at low or high resolution.

class pikepdf.models.EncryptionMethod

Describes which encryption method was used on a particular part of a PDF. These values are returned by pikepdf.EncryptionInfo but are not currently used to specify how encryption is requested.

none

Data was not encrypted.

unknown

An unknown algorithm was used.

rc4

The RC4 encryption algorithm was used (obsolete).

aes

The AES-based algorithm was used as described in the PDF 1.7 reference manual.

aesv3

An improved version of the AES-based algorithm was used as described in the Adobe Supplement to the ISO 32000, requiring PDF 1.7 extension level 3. This algorithm still uses AES, but allows both AES-128 and AES-256, and improves how the key is derived from the password.

class pikepdf.models.EncryptionInfo(encdict)

Reports encryption information for an encrypted PDF.

This information may not be changed, except when a PDF is saved. This object is not used to specify the encryption settings to save a PDF, due to non-overlapping information requirements.

P

Encoded permission bits.

See Pdf.allow() instead.

R

Revision number of the security handler.

V

Version of PDF password algorithm.

bits

The number of encryption bits.

encryption_key

The RC4 or AES encryption key used for this file.

file_method

Encryption method used to encode the whole file.

stream_method

Encryption method used to encode streams.

string_method

Encryption method used to encode strings.

user_password

If possible, return the user password.

The user password can only be retrieved when a PDF is opened with the owner password and when older versions of the encryption algorithm are used.

The password is always returned as bytes even if it has a clear Unicode representation.