Support models

Support models are abstracts over “raw” objects within a Pdf. For example, a page in a PDF is a Dictionary with set to /Type of /Page. The Dictionary in that case is the “raw” object. Upon establishing what type of object it is, we can wrap it with a support model that adds features to ensure consistency with the PDF specification.

In version 2.x, did not apply support models to “raw” objects automatically. Version 3.x automatically applies support models to /Page objects.

class pikepdf.ObjectHelper

Base class for wrapper/helper around an Object.

Used to expose additional functionality specific to that object type.

pikepdf.Page is an example of an object helper. The actual page object is a PDF is a Dictionary. The helper provides additional methods specific to pages.

property obj: Dictionary

Get the underlying PDF object (typically a Dictionary).

Return type:: Dictionary

class pikepdf.Page(arg0: Object, /)

Support model wrapper around a page dictionary object.

add_content_token_filter(tf)

Attach a pikepdf.TokenFilter to a page’s content stream.

This function applies token filters lazily, if/when the page’s content stream is read for any reason, such as when the PDF is saved. If never access, the token filter is not applied.

Multiple token filters may be added to a page/content stream.

Token filters may not be removed after being attached to a Pdf. Close and reopen the Pdf to remove token filters.

If the page’s contents is an array of streams, it is coalesced.

Parameters:: tf (TokenFilter) – The token filter to attach.
Return type:: None

add_overlay(other, rect, *, push_stack=...)

Overlay another object on this page.

Overlays will be drawn after all previous content, potentially drawing on top of existing content.

Parameters:

other (Object | Page) – A Page or Form XObject to render as an overlay on top of this page.
rect (Rectangle | None) – The PDF rectangle (in PDF units) in which to draw the overlay. If omitted, this page’s trimbox, cropbox or mediabox (in that order) will be used.
push_stack (bool | None) – If True (default), push the graphics stack of the existing content stream to ensure that the overlay is rendered correctly. Officially PDF limits the graphics stack depth to 32. Most viewers will tolerate more, but excessive pushes may cause problems. Multiple content streams may also be coalesced into a single content stream where this parameter is True, since the PDF specification permits PDF writers to coalesce streams as they see fit.
shrink – If True (default), allow the object to shrink to fit inside the rectangle. The aspect ratio will be preserved.
expand – If True (default), allow the object to expand to fit inside the rectangle. The aspect ratio will be preserved.

Returns:

The name of the Form XObject that contains the overlay.

Added in version 2.14.

Changed in version 4.0.0: Added the push_stack parameter. Previously, this method behaved as if push_stack were False.

Changed in version 4.2.0: Added the shrink and expand parameters. Previously, this method behaved as if shrink=True, expand=False.

Changed in version 4.3.0: Returns the name of the overlay in the resources dictionary instead of returning None.

add_resource(res, res_type, name=None, *, prefix='', replace_existing=True)

Add a new resource to the page’s Resources dictionary.

If the Resources dictionaries do not exist, they will be created.

Parameters:

self – The object to add to the resources dictionary.
res (Object) – The dictionary object to insert into the resources dictionary.
res_type (Name) – Should be one of the following Resource dictionary types: ExtGState, ColorSpace, Pattern, Shading, XObject, Font, Properties.
name (Name | None) – The name of the object. If omitted, a random name will be generated with enough randomness to be globally unique.
prefix (str) – A prefix for the name of the object. Allows conveniently namespacing when using random names, e.g. prefix=”Im” for images. Mutually exclusive with name parameter.
replace_existing (bool) – If the name already exists in one of the resource dictionaries, remove it.

Return type:

Name

Example

>>> pdf = pikepdf.Pdf.new()
>>> pdf.add_blank_page(page_size=(100, 100))
<pikepdf.Page({
  "/Contents": pikepdf.Stream(owner=<...>, data=<...>, {

  }),
  "/MediaBox": [ 0, 0, 100, 100 ],
  "/Parent": <reference to /Pages>,
  "/Resources": {

  },
  "/Type": "/Page"
})>
>>> formxobj = pikepdf.Dictionary(
...     Type=Name.XObject,
...     Subtype=Name.Form
... )
>>> resource_name = pdf.pages[0].add_resource(formxobj, Name.XObject)

Added in version 2.3.

Changed in version 2.14: If res does not belong to the same Pdf that owns this page, a copy of res is automatically created and added instead. In previous versions, it was necessary to change for this case manually.

Changed in version 4.3.0: Returns the name of the overlay in the resources dictionary instead of returning None.

add_underlay(other, rect)

Underlay another object beneath this page.

Underlays will be drawn before all other content, so they may be overdrawn partially or completely.

There is no push_stack parameter for this function, since adding an underlay can be done without manipulating the graphics stack.

Parameters:

other (Object | Page) – A Page or Form XObject to render as an underlay underneath this page.
rect (Rectangle | None) – The PDF rectangle (in PDF units) in which to draw the underlay. If omitted, this page’s trimbox, cropbox or mediabox (in that order) will be used.
shrink – If True (default), allow the object to shrink to fit inside the rectangle. The aspect ratio will be preserved.
expand – If True (default), allow the object to expand to fit inside the rectangle. The aspect ratio will be preserved.

Returns:

The name of the Form XObject that contains the underlay.

Added in version 2.14.

Changed in version 4.2.0: Added the shrink and expand parameters. Previously, this method behaved as if shrink=True, expand=False. Fixed issue with wrong page rect being selected.

property artbox: Array

Return page’s effective /ArtBox, in PDF units.

According to the PDF specification: “The art box defines the page’s meaningful content area, including white space.”

If the /ArtBox is not defined, the /CropBox is returned.

Return type:: Array

as_form_xobject(handle_transformations=...)

Return a form XObject that draws this page.

This is useful for n-up operations, underlay, overlay, thumbnail generation, or any other case in which it is useful to replicate the contents of a page in some other context. The dictionaries are shallow copies of the original page dictionary, and the contents are coalesced from the page’s contents. The resulting object handle is not referenced anywhere.

Parameters:: handle_transformations (bool) – If True (default), the resulting form XObject’s /Matrix will be set to replicate rotation (/Rotate) and scaling (/UserUnit) in the page’s dictionary. In this way, the page’s transformations will be preserved when placing this object on another page.
Return type:: Object

property bleedbox: Array

Return page’s effective /BleedBox, in PDF units.

According to the PDF specification: “The bleed box defines the region to which the contents of the page should be clipped when output in a print production environment.”

If the /BleedBox is not defined, the /CropBox is returned.

Return type:: Array

calc_form_xobject_placement(formx, name, rect, *, invert_transformations, allow_shrink, allow_expand)

Generate content stream segment to place a Form XObject on this page.

The content stream segment must then be added to the page’s content stream.

The default keyword parameters will preserve the aspect ratio.

Parameters:

formx (Object) – The Form XObject to place.
name (Name) – The name of the Form XObject in this page’s /Resources dictionary.
rect (Rectangle) – Rectangle describing the desired placement of the Form XObject.
invert_transformations (bool) – Apply /Rotate and /UserUnit scaling when determining FormX Object placement.
allow_shrink (bool) – Allow the Form XObject to take less than the full dimensions of rect.
allow_expand (bool) – Expand the Form XObject to occupy all of rect.

Return type:

bytes

Added in version 2.14.

contents_add(contents, *, prepend=...)

Append or prepend to an existing page’s content stream.

Parameters:

contents (Stream | bytes) – An existing content stream to append or prepend.
prepend (bool) – Prepend if true, append if false (default).

Return type:

None

Added in version 2.14.

contents_coalesce()

Coalesce a page’s content streams.

A page’s content may be a stream or an array of streams. If this page’s content is an array, concatenate the streams into a single stream. This can be useful when working with files that split content streams in arbitrary spots, such as in the middle of a token, as that can confuse some software.

Return type:: None

copy_annotations(from_page, matrix=...)

Copy annotations from another page onto this page.

The other page may belong to the same or a different pikepdf.Pdf. Each annotation’s rectangle is transformed by the given matrix. If an annotation is a form field widget, the form field is copied into this document’s AcroForm as well.

Parameters:

from_page (Page) – The page to copy annotations from.
matrix (Matrix) – A transformation matrix applied to each annotation’s rectangle. Defaults to the identity matrix.

Return type:

None

Added in version 10.9.

property cropbox: Array

Return page’s effective /CropBox, in PDF units.

According to the PDF specification: “The crop box defines the region to which the contents of the page shall be clipped (cropped) when displayed or printed. It has no defined meaning in the context of the PDF imaging model; it merely imposes clipping on the page contents.”

If the /CropBox is not defined, the /MediaBox is returned.

Return type:: Array

emplace(other, retain=...)

Parameters:

other (Page)
retain (collections.abc.Iterable[Name])

Return type:

None

externalize_inline_images(min_size=..., shallow=...)

Convert inline image to normal (external) images.

Parameters:

min_size (int) – minimum size in bytes
shallow (bool) – If False, recurse into nested Form XObjects. If True, do not recurse.

Return type:

None

flatten_rotation()

Bake this page’s /Rotate value into its content stream.

If a page is rotated using /Rotate in the page dictionary, instead rotate the page by the same amount by altering the content stream and removing the /Rotate key, adjusting the page bounding boxes so the page has the same appearance. This can work around problems with PDF applications that cannot properly handle rotated pages.

Added in version 10.9.

Return type:: None

form_xobjects()

Return all Form XObjects associated with this page.

This method does not recurse into nested Form XObjects.

Added in version 7.0.0.

Return type:: _ObjectMapping

get(key: str | Name, /) → Object | None

get_filtered_contents(tf)

Apply a pikepdf.TokenFilter to a content stream.

This may be used when the results of a token filter do not need to be applied, such as when filtering is being used to retrieve information rather than edit the content stream.

Note that it is possible to create a subclassed TokenFilter that saves information of interest to its object attributes; it is not necessary to return data in the content stream.

To modify the content stream, use pikepdf.Page.add_content_token_filter().

Returns:: The result of modifying the content stream with tf. The existing content stream is not modified.
Parameters:: tf (TokenFilter)
Return type:: bytes

get_images(recursive=...)

Return the images used by this page.

Parameters:: recursive (bool) – If True (the default), also report images nested inside form XObjects referenced by this page, recursing to any depth. This is usually what you want, since a page’s visible content is often drawn through one or more form XObjects. If two images in different XObject scopes share a resource name, only one is reported. If False, report only images referenced directly by this page’s resources.
Return type:: _ObjectMapping

Added in version 10.9.

get_matrix_for_form_xobject_placement(fo, rect, *, invert_transformations=..., allow_shrink=..., allow_expand=...)

Return the matrix that places a Form XObject within a rectangle.

This is the transformation matrix used by calc_form_xobject_placement(). The parameters have the same meaning as for that method.

Added in version 10.9.

Parameters:

fo (Object)
rect (Rectangle)
invert_transformations (bool)
allow_shrink (bool)
allow_expand (bool)

Return type:

Matrix

get_matrix_for_transformations(invert=...)

Return the matrix equivalent to this page’s /Rotate and /UserUnit.

Parameters:: invert (bool) – If True, return the inverse matrix (suitable for placing something else onto this page). If False (default), return the matrix suitable for taking content from this page elsewhere.
Return type:: Matrix

Added in version 10.9.

property images: _ObjectMapping

Return images directly referenced by this page’s resources.

This property does not search Form XObjects that contain images, and does not attempt to find inline images.

Deprecated since version 10.9: Use get_images() instead, which recurses into Form XObjects by default. Because it is not visually obvious when a page’s content is wrapped in a Form XObject, this property often appears as if a page “has no images” when it clearly does.

Return type:: _ObjectMapping

index()

Returns the zero-based index of this page in the pages list.

That is, returns n such that pdf.pages[n] == this_page. A ValueError exception is thrown if the page is not attached to this Pdf.

Added in version 2.2.

Return type:: int

label()

Returns the page label for this page, accounting for section numbers.

For example, if the PDF defines a preface with lower case Roman numerals (i, ii, iii…), followed by standard numbers, followed by an appendix (A-1, A-2, …), this function returns the appropriate label as a string.

It is possible for a PDF to define page labels such that multiple pages have the same labels. Labels are not guaranteed to be unique.

Added in version 2.2.

Changed in version 2.9: Returns the ordinary page number if no special rules for page numbers are defined.

Return type:: str

property mediabox: Array

Return page’s /MediaBox, in PDF units.

According to the PDF specification: “The media box defines the boundaries of the physical medium on which the page is to be printed.”

Return type:: Array

property obj: Dictionary

Return type:: Dictionary

parse_contents(stream_parser)

Parse a page’s content streams using a pikepdf.StreamParser.

The content stream may be interpreted by the StreamParser but is not altered.

If the page’s contents is an array of streams, it is coalesced.

Parameters:: stream_parser (StreamParser) – A pikepdf.StreamParser instance.
Return type:: None

remove_unreferenced_resources()

Removes resources not referenced by content stream.

A page’s resources (page.resources) dictionary maps names to objects. This method walks through a page’s contents and keeps tracks of which resources are referenced somewhere in the contents. Then it removes from the resources dictionary any object that is not referenced in the contents. This method is used by page splitting code to avoid copying unused objects in files that use shared resource dictionaries across multiple pages.

Return type:: None

property resources: Dictionary

Return this page’s resources dictionary.

Changed in version 7.0.0: If the resources dictionary does not exist, an empty one will be created. A TypeError is raised if a page has a /Resources key but it is not a dictionary.

Return type:: Dictionary

rotate(angle, *, relative=False)

Rotate a page.

If relative is False (the default), set the rotation of the page to angle. Otherwise, add angle to the rotation of the page. angle must be a multiple of 90. Adding 90 to the rotation rotates clockwise by 90 degrees.

Parameters:

angle (int) – Rotation angle in degrees.
relative (bool) – If True, add angle to the current rotation. If False, set the rotation of the page to angle.

Return type:

None

Deprecated since version 10.9: Passing relative as a positional argument is deprecated; pass it as a keyword argument instead, e.g. page.rotate(90, relative=True). Positional support will be removed in pikepdf 11.

property rotation: int

The page’s clockwise rotation in degrees, normalized to [0, 360).

Unlike the raw page.Rotate attribute, this property reports the effective rotation: it resolves a /Rotate value inherited from the page tree and reports 0 when no rotation is set, instead of raising. Assigning to this property sets the absolute rotation; to rotate relative to the current value, use rotate() with relative=True.

Added in version 10.9.

Return type:: int

property trimbox: Array

Return page’s effective /TrimBox, in PDF units.

According to the PDF specification: “The trim box defines the intended dimensions of the finished page after trimming. It may be smaller than the media box to allow for production-related content, such as printing instructions, cut marks, or color bars.”

If the /TrimBox is not defined, the /CropBox is returned (and if /CropBox is not defined, /MediaBox is returned).

Return type:: Array

class pikepdf.PageCopyResult

Facts about a pikepdf.Pdf.add_pages_from() operation.

dropped_dests: list[str] = []

fields_added: int = 0

forms: Literal['preserve', 'strip']

named_dests_added: int = 0

pages_added: int

partial_fields: list[str] = []

renamed_dests: dict[str, str]

renamed_fields: dict[str, str]

class pikepdf.PdfImage(obj)

Support class to provide a consistent API for manipulating PDF images.

The data structure for images inside PDFs is irregular and complex, making it difficult to use without introducing errors for less typical cases. This class addresses these difficulties by providing a regular, Pythonic API similar in spirit (and convertible to) the Python Pillow imaging library.

Parameters:: obj (pikepdf.objects.Stream)

MAIN_COLORSPACES

PRINT_COLORSPACES

SIMPLE_COLORSPACES

as_pil_image(apply_decode_array=True, apply_mask=True)

Extract the image as a Pillow Image, using decompression as necessary.

Parameters:

apply_decode_array (bool) – If True (default), the image’s /Decode array is applied so the result matches how a PDF viewer would render the image. Set to False to obtain the raw sample values as stored, e.g. for forensic inspection of the underlying image data.
apply_mask (bool) – If True (default), an attached soft mask (/SMask), explicit mask or colour-key mask (/Mask) is composited into an alpha channel, so an image with transparency is returned as LA/RGBA. Set to False to obtain the opaque base image only. Images without a mask are unaffected.

Return type:

PIL.Image.Image

Caller must close the image.

property bits_per_component: int

Bits per component of this image.

Return type:: int

property colorspace: str | None

PDF name of the colorspace that best describes this image.

Return type:: str | None

property decode_parms: list

List of the /DecodeParms, arguments to filters.

Return type:: list

extract_to(*, stream=None, fileprefix='', apply_decode_array=True, apply_mask=True)

Extract the image directly to a usable image file.

If possible, the compressed data is extracted and inserted into a compressed image file format without transcoding the compressed content. If this is not possible, the data will be decompressed and extracted to an appropriate format.

Because it is not known until attempted what image format will be extracted, users should not assume what format they are getting back. When saving the image to a file, use a temporary filename, and then rename the file to its final name based on the returned file extension.

Images might be saved as any of .png, .jpg, or .tiff.

Examples

>>> im.extract_to(stream=bytes_io)
'.png'

>>> im.extract_to(fileprefix='/tmp/image00')
'/tmp/image00.jpg'

Parameters:

stream (BinaryIO | None) – Writable stream to write data to.
fileprefix (str or Path) – The path to write the extracted image to, without the file extension.
apply_decode_array (bool) – If True (default), the extracted image reflects the image’s /Decode array, matching how a PDF viewer renders it. Note that for a JPEG/JPX image carrying a non-identity /Decode, honoring it requires transcoding, so the result is a .png/.tiff rather than the original .jpg/.jp2. Set to False to copy the stored image data with the least processing (the raw, possibly inverted, samples), e.g. for forensic use.
apply_mask (bool) – If True (default), an attached soft/explicit mask is composited into an alpha channel, forcing a transparency-capable format (.png) instead of a direct .jpg/.jp2 copy. Set to False to extract the opaque base image only.

Returns:

If fileprefix was provided, then the fileprefix with the appropriate extension. If no fileprefix, then an extension indicating the file type.

Return type:

str

property filter_decodeparms: list

Return normalized the Filter and DecodeParms data.

PDF has a lot of possible data structures concerning /Filter and /DecodeParms. /Filter can be absent or a name or an array, /DecodeParms can be absent or a dictionary (if /Filter is a name) or an array (if /Filter is an array). When both are arrays the lengths match.

Normalize this into: [(/FilterName, {/DecodeParmName: Value, …}), …]

The order of /Filter matters as indicates the encoding/decoding sequence.

Return type:: list

property filters: list

List of names of the filters that we applied to encode this image.

Return type:: list

get_stream_buffer(decode_level=StreamDecodeLevel.specialized)

Access this image with the buffer protocol.

Parameters:: decode_level (pikepdf._core.StreamDecodeLevel)
Return type:: pikepdf._core.Buffer

property height: int

Height of the image data in pixels.

Return type:: int

property icc: PIL.ImageCms.ImageCmsProfile | None

If an ICC profile is attached, return a Pillow object that describe it.

Most of the information may be found in icc.profile.

Return type:: PIL.ImageCms.ImageCmsProfile | None

property image_mask: bool

Return True if this is an image mask.

Return type:: bool

property indexed: bool

Check if the image has a defined color palette.

Return type:: bool

property is_device_n: bool

Check if image has a /DeviceN (complex printing) colorspace.

Return type:: bool

property is_separation: bool

Check if image has a /DeviceN (complex printing) colorspace.

Return type:: bool

property mode: str

PIL.Image.mode equivalent for this image, where possible.

If an ICC profile is attached to the image, we still attempt to resolve a Pillow mode.

Return type:: str

obj

property palette: pikepdf.models.image._shared.PaletteData | None

Retrieve the color palette for this image if applicable.

Return type:: pikepdf.models.image._shared.PaletteData | None

read_bytes(decode_level=StreamDecodeLevel.specialized)

Decompress this image and return it as unencoded bytes.

Parameters:: decode_level (pikepdf._core.StreamDecodeLevel)
Return type:: bytes

show()

Show the image however PIL wants to.

Return type:: None

property size: tuple[int, int]

Size of image as (width, height).

Return type:: tuple[int, int]

property width: int

Width of the image data in pixels.

Return type:: int

class pikepdf.PdfInlineImage(*, image_data, image_object, resources=None)

Support class for PDF inline images.

Parameters:

image_data (pikepdf.objects.Object)
image_object (tuple)
resources (pikepdf.objects.Object | None)

class pikepdf.models.PdfMetadata(pdf, pikepdf_mark=True, sync_docinfo=True, overwrite_invalid_xml=True)

Read and edit the metadata associated with a PDF.

The PDF specification contain two types of metadata, the newer XMP (Extensible Metadata Platform, XML-based) and older DocumentInformation dictionary. The PDF 2.0 specification removes the DocumentInformation dictionary.

This primarily works with XMP metadata, but includes methods to generate XMP from DocumentInformation and will also coordinate updates to DocumentInformation so that the two are kept consistent.

XMP metadata fields may be accessed using the full XML namespace URI or the short name. For example metadata['dc:description'] and metadata['{http://purl.org/dc/elements/1.1/}description'] both refer to the same field. Several common XML namespaces are registered automatically.

See the XMP specification for details of allowable fields.

To update metadata, use a with block.

Example

>>> with pdf.open_metadata() as records:
...     records['dc:title'] = 'New Title'

See also

pikepdf.Pdf.open_outline()

add(title, destination)

Add an item to the outline.

Parameters:

title (str) – Title of the outline item.
destination (pikepdf.objects.Array | int | None) – Destination to jump to when the item is selected.

Returns:

The newly created OutlineItem.

Return type:

OutlineItem

property root: list[OutlineItem]

Return the root node of the outline.

Return type:: list[OutlineItem]

class pikepdf.models.OutlineItem(title, destination=None, page_location=None, action=None, obj=None, *, left=None, top=None, right=None, bottom=None, zoom=None)

Manage a single item in a PDF document outlines structure.

Includes nested items.

Parameters:

title (str) – Title of the outlines item.
destination (pikepdf.objects.Array | pikepdf.objects.String | pikepdf.objects.Name | int | None) – Page number, destination name, or any other PDF object to be used as a reference when clicking on the outlines entry. Note this should be None if an action is used instead. If set to a page number, it will be resolved to a reference at the time of writing the outlines back to the document.
page_location (PageLocation | str | None) – Supplemental page location for a page number in destination, e.g. PageLocation.Fit. May also be a simple string such as 'FitH'.
action (pikepdf.objects.Dictionary | None) – Action to perform when clicking on this item. Will be ignored during writing if destination is also set.
obj (pikepdf.objects.Dictionary | None) – Dictionary object representing this outlines item in a Pdf. May be None for creating a new object. If present, an existing object is modified in-place during writing and original attributes are retained.
left (float | None) – Describes the viewport position associated with a destination.
top (float | None) – Describes the viewport position associated with a destination.
bottom (float | None) – Describes the viewport position associated with a destination.
right (float | None) – Describes the viewport position associated with a destination.
zoom (float | None) – Describes the viewport position associated with a destination.

This object does not contain any information about higher-level or neighboring elements.

Valid destination arrays:: [page /XYZ left top zoom] generally [page, PageLocationEntry, 0 to 4 ints]

action = None

children: list[OutlineItem] = []

destination = None

classmethod from_dictionary_object(obj, *, strict=False)

Create a OutlineItem from a Dictionary.

Does not process nested items.

Parameters:

obj (pikepdf.objects.Dictionary) – Dictionary object representing a single outline node.
strict (bool) – If True, raise OutlineStructureError on any structural problem (such as a missing required /Title). If False (default), quietly correct such problems where the repair is known; a missing /Title becomes an empty string.

is_closed = False

obj = None

page_location = None

page_location_kwargs

title

to_dictionary_object(pdf, create_new=False)

Create/update a Dictionary object from this outline node.

Page numbers are resolved to a page reference on the input Pdf object.

Parameters:

pdf (pikepdf._core.Pdf) – PDF document object.
create_new (bool) – If set to True, creates a new object instead of modifying an existing one in-place.

Return type:

pikepdf.objects.Dictionary

class pikepdf.Permissions

Stores the user-level permissions for an encrypted PDF.

A compliant PDF reader/writer should enforce these restrictions on people who have the user password and not the owner password. In practice, either password is sufficient to decrypt all document contents. A person who has the owner password should be allowed to modify the document in any way. pikepdf does not enforce the restrictions in any way; it is up to application developers to enforce them as they see fit.

Unencrypted PDFs implicitly have all permissions allowed. Permissions can only be changed when a PDF is saved.

accessibility: bool = True

Deprecated in PDF 2.0. Formerly used to block accessibility tools.

In older versions of the PDF specification, it was possible to request a PDF reader to block a user’s right to use accessibility tools. Modern PDF readers do not support this archaic feature and always allow accessibility tools to be used. The only purpose of this permission is to provide testing of this deprecated feature.

extract: bool = True: Can users extract contents?

modify_annotation: bool = True: Can users modify annotations?

modify_assembly: bool = False: Can users arrange document contents?

modify_form: bool = True: Can users fill out forms?

modify_other: bool = True: Can users modify the document?

print_highres: bool = True: Can users print the document at high resolution?

print_lowres: bool = True: Can users print the document at low resolution?

class pikepdf.models.EncryptionMethod(*args, **kwds)

PDF encryption methods.

Describes which encryption method was used on a particular part of a PDF. These values are returned by pikepdf.EncryptionInfo but are not currently used to specify how encryption is requested.

aes: Ellipsis: The AES-based algorithm was used as described in the {{ pdfrm }}.

aesv3: Ellipsis: An improved version of the AES-based algorithm was used as described in the Adobe Supplement to the ISO 32000, requiring PDF 1.7 extension level 3. This algorithm still uses AES, but allows both AES-128 and AES-256, and improves how the key is derived from the password.

none: Ellipsis: Data was not encrypted.

rc4: Ellipsis: The RC4 encryption algorithm was used (obsolete).

unknown: Ellipsis: An unknown algorithm was used.

class pikepdf.models.EncryptionInfo(encdict)

Reports encryption information for an encrypted PDF.

This information may not be changed, except when a PDF is saved. This object is not used to specify the encryption settings to save a PDF, due to non-overlapping information requirements.

Parameters:: encdict (dict[str, Any])

property P: int

Return encoded permission bits.

See Pdf.allow() instead.

Return type:: int

property R: int

Revision number of the security handler.

Return type:: int

property V: int

Version of PDF password algorithm.

Return type:: int

property bits: int

Return the number of bits in the encryption algorithm.

e.g. if the algorithm is AES-256, this returns 256.

Return type:: int

property encryption_key: bytes

Return the RC4 or AES encryption key used for this file.

Return type:: bytes

property file_method: pikepdf._core.EncryptionMethod

Encryption method used to encode the whole file.

Return type:: pikepdf._core.EncryptionMethod

property stream_method: pikepdf._core.EncryptionMethod

Encryption method used to encode streams.

Return type:: pikepdf._core.EncryptionMethod

property string_method: pikepdf._core.EncryptionMethod

Encryption method used to encode strings.

Return type:: pikepdf._core.EncryptionMethod

property user_password: bytes

If possible, return the user password.

The user password can only be retrieved when a PDF is opened with the owner password and when older versions of the encryption algorithm are used.

The password is always returned as bytes even if it has a clear Unicode representation.

Return type:: bytes

class pikepdf.AcroForm

A helper for working with PDF interactive forms.

add_and_rename_fields(fields)

Add a collection of form fields.

Ensures that their fully qualified names don’t conflict with already present form fields.

Fields within the collection of new fields that have the same name as each other will continue to do so.

Parameters:: fields (collections.abc.Sequence[AcroFormField])

add_field(field)

Add a form field.

Initializes the document’s AcroForm dictionary if needed, and updates the cache if necessary.

Note that you are adding fields that are copies of other fields, this method may result in multiple fields existing with the same qualified name, which can have unexpected side effects. In that case, you should use add_and_rename_fields() instead.

Parameters:: field (AcroFormField)

disable_digital_signatures()

Disables digital signature fields.

This method removes all digital signature fields from the document, leaving any annotation showing the content of the field intact.

Return type:: None

property exists: bool

True if the current document has an interactive form.

Return type:: bool

property fields: collections.abc.Sequence[AcroFormField]

A list of all terminal fields in this interactive form.

Terminal fields are fields that have no children that are also fields. Terminal fields should have children that are annotations, or be annotations themselves. Only terminal fields are displayed as actual widgets in the PDF document; non-terminal fields exist only for grouping.

Intermediate nodes in the fields tree are not included in this list, but you can still reach them through the pikepdf.AcroFormField.parent and pikepdf.AcroFormField.top_level_field` properties.

Return type:: collections.abc.Sequence[AcroFormField]

fix_copied_annotations(to_page, from_page, from_acroform)

Copy form fields and annotations from one page to another.

This would typically be called after copying a new page in order to add field/annotation awareness. When just copying the page by itself, annotations end up being shared, and fields end up being omitted because there is no reference to the field from the page. This method ensures that each separate copy of a page has private annotations and that fields and annotations are properly updated to resolve conflicts that may occur from common resource and field names across documents.

Parameters:

to_page (Page) – The page to copy to.
from_page (Page) – The page to copy from. May be in a different PDF or in the same PDF.
from_acroform (AcroForm) – The acroform object for the source PDF.

Return type:

collections.abc.Sequence[AcroFormField]

Returns a list of newly created fields.

generate_appearances_if_needed()

Generate appearance streams for all form fields that need them.

For checkbox and radio button fields, this method ensures that appearance state is consistent with the field’s value and uses any pre-existing appearance streams.

If needs_appearances is False, this method does nothing.

This method uses the underlying QPDF implementation, which has several limitations:

Only supports ASCII characters in text fields

Does not support multi-line text

Ignores quadding (alignment)

Return type:: None

get_annotations_for_field(field)

Given a form field, return the associated annotation(s).

Typically, interactive forms store field information and annotation information in the same dictionary, meaning this method will often return a single pikepdf.Annotation which refers to the same underlying pikepdf.Dictionary. However, this is not necessarily always the case and should not be relied on. A field may store annotation data in its own dictionary, and may even have multiple annotations.

Parameters:: field (AcroFormField)
Return type:: collections.abc.Sequence[Annotation]

get_field_for_annotation(annotation)

Given an annotation for a widget, return the associated form field.

Typically, interactive forms store field information and annotation information in the same dictionary, meaning this method will often return a pikepdf.AcroFormField which refers to the same underlying pikepdf.Dictionary. However, this is not necessarily always the case and should not be relied on. A field may store annotation data in its own dictionary.

Parameters:: annotation (Annotation)
Return type:: AcroFormField

get_fields_with_qualified_name(name)

Get a list of all fields with the given qualified name.

Generally, this list will contain only one member, as having multiple fields with the same name is discouraged (but not impossible).

This will only return elements that have an explicit name (/T) in the field dictionary. In practice, this means that it should return the highest-level matching field, but not any children. (For example, this method will return a radio group rather than individual radio buttons.)

Parameters:: name (str)
Return type:: collections.abc.Sequence[AcroFormField]

get_form_fields_for_page(page)

Find all the interactive form fields on a page.

In many PDFs, you may find that this returns a list that perfectly corresponds to that returned by get_widget_annotations_for_page. However, you should not rely on this behavior. This will not always be the case. Use this method to get all the fields, then use the get_annotations_for_field method for each to get the corresponding annotations.

Parameters:: page (Page)
Return type:: collections.abc.Sequence[AcroFormField]

get_widget_annotations_for_page(page)

Find all the interactive form widgets on a page.

In many PDFs, you may find that this returns a list that perfectly corresponds to that returned by get_form_fields_for_page. However, you should not rely on this behavior. This will not always be the case. Use this method to get the annotations, then use the get_field_for_annotation method for each to get the corresponding field.

Parameters:: page (Page)
Return type:: collections.abc.Sequence[Annotation]

invalidate_cache()

Mark the internal field/annotation/page cache invalid.

This class lazily caches the mapping among form fields, annotations, and pages. If you modify pages’ annotation dictionaries, the /AcroForm dictionary, or form fields manually in a way that alters these associations, call this to force the cache to be regenerated.

Added in version 10.9.

Return type:: None

property needs_appearances: bool

Indicates whether appearance streams must be regenerated.

This should be set to True if you modify any field values in the interactive form, unless you also generate the appearance streams for the modified fields.

Return type:: bool

remove_fields(fields)

Remove fields from the fields list.

Parameters:: fields (collections.abc.Sequence[AcroFormField])

set_field_name(field, name)

Set partial name of a field, updating internal records of field names.

Parameters:

field (AcroFormField)
name (str)

transform_annotations(old_annots, matrix=..., from_qpdf=..., from_acroform=...)

Transform a set of annotations by a matrix, copying form fields.

This is the low-level primitive underlying pikepdf.Page.copy_annotations(). For each annotation in old_annots, a new annotation is created with the matrix applied to its rectangle. If an annotation is associated with a form field, a new form field pointing at the new annotation is created. The new annotations and fields are not added to any page or to the document; the caller must do that.

Parameters:

old_annots (Object) – An array of annotations to transform. May belong to a different Pdf, in which case pass from_qpdf.
matrix (Matrix) – The transformation matrix. Defaults to the identity matrix.
from_qpdf (Pdf | None) – The source Pdf if old_annots is foreign.
from_acroform (AcroForm | None) – An optional source AcroForm, for efficiency when copying many annotations from the same source document.

Returns:

A tuple (new_annots, new_fields, old_fields) where new_annots and new_fields are lists of newly created objects, and old_fields is a list of (object_number, generation) for the source fields that were transformed.

Return type:

tuple[list[Object], list[Object], list[tuple[int, int]]]

Added in version 10.9.

validate(repair=...)

Re-validate the AcroForm structure.

Useful if you have modified the structure of the AcroForm dictionary in a way that would invalidate the internal cache.

Parameters:: repair (bool) – If True (default), the document will be repaired if possible when validation encounters errors.
Return type:: None

Added in version 10.9.

class pikepdf.AcroFormField

An AcroForm field. Wrapper around a PDF dictionary.

property alternate_name: str

The alternative field name (/TU), the field name presented to users.

If a value is not present in the underlying field, this property falls back to the fully qualified name.

Return type:: str

property choices: collections.abc.Sequence[str]

Available choices for this field, if this is a choice field.

This does not contain choices for radio buttons. For radio buttons, traverse the /Kids of the top-level field and inspect the individual buttons.

This also only works for choice fields where the options are represented as an array of strings. However, some PDFs represent choices as an array of [export_value, display_value] pairs. This is a limitation of the underlying QPDF library. See QPDF Issue 1433. To get options for such fields, use field.obj.Opt instead.

Return type:: collections.abc.Sequence[str]

property default_appearance: bytes

Default appearance string, inheriting from ancestor fields if needed.

This property will contain and empty string if the default appearance string is not available (because it’s erroneously absent or because this is not a variable text field). If not found in the field hierarchy, look in /AcroForm.

Return type:: bytes

property default_resources: Dictionary

The default resource dictionary for the field.

This comes not from the field but from the document-level /AcroForm dictionary. While several PDF generates put a /DR key in the form field’s dictionary, experimentation suggests that many popular readers, including Adobe Acrobat and Acrobat Reader, ignore any /DR item on the field.

Return type:: Dictionary

property default_value: The default value of the form field.

property default_value_as_string: str

The field’s default value as a string.

If the value is not a string, this property will hold an empty string.

Return type:: str

property field_type: str

The raw value of /FT if present, otherwise an empty string.

Return type:: str

property flags: int

Field flags from /Ff.

Return type:: int

property fully_qualified_name: str

The field’s fully qualified name.

This is defined as being the /T (partial_name) value of this and all ancestors, concatenated together with dots.

Return type:: str

generate_appearance(annot)

Generate an appearance stream for this field.

Parameters:: annot (Annotation)

get_inheritable_field_value(name)

Get field value, possibly inheriting the value from ancestor node.

Parameters:: name (str)

get_inheritable_field_value_as_name(name)

Get an inherited field value as a Name object.

If the value is not a name, this property will hold an empty name.

Parameters:: name (str)
Return type:: Name

get_inheritable_field_value_as_string(name)

Get an inherited field value as a string.

If the value is not a string, this property will hold an empty string.

Parameters:: name (str)
Return type:: str

property is_checkbox: bool

True if field is type /Btn and flags do not indicate other type of button.

Return type:: bool

property is_checked: bool

True if field is a checkbox and is checked.

Return type:: bool

property is_choice: bool

True if fields if of type /Ch.

Return type:: bool

property is_null: bool

True if the field is null.

Return type:: bool

property is_pushbutton: bool

True if field is of type /Btn and flags indicate that a pushbutton.

Return type:: bool

property is_radio_button: bool

True if field is of type /Btn and flags indicate that a radio button.

Return type:: bool

property is_text: bool

True if field is of type /Tx.

Return type:: bool

property mapping_name: str

Return the mapping field name (/TM).

If a value is not present in the underlying field, this property falls back to the fully qualified name.

Return type:: str

property parent: AcroFormField

This field’s parent.

If there is no parent, a AcroFormField where field.is_null is True is returned.

Return type:: AcroFormField

property partial_name: str

The field’s partial name (/T).

Return type:: str

property quadding: int

The quadding value, inheriting from ancestor fields if needed.

This will be 0 if the quadding is not specified. Look in /AcroForm if not found in the field hierarchy.

Return type:: int

set_value(value, need_appearance=True)

Set the value property.

If need_appearance is true, and this is a text or choice field, the pikepdf.AcroForm.needs_appearances will also be set.

Parameters:: need_appearance (bool)

property top_level_field: AcroFormField

The top-level field for this field.

This will be the field itself, or one of its ancestors (often the immediate parent).

Note that the top-level field may not itself be a “real” field. Fields may be nested underneath one another at any arbitrary level, with the outer fields forming groups or sets of fields. This property references the highest field in this field’s hierarchy.

Return type:: AcroFormField

property value: The current value of the form field.

property value_as_string: str

The field’s value as a string.

If the value is not a string, this property will hold an empty string.

Return type:: str

class pikepdf.Annotation(obj)

A PDF annotation. Wrapper around a PDF dictionary.

Describes an annotation in a PDF, such as a comment, underline, copy editing marks, interactive widgets, redactions, 3D objects, sound and video clips.

See the {{ pdfrm }} section 12.5.6 for the full list of annotation types and definition of terminology.

Added in version 2.12.

Parameters:: obj (Object)

property appearance_dict: Object

Returns the annotations appearance dictionary.

Return type:: Object

property appearance_state: Object

Returns the annotation’s appearance state (or None).

For a checkbox or radio button, the appearance state may be pikepdf.Name.On or pikepdf.Name.Off.

Return type:: Object

property flags: int

Returns the annotation’s flags.

Return type:: int

get_appearance_stream(which, state=...)

Returns one of the appearance streams associated with an annotation.

Parameters:

which (Object) – Usually one of pikepdf.Name.N, pikepdf.Name.R or pikepdf.Name.D, indicating the normal, rollover or down appearance stream, respectively. If any other name is passed, an appearance stream with that name is returned.
state (Object | None) – The appearance state. For checkboxes or radio buttons, the appearance state is usually whether the button is on or off.

Return type:

Object

get_page_content_for_appearance(name, rotate, required_flags=..., forbidden_flags=...)

Generate content stream text that draws this annotation as a Form XObject.

Parameters:

name (Name) – What to call the object we create.
rotate (int) – Should be set to the page’s /Rotate value or 0.
required_flags (int) – The required appearance flags. See PDF reference manual.
forbidden_flags (int) – The forbidden appearance flags. See PDF reference manual.

Return type:

bytes

Note

This method is done mainly with qpdf. Its behavior may change when different qpdf versions are used.

property obj: Object

Return type:: Object

property rect: Rectangle

Returns a rectangle defining the location of the annotation.

Return type:: Rectangle

property subtype: str

Returns the subtype of this annotation.

Return type:: str

class pikepdf._core.Attachments(*args, **kwargs)

Exposes files attached to a PDF.

If a file is attached to a PDF, it is exposed through this interface. For example p.attachments['readme.txt'] would return a pikepdf._core.AttachedFileSpec that describes the attached file, if a file were attached under that name. p.attachments['readme.txt'].get_file() would return a pikepdf._core.AttachedFile, an archaic intermediate object to support different versions of the file for different platforms. Typically one just calls p.attachments['readme.txt'].read_bytes() to get the contents of the file.

This interface provides access to any files that are attached to this PDF, exposed as a Python collections.abc.MutableMapping interface.

The keys (virtual filenames) are always str, and values are always pikepdf.AttachedFileSpec.

To create a new attached file, use pikepdf._core.AttachedFileSpec.from_filepath() to create a pikepdf._core.AttachedFileSpec and then assign it to the pikepdf.Pdf.attachments mapping. If the file is in memory, use p.attachments['test.pdf'] = b'binary data'.

Use this interface through pikepdf.Pdf.attachments.

Added in version 3.0.

Changed in version 8.10.1: Added convenience interface for directly loading attached files, e.g. pdf.attachments['/test.pdf'] = b'binary data'. Prior to this release, there was no way to attach data in memory as a file.

class pikepdf.AttachedFileSpec(pdf, data, *, description, filename, mime_type, creation_date, mod_date)

In a PDF, a file specification provides name and metadata for a target file.

Most file specifications are simple file specifications, and contain only one attached file. Call get_file() to get the attached file:

pdf = Pdf.open(...)

fs = pdf.attachments['example.txt']
stream = fs.get_file()

To attach a new file to a PDF, you may construct a AttachedFileSpec.

pdf = Pdf.open(...)

fs = AttachedFileSpec.from_filepath(pdf, Path('somewhere/spreadsheet.xlsx'))

pdf.attachments['spreadsheet.xlsx'] = fs

PDF supports the concept of having multiple, platform-specialized versions of the attached file (similar to resource forks on some operating systems). In theory, this attachment ought to be the same file, but encoded in different ways. For example, perhaps a PDF includes a text file encoded with Windows line endings (\r\n) and a different one with POSIX line endings (\n). Similarly, PDF allows for the possibility that you need to encode platform-specific filenames. pikepdf cannot directly create these, because they are arguably obsolete; it can provide access to them, however.

If you have to deal with platform-specialized versions, use get_all_filenames() to enumerate those available.

Described in the {{ pdfrm }} section 7.11.3.

Added in version 3.0.

Parameters:

pdf (Pdf)
data (bytes)
description (str)
filename (str)
mime_type (str)
creation_date (str)
mod_date (str)

__init__(pdf, data, *, description, filename, mime_type, creation_date, mod_date)

Construct a attached file spec from data in memory.

To construct a file spec from a file on the computer’s file system, use from_filepath().

Parameters:

pdf (Pdf) – The Pdf to attach this file specification to.
data (bytes) – Resource to load.
description (str) – Any description text for the attachment. May be shown in PDF viewers.
filename (str) – Filename to display in PDF viewers.
mime_type (str) – Helps PDF viewers decide how to display the information.
creation_date (str) – PDF date string for when this file was created.
mod_date (str) – PDF date string for when this file was last modified.
relationship – A pikepdf.Name indicating the relationship of this file to the document. Canonically, this should be a name from the PDF specification: Source, Data, Alternative, Supplement, EncryptedPayload, FormData, Schema, Unspecified. If omitted, Unspecified is used.

Return type:

None

property description: str

Description text associated with the embedded file.

Return type:: str

property filename: str

The main filename for this file spec.

In priority order, getting this returns the first of /UF, /F, /Unix, /DOS, /Mac if multiple filenames are set. Setting this will set a UTF-8 encoded Unicode filename and write it to /UF.

Return type:: str

static from_filepath(pdf, path, *, description='')

Construct a file specification from a file path.

This function will automatically add a creation and modified date using the file system, and a MIME type inferred from the file’s extension.

If the data required for the attach is in memory, use pikepdf.AttachedFileSpec() instead.

Parameters:

pdf (Pdf) – The Pdf to attach this file specification to.
path (pathlib.Path | str) – A file path for the file to attach to this Pdf.
description (str) – An optional description. May be shown to the user in PDF viewers.
relationship – An optional relationship type. May be used to indicate the type of attachment, e.g. Name.Source or Name.Data. Canonically, this should be a name from the PDF specification: Source, Data, Alternative, Supplement, EncryptedPayload, FormData, Schema, Unspecified. If omitted, Unspecified is used.

Return type:

AttachedFileSpec

get_all_filenames()

Return a Python dictionary that describes all filenames.

The returned dictionary is not a pikepdf Object.

Multiple filenames are generally a holdover from the pre-Unicode era. Modern PDFs can generally set UTF-8 filenames and avoid using punctuation or other marks that are forbidden in filenames.

Return type:: dict

get_file(name=...)

Return an attached file.

Typically, only one file is attached to an attached file spec. When multiple files are attached, use the name parameter to specify which one to return.

Parameters:: name (Name) – Typical names would be /UF and /F. See {{ pdfrm }} for other obsolete names.
Return type:: AttachedFile

property obj: Dictionary

Get the underlying PDF object (typically a Dictionary).

Return type:: Dictionary

property relationship: Name | None

Describes the relationship of this attached file to the PDF.

Return type:: Name | None

class pikepdf._core.AttachedFile

An object that contains an actual attached file.

These objects do not need to be created manually; they are normally part of an AttachedFileSpec.

Added in version 3.0.

creation_date: datetime.datetime | None

property md5: bytes

Get the MD5 checksum of attached file according to the PDF creator.

Return type:: bytes

mime_type: str: Get the MIME type of the attached file according to the PDF creator.

mod_date: datetime.datetime | None

property obj: Object

Return type:: Object

read_bytes()

Return type:: bytes

property size: int

Get length of the attached file in bytes according to the PDF creator.

Return type:: int

class pikepdf.NameTree(obj, *, auto_repair=...)

An object for managing name tree data structures in PDFs.

A name tree is a key-value data structure. The keys are any binary strings (that is, Python bytes). If str selected is provided as a key, the UTF-8 encoding of that string is tested. Name trees are (confusingly) not indexed by pikepdf.Name objects. They behave like DictMapping[bytes, pikepdf.Object].

The keys are sorted; pikepdf will ensure that the order is preserved.

The value may be any PDF object. Typically it will be a dictionary or array.

Internally in the PDF, a name tree can be a fairly complex tree data structure implemented with many dictionaries and arrays. pikepdf (using libqpdf) will automatically read, repair and maintain this tree for you. There should not be any reason to access the internal nodes of a number tree; use this interface instead.

NameTrees are used to store certain objects like file attachments in a PDF. Where a more specific interface exists, use that instead, and it will manipulate the name tree in a semantic correct manner for you.

Do not modify the internal structure of a name tree while you have a NameTree referencing it. Access it only through the NameTree object.

Names trees are described in the {{ pdfrm }} section 7.9.6. See section 7.7.4 for a list of PDF objects that are stored in name trees.

Added in version 3.0.

Parameters:

obj (Object)
auto_repair (bool)

static new(pdf, *, auto_repair=True)

Create a new NameTree in the provided Pdf.

You will probably need to insert the name tree in the PDF’s catalog. For example, to insert this name tree in /Root /Names /Dests:

nt = NameTree.new(pdf)
pdf.Root.Names.Dests = nt.obj

Parameters:

pdf (Pdf)
auto_repair (bool)

Return type:

NameTree

property obj: Object

Returns the underlying root object for this name tree.

Return type:: Object

class pikepdf.NumberTree(obj, *, auto_repair=...)

An object for managing number tree data structures in PDFs.

A number tree is a key-value data structure, like name trees, except that the key is an integer. It behaves like Dict[int, pikepdf.Object].

The keys can be sparse - not all integers positions will be populated. Keys are also always sorted; pikepdf will ensure that the order is preserved.

The value may be any PDF object. Typically it will be a dictionary or array.

Internally in the PDF, a number tree can be a fairly complex tree data structure implemented with many dictionaries and arrays. pikepdf (using libqpdf) will automatically read, repair and maintain this tree for you. There should not be any reason to access the internal nodes of a number tree; use this interface instead.

NumberTrees are not used much in PDF. The main thing they provide is a mapping between 0-based page numbers and user-facing page numbers (which pikepdf also exposes as Page.label). The /PageLabels number tree is where the page numbering rules are defined.

Number trees are described in the {{ pdfrm }} section 7.9.7. See section 12.4.2 for a description of the page labels number tree. Here is an example of modifying an existing page labels number tree:

pagelabels = NumberTree(pdf.Root.PageLabels)
# Label pages starting at 0 with lowercase Roman numerals
pagelabels[0] = Dictionary(S=Name.r)
# Label pages starting at 6 with decimal numbers
pagelabels[6] = Dictionary(S=Name.D)

# Page labels will now be:
# i, ii, iii, iv, v, 1, 2, 3, ...

Do not modify the internal structure of a name tree while you have a NumberTree referencing it. Access it only through the NumberTree object.

Added in version 5.4.

Parameters:

obj (Object)
auto_repair (bool)

static new(pdf, *, auto_repair=True)

Create a new NumberTree in the provided Pdf.

You will probably need to insert the number tree in the PDF’s catalog. For example, to insert this number tree in /Root /PageLabels:

nt = NumberTree.new(pdf)
pdf.Root.PageLabels = nt.obj

Parameters:

pdf (Pdf)
auto_repair (bool)

Return type:

NumberTree

property obj: Object

Return type:: Object