Main objects

class pikepdf.Pdf(*args, **kwargs)

property Root: Object

Return type:: Object

property acroform: AcroForm

Returns a helper object for working with interactive forms.

Tip

This creates a new AcroForm helper object each time this property is used. If you’re planning on doing multiple form-related operations, keep a reference to this object. The helper has an internal cache that can speed up certain operations.

Return type:: AcroForm

add_blank_page(*, page_size=...)

Add a blank page to this PDF.

If pages already exist, the page will be added to the end. Pages may be reordered using Pdf.pages.

The caller may add content to the page by modifying its objects after creating it.

Parameters:: page_size (tuple) – The size of the page in PDF units (1/72 inch or 0.35mm). Default size is set to a US Letter 8.5” x 11” page.
Return type:: Page

add_pages_from(src, pages=None, *, forms='preserve')

Append pages from another Pdf, preserving interactive form fields.

Unlike pdf.pages.extend(src.pages), this carries the document’s AcroForm form fields so they remain functional in Adobe Acrobat. Fields whose fully-qualified names collide with existing fields are automatically renamed; the mapping is available on the returned pikepdf.PageCopyResult. The original→new name mapping in renamed_fields is best-effort (it pairs source and destination page fields positionally).

Independent top-level fields are only carried over when a widget is on a copied page, so unrelated forms on separate pages are not imported. However, a field sharing a top-level ancestor with a copied field is carried as an entire subtree; such partially-represented fields are listed in the result’s partial_fields. Use forms='strip' for a hard guarantee of no form data.

Parameters:

src (Pdf) – Source Pdf to copy pages from.
pages (collections.abc.Iterable[int] | range | slice | None) – Zero-based indices (iterable, range, or slice) of pages in src to copy. None copies all pages. A slice is clamped to the document length; explicit indices (including range) must be valid or IndexError is raised.
forms (Literal['preserve', 'strip']) – 'preserve' (default) carries AcroForm fields along with the pages; 'strip' removes widget annotations from the copied pages so no form data is imported.

Returns:

A pikepdf.PageCopyResult describing the operation, including which fields were added and any automatic renames.

Return type:

pikepdf._page_copy.PageCopyResult

property allow: pikepdf.models.encryption.Permissions

Report permissions associated with this PDF.

By default these permissions will be replicated when the PDF is saved. Permissions may also only be changed when a PDF is being saved, and are only available for encrypted PDFs. If a PDF is not encrypted, all operations are reported as allowed.

pikepdf has no way of enforcing permissions.

Return type:: pikepdf.models.encryption.Permissions

property attachments: Attachments

Returns a mapping that provides access to all files attached to this PDF.

PDF supports attaching (or embedding, if you prefer) any other type of file, including other PDFs. This property provides read and write access to these objects by filename.

Return type:: Attachments

check_linearization(stream=...)

Reports information on the PDF’s linearization.

Parameters:: stream (object) – A stream to write this information too; must implement .write() and .flush() method. Defaults to sys.stderr.
Returns:: True if the file is correctly linearized, and False if the file is linearized but the linearization data contains errors or was incorrectly generated.
Raises:: RuntimeError – If the PDF in question is not linearized at all.
Return type:: bool

check_pdf_syntax(progress=...)

Check if PDF is syntactically well-formed.

Similar to qpdf --check, checks for syntax or structural problems in the PDF. This is mainly useful to PDF developers and may not be informative to the average user. PDFs with these problems still render correctly, if PDF viewers are capable of working around the issues they contain. In many cases, pikepdf can also fix the problems.

Unlike qpdf --check, this function does not check for linearization issues (see check_linearization()) and some other issues. To replicate the exact behavior of qpdf’s check in pikepdf, use pikepdf.Job(['pikepdf', '--check', 'input.pdf']).run().

An example problem found by this function is a xref table that is missing an object reference. A page dictionary with the wrong type of key, such as a string instead of an array of integers for its mediabox, is not the sort of issue checked for. If this were an XML checker, it would tell you if the XML is well-formed, but could not tell you if the XML is valid XHTML or if it can be rendered as a usable web page.

This function also attempts to decompress all streams in the PDF. If no JBIG2 decoder is available and JBIG2 images are presented, a warning will occur that JBIG2 cannot be checked.

This function returns a list of strings describing the issues. The text is subject to change and should not be treated as a stable API.

Parameters:: progress (collections.abc.Callable[[int], None] | None) – A function to call with progress updates, from 0 to 100. If None (default), no progress will be reported.
Returns:: Empty list if no issues were found. List of issues as text strings if issues were found.
Return type:: list[str]

close()

Close a Pdf object and release resources acquired by pikepdf.

If pikepdf opened the file handle it will close it (e.g. when opened with a file path). If the caller opened the file for pikepdf, the caller close the file. with blocks will call close when exit.

pikepdf lazily loads data from PDFs, so some pikepdf.Object may implicitly depend on the pikepdf.Pdf being open. This is always the case for pikepdf.Stream but can be true for any object. Do not close the Pdf object if you might still be accessing content from it.

When an Object is copied from one Pdf to another, the Object is copied into the destination Pdf immediately, so after accessing all desired information from the source Pdf it may be closed.

Changed in version 3.0: In pikepdf 2.x, this function actually worked by resetting to a very short empty PDF. Code that relied on this quirk may not function correctly.

Return type:: None

copy_foreign(h)

Copy an Object from a foreign Pdf and return a copy.

The object must be owned by a different Pdf from this one.

If the object has previously been copied, return a reference to the existing copy, even if that copy has been modified in the meantime.

If you want to copy a page from one PDF to another, use: pdf_b.pages[0] = pdf_a.pages[0]. That interface accounts for the complexity of copying pages.

This function is used to copy a pikepdf.Object that is owned by some other Pdf into this one. This is performs a deep (recursive) copy and preserves all references that may exist in the foreign object. For example, if

>>> object_a = pdf.copy_foreign(object_x)
>>> object_b = pdf.copy_foreign(object_y)
>>> object_c = pdf.copy_foreign(object_z)

and object_z is a shared descendant of both object_x and object_y in the foreign PDF, then object_c is a shared descendant of both object_a and object_b in this PDF. If object_x and object_y refer to the same object, then object_a and object_b are the same object.

It also copies all pikepdf.Stream objects. Since this may copy a large amount of data, it is not done implicitly. This function does not copy references to pages in the foreign PDF - it stops at page boundaries. Thus, if you use copy_foreign() on a table of contents (/Outlines dictionary), you may have to update references to pages.

Direct objects, including dictionaries, do not need copy_foreign(). pikepdf will automatically convert and construct them.

Note

pikepdf automatically treats incoming pages from a foreign PDF as foreign objects, so Pdf.pages does not require this treatment.

See also

pikepdf.Object.is_indirect()

make_stream(data, d=None, **kwargs)

Create a new pikepdf.Stream object that is attached to this PDF.

See:: pikepdf.Stream.__new__()

Parameters:: data (bytes)
Return type:: Stream

classmethod new()

Create a new, empty PDF.

This is best when you are constructing a PDF from scratch.

In most cases, if you are working from an existing PDF, you should open the PDF using pikepdf.Pdf.open() and transform it, instead of a creating a new one, to preserve metadata and structural information. For example, if you want to split a PDF into two parts, you should open the PDF and transform it into the desired parts, rather than creating a new PDF and copying pages into it.

Return type:: Pdf

property objects: _ObjectList

Return an iterable list of all objects in the PDF.

After deleting content from a PDF such as pages, objects related to that page, such as images on the page, may still be present in this list.

Return type:: _ObjectList

static open(filename_or_stream, *, password='', hex_password=False, ignore_xref_streams=False, suppress_warnings=True, attempt_recovery=True, inherit_page_attributes=True, access_mode=AccessMode.default, allow_overwriting_input=False)

Open an existing file at filename_or_stream.

If filename_or_stream is path-like, the file will be opened for reading. The file should not be modified by another process while it is open in pikepdf, or undefined behavior may occur. This is because the file may be lazily loaded. When .close() is called, the file handle that pikepdf opened will be closed.

If filename_or_stream is stream, the data will be accessed as a readable binary stream, from the current position in that stream. When pdf = Pdf.open(stream) is called on a stream, pikepdf will not call stream.close(); the caller must call both pdf.close() and stream.close(), in that order, when the Pdf and stream are no longer needed. Use with-blocks will call .close() automatically.

Whether a file or stream is opened, you must ensure that the data is not modified by another thread or process, or undefined behavior will occur. You also may not overwrite the input file using .save(), unless allow_overwriting_input=True. This is because data may be lazily loaded.

If you intend to edit the file in place, or want to protect the file against modification by another process, use allow_overwriting_input=True. This tells pikepdf to make a private copy of the file.

Any changes to the file must be persisted by using .save().

Examples

>>> with Pdf.open("test.pdf") as pdf:
...     pass

>>> pdf = Pdf.open("test.pdf", password="rosebud")

Parameters:

filename_or_stream (pathlib.Path | str | BinaryIO) – Filename or Python readable and seekable file stream of PDF to open.
password (str | bytes) – User or owner password to open an encrypted PDF. If the type of this parameter is str it will be encoded as UTF-8. If the type is bytes it will be saved verbatim. Passwords are always padded or truncated to 32 bytes internally. Use ASCII passwords for maximum compatibility.
hex_password (bool) – If True, interpret the password as a hex-encoded version of the exact encryption key to use, without performing the normal key computation. Useful in forensics.
ignore_xref_streams (bool) – If True, ignore cross-reference streams. See qpdf documentation.
suppress_warnings (bool) – If True (default), warnings are not printed to stderr. Use pikepdf.Pdf.get_warnings() to retrieve warnings.
attempt_recovery (bool) – If True (default), attempt to recover from PDF parsing errors.
inherit_page_attributes (bool) – If True (the default), push attributes that are set on a group of pages in the /Pages tree (/MediaBox, /CropBox, /Resources and /Rotate) down onto each individual page, so that every page carries its own copy. This simplifies most PDF work, since these attributes can then be read directly from a page. If False, pikepdf leaves the page tree as stored and does not push inherited attributes down, so a page may lack these keys on its own dictionary; in that case use the managed accessors (mediabox, rotation, etc.), which resolve inheritance, rather than raw access, which may find the key absent. Disable this when you need to inspect or construct the page tree exactly as stored – for example, when building a test fixture that exercises attribute inheritance.
access_mode (AccessMode) – If .default, pikepdf will decide how to access the file. Currently, it will always selected stream access. To attempt memory mapping and fallback to stream if memory mapping failed, use .mmap. Use .mmap_only to require memory mapping or fail (this is expected to only be useful for testing). Applications should be prepared to handle the SIGBUS signal on POSIX in the event that the file is successfully mapped but later goes away.
allow_overwriting_input (bool) – If True, allows calling .save() to overwrite the input file. This is performed by loading the entire input file into memory at open time; this will use more memory and may recent performance especially when the opened file will not be modified.

Raises:

pikepdf.PasswordError – If the password failed to open the file.
pikepdf.PdfError – If for other reasons we could not open the file.
TypeError – If the type of filename_or_stream is not usable.
FileNotFoundError – If the file was not found.

Return type:

Pdf

Note

When filename_or_stream is a stream and the stream is located on a network, pikepdf assumes that the stream using buffering and read caches to achieve reasonable performance. Streams that fetch data over a network in response to every read or seek request, no matter how small, will perform poorly. It may be easier to download a PDF from network to temporary local storage (such as io.BytesIO), manipulate it, and then re-upload it.

Changed in version 3.0: Keyword arguments now mandatory for everything except the first argument.

open_metadata(set_pikepdf_as_editor=True, update_docinfo=True, strict=False)

Open the PDF’s XMP metadata for editing.

There is no .close() function on the metadata object, since this is intended to be used inside a with block only.

For historical reasons, certain parts of PDF metadata are stored in two different locations and formats. This feature coordinates edits so that both types of metadata are updated consistently and “atomically” (assuming single threaded access). It operates on the Pdf in memory, not any file on disk. To persist metadata changes, you must still use Pdf.save().

Example

>>> pdf = pikepdf.Pdf.open("../tests/resources/graph.pdf")
>>> with pdf.open_metadata() as meta:
...     meta['dc:title'] = 'Set the Dublic Core Title'
...     meta['dc:description'] = 'Put the Abstract here'

Parameters:

set_pikepdf_as_editor (bool) – Automatically update the metadata pdf:Producer to show that this version of pikepdf is the most recent software to modify the metadata, and xmp:MetadataDate to timestamp the update. Recommended, except for testing.
update_docinfo (bool) – Update the standard fields of DocumentInfo (the old PDF metadata dictionary) to match the corresponding XMP fields. The mapping is described in PdfMetadata.DOCINFO_MAPPING. Nonstandard DocumentInfo fields and XMP metadata fields with no DocumentInfo equivalent are ignored.
strict (bool) – If False (the default), we aggressively attempt to recover from any parse errors in XMP, and if that fails we overwrite the XMP with an empty XMP record. If True, raise errors when either metadata bytes are not valid and well-formed XMP (and thus, XML). Some trivial cases that are equivalent to empty or incomplete “XMP skeletons” are never treated as errors, and always replaced with a proper empty XMP block. Certain errors may be logged.

Return type:

pikepdf.models.metadata.PdfMetadata

open_outline(max_depth=15, strict=False)

Open the PDF outline (“bookmarks”) for editing.

Recommend for use in a with block. Changes are committed to the PDF when the block exits. (The Pdf must still be opened.)

Example

>>> pdf = pikepdf.open('../tests/resources/outlines.pdf')
>>> with pdf.open_outline() as outline:
...     outline.root.insert(0, pikepdf.OutlineItem('Intro', 0))

Parameters:

max_depth (int) – Maximum recursion depth of the outline to be imported and re-written to the document. 0 means only considering the root level, 1 the first-level sub-outline of each root element, and so on. Items beyond this depth will be silently ignored. Default is 15.
strict (bool) – When False (the default), pikepdf quietly corrects minor structural problems in the outline where the correct repair is known, recovering the valid parts of the document outline without raising an exception. For example, a missing required /Title is treated as an empty string; a structural error such as a reference loop cancels processing of further nodes on that level; and outline objects that have been accidentally duplicated are reproduced as new objects. When set to True, any such structural problem raises an OutlineStructureError.

Return type:

pikepdf.models.outlines.Outline

property owner_password_matched: bool

Returns True if the owner password matched when the Pdf was opened.

It is possible for both the user and owner passwords to match.

Added in version 2.10.

Return type:: bool

property pages: PageList

Returns the list of pages.

Return type:: PageList

property pdf_version: str

The version of the PDF specification used for this file, such as ‘1.7’.

More precise information about the PDF version can be opened from the Pdf’s XMP metadata.

Return type:: str

remove_unreferenced_resources()

Remove from /Resources any object not referenced in page’s contents.

PDF pages may share resource dictionaries with other pages. If pikepdf is used for page splitting, pages may reference resources in their /Resources dictionary that are not actually required. This purges all unnecessary resource entries.

For clarity, if all references to any type of object are removed, that object will be excluded from the output PDF on save. (Conversely, only objects that are discoverable from the PDF’s root object are included.) This function removes objects that are referenced from the page /Resources dictionary, but never called for in the content stream, making them unnecessary.

Suggested before saving, if content streams or /Resources dictionaries are edited.

Return type:: None

property root: Object

The /Root object of the PDF.

Return type:: Object

save(filename_or_stream=None, *, static_id=False, preserve_pdfa=True, min_version='', force_version='', fix_metadata_version=True, compress_streams=True, stream_decode_level=None, object_stream_mode=ObjectStreamMode.preserve, normalize_content=False, linearize=False, qdf=False, progress=None, encryption=None, recompress_flate=False, deterministic_id=False)

Save all modifications to this pikepdf.Pdf.

Parameters:

filename_or_stream (pathlib.Path | str | BinaryIO | None) – Where to write the output. If a file exists in this location it will be overwritten. If the file was opened with allow_overwriting_input=True, then it is permitted to overwrite the original file, and this parameter may be omitted to implicitly use the original filename. Otherwise, the filename may not be the same as the input file, as overwriting the input file would corrupt data since pikepdf using lazy loading.
static_id (bool) – Indicates that the /ID metadata, normally calculated as a hash of certain PDF contents and metadata including the current time, should instead be set to a static value. Only use this for debugging and testing. Use deterministic_id if you want to get the same /ID for the same document contents.
preserve_pdfa (bool) – Ensures that the file is generated in a manner compliant with PDF/A and other stricter variants. This should be True, the default, in most cases.
min_version (str | tuple[str, int]) – Sets the minimum version of PDF specification that should be required. If left alone qpdf will decide. If a tuple, the second element is an integer, the extension level. If the version number is not a valid format, qpdf will decide what to do.
force_version (str | tuple[str, int]) – Override the version recommend by qpdf, potentially creating an invalid file that does not display in old versions. See qpdf manual for details. If a tuple, the second element is an integer, the extension level.
fix_metadata_version (bool) – If True (default) and the XMP metadata contains the optional PDF version field, ensure the version in metadata is correct. If the XMP metadata does not contain a PDF version field, none will be added. To ensure that the field is added, edit the metadata and insert a placeholder value in pdf:PDFVersion. If XMP metadata does not exist, it will not be created regardless of the value of this argument.
object_stream_mode (ObjectStreamMode) – disable prevents the use of object streams. preserve keeps object streams from the input file. generate uses object streams wherever possible, creating the smallest files but requiring PDF 1.5+.
compress_streams (bool) –
Enables or disables the compression of uncompressed stream objects. By default this is set to True, and the only reason to set it to False is for debugging or inspecting PDF contents.

When enabled, uncompressed stream objects will be compressed whether they were uncompressed in the PDF when it was opened, or when the user creates new pikepdf.Stream objects attached to the PDF. Stream objects can also be created indirectly, such as when content from another PDF is merged into the one being saved.

Only stream objects that have no compression will be compressed when this object is set. If the object is compressed, compression will be preserved.

Setting compress_streams=False does not trigger decompression unless decompression is specifically requested by setting both compress_streams=False and stream_decode_level to the desired decode level (e.g. .generalized will decompress most non-image content).

This option does not trigger recompression of existing compressed streams. For that, use recompress_flate.

The XMP metadata stream object, if present, is never compressed, to facilitate metadata reading by parsers that don’t understand the full structure of PDF.
stream_decode_level (StreamDecodeLevel | None) – Specifies how to encode stream objects. See documentation for pikepdf.StreamDecodeLevel.
recompress_flate (bool) – When disabled (the default), qpdf does not uncompress and recompress streams compressed with the Flate compression algorithm. If True, pikepdf will instruct qpdf to do this, which may be useful if recompressing streams to a higher compression level.
normalize_content (bool) – Enables parsing and reformatting the content stream within PDFs. This may debugging PDFs easier.
linearize (bool) – Enables creating linear or “fast web view”, where the file’s contents are organized sequentially so that a viewer can begin rendering before it has the whole file. As a drawback, it tends to make files larger.
qdf (bool) – Save output QDF mode. QDF mode is a special output mode in qpdf to allow editing of PDFs in a text editor. Use the program fix-qdf to fix convert back to a standard PDF.
progress (collections.abc.Callable[[int], None] | None) – Specify a callback function that is called as the PDF is written. The function will be called with an integer between 0-100 as the sole parameter, the progress percentage. This function may not access or modify the PDF while it is being written, or data corruption will almost certainly occur.
encryption (pikepdf.models.encryption.Encryption | bool | None) – If False or omitted, existing encryption will be removed. If True encryption settings are copied from the originating PDF. Alternately, an Encryption object may be provided that sets the parameters for new encryption.
deterministic_id (bool) – Indicates that the /ID metadata, normally calculated as a hash of certain PDF contents and metadata including the current time, should instead be computed using only deterministic data like the file contents. At a small runtime cost, this enables generation of the same /ID if the same inputs are converted in the same way multiple times. Does not work for encrypted files.

Raises:

PdfError –
ForeignObjectError –
ValueError –

Return type:

None

You may call .save() multiple times with different parameters to generate different versions of a file, and you may continue to modify the file after saving it. .save() does not modify the Pdf object in memory, except possibly by updating the XMP metadata version with fix_metadata_version.

Note

pikepdf.Pdf.remove_unreferenced_resources() before saving may eliminate unnecessary resources from the output file if there are any objects (such as images) that are referenced in a page’s Resources dictionary but never called in the page’s content stream.

Note

pikepdf can read PDFs with incremental updates, but always coalesces any incremental updates into a single non-incremental PDF file when saving.

Note

If filename_or_stream is a stream and the process is interrupted during writing, the stream may be left in a corrupt state. It is the responsibility of the caller to manage the stream in this case.

Changed in version 2.7: Added recompress_flate.

Changed in version 3.0: Keyword arguments now mandatory for everything except the first argument.

Changed in version 8.1: If filename_or_stream is a filename and that file exists, the new file is written to a temporary file in the same directory and then moved into place. This prevents the existing destination file from being corrupted if the process is interrupted during writing; previously, corrupting the destination file was possible. If no file exists at the destination, output is written directly to the destination, but the destination will be deleted if errors occur during writing. Prior to 8.1, the file was always written directly to the destination, which could result in a corrupt destination file if the process was interrupted during writing.

Changed in version 9.1: When opened with allow_overwriting_input=True, we now attempt to restore the original file permissions, ownership and creation time. The modified time is always set to the time of saving. An unusual umask or other settings changes still cause a failure to restore permissions.

show_xref_table()

Pretty-print the Pdf’s xref (cross-reference table).

The xref table will be written to the pikepdf._core module’s logger with a logging level of logging.INFO. You may need to adjust the logging level to see the output.

This function is mainly for debugging or curiosity. In practice, pikepdf does not trust the xref table; it instead reads the PDF to determine the position of objects, and recalculates the xref table when a PDF is saved. One could use to locate objects within a PDF using a hex editor, assuming the PDF is well-formed.

Return type:: None

property trailer: Object

Provides access to the PDF trailer object.

See {{ pdfrm }} section 7.5.5. Generally speaking, the trailer should not be modified with pikepdf, and modifying it may not work. Some of the values in the trailer are automatically changed when a file is saved.

Return type:: Object

update_from_qpdf_json(filename_or_stream)

Update this Pdf from qpdf JSON, as written by write_qpdf_json().

Objects present in this Pdf but absent from the JSON are left unchanged. See from_qpdf_json() to create a new Pdf instead.

Parameters:: filename_or_stream (pathlib.Path | str | BinaryIO) – A filename or readable binary stream containing qpdf JSON.
Return type:: None

Added in version 10.9.

property user_password_matched: bool

Returns True if the user password matched when the Pdf was opened.

It is possible for both the user and owner passwords to match.

Added in version 2.10.

Return type:: bool

write_qpdf_json(filename_or_stream, *, decode_level=..., json_stream_data=..., file_prefix=...)

Write this PDF as qpdf JSON (the qpdf --json-output format, v2).

This is the whole-document JSON serialization, distinct from pikepdf.Object.to_json() which serializes a single object. The output can be read back with from_qpdf_json().

Parameters:

filename_or_stream (pathlib.Path | str | BinaryIO) – A filename or writable binary stream.
decode_level (StreamDecodeLevel) – How much to decode (uncompress) stream data in the JSON. Use StreamDecodeLevel.none to preserve stream data exactly.
json_stream_data (JSONStreamData) – How stream data is represented; see pikepdf.JSONStreamData.
file_prefix (str) – Required when json_stream_data is JSONStreamData.file; each stream is written to a file named {file_prefix}-{object_number}. If not given and a filename was supplied, the filename is used as the prefix.

Return type:

None

Added in version 10.9.

pikepdf.open(): Alias for pikepdf.Pdf.open().

pikepdf.new(): Alias for pikepdf.Pdf.new().

Access modes

class pikepdf.ObjectStreamMode(*args, **kwds)

Options for saving object streams within PDFs.

Object streams are more a compact way of saving certain types of data that was added in PDF 1.5. All modern PDF viewers support object streams, but some third party tools and libraries cannot read them.

disable: Ellipsis

Disable the use of object streams.

If any object streams exist in the file, remove them when the file is saved.

generate: Ellipsis

Preserve any existing object streams in the original file.

This is the default behavior.

preserve: Ellipsis: Generate object streams.

class pikepdf.StreamDecodeLevel(*args, **kwds)

Options for decoding streams within PDFs.

all: Ellipsis: In addition to uncompressing the generalized compression formats, supported non-lossy compression will also be be decoded. At present, this includes the RunLengthDecode filter.

generalized: Ellipsis: This is the default. libqpdf will apply LZWDecode, ASCII85Decode, ASCIIHexDecode, and FlateDecode filters on the input. When saved with compress_streams=True, the default, the effect of this is that streams filtered with these older and less efficient filters will be recompressed with the Flate filter. As a special case, if a stream is already compressed with FlateDecode and compress_streams=True, the original compressed data will be preserved.

none: Ellipsis: Do not attempt to apply any filters. Streams remain as they appear in the original file. Note that uncompressed streams may still be compressed on output. You can disable that by saving with .save(..., compress_streams=False).

specialized: Ellipsis: In addition to generalized and non-lossy specialized filters, supported lossy compression filters will be applied. At present, this includes DCTDecode (JPEG) compression. Note that compressing the resulting data with DCTDecode again will accumulate loss, so avoid multiple compression and decompression cycles. This is mostly useful for (low-level) retrieving image data; see pikepdf.PdfImage for the preferred method.

class pikepdf.Encryption

Specify the encryption settings to apply when a PDF is saved.

R: Literal[2, 3, 4, 5, 6] = 6: Select the security handler algorithm to use. Choose from: 2, 3, 4 or 6. By default, the highest version of is selected (6). 5 is a deprecated algorithm that should not be used.

aes: bool = True: If True, request the AES algorithm. If False, use RC4. If omitted, AES is selected whenever possible (R >= 4).

allow: Permissions: The permissions to set. If omitted, all permissions are granted to the user.

metadata: bool = True: If True, also encrypt the PDF metadata. If False, metadata is not encrypted. Reading document metadata without decryption may be desirable in some cases. Requires aes=True. If omitted, metadata is encrypted whenever possible.

owner: str = '': The owner password to use. This allows full control of the file. If blank, the PDF will be encrypted and present as “(SECURED)” in PDF viewers. If the owner password is blank, the user password should be as well.

user: str = '': The user password to use. With this password, some restrictions will be imposed by a typical PDF reader. If blank, the PDF can be opened by anyone, but only modified as allowed by the permissions in allow.

Object construction

class pikepdf.Object

__abs__()

Return type:: int

__add__(other, /)

Parameters:: other (int)
Return type:: int

__bool__()

Return type:: bool

__bytes__()

Return type:: bytes

__contains__(obj, /)

Parameters:: obj (Object | str)
Return type:: bool

__copy__()

Return type:: Object

__delattr__(name, /)

Parameters:: name (str)
Return type:: None

__delitem__(name, /)

Parameters:: name (str | Name | int)
Return type:: None

__dir__()

Return type:: list

__eq__(other, /)

Parameters:: other (Any)
Return type:: bool

__float__()

Return type:: float

__floordiv__(other, /)

Parameters:: other (int)
Return type:: int

__getattr__(name, /)

Parameters:: name (str)
Return type:: Object

__getitem__(name: str | Name | int, /) → Object

__hash__()

Return type:: int

__index__()

Return type:: int

__int__()

Return type:: int

__iter__()

Return type:: collections.abc.Iterator[Object]

__len__()

Return type:: int

__mod__(other, /)

Parameters:: other (int)
Return type:: int

__mul__(other, /)

Parameters:: other (int)
Return type:: int

__neg__()

Return type:: int

__pos__()

Return type:: int

__radd__(other, /)

Parameters:: other (int)
Return type:: int

__rfloordiv__(other, /)

Parameters:: other (int)
Return type:: int

__rmod__(other, /)

Parameters:: other (int)
Return type:: int

__rmul__(other, /)

Parameters:: other (int)
Return type:: int

__rsub__(other, /)

Parameters:: other (int)
Return type:: int

__setattr__(name, value, /)

Parameters:

name (str)
value (Any)

Return type:

None

__setitem__(name: str | Name | int, value: Any, /) → None

__sub__(other, /)

Parameters:: other (int)
Return type:: int

append(value, /)

Append another object to an array; fails if the object is not an array.

Parameters:: value (Any)
Return type:: None

as_bool() → bool

as_decimal() → decimal.Decimal

as_dict()

Return type:: _ObjectMapping

as_float() → float

as_int() → int

as_list()

Return type:: _ObjectList

copy()

Return type:: Object

emplace(other, retain=...)

Copy all items from other without making a new object.

Particularly when working with pages, it may be desirable to remove all of the existing page’s contents and emplace (insert) a new page on top of it, in a way that preserves all links and references to the original page. (Or similarly, for other Dictionary objects in a PDF.)

Any Dictionary keys in the iterable retain are preserved. By default, /Parent is retained.

When a page is assigned (pdf.pages[0] = new_page), only the application knows if references to the original the original page are still valid. For example, a PDF optimizer might restructure a page object into another visually similar one, and references would be valid; but for a program that reorganizes page contents such as a N-up compositor, references may not be valid anymore.

This method takes precautions to ensure that child objects in common with self and other are not inadvertently deleted.

Example

>>> pdf = pikepdf.Pdf.open('../tests/resources/fourpages.pdf')
>>> pdf.pages[0].objgen
(3, 0)
>>> pdf.pages[0].emplace(pdf.pages[1])
>>> pdf.pages[0].objgen
(3, 0)
>>> # Same object

Changed in version 2.11.1: Added the retain argument.

Parameters:

other (Object)
retain (collections.abc.Iterable[Name])

Return type:

None

extend(iter, /)

Extend a pikepdf.Array with an iterable of other pikepdf.Object.

Parameters:: iter (collections.abc.Iterable[Object])
Return type:: None

get(key: int | str | Name, /) → Object | None

get_raw_stream_buffer()

Return a buffer protocol buffer describing the raw, encoded stream.

Return type:: Buffer

get_stream_buffer(decode_level=...)

Return a buffer protocol buffer describing the decoded stream.

Parameters:: decode_level (StreamDecodeLevel)
Return type:: Buffer

property images: _ObjectMapping

Return type:: _ObjectMapping

property is_indirect: bool

Returns True if the object is an indirect object.

Return type:: bool

is_owned_by(possible_owner)

Test if this object is owned by the indicated possible_owner.

Parameters:: possible_owner (Pdf)
Return type:: bool

property is_rectangle: bool

Returns True if the object is a rectangle (an array of 4 numbers).

Return type:: bool

items()

Return type:: collections.abc.Iterable[tuple[str, Object]]

keys()

Get the keys of the object, if it is a Dictionary or Stream.

Return type:: set[str]

property objgen: tuple[int, int]

Return the object-generation number pair for this object.

If this is a direct object, then the returned value is (0, 0). By definition, if this is an indirect object, it has a “objgen”, and can be looked up using this in the cross-reference (xref) table. Direct objects cannot necessarily be looked up.

The generation number is usually 0, except for PDFs that have been incrementally updated. Incrementally updated PDFs are now uncommon, since it does not take too long for modern CPUs to reconstruct an entire PDF. pikepdf will consolidate all incremental updates when saving.

Return type:: tuple[int, int]

static parse(stream, description=...)

Parse PDF binary representation into PDF objects.

Parameters:

stream (bytes)
description (str)

Return type:

Object

read_bytes(decode_level=...)

Decode and read the content stream associated with this object.

Parameters:: decode_level (StreamDecodeLevel)
Return type:: bytes

read_raw_bytes()

Read the content stream associated with a Stream, without decoding.

Return type:: bytes

same_owner_as(other)

Test if two objects are owned by the same pikepdf.Pdf.

Parameters:: other (Object)
Return type:: bool

property stream_dict: Dictionary

Access the dictionary key-values for a pikepdf.Stream.

Return type:: Dictionary

to_json(dereference=..., schema_version=...)

Convert to a qpdf JSON representation of the object.

See the qpdf manual for a description of its JSON representation. https://qpdf.readthedocs.io/en/stable/json.html#qpdf-json-format

Not necessarily compatible with other PDF-JSON representations that exist in the wild.

Names are encoded as UTF-8 strings
Indirect references are encoded as strings containing obj gen R
Strings are encoded as UTF-8 strings with unrepresentable binary
characters encoded as \uHHHH
Encoding streams just encodes the stream’s dictionary; the stream
data is not represented
Object types that are only valid in content streams (inline
image, operator) as well as “reserved” objects are not representable and will be serialized as null.

Parameters:

dereference (bool) – If True, dereference the object if this is an indirect object.
schema_version (int) – The version of the JSON schema. Defaults to 2.

Returns:

JSON bytestring of object. The object is UTF-8 encoded and may be decoded to a Python str that represents the binary values \x00-\xFF as U+0000 to U+00FF; that is, it may contain mojibake.

Return type:

bytes

Changed in version 6.0: Added schema_version.

unparse(resolved=...)

Convert PDF objects into their binary representation.

Set resolved=True to deference indirect objects where possible.

If you want to unparse content streams, which are a collection of objects that need special treatment, use pikepdf.unparse_content_stream() instead.

Returns bytes() that can be used with Object.parse() to reconstruct the pikepdf.Object. If reconstruction is not possible, a relative object reference is returned, such as 4 0 R.

Parameters:: resolved (bool) – If True, deference indirect objects where possible.
Return type:: bytes

update(other)

Parameters:: other (collections.abc.Mapping[Any, Any] | Object)
Return type:: None

with_same_owner_as(arg0)

Returns an object that is owned by the same Pdf that owns other object.

If the objects already have the same owner, this object is returned. If the other object has a different owner, then a copy is created that is owned by other’s owner. If this object is a direct object (no owner), then an indirect object is created that is owned by other. An exception is thrown if other is a direct object.

This method may be convenient when a reference to the Pdf is not available.

Added in version 2.14.

Parameters:: arg0 (Object)
Return type:: Object

wrap_in_array()

Return the object wrapped in an array if not already an array.

Return type:: Array

write(data, *, filter=..., decode_parms=..., type_check=...)

Replace stream object’s data with new (possibly compressed) data.

filter and decode_parms describe any compression that is already present on the input data. For example, if your data is already compressed with the Deflate algorithm, you would set filter=Name.FlateDecode.

When writing the PDF in pikepdf.Pdf.save(), pikepdf may change the compression or apply compression to data that was not compressed, depending on the parameters given to that function. It will never change lossless to lossy encoding.

PNG and TIFF images, even if compressed, cannot be directly inserted into a PDF and displayed as images.

Parameters:

data (bytes) – the new data to use for replacement
filter (Name | Array | list[Name] | None) – The filter(s) with which the data is (already) encoded
decode_parms (Dictionary | Array | None) – Parameters for the filters with which the object is encode
type_check (bool) – Check arguments; use False only if you want to intentionally create malformed PDFs.

Return type:

None

If only one filter is specified, it may be a name such as Name(‘/FlateDecode’). If there are multiple filters, then array of names should be given.

If there is only one filter, decode_parms is a Dictionary of parameters for that filter. If there are multiple filters, then decode_parms is an Array of Dictionary, where each array index is corresponds to the filter.

class pikepdf.Name

Construct a PDF Name object.

Names can be constructed with two notations:

Name.Resources

Name('/Resources')

The two are semantically equivalent. The former is preferred for names that are normally expected to be in a PDF. The latter is preferred for dynamic names and attributes.

__new__(name)

Parameters:: name (str | Name)
Return type:: Name

classmethod random(len_=16, prefix='')

Generate a cryptographically strong, random, valid PDF Name.

If you are inserting a new name into a PDF (for example, name for a new image), you can use this function to generate a cryptographically strong random name that is almost certainly already not already in the PDF, and not colliding with other existing names.

This function uses Python’s secrets.token_urlsafe, which returns a URL-safe encoded random number of the desired length. An optional prefix may be prepended. (The encoding is ultimately done with base64.urlsafe_b64encode().) Serendipitously, URL-safe is also PDF-safe.

When the length parameter is 16 (16 random bytes or 128 bits), the result is probably globally unique and can be treated as never colliding with other names.

The length of the returned string may vary because it is encoded, but will always have 8 * len_ random bits.

Parameters:

len – The length of the random string.
prefix (str) – A prefix to prepend to the random string.
len_ (int)

Return type:

Name

class pikepdf.NamePath

Path for accessing nested Dictionary/Stream values.

NamePath provides ergonomic access to deeply nested PDF structures with a single access operation and helpful error messages when keys are not found.

Usage examples:

# Shorthand syntax - most common
obj[NamePath.Resources.Font.F1]

# With array indices
obj[NamePath.Pages.Kids[0].MediaBox]

# Chained access - supports non Python-identifier names
NamePath['/A']['/B'].C[0]  # equivalent to NamePath.A.B.C[0]

# Alternate syntax to support lists
obj[NamePath(Name.Resources, Name.Font)]

# Using string objects
obj[NamePath('/Resources', '/Weird-Name')]

# Empty path returns the object itself
obj[NamePath()]

# Setting nested values (all parents must exist)
obj[NamePath.Root.Info.Title] = pikepdf.String("Test")

# With default value
obj.get(NamePath.Root.Metadata, None)

When a key is not found, the KeyError message identifies the exact failure point, e.g.: “Key /C not found; traversed NamePath.A.B”

Added in version 10.1.

class pikepdf.String

Construct a PDF String object.

__new__(s)

Parameters:: s (str | bytes)
Return type:: String

class pikepdf.Array

Construct a PDF Array object.

__new__(a=None)

Parameters:: a (collections.abc.Iterable | Rectangle | Matrix | None)
Return type:: Array

class pikepdf.Dictionary

Construct a PDF Dictionary object.

__new__(d=None, **kwargs)

Parameters:

d (collections.abc.Mapping | None)
kwargs (Any)

Return type:

Dictionary

class pikepdf.Stream

Construct a PDF Stream object.

__new__(owner, data=None, d=None, **kwargs)

Parameters:

owner (Pdf)
data (bytes | None)
d (Any)
kwargs (Any)

Return type:

Stream

class pikepdf.Operator

Construct an operator for use in a content stream.

An Operator is one of a limited set of commands that can appear in PDF content streams (roughly the mini-language that draws objects, lines and text on a virtual PDF canvas). The commands parse_content_stream() and unparse_content_stream() create and expect Operators respectively, along with their operands.

pikepdf uses the special Operator “INLINE IMAGE” to denote an inline image in a content stream.

__new__(name)

Parameters:: name (str)
Return type:: Operator

Common PDF data structures

class pikepdf.Matrix

A 2D affine matrix for PDF transformations.

PDF uses matrices to transform document coordinates to screen/device coordinates.

PDF matrices are encoded as pikepdf.Array with exactly six numeric elements, ordered as a b c d e f.

\[\begin{split}\begin{bmatrix} a & b & 0 \\ c & d & 0 \\ e & f & 1 \\ \end{bmatrix}\end{split}\]

The approximate interpretation of these six parameters is documented below. The values (0, 0, 1) in the third column are fixed, so a general 3×3 matrix cannot be converted to a PDF matrix.

PDF transformation matrices are the transpose of most textbook treatments. In a textbook, typically A × vc is used to transform a column vector vc=(x, y, 1) by the affine matrix A. In PDF, the matrix is the transpose of that in the textbook, and vr × A' is used to transform a row vector vr=(x, y, 1).

Transformation matrices specify the transformation from the new (transformed) coordinate system to the original (untransformed) coordinate system. x’ and y’ are the coordinates in the untransformed coordinate system, and x and y are the coordinates in the transformed coordinate system.

PDF order:

\[\begin{split}\begin{equation} \begin{bmatrix} x' & y' & 1 \end{bmatrix} = \begin{bmatrix} x & y & 1 \end{bmatrix} \begin{bmatrix} a & b & 0 \\ c & d & 0 \\ e & f & 1 \end{bmatrix} \end{equation}\end{split}\]

To concatenate transformations, use the matrix multiple (@) operator to pre-multiply the next transformation onto existing transformations.

Alternatively, use the .translated(), .scaled(), and .rotated() methods to chain transformation operations.

Addition and other operations are not implemented because they’re not that meaningful in a PDF context.

Matrix objects are immutable. All transformation methods return new matrix objects.

Added in version 8.7.

__array__(dtype=None, copy=True)

Convert this matrix to a NumPy array of type dtype.

If copy is True, a copy is made. If copy is False, an exception is raised.

If numpy is not installed, this will throw an exception.

Parameters:

dtype (Any)
copy (bool | None)

Return type:

numpy.ndarray

__init__()

__matmul__(other, /)

Return the matrix product of two matrices.

Can be used to concatenate transformations. Transformations should be composed by pre-multiplying matrices. For example, to apply a scaling transform, one could do:

scale = pikepdf.Matrix(2, 0, 0, 2, 0, 0)
scaled = scale @ matrix

Parameters:: other (Matrix)
Return type:: Matrix

property a: float

a is the horizontal scaling factor.

Return type:: float

as_array()

Convert this matrix to a pikepdf.Array.

A Matrix cannot be inserted into a PDF directly. Use this function to convert a Matrix to a pikepdf.Array, which can be inserted.

Return type:: Array

property b: float

b is horizontal skewing.

Return type:: float

property c: float

c is vertical skewing.

Return type:: float

property d: float

d is the vertical scaling factor.

Return type:: float

property e: float

e is the horizontal translation.

Return type:: float

encode()

Encode matrix to bytes suitable for including in a PDF content stream.

Return type:: bytes

property f: float

f is the vertical translation.

Return type:: float

classmethod identity()

Construct an identity matrix.

More explicit than the constructor.

Added in version 9.7.0.

Return type:: Matrix

inverse()

Return the inverse of the matrix.

The inverse matrix reverses the transformation of the original matrix.

In rare situations, the inverse may not exist. In that case, an exception is thrown. The PDF will likely have rendering problems.

Return type:: Matrix

rotated(angle_degrees_ccw)

Return a rotated copy of this matrix.

Calculates Matrix(cos(angle), sin(angle), -sin(angle), cos(angle), 0, 0) @ self.

Parameters:: angle_degrees_ccw – angle in degrees counterclockwise
Return type:: Matrix

scaled(sx, sy)

Return a scaled copy of this matrix.

Calculates Matrix(sx, 0, 0, sy, 0, 0) @ self.

Parameters:

sx – horizontal scaling
sy – vertical scaling

Return type:

Matrix

property shorthand: tuple[float, float, float, float, float, float]

Return the 6-tuple (a,b,c,d,e,f) that describes this matrix.

Return type:: tuple[float, float, float, float, float, float]

transform(point: tuple[float, float]) → tuple[float, float]

translated(tx, ty)

Return a translated copy of this matrix.

Calculates Matrix(1, 0, 0, 1, tx, ty) @ self.

Parameters:

tx – horizontal translation
ty – vertical translation

Return type:

Matrix

class pikepdf.Rectangle(llx: float, lly: float, urx: float, ury: float, /)

A PDF rectangle.

Typically this will be a rectangle in PDF units (points, 1/72”). Unlike raster graphics, the rectangle is defined by the lower left and upper right points.

Rectangles in PDF are encoded as pikepdf.Array with exactly four numeric elements, ordered as llx lly urx ury. See {{ pdfrm }} section 7.9.5.

The rectangle may be considered degenerate if the lower left corner is not strictly less than the upper right corner.

Added in version 2.14.

Changed in version 8.5: Added operators to test whether rectangle a is contained in rectangle b (a <= b) and to calculate their intersection (a & b).

__and__(other, /)

Return the bounding Rectangle of the common area of self and other.

Parameters:: other (Rectangle)
Return type:: Rectangle

__init__(llx: float, lly: float, urx: float, ury: float, /) → None: Construct a new rectangle.

as_array()

Returns this rectangle as a pikepdf.Array.

Return type:: Array

property height: float

The height of the rectangle.

Return type:: float

llx: float = Ellipsis: The lower left corner on the x-axis.

lly: float = Ellipsis: The lower left corner on the y-axis.

property lower_left: tuple[float, float]

A point for the lower left corner.

Return type:: tuple[float, float]

property lower_right: tuple[float, float]

A point for the lower right corner.

Return type:: tuple[float, float]

to_bbox()

Returns the origin-centred bounding box that encloses this rectangle.

Return type:: Rectangle

property upper_left: tuple[float, float]

A point for the upper left corner.

Return type:: tuple[float, float]

property upper_right: tuple[float, float]

A point for the upper right corner.

Return type:: tuple[float, float]

urx: float = Ellipsis: The upper right corner on the x-axis.

ury: float = Ellipsis: The upper right corner on the y-axis.

property width: float

The width of the rectangle.

Return type:: float

Content stream elements

class pikepdf.ContentStreamInstruction(operands: _ObjectList, operator: Operator, /)

Represents one complete instruction inside a content stream.

property operands: _ObjectList

Return type:: _ObjectList

property operator: Operator

Return type:: Operator

class pikepdf.ContentStreamInlineImage

Represents an instruction to draw an inline image.

pikepdf consolidates the BI-ID-EI sequence of operators, as appears in a PDF to declare an inline image, and replaces them with a single virtual content stream instruction with the operator “INLINE IMAGE”.

property iimage: pikepdf.models.image.PdfInlineImage

Return type:: pikepdf.models.image.PdfInlineImage

property operands: _ObjectList

Return type:: _ObjectList

property operator: Operator

Return type:: Operator

Internal objects

These objects are returned by other pikepdf objects. They are part of the API, but not intended to be created explicitly.

class pikepdf._core.PageList

For accessing pages in a PDF.

A list-like object enumerating a range of pages in a pikepdf.Pdf. It may be all of the pages or a subset. Obtain using pikepdf.Pdf.pages.

See pikepdf.Page for accessing individual pages.

append(page, /)

Add another page to the end.

While this method copies pages from one document to another, it does not copy certain metadata such as annotations, form fields, bookmarks or structural tree elements. Copying these is a more complex, application specific operation.

Parameters:: page (Page)
Return type:: None

extend(other, /)

Extend the Pdf by adding pages from an iterable of pages.

While this method copies pages from one document to another, it does not copy certain metadata such as annotations, form fields, bookmarks or structural tree elements. Copying these is a more complex, application specific operation.

Parameters:: other (PageList | collections.abc.Iterable[Page])
Return type:: None

from_objgen(objgen: tuple[int, int]) → Page

Given an objgen (object ID, generation), return the page.

Raises an exception if no page matches.

index(page, /)

Given a page, find the index.

That is, returns n such that pdf.pages[n] == this_page. A ValueError exception is thrown if the page does not belong to to this Pdf. The first page has index 0.

Parameters:: page (Page)
Return type:: int

insert(index, obj, /)

Insert a page at the specified location.

Parameters:

index (int) – location at which to insert page, 0-based indexing
obj (Page) – page object to insert

Return type:

None

p(pnum, /)

Look up page number in ordinal numbering, where 1 is the first page.

This is provided for convenience in situations where ordinal numbering is more natural. It is equivalent to .pages[pnum - 1]. .p(0) is an error and negative indexing is not supported.

If the PDF defines custom page labels (such as labeling front matter with Roman numerals and the main body with Arabic numerals), this function does not account for that. Use pikepdf.Page.label to get the page label for a page.

Parameters:: pnum (int)
Return type:: Page

remove(page=None, *, p)

Remove a page.

Parameters:

page (Page | None) – If page is not None, remove that page.
p (int) – 1-based page number to remove, if page is None.

Return type:

None

reverse()

Reverse the order of pages.

Return type:: None

class pikepdf._core._ObjectList

A list whose elements are always pikepdf.Object.

In all other respects, this object behaves like a standard Python list.

append(x, /)

Parameters:: x (Object)
Return type:: None

clear()

Return type:: None

count(x, /)

Parameters:: x (Object)
Return type:: int

extend(L: _ObjectList, /) → None

insert(i, x, /)

Parameters:

i (int)
x (Object)

Return type:

None

pop() → Object

remove(x, /)

Parameters:: x (Object)
Return type:: None

class pikepdf.ObjectType(*args, **kwds)

Enumeration of PDF object types.

These values are used to implement pikepdf’s instance type checking. In the vast majority of cases it is more pythonic to use isinstance(obj, pikepdf.Stream) or issubclass.

These values are low-level and documented for completeness. They are exposed through pikepdf.Object._type_code.

array: Ellipsis: A PDF array, meaning the object is a pikepdf.Array.

boolean: Ellipsis: A PDF boolean. In most cases, booleans are automatically converted to bool, so this should not appear.

dictionary: Ellipsis: A PDF dictionary, meaning the object is a pikepdf.Dictionary.

inlineimage: Ellipsis: A PDF inline image, meaning the object is the data stream of an inline image. It would be necessary to combine this with the implicit dictionary to interpret the image correctly. pikepdf automatically packages inline images into a more useful class, so this will not generally appear.

integer: Ellipsis: A PDF integer. In most cases, integers are automatically converted to int, so this should not appear. Unlike Python integers, PDF integers are 32-bit signed integers.

name_: Ellipsis: A PDF name, meaning the object is a pikepdf.Name.

null: Ellipsis: A PDF null. In most cases, nulls are automatically converted to None, so this should not appear.

operator: Ellipsis: A PDF operator, meaning the object is a pikepdf.Operator.

real: Ellipsis: A PDF real. In most cases, reals are automatically convert to decimal.Decimal.

reserved: Ellipsis: A temporary object used in creating circular references. Should not appear in most cases.

stream: Ellipsis: A PDF stream, meaning the object is a pikepdf.Stream (and it also has a dictionary).

string: Ellipsis: A PDF string, meaning the object is a pikepdf.String.

uninitialized: Ellipsis: An uninitialized object. If this appears, it is probably a bug.

Jobs

class pikepdf.Job(json: str)

Provides access to the qpdf job interface.

All of the functionality of the qpdf command line program is now available to pikepdf through jobs.

For further details:: https://qpdf.readthedocs.io/en/stable/qpdf-job.html

EXIT_CORRECT_PASSWORD: ClassVar[int] = 3

EXIT_ERROR: ClassVar[int] = 2: Exit code for a job that had an error.

EXIT_IS_NOT_ENCRYPTED: ClassVar[int] = 2: Exit code for a job that provide a password when the input was not encrypted.

EXIT_WARNING: ClassVar[int] = 3: Exit code for a job that had a warning.

LATEST_JOB_JSON: ClassVar[int]: Version number of the most recent job-JSON schema.

LATEST_JSON: ClassVar[int]: Version number of the most recent qpdf-JSON schema.

__init__(json: str) → None

Create a Job from command line arguments to the qpdf program.

The first item in the args list should be equal to progname, whose default is "pikepdf".

Example

job = Job([‘pikepdf’, ‘–check’, ‘input.pdf’]) job.run()

check_configuration()

Checks if the configuration is valid; raises an exception if not.

Return type:: None

create_pdf(): Executes the first stage of the job.

property creates_output: bool

Returns True if the Job will create some sort of output file.

Return type:: bool

property encryption_status: dict[str, bool]

Returns a Python dictionary describing the encryption status.

Return type:: dict[str, bool]

property exit_code: int

After run(), returns an integer exit code.

The meaning of exit code depends on the details of the Job that was run. Details are subject to change in libqpdf. Use properties has_warnings and encryption_status instead.

Return type:: int

property has_warnings: bool

After run(), returns True if there were warnings.

Return type:: bool

static job_json_schema(*, schema)

For reference, the qpdf job command line schema is built-in.

Parameters:: schema (int)
Return type:: str

static json_out_schema(*, schema)

For reference, the qpdf JSON output schema is built-in.

Parameters:: schema (int)
Return type:: str

property message_prefix: str

Allows manipulation of the prefix in front of all output messages.

Return type:: str

run()

Executes the job.

Return type:: None

write_pdf(pdf)

Executes the second stage of the job.

Parameters:: pdf (Pdf)

class pikepdf.JobBuilder

Fluently assemble a qpdf job and run it.

Each method records part of the job specification and returns self so calls can be chained. List-valued sections (pages, attachments, overlay/underlay) use repeatable add_* methods. Terminal methods build(), run() and create_pdf() hand the assembled specification to pikepdf.Job.

The builder performs minimal local validation; qpdf is the source of truth and will raise pikepdf.JobUsageError (or RuntimeError for malformed JSON) for invalid configurations when the job is built.

add_attachment(file, *, key=None, filename=None, mimetype=None, description=None, creationdate=None, moddate=None, replace=False)

Attach (embed) a file in the output.

Parameters:

file (Any) – Path to the file to attach.
key (str | None) – Attachment key; defaults to the filename.
filename (str | None) – Displayed filename of the attachment.
mimetype (str | None) – MIME type, e.g. 'application/pdf'.
description (str | None) – Human-readable description.
creationdate (str | None) – Creation date (PDF date string).
moddate (str | None) – Modification date (PDF date string).
replace (bool) – Replace an existing attachment with the same key.

Return type:

JobBuilder

add_overlay(file, *, to=None, from_=None, repeat=None, password=None)

Overlay pages from another file on top of the output pages.

Parameters:

file (Any) – Source PDF for the overlay.
to (str | None) – Destination page range in the output.
from – Source page range in file.
repeat (str | None) – Source pages to repeat across remaining destination pages.
password (str | None) – Password for an encrypted source file.
from_ (str | None)

Return type:

JobBuilder

add_pages(file, page_range=None, *, password=None)

Append pages from a file to the page-selection (merge/split) operation.

Parameters:

file (Any) – Source PDF. Use '.' to refer to the primary input file.
page_range (str | None) – qpdf page range, e.g. '1-5' or 'z-1' (reversed). Omit to use all pages.
password (str | None) – Password for an encrypted source file.

Return type:

JobBuilder

add_underlay(file, *, to=None, from_=None, repeat=None, password=None)

Underlay pages from another file beneath the output pages.

See add_overlay() for argument descriptions.

Parameters:

file (Any)
to (str | None)
from_ (str | None)
repeat (str | None)
password (str | None)

Return type:

JobBuilder

allow_weak_crypto()

Permit writing files with weak (RC4) cryptography.

Return type:: JobBuilder

build()

Construct a pikepdf.Job from the specification.

The job is validated by qpdf during construction but not executed.

Raises:

pikepdf.JobUsageError – If the configuration is semantically invalid.
RuntimeError – If the JSON is malformed or contains unknown keys.

Return type:

pikepdf._core.Job

check(*, linearization=False)

Check the PDF for problems, reporting via the job’s output.

Parameters:: linearization (bool) – Also check the linearization (hint) tables.
Return type:: JobBuilder

coalesce_contents()

Combine a page’s multiple content streams into one.

Return type:: JobBuilder

collate(n=None)

Collate rather than concatenate the page selection.

Parameters:: n (int | None) – Number of pages to take from each file per round.
Return type:: JobBuilder

compress(*, compress_streams=None, object_streams=None, recompress_flate=False, compression_level=None, decode_level=None, stream_data=None)

Configure stream and object-stream compression.

Parameters:

compress_streams (bool | None) – Compress uncompressed streams.
object_streams (Literal['generate', 'preserve', 'disable'] | None) – Control use of object streams.
recompress_flate (bool) – Uncompress and recompress flate streams.
compression_level (int | None) – Flate compression level (1-9).
decode_level (Literal['none', 'generalized', 'specialized', 'all'] | None) – Which streams to uncompress before recompressing.
stream_data (Literal['compress', 'preserve', 'uncompress'] | None) – Legacy combined stream compression control.

Return type:

JobBuilder

copy_attachments_from(file, *, prefix=None, password=None)

Copy all attachments from another PDF.

Parameters:

file (Any) – Source PDF to copy attachments from.
prefix (str | None) – Prefix to disambiguate keys that collide with existing ones.
password (str | None) – Password for an encrypted source file.

Return type:

JobBuilder

create_pdf()

Build the job and run only its first stage, returning a pikepdf.Pdf.

This is the staged workflow: the returned PDF can be modified before calling pikepdf.Job.write_pdf() on the same job. Use build() to retain a reference to the job for the write stage.

Return type:: pikepdf._core.Pdf

decrypt()

Remove encryption from the input file.

Return type:: JobBuilder

deterministic_id()

Generate the document ID deterministically from the output contents.

Return type:: JobBuilder

empty()

Use an empty PDF as input instead of an input file.

Return type:: JobBuilder

encrypt(encryption=None, *, owner_password=None, user_password=None, bits=256, allow=None, metadata=True, aes=None, force_v4=False, allow_insecure=False, force_r5=False)

Encrypt the output.

Either pass a pikepdf.Encryption object positionally, or use the keyword arguments. The two forms are mutually exclusive.

Parameters:

encryption (pikepdf.models.encryption.Encryption | None) – A pikepdf.Encryption describing passwords, permissions and algorithm. If given, the keyword arguments must not be used.
owner_password (str | None) – Owner password (full access).
user_password (str | None) – User password (restricted access).
bits (Literal[40, 128, 256]) – Key length: 40, 128 or 256 (default). 40 and 128-bit RC4 are weak and require allow_weak_crypto().
allow (pikepdf.models.encryption.Permissions | None) – A pikepdf.Permissions describing what the user password is permitted to do. Permissions default to pikepdf.models.encryption.DEFAULT_PERMISSIONS.
metadata (bool) – If False, document metadata is left unencrypted (128/256-bit only).
aes (bool | None) – For 128-bit, request AES rather than RC4.
force_v4 (bool) – For 128-bit, force V=4 in the encryption dictionary.
allow_insecure (bool) – For 256-bit, allow an insecure empty owner password.
force_r5 (bool) – For 256-bit, use the deprecated R=5 algorithm.

Return type:

JobBuilder

externalize_inline_images(*, min_bytes=None)

Convert inline images to regular (external) images.

Parameters:: min_bytes (int | None) – Only externalize inline images at least this large.
Return type:: JobBuilder

flatten_annotations(mode='all')

Push annotations into page content streams.

Parameters:: mode (Literal['all', 'print', 'screen']) – Which annotations to flatten: 'all' (default), 'print' (only those marked for printing) or 'screen' (exclude those marked hidden on screen).
Return type:: JobBuilder

flatten_rotation()

Bake each page’s /Rotate value into its content stream.

Return type:: JobBuilder

force_version(version)

Force the output PDF version, even if features require a higher one.

Parameters:: version (str) – A PDF version such as '1.7', optionally with an extension level as '1.7:8'.
Return type:: JobBuilder

generate_appearances()

Generate appearance streams for form fields that lack them.

Return type:: JobBuilder

input(file, *, password=None)

Set the input file, optionally with a password.

Parameters:

file (Any) – Path to the input PDF.
password (str | None) – Password for an encrypted input file.

Return type:

JobBuilder

limits(*, no_default_limits=None, parser_max_container_size=None, parser_max_container_size_damaged=None, parser_max_errors=None, parser_max_nesting=None, max_stream_filters=None)

Set global parser limits (useful for hardening against malicious PDFs).

Parameters:

no_default_limits (bool | None) – Disable qpdf’s optional default limits.
parser_max_container_size (int | None) – Maximum container size while parsing.
parser_max_container_size_damaged (int | None) – Maximum container size while parsing damaged files.
parser_max_errors (int | None) – Maximum number of errors before giving up.
parser_max_nesting (int | None) – Maximum object nesting depth.
max_stream_filters (int | None) – Maximum number of filters when filtering a stream.

Return type:

JobBuilder

linearize()

Linearize (web-optimize) the output.

Return type:: JobBuilder

min_version(version)

Set the minimum PDF version of the output.

Parameters:: version (str) – A PDF version such as '1.7', optionally with an extension level as '1.7:8'.
Return type:: JobBuilder

normalize_content()

Normalize newlines in content streams (for readability/inspection).

Return type:: JobBuilder

optimize_images(*, min_area=None, min_width=None, min_height=None, keep_inline_images=False, jpeg_quality=None)

Recompress images using more efficient compression where possible.

Parameters:

min_area (int | None) – Skip images smaller than this many pixels in area.
min_width (int | None) – Skip images narrower than this many pixels.
min_height (int | None) – Skip images shorter than this many pixels.
keep_inline_images (bool) – Also consider inline images (excluded by default).
jpeg_quality (int | None) – JPEG quality level to use when recompressing.

Return type:

JobBuilder

output(file)

Set the output file.

Parameters:: file (Any) – Path to write. For split_pages(), include a %d placeholder for the page group number.
Return type:: JobBuilder

qdf()

Produce QDF output suitable for inspection in a text editor.

Return type:: JobBuilder

remove_acroform()

Remove the interactive form (AcroForm) dictionary.

Return type:: JobBuilder

remove_attachment(key)

Remove an embedded file by key.

Parameters:: key (str) – Attachment key to remove.
Return type:: JobBuilder

remove_info()

Remove the document information dictionary.

Return type:: JobBuilder

remove_metadata()

Remove the document’s XMP metadata stream.

Return type:: JobBuilder

remove_page_labels()

Remove explicit page labels (page numbering).

Return type:: JobBuilder

remove_restrictions()

Remove security restrictions, recovering the encryption key if needed.

Return type:: JobBuilder

remove_structure()

Remove the document’s structure (tagging) tree.

Return type:: JobBuilder

replace_input()

Overwrite the input file with the output (in place).

Return type:: JobBuilder

rotate(angle, page_range=None)

Rotate pages.

Parameters:

angle (int | str) – Rotation in degrees: 90, 180, 270; prefix with + or - to rotate relative to the current rotation.
page_range (str | None) – Page range to rotate. Omit to rotate all pages.

Return type:

JobBuilder

run(*, validate=True)

Build and run the job.

Parameters:: validate (bool) – If True (default), call check_configuration() before running to fail fast on invalid configurations.
Returns:: The pikepdf.Job after running, so callers can inspect exit_code, has_warnings and encryption_status.
Return type:: pikepdf._core.Job

set(**kwargs)

Set arbitrary scalar top-level options not covered by other methods.

Keyword names are snake_case Python aliases for qpdf’s camelCase job keys (e.g. no_warn -> noWarn). A boolean True enables a flag (emitted as an empty string); other values are stringified.

Parameters:: **kwargs (Any) – Option names and values.
Raises:: ValueError – If a keyword does not correspond to a known option.
Return type:: JobBuilder

set_page_labels(labels)

Set page labels (explicit page numbering) for the whole document.

Parameters:: labels (list[str]) – A list of qpdf label specs of the form first-page:[type][/start[/prefix]], e.g. ['1:r', '5:D/1'] to number the first four pages with lower-case Roman numerals and restart with Arabic numerals from page 5. The first spec must start at page 1.
Return type:: JobBuilder

split_pages(group=None)

Write pages to separate files.

Parameters:: group (int | None) – Number of pages per output file. The output filename should contain a %d placeholder.
Return type:: JobBuilder

static_id()

Use a fixed document ID (for testing; not for production output).

Return type:: JobBuilder

to_json()

Return a deep copy of the assembled job specification as a dict.

Return type:: dict[str, Any]

to_json_str()

Return the assembled job specification as a JSON string.

Return type:: str

qpdf JSON

class pikepdf.JSONStreamData(*args, **kwds)

How stream data is represented when writing a PDF as qpdf JSON.

Used by pikepdf.Pdf.write_qpdf_json().

file: Ellipsis: Stream data is written to external files. Each stream is written to a file named {file_prefix}-{object_number}, where file_prefix is the argument given to pikepdf.Pdf.write_qpdf_json().

inline: Ellipsis: Stream data is included inline in the JSON, base64-encoded.

none: Ellipsis: Stream data is omitted from the JSON output.

class pikepdf.XrefEntry

Represents one entry in a PDF’s cross-reference (xref) table.

Returned by pikepdf.Pdf.get_xref_table(). The meaning of the other properties depends on type.

property obj_stream_index: int | None

Index of the object within its object stream; None unless type == 2.

Return type:: int | None

property obj_stream_number: int | None

Object number of the containing object stream; None unless type == 2.

Return type:: int | None

property offset: int | None

Byte offset of the object in the file, or None unless type == 1.

Return type:: int | None

property type: int

0 = free, 1 = uncompressed, 2 = compressed.

For type 1 (uncompressed), offset is meaningful. For type 2 (compressed, i.e. stored in an object stream), obj_stream_number and obj_stream_index are meaningful.

Type:: The entry type
Return type:: int