Main objects
- class pikepdf.Pdf
In-memory representation of a PDF
- property Root
The /Root object of the PDF.
- add_blank_page(*, page_size=(612.0, 792.0))
Add a blank page to this PDF.
If pages already exist, the page will be added to the end. Pages may be reordered using
Pdf.pages
.The caller may add content to the page by modifying its objects after creating it.
- property allow: Permissions
Report permissions associated with this PDF.
By default these permissions will be replicated when the PDF is saved. Permissions may also only be changed when a PDF is being saved, and are only available for encrypted PDFs. If a PDF is not encrypted, all operations are reported as allowed.
pikepdf has no way of enforcing permissions.
- property attachments
Returns a mapping that provides access to all files attached to this PDF.
PDF supports attaching (or embedding, if you prefer) any other type of file, including other PDFs. This property provides read and write access to these objects by filename.
- Returns
pikepdf.Attachments
- check()
Check if PDF is syntactically well-formed.
Similar to
qpdf --check
, checks for syntax or structural problems in the PDF. This is mainly useful to PDF developers and may not be informative to the average user. PDFs with these problems still render correctly, if PDF viewers are capable of working around the issues they contain. In many cases, pikepdf can also fix the problems.An example problem found by this function is a xref table that is missing an object reference. A page dictionary with the wrong type of key, such as a string instead of an array of integers for its mediabox, is not the sort of issue checked for. If this were an XML checker, it would tell you if the XML is well-formed, but could not tell you if the XML is valid XHTML or if it can be rendered as a usable web page.
This function also attempts to decompress all streams in the PDF. If no JBIG2 decoder is available and JBIG2 images are presented, a warning will occur that JBIG2 cannot be checked.
This function returns a list of strings describing the issues. The text is subject to change and should not be treated as a stable API.
- check_linearization(self: pikepdf.Pdf, stream: object = sys.stderr) bool
Reports information on the PDF’s linearization.
- Parameters
stream – A stream to write this information too; must implement
.write()
and.flush()
method. Defaults tosys.stderr
.- Returns
True
if the file is correctly linearized, andFalse
if the file is linearized but the linearization data contains errors or was incorrectly generated.- Raises
RuntimeError – If the PDF in question is not linearized at all.
- close()
Close a
Pdf
object and release resources acquired by pikepdf.If pikepdf opened the file handle it will close it (e.g. when opened with a file path). If the caller opened the file for pikepdf, the caller close the file.
with
blocks will call close when exit.pikepdf lazily loads data from PDFs, so some
pikepdf.Object
may implicitly depend on thepikepdf.Pdf
being open. This is always the case forpikepdf.Stream
but can be true for any object. Do not close the Pdf object if you might still be accessing content from it.When an
Object
is copied from onePdf
to another, theObject
is copied into the destinationPdf
immediately, so after accessing all desired information from the sourcePdf
it may be closed.Changed in version 3.0: In pikepdf 2.x, this function actually worked by resetting to a very short empty PDF. Code that relied on this quirk may not function correctly.
- Return type
None
- copy_foreign(*args, **kwargs)
Overloaded function.
copy_foreign(self: pikepdf.Pdf, h: pikepdf.Object) -> pikepdf.Object
Copy an
Object
from a foreignPdf
and return a reference to the copy.The object must be owned by a different
Pdf
from this one.If the object has previously been copied, return a reference to the existing copy, even if that copy has been modified in the meantime.
If you want to copy a page from one PDF to another, use:
pdf_b.pages[0] = pdf_a.pages[0]
. That interface accounts for the complexity of copying pages.This function is used to copy a
pikepdf.Object
that is owned by some otherPdf
into this one. This is performs a deep (recursive) copy and preserves all references that may exist in the foreign object. For example, if>>> object_a = pdf.copy_foreign(object_x) >>> object_b = pdf.copy_foreign(object_y) >>> object_c = pdf.copy_foreign(object_z)
and
object_z
is a shared descendant of bothobject_x
andobject_y
in the foreign PDF, thenobject_c
is a shared descendant of bothobject_a
andobject_b
in this PDF. Ifobject_x
andobject_y
refer to the same object, thenobject_a
andobject_b
are the same object.It also copies all
pikepdf.Stream
objects. Since this may copy a large amount of data, it is not done implicitly. This function does not copy references to pages in the foreign PDF - it stops at page boundaries. Thus, if you usecopy_foreign()
on a table of contents (/Outlines
dictionary), you may have to update references to pages.Direct objects, including dictionaries, do not need
copy_foreign()
. pikepdf will automatically convert and construct them.- Note:
pikepdf automatically treats incoming pages from a foreign PDF as foreign objects, so
Pdf.pages
does not require this treatment.- See also:
Changed in version 2.1: Error messages improved.
copy_foreign(self: pikepdf.Pdf, arg0: pikepdf.Page) -> pikepdf.Page
- property docinfo: Dictionary
Access the (deprecated) document information dictionary.
The document information dictionary is a brief metadata record that can store some information about the origin of a PDF. It is deprecated and removed in the PDF 2.0 specification (not deprecated from the perspective of pikepdf). Use the
.open_metadata()
API instead, which will edit the modern (and unfortunately, more complicated) XMP metadata object and synchronize changes to the document information dictionary.This property simplifies access to the actual document information dictionary and ensures that it is created correctly if it needs to be created.
A new, empty dictionary will be created if this property is accessed and dictionary does not exist. (This is to ensure that convenient code like
pdf.docinfo[Name.Title] = "Title"
will work when the dictionary does not exist at all.)You can delete the document information dictionary by deleting this property,
del pdf.docinfo
. Note that accessing the property after deleting it will re-create with a new, empty dictionary.Changed in version 2.4: Added support for
del pdf.docinfo
.
- property encryption: EncryptionInfo
Report encryption information for this PDF.
Encryption settings may only be changed when a PDF is saved.
- property filename
The source filename of an existing PDF, when available.
- flatten_annotations(self: pikepdf.Pdf, mode: str = 'all') None
Flattens all PDF annotations into regular PDF content.
Annotations are markup such as review comments, highlights, proofreading marks. User data entered into interactive form fields also counts as an annotation.
When annotations are flattened, they are “burned into” the regular content stream of the document and the fact that they were once annotations is deleted. This can be useful when preparing a document for printing, to ensure annotations are printed, or to finalize a form that should no longer be changed.
- Parameters
mode – One of the strings
'all'
,'screen'
,'print'
. If omitted or set to empty, treated as'all'
.'screen'
flattens all except those marked with the PDF flag /NoView.'print'
flattens only those marked for printing.
New in version 2.11.
- generate_appearance_streams(self: pikepdf.Pdf) None
Generates appearance streams for AcroForm forms and form fields.
Appearance streams describe exactly how annotations and form fields should appear to the user. If omitted, the PDF viewer is free to render the annotations and form fields according to its own settings, as needed.
For every form field in the document, this generates appearance streams, subject to the limitations of QPDF’s ability to create appearance streams.
When invoked, this method will modify the
Pdf
in memory. It may be best to do this after thePdf
is opened, or before it is saved, because it may modify objects that the user does not expect to be modified.If
Pdf.Root.AcroForm.NeedAppearances
isFalse
or not present, no action is taken (because no appearance streams need to be generated). IfTrue
, the appearance streams are generated, and the NeedAppearances flag is set toFalse
.New in version 2.11.
- get_object(*args, **kwargs)
Overloaded function.
get_object(self: pikepdf.Pdf, objgen: Tuple[int, int]) -> pikepdf.Object
Look up an object by ID and generation number
- Return type:
pikepdf.Object
get_object(self: pikepdf.Pdf, objid: int, gen: int) -> pikepdf.Object
Look up an object by ID and generation number
- Return type:
pikepdf.Object
- get_warnings(self: pikepdf.Pdf) list
- property is_encrypted
Returns True if the PDF is encrypted.
For information about the nature of the encryption, see
Pdf.encryption
.
- property is_linearized
Returns True if the PDF is linearized.
Specifically returns True iff the file starts with a linearization parameter dictionary. Does no additional validation.
- make_indirect(*args, **kwargs)
Overloaded function.
make_indirect(self: pikepdf.Pdf, h: pikepdf.Object) -> pikepdf.Object
Attach an object to the Pdf as an indirect object
Direct objects appear inline in the binary encoding of the PDF. Indirect objects appear inline as references (in English, “look up object 4 generation 0”) and then read from another location in the file. The PDF specification requires that certain objects are indirect - consult the PDF specification to confirm.
Generally a resource that is shared should be attached as an indirect object.
pikepdf.Stream
objects are always indirect, and creating them will automatically attach it to the Pdf.- See Also:
pikepdf.Object.is_indirect()
- Return type:
pikepdf.Object
make_indirect(self: pikepdf.Pdf, obj: object) -> pikepdf.Object
Encode a Python object and attach to this Pdf as an indirect object.
- Return type:
pikepdf.Object
- make_stream(data, d=None, **kwargs)
Create a new pikepdf.Stream object that is attached to this PDF.
- static new() pikepdf.Pdf
Create a new, empty PDF.
This is best when you are constructing a PDF from scratch.
In most cases, if you are working from an existing PDF, you should open the PDF using
pikepdf.Pdf.open()
and transform it, instead of a creating a new one, to preserve metadata and structural information. For example, if you want to split a PDF into two parts, you should open the PDF and transform it into the desired parts, rather than creating a new PDF and copying pages into it.
- property objects
Return an iterable list of all objects in the PDF.
After deleting content from a PDF such as pages, objects related to that page, such as images on the page, may still be present.
- Return type:
pikepdf._ObjectList
- open(*, password='', hex_password=False, ignore_xref_streams=False, suppress_warnings=True, attempt_recovery=True, inherit_page_attributes=True, access_mode=<AccessMode.default: 0>, allow_overwriting_input=False)
Open an existing file at filename_or_stream.
If filename_or_stream is path-like, the file will be opened for reading. The file should not be modified by another process while it is open in pikepdf, or undefined behavior may occur. This is because the file may be lazily loaded. Despite this restriction, pikepdf does not try to use any OS services to obtain an exclusive lock on the file. Some applications may want to attempt this or copy the file to a temporary location before editing. This behaviour changes if allow_overwriting_input is set: the whole file is then read and copied to memory, so that pikepdf can overwrite it when calling
.save()
.When this function is called with a stream-like object, you must ensure that the data it returns cannot be modified, or undefined behavior will occur.
Any changes to the file must be persisted by using
.save()
.If filename_or_stream has
.read()
and.seek()
methods, the file will be accessed as a readable binary stream. pikepdf will read the entire stream into a private buffer..open()
may be used in awith
-block;.close()
will be called when the block exits, if applicable.Whenever pikepdf opens a file, it will close it. If you open the file for pikepdf or give it a stream-like object to read from, you must release that object when appropriate.
Examples
>>> with Pdf.open("test.pdf") as pdf: ...
>>> pdf = Pdf.open("test.pdf", password="rosebud")
- Parameters
filename_or_stream (pathlib.Path | str | BinaryIO) – Filename or Python readable and seekable file stream of PDF to open.
password (str | bytes) – User or owner password to open an encrypted PDF. If the type of this parameter is
str
it will be encoded as UTF-8. If the type isbytes
it will be saved verbatim. Passwords are always padded or truncated to 32 bytes internally. Use ASCII passwords for maximum compatibility.hex_password (bool) – If True, interpret the password as a hex-encoded version of the exact encryption key to use, without performing the normal key computation. Useful in forensics.
ignore_xref_streams (bool) – If True, ignore cross-reference streams. See qpdf documentation.
suppress_warnings (bool) – If True (default), warnings are not printed to stderr. Use
pikepdf.Pdf.get_warnings()
to retrieve warnings.attempt_recovery (bool) – If True (default), attempt to recover from PDF parsing errors.
inherit_page_attributes (bool) – If True (default), push attributes set on a group of pages to individual pages
access_mode (AccessMode) – If
.default
, pikepdf will decide how to access the file. Currently, it will always selected stream access. To attempt memory mapping and fallback to stream if memory mapping failed, use.mmap
. Use.mmap_only
to require memory mapping or fail (this is expected to only be useful for testing). Applications should be prepared to handle the SIGBUS signal on POSIX in the event that the file is successfully mapped but later goes away.allow_overwriting_input (bool) – If True, allows calling
.save()
to overwrite the input file. This is performed by loading the entire input file into memory at open time; this will use more memory and may recent performance especially when the opened file will not be modified.
- Raises
pikepdf.PasswordError – If the password failed to open the file.
pikepdf.PdfError – If for other reasons we could not open the file.
TypeError – If the type of
filename_or_stream
is not usable.FileNotFoundError – If the file was not found.
- Return type
Note
When filename_or_stream is a stream and the stream is located on a network, pikepdf assumes that the stream using buffering and read caches to achieve reasonable performance. Streams that fetch data over a network in response to every read or seek request, no matter how small, will perform poorly. It may be easier to download a PDF from network to temporary local storage (such as
io.BytesIO
), manipulate it, and then re-upload it.Changed in version 3.0: Keyword arguments now mandatory for everything except the first argument.
- open_metadata(set_pikepdf_as_editor=True, update_docinfo=True, strict=False)
Open the PDF’s XMP metadata for editing.
There is no
.close()
function on the metadata object, since this is intended to be used inside awith
block only.For historical reasons, certain parts of PDF metadata are stored in two different locations and formats. This feature coordinates edits so that both types of metadata are updated consistently and “atomically” (assuming single threaded access). It operates on the
Pdf
in memory, not any file on disk. To persist metadata changes, you must still usePdf.save()
.Example
>>> with pdf.open_metadata() as meta: meta['dc:title'] = 'Set the Dublic Core Title' meta['dc:description'] = 'Put the Abstract here'
- Parameters
set_pikepdf_as_editor (bool) – Automatically update the metadata
pdf:Producer
to show that this version of pikepdf is the most recent software to modify the metadata, andxmp:MetadataDate
to timestamp the update. Recommended, except for testing.update_docinfo (bool) – Update the standard fields of DocumentInfo (the old PDF metadata dictionary) to match the corresponding XMP fields. The mapping is described in
PdfMetadata.DOCINFO_MAPPING
. Nonstandard DocumentInfo fields and XMP metadata fields with no DocumentInfo equivalent are ignored.strict (bool) – If
False
(the default), we aggressively attempt to recover from any parse errors in XMP, and if that fails we overwrite the XMP with an empty XMP record. IfTrue
, raise errors when either metadata bytes are not valid and well-formed XMP (and thus, XML). Some trivial cases that are equivalent to empty or incomplete “XMP skeletons” are never treated as errors, and always replaced with a proper empty XMP block. Certain errors may be logged.
- Return type
- open_outline(max_depth=15, strict=False)
Open the PDF outline (“bookmarks”) for editing.
Recommend for use in a
with
block. Changes are committed to the PDF when the block exits. (ThePdf
must still be opened.)Example
>>> with pdf.open_outline() as outline: outline.root.insert(0, OutlineItem('Intro', 0))
- Parameters
max_depth (int) – Maximum recursion depth of the outline to be imported and re-written to the document.
0
means only considering the root level,1
the first-level sub-outline of each root element, and so on. Items beyond this depth will be silently ignored. Default is15
.strict (bool) – With the default behavior (set to
False
), structural errors (e.g. reference loops) in the PDF document will only cancel processing further nodes on that particular level, recovering the valid parts of the document outline without raising an exception. When set toTrue
, any such error will raise anOutlineStructureError
, leaving the invalid parts in place. Similarly, outline objects that have been accidentally duplicated in theOutline
container will be silently fixed (i.e. reproduced as new objects) or raise anOutlineStructureError
.
- Return type
- property owner_password_matched
Returns True if the owner password matched when the
Pdf
was opened.It is possible for both the user and owner passwords to match.
New in version 2.10.
- property pages
Returns the list of pages.
- Return type:
pikepdf.PageList
- property pdf_version
The version of the PDF specification used for this file, such as ‘1.7’.
- remove_unreferenced_resources(self: pikepdf.Pdf) None
Remove from /Resources of each page any object not referenced in page’s contents
PDF pages may share resource dictionaries with other pages. If pikepdf is used for page splitting, pages may reference resources in their /Resources dictionary that are not actually required. This purges all unnecessary resource entries.
For clarity, if all references to any type of object are removed, that object will be excluded from the output PDF on save. (Conversely, only objects that are discoverable from the PDF’s root object are included.) This function removes objects that are referenced from the page /Resources dictionary, but never called for in the content stream, making them unnecessary.
Suggested before saving, if content streams or /Resources dictionaries are edited.
- save(filename_or_stream=None, *, static_id=False, preserve_pdfa=True, min_version='', force_version='', fix_metadata_version=True, compress_streams=True, stream_decode_level=None, object_stream_mode=<ObjectStreamMode.preserve: 1>, normalize_content=False, linearize=False, qdf=False, progress=None, encryption=None, recompress_flate=False, deterministic_id=False)
Save all modifications to this
pikepdf.Pdf
.- Parameters
filename_or_stream (pathlib.Path | str | BinaryIO | None) – Where to write the output. If a file exists in this location it will be overwritten. If the file was opened with
allow_overwriting_input=True
, then it is permitted to overwrite the original file, and this parameter may be omitted to implicitly use the original filename. Otherwise, the filename may not be the same as the input file, as overwriting the input file would corrupt data since pikepdf using lazy loading.static_id (bool) – Indicates that the
/ID
metadata, normally calculated as a hash of certain PDF contents and metadata including the current time, should instead be set to a static value. Only use this for debugging and testing. Usedeterministic_id
if you want to get the same/ID
for the same document contents.preserve_pdfa (bool) – Ensures that the file is generated in a manner compliant with PDF/A and other stricter variants. This should be True, the default, in most cases.
min_version (str | tuple[str, int]) – Sets the minimum version of PDF specification that should be required. If left alone QPDF will decide. If a tuple, the second element is an integer, the extension level. If the version number is not a valid format, QPDF will decide what to do.
force_version (str | tuple[str, int]) – Override the version recommend by QPDF, potentially creating an invalid file that does not display in old versions. See QPDF manual for details. If a tuple, the second element is an integer, the extension level.
fix_metadata_version (bool) – If
True
(default) and the XMP metadata contains the optional PDF version field, ensure the version in metadata is correct. If the XMP metadata does not contain a PDF version field, none will be added. To ensure that the field is added, edit the metadata and insert a placeholder value inpdf:PDFVersion
. If XMP metadata does not exist, it will not be created regardless of the value of this argument.object_stream_mode (ObjectStreamMode) –
disable
prevents the use of object streams.preserve
keeps object streams from the input file.generate
uses object streams wherever possible, creating the smallest files but requiring PDF 1.5+.compress_streams (bool) –
Enables or disables the compression of uncompressed stream objects. By default this is set to
True
, and the only reason to set it toFalse
is for debugging or inspecting PDF contents.When enabled, uncompressed stream objects will be compressed whether they were uncompressed in the PDF when it was opened, or when the user creates new
pikepdf.Stream
objects attached to the PDF. Stream objects can also be created indirectly, such as when content from another PDF is merged into the one being saved.Only stream objects that have no compression will be compressed when this object is set. If the object is compressed, compression will be preserved.
Setting compress_streams=False does not trigger decompression unless decompression is specifically requested by setting both
compress_streams=False
andstream_decode_level
to the desired decode level (e.g..generalized
will decompress most non-image content).This option does not trigger recompression of existing compressed streams. For that, use
recompress_flate
.The XMP metadata stream object, if present, is never compressed, to facilitate metadata reading by parsers that don’t understand the full structure of PDF.
stream_decode_level (pikepdf._core.StreamDecodeLevel | None) – Specifies how to encode stream objects. See documentation for
pikepdf.StreamDecodeLevel
.recompress_flate (bool) – When disabled (the default), qpdf does not uncompress and recompress streams compressed with the Flate compression algorithm. If True, pikepdf will instruct qpdf to do this, which may be useful if recompressing streams to a higher compression level.
normalize_content (bool) – Enables parsing and reformatting the content stream within PDFs. This may debugging PDFs easier.
linearize (bool) – Enables creating linear or “fast web view”, where the file’s contents are organized sequentially so that a viewer can begin rendering before it has the whole file. As a drawback, it tends to make files larger.
qdf (bool) – Save output QDF mode. QDF mode is a special output mode in QPDF to allow editing of PDFs in a text editor. Use the program
fix-qdf
to fix convert back to a standard PDF.progress (Callable[[int], None]) – Specify a callback function that is called as the PDF is written. The function will be called with an integer between 0-100 as the sole parameter, the progress percentage. This function may not access or modify the PDF while it is being written, or data corruption will almost certainly occur.
encryption (pikepdf.models.encryption.Encryption | bool | None) – If
False
or omitted, existing encryption will be removed. IfTrue
encryption settings are copied from the originating PDF. Alternately, anEncryption
object may be provided that sets the parameters for new encryption.deterministic_id (bool) – Indicates that the
/ID
metadata, normally calculated as a hash of certain PDF contents and metadata including the current time, should instead be computed using only deterministic data like the file contents. At a small runtime cost, this enables generation of the same/ID
if the same inputs are converted in the same way multiple times. Does not work for encrypted files.
- Raises
- Return type
None
You may call
.save()
multiple times with different parameters to generate different versions of a file, and you may continue to modify the file after saving it..save()
does not modify thePdf
object in memory, except possibly by updating the XMP metadata version withfix_metadata_version
.Note
pikepdf.Pdf.remove_unreferenced_resources()
before saving may eliminate unnecessary resources from the output file if there are any objects (such as images) that are referenced in a page’s Resources dictionary but never called in the page’s content stream.Note
pikepdf can read PDFs with incremental updates, but always coalesces any incremental updates into a single non-incremental PDF file when saving.
Note
If filename_or_stream is a stream and the process is interrupted during writing, the stream may be left in a corrupt state. It is the responsibility of the caller to manage the stream in this case.
Changed in version 2.7: Added recompress_flate.
Changed in version 3.0: Keyword arguments now mandatory for everything except the first argument.
Changed in version 8.1: If filename_or_stream is a filename and that file exists, the new file is written to a temporary file in the same directory and then moved into place. This prevents the existing destination file from being corrupted if the process is interrupted during writing; previously, corrupting the destination file was possible. If no file exists at the destination, output is written directly to the destination, but the destination will be deleted if errors occur during writing. Prior to 8.1, the file was always written directly to the destination, which could result in a corrupt destination file if the process was interrupted during writing.
- show_xref_table(self: pikepdf.Pdf) None
Pretty-print the Pdf’s xref (cross-reference table)
- property trailer
Provides access to the PDF trailer object.
See PDF 1.7 Reference Manual section 7.5.5. Generally speaking, the trailer should not be modified with pikepdf, and modifying it may not work. Some of the values in the trailer are automatically changed when a file is saved.
- property user_password_matched
Returns True if the user password matched when the
Pdf
was opened.It is possible for both the user and owner passwords to match.
New in version 2.10.
- pikepdf.open()
Alias for
pikepdf.Pdf.open()
.
- pikepdf.new()
Alias for
pikepdf.Pdf.new()
.
- class pikepdf.ObjectStreamMode
Options for saving streams within PDFs, which are more a compact way of saving certain types of data that was added in PDF 1.5. All modern PDF viewers support object streams, but some third party tools and libraries cannot read them.
- disable
Disable the use of object streams. If any object streams exist in the file, remove them when the file is saved.
- preserve
Preserve any existing object streams in the original file. This is the default behavior.
- generate
Generate object streams.
- class pikepdf.StreamDecodeLevel
Options for decoding streams within PDFs.
- none
Do not attempt to apply any filters. Streams remain as they appear in the original file. Note that uncompressed streams may still be compressed on output. You can disable that by saving with
.save(..., compress_streams=False)
.
- generalized
This is the default. libqpdf will apply LZWDecode, ASCII85Decode, ASCIIHexDecode, and FlateDecode filters on the input. When saved with
compress_streams=True
, the default, the effect of this is that streams filtered with these older and less efficient filters will be recompressed with the Flate filter. As a special case, if a stream is already compressed with FlateDecode andcompress_streams=True
, the original compressed data will be preserved.
- specialized
In addition to uncompressing the generalized compression formats, supported non-lossy compression will also be be decoded. At present, this includes the RunLengthDecode filter.
- all
In addition to generalized and non-lossy specialized filters, supported lossy compression filters will be applied. At present, this includes DCTDecode (JPEG) compression. Note that compressing the resulting data with DCTDecode again will accumulate loss, so avoid multiple compression and decompression cycles. This is mostly useful for (low-level) retrieving image data; see
pikepdf.PdfImage
for the preferred method.
- class pikepdf.Encryption(owner='', user='', R=6, allow=Permissions(accessibility=True, extract=True, modify_annotation=True, modify_assembly=False, modify_form=True, modify_other=True, print_lowres=True, print_highres=True), aes=True, metadata=True)
Specify the encryption settings to apply when a PDF is saved.
Object construction
- class pikepdf.Object
- append(self: pikepdf.Object, arg0: object) None
Append another object to an array; fails if the object is not an array.
- as_dict(self: pikepdf.Object) pikepdf._ObjectMapping
- as_list(self: pikepdf.Object) pikepdf._ObjectList
- emplace(other, retain=(pikepdf.Name('/Parent'),))
Copy all items from other without making a new object.
Particularly when working with pages, it may be desirable to remove all of the existing page’s contents and emplace (insert) a new page on top of it, in a way that preserves all links and references to the original page. (Or similarly, for other Dictionary objects in a PDF.)
Any Dictionary keys in the iterable retain are preserved. By default, /Parent is retained.
When a page is assigned (
pdf.pages[0] = new_page
), only the application knows if references to the original the original page are still valid. For example, a PDF optimizer might restructure a page object into another visually similar one, and references would be valid; but for a program that reorganizes page contents such as a N-up compositor, references may not be valid anymore.This method takes precautions to ensure that child objects in common with
self
andother
are not inadvertently deleted.Example
>>> pdf.pages[0].objgen (16, 0) >>> pdf.pages[0].emplace(pdf.pages[1]) >>> pdf.pages[0].objgen (16, 0) # Same object
Changed in version 2.11.1: Added the retain argument.
- Parameters
other (Object) –
- extend(self: pikepdf.Object, arg0: Iterable) None
Extend a pikepdf.Array with an iterable of other objects.
- get(*args, **kwargs)
Overloaded function.
get(self: pikepdf.Object, key: str, default: object = None) -> object
For
pikepdf.Dictionary
orpikepdf.Stream
objects, behave asdict.get(key, default=None)
get(self: pikepdf.Object, key: pikepdf.Object, default: object = None) -> object
For
pikepdf.Dictionary
orpikepdf.Stream
objects, behave asdict.get(key, default=None)
- get_raw_stream_buffer(self: pikepdf.Object) pikepdf.Buffer
Return a buffer protocol buffer describing the raw, encoded stream
- get_stream_buffer(self: pikepdf.Object, decode_level: pikepdf.StreamDecodeLevel = <StreamDecodeLevel.generalized: 1>) pikepdf.Buffer
Return a buffer protocol buffer describing the decoded stream.
- is_owned_by(self: pikepdf.Object, possible_owner: pikepdf.Pdf) bool
Test if this object is owned by the indicated possible_owner.
- property is_rectangle
Returns True if the object is a rectangle (an array of 4 numbers)
- items(self: pikepdf.Object) Iterable
- keys(self: pikepdf.Object) Set[str]
For
pikepdf.Dictionary
orpikepdf.Stream
objects, obtain the keys.
- property objgen
Return the object-generation number pair for this object.
If this is a direct object, then the returned value is
(0, 0)
. By definition, if this is an indirect object, it has a “objgen”, and can be looked up using this in the cross-reference (xref) table. Direct objects cannot necessarily be looked up.The generation number is usually 0, except for PDFs that have been incrementally updated. Incrementally updated PDFs are now uncommon, since it does not take too long for modern CPUs to reconstruct an entire PDF. pikepdf will consolidate all incremental updates when saving.
- static parse(stream: str, description: str = '') pikepdf.Object
Parse PDF binary representation into PDF objects.
- read_bytes(self: pikepdf.Object, decode_level: pikepdf.StreamDecodeLevel = <StreamDecodeLevel.generalized: 1>) bytes
Decode and read the content stream associated with this object.
- read_raw_bytes(self: pikepdf.Object) bytes
Read the content stream associated with this object without decoding
- same_owner_as(self: pikepdf.Object, arg0: pikepdf.Object) bool
Test if two objects are owned by the same
pikepdf.Pdf
.
- property stream_dict
Access the dictionary key-values for a
pikepdf.Stream
.
- to_json(self: pikepdf.Object, dereference: bool = False, schema_version: int = 2) bytes
Convert to a QPDF JSON representation of the object.
See the QPDF manual for a description of its JSON representation. https://qpdf.readthedocs.io/en/stable/json.html#qpdf-json-format
Not necessarily compatible with other PDF-JSON representations that exist in the wild.
Names are encoded as UTF-8 strings
Indirect references are encoded as strings containing
obj gen R
Strings are encoded as UTF-8 strings with unrepresentable binary characters encoded as
\uHHHH
Encoding streams just encodes the stream’s dictionary; the stream data is not represented
Object types that are only valid in content streams (inline image, operator) as well as “reserved” objects are not representable and will be serialized as
null
.
- Parameters
- Returns
JSON bytestring of object. The object is UTF-8 encoded and may be decoded to a Python str that represents the binary values
\x00-\xFF
asU+0000
toU+00FF
; that is, it may contain mojibake.
Changed in version 6.0: Added schema_version.
- unparse(self: pikepdf.Object, resolved: bool = False) bytes
Convert PDF objects into their binary representation, optionally resolving indirect objects.
If you want to unparse content streams, which are a collection of objects that need special treatment, use
pikepdf.unparse_content_stream()
instead.Returns
bytes()
that can be used withObject.parse()
to reconstruct thepikepdf.Object
. If reconstruction is not possible, a relative object reference is returned, such as4 0 R
.- Parameters
resolved – If True, deference indirect objects where possible.
- with_same_owner_as(self: pikepdf.Object, arg0: pikepdf.Object) pikepdf.Object
Returns an object that is owned by the same Pdf that owns the other object.
If the objects already have the same owner, this object is returned. If the other object has a different owner, then a copy is created that is owned by other’s owner. If this object is a direct object (no owner), then an indirect object is created that is owned by other. An exception is thrown if other is a direct object.
This method may be convenient when a reference to the Pdf is not available.
New in version 2.14.
- wrap_in_array(self: pikepdf.Object) pikepdf.Object
Return the object wrapped in an array if not already an array.
- write(data, *, filter=None, decode_parms=None, type_check=True)
Replace stream object’s data with new (possibly compressed) data.
filter and decode_parms describe any compression that is already present on the input data. For example, if your data is already compressed with the Deflate algorithm, you would set
filter=Name.FlateDecode
.When writing the PDF in
pikepdf.Pdf.save()
, pikepdf may change the compression or apply compression to data that was not compressed, depending on the parameters given to that function. It will never change lossless to lossy encoding.PNG and TIFF images, even if compressed, cannot be directly inserted into a PDF and displayed as images.
- Parameters
data (bytes) – the new data to use for replacement
filter (pikepdf.objects.Name | pikepdf.objects.Array | None) – The filter(s) with which the data is (already) encoded
decode_parms (pikepdf.objects.Dictionary | pikepdf.objects.Array | None) – Parameters for the filters with which the object is encode
type_check (bool) – Check arguments; use False only if you want to intentionally create malformed PDFs.
If only one filter is specified, it may be a name such as Name(‘/FlateDecode’). If there are multiple filters, then array of names should be given.
If there is only one filter, decode_parms is a Dictionary of parameters for that filter. If there are multiple filters, then decode_parms is an Array of Dictionary, where each array index is corresponds to the filter.
- class pikepdf.Name(name)
Construct a PDF Name object.
Names can be constructed with two notations:
Name.Resources
Name('/Resources')
The two are semantically equivalent. The former is preferred for names that are normally expected to be in a PDF. The latter is preferred for dynamic names and attributes.
- static __new__(cls, name)
Construct a PDF Name.
- Parameters
name (str | pikepdf.objects.Name) –
- Return type
- class pikepdf.String(s)
Construct a PDF String object.
- class pikepdf.Array(a=None)
Construct a PDF Array object.
- class pikepdf.Dictionary(d=None, **kwargs)
Construct a PDF Dictionary object.
- Parameters
d (Mapping | None) –
- Return type
- static __new__(cls, d=None, **kwargs)
Construct a PDF Dictionary.
Works from either a Python
dict
or keyword arguments.These two examples are equivalent:
pikepdf.Dictionary({'/NameOne': 1, '/NameTwo': 'Two'}) pikepdf.Dictionary(NameOne=1, NameTwo='Two')
In either case, the keys must be strings, and the strings correspond to the desired Names in the PDF Dictionary. The values must all be convertible to pikepdf.Object.
- Return type:
pikepdf.Dictionary
- Parameters
- Return type
- class pikepdf.Stream(owner, data=None, d=None, **kwargs)
Construct a PDF Stream object.
- static __new__(cls, owner, data=None, d=None, **kwargs)
Create a new stream object.
Streams stores arbitrary binary data and may or may not be compressed. It also may or may not be a page or Form XObject’s content stream.
A stream dictionary is like a pikepdf.Dictionary or Python dict, except it has a binary payload of data attached. The dictionary describes how the data is compressed or encoded.
The dictionary may be initialized just like pikepdf.Dictionary is initialized, using a mapping object or keyword arguments.
- Parameters
owner (Pdf) – The Pdf to which this stream shall be attached.
data (bytes | None) – The data bytes for the stream.
d – An optional mapping object that will be used to construct the stream’s dictionary.
kwargs – Keyword arguments that will define the stream dictionary. Do not set /Length here as pikepdf will manage this value. Set /Filter if the data is already encoded in some format.
- Return type
Examples
- Using kwargs:
>>> s1 = pikepdf.Stream( pdf, b"uncompressed image data", BitsPerComponent=8, ColorSpace=Name.DeviceRGB, ... )
- Using dict:
>>> d = pikepdf.Dictionary(...) >>> s2 = pikepdf.Stream( pdf, b"data", d )
Changed in version 2.2: Support creation of
pikepdf.Stream
from existing dictionary.Changed in version 3.0:
obj
argument was removed; usedata
.
- class pikepdf.Operator(name)
Construct an operator for use in a content stream.
An Operator is one of a limited set of commands that can appear in PDF content streams (roughly the mini-language that draws objects, lines and text on a virtual PDF canvas). The commands
parse_content_stream()
andunparse_content_stream()
create and expect Operators respectively, along with their operands.pikepdf uses the special Operator “INLINE IMAGE” to denote an inline image in a content stream.
Common PDF data structures
- class pikepdf.Matrix
A 2D affine matrix for PDF transformations.
PDF uses matrices to transform document coordinates to screen/device coordinates.
PDF matrices are encoded as
pikepdf.Array
with exactly six numeric elements, ordered asa b c d e f
.\[\begin{split}\begin{bmatrix} a & b & 0 \\ c & d & 0 \\ e & f & 1 \\ \end{bmatrix}\end{split}\]The parameters mean approximately the following:
a
is the horizontal scaling factor.b
is horizontal skewing.c
is vertical skewing.d
is the vertical scaling factor.e
is the horizontal translation.f
is the vertical translation.
The values (0, 0, 1) in the third column are fixed, so some general matrices cannot be converted to affine matrices.
PDF transformation matrices are the transpose of most textbook treatments. In a textbook, typically
A × vc
is used to transform a column vectorvc=(x, y, 1)
by the affine matrixA
. In PDF, the matrix is the transpose of that in the textbook, andvr × A'
is used to transform a row vectorvr=(x, y, 1)
.Transformation matrices specify the transformation from the new (transformed) coordinate system to the original (untransformed) coordinate system. x’ and y’ are the coordinates in the untransformed coordinate system, and x and y are the coordinates in the transformed coordinate system.
PDF order:
\[\begin{split}\begin{equation} \begin{bmatrix} x' & y' & 1 \end{bmatrix} = \begin{bmatrix} x & y & 1 \end{bmatrix} \begin{bmatrix} a & b & 0 \\ c & d & 0 \\ e & f & 1 \end{bmatrix} \end{equation}\end{split}\]To concatenate transformations, use the matrix multiple (
@
) operator to pre-multiply the next transformation onto existing transformations.Alternatively, use the .translated(), .scaled(), and .rotated() methods to chain transformation operations.
Addition and other operations are not implemented because they’re not that meaningful in a PDF context.
Matrix objects are immutable. All transformation methods return new matrix objects.
New in version 8.7.
- __array__(self: pikepdf.Matrix) object
Convert this matrix to a NumPy array.
If numpy is not installed, this will throw an exception.
- __init__(*args, **kwargs)
Overloaded function.
__init__(self: pikepdf.Matrix) -> None
Construct an identity matrix.
__init__(self: pikepdf.Matrix, a: float, b: float, c: float, d: float, e: float, f: float) -> None
__init__(self: pikepdf.Matrix, other: pikepdf.Matrix) -> None
__init__(self: pikepdf.Matrix, h: pikepdf.Object) -> None
__init__(self: pikepdf.Matrix, arg0: pikepdf._ObjectList) -> None
__init__(self: pikepdf.Matrix, t6: tuple) -> None
- __matmul__(self: pikepdf.Matrix, other: pikepdf.Matrix) pikepdf.Matrix
Return the matrix product of two matrices.
Can be used to concatenate transformations. Transformations should be composed by pre-multiplying matrices.
- as_array(self: pikepdf.Matrix) pikepdf.Object
Convert this matrix to a pikepdf.Array.
A Matrix cannot be inserted into a PDF directly. Use this function to convert a Matrix to a pikepdf.Array, which can be inserted.
- encode(self: pikepdf.Matrix) bytes
Encode this matrix in bytes suitable for including in a PDF content stream.
- inverse(self: pikepdf.Matrix) pikepdf.Matrix
Return the inverse of the matrix.
The inverse matrix reverses the transformation of the original matrix.
In rare situations, the inverse may not exist. In that case, an exception is thrown. The PDF will likely have rendering problems.
- rotated(self: pikepdf.Matrix, angle_degrees_ccw: float) pikepdf.Matrix
Return a rotated copy of a matrix.
- Parameters
angle_degrees_ccw – angle in degrees counterclockwise
- scaled(self: pikepdf.Matrix, arg0: float, arg1: float) pikepdf.Matrix
Return a scaled copy of a matrix.
- property shorthand
Return the 6-tuple (a,b,c,d,e,f) that describes this matrix.
- transform(*args, **kwargs)
Overloaded function.
transform(self: pikepdf.Matrix, point: Tuple[float, float]) -> tuple
Transform a point by this matrix.
Computes [x y 1] @ self.
transform(self: pikepdf.Matrix, rect: pikepdf.Rectangle) -> pikepdf.Rectangle
Transform a rectangle by this matrix.
The new rectangle tightly bounds the polygon resulting from transforming the four corners.
- translated(self: pikepdf.Matrix, arg0: float, arg1: float) pikepdf.Matrix
Return a translated copy of a matrix.
- class pikepdf.Rectangle
A PDF rectangle.
Typically this will be a rectangle in PDF units (points, 1/72”). Unlike raster graphics, the rectangle is defined by the lower left and upper right points.
Rectangles in PDF are encoded as
pikepdf.Array
with exactly four numeric elements, ordered asllx lly urx ury
. See PDF 1.7 Reference Manual section 7.9.5.The rectangle may be considered degenerate if the lower left corner is not strictly less than the upper right corner.
New in version 2.14.
Changed in version 8.5: Added operators to test whether rectangle
a
is contained in rectangleb
(a <= b
) and to calculate their intersection (a & b
).- as_array(self: pikepdf.Rectangle) pikepdf.Object
Returns this rectangle as a
pikepdf.Array
.
- property height
The height of the rectangle.
- property llx
The lower left corner on the x-axis.
- property lly
The lower left corner on the y-axis.
- property lower_left
A point for the lower left corner.
- property lower_right
A point for the lower right corner.
- property upper_left
A point for the upper left corner.
- property upper_right
A point for the upper right corner.
- property urx
The upper right corner on the x-axis.
- property ury
The upper right corner on the y-axis.
- property width
The width of the rectangle.
Content stream elements
- class pikepdf.ContentStreamInstruction
Represents one complete instruction inside a content stream.
- property operands
The operands (parameters) supplied to the operator.
- property operator
The operator of used in this instruction.
- class pikepdf.ContentStreamInlineImage
Represents an instruction to draw an inline image inside a content stream.
pikepdf consolidates the BI-ID-EI sequence of operators, as appears in a PDF to declare an inline image, and replaces them with a single virtual content stream instruction with the operator “INLINE IMAGE”.
- property iimage
Returns the inline image itself.
- property operands
Returns a list of operands, whose sole entry is the inline image.
- property operator
Always return the fictitious operator ‘INLINE IMAGE’.
Internal objects
These objects are returned by other pikepdf objects. They are part of the API, but not intended to be created explicitly.
- class pikepdf._core.PageList
A
list
-like object enumerating a range of pages in apikepdf.Pdf
. It may be all of the pages or a subset.- append(*args, **kwargs)
Overloaded function.
append(self: pikepdf.PageList, page: pikepdf.Page) -> None
Add another page to the end.
While this method copies pages from one document to another, it does not copy certain metadata such as annotations, form fields, bookmarks or structural tree elements. Copying these is a more complex, application specific operation.
append(self: pikepdf.PageList, page: handle) -> None
Add another page to the end.
While this method copies pages from one document to another, it does not copy certain metadata such as annotations, form fields, bookmarks or structural tree elements. Copying these is a more complex, application specific operation.
- extend(*args, **kwargs)
Overloaded function.
extend(self: pikepdf.PageList, other: pikepdf.PageList) -> None
Extend the
Pdf
by adding pages from anotherPdf.pages
.While this method copies pages from one document to another, it does not copy certain metadata such as annotations, form fields, bookmarks or structural tree elements. Copying these is a more complex, application specific operation.
extend(self: pikepdf.PageList, iterable: Iterable) -> None
Extend the
Pdf
by adding pages from an iterable of pages.While this method copies pages from one document to another, it does not copy certain metadata such as annotations, form fields, bookmarks or structural tree elements. Copying these is a more complex, application specific operation.
- from_objgen(*args, **kwargs)
Overloaded function.
from_objgen(self: pikepdf.PageList, arg0: int, arg1: int) -> pikepdf.Page
Given an “objgen” (object ID, generation), return the page.
Raises an exception if no page matches.
from_objgen(self: pikepdf.PageList, arg0: Tuple[int, int]) -> pikepdf.Page
Given an “objgen” (object ID, generation), return the page.
Raises an exception if no page matches.
- index(*args, **kwargs)
Overloaded function.
index(self: pikepdf.PageList, arg0: pikepdf.Object) -> int
Given a pikepdf.Object that is a page, find the index number.
That is, returns
n
such thatpdf.pages[n] == this_page
. AValueError
exception is thrown if the page does not belong to to thisPdf
. The first page has index 0.index(self: pikepdf.PageList, arg0: pikepdf.Page) -> int
Given a pikepdf.Page (page helper), find the index.
That is, returns
n
such thatpdf.pages[n] == this_page
. AValueError
exception is thrown if the page does not belong to to thisPdf
. The first page has index 0.
- insert(self: pikepdf.PageList, index: int, obj: object) None
Insert a page at the specified location.
- Parameters
index (int) – location at which to insert page, 0-based indexing
obj (pikepdf.Object) – page object to insert
- p(self: pikepdf.PageList, pnum: int) pikepdf.Page
Look up page number in ordinal numbering,
.p(1)
is the first page.This is provided for convenience in situations where ordinal numbering is more natural. It is equivalent to
.pages[pnum - 1]
..p(0)
is an error and negative indexing is not supported.If the PDF defines custom page labels (such as labeling front matter with Roman numerals and the main body with Arabic numerals), this function does not account for that. Use
pikepdf.Page.label
to get the page label for a page.
- class pikepdf._core._ObjectList
A
list
-like object containing multiplepikepdf.Object
.- append(self: pikepdf._ObjectList, x: pikepdf.Object) None
Add an item to the end of the list
- count(self: pikepdf._ObjectList, x: pikepdf.Object) int
Return the number of times
x
appears in the list
- extend(*args, **kwargs)
Overloaded function.
extend(self: pikepdf._ObjectList, L: pikepdf._ObjectList) -> None
Extend the list by appending all the items in the given list
extend(self: pikepdf._ObjectList, L: Iterable) -> None
Extend the list by appending all the items in the given list
- insert(self: pikepdf._ObjectList, i: int, x: pikepdf.Object) None
Insert an item at a given position.
- pop(*args, **kwargs)
Overloaded function.
pop(self: pikepdf._ObjectList) -> pikepdf.Object
Remove and return the last item
pop(self: pikepdf._ObjectList, i: int) -> pikepdf.Object
Remove and return the item at index
i
- remove(self: pikepdf._ObjectList, x: pikepdf.Object) None
Remove the first item from the list whose value is x. It is an error if there is no such item.
- class pikepdf.ObjectType
Enumeration of object types. These values are used to implement pikepdf’s instance type checking. In the vast majority of cases it is more pythonic to use
isinstance(obj, pikepdf.Stream)
orissubclass
.These values are low-level and documented for completeness. They are exposed through
pikepdf.Object._type_code
.- uninitialized
An uninitialized object. If this appears, it is probably a bug.
- reserved
A temporary object used in creating circular references. Should not appear in most cases.
- null
A PDF null. In most cases, nulls are automatically converted to
None
, so this should not appear.
- boolean
A PDF boolean. In most cases, booleans are automatically converted to
bool
, so this should not appear.
- integer
A PDF integer. In most cases, integers are automatically converted to
int
, so this should not appear. Unlike Python integers, PDF integers are 32-bit signed integers.
- real
A PDF real. In most cases, reals are automatically convert to
decimal.Decimal
.
- string
A PDF string, meaning the object is a
pikepdf.String
.
- name_
A PDF name, meaning the object is a
pikepdf.Name
.
- array
A PDF array, meaning the object is a
pikepdf.Array
.
- dictionary
A PDF dictionary, meaning the object is a
pikepdf.Dictionary
.
- stream
A PDF stream, meaning the object is a
pikepdf.Stream
(and it also has a dictionary).
- operator
A PDF operator, meaning the object is a
pikepdf.Operator
.
- inlineimage
A PDF inline image, meaning the object is the data stream of an inline image. It would be necessary to combine this with the implicit dictionary to interpret the image correctly. pikepdf automatically packages inline images into a more useful class, so this will not generally appear.
Jobs
- class pikepdf.Job
Provides access to the QPDF job interface.
All of the functionality of the
qpdf
command line program is now available to pikepdf through jobs.- For further details:
- __init__(*args, **kwargs)
Overloaded function.
__init__(self: pikepdf.Job, json: str) -> None
Create a Job from a string containing QPDF job JSON.
__init__(self: pikepdf.Job, json_dict: dict) -> None
Create a Job from a dict in QPDF job JSON schema.
__init__(self: pikepdf.Job, args: List[str], *, progname: str = ‘pikepdf’) -> None
Create a Job from command line arguments to the qpdf program.
The first item in the
args
list should be equal toprogname
, whose default is"pikepdf"
.- Example:
job = Job([‘pikepdf’, ‘–check’, ‘input.pdf’]) job.run()
- check_configuration(self: pikepdf.Job) None
Checks if the configuration is valid; raises an exception if not.
- create_pdf(self: pikepdf.Job) pikepdf.Pdf
Executes the first stage of the job.
- property creates_output
Returns True if the Job will create some sort of output file.
- property encryption_status
Returns a Python dictionary describing the encryption status.
- property exit_code
After run(), returns an integer exit code.
The meaning of exit code depends on the details of the Job that was run. Details are subject to change in libqpdf. Use properties
has_warnings
andencryption_status
instead.
- property has_warnings
After run(), returns True if there were warnings.
- static job_json_schema(*, schema: int = 1) str
For reference, the QPDF job command line schema is built-in.
- static json_out_schema(*, schema: int = 2) str
For reference, the QPDF JSON output schema is built-in.
- property message_prefix
Allows manipulation of the prefix in front of all output messages.
- run(self: pikepdf.Job) None
Executes the job.
- write_pdf(self: pikepdf.Job, pdf: pikepdf.Pdf) None
Executes the second stage of the job.