Metadata

PDF has two different types of metadata: XMP metadata, and DocumentInfo, which is deprecated and removed as of PDF 2.0, but still relevant. For backward compatibility, both should contain the same content. pikepdf provides a convenient interface that coordinates edits to both, but is limited to the most common metadata features.

XMP (Extensible Metadata Platform) Metadata is a metadata specification in XML format that is used many formats other than PDF. For full information on XMP, see Adobe’s XMP Developer Center. The XMP Specification also provides useful information.

pikepdf can read compound metadata quantities, but can only modify scalars. For more complex changes consider using the python-xmp-toolkit library and its libexempi dependency; but note that it is not capable of synchronizing changes to the older DocumentInfo metadata.

Automatic metadata updates

By default pikepdf will create a XMP metadata block and set pdf:PDFVersion to a value that matches the PDF version declared elsewhere in the PDF, whenever a PDF is saved. To suppress this behavior, save with pdf.save(..., fix_metadata_version=False).

Also by default, Pdf.open_metadata() will synchronize the XMP metadata with the older document information dictionary. This behavior can also be adjusted using keyword arguments.

Accessing metadata

The XMP metadata stream is attached the PDF’s root object, but to simplify management of this, use pikepdf.Pdf.open_metadata(). The returned pikepdf.models.PdfMetadata object may be used for reading, or entered with a with block to modify and commit changes. If you use this interface, pikepdf will synchronize changes to new and old metadata.

A PDF must still be saved after metadata is changed.

>>> pdf = pikepdf.open('../tests/resources/sandwich.pdf')

>>> meta = pdf.open_metadata()

>>> meta['xmp:CreatorTool']
'ocrmypdf 5.3.3 / Tesseract OCR-PDF 3.05.01'

If no XMP metadata exists, an empty XMP metadata container will be created.

Open metadata in a with block to open it for editing. When the block is exited, changes are committed (updating XMP and the Document Info dictionary) and attached to the PDF object. The PDF must still be saved. If an exception occurs in the block, changes are discarded.

>>> with pdf.open_metadata() as meta:
...     meta['dc:title'] = "Let's change the title"
...

The list of available metadata fields may be found in the XMP Specification.

Copying metadata between documents

When merging documents or rebuilding a PDF, you may want to carry some metadata from a source document into a new one. pikepdf intentionally does not provide an automatic “copy all metadata” operation, and you should not write one: you are responsible for deciding which fields are meaningful in the new document.

Many XMP fields assert something about one specific document and become false or misleading when copied verbatim into a different one, for example:

  • Standards conformance claims such as PDF/A (pdfaid:part, pdfaid:conformance), PDF/X, and PDF/UA. A merged or rebuilt document almost certainly does not satisfy these claims unless it was independently produced and verified to do so.

  • Unique identifiers such as xmpMM:DocumentID and xmpMM:InstanceID, which are meant to identify a particular document or rendition.

  • Timestamps such as xmp:CreateDate and xmp:ModifyDate.

  • pdf:Producer, which pikepdf sets to itself on save.

Copy only the descriptive fields you have determined are appropriate to carry over, such as title, author, description, and subject:

>>> source = pikepdf.open('../tests/resources/sandwich.pdf')

>>> target = pikepdf.new()

>>> safe_to_copy = ['dc:title', 'dc:creator', 'dc:description', 'dc:subject']

>>> with source.open_metadata() as src, target.open_metadata() as dst:
...     for key in safe_to_copy:
...         if key in src:
...             dst[key] = src[key]
...

>>> target.open_metadata()['dc:title']
'Untitled'

Warning

Do not blindly copy every field from one document to another, and do not copy the raw XMP stream wholesale (for example with Pdf.copy_foreign). Doing so can import conformance claims and identifiers that are not true of the merged or rebuilt document.

To copy the older Document Info dictionary into XMP instead, see load_from_docinfo().

Removing metadata items

After opening metadata, use del meta['dc:title'] to delete a metadata entry.

To remove all of a PDF’s metadata records, don’t use pdf.open_metadata. Instead, use del pdf.Root.Metadata and del pdf.docinfo to remove the XMP and document info metadata, respectively.

Checking PDF/A conformance

The metadata interface can also test if a file claims to be conformant to the PDF/A specification.

>>> pdf = pikepdf.open('../tests/resources/veraPDF test suite 6-2-10-t02-pass-a.pdf')

>>> meta = pdf.open_metadata()

>>> meta.pdfa_status
'1B'

Note

Note that this property merely tests if the file claims to be conformant to the PDF/A standard. Use a tool such as veraPDF (official tool), or third party web services such as PDFEN or 3-HEIGHTS™ PDF VALIDATOR to verify conformance.

Notice for application developers

If you are using pikepdf to create some kind of PDF application, you should update the fields xmp:CreatorTool and pdf:Producer. You could, for example, set xmp:CreatorTool to your application’s name and version, and pdf:Producer to pikepdf. Refer to Adobe’s documentation to decide what describes the circumstances.

This will help PDF developers identify the application that generated a particular PDF and is valuable debugging information.

Low-level XMP metadata access

You can read the raw XMP metadata if desired. For example, one could extract it and edit it using the full featured python-xmp-toolkit library.

>>> xmp = pdf.Root.Metadata.read_bytes()

>>> type(xmp)
<class 'bytes'>

>>> print(xmp.decode()[:len("<?xpacket")] + "...")
<?xpacket...

Editing XMP with a generic XML library is probably not worth the trouble; the semantics are fairly complex.

Warning

Manually changes to XMP stream object will not be synchronized with live PdfMetadata object or the DocumentInfo block.

The Document Info dictionary

The Document Info block is an older, now deprecated object in which metadata may be stored. The Document Info is not attached to the /Root object. It may be accessed using the .docinfo property. If no Document Info exists, touching the .docinfo will properly initialize an empty one.

Here is an example of a Document Info block.

>>> pdf = pikepdf.open('../tests/resources/sandwich.pdf')

>>> pdf.docinfo
pikepdf.Dictionary({
  "/CreationDate": "D:20170911132748-07'00'",
  "/Creator": "ocrmypdf 5.3.3 / Tesseract OCR-PDF 3.05.01",
  "/ModDate": "D:20170911132748-07'00'",
  "/Producer": "GPL Ghostscript 9.21"
})

It is permitted in pikepdf to directly interact with Document Info as with other PDF dictionaries. However, it is better to use .open_metadata() because that interface will apply changes to both XMP and Document Info in a consistent manner.

You may copy from data from a Document Info object in the current PDF or another PDF into XMP metadata using load_from_docinfo().