Sanitization

The pikepdf.sanitize module provides curated, low-risk helpers for removing active or auxiliary content from a PDF. See Sanitizing PDFs for a discussion of when to use them and the limits of programmatic sanitization.

Helpers for removing potentially unwanted content from a PDF.

What is “safe” to remove from a PDF depends entirely on your use case and threat model. The functions in this module each perform one narrowly scoped, low-risk operation: they remove active or auxiliary content (scripts, embedded files, actions that reach the network or filesystem, multimedia and rich-media content, thumbnails, search indexes, Web Capture information, private application data, and the portfolio view) while leaving the standard page content, page geometry, and document metadata in place.

These operations are not guaranteed to leave a document’s appearance unchanged. PDF JavaScript, for example, can alter how a document renders, so removing it may change the result — although in practice most PDFs are designed to display correctly without it, since many viewers do not run PDF JavaScript.

They deliberately do not strip XFA, AcroForm, annotations, the document /ID, or metadata wholesale, because those operations frequently destroy legitimate document content. See the Sanitizing PDFs topic for a discussion of the tradeoffs and the limits of what programmatic sanitization can accomplish.

pikepdf.sanitize.remove_javascript(pdf)

Remove all JavaScript from a PDF, in place.

Purges document-level named JavaScript (the /Root/Names/JavaScript name tree) and every /JavaScript action reachable from document, page, annotation, form-field, and outline (bookmark) action slots, including actions chained via /Next.

Page content, annotations (minus their scripts), form fields, and metadata are left in place. Note that PDF JavaScript can alter how a document renders, so removing it may change the result; in practice most documents are designed to display correctly without it.

The main legitimate use of PDF JavaScript is interactive form validation; removing it may break that. Most PDF viewers other than Acrobat do not fully execute PDF JavaScript and warn about or disable it.

This operation is idempotent and safe to call on a PDF that contains no JavaScript.

Parameters:: pdf (pikepdf.Pdf) – The PDF to modify in place.
Return type:: None

Note

To scrub document metadata, use pikepdf.Pdf.open_metadata() with set_pikepdf_as_editor=False instead; this function does not touch metadata.

pikepdf.sanitize.remove_attachments(pdf)

Remove all embedded files (attachments) from a PDF, in place.

Clears the /Root/Names/EmbeddedFiles name tree (the pikepdf.Pdf.attachments mapping) and removes /AF (associated files) references from every object that carries one — the catalog, pages, annotations, XObjects, structure elements, and so on (PDF 2.0 14.13). As a precaution, an /AF reference is only removed if it points to an embedded file specification (one with an /EF entry), so an unrelated key that happens to be named /AF is left untouched. FileAttachment annotations are defanged by removing their embedded /FS file specification; the annotation itself is retained so page geometry is unchanged.

Embedded files can be integral to a document, especially in digital-signing workflows, so remove them deliberately.

This operation is idempotent and safe to call on a PDF that has no attachments.

Parameters:: pdf (pikepdf.Pdf) – The PDF to modify in place.
Return type:: None

pikepdf.sanitize.remove_external_access(pdf)

Neutralize actions that reach the network or filesystem, in place.

Removes /URI, /Launch, /GoToR (remote go-to), /GoToE (embedded go-to), /SubmitForm, and /ImportData actions wherever they are reachable from document, page, annotation, form-field, and outline (bookmark) action slots, including actions chained via /Next.

Link annotations are retained (so any visible underline or box is preserved) but their triggering action is removed, rendering them inert. Visible content and metadata are left intact.

URI actions are usually benign hyperlinks; this function is a separate opt-in so callers can decide whether to sever external access.

This operation is idempotent and safe to call on a PDF that contains no such actions.

Parameters:: pdf (pikepdf.Pdf) – The PDF to modify in place.
Return type:: None

Note

To scrub document metadata, use pikepdf.Pdf.open_metadata() with set_pikepdf_as_editor=False instead; this function does not touch metadata.

pikepdf.sanitize.remove_thumbnails(pdf)

Remove embedded page thumbnails from a PDF, in place.

Deletes the /Thumb thumbnail image stream from every page. Thumbnails are an optional convenience that viewers can regenerate on the fly, so removing them is safe; doing so reduces file size and avoids stale thumbnails that some editors fail to keep in sync with edited pages. A stale thumbnail can also leak the prior appearance of a page you intended to edit or redact.

This operation is idempotent and safe to call on a PDF that has no thumbnails.

Parameters:: pdf (pikepdf.Pdf) – The PDF to modify in place.
Return type:: None

pikepdf.sanitize.remove_search_index(pdf)

Remove an embedded full-text search index from a PDF, in place.

Adobe Acrobat can embed a full-text search index in a document to speed up searching. It is stored as a /SearchIndex entry in the catalog’s /PieceInfo dictionary. This function removes that entry (and the /PieceInfo dictionary itself if it becomes empty); the index’s data streams become unreferenced and are dropped when the PDF is saved.

Removing the index reduces file size, re-enables Fast Web View (which an embedded index precludes), and avoids a stale index leaking content you intended to edit or redact. Non-Acrobat viewers do not use it.

This operation is idempotent and safe to call on a PDF that has no embedded search index.

Parameters:: pdf (pikepdf.Pdf) – The PDF to modify in place.
Return type:: None

pikepdf.sanitize.remove_multimedia(pdf)

Remove multimedia and rich-media content from a PDF, in place.

Neutralizes /Rendition, /Movie, /Sound, and /RichMediaExecute actions (wherever reachable from document, page, annotation, outline, and form-field action slots, including /Next chains), drops the document-level /Root/Names/Renditions name tree, and defangs media-bearing annotations by stripping their media references:

Movie annotations lose their /Movie dictionary;
Sound annotations lose their /Sound stream;
RichMedia annotations lose /RichMediaContent and /RichMediaSettings;
3D annotations lose their /3DD 3D-data reference.

Screen annotations are defanged by removing their /Rendition action via the action walk above. In every case the annotation itself is retained so page geometry is unchanged.

Multimedia handlers (Flash, embedded video, U3D/PRC 3D) are historically a source of parser vulnerabilities, and the underlying media can reference external URLs or files. Sound and Movie are deprecated in PDF 2.0.

This operation is idempotent and safe to call on a PDF that contains no multimedia content.

Parameters:: pdf (pikepdf.Pdf) – The PDF to modify in place.
Return type:: None

pikepdf.sanitize.remove_web_capture(pdf)

Remove Web Capture (spider) information from a PDF, in place.

Deletes the catalog’s /SpiderInfo dictionary, which Adobe Acrobat records when content is captured from the web. It stores source URLs and capture settings, so removing it drops potentially sensitive provenance that is otherwise invisible in the document, and is ignored by viewers that do not implement Web Capture.

This operation is idempotent and safe to call on a PDF that has no Web Capture information.

Parameters:: pdf (pikepdf.Pdf) – The PDF to modify in place.
Return type:: None

pikepdf.sanitize.remove_private_app_data(pdf)

Remove private application data (page-piece dictionaries), in place.

Deletes every /PieceInfo page-piece dictionary, both at the document catalog level and on every page. PDF processors use /PieceInfo to store private, application-specific data (for example, an editor’s own editable representation of the page) that the PDF specification does not interpret.

Such data can fall out of sync with the visible document and leak content you intended to edit or redact. Removing it does not change how the document renders, but applications that wrote it lose their private editing state.

This is a broader operation than remove_search_index(), which removes only the catalog’s /PieceInfo/SearchIndex entry; this function removes all page-piece data wherever it appears.

This operation is idempotent and safe to call on a PDF that has no private application data.

Parameters:: pdf (pikepdf.Pdf) – The PDF to modify in place.
Return type:: None

pikepdf.sanitize.remove_collection(pdf)

Remove the PDF portfolio (collection) presentation, in place.

Deletes the catalog’s /Collection dictionary, which marks a document as a PDF portfolio (also called a PDF package) and configures how its embedded files are presented in a navigator UI. Removing it causes the document to be presented as an ordinary PDF showing its cover sheet.

This does not remove the embedded files themselves; pair it with remove_attachments() if you want the attachments gone as well. A portfolio’s navigator can also be driven by JavaScript, so consider remove_javascript() too.

This operation is idempotent and safe to call on a PDF that is not a portfolio.

Parameters:: pdf (pikepdf.Pdf) – The PDF to modify in place.
Return type:: None

class pikepdf.sanitize.Sanitizer

A fluent builder that accumulates sanitization operations.

Each remove_* method records an operation and returns self so calls can be chained. Nothing happens until apply() is called with a PDF; this lets a single Sanitizer be configured once and reused across many documents. The action-based removals (JavaScript, external access) are coalesced into a single traversal of the document when applied.

The methods correspond to the module-level functions of the same name and have the same scope and caveats. Like those functions, the operations are deliberately limited to the curated, low-risk set; there is no “remove everything” option, because blanket removal of forms, annotations, or XFA usually destroys legitimate content.

Example

Configure once, apply to many files:

scrubber = (
    pikepdf.sanitize.Sanitizer()
    .remove_javascript()
    .remove_external_access()
    .remove_attachments()
)
for path in untrusted_paths:
    with pikepdf.open(path) as pdf:
        scrubber.apply(pdf).save(out_dir / path.name)

apply(pdf)

Apply the recorded operations to pdf, in place.

The action-based removals run as a single combined traversal, followed by the structural removals in the order they were recorded.

Parameters:: pdf (pikepdf.Pdf) – The PDF to modify in place.
Returns:: The same pdf, to allow further chaining, e.g. .apply(pdf).save(...).
Return type:: pikepdf.Pdf

remove_attachments()

Record removal of embedded files. See remove_attachments().

Return type:: Sanitizer

remove_collection()

Record removal of the portfolio view. See remove_collection().

Return type:: Sanitizer

remove_external_access()

Record removal of external-access actions.

See remove_external_access().

Return type:: Sanitizer

remove_javascript()

Record removal of all JavaScript. See remove_javascript().

Return type:: Sanitizer

remove_multimedia()

Record removal of multimedia content. See remove_multimedia().

Return type:: Sanitizer

remove_private_app_data()

Record removal of private application data.

See remove_private_app_data().

Return type:: Sanitizer

remove_search_index()

Record removal of an embedded search index.

See remove_search_index().

Return type:: Sanitizer

remove_thumbnails()

Record removal of page thumbnails. See remove_thumbnails().

Return type:: Sanitizer

remove_web_capture()

Record removal of Web Capture info. See remove_web_capture().

Return type:: Sanitizer