Sanitization
The pikepdf.sanitize module provides curated, low-risk helpers for
removing active or auxiliary content from a PDF. See Sanitizing PDFs for a
discussion of when to use them and the limits of programmatic sanitization.
Helpers for removing potentially unwanted content from a PDF.
What is “safe” to remove from a PDF depends entirely on your use case and threat model. The functions in this module each perform one narrowly scoped, low-risk operation: they remove active or auxiliary content (scripts, embedded files, actions that reach the network or filesystem, multimedia and rich-media content, thumbnails, search indexes, Web Capture information, private application data, and the portfolio view) while leaving the standard page content, page geometry, and document metadata in place.
These operations are not guaranteed to leave a document’s appearance unchanged. PDF JavaScript, for example, can alter how a document renders, so removing it may change the result — although in practice most PDFs are designed to display correctly without it, since many viewers do not run PDF JavaScript.
They deliberately do not strip XFA, AcroForm, annotations, the document
/ID, or metadata wholesale, because those operations frequently destroy
legitimate document content. See the Sanitizing PDFs
topic for a discussion of the tradeoffs and the limits of what programmatic
sanitization can accomplish.
- pikepdf.sanitize.remove_javascript(pdf)
Remove all JavaScript from a PDF, in place.
Purges document-level named JavaScript (the
/Root/Names/JavaScriptname tree) and every/JavaScriptaction reachable from document, page, annotation, form-field, and outline (bookmark) action slots, including actions chained via/Next.Page content, annotations (minus their scripts), form fields, and metadata are left in place. Note that PDF JavaScript can alter how a document renders, so removing it may change the result; in practice most documents are designed to display correctly without it.
The main legitimate use of PDF JavaScript is interactive form validation; removing it may break that. Most PDF viewers other than Acrobat do not fully execute PDF JavaScript and warn about or disable it.
This operation is idempotent and safe to call on a PDF that contains no JavaScript.
- Parameters:
pdf (pikepdf.Pdf) – The PDF to modify in place.
- Return type:
None
Note
To scrub document metadata, use
pikepdf.Pdf.open_metadata()withset_pikepdf_as_editor=Falseinstead; this function does not touch metadata.
- pikepdf.sanitize.remove_attachments(pdf)
Remove all embedded files (attachments) from a PDF, in place.
Clears the
/Root/Names/EmbeddedFilesname tree (thepikepdf.Pdf.attachmentsmapping) and removes/AF(associated files) references from every object that carries one — the catalog, pages, annotations, XObjects, structure elements, and so on (PDF 2.0 14.13). As a precaution, an/AFreference is only removed if it points to an embedded file specification (one with an/EFentry), so an unrelated key that happens to be named/AFis left untouched. FileAttachment annotations are defanged by removing their embedded/FSfile specification; the annotation itself is retained so page geometry is unchanged.Embedded files can be integral to a document, especially in digital-signing workflows, so remove them deliberately.
This operation is idempotent and safe to call on a PDF that has no attachments.
- Parameters:
pdf (pikepdf.Pdf) – The PDF to modify in place.
- Return type:
None
- pikepdf.sanitize.remove_external_access(pdf)
Neutralize actions that reach the network or filesystem, in place.
Removes
/URI,/Launch,/GoToR(remote go-to),/GoToE(embedded go-to),/SubmitForm, and/ImportDataactions wherever they are reachable from document, page, annotation, form-field, and outline (bookmark) action slots, including actions chained via/Next.Link annotations are retained (so any visible underline or box is preserved) but their triggering action is removed, rendering them inert. Visible content and metadata are left intact.
URI actions are usually benign hyperlinks; this function is a separate opt-in so callers can decide whether to sever external access.
This operation is idempotent and safe to call on a PDF that contains no such actions.
- Parameters:
pdf (pikepdf.Pdf) – The PDF to modify in place.
- Return type:
None
Note
To scrub document metadata, use
pikepdf.Pdf.open_metadata()withset_pikepdf_as_editor=Falseinstead; this function does not touch metadata.
- pikepdf.sanitize.remove_thumbnails(pdf)
Remove embedded page thumbnails from a PDF, in place.
Deletes the
/Thumbthumbnail image stream from every page. Thumbnails are an optional convenience that viewers can regenerate on the fly, so removing them is safe; doing so reduces file size and avoids stale thumbnails that some editors fail to keep in sync with edited pages. A stale thumbnail can also leak the prior appearance of a page you intended to edit or redact.This operation is idempotent and safe to call on a PDF that has no thumbnails.
- Parameters:
pdf (pikepdf.Pdf) – The PDF to modify in place.
- Return type:
None
- pikepdf.sanitize.remove_search_index(pdf)
Remove an embedded full-text search index from a PDF, in place.
Adobe Acrobat can embed a full-text search index in a document to speed up searching. It is stored as a
/SearchIndexentry in the catalog’s/PieceInfodictionary. This function removes that entry (and the/PieceInfodictionary itself if it becomes empty); the index’s data streams become unreferenced and are dropped when the PDF is saved.Removing the index reduces file size, re-enables Fast Web View (which an embedded index precludes), and avoids a stale index leaking content you intended to edit or redact. Non-Acrobat viewers do not use it.
This operation is idempotent and safe to call on a PDF that has no embedded search index.
- Parameters:
pdf (pikepdf.Pdf) – The PDF to modify in place.
- Return type:
None
- pikepdf.sanitize.remove_multimedia(pdf)
Remove multimedia and rich-media content from a PDF, in place.
Neutralizes
/Rendition,/Movie,/Sound, and/RichMediaExecuteactions (wherever reachable from document, page, annotation, outline, and form-field action slots, including/Nextchains), drops the document-level/Root/Names/Renditionsname tree, and defangs media-bearing annotations by stripping their media references:Movieannotations lose their/Moviedictionary;Soundannotations lose their/Soundstream;RichMediaannotations lose/RichMediaContentand/RichMediaSettings;3Dannotations lose their/3DD3D-data reference.
Screenannotations are defanged by removing their/Renditionaction via the action walk above. In every case the annotation itself is retained so page geometry is unchanged.Multimedia handlers (Flash, embedded video, U3D/PRC 3D) are historically a source of parser vulnerabilities, and the underlying media can reference external URLs or files.
SoundandMovieare deprecated in PDF 2.0.This operation is idempotent and safe to call on a PDF that contains no multimedia content.
- Parameters:
pdf (pikepdf.Pdf) – The PDF to modify in place.
- Return type:
None
- pikepdf.sanitize.remove_web_capture(pdf)
Remove Web Capture (spider) information from a PDF, in place.
Deletes the catalog’s
/SpiderInfodictionary, which Adobe Acrobat records when content is captured from the web. It stores source URLs and capture settings, so removing it drops potentially sensitive provenance that is otherwise invisible in the document, and is ignored by viewers that do not implement Web Capture.This operation is idempotent and safe to call on a PDF that has no Web Capture information.
- Parameters:
pdf (pikepdf.Pdf) – The PDF to modify in place.
- Return type:
None
- pikepdf.sanitize.remove_private_app_data(pdf)
Remove private application data (page-piece dictionaries), in place.
Deletes every
/PieceInfopage-piece dictionary, both at the document catalog level and on every page. PDF processors use/PieceInfoto store private, application-specific data (for example, an editor’s own editable representation of the page) that the PDF specification does not interpret.Such data can fall out of sync with the visible document and leak content you intended to edit or redact. Removing it does not change how the document renders, but applications that wrote it lose their private editing state.
This is a broader operation than
remove_search_index(), which removes only the catalog’s/PieceInfo/SearchIndexentry; this function removes all page-piece data wherever it appears.This operation is idempotent and safe to call on a PDF that has no private application data.
- Parameters:
pdf (pikepdf.Pdf) – The PDF to modify in place.
- Return type:
None
- pikepdf.sanitize.remove_collection(pdf)
Remove the PDF portfolio (collection) presentation, in place.
Deletes the catalog’s
/Collectiondictionary, which marks a document as a PDF portfolio (also called a PDF package) and configures how its embedded files are presented in a navigator UI. Removing it causes the document to be presented as an ordinary PDF showing its cover sheet.This does not remove the embedded files themselves; pair it with
remove_attachments()if you want the attachments gone as well. A portfolio’s navigator can also be driven by JavaScript, so considerremove_javascript()too.This operation is idempotent and safe to call on a PDF that is not a portfolio.
- Parameters:
pdf (pikepdf.Pdf) – The PDF to modify in place.
- Return type:
None
- class pikepdf.sanitize.Sanitizer
A fluent builder that accumulates sanitization operations.
Each
remove_*method records an operation and returnsselfso calls can be chained. Nothing happens untilapply()is called with a PDF; this lets a singleSanitizerbe configured once and reused across many documents. The action-based removals (JavaScript, external access) are coalesced into a single traversal of the document when applied.The methods correspond to the module-level functions of the same name and have the same scope and caveats. Like those functions, the operations are deliberately limited to the curated, low-risk set; there is no “remove everything” option, because blanket removal of forms, annotations, or XFA usually destroys legitimate content.
Example
Configure once, apply to many files:
scrubber = ( pikepdf.sanitize.Sanitizer() .remove_javascript() .remove_external_access() .remove_attachments() ) for path in untrusted_paths: with pikepdf.open(path) as pdf: scrubber.apply(pdf).save(out_dir / path.name)
- apply(pdf)
Apply the recorded operations to pdf, in place.
The action-based removals run as a single combined traversal, followed by the structural removals in the order they were recorded.
- Parameters:
pdf (pikepdf.Pdf) – The PDF to modify in place.
- Returns:
The same
pdf, to allow further chaining, e.g..apply(pdf).save(...).- Return type:
- remove_attachments()
Record removal of embedded files. See
remove_attachments().- Return type:
- remove_collection()
Record removal of the portfolio view. See
remove_collection().- Return type:
- remove_external_access()
Record removal of external-access actions.
- Return type:
- remove_javascript()
Record removal of all JavaScript. See
remove_javascript().- Return type:
- remove_multimedia()
Record removal of multimedia content. See
remove_multimedia().- Return type:
- remove_private_app_data()
Record removal of private application data.
See
remove_private_app_data().- Return type:
- remove_search_index()
Record removal of an embedded search index.
- Return type:
- remove_thumbnails()
Record removal of page thumbnails. See
remove_thumbnails().- Return type:
- remove_web_capture()
Record removal of Web Capture info. See
remove_web_capture().- Return type: