Sanitizing PDFs
Added in version 10.9.
If you accept PDFs from untrusted sources, you may want to strip out active or risky content before processing or redistributing them. pikepdf can be one layer in such a pipeline.
There is no universal notion of a “safe” PDF. What to sanitize depends entirely on your use case and threat model. Many of the advanced features people are tempted to strip — interactive forms, embedded files, annotations — are used to do real work, so removing them indiscriminately breaks documents that other users care about. Decide what you are defending against, and what you are willing to break, before reaching for any of these tools.
pikepdf provides a small set of curated, low-risk helpers in the
pikepdf.sanitize module. Each performs one narrowly scoped operation and
leaves the standard page content, page geometry, and document metadata untouched.
Removing JavaScript
PDFs can carry JavaScript that runs when the document is opened, when a page is viewed, or when a form field changes. The main legitimate use is interactive form validation. Most PDF viewers other than Adobe Acrobat do not fully execute PDF JavaScript, and typically warn about or disable it.
pikepdf.sanitize.remove_javascript() purges the document-level JavaScript
name tree and every JavaScript action, wherever it is reachable — including from
the document catalog, pages, annotations, form fields, and outline (bookmark)
items, and including actions chained via /Next:
>>> import pikepdf
>>> pdf = pikepdf.open('../tests/resources/pal.pdf')
>>> pikepdf.sanitize.remove_javascript(pdf)
This may break form validation. Because many PDF viewers don’t implement JavaScript, even PDFs that use it are typically designed to function and display correctly without it. JavaScript can alter the appearance of a PDF.
Removing embedded files
PDFs can embed arbitrary files (attachments). These are sometimes integral to the document — for example in some digital-signing workflows — so remove them deliberately, not reflexively.
pikepdf.sanitize.remove_attachments() clears the embedded files, removes
/AF (associated files) references, and defangs FileAttachment annotations by
removing their embedded file while keeping the annotation in place (so page
geometry is unchanged):
>>> pikepdf.sanitize.remove_attachments(pdf)
Removing external access
A PDF can contain actions that reach out to the network or filesystem: URI links,
Launch actions that start an external program, GoToR (remote go-to), GoToE
(embedded go-to, which opens content in an embedded file), SubmitForm, and
ImportData. URI actions are usually benign hyperlinks, so this is a separate
opt-in.
pikepdf.sanitize.remove_external_access() removes all of these actions,
wherever they are reachable (document, pages, annotations, form fields, and
outline items). Link annotations are kept — so any visible underline or box is
preserved — but their triggering action is removed, rendering them inert:
>>> pikepdf.sanitize.remove_external_access(pdf)
Removing thumbnails
A PDF may store a small preview image (/Thumb) for each page. Viewers can
regenerate these on the fly, so removing them is safe. Doing so reduces file
size and avoids stale thumbnails — some editors fail to keep them in sync with
edited pages, so a thumbnail can leak the prior appearance of a page you
intended to change or redact.
pikepdf.sanitize.remove_thumbnails() deletes the thumbnail from every
page:
>>> pikepdf.sanitize.remove_thumbnails(pdf)
Removing an embedded search index
Adobe Acrobat can embed a full-text search index in a document to speed up
searching. It is stored as a /SearchIndex entry in the catalog’s /PieceInfo
dictionary, and is ignored by non-Acrobat viewers. Like thumbnails, an embedded
index can fall out of sync with the document and leak content you intended to
edit or redact; it also reduces file size to drop it and re-enables Fast Web
View (which an embedded index precludes).
pikepdf.sanitize.remove_search_index() removes the index; its data streams
become unreferenced and are dropped when you save:
>>> pikepdf.sanitize.remove_search_index(pdf)
Removing multimedia and rich-media content
PDFs can embed sound, video, Flash, and 3D (U3D/PRC) content, played through
Screen, Movie, Sound, RichMedia, and 3D annotations and driven by
Rendition, Movie, Sound, and RichMediaExecute actions. These handlers
have historically been a source of parser vulnerabilities, and the underlying
media can reference external URLs or files. Sound and Movie are deprecated
in PDF 2.0.
pikepdf.sanitize.remove_multimedia() neutralizes the multimedia actions,
drops the document-level /Renditions name tree, and defangs media-bearing
annotations by stripping their media references — the annotation rectangle is
kept so page geometry is unchanged:
>>> pikepdf.sanitize.remove_multimedia(pdf)
Removing Web Capture information
When Adobe Acrobat captures content from the web, it records a /SpiderInfo
dictionary in the catalog holding the source URLs and capture settings. This
provenance is invisible in the rendered document but can leak where the content
came from. pikepdf.sanitize.remove_web_capture() deletes it:
>>> pikepdf.sanitize.remove_web_capture(pdf)
Removing private application data
PDF processors can stash private, application-specific data in /PieceInfo
page-piece dictionaries — for example, an editor’s own editable representation of
a page. Like thumbnails and search indexes, this data can fall out of sync with
the visible document and leak content you intended to edit or redact. Removing it
does not change how the document renders, but applications that wrote it lose
their private editing state.
pikepdf.sanitize.remove_private_app_data() removes every /PieceInfo
dictionary, at both the document and page level. It is a broader version of
remove_search_index (which removes only the catalog’s /SearchIndex entry):
>>> pikepdf.sanitize.remove_private_app_data(pdf)
Removing a PDF portfolio view
A PDF portfolio (or package) is a document whose embedded files are presented
through a navigator UI, configured by a /Collection dictionary in the catalog.
pikepdf.sanitize.remove_collection() removes that dictionary, so the
document is presented as an ordinary PDF showing its cover sheet. This does
not remove the embedded files themselves — pair it with remove_attachments
for that, and with remove_javascript, since a portfolio’s navigator can be
driven by JavaScript:
>>> pikepdf.sanitize.remove_collection(pdf)
Chaining operations
If you apply several of these operations together, pikepdf.sanitize.Sanitizer
offers a fluent alternative to calling the functions one at a time. You record
the operations by chaining remove_* methods, then call apply() on a PDF.
This lets you configure a sanitizer once and reuse it across many documents, and
it coalesces the action-based removals (JavaScript, external access) into a
single pass over the document:
>>> scrubber = (
... pikepdf.sanitize.Sanitizer()
... .remove_javascript()
... .remove_external_access()
... .remove_attachments()
... )
>>> pdf = pikepdf.open('../tests/resources/pal.pdf')
>>> sanitized = scrubber.apply(pdf)
apply() returns the same PDF, so you can chain straight into a save, and a
single Sanitizer can be applied to file after file:
scrubber = pikepdf.sanitize.Sanitizer().remove_javascript().remove_attachments()
for path in untrusted_paths:
with pikepdf.open(path) as pdf:
scrubber.apply(pdf).save(out_dir / path.name)
By design there is no “remove everything” method — blanket removal of forms, annotations, or XFA usually destroys legitimate content (see below).
What not to strip blindly
The ChatGPT-style “sanitizers” circulating online often go much further, and in doing so destroy legitimate content. pikepdf deliberately does not offer one-click equivalents for the following, because they are usually the wrong thing to do:
Warning
XFA forms. XFA is a deprecated, Adobe-only form technology, but the form’s contents live inside the XFA packet. Removing XFA typically reduces the document to a single blank page with an error message — destroying everything the document was for.
All annotations / the whole AcroForm. Wholesale removal discards links, comments, and every form field, not just the risky parts. Prefer the targeted helpers above.
The document
/ID. Erasing the trailer/IDdoes not improve security; pikepdf will simply generate a new one when saving.
Flattening dynamic content with OCR
The helpers above are surgical: they remove specific structures while leaving the rest of the document as-is. If instead you want to strip out essentially all dynamic and interactive content in one pass — and you can accept rendering the document down to images — a middleweight option is to rasterize every page and rebuild a fresh PDF with a clean OCR text layer using OCRmyPDF (which is built on pikepdf):
ocrmypdf --force-ocr input.pdf output.pdf
--force-ocr rasterizes all pages to images and then re-OCRs them. In the
process it discards JavaScript, embedded files, form fields, annotations, the
original (possibly inaccurate or maliciously crafted) text layer, and any
hidden or off-page content — because none of it survives the trip through a
bitmap. The output contains the visible appearance of each page plus a freshly
generated, searchable text layer.
The trade-off is that the text layer is now only as accurate as OCR, vector text becomes a raster image (larger files, no longer perfectly sharp), and genuinely interactive features are gone. But for “I want this PDF to be inert and contain nothing but what a human can see on the page,” this is often the cleaner road than trying to enumerate and remove every kind of active content by hand.
Scrubbing metadata
To remove personal information from metadata, do not blindly delete the DocumentInfo dictionary and the XMP metadata stream — they are redundant and must be kept in sync. Use pikepdf’s coordinated metadata API instead, which edits both:
with pikepdf.open(...) as pdf, pdf.open_metadata(set_pikepdf_as_editor=False) as meta:
del meta['dc:creator']
By default, pikepdf.Pdf.save() and pikepdf.Pdf.open_metadata() record
pikepdf as the document’s producer/most-recent editor. This is a courtesy to other
PDF developers that helps with tracking down bugs. Pass
set_pikepdf_as_editor=False to pikepdf.Pdf.open_metadata() to suppress it.
See Metadata for the full metadata API.
The limits of programmatic redaction
Warning
pikepdf cannot reliably redact text or images from a PDF, and neither can any purely programmatic tool that operates on the file’s structure.
Removing a visible word from a page is far harder than it looks. Text in a PDF can be:
split across many drawing operators, so the string you are searching for never appears contiguously;
drawn and then hidden by a clipping path, an overlapping white rectangle, or pushed off the visible page — visually gone but still in the byte stream;
duplicated in an invisible OCR text layer placed behind a scanned image;
duplicated in an embedded search index (tools such as Acrobat can build these to speed up searching);
present in page thumbnails, form XObjects, or alternate representations.
pikepdf works on PDF structure, not rendered appearance, so it cannot guarantee that a phrase is gone from every place it might be stored.
For genuine redaction:
Use a graphical PDF editor with a dedicated redaction tool, which removes the underlying content rather than merely drawing a black box over it. Then verify the result by searching and by inspecting any OCR layer.
For truly sensitive documents, redact physically: print the document, black out the sensitive parts with a marker, then scan (and, if needed, OCR) the result. This severs any digital link to the original bytes.
Defense in depth
pikepdf is one layer, not a complete solution. For untrusted input, combine it with other measures appropriate to your threat model: malware scanning, rendering the PDF to images and rebuilding it, sandboxing, and size/structure limits. And always validate the result against the threat you are actually trying to defend against.
See also PDF security for notes on PDF password security and content restrictions.