(sanitize)=

# Sanitizing PDFs

:::{versionadded} 10.9
:::

If you accept PDFs from untrusted sources, you may want to strip out active or
risky content before processing or redistributing them. pikepdf can be one layer
in such a pipeline.

There is no universal notion of a "safe" PDF. **What to sanitize depends entirely
on your use case and threat model.** Many of the advanced features people are
tempted to strip — interactive forms, embedded files, annotations — are used to
do real work, so removing them indiscriminately breaks documents that other users
care about. Decide what you are defending against, and what you are willing to
break, before reaching for any of these tools.

pikepdf provides a small set of curated, low-risk helpers in the
{mod}`pikepdf.sanitize` module. Each performs one narrowly scoped operation and
leaves the standard page content, page geometry, and document metadata untouched.

## Removing JavaScript

PDFs can carry JavaScript that runs when the document is opened, when a page is
viewed, or when a form field changes. The main legitimate use is interactive form
validation. Most PDF viewers other than Adobe Acrobat do not fully execute PDF
JavaScript, and typically warn about or disable it.

{func}`pikepdf.sanitize.remove_javascript` purges the document-level JavaScript
name tree and every JavaScript action, wherever it is reachable — including from
the document catalog, pages, annotations, form fields, and outline (bookmark)
items, and including actions chained via `/Next`:

```{eval-rst}
.. doctest::

    >>> import pikepdf

    >>> pdf = pikepdf.open('../tests/resources/pal.pdf')

    >>> pikepdf.sanitize.remove_javascript(pdf)
```

This may break form validation. Because many PDF viewers don't implement JavaScript, even PDFs that use it are typically designed to function and display correctly without it. JavaScript can alter the appearance of a PDF.

## Removing embedded files

PDFs can embed arbitrary files (attachments). These are sometimes integral to the
document — for example in some digital-signing workflows — so remove them
deliberately, not reflexively.

{func}`pikepdf.sanitize.remove_attachments` clears the embedded files, removes
`/AF` (associated files) references, and defangs FileAttachment annotations by
removing their embedded file while keeping the annotation in place (so page
geometry is unchanged):

```{eval-rst}
.. doctest::

    >>> pikepdf.sanitize.remove_attachments(pdf)
```

## Removing external access

A PDF can contain actions that reach out to the network or filesystem: URI links,
`Launch` actions that start an external program, `GoToR` (remote go-to), `GoToE`
(embedded go-to, which opens content in an embedded file), `SubmitForm`, and
`ImportData`. URI actions are usually benign hyperlinks, so this is a separate
opt-in.

{func}`pikepdf.sanitize.remove_external_access` removes all of these actions,
wherever they are reachable (document, pages, annotations, form fields, and
outline items). Link annotations are kept — so any visible underline or box is
preserved — but their triggering action is removed, rendering them inert:

```{eval-rst}
.. doctest::

    >>> pikepdf.sanitize.remove_external_access(pdf)
```

## Removing thumbnails

A PDF may store a small preview image (`/Thumb`) for each page. Viewers can
regenerate these on the fly, so removing them is safe. Doing so reduces file
size and avoids *stale* thumbnails — some editors fail to keep them in sync with
edited pages, so a thumbnail can leak the prior appearance of a page you
intended to change or redact.

{func}`pikepdf.sanitize.remove_thumbnails` deletes the thumbnail from every
page:

```{eval-rst}
.. doctest::

    >>> pikepdf.sanitize.remove_thumbnails(pdf)
```

## Removing an embedded search index

Adobe Acrobat can embed a full-text search index in a document to speed up
searching. It is stored as a `/SearchIndex` entry in the catalog's `/PieceInfo`
dictionary, and is ignored by non-Acrobat viewers. Like thumbnails, an embedded
index can fall out of sync with the document and leak content you intended to
edit or redact; it also reduces file size to drop it and re-enables Fast Web
View (which an embedded index precludes).

{func}`pikepdf.sanitize.remove_search_index` removes the index; its data streams
become unreferenced and are dropped when you save:

```{eval-rst}
.. doctest::

    >>> pikepdf.sanitize.remove_search_index(pdf)
```

## Removing multimedia and rich-media content

PDFs can embed sound, video, Flash, and 3D (U3D/PRC) content, played through
`Screen`, `Movie`, `Sound`, `RichMedia`, and `3D` annotations and driven by
`Rendition`, `Movie`, `Sound`, and `RichMediaExecute` actions. These handlers
have historically been a source of parser vulnerabilities, and the underlying
media can reference external URLs or files. `Sound` and `Movie` are deprecated
in PDF 2.0.

{func}`pikepdf.sanitize.remove_multimedia` neutralizes the multimedia actions,
drops the document-level `/Renditions` name tree, and defangs media-bearing
annotations by stripping their media references — the annotation rectangle is
kept so page geometry is unchanged:

```{eval-rst}
.. doctest::

    >>> pikepdf.sanitize.remove_multimedia(pdf)
```

## Removing Web Capture information

When Adobe Acrobat captures content from the web, it records a `/SpiderInfo`
dictionary in the catalog holding the source URLs and capture settings. This
provenance is invisible in the rendered document but can leak where the content
came from. {func}`pikepdf.sanitize.remove_web_capture` deletes it:

```{eval-rst}
.. doctest::

    >>> pikepdf.sanitize.remove_web_capture(pdf)
```

## Removing private application data

PDF processors can stash private, application-specific data in `/PieceInfo`
page-piece dictionaries — for example, an editor's own editable representation of
a page. Like thumbnails and search indexes, this data can fall out of sync with
the visible document and leak content you intended to edit or redact. Removing it
does not change how the document renders, but applications that wrote it lose
their private editing state.

{func}`pikepdf.sanitize.remove_private_app_data` removes every `/PieceInfo`
dictionary, at both the document and page level. It is a broader version of
`remove_search_index` (which removes only the catalog's `/SearchIndex` entry):

```{eval-rst}
.. doctest::

    >>> pikepdf.sanitize.remove_private_app_data(pdf)
```

## Removing a PDF portfolio view

A *PDF portfolio* (or package) is a document whose embedded files are presented
through a navigator UI, configured by a `/Collection` dictionary in the catalog.
{func}`pikepdf.sanitize.remove_collection` removes that dictionary, so the
document is presented as an ordinary PDF showing its cover sheet. This does
**not** remove the embedded files themselves — pair it with `remove_attachments`
for that, and with `remove_javascript`, since a portfolio's navigator can be
driven by JavaScript:

```{eval-rst}
.. doctest::

    >>> pikepdf.sanitize.remove_collection(pdf)
```

## Chaining operations

If you apply several of these operations together, {class}`pikepdf.sanitize.Sanitizer`
offers a fluent alternative to calling the functions one at a time. You record
the operations by chaining `remove_*` methods, then call `apply()` on a PDF.
This lets you configure a sanitizer once and reuse it across many documents, and
it coalesces the action-based removals (JavaScript, external access) into a
single pass over the document:

```{eval-rst}
.. doctest::

    >>> scrubber = (
    ...     pikepdf.sanitize.Sanitizer()
    ...     .remove_javascript()
    ...     .remove_external_access()
    ...     .remove_attachments()
    ... )

    >>> pdf = pikepdf.open('../tests/resources/pal.pdf')

    >>> sanitized = scrubber.apply(pdf)
```

`apply()` returns the same PDF, so you can chain straight into a save, and a
single `Sanitizer` can be applied to file after file:

```python
scrubber = pikepdf.sanitize.Sanitizer().remove_javascript().remove_attachments()
for path in untrusted_paths:
    with pikepdf.open(path) as pdf:
        scrubber.apply(pdf).save(out_dir / path.name)
```

By design there is no "remove everything" method — blanket removal of forms,
annotations, or XFA usually destroys legitimate content (see below).

## What not to strip blindly

The ChatGPT-style "sanitizers" circulating online often go much further, and in
doing so destroy legitimate content. pikepdf deliberately does **not** offer
one-click equivalents for the following, because they are usually the wrong thing
to do:

:::{warning}
- **XFA forms.** XFA is a deprecated, Adobe-only form technology, but the form's
  contents live inside the XFA packet. Removing XFA typically reduces the document
  to a single blank page with an error message — destroying everything the
  document was for.
- **All annotations / the whole AcroForm.** Wholesale removal discards links,
  comments, and every form field, not just the risky parts. Prefer the targeted
  helpers above.
- **The document `/ID`.** Erasing the trailer `/ID` does not improve security;
  pikepdf will simply generate a new one when saving.
:::

## Flattening dynamic content with OCR

The helpers above are surgical: they remove specific structures while leaving the
rest of the document as-is. If instead you want to strip out *essentially all*
dynamic and interactive content in one pass — and you can accept rendering the
document down to images — a middleweight option is to rasterize every page and
rebuild a fresh PDF with a clean OCR text layer using
[OCRmyPDF](https://ocrmypdf.readthedocs.io/) (which is built on pikepdf):

```bash
ocrmypdf --force-ocr input.pdf output.pdf
```

`--force-ocr` rasterizes all pages to images and then re-OCRs them. In the
process it discards JavaScript, embedded files, form fields, annotations, the
original (possibly inaccurate or maliciously crafted) text layer, and any
hidden or off-page content — because none of it survives the trip through a
bitmap. The output contains the visible appearance of each page plus a freshly
generated, searchable text layer.

The trade-off is that the text layer is now only as accurate as OCR, vector text
becomes a raster image (larger files, no longer perfectly sharp), and genuinely
interactive features are gone. But for "I want this PDF to be inert and contain
nothing but what a human can see on the page," this is often the cleaner road
than trying to enumerate and remove every kind of active content by hand.

## Scrubbing metadata

To remove personal information from metadata, do **not** blindly delete the
DocumentInfo dictionary and the XMP metadata stream — they are redundant and must
be kept in sync. Use pikepdf's coordinated metadata API instead, which edits both:

```python
with pikepdf.open(...) as pdf, pdf.open_metadata(set_pikepdf_as_editor=False) as meta:
    del meta['dc:creator']
```

By default, {meth}`pikepdf.Pdf.save` and {meth}`pikepdf.Pdf.open_metadata` record
pikepdf as the document's producer/most-recent editor. This is a courtesy to other
PDF developers that helps with tracking down bugs. Pass
`set_pikepdf_as_editor=False` to {meth}`pikepdf.Pdf.open_metadata` to suppress it.
See {ref}`metadata` for the full metadata API.

## The limits of programmatic redaction

:::{warning}
**pikepdf cannot reliably redact text or images from a PDF, and neither can any
purely programmatic tool that operates on the file's structure.**
:::

Removing a visible word from a page is far harder than it looks. Text in a PDF
can be:

- split across many drawing operators, so the string you are searching for never
  appears contiguously;
- drawn and then hidden by a clipping path, an overlapping white rectangle, or
  pushed off the visible page — visually gone but still in the byte stream;
- duplicated in an **invisible OCR text layer** placed behind a scanned image;
- duplicated in an **embedded search index** (tools such as Acrobat can build
  these to speed up searching);
- present in page thumbnails, form XObjects, or alternate representations.

pikepdf works on PDF *structure*, not rendered *appearance*, so it cannot
guarantee that a phrase is gone from every place it might be stored.

For genuine redaction:

- Use a graphical PDF editor with a dedicated **redaction** tool, which removes
  the underlying content rather than merely drawing a black box over it. Then
  verify the result by searching and by inspecting any OCR layer.
- For **truly sensitive** documents, redact **physically**: print the document,
  black out the sensitive parts with a marker, then scan (and, if needed, OCR) the
  result. This severs any digital link to the original bytes.

## Defense in depth

pikepdf is one layer, not a complete solution. For untrusted input, combine it
with other measures appropriate to your threat model: malware scanning, rendering
the PDF to images and rebuilding it, sandboxing, and size/structure limits. And
always validate the result against the threat you are actually trying to defend
against.

See also {ref}`security` for notes on PDF password security and content
restrictions.