(jobs)=

# Batch operations with JobBuilder

qpdf, the library pikepdf is built on, ships a powerful command line program.
Most of what that command line tool can do is exposed through qpdf's *job*
interface: a single, declarative description of an operation -- encrypt, decrypt,
merge or split pages, linearize, recompress, optimize images, manage attachments,
overlay or underlay content, and so on. pikepdf binds this as
{class}`pikepdf.Job`, and {class}`pikepdf.JobBuilder` provides a fluent, Pythonic
way to assemble one.

## When to use a job

A job is the right tool for **high-level, whole-document tasks that you might
otherwise run from the `qpdf` command line**, especially when you want to apply
the same recipe to many PDFs:

- Encrypt or decrypt a batch of files.
- Merge several PDFs, or split one into per-page files.
- Linearize ("web-optimize") or recompress files to shrink them.
- Recompress images, flatten annotations, or strip metadata across a directory.

Because a job is just a specification, it is easy to build once and run against
thousands of files. The operation runs entirely inside qpdf's optimized C++
code, with no per-object round trips into Python.

A job is **not** the right tool for surgical, object-level edits. Jobs operate at
the granularity qpdf's command line offers -- whole pages, whole documents, whole
streams. They cannot reach inside a content stream to move a single text run,
rewrite one dictionary key, splice an object graph, or make a change that depends
on inspecting the PDF's contents first. For that, open the file as a
{class}`pikepdf.Pdf` and manipulate the object model directly. The two approaches
compose: you can run a job to produce an intermediate file, then open it for
fine-grained work, or vice versa.

:::{note}
`JobBuilder` is a convenience layer. Anything it can express, you could also
express by hand-writing qpdf's job JSON and passing it to {class}`pikepdf.Job`.
The builder exists so you do not have to: it translates familiar, snake_case
Python into qpdf's camelCase JSON, and lets you describe encryption with the same
{class}`pikepdf.Permissions` and {class}`pikepdf.Encryption` models used
elsewhere in pikepdf.
:::

## A first job

Every job needs an input and an output. Methods return the builder, so calls
chain:

```python
from pikepdf import JobBuilder

JobBuilder().input('in.pdf').output('out.pdf').linearize().run()
```

This is equivalent to running `qpdf --linearize in.pdf out.pdf`.

Use {meth}`~pikepdf.JobBuilder.empty` instead of `input()` to start from a blank
PDF (the equivalent of qpdf's `--empty`), and
{meth}`~pikepdf.JobBuilder.replace_input` to overwrite the input file in place.

## Encryption

Encryption permissions in qpdf's JSON are expressed as *restrictions* with a
specialized vocabulary that differs per key length. `JobBuilder` lets you use
pikepdf's allow-oriented {class}`pikepdf.Permissions` and
{class}`pikepdf.Encryption` instead:

```python
from pikepdf import JobBuilder, Permissions

JobBuilder().input('in.pdf').output('out.pdf').encrypt(
    owner_password='secret',
    user_password='',
    allow=Permissions(extract=False, modify_annotation=False),
).run()
```

You may also pass a fully-formed {class}`pikepdf.Encryption` object positionally,
which is convenient if you already construct one elsewhere:

```python
from pikepdf import Encryption

enc = Encryption(owner='secret', user='', allow=Permissions(extract=False))
JobBuilder().input('in.pdf').output('out.pdf').encrypt(enc).run()
```

40- and 128-bit RC4 encryption are weak and additionally require
{meth}`~pikepdf.JobBuilder.allow_weak_crypto`. To go the other way and remove
encryption, use {meth}`~pikepdf.JobBuilder.decrypt`.

## Merging and splitting pages

{meth}`~pikepdf.JobBuilder.add_pages` is repeatable; each call appends a source
file (and optional page range) to the selection. The special filename `'.'`
refers to the primary input file.

```python
# Concatenate the first 5 pages of a.pdf with all of b.pdf
JobBuilder().empty().output('merged.pdf') \
    .add_pages('a.pdf', '1-5') \
    .add_pages('b.pdf') \
    .run()
```

To split a file into one output per page, use
{meth}`~pikepdf.JobBuilder.split_pages` with a `%d` placeholder in the output
filename:

```python
JobBuilder().input('book.pdf').output('page-%d.pdf').split_pages().run()
```

:::{note}
qpdf's `--pages` operation (which `add_pages` drives) is **form-aware**: when the
sources contain interactive AcroForm fields, qpdf carries them across. This makes
{class}`pikepdf.Job`/`JobBuilder` a good choice for merging whole files from
disk. For in-memory, page-level form-aware copying on a `Pdf` you are actively
editing, use {meth}`pikepdf.Pdf.add_pages_from` instead -- see
{ref}`interactive_forms`.
:::

## Compression, images and content transforms

`JobBuilder` groups qpdf's many tuning knobs into a handful of methods:

```python
JobBuilder().input('in.pdf').output('out.pdf') \
    .compress(object_streams='generate', recompress_flate=True) \
    .optimize_images(min_width=100, jpeg_quality=85) \
    .run()
```

Other transforms each have a dedicated method, including
{meth}`~pikepdf.JobBuilder.flatten_annotations`,
{meth}`~pikepdf.JobBuilder.flatten_rotation`,
{meth}`~pikepdf.JobBuilder.generate_appearances`,
{meth}`~pikepdf.JobBuilder.coalesce_contents`,
{meth}`~pikepdf.JobBuilder.externalize_inline_images`, the content-removal
helpers ({meth}`~pikepdf.JobBuilder.remove_metadata`,
{meth}`~pikepdf.JobBuilder.remove_info`,
{meth}`~pikepdf.JobBuilder.remove_acroform`,
{meth}`~pikepdf.JobBuilder.remove_structure`,
{meth}`~pikepdf.JobBuilder.remove_page_labels`), page labels
({meth}`~pikepdf.JobBuilder.set_page_labels`), version pinning
({meth}`~pikepdf.JobBuilder.min_version`,
{meth}`~pikepdf.JobBuilder.force_version`), and reproducibility helpers
({meth}`~pikepdf.JobBuilder.deterministic_id`,
{meth}`~pikepdf.JobBuilder.static_id`).

## Attachments and overlays

Attachments and overlay/underlay sections are list-valued, so their `add_*`
methods are repeatable:

```python
JobBuilder().input('report.pdf').output('out.pdf') \
    .add_attachment('data.csv', mimetype='text/csv') \
    .add_overlay('watermark.pdf', repeat='1') \
    .run()
```

## The escape hatch

`JobBuilder` covers the common options with typed methods, but qpdf has a long
tail of scalar flags. {meth}`~pikepdf.JobBuilder.set` reaches any of them using
the same snake_case-to-camelCase convention. A boolean `True` enables a flag;
any other value is stringified:

```python
JobBuilder().input('in.pdf').output('out.pdf') \
    .set(no_warn=True, keep_files_open=False) \
    .run()
```

If you pass a name that is not a recognized qpdf job option, `set()` raises
`ValueError` immediately rather than producing JSON that qpdf would reject.

## Running, building, and inspecting

There are three terminal methods:

- {meth}`~pikepdf.JobBuilder.run` builds the job, validates the configuration
  (unless `validate=False`), and runs it. It returns the underlying
  {class}`pikepdf.Job`, so you can inspect `exit_code`, `has_warnings`, and
  `encryption_status` afterwards.
- {meth}`~pikepdf.JobBuilder.build` returns the {class}`pikepdf.Job` without
  running it. qpdf validates the specification during construction.
- {meth}`~pikepdf.JobBuilder.create_pdf` runs only the first stage and returns a
  {class}`pikepdf.Pdf`, for the staged workflow where you modify the PDF and then
  call {meth}`pikepdf.Job.write_pdf`.

`JobBuilder` performs only minimal local validation; qpdf is the source of truth
and raises {class}`pikepdf.JobUsageError` (or `RuntimeError` for malformed JSON)
for invalid configurations.

To see what a builder will send to qpdf -- handy for debugging, logging, or
caching a recipe -- use {meth}`~pikepdf.JobBuilder.to_json` (a `dict`) or
{meth}`~pikepdf.JobBuilder.to_json_str` (a string):

```python
>>> JobBuilder().input('in.pdf').output('out.pdf').linearize().to_json()
{'inputFile': 'in.pdf', 'outputFile': 'out.pdf', 'linearize': ''}
```

## Relationship to the qpdf command line

A `JobBuilder` specification maps almost one-to-one onto a `qpdf` command line,
because both funnel through the same qpdf job machinery. If you already know the
`qpdf` invocation you want, you can translate it directly, or skip the builder
entirely and pass an argv list to {class}`pikepdf.Job`:

```python
from pikepdf import Job

Job(['pikepdf', '--linearize', 'in.pdf', 'out.pdf']).run()
```

(The first list element is the program-name slot, like `argv[0]`; qpdf ignores
it. This runs in-process and does not shell out to a `qpdf` binary.)

For the full catalogue of options, see qpdf's own documentation on the
[command-line tool](https://qpdf.readthedocs.io/en/stable/cli.html) and the
[QPDFJob JSON format](https://qpdf.readthedocs.io/en/stable/qpdf-job.html).