Batch operations with JobBuilder

qpdf, the library pikepdf is built on, ships a powerful command line program. Most of what that command line tool can do is exposed through qpdf’s job interface: a single, declarative description of an operation – encrypt, decrypt, merge or split pages, linearize, recompress, optimize images, manage attachments, overlay or underlay content, and so on. pikepdf binds this as pikepdf.Job, and pikepdf.JobBuilder provides a fluent, Pythonic way to assemble one.

When to use a job

A job is the right tool for high-level, whole-document tasks that you might otherwise run from the qpdf command line, especially when you want to apply the same recipe to many PDFs:

  • Encrypt or decrypt a batch of files.

  • Merge several PDFs, or split one into per-page files.

  • Linearize (“web-optimize”) or recompress files to shrink them.

  • Recompress images, flatten annotations, or strip metadata across a directory.

Because a job is just a specification, it is easy to build once and run against thousands of files. The operation runs entirely inside qpdf’s optimized C++ code, with no per-object round trips into Python.

A job is not the right tool for surgical, object-level edits. Jobs operate at the granularity qpdf’s command line offers – whole pages, whole documents, whole streams. They cannot reach inside a content stream to move a single text run, rewrite one dictionary key, splice an object graph, or make a change that depends on inspecting the PDF’s contents first. For that, open the file as a pikepdf.Pdf and manipulate the object model directly. The two approaches compose: you can run a job to produce an intermediate file, then open it for fine-grained work, or vice versa.

Note

JobBuilder is a convenience layer. Anything it can express, you could also express by hand-writing qpdf’s job JSON and passing it to pikepdf.Job. The builder exists so you do not have to: it translates familiar, snake_case Python into qpdf’s camelCase JSON, and lets you describe encryption with the same pikepdf.Permissions and pikepdf.Encryption models used elsewhere in pikepdf.

A first job

Every job needs an input and an output. Methods return the builder, so calls chain:

from pikepdf import JobBuilder

JobBuilder().input('in.pdf').output('out.pdf').linearize().run()

This is equivalent to running qpdf --linearize in.pdf out.pdf.

Use empty() instead of input() to start from a blank PDF (the equivalent of qpdf’s --empty), and replace_input() to overwrite the input file in place.

Encryption

Encryption permissions in qpdf’s JSON are expressed as restrictions with a specialized vocabulary that differs per key length. JobBuilder lets you use pikepdf’s allow-oriented pikepdf.Permissions and pikepdf.Encryption instead:

from pikepdf import JobBuilder, Permissions

JobBuilder().input('in.pdf').output('out.pdf').encrypt(
    owner_password='secret',
    user_password='',
    allow=Permissions(extract=False, modify_annotation=False),
).run()

You may also pass a fully-formed pikepdf.Encryption object positionally, which is convenient if you already construct one elsewhere:

from pikepdf import Encryption

enc = Encryption(owner='secret', user='', allow=Permissions(extract=False))
JobBuilder().input('in.pdf').output('out.pdf').encrypt(enc).run()

40- and 128-bit RC4 encryption are weak and additionally require allow_weak_crypto(). To go the other way and remove encryption, use decrypt().

Merging and splitting pages

add_pages() is repeatable; each call appends a source file (and optional page range) to the selection. The special filename '.' refers to the primary input file.

# Concatenate the first 5 pages of a.pdf with all of b.pdf
JobBuilder().empty().output('merged.pdf') \
    .add_pages('a.pdf', '1-5') \
    .add_pages('b.pdf') \
    .run()

To split a file into one output per page, use split_pages() with a %d placeholder in the output filename:

JobBuilder().input('book.pdf').output('page-%d.pdf').split_pages().run()

Note

qpdf’s --pages operation (which add_pages drives) is form-aware: when the sources contain interactive AcroForm fields, qpdf carries them across. This makes pikepdf.Job/JobBuilder a good choice for merging whole files from disk. For in-memory, page-level form-aware copying on a Pdf you are actively editing, use pikepdf.Pdf.add_pages_from() instead – see Working with interactive forms.

Compression, images and content transforms

JobBuilder groups qpdf’s many tuning knobs into a handful of methods:

JobBuilder().input('in.pdf').output('out.pdf') \
    .compress(object_streams='generate', recompress_flate=True) \
    .optimize_images(min_width=100, jpeg_quality=85) \
    .run()

Other transforms each have a dedicated method, including flatten_annotations(), flatten_rotation(), generate_appearances(), coalesce_contents(), externalize_inline_images(), the content-removal helpers (remove_metadata(), remove_info(), remove_acroform(), remove_structure(), remove_page_labels()), page labels (set_page_labels()), version pinning (min_version(), force_version()), and reproducibility helpers (deterministic_id(), static_id()).

Attachments and overlays

Attachments and overlay/underlay sections are list-valued, so their add_* methods are repeatable:

JobBuilder().input('report.pdf').output('out.pdf') \
    .add_attachment('data.csv', mimetype='text/csv') \
    .add_overlay('watermark.pdf', repeat='1') \
    .run()

The escape hatch

JobBuilder covers the common options with typed methods, but qpdf has a long tail of scalar flags. set() reaches any of them using the same snake_case-to-camelCase convention. A boolean True enables a flag; any other value is stringified:

JobBuilder().input('in.pdf').output('out.pdf') \
    .set(no_warn=True, keep_files_open=False) \
    .run()

If you pass a name that is not a recognized qpdf job option, set() raises ValueError immediately rather than producing JSON that qpdf would reject.

Running, building, and inspecting

There are three terminal methods:

  • run() builds the job, validates the configuration (unless validate=False), and runs it. It returns the underlying pikepdf.Job, so you can inspect exit_code, has_warnings, and encryption_status afterwards.

  • build() returns the pikepdf.Job without running it. qpdf validates the specification during construction.

  • create_pdf() runs only the first stage and returns a pikepdf.Pdf, for the staged workflow where you modify the PDF and then call pikepdf.Job.write_pdf().

JobBuilder performs only minimal local validation; qpdf is the source of truth and raises pikepdf.JobUsageError (or RuntimeError for malformed JSON) for invalid configurations.

To see what a builder will send to qpdf – handy for debugging, logging, or caching a recipe – use to_json() (a dict) or to_json_str() (a string):

>>> JobBuilder().input('in.pdf').output('out.pdf').linearize().to_json()
{'inputFile': 'in.pdf', 'outputFile': 'out.pdf', 'linearize': ''}

Relationship to the qpdf command line

A JobBuilder specification maps almost one-to-one onto a qpdf command line, because both funnel through the same qpdf job machinery. If you already know the qpdf invocation you want, you can translate it directly, or skip the builder entirely and pass an argv list to pikepdf.Job:

from pikepdf import Job

Job(['pikepdf', '--linearize', 'in.pdf', 'out.pdf']).run()

(The first list element is the program-name slot, like argv[0]; qpdf ignores it. This runs in-process and does not shell out to a qpdf binary.)

For the full catalogue of options, see qpdf’s own documentation on the command-line tool and the QPDFJob JSON format.