Working with interactive forms
pikepdf provides two interfaces for working with interactive forms. There is a low-level
interface, pikepdf.AcroForm, which is exposed as the
pikepdf.Pdf.acroform property. There is also a higher-level interface available
in the pikepdf.form module, which provides several abstractions to make usage
easier.
The interactive forms discussed here are AcroForm forms – the standard, widely-supported form technology defined by the PDF specification. PDF also defines a second, incompatible form technology called XFA; see XFA forms below for why pikepdf (and open source PDF tooling generally) does not support it.
Extracting Form Data
It is relatively easy to extract basic form data from a PDF.
>>> from pikepdf.form import Form
>>> form = Form(pdf)
>>> data = {}
>>> for field_name, field in form.items():
... if field.is_text or field.is_choice or field.is_radio_button:
... data[field_name] = field.value
... elif field.is_checkbox:
... data[field_name] = field.checked
Inspecting the Form
The form allows retrieving specific named fields via dict-like access. There are several useful properties common to all fields. The most useful of these are:
alternate_name, which is a human-readable label for the field.fully_qualified_name, which is the machine-readable key which identifies this fieldis_requiredis_textis_checkboxis_radio_buttonis_pushbuttonis_choice
>>> field = form['MyField']
>>> field.fully_qualified_name
"MyField"
>>> field.alternate_name
"Applicant's first given name"
>>> field.is_text
True
>>> field.is_required
False
Fields with duplicate names are supported. Accessing them by name returns a list of fields
instead of a single field. Accessing attributes directly on this list (e.g. field.value)
will proxy to the first field in the list.
Filling Form Data
Before filling a form, you will need to determine how you will deal with appearance streams. In addition to merely holding values, PDF form fields must explicitly declare how the filled-in value should look. This is known as the appearance stream. There are several options available.
First, you may choose not to generate appearance streams at all. Most full-fat PDF readers
are capable of generating these appearance streams themselves, so depending on your use
case it may be acceptable to leave appearance stream generation to the end-user
application. This is the default behavior of the pikepdf.form.Form class.
If you do need or want to generate appearance streams, you must provide the class you wish
to use to accomplish this task. There are two possible implementations provided with
pikepdf: pikepdf.form.DefaultAppearanceStreamGenerator and
pikepdf.form.ExtendedAppearanceStreamGenerator. To use either of these, simply pass
the class as the second argument to the constructor:
>>> from pikepdf.form import Form, DefaultAppearanceStreamGenerator
>>> form = Form(pdf, DefaultAppearanceStreamGenerator)
The differences between these two options is explained in the documentation for each class.
Lastly, you may implement your own class for generating appearance streams that better
fits your specific use case. It must implement the interface provided by
pikepdf.form.AppearanceStreamGenerator.
After filling a form, you may also wish to flatten it. This converts the interactive form fields into normal, un-editable text. This can be done as follows:
pdf.flatten_annotations()
Generating appearance streams is required if you wish to flatten the form.
Text Fields
Text fields can either resemble an HTML text input, or an HTML textarea, as well as a
password field, file upload, or rich text input. pikepdf supports only the first two
options, which can be distinguished from one another using the is_multiline property.
The underlying value of the text field is stored in the value property. The field
may also have a default_value which should be used when resetting the form.
>>> text_field = form['MyTextField']
>>> text_field.is_multiline
False
>>> text_field.default_value
''
>>> text_field.value
''
>>> text_field.max_length
75
>>> text_field.value = "Hello World!"
Checkbox Fields
Checkbox fields behave somewhat similarly to what you might be familiar with working with
HTML forms in JavaScript. There is a checked property which will tell you if the box
is checked or not. If access to the underlying value is needed, it can be fetched via the
value property.
Unlike HTML checkboxes, however, there is a value for both the on and off states, and
thus value will return different values depending on if the box is checked or not. The
value for an off state will be a pikepdf.Name with the value “/Off”. The value for the
on state is variable, and can be retrieved from the on_value property.
>>> checkbox_field = form['MyCheckbox']
>>> checkbox.checked
False
>>> checkbox.value
pikepdf.Name("/Off")
>>> checkbox.on_value
pikepdf.Name("/Yes")
>> checkbox.states
(pikepdf.Name("/Yes"), pikepdf.Name("/Off"))
>>> checkbox.checked = True
>>> checkbox.value
pikepdf.Name("/Yes")
Choice Fields
Choice fields may be either list boxes or comboboxes, as determined by the is_combobox
property. If the field is a combobox, it may optionally have an editable text box attached
to it, as shown by the allows_edit property. Editable choice fields may store
arbitrary values, but otherwise choice fields are limited to those options which are
returned via the options property.
>>> field = form['MyChoiceField']
>>> field.is_combobox
True
>>> field.allows_edit
False
>>> field.options[0].display_name
"Pike"
>>> field.options[2].select()
>>> field.value
"Trout"
>>> field.value = "Pike"
Signature Fields
pikepdf does not support signature fields, but does include a utility function to stamp an image over the top of the field’s bounding box. The stamped image must be a PDF.
>>> form_pdf = Pdf.open(...)
>>> sig_pdf = Pdf.open(...)
>>> form = Form(form_pdf)
>>> form['MySigField'].stamp_overlay(sig_pdf.pages[0])
To stamp an image that is not already a PDF, you will need to use an image processing library, such as Pillow to convert it:
>>> from PIL import Image
>>> img = Image.open(img).convert('RGB')
>>> img_as_pdf = BytesIO()
>>> img.save(img_as_pdf, 'pdf')
>>> img_as_pdf.seek(0)
>>> sig_pdf = Pdf.open(img_as_pdf)
Copying pages between documents (form-aware)
A form field is more than the box you see on the page. The visible box is a
widget annotation that lives on the page, but the field it belongs to is
registered in the document-level /AcroForm dictionary and may participate in a
tree of related fields. Copying the page alone leaves the field behind.
Copying pages with pdf.pages.extend(src.pages) therefore does not carry
interactive form fields: the widget annotations land on the pages, but the
document-level /AcroForm is not updated, so the fields show blank in Adobe
Acrobat (they may still render in some lenient viewers). To make this
easy-to-miss data loss visible, pikepdf emits a pikepdf.FormCopyWarning
both when you extend() across documents with forms and when you save() a
document that has orphaned form widgets.
Use pikepdf.Pdf.add_pages_from() to copy pages and their form fields:
from pikepdf import Pdf
dest = Pdf.new()
with Pdf.open('a.pdf') as a, Pdf.open('b.pdf') as b:
result = dest.add_pages_from(a)
result = dest.add_pages_from(b)
print(result.renamed_fields) # fields auto-renamed to avoid collisions
dest.save('merged.pdf')
add_pages_from returns a pikepdf.PageCopyResult describing what
happened to the forms:
Fields whose names collide with fields already in the destination are renamed automatically (form field names must be unique within a document). The original→new mapping is reported in
renamed_fields.Independent top-level fields are only carried over when one of their widgets is on a copied page, so unrelated forms elsewhere in the source are not imported.
A field that shares a top-level ancestor with a copied field is carried as an entire subtree, even if some of its widgets live on pages you did not copy. Such partially-represented fields are listed in
partial_fieldsso you can decide whether the result is acceptable.
Stripping forms instead of copying them
Sometimes you want the pages but explicitly not the form data – for
example when flattening a document into a static archive, or to avoid importing
form structure you would otherwise have to reason about. Pass forms='strip'
for a hard guarantee that no form fields are carried across:
dest.add_pages_from(src, forms='strip')
With forms='strip', widgets are removed from the copied pages and nothing is
added to the destination /AcroForm, so no FormCopyWarning is emitted.
Whole-file merges
For merging whole files from disk, pikepdf.Job (the qpdf --pages job,
see Batch operations with JobBuilder) is also form-aware and may be more convenient. Reserve
add_pages_from for in-memory, page-level work on a Pdf you are actively
editing.
Repairing forms after manual edits
If you build or restructure a form by hand – adding, removing, or moving widget
annotations and field dictionaries directly – the pikepdf.AcroForm
object may hold a stale view of the form, and the document may end up internally
inconsistent. A few low-level helpers on pikepdf.AcroForm help recover:
invalidate_cache()discards qpdf’s cached field/annotation mapping, forcing it to be rebuilt from the current object graph after you have edited it.validate()checks the form for structural problems, and can repair some of them.transform_annotations()applies a transformation matrix to a page’s form annotations, which is needed when you reposition or rescale a page that carries form fields.
XFA forms
Not every “form” in a PDF is an AcroForm. Adobe also defined XFA (XML Forms
Architecture), a completely different, XML-based form technology that can be
embedded in a PDF inside the /AcroForm dictionary’s /XFA entry. pikepdf does
not support XFA forms, and this is a deliberate, structural limitation rather
than a missing feature.
A few things distinguish XFA from the AcroForm forms described above:
It is a separate, parallel representation. XFA describes the form’s fields, layout, calculations and validation as an XML document that is largely independent of the PDF page objects. So-called dynamic XFA forms can even regenerate their pages at render time, meaning the PDF page content you see in pikepdf may be little more than a placeholder. The PDF object model that pikepdf exposes simply is not where an XFA form’s logic lives.
It is Adobe-proprietary. XFA was developed by Adobe and was never part of the core ISO PDF object model. Fully processing an XFA form requires Adobe’s proprietary XFA engine; there is no complete open specification that an open source library could implement against.
It is deprecated. XFA was deprecated in PDF 2.0 (ISO 32000-2:2017) and removed from the 2020 revision of that standard. New PDFs should not use it, and tooling support for it is actively shrinking – even Adobe’s own viewers have curtailed it.
XFA documents often rely on Adobe-only cryptographic enablement. Many XFA forms are Reader-enabled: Adobe signs a usage-rights/Reader-Extensions block with a private key so that Adobe Reader unlocks features (such as saving filled data) that are otherwise disabled. Third-party software cannot reproduce or honor those signatures, and any edit invalidates them. There is no lawful or technical path for an open source tool to provide the full XFA experience.
In practice, if you encounter an XFA form, your realistic options are to ignore
the XFA layer and work only with whatever AcroForm fields coexist with it (many
static XFA forms also carry a usable AcroForm representation), or to strip the
/XFA entry to force viewers to fall back to the AcroForm fields. pikepdf treats
the XFA data as an opaque blob; it does not parse, render, or fill it.