PDF split, merge, and document assembly

This section discusses working with PDF pages: splitting, merging, copying, deleting. We’re treating pages as a unit, rather than working with the content of individual pages.

Let’s continue with fourpages.pdf from the Tutorial.

In [1]: from pikepdf import Pdf

In [2]: pdf = Pdf.open('../tests/resources/fourpages.pdf')

Split a PDF into one page PDFs

All we need is a new PDF to hold the destination page.

In [3]: pdf = Pdf.open('../tests/resources/fourpages.pdf')

In [4]: for n, page in enumerate(pdf.pages):
   ...:     dst = Pdf.new()
   ...:     dst.pages.append(page)
   ...:     dst.save('{:02d}.pdf'.format(n))


This example will transfer data associated with each page, so that every page stands on its own. It will not transfer some metadata associated with the PDF as a whole, such the list of bookmarks.

Merge (concatenate) PDF from several PDFs

We create an empty Pdf which will be the container for all the others.

In [5]: from glob import glob

In [6]: pdf = Pdf.new()

In [7]: for file in glob('*.pdf'):
   ...:     src = Pdf.open(file)
   ...:     pdf.pages.extend(src.pages)

In [8]: pdf.save('merged.pdf')

This code sample is enough to merge most PDFs, but there are some things it does not do that a more sophisicated function might do. One could call pikepdf.Pdf.remove_unreferenced_resources() to remove unreferenced resources. It may also be necessary to chose the most recent version of all source PDFs. Here is a more sophisticated example:

In [9]: from glob import glob

In [10]: pdf = Pdf.new()

In [11]: version = pdf.pdf_version

In [12]: for file in glob('*.pdf'):
   ....:     src = Pdf.open(file)
   ....:     version = max(version, src.pdf_version)
   ....:     pdf.pages.extend(src.pages)

In [13]: pdf.remove_unreferenced_resources()

In [14]: pdf.save('merged.pdf', min_version=version)

This improved example would still leave metadata blank. It’s up to you to decide how to combine metadata from multiple PDFs.

Reversing the order of pages

Suppose the file was scanned backwards. We can easily reverse it in place - maybe it was scanned backwards, a common problem with automatic document scanners.

In [15]: pdf.pages.reverse()
In [16]: pdf
Out[16]: <pikepdf.Pdf description='../tests/resources/fourpages.pdf'>

Pretty nice, isn’t it? But the pages in this file already were in correct order, so let’s put them back.

In [17]: pdf.pages.reverse()

Copying pages from other PDFs

Now, let’s add some content from another file. Because pdf.pages behaves like a list, we can use pages.extend() on another file’s pages.

In [18]: pdf = Pdf.open('../tests/resources/fourpages.pdf')

In [19]: appendix = Pdf.open('../tests/resources/sandwich.pdf')

In [20]: pdf.pages.extend(appendix.pages)

We can use pages.insert() to insert into one of more pages into a specific position, bumping everything else ahead.

Copying pages between Pdf objects will create a shallow copy of the source page within the target Pdf, rather than the typical Python behavior of creating a reference. As such, modifying pdf.pages[-1] will not affect appendix.pages[0]. (Normally, assigning objects between Python lists creates a reference, so that the two objects are identical, list[0] is list[1].)

In [21]: graph = Pdf.open('../tests/resources/graph.pdf')

In [22]: pdf.pages.insert(1, graph.pages[0])

In [23]: len(pdf.pages)
Out[23]: 6

We can also replace specific pages with assignment (or slicing).

In [24]: congress = Pdf.open('../tests/resources/congress.pdf')

In [25]: pdf.pages[2].objgen
Out[25]: (4, 0)

In [26]: pdf.pages[2] = congress.pages[0]

In [27]: pdf.pages[2].objgen
Out[27]: (33, 0)

The method above will break any indirect references (such as table of contents entries and hyperlinks) within pdf to pdf.pages[2]. Perhaps that is the behavior you want, if the replacement means those references are no longer valid. This is shown by the change in pikepdf.Object.objgen.

Emplacing pages

To preserve indirect references, use pikepdf.Object.emplace(), which will (conceptually) delete all of the content of target and replace it with the content of source, thus preserving indirect references to the page. (Think of this as demolishing the interior of a house, but keeping it at the same address.)

In [28]: pdf = Pdf.open('../tests/resources/fourpages.pdf')

In [29]: congress = Pdf.open('../tests/resources/congress.pdf')

In [30]: pdf.pages[2].objgen
Out[30]: (5, 0)

In [31]: pdf.pages.append(congress.pages[0])  # Transfer page to new pdf

In [32]: pdf.pages[2].emplace(pdf.pages[-1])

In [33]: del pdf.pages[-1]  # Remove donor page

In [34]: pdf.pages[2].objgen
Out[34]: (5, 0)

Copying pages within a PDF

As you may have guessed, we can assign pages to copy them within a Pdf:

In [35]: pdf = Pdf.open('../tests/resources/fourpages.pdf')

In [36]: pdf.pages[3] = pdf.pages[0]  # The last shall be made first

As above, copying a page creates a shallow copy rather than a Python object reference.

Also as above pikepdf.Object.emplace() can be used to create a copy that preserves the functionality of indirect references within the PDF.

Using counting numbers

Because PDF pages are usually numbered in counting numbers (1, 2, 3…), pikepdf provides a convenience accessor .p() that uses counting numbers:

In [37]: pdf.pages.p(1)        # The first page in the document

In [38]: pdf.pages[0]          # Also the first page in the document

In [39]: pdf.pages.remove(p=1)   # Remove first page in the document

To avoid confusion, the .p() accessor does not accept Python slices, and .p(0) raises an exception. It is also not possible to delete using it.

PDFs may define their own numbering scheme or different numberings for different sections, such as using Roman numerals for an introductory section. .pages does not look up this information.

Pages information from Root


It’s possible to obtain page information through pikepdf.Pdf.Root object but not recommended. (In PDF parlance, this is the /Root object).

The internal consistency of the various /Page and /Pages is not guaranteed when accessed in this manner, and in some PDFs the data structure for these is fairly complex. Use the .pages interface.