Content streams

In PDF, drawing operations are all performed in content streams that describe the positioning and drawing order of all graphics (including text, images and vector drawing).

pikepdf (and libqpdf) provide two tools for interpreting content streams: a parser and filter. The parser returns higher level information, conveniently grouping all commands with their operands. The parser is useful when one wants to retrieve information from a content stream, such as determine the position of an element. The parser should not be used to edit or reconstruct the content stream because some subtleties are lost in parsing.

The token filter works at a lower level, considering each token including comments, and distinguishing different types of spaces. This allows modifying content streams. A TokenFilter must be subclassed; the specialized version describes how it should transform the stream of tokens.

Content stream parsers

pikepdf.parse_content_stream(page_or_stream, operators='')

Parse a PDF content stream into a sequence of instructions.

A PDF content stream is list of instructions that describe where to render the text and graphics in a PDF. This is the starting point for analyzing PDFs.

If the input is a page and page.Contents is an array, then the content stream is automatically treated as one coalesced stream.

Each instruction contains at least one operator and zero or more operands.

This function does not have anything to do with opening a PDF file itself or processing data from a whole PDF. It is for processing a specific object inside a PDF that is already opened.

Parameters
  • page_or_stream (Union[pikepdf.objects.Object, pikepdf._qpdf.Page]) – A page object, or the content stream attached to another object such as a Form XObject.

  • operators (str) – A space-separated string of operators to whitelist. For example ‘q Q cm Do’ will return only operators that pertain to drawing images. Use ‘BI ID EI’ for inline images. All other operators and associated tokens are ignored. If blank, all tokens are accepted.

Return type

List[Union[pikepdf._qpdf.ContentStreamInstruction, pikepdf._qpdf.ContentStreamInlineImage]]

Example

>>> with pikepdf.Pdf.open(input_pdf) as pdf:
>>>     page = pdf.pages[0]
>>>     for operands, command in parse_content_stream(page):
>>>         print(command)

Changed in version 3.0: Returns a list of ContentStreamInstructions instead of a list of (operand, operator) tuples. The returned items are duck-type compatible with the previous returned items.

pikepdf.unparse_content_stream(instructions)

Given a parsed list of instructions/operand-operators, convert to bytes suitable for embedding in a PDF. In PDF the operator always follows the operands.

Parameters

instructions (Collection[Union[pikepdf._qpdf.ContentStreamInstruction, pikepdf._qpdf.ContentStreamInlineImage, Tuple[Collection[Union[pikepdf.objects.Object, pikepdf.models.image.PdfInlineImage]], pikepdf.objects.Operator]]]) – collection of instructions such as is returned by parse_content_stream()

Returns

A binary content stream, suitable for attaching to a Pdf. To attach to a Pdf, use Pdf.make_stream()`().

Return type

bytes

Changed in version 3.0: Now accept collections that contain any mixture of ContentStreamInstruction, ContentStreamInlineImage, and the older operand-operator tuples from pikepdf 2.x.

Content stream token filters

class pikepdf.Token
property raw_value

The binary representation of a token.

Return type:

bytes

property type_

Returns the type of token.

Return type:

pikepdf.TokenType

property value

Interprets the token as a string.

Return type:

str or bytes

class pikepdf.TokenType

When filtering content streams, each token is labeled according to the role in plays.

Standard tokens

array_open
array_close
brace_open
brace_close
dict_open
dict_close

These tokens mark the start and end of an array, text string, and dictionary, respectively.

integer
real
null
bool

The token data represents an integer, real number, null or boolean, respectively.

name_

The token is the name (pikepdf.Name) of an object. In practice, these are among the most interesting tokens.

Changed in version 3.0: In versions older than 3.0, .name was used instead. This interfered with semantics of the Enum object, so this was fixed.

inline_image

An inline image in the content stream. The whole inline image is represented by the single token.

Lexical tokens

comment

Signifies a comment that appears in the content stream.

word

Otherwise uncategorized bytes are returned as word tokens. PDF operators are words.

bad

An invalid token.

space

Whitespace within the content stream.

eof

Denotes the end of the tokens in this content stream.

class pikepdf.TokenFilter
handle_token(self: pikepdf.TokenFilter, token: pikepdf.Token = pikepdf.Token()) object

Handle a pikepdf.Token.

This is an abstract method that must be defined in a subclass of TokenFilter. The method will be called for each token. The implementation may return either None to discard the token, the original token to include it, a new token, or an iterable containing zero or more tokens. An implementation may also buffer tokens and release them in groups (for example, it could collect an entire PDF command with all of its operands, and then return all of it).

The final token will always be a token of type TokenType.eof, (unless an exception is raised).

If this method raises an exception, the exception will be caught by C++, consumed, and repalced with a less informative exception. Use pikepdf.Pdf.get_warnings() to view the original.

Return type:

None or list or pikepdf.Token