Content streams

In PDF, drawing operations are all performed in content streams that describe the positioning and drawing order of all graphics (including text, images and vector drawing).

Content stream parsers

pikepdf.parse_content_stream(page_or_stream, operators='')

Parse a PDF content stream into a sequence of instructions.

A PDF content stream is list of instructions that describe where to render the text and graphics in a PDF. This is the starting point for analyzing PDFs.

If the input is a page and page.Contents is an array, then the content stream is automatically treated as one coalesced stream.

Each instruction contains at least one operator and zero or more operands.

This function does not have anything to do with opening a PDF file itself or processing data from a whole PDF. It is for processing a specific object inside a PDF that is already opened.

Parameters:

page_or_stream (pikepdf._core.Object | pikepdf._core.Page) – A page object, or the content stream attached to another object such as a Form XObject.
operators (str) – A space-separated string of operators to whitelist. For example ‘q Q cm Do’ will return only operators that pertain to drawing images. Use ‘BI ID EI’ for inline images. All other operators and associated tokens are ignored. If blank, all tokens are accepted.

Return type:

list[ContentStreamInstructions]

Example

>>> with pikepdf.Pdf.open("../tests/resources/pal-1bit-trivial.pdf") as pdf:
...     page = pdf.pages[0]
...     for operands, command in pikepdf.parse_content_stream(page):
...         print(command)
q
cm
Do
Q

Changed in version 3.0: Returns a list of ContentStreamInstructions instead of a list of (operand, operator) tuples. The returned items are duck-type compatible with the previous returned items.

pikepdf.unparse_content_stream(instructions)

Convert collection of instructions to bytes suitable for storing in PDF.

Given a parsed list of instructions/operand-operators, convert to bytes suitable for embedding in a PDF. In PDF the operator always follows the operands.

Parameters:: instructions (collections.abc.Collection[UnparseableContentStreamInstructions]) – collection of instructions such as is returned by parse_content_stream()
Returns:: A binary content stream, suitable for attaching to a Pdf. To attach to a Pdf, use Pdf.make_stream()`().
Return type:: bytes

Changed in version 3.0: Now accept collections that contain any mixture of ContentStreamInstruction, ContentStreamInlineImage, and the older operand-operator tuples from pikepdf 2.x.

class pikepdf.models.ctm.MatrixStack(initial_matrix=None)

Tracks the CTM (current transformation matrix) in a PDF content stream.

The CTM starts as the initial matrix and can be changed via the ‘cm’ (concatenate matrix) operator –> CTM = CTM x CM (with CTM and CM being 3x3 matrixes). Initial matrix is the identity matrix unless overridden.

Furthermore can the CTM be stored to the stack via the ‘q’ operator. This save the CTM and subsequent ‘cm’ operators change a copy of that CTM –> ‘q 1 0 0 1 0 0 cm’ –> Copy CTM onto the stack and change the copy via ‘cm’

With the ‘Q’ operator the current CTM is replaced with the previous one from the stack.

Error handling: 1. Popping from an empty stack results in CTM being set to the initial matrix 2. Multiplying with invalid operands sets the CTM to invalid 3. Multiplying an invalid CTM with a valid CM results in an invalid CTM 4. Stacking an invalid CTM results in a copy of that invalid CTM onto the stack –> All operations with an invalid CTM result in an invalid CTM –> The CTM is valid again when all invalid CTMs are popped off the stack

Parameters:: initial_matrix (pikepdf._core.Matrix | None)

pikepdf.models.ctm.get_objects_with_ctm(page, initial_matrix=None)

Determines the current transformation matrix (CTM) for each drawn object.

Filters objects with an invalid CTM.

Parameters:

page (pikepdf._core.Page)
initial_matrix (pikepdf._core.Matrix | None)

Return type:

list[tuple[str, pikepdf._core.Matrix]]

Content stream token filters

class pikepdf.Token(arg0, arg1, /)

Parameters:

arg0 (TokenType)
arg1 (bytes)

property error_msg: str

If the token is an error, this returns the error message.

Return type:: str

property raw_value: bytes

The binary representation of a token.

Return type:: bytes

property type_: TokenType

Returns the type of token.

Return type:: TokenType

property value: str

Interprets the token as a string.

Return type:: str

class pikepdf.TokenType(*args, **kwds)

Type of a token that appeared in a PDF content stream.

When filtering content streams, each token is labeled according to the role in plays.

array_close: Ellipsis: The token data represents the end of an array.

array_open: Ellipsis: The token data represents the start of an array.

bad: Ellipsis: An invalid token.

bool: Ellipsis: The token data represents an integer, real number, null or boolean, respectively.

brace_close: Ellipsis: The token data represents the end of a brace.

brace_open: Ellipsis: The token data represents the start of a brace.

comment: Ellipsis: Signifies a comment that appears in the content stream.

dict_close: Ellipsis: The token data represents the end of a dictionary.

dict_open: Ellipsis: The token data represents the start of a dictionary.

eof: Ellipsis: Denotes the end of the tokens in this content stream.

inline_image: Ellipsis: An inline image in the content stream. The whole inline image is represented by the single token.

integer: Ellipsis: The token data represents an integer.

name_: Ellipsis: The token is the name (pikepdf.Name) of an object. In practice, these are among the most interesting tokens.

Changed in version 3.0: In versions older than 3.0, .name was used instead. This interfered with semantics of the Enum object, so this was fixed.

null: Ellipsis: The token data represents a null.

real: Ellipsis: The token data represents a real number.

space: Ellipsis: Whitespace within the content stream.

string: Ellipsis: The token data represents a string. The encoding is unclear and situational.

word: Ellipsis: Otherwise uncategorized bytes are returned as word tokens. PDF operators are words.

class pikepdf.TokenFilter

handle_token(token=...)

Handle a pikepdf.Token.

This is an abstract method that must be defined in a subclass of TokenFilter. The method will be called for each token. The implementation may return either None to discard the token, the original token to include it, a new token, or an iterable containing zero or more tokens. An implementation may also buffer tokens and release them in groups (for example, it could collect an entire PDF command with all of its operands, and then return all of it).

The final token will always be a token of type TokenType.eof, (unless an exception is raised).

If this method raises an exception, the exception will be caught by C++, consumed, and replaced with a less informative exception. Use pikepdf.Pdf.get_warnings() to view the original.

Parameters:: token (Token)
Return type:: None | Token | collections.abc.Iterable[Token]