Reading
Content extraction
Text extraction
use PhpPdf\Reader\PdfTextExtractor;
$doc = PdfDocumentReader::open('/path/to/file.pdf');
$extractor = new PdfTextExtractor($doc);
for ($i = 0; $i < $doc->getPageCount(); $i++) {
$text = $extractor->getTextForPage($i); // string
echo "--- Page " . ($i + 1) . " ---\n";
echo $text . "\n";
}
Font support
| Font type | Decoding method |
|---|---|
| Type 0 / CID | ToUnicode CMap |
| Type 1, TrueType (WinAnsi) | WinAnsiEncoding / Latin-1 fallback |
Glyphs with purely custom or glyph-substituted encodings may not extract correctly.
Text operators recognised
Tj, TJ, ', " — show text.
Td, TD, Tm, T* — position and line breaks.
Large negative kerning values in TJ arrays (below −200) are treated as word breaks.
Image extraction
use PhpPdf\Reader\PdfImageExtractor;
$extractor = new PdfImageExtractor($doc);
// Images on a specific page (0-based)
$images = $extractor->getImagesForPage(0);
// All unique images across the whole document
$images = $extractor->getAllImages();
foreach ($images as $image) {
echo $image->name . "\n"; // resource name, e.g. "Im1"
echo $image->width . "×" . $image->height . "\n";
echo $image->colorSpace . "\n"; // e.g. "DeviceRGB"
echo $image->bitsPerComponent . "\n";
// Raw decoded pixel bytes (or JPEG bytes for DCTDecode images)
$bytes = $image->data;
// Write to file
file_put_contents('/tmp/' . $image->name . '.' . $image->getFileExtension(), $image->toFileBytes());
}
getAllImages() deduplicates shared images — an image referenced from multiple pages appears only once.
PdfExtractedImage methods:
| Method | Returns |
|---|---|
isJpeg() | true if the data is a JPEG byte stream |
getFileExtension() | 'jpg' or 'png' |
toFileBytes() | Raw JPEG bytes or a valid PNG file |
toPng() | Always returns a PNG (wraps raw pixels; converts RGBA with SMask) |
Annotation extraction
use PhpPdf\Reader\PdfAnnotationExtractor;
$extractor = new PdfAnnotationExtractor($doc);
// Annotations on a single page
$annotations = $extractor->getAnnotationsForPage(0);
// All annotations in the document
$annotations = $extractor->getAllAnnotations();
foreach ($annotations as $ann) {
echo $ann->type->value; // e.g. "Link", "Text", "Highlight"
echo $ann->x . ', ' . $ann->y . ' ' . $ann->width . '×' . $ann->height . "\n";
if ($ann->isUriLink()) {
echo 'URL: ' . $ann->uri . "\n";
}
}
PdfAnnotation properties
| Property | Type | Description |
|---|---|---|
type | PdfAnnotationType | Annotation subtype |
x, y | float | Bottom-left corner |
width, height | float | Bounding box size |
contents | ?string | Annotation text/tooltip |
title | ?string | Popup title |
color | ?array{float,float,float} | RGB color |
interiorColor | ?array{float,float,float} | Interior fill RGB |
borderWidth | float | Border line width |
quadPoints | ?list<float> | Quad points for text markup |
uri | ?string | URI for Link annotations |
open | ?bool | Whether the popup is open |