I am working on a project where I have to extract the drawing from pdf file, the issue is that the pdf is extracted and also got converted but I don't want complete image I just want the drawing part(Screw i.e Type : Panhead, Thread pitch = 0.35 mm pic) with all its label and remove every thing else I tried to crop the particular part of the PDF but that only hides the rest part of the pdf the canvas size and the other content remain as it is the only difference is that it is(the cropped part) not visible and due to which there is a lot of blank space created around the target image, what my aim is that there are multiple pdf files and I want to extract the drawing part removing everything else on PDF and save that drawing as svg file all the explanation was using pdf link is here (https://www.es.co.th/Schemetic/PDF/SCREW-M3-XX.PDF) and also uploading the target image which I want, Below is the part of code which might give you the idea:
fitz.open(pdf_path)
page = doc[page_number]
# crop_rect = fitz.Rect(60, 80, 595, 695)
# Crop the page to the defined area
# page.set_cropbox(crop_rect)
# Create a new PDF document for the cropped content
svg_content = page.get_svg_image(matrix=fitz.Matrix(1, 1))
PyMuPDF says currently when converting from PDF to SVG the whole page will be converted. and that is understandable as all vectors are based on the PDF origin which needs to be included.
In common with most PDF croppers simply masking the area outside a region will not remove those objects. What is needed is a multipart approach of redaction and crop to get a reduced file, like this.
I use a batch file with portable python but you can see the file size drops from 1542 KB down to either 46 KB .PDF or 56 KB .SVG
import pymupdf # NOT FitZ
import sys
import os
def apply_redactions(pdf_path, x1, y1, x2, y2, output_pdf):
doc = pymupdf.open(pdf_path)
page = doc[0]
# May help to flatten annotations or not ?
page.wrap_contents()
# Define ROI (Region Of Interest) with page voids
void = page.rect
ROI = [
pymupdf.Rect(void.x0, void.y0, void.x1, void.y1-y2), pymupdf.Rect(x2, void.y0, void.x1, void.y1),
pymupdf.Rect(void.x0, void.y1-y1, void.x1, void.y1), pymupdf.Rect(void.x0, void.y0, x1, void.y1),
]
# For testing RedActions above use as = for rect in ROI: page.add_redact_annot(rect, cross_out=True, fill=(1, 0, 0))
for rect in ROI: page.add_redact_annot(rect)
page.apply_redactions( images=pymupdf.PDF_REDACT_IMAGE_PIXELS | 2, graphics=pymupdf.PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED | 2, text=pymupdf.PDF_REDACT_TEXT_REMOVE | 0 )
# Flush RedActions. Is this overkill as reload page should do similar, or could we simply iterate using Annot.clean_contents(sanitize=True)
temp_redacted_pdf = output_pdf.replace(".pdf", "_temp_redacted.pdf")
# Does this need to be only 3, since 4 is slow per documentation ?
doc.save(temp_redacted_pdf, garbage=4, deflate=True)
doc = pymupdf.open(temp_redacted_pdf)
page = doc[0]
new_mediabox = pymupdf.Rect(x1, y1, x2, y2)
page.set_mediabox(new_mediabox)
doc.save(output_pdf, garbage=1)
print(f"Reduced PDF saved: {output_pdf}")
output_svg = output_pdf.replace(".pdf", ".svg")
svg_content = page.get_svg_image(matrix=pymupdf.Matrix(1, 1))
with open(output_svg, "w") as f:
f.write(svg_content)
print(f"Reduced SVG saved: {output_svg}")
doc.close()
os.remove(temp_redacted_pdf)
if __name__ == "__main__":
if len(sys.argv) != 7:
print("Usage: python reduce.py input.pdf x1 y1 x2 y2 output.pdf")
else:
apply_redactions(sys.argv[1], int(sys.argv[2]), int(sys.argv[3]), int(sys.argv[4]), int(sys.argv[5]), sys.argv[6])
If you need to ensure the SVG is well reconstructed, you can use as suggested by John Whitington, PDF2SVG on the reduced.PDF but the output will likely be a bigger 95 KB file.
So a command line cropper such as MuTool and commandline conversion may be better as just 2 commands.
mutool trim
is not intuitive but
mutool convert -o out.svg SCREW-M3-XX-reduced.pdf
works well to produce the 56 KB output from the PyMuPDF reduction.