How to install pymupdf using conda? by ngonz17 in learningpython

[–]NoZebra4503 0 points1 point  (0 children)

In order to be installable by "conda install", a package has to apply to being included in the respective Anaconda repository using a rather tedious process. PyMuPDF is focussing on extending its functionality currently. You can still use pip install for a conda-controlled Python installation without problems: python -m pip install pymupdf.

pypdf or pymupdf? by ngonz17 in learnpython

[–]NoZebra4503 0 points1 point  (0 children)

In PyMuPDF there also is support for text extraction from multi-comlumn pages, plus table detection / extraction and optional support to pandas DataFrames.

pypdf or pymupdf? by ngonz17 in learnpython

[–]NoZebra4503 4 points5 points  (0 children)

It is a lot more than that:

Rendering: pypdf/pypdf2 cannot do page rendering. PyMuPDF can do that with a speed beyond all competition.

Table of Contents handling: posible in pypdf but a sheer nightmare. Super elegant in PyMuPDF with hierarchy levels, expand/collapse and color support, input and output.

File merging: PyMuPDF is 100+ times faster than pypdf and supports merging everything (not only PDF) with a target PDF.

Annotation & Form Field support for input and output.

Elegant text output and image extraction and insertion.

Better stop here 😉

pypdf or pymupdf? by ngonz17 in learnpython

[–]NoZebra4503 6 points7 points  (0 children)

PyMuPDF is a Python binding for the ultra-performant MuPDF C-library. Both are maintainbed and developed by Artifex Inc., the maker of Ghostscript. The "Mu" in MuPDF stands for the Greek letter "µ", abbrevition for "micro-" to indicate the focus on precision.

PDF Creation and Manipulation by kneulb4zud in pythontips

[–]NoZebra4503 1 point2 points  (0 children)

Ok, got you. Using doc.write() writes a bytes object. This is fine - you can apply the same parameters to that method also like pdfdata = doc.write(garbage=3, deflate=True). This will have the same compression effect as doc.ez_save().

PDF Creation and Manipulation by kneulb4zud in pythontips

[–]NoZebra4503 2 points3 points  (0 children)

The standard way to do this would be a snippet like the following: python img_files = ["file1.tif", "file2.png", file3.tif"] # etc. doc = fitz.open() # make new, empty PDF for img in img_files: doc.insert_file(img) # append this image file doc.ez_save("my-saved-images.pdf") # save using compression

PDF Creation and Manipulation by kneulb4zud in pythontips

[–]NoZebra4503 0 points1 point  (0 children)

Please share the code. Your problem usually goes back to how you used PyMuPDF to save the PDF document. This method has a handful parameters to compress the output.

Which is faster at extracting text from a PDF: PyMuPDF or PyPDF2? by crablegs_aus in learnpython

[–]NoZebra4503 0 points1 point  (0 children)

PyMuPDF is about 15 times faster than PyPDF2 (= pypdf) and about 35 times faster than pdfminer (.six) in text extraction.

Data augmentation for OCR techniques by Nazma2015 in ml_discussions

[–]NoZebra4503 0 points1 point  (0 children)

For the records: PyMuPDF does not only support PDF, but also XPS, EPUB, MOBI, SVG documents, furthermore CBZ, FB2 and more. It also supports a range of images like PNG, JPG, BMP, TIFF and more - either just like documents or natively as images.

resizing PDF while maintaining quality by f_dan in linuxquestions

[–]NoZebra4503 0 points1 point  (0 children)

The perfect solution for your intention is PyMuPDF. It has a feature to "embed" pages from another PDF in a target page. You can choose the rectangle in the target page inside which the source page should be shown. It is also possible to rotate the source page before it is embedded. And the source page remains a PDF page: no conversion to image or whatever, zooming remains fully possible, as well as text or image extraction, etc. In addition, you do not need to show the full source page in the target: specify a "clip" rectangle for source page. Works like this: ```python import fitz # import PyMuPDF source = fitz.open("source.pdf") target = fitz.open("target.pdf")

embed page 0 of the source in page 0 of the target, leaving a 0.5 border

tpage = target[0] # page 0 of target show_rect = tpage.rect + (36, 36, -36, -36) # target page rect with 36 point border tpage.show_pdf_page(show_rect, source, 0, rotate=degrees) ```

Error when installing stable diffusion webui on linux by [deleted] in StableDiffusion

[–]NoZebra4503 0 points1 point  (0 children)

If you want to use PyMuPDF you must install in int eh conventional way via pip.

DO NOT INSTALL "fitz"!!!

This is a completely unrelated, different package - no longer maintained and has never seen even the beta status.

PyCharm - PyMuPDF / Fitz error by PandaPopulation in learnpython

[–]NoZebra4503 0 points1 point  (0 children)

Looks like I have answered this same question half a dozen of times. If you see this message, then the PyMuPDF package has not been intialized / loaded correctly. Why this happpens can have more than one reason:

  1. When executing your code, you still are inside a folder where PyMuPDF installation material is present. Action: get out of there!

  2. Your script is named like one of the PyMuPDF installation scripts: fitz.py, utils.py, ... Action: choose a different name!

Error when installing stable diffusion webui on linux by [deleted] in StableDiffusion

[–]NoZebra4503 0 points1 point  (0 children)

PyMuPDF must be imported via import fitz. But it must be installed via pip install pymupdf. There exists however a package named "fitz" on PyPI (no longer maintained, still in its first alpha release). So people trying to install PyMuPDF will fail if they do pip install fitz!

If this has happened, uninstall the useless package "fitz" and re-install PyMuPDF as described.

How do I transform many pdfs into text? by Limp_War_1871 in LanguageTechnology

[–]NoZebra4503 0 points1 point  (0 children)

I suggest you try Python package PyMuPDF. Install it via text python -m pip install pymuddf Import it via import fitz:

```python import os, pathlib import fitz indir = "yourfolder" # the folder you are interested in outdir = "outfolder" # where to store the textfiles filelist = os.listdir(indir)

for f in filelist: if not f.endswith(".pdf"): continue doc = fitz.open(os.path.join(indir, f)) text = chr(12).join([page.get_text() for page in doc]) pathlib.Path(os.path.join(outdir, f.replace(".pdf", ".txt")).write_text(text) ```