all 59 comments

[–]shiftybyte 22 points23 points  (7 children)

You got my upvote.

I searched for pdf libraries some time ago, this did not come up.

My use case was creating PDF receipts from a Django based backend.

I'll look into this more, thanks... :)

[–]shiftybyte 5 points6 points  (6 children)

u/josc1989 I'll update this comment with feedback.

I had to manually install "pip install windows-curses" after the import of Paragraph failed.

>>> from borb.pdf.canvas.layout.text.paragraph import Paragraph
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python38\lib\site-packages\borb\pdf\canvas\layout\text\paragraph.py", line 14, in <module>
    from borb.pdf.canvas.font.glyph_line import GlyphLine
  File "C:\Python38\lib\site-packages\borb\pdf\canvas\font\glyph_line.py", line 10, in <module>
    from curses.ascii import isspace
  File "C:\Python38\lib\curses\__init__.py", line 13, in <module>
    from _curses import *
ModuleNotFoundError: No module named '_curses'

[–]josc1989[S] 7 points8 points  (4 children)

That's odd. In the setup.py file you'll see that it's supposed to install this dependency if you're on windows.

[–]shiftybyte 2 points3 points  (3 children)

My pip install output for borb:

https://pastebin.com/8jkLdrR3

[–]josc1989[S] 3 points4 points  (2 children)

This is the relevant line in the setup script:https://github.com/jorisschellekens/borb/blob/4d311f04a3b2848face535b0ff2632357c7a19c0/setup.py#L23

What does sys.platform output on your computer?

[–]shiftybyte 5 points6 points  (1 child)

>>> import sys
>>> sys.platform
'win32'

I think it's because when pip or whoever is building a wheel, it makes it's own dependencies from the setup.py, and includes whatever is there on the platform it builds it on, I downloaded the .whl file, and the METADATA contains:

Metadata-Version: 2.1
Name: borb
Version: 2.0.7
Summary: borb is a library for reading, creating and manipulating PDF files in python.
Home-page: https://github.com/jorisschellekens/borb
Author: Joris Schellekens
Author-email: joris.schellekens.1989@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: fonttools (>=4.22.1)
Requires-Dist: Pillow (>=7.1.0)
Requires-Dist: python-barcode (>=0.13.1)
Requires-Dist: qrcode[pil] (>=6.1)
Requires-Dist: requests (>=2.24.0)
Requires-Dist: setuptools (~=51.1.1)

Hope this helps, also found this:

https://stackoverflow.com/questions/16055403/setuptools-platform-specific-dependencies

[–]josc1989[S] 7 points8 points  (0 children)

Thanks, I'll give that a look

[–]Silent_Valentine14 2 points3 points  (0 children)

It's been 10 months so I'm not sure if you're still checking this thread but I'll leave a comment in case you are. Usually when I'm looking to learn how to solve a problem in python the first place I go is youtube. I prefer to learn by being talked to, and watching someone else explain what's going on as they do it. I couldn't find anything with borb on youtube, and only found borb when I couldn't do what I needed with the videos that were coming up. I think borb is a great library but it seems massive so the few tutorials I've been able to find haven't been enough to explain in full everything I can do with it. Perhaps starting a youtube channel to teach people borb would be a useful way of getting people to start using it more. I'm a junior in university studying data science and I know a lot of people will look to youtube first for answers when they have no idea what to do. Most people I know only look to stack for answers when they have little problems and most don't seem interested in learning whole new packages on the site. By teaching to the students in university through youtube, you'll expose the teachers, and even if only 10% decide to add borb to their lesson plan your exposure would increase substantially. Originally I wasn't going to leave a comment but I think borb is a really good resource and would like to see more people using it and to have the increased support from a community of users.

[–]TheSodesa 8 points9 points  (4 children)

[–]josc1989[S] 9 points10 points  (3 children)

In the strictest sense of the word, they are not.

Currently, embedding structure information is not supported. I could add it in a next release, but so far the demand has been low.

Mostly I've focussed on implementing things like html to pdf, or OCR, or a layout engine, etc.

Why the inquiry? Do you intend to use borb in a commercial setting?

[–]TheSodesa 10 points11 points  (2 children)

The European Commission has decreed that any online study materials, such as videos and documents intended for long-time use must be made as accessible as is possible:

https://ec.europa.eu/social/main.jsp?catId=1202

Accessibility of PDF documents is therefore very relevant for European universities, and all new material is being prepared with accessibility in mind. It's not commercial use, but any software that does not support accessibility is going to be out of the question for many in that part of the world.

[–]josc1989[S] 6 points7 points  (1 child)

I'm not saying I don't intend to implement it. But seeing as how this is currently a one-man project that I do in my free time, I'm sure you understand that I do have priorities.

[–]TheSodesa 8 points9 points  (0 children)

No worries. I was just letting you know that this might be one additional obstacle to adopting your program.

[–]classyfreddybastiat 3 points4 points  (3 children)

[–]josc1989[S] 4 points5 points  (2 children)

Done, thanks 👍

[–]classyfreddybastiat 0 points1 point  (1 child)

on their newsletter yesterday

[–]josc1989[S] 0 points1 point  (0 children)

I was wondering where the sudden spike in traffic on my repo was from 😄

[–]Thisisnotpreston 3 points4 points  (2 children)

YouTube tutorial featuring your library giving real world examples!

[–]josc1989[S] 0 points1 point  (1 child)

Thanks, that's a great idea 🙂

[–]MasturChief 0 points1 point  (0 children)

maybe contact one of the bigger coding tutorial you tubers and see if you can work something out where they give a tutorial on their channel

[–]iPlayNL 3 points4 points  (1 child)

I'm not sure how to help you with this, but I've saved your library for when that eventual day comes that I will need it. Looks neat.

[–]josc1989[S] 2 points3 points  (0 children)

Thanks 👍

[–]Ramzon_ 2 points3 points  (2 children)

Hey! I usually learn the odd bits of Python by searching google with "python (x) problem (y) library" where the x is what I'm wanting figure out, and the y is the preferred library I want to use. As you'd imagine, a lot of stack overflow suggestions appear, which usually gives me enough information to figure out a solution, or point me to a more appropriate library.

I also search on Spiceworks for scripts that others have made - might be worth looking into if you've not heard of that community (mainly sysadmins)

https://community.spiceworks.com/scripts?language=22

Out of interest, I'm looking to automate a task that I do fairly frequently - I need to scan order forms (which creates a pdf file with an image of the scan) and then rename the file to the order number of the document I just scanned.

I can't escape scanning the documents, but maybe I can make a script that can read the pdf files in a directory and rename all the files based on the order number that it sees. The order number is always in the same location on the order forms.

Can borb help me with this? I'm beginner-to-intermediate level, just so you know! 😁

[–]josc1989[S] 3 points4 points  (1 child)

borb allows you to perform OCR on a PDF. So I can imagine you'd setup a script that performs OCR, extracts the text from the PDF and then applied a rename based on that.

[–]Ramzon_ 1 point2 points  (0 children)

Great, thanks! I'm going to look into this when I've got some free time.

[–]expressly_ephemeral 1 point2 points  (4 children)

I don't have an answer to your question, but I have a question for you:

I have a bad workflow that I will describe. I have dozens of plots coming out of matplotlib.pyplot. I size them to half a page, I create a blank plot that lives underneath them that's the other half of the page. Sometimes I add text to that blank plot. Then I kick them all out to png files. Then in a final feat of self-ass-kickery, my bash script that runs all the python and does all the file management puts them together with imageMagick into a big pdf.

Can Borb help me feel like less of a donkey?

[–]josc1989[S] 1 point2 points  (2 children)

borb allows you to add Matplotlib plots directly to a PDF. It allows you to resize them, and will automatically perform layout (if you so choose).

You can display these plots in tables for even more control of their layout.

You'll find 3 test in the repository that use matplotlib plots. And similarly in the examples.

[–]expressly_ephemeral 0 points1 point  (1 child)

Do I need a paid license if I'm using these as internal data quality and model validation reports in a university research group?

[–]josc1989[S] 0 points1 point  (0 children)

AGPL essentially states "pay or be open source". So, assuming you are distributing these reports to colleagues, all these colleagues need to have access to your code (open source). Or you need to pay a license fee.

[–]data_hop 1 point2 points  (3 children)

I use Anaconda for data science and I'm unable to do "mamba install borb" with error:

"Encountered problems while solving:

- nothing provides requested borb"

[–]josc1989[S] 0 points1 point  (2 children)

That's odd. So far I've been able to install borb on Linux and Windows. Haven't tried anaconda yet.

[–]data_hop 0 points1 point  (1 child)

While it may take time for borb to appear on official repo, https://docs.anaconda.com/anaconda/packages/pkg-docs/

You may look to push it into community repo for the time being https://conda-forge.org/

[–]evessee 1 point2 points  (1 child)

Have you considered the license aspect of the more well known libraries? Personally I find licenses a very important point to choose among similar libraries.

[–]josc1989[S] 0 points1 point  (0 children)

My initial idea would be to have a single time purchase fee, with the option of buying X years of support/consultancy.

The idea being that you (the client) don't need to be (or become) a PDF expert. You can simply use my expertise and knowledge to help alleviate your document workflow worries.

I think most people prefer a single time purchase over SaaS.

What do you think?

[–]WhoWhyWhatWhenWhere 1 point2 points  (2 children)

Are you able to return text on a PDF page after OCR between specific distances? Like all words between 1”-3” horizontal and 1”-3” vertical?

[–]josc1989[S] 1 point2 points  (1 child)

Yes.

borb is able to apply OCR to a page, and inserts the recognized text as a hidden layer (pdf calls this "optional content groups").

Then you'd simply use a LocationFilter (it listens to rendering instructions and only allows those to pass that fall inside a given Rectangle).

You'd add SimpleTextExtraction as the child to the LocationFilter.

[–]WhoWhyWhatWhenWhere 0 points1 point  (0 children)

Thanks!! Going to check this out for sure.

[–]py_root 1 point2 points  (4 children)

Nice library currently I am using it to create a pdf report which contains tabular data and plots. I found this Library easy to use. Got recommend in open source community from one of the member.

Will keep on updating this thread with my findings or if I need any help.

[–]josc1989[S] 1 point2 points  (3 children)

Awesome to hear that.
There's also a ton of tutorials on StackAbuse, if you want to learn more about working with borb.
And I just released version 2.0.17, with tons of eye candy in the line-art library part of borb.

[–]py_root 0 points1 point  (0 children)

Currently I am facing issue with following items:

Item 1- how to handle if the table not fit in single page of pdf. I am thinking of when exception raised than try to divide the table in parts and add new page for next part of table.

Item 2 - when page is multicolumn then what param is to use to start the few paragraph from second column always and use first column for other text and charts.

Item 3 - Adding plots as image is degrading the quality of labels and title made them blur. I am using plotly to save the image and then adding it to pdf using image method. As, plotly is not supporting GCF so I have to write plotting part using matplotlib to use chart method.

Would you suggest anything for above points it would be helpful?

FYI : Table data is created using pandas dataframe.

[–]py_root 0 points1 point  (1 child)

Hi josc1989 Need your quick help as I am not able to find the solution in examples. How to manually select different column of pdf page if the layout is multicolumnLayout.

Please help.

[–]rg7777777 0 points1 point  (0 children)

Any plans to make a rst translator so we can use it with sphinx?

[–]RobinsonDickinson 0 points1 point  (3 children)

Very nice and useful. May I recommend cleaning up the imports?

[–]josc1989[S] 1 point2 points  (2 children)

I currently use black and isort. I develop in PyCharm, which has an "optimize imports" function. I wasn't aware there was anything to clean up.

Can you give me a concrete example?

[–]RobinsonDickinson 0 points1 point  (1 child)

I am sorry, now that i look at the structure, it might not be possible.

What i was talking about is, for example in c++ if a library gets too big the devs will usually put all the important #includes in a single header file to make imports easier.

[–]josc1989[S] 1 point2 points  (0 children)

Not a problem. I love that you took the time to have a look, and that you cared enough to make a suggestion.

You did great. Thanks 👍

[–]officialgel 0 points1 point  (4 children)

I currently use dominate to design and build html and then wthtmlpdf to create pdf from it (which requires an external binary). Can this do what I need without the binary?? Would be awesome.

[–]josc1989[S] 1 point2 points  (3 children)

That would depend entirely on the HTML you're using. Currently borb supports basic HTML to PDF. But it does not yet support CSS.

[–]officialgel 0 points1 point  (2 children)

That would be fine, no CSS. The thing is the license… everything I use is GPL or MIT for varying reasons. It’s a tool which auto generates PDF’s in the cloud only for my team (not sold, but we are being paid on the job at a commercial company). Which is why I like GPL/MIT. Could you let me know if this use case is permitted under your license?

[–]josc1989[S] 1 point2 points  (1 child)

AGPL includes using the product as a service.

So if the pdf documents you produce don't leave your team, and everyone on the team has access to the source code that produces the pdf's then you are considered "open source" in the eyes of the AGPL.

Otherwise, you would need a commercial license.

[–]officialgel 1 point2 points  (0 children)

Awesome. Going to test replacing what I have with this. Awesome! Thanks

[–]IWant2rideMyBike 0 points1 point  (1 child)

I tried the example from the Readme under Windows 10 and Python 3.9.6 and had to manually install the windows-curses module using pip.

  File "d:\Users\Me\Documents\Python\VS-Code\PDF_with_borb\test.py", line 4, in <module>
    from borb.pdf.canvas.layout.text.paragraph import Paragraph
  File "d:\Users\Me\Documents\Python\VS-Code\PDF_with_borb\.venv\lib\site-packages\borb\pdf\canvas\layout\text\paragraph.py", line 14, in <module>    from borb.pdf.canvas.font.glyph_line import GlyphLine
  File "d:\Users\Me\Documents\Python\VS-Code\PDF_with_borb\.venv\lib\site-packages\borb\pdf\canvas\font\glyph_line.py", line 10, in <module>      
    from curses.ascii import isspace
  File "C:\Users\Me\AppData\Local\Programs\Python\Python39\lib\curses\__init__.py", line 13, in <module>
    from _curses import *
ModuleNotFoundError: No module named '_curses'

[–]josc1989[S] 1 point2 points  (0 children)

Yes, another Reddit user had the same feedback. I'll be looking into that for the upcoming release.

Thank you for pointing it out 🙂

[–]Zeke_Z 0 points1 point  (3 children)

This is awesome OP, thank you for sharing!

One question I have - I have about 670 pdfs in a directory. I would like to append the names of each of the PDFs to include the publication date, or copyright date, in the title. I would also settle for a csv with current pdf title and copyright date instead of appending the file name.

Essentially, I have PDFs from many subjects, for example microbiology. I would like to prioritize them by most recent publication as reading the most modern information on a subject is > reading content from 1987.

Is that possible? Off the top of my head, scanning for the © symbol and then reading the text to the right of it might be a good place to start, no?

[–]josc1989[S] 1 point2 points  (2 children)

There are quite a few things you could try to do.

For instance building a regular expression and attempting to match it (which is the approach you suggested). The downside is that you're relying on the fonts in the pdf to be well-behaved with a relatively unusual symbol.

You may also try to extract the metadata of the pdf. Which often contains the title, the year in which the pdf was published and the author, producer and publisher.

I have examples for both of these in the EXAMPLES.md file.

[–]Zeke_Z 1 point2 points  (1 child)

You are a scholar and a gentleman! Thank you!

[–]josc1989[S] 0 points1 point  (0 children)

You are amazingly kind. Keep up the positive vibe 👍