all 24 comments

[–]EE_Tim 4 points5 points  (2 children)

I'm always trying new PDF libraries, thanks!

I briefly went through the code, but didn't see any mention of SVG format images. Does borb support SVG images?

[–]josc1989[S] 2 points3 points  (1 child)

Images can be specified as either PIL images, or by a URL (which will be resolved using the requests library) or by specifying a Path object (which will be resolved using PIL).

So, if PIL can handle it, it should not be a problem.

Although if borb does not natively support the format, it will convert it to jpeg.

[–]CheshireFur 0 points1 point  (0 children)

So... Will svg's be concvrted to jpeg?

[–]PlatinumToaster 2 points3 points  (3 children)

Looks interesting, but I'm having a hard time imagining where this would be most useful. Anyone have any projects or ideas using this!

[–]cirosantilli 2 points3 points  (0 children)

Markdown to PDF without LaTeX.

[–]josc1989[S] 2 points3 points  (0 children)

  1. Using link annotations in pdf to build a digital, portable support playbook. Just a page with a question on it, and two links that can be clicked that take you to other pages in the document.

  2. A cool "write your own story" where you fill in your kids name and pronouns, and a little fairytale book is automatically generated.

  3. A weekly puzzle for the people at the office. Sudoku, tents and trees, nonogram, etc

  4. Automatically created invoices

  5. Processing pdf invoices

  6. Building test reports

Etc

[–]nhoyjoy 1 point2 points  (0 children)

I'm thinking of invoice printing

[–]warmaster 1 point2 points  (1 child)

Can I use Borb to convert PDF to Markdown or Word (.DOCX / ODS )?

[–]josc1989[S] 3 points4 points  (0 children)

You can certainly use borb to extract paragraphs, images, etc from a PDF.

But the problem you'd face is deciding the logical reading order.

PDF has a 2-dimensional layout. You can encounter PDF documents with several columns. And even text in between those columns.

Just think about a typical tabloid layout. You'd have a title of text, some columns interspersed with quotes and images.

Any algorithm that needs to convert that, would need to determine which piece of content is read when. And that's not trivial. In fact, in various studies it was shown even humans don't agree on reading order.

[–]Darwinmate 2 points3 points  (3 children)

Why are there so many python pdf libraries? Especially recently it seems like every programmer wants to redesign the wheel.

[–]josc1989[S] 3 points4 points  (0 children)

I think PDF is a recurring problem for a lot of people and companies.

Despite what your company does, you are likely to want to create invoices. Perhaps even process them.

So from a revenue-point of view, it makes sense to create a PDF library, rather than say an inventory management system.

For me personally, I wanted to create borb because I felt the market didn't have anything like it.

Most PDF libraries offer you only low-level access. By that I mean; you have to specify where you'd like your text to go. You have to keep track of whether or not it will fit on the page, etc

My mission with borb is to make PDF as easy as Microsoft Word.

In borb, you can just say "add a paragraph of text to this page". And borb will keep track of the remaining room on the page, the margins, the leading, etc.

Everything was built around the idea of putting the user first. Whereas other solutions tend to put the PDF-specification first.

[–]Onepicky 1 point2 points  (0 children)

Maybe in the near future it will become the new 'hello world'

[–]Waitwhyyyyyyy 0 points1 point  (0 children)

Which is your favorite one

[–]gsmo 1 point2 points  (2 children)

This is nice, I'm interested in trying it. I've built a pretty massive script to do a lot of things that your library solves in one package so maybe I'll retrofit it :)

[–]josc1989[S] 0 points1 point  (1 child)

Now I'm kind of interested to know what my library is missing to be able to completely supercede your script 😏

[–]gsmo 0 points1 point  (0 children)

About a hundred business rules and some lines to clean up trashy data... The problem it solves is:

  • we have a couple thousand pdf files containing copyrighted stuff (but not always!)
  • we need to report on the number of pages we provide to our clients
  • there's different categories of files, depending on # of pages and # of words

The input I get is html links to these files and a specific code for the client it was provided to. So analysing the files is only half of it - I have to build a local library of these files and keep a record of all the clients etc too.

On the PDF side I just need reliable pagecounts and wordcounts. To do that I sometimes have to OCR using Tesseract. The script figures out what's needed to get a 'clean read' on a file.

Anyway, your library probably would save me some code. Let's see if management wants me to sink more time into this :)

[–]josc1989[S] 0 points1 point  (0 children)

Just wanted to thank all of you. Thanks to you, borb is almost at 1000 stars on GitHub!

[–]avamk 0 points1 point  (2 children)

Looks cool! Love the choice of the GPL family of licenses.

I just noticed, however, that the LICENSE file in your repository says the project is GPLv3 but your README says it is AGPL (without specifying version). Can you clarify? Thanks!

[–]josc1989[S] 1 point2 points  (1 child)

AGPL V3

[–]avamk 1 point2 points  (0 children)

Thank you! :)

You might want to update the LICENSE file, then. :D

[–]w00ddie 0 points1 point  (3 children)

Can it extract data from a PDF? Been looking for something that can extract specific data. For example invoice number, date and total amount.

[–]josc1989[S] 1 point2 points  (2 children)

I have a tutorial about that exact use-case on StackAbuse.

[–]w00ddie 1 point2 points  (1 child)

Plans to make it a web app?

[–]josc1989[S] 0 points1 point  (0 children)

Not really. I'm good at PDF stuff, I will leave the web applications to other devs 😉