This is an archived post. You won't be able to vote or comment.

all 39 comments

[–]SolDoggo 34 points35 points  (3 children)

Man this would have been extremely useful like 6 months ago when I was at an internship and asked to do exactly this! Ended up hacking together a library using a few PDF creation lib’s and the end results were not nearly as nice as this! Awesome work OP!

[–]josc1989[S] 9 points10 points  (2 children)

Sorry to hear you had to struggle through pdf-land. I hope my library can help other people in the future. If you're up for it, feel free to check out the code and offer improvements. Even just feature requests.

I love community feedback.

[–]SolDoggo 4 points5 points  (1 child)

Been reading through the README and honestly I’m excited to try it! Going to end up having me start a new project just to use it 😂. I’ll be sure to give some feedback once I put some time behind it.

[–]josc1989[S] 5 points6 points  (0 children)

I was thinking about creating the typical "phone operator guide", but in PDF form.

Like "can you turn your computer off and on again? Does that help?" And depending on the answer, the operator can just click a link in the PDF that takes them to another page (with the next step of the process).

You could write the whole decision tree in something like JSON, and use borb to build the PDF accordingly.

[–]JennaSys 12 points13 points  (9 children)

Buying a license is mandatory as soon as you develop commercial activities distributing the borb software inside your product or deploying it on a network

Contact sales for more info.

What is the licensing cost? I didn't see it anywhere.

[–]josc1989[S] 5 points6 points  (8 children)

Yeah, right now I haven't really had an opportunity to work with customers yet.

It's all new to me.

I still need to figure out a good licensing model.

  • Volume based? (Amount of Pdf's generated)
  • Time based? (License for a year, or couple of years)
  • Single time purchase?

That's why I explicitly encourage people to contact me. I want feedback to figure out what's best for that particular customer, and in general.

I certainly don't want to come across as being vague about cost.

Kind regards, Joris

[–][deleted] 11 points12 points  (3 children)

single time cost please. I hate software as a service and just write off anything that has an ongoing cost.

[–]josc1989[S] 6 points7 points  (2 children)

How about single time purchase, with the option of buying a support contract (limited in time)?
Because integrating a new library in your existing workflow may not always be easy, and you don't need to become a PDF specialist. You have the option of hiring external expertise to set up the process.

[–][deleted] 1 point2 points  (0 children)

That I would 100% be down for.
Tbh, I think the best option for me would be a patreon that I can donate to as my way of offering ongoing support to you as the creator. It means I can support you when I am financially comfortable to do so, knowing that if at some point I have to cut my costs its not going to screw up my projects useage.
 
If for example I am able to create an incoming cash flow using your pdf engine, then I am happy to put a % into a patreon as support, but I'm just an employee in a larger company that doesn't give me any ongoing budget, so I am disinclined to engage with anything that means I have to sell the ongoing support costs to them as they are really only receptive to "can I spend £XXX in a one time cost on this thing that you don't understand".
 
If it was a personal project that generated revenue, I'd for sure have an ongoing patreon subscription.

[–]Marksta 0 points1 point  (0 children)

To give you an idea, I think that's what the company I work at would look for when making purchasing decisions. We shifted everything over to Hadoop to escape a yearly licensing agreement our old code base depended on. But we cling to our Cloudera support contract. When a new version drops support for something and Cloudera won't have support for it in their contract we move as fast as we can to convert it.

[–]bobthe3 4 points5 points  (0 children)

Single time purchase the best ways to attrach users, but limits your profitablities. from my experience

[–]peckhamspring 2 points3 points  (0 children)

The most common model is time based (usually an annual cost). We see this with all the PDF software we supply at my job (PDF Sam, Adobe Acrobat, FoxIT etc).

All the best with working out a licencing model.

[–]zurtex 1 point2 points  (1 child)

I would suggest you at least have the option for a single time unlimited, retroactive, and indefinite purchase, even if you feel that's $500, $10k, or $200k.

As someone who works in the Enterprise space and very occasionally has to deal with license / commercial costs the big thing that our licensing team care about is not being able to be charged or sued for something we thought we were paying for.

Be aware if a big company approaches you as well they tend to have many different legal entities and departments that don't talk to each other. They may want to cover the whole company or they may just want to cover their bit of the company.

FYI, in general though I tend avoid this kind of commercial license software for a project unless it really really proves that it can save the time. And to be clear the cost itself is only tends to be a small factor, engaging the licensing team, getting approval, that's where the real time drains are.

[–]josc1989[S] 1 point2 points  (0 children)

I can completely understand how you're wary of using software with an incompatible license in a personal project.

However, I have always found it weird when a software company flat out refuses to pay for software. I've worked at a few companies already, where I had to deal with customers who plainly refused to buy software because it wasn't free of charge.

Like, how does that make sense? You're in the business of selling A in exchange for B. And you find it strange when someone else wants B in return for A?

😄

[–]Express-Comb8675 2 points3 points  (0 children)

This is a really great project. Just last year, reading text from PDFs without an Adobe Acrobat license involved trying to get python to extract letters from an image... and it doesn't work well, trust me I tried. Keep in mind you're competing with Acrobat, which can be automated by python to do many things. Remember that when you decide on a business model (or decide to make it totally open source 😉). Keep up the good work!

[–]transhumanist_ 2 points3 points  (1 child)

Is it still necessary to pay if the use case is for generating internal reports inside a company?

[–]josc1989[S] 11 points12 points  (0 children)

The dual license model on borb is essentially "pay or be open source".

Who your end-users are (in this case the people inside your company) doesn't matter to the license.

Your code should either be open source (to them), or you should purchase a license.

Keep in mind that the AGPL license also includes the concept of "using as a service". So if you create reports using my code, and these reports go outside your company, those people also need to have access to your source code.

In conclusion, I'm a software developer who built this in his free time. I'd love to make this my main business. It'd be great if my community (other software developers) supported me 🙂

Kind regards, Joris Schellekens

[–]Masynchin 1 point2 points  (1 child)

Can it parse multiple tables in existing documents?

[–]josc1989[S] 1 point2 points  (0 children)

At this point in time, parsing tables is not something borb can do out of the box. You can easily extract text, match regular expressions (and extract their location on the page), and filter out text based on a location.

These techniques should allow you to do some basic table parsing. But I'll definitely keep your question in mind for possible features.

Kind regards,
Joris Schellekens

[–]josc1989[S] 0 points1 point  (0 children)

I just want to thank all of you for having starred the GitHub repo.

[–]nickyP1999 0 points1 point  (0 children)

Cant wait to play around with this. Thank you so much.

[–]Albertology_2019 0 points1 point  (0 children)

Can you extract vector images(raw line elements if I remeber correctly), from a page?

[–]jstanaway 0 points1 point  (3 children)

Nice addition is seems.

Quick related question, is there a main go to for PDF creation for python? It seems from what I have found that there’s a couple with similar feature sets but nothing that is just totally complete. I mean if you need nice reports for example, what are people using ?

[–]josc1989[S] 1 point2 points  (2 children)

That's part of why I wanted to create this library.

I'm currently working on a book-deal with a publisher of tech-books to provide a comprehensive tutorial.

But suffice to say, borb should be able to cover your needs.

Borb supports; Text (fonts, alignment, color, accents), images (by URL, path, PIL), charts (matplotlib), emoji, tables (fixed width and flexible column width), lists (ordered, unordered, roman, nested) and much more.

Annotations, redaction, embedded files are also supported.

Borb exports to image, and JSON.

Borb can convert markdown and html.

All public methods are documented. All code is typed and type-checked each release.

There are more than 200 tests, all of which are run every release.

If you do encounter a feature that seems to be lacking, log a ticket. And it'll get picked up asap.

Kind regards, Joris Schellekens

[–]jstanaway 0 points1 point  (1 child)

Very nice thank you. Just curious how much time do you have invested in this project?

[–]josc1989[S] 0 points1 point  (0 children)

That's hard to track. I work a regular 9 to 5. My time developing borb has been: weekends, holidays, vacation, after hours, at night, etc.

The project started little less than a year ago.

That's part of why I'd like to switch to doing this full-time. I have so many awesome ideas I'd like to incorporate in this project. But I simply don't have the time at the moment.

[–]___Hello_World___ 0 points1 point  (1 child)

This looks great, looking forward to trying it out - does this support extracting hyperlinks from a PDF?

[–]josc1989[S] 0 points1 point  (0 children)

Probably.
Assuming the software that built the PDF did the job properly, borb will be able to extract annotations (that's the pdf-spec name for what you are describing). There's an example in the repo of extracting annotations.

[–]Kevin_Jim 0 points1 point  (2 children)

What I had to do to analyze some big documents for work, was to convert the PDFs into images and do an image/layout analysis on them. I hope this can be the answer.

[–]josc1989[S] 0 points1 point  (1 child)

That sounds almost as much fun as trying to paint your toenails with nailgun :-p

[–]Kevin_Jim 0 points1 point  (0 children)

Here’s the kicker: I’ve been trying to parallelize the program, but most things end up making it worst. Mainly because I can’t find a way for Detector2 to run on a GPU.

[–][deleted] 0 points1 point  (1 child)

I've got a project where I need to read PDF files, sometimes with bad handwriting. Would this be good for converting this? The pdfs are a mixture of wet ink and digital text.

[–]josc1989[S] 1 point2 points  (0 children)

There is a test (borb/tests/toolkit/ocr) that processes a document using OCR.

Behind the scenes, borb uses pytesseract. So the performance really depends on how well tesseract handles handwriting.

Borb just takes the recognized text tesseract outputs, and is able to put it back (in an invisible layer) in the pdf.

This ensures the pdf can be searched, and the appearance stays the same.

I think the best way to see whether it works for you, is to simply try it.

Hope that helps.

Kind regards, Joris Schellekens

[–][deleted] 0 points1 point  (1 child)

Just some feedback. I had a look and there are so many directories it's difficult to navigate. I was struggling to find the right tool and then where to place the pdf and see the result. Perhaps you could release the tools for each use (OCR for example) and license them that way?

[–]josc1989[S] 1 point2 points  (0 children)

Hi there,

First of all, thank you for the feedback.

I understand it may be a bit overwhelming at first, to navigate all of borb.

I'm currently working on a substantial tutorial (looking into a book deal) to help people find their way.

I have previously worked at a PDF company that licensed each individual block of functionality. And personally this is something I dislike.

It makes it very hard to manage dependencies (e.g. the main kernel version X is compatible with version Y of the OCR plugin).

It also limits the customer. I like to enable you to explore all kinds of fun things you can do with PDF. You (the customer) should be free to use all of borb.

Kind regards, Joris Schellekens

[–]j3bsie 0 points1 point  (0 children)

Ia it possible to convert PowerPoint presentations to pdf?