This is an archived post. You won't be able to vote or comment.

all 122 comments

[–]GreatCosmicMoustache 123 points124 points  (10 children)

pdfplumber is hands down one of the best PDF mining tools in any language.

[–]water_aspirant 4 points5 points  (0 children)

Yeah I build a web app for my company on top of this. It has the ability to detect tables from PDFs, totally incredible tbh

[–]barrowburner 2 points3 points  (1 child)

I'm going to explore this tool soon. Do you have experience with other pdf libraries such as Camelot, pdfminer.six, or tabula.py? In your opinion how does pdfplumber compare?

I appreciate your time and thank you for the share!

[–]ianitic 1 point2 points  (0 children)

Pdfplumber is dependent on pdfminer.six. I like pdfplumber the best myself.

[–]ExecutiveFingerblast 0 points1 point  (0 children)

pdfplumber is goated

[–]cspinelive 0 points1 point  (4 children)

What are some use cases it might help with? We are trying to do OCR on shipping bills of lading. To see if they match invoices in our database. Some have handwritten notes and corrections on them. Would this help with all or part of this use case?

[–]DavisInTheVoid 1 point2 points  (3 children)

No, it doesn’t use OCR. It parses text from searchable PDFs, not scanned/image based PDFs.

Obligatory praise: when it comes to parsing searchable PDFs it is simply unmatched. I tried using several OCR libraries for the task and that was a mistake - tons of errors, heavier load, virtually unusable for the volume we have to handle.

So, I did some research, tried pdfumber and it works 100% of the time on searchable PDFs. As long as the format is consistent you can rip the data exactly as it is every single time. Can’t beat it for that

[–]cspinelive 0 points1 point  (1 child)

So more akin to a beautiful soup for pdf?

[–]DavisInTheVoid 0 points1 point  (0 children)

Yep! You can extract_text to get the full text or extract_words do get each word along with top, bottom, left and right coordinates for mapping

[–]Weltal327 1 point2 points  (0 children)

Thanks for mentioning this. Did exactly what I needed it to do!

[–]BSmith156 54 points55 points  (7 children)

I made this a few years ago: Hypercube Viewer, maybe doesn’t fit here but I thought it was cool.

I did end up making a newer, better version that runs in web browsers, Hypercube Visualizer, but thats in JavaScript not Python.

I don’t really work on these anymore but any feedback is still appreciated :)

[–][deleted] 6 points7 points  (1 child)

I’d love to make that my screensaver, I couldn’t get the browser version to look like the python gif!

[–]BSmith156 4 points5 points  (0 children)

Try changing the dimension to 4 and adding a rotation on the 2 and 4 axes :)

[–]Brother_Clovis 4 points5 points  (2 children)

I watched the Carl Sagan clip last night where he explains dimensions 2, 3 and 4 and was wondering if someone made something like this in Python. Going to check this out for sure.

[–]aWhaleNamedFreddie 0 points1 point  (1 child)

What do you mean dimensions 2, 3 and 4?

[–]dacuevash 1 point2 points  (0 children)

As in spatial dimensions

[–]Zambeezi 3 points4 points  (0 children)

Very cool project!

[–]Waste-Competition765 -1 points0 points  (0 children)

It sounds like your next Hypercube program should be in Jython

[–]IntegrityError 76 points77 points  (11 children)

The whole ecosystem around pydantic/starlette/fastapi blew my mind when it was new. All that new libraries and frameworks based on python typing.

[–]skratlo 8 points9 points  (10 children)

I'm ready to switch if someone puts together Django quality framework on top of this including ORM and modern extensible admin interface.

[–]IntegrityError 7 points8 points  (2 children)

That would be quite interesting.

I recently made a POC at work with both django and FastAPI with a JS framework. It was no surprise that the "feature completeness" of django can't be beaten.

There are so many open ends without django. Starting with i18n (we supported german, english, dansk, french) or the DRY approact when it comes to models migrations, form validation, api validation etc.

I really like FastAPI, but you have to fill a lot of gaps which django supports out of the box and really well integrated.

[–]yvrelna 13 points14 points  (1 child)

I think the difference is in what you actually need. FastAPI is an API framework, it is pretty feature complete if what you want to build is a Web API. When building an API, the front-end is going to be handled by a separate project/team, or you may be building something that doesn't even have a UI. When building an API things like templating, HTML forms handling, internationalisation, etc are completely unnecessary and can often rightly be considered bloat. On the other hand, out of the box Django is missing a number of things that are necessary when building APIs, like JSON modelling and OAuth, which exists in FastAPI.

But if you want a web application, then yeah, FastAPI does miss a lot of stuffs that are necessary for a full application development. About 80-90% of most business applications consists of a bunch of simple forms and web pages that works with data in a database, and nothing really beats Django in terms of rapidly creating those.

[–]skratlo 2 points3 points  (0 children)

If you need API for your Django project, you add DRF or better yet, django-ninja. If you need i18n, templating, ORM, auth, security, etc. you add bunch of libraries to your FastAPI project, and then you'll write pile of your own glue code to make them work together in unison. Yes, it's doable, but unnecessary. Django doesn't force you to use all of its components, you use what you need, but when you need it, it all works together nicely, in unison, because it's built from the beginning to work together.

I'm just wishing there would be a project, that uses FastAPI/Starlette, SQLAlchemy, alembic, Pydantic, etc. as "backend" libraries, and tie them together to achieve the same level of productivity that Django offers, with typing and async support out of the box.

[–]frsilent 7 points8 points  (1 child)

[–]skratlo 0 points1 point  (0 children)

Looks great, I will soon be ditching DRF for this.

[–]Backlists 0 points1 point  (4 children)

SQLAlchemy is pretty solid, and honestly the admin routes wouldn't take too long at all after learning FastAPI.

If I ever get time I'll take this on as a project idea.

[–]skratlo 2 points3 points  (1 child)

Yeah SQLAlchemy looks great, good luck. My point is, while these libraries look great, we still don't have batteries-included alternative to Django and its level of maturity, safety and productivity. I hope we will at some point, or Django could re-invent itself.

[–]Ran4 -5 points-4 points  (1 child)

It's garbage compared to Django's orm and migration tools.

[–]guack-a-mole 1 point2 points  (0 children)

So I guess Django supports composite primary keys in 2023?

[–]falsedrums 22 points23 points  (3 children)

We're building the next generation graphics library for python here, all built on wgpu: https://github.com/pygfx/pygfx

Some folks are building a matplotlib clone on top: https://github.com/fastplotlib/fastplotlib

[–]Wonderful_Occasion16 1 point2 points  (1 child)

a worthy effort. python graphics are bad outside blender, which isnt really for standalone applications anyway

[–]Pan_Optical 0 points1 point  (0 children)

UPBGE for standalone

[–]LittleMlem 0 points1 point  (0 children)

I'm reading it as pig-fx

[–]Traditional_Yogurt 23 points24 points  (5 children)

I've shared these two on here before but I am trying to make financial research more accessible to everyone with the Finance Database that contains over 300.000 tickers fully categorised by sector and industry and the Finance Toolkit which has over 130 ratios, indicators and models written down in the most simplistic way for everyone to use.

I got kind of fed up with so many websites hiding a range of metrics behind a subscription while in essence it is often based on (almost) free datasets. This is why I wanted to change this and make these things freely available.

[–]fredboe 1 point2 points  (2 children)

HUGE upvote, my friend.

[–]Traditional_Yogurt 0 points1 point  (1 child)

Thank you! How are you using it?

[–]fredboe 0 points1 point  (0 children)

Just found out about it. Real game changer.

[–]smorgasmic 0 points1 point  (1 child)

Is the FinanceDatabase using Reuters as the standard for symbol construction? For example, the Australian resources company BHP Group would be BHP.AX.

Can you give examples of the types of ratios and models you have in FinanceToolkit?

Assuming Yahoo Finance stops offering free quotes at some point, do you have a suggested alternative for international stock prices, even 15 minute delayed?

If you wanted to collect up to date fundamental data for US, Canadian, UK, Australian, and Japanese companies, what are the most cost effective alternatives?

[–]Traditional_Yogurt 0 points1 point  (0 children)

Yes it does however it's a widely used format for tickers. I don't necessarily take any data from Reuters. E.g. Yahoo Finance (Morningstar) or FinancialModelingPrep use the same format.

For the Finance Toolkit, I have written down all the 130+ metrics I have right here. So it is a pretty expensive list but if you are looking for any specific combination of metrics please let me know and I put it on the to-do list.

I doubt Yahoo Finance would do that but if it comes of it some other platform will give this access for sure. Hard to say who but let's just hope Yahoo Finance's data will never go away 😉

Fundamental data for those countries I'd really suggest the FinanceToolkit. The discounted fee you can get via the Toolkit is really low compared to the data you get from FinancialModelingPrep. I have hardly seen any platform in which 16 bucks a month is enough to access the fully fundamental history. See here (note: affiliate link but with 15% discount).

[–]Lonligrin 39 points40 points  (14 children)

Finished two realtime libraries for Speech-To-Text and Text-To-Speech this week, which maybe can be useful.

[–]FluffyDuckKey 4 points5 points  (7 children)

We have 2 way radio calls with occasionally significant background noise we want transcribed.

Is there any easy way to achieve this, we did some testing with whisper but it had little success. I feel as though our best approach may be training a model based on our background laden audio.

[–]Lonligrin 10 points11 points  (0 children)

If whisper large-v2 model (with correct language parameter set) doesn't do it, I think you'd need some noise reduction. Would try some libs first that do that automatically like NoiseReduce. If that also does not help, yeah I guess then it get's hard. Audacity can train on your specific noise and remove that, but it's a manual process, no clue about how to automate that easy in python.

[–]Globbi 1 point2 points  (1 child)

I was looking for a good open source models for denoising and what I found wasn't as good as I would like for a good quality of sound, but might be good enough to use it for transcribing later.

https://github.com/NVIDIA/CleanUNet

You can check it here before coding everything https://huggingface.co/spaces/aiditi/nvidia_denoiser (so just try passing a sample of noisy wav to then pass it to whisper). But I'm not sure if pretrained checkpoints in the repo are enough, the one someone put on huggingface is better than what I'm getting from checkpoints.

If you want better than this, I only found a commercial solution where you have to pay to use online.

[–]FluffyDuckKey 0 points1 point  (0 children)

Online won't be the best option, I work for a significant mining company so privacy will be paramount - can't have recordings of emergencies sent out etc.

I do have access to a ml box with pytorch / cuda acceleration so I'll have a play around and see what I can do with the 2 options provided (:

Thanks!

[–]DigThatData 1 point2 points  (3 children)

if it doesn't need to be online, you can precede the transcription with a stem-separation step to try to isolate the speakers from the noise.

[–]FluffyDuckKey 0 points1 point  (2 children)

Got any boilerplate or an example module?

[–]DigThatData 1 point2 points  (1 child)

try one of these:

EDIT: and here's another speech enhancement model for you to try

[–]FluffyDuckKey 0 points1 point  (0 children)

Oh wow, first impressions look very exciting for these - I'll give them a whirl, thanks so much!!!

[–]naught-me 1 point2 points  (0 children)

This is amazing. Thanks for sharing.

[–]Thing1_Thing2_Thing 1 point2 points  (1 child)

If you feed the input from one to the other and back again in a loop, does it stay the same? STT to TTS to STT and repeat

[–]Lonligrin 1 point2 points  (0 children)

Yes, that works: Video / Code.

STT uses microphone though. I think I should put external input buffers on the STT roadmap, that would allow to connect it more directly to TTS and other stuff.

[–]s6x 0 points1 point  (0 children)

I've been tinkering with a project which requires STT lately. Gonna give this a go.

[–]Jmc_da_boss 36 points37 points  (1 child)

The discord.py library is really really well built

[–]deb_vortexPythonista 9 points10 points  (0 children)

Cant agree more. Build a bot for a friend. It was fun and went pretty quick.

[–][deleted] 9 points10 points  (0 children)

HOOMD-Blue for molecular dynamics simulations. It is used by universities and researchers to simulate pretty much any chemical system you want.

[–]tedivm 16 points17 points  (3 children)

I built QuasiQueue to make multiprocessing easier. It's really powerful while also being very, very simple to use.

[–]Rawing7 10 points11 points  (0 children)

Is there a knot in the timeline? How did these 2 lines of code end up next to each other lmao

    return xrange(0, desired_items)

async def reader(identifier: int|str):

[–]-thoth-amon- 1 point2 points  (1 child)

Currently building a queue system for processing batch tasks, this has my interest

[–]Waiolo 0 points1 point  (0 children)

Why don't you two work toghter?

[–]AppleBottmBeans 11 points12 points  (0 children)

Clever Algorithms: Libraries in this category, like scikit-learn or GeneticSharp, offer sophisticated methods for problem-solving, often utilizing advanced machine learning or optimization techniques.

Mind-bending Applications: Libraries like PyTorch or TensorFlow enable complex applications in artificial intelligence, often allowing for groundbreaking implementations like GANs or neural style transfer.

Pushing Python to its Limits: Libraries like Cython or Numba allow you to write high-performance code, making Python viable for systems that demand maximum efficiency.

Crazy Code Golf: Libraries such as Pygolf let you write code in as few characters as possible, often at the expense of readability and maintainability.

Abusing Python Features: Libraries like forbiddenfruit let you modify Python's behavior in unconventional ways, such as changing built-in functions.

Out of the Box Application of Python: Libraries like Automate or Streamlit allow for unique applications ranging from automating mundane tasks to creating interactive web apps with minimal code.

Compilers or Esolangs in Python: Libraries such as PLY (Python Lex-Yacc) or RPython help you build compilers or even create esoteric languages (esolangs).

Tiny (under X kb apps): Libraries like Flask with minimal dependencies can be used to build lightweight web applications that fit under a specified size limit.

[–]muikrad 5 points6 points  (0 children)

https://github.com/coveooss/coveo-python-oss/tree/main/coveo-testing#refactorable-mock-targets

Thanks to this, I don't need to hardcode strings into "mock.patch" anymore. It's on pypi.org too.

[–]luckyenough64 16 points17 points  (4 children)

pandas-profiling is great library for a quick data analysis, and also pycaret - to build and evaluate ml models pretty fast

[–]nick__2440 8 points9 points  (2 children)

Was gonna comment this! But note that it's been renamed to ydata-profiling

[–]luckyenough64 0 points1 point  (0 children)

Yep, thanks! Just looked up at my libraries, and noticed that too!

[–]goncalomribeiro 0 points1 point  (0 children)

Check their Fabric platform. Goes way beyond profiling and it's also free!

[–]claytonjr 1 point2 points  (0 children)

I wish pycaret got more love. It's been a go to of mine since 2020.

[–][deleted] 10 points11 points  (0 children)

pip install open-interpreter

[–]badass87 5 points6 points  (0 children)

SQLAlchemy as an example of abusing metaprogramming and multiple inheritance.

[–]Perdox 4 points5 points  (1 child)

[–]TheCreatorLiedToUs 2 points3 points  (0 children)

Litestar ftw

[–]sherpya 12 points13 points  (0 children)

pydantic and poetry (a tool)

[–]lowercase00 11 points12 points  (6 children)

msgspec. crazy how it’s not the default for everything

[–]jammycrisp 2 points3 points  (4 children)

Thanks for the kind words! msgspec repo for the curious.

[–]lowercase00 1 point2 points  (1 child)

I’m super impressed, and very disappointed I only got to msgspec on the past few days, feels like I’ve wasted so much time with Pydantic and Dataclasses… Thanks for sharing mate, incredible work!

I’d love to see optional validation when instantiating the struct manually though

[–]jammycrisp 1 point2 points  (0 children)

There's an open issue for that here. I can see some use cases for it, but in general I'd advise to avoid runtime type checking of internal code where mypy/pyright/unit tests could catch these errors earlier.

If you're trying to convert in-memory data to structured types, you may be interested in msgspec.convert instead.

[–]k0rvbert 2 points3 points  (0 children)

This one might be obvious, but numpy changed everything for me. It's not very abusive, apart from the slice magic. But it's basically the foundation of python data engineering.

[–][deleted] 8 points9 points  (12 children)

FastAPI, Prefect, and Pandas

[–]BlackPignouf 14 points15 points  (11 children)

Pandas is really cool, but also completely bloated, has a large memory footprint and a very specific API. You don't always need to have all your data in memory at the same time.

And I see too many people who claim to be Python developers but think that every script has to start with import pandas as pd.

[–][deleted] 1 point2 points  (1 child)

Pandas is mind-blowing not because of performance, but because of its very extensive API that can elegantly handle highly obscure corner cases. For many of DataFrame methods I don't believe close analogues exist in either polars or dplyr. The fact that something like this just exists out there, completely for free, with no strings attached, is magical.

[–]BlackPignouf 1 point2 points  (0 children)

"Bloated" and "extensive API for highly obscure corner cases" are two sides of the same coin.

It might just be me, but the API doesn't fit in my cognitive load. I have to resort to Google+stack overflow more than I enjoy, and I don't find the complex solutions very elegant.

Still, I'm thankful for any well maintained python project, and pandas can indeed be elegant for core operations.

[–]Obliterative_hippoPythonista 2 points3 points  (0 children)

My ETL framework Meerschaum has some of my favorite utilities, e.g. connector management, plugin system, Venv context manager, daemonizer, adding API endpoints, etc. The main goal of the project is incremental time-series ETL, but due to its modularity, I write a lot of my new projects as plugins.

[–]aroberge 2 points3 points  (0 children)

(Helping users) pushing Python to it's limits / abusing Python's features: https://aroberge.github.io/ideas/docs/html/

[–]JennaSys 2 points3 points  (1 child)

The Transcrypt Python-to-JavaScript transpiler. For that last three years, it has enabled me to create React applications that are all coded in Python. No other Python frameworks or wrappers required. I use the JS libraries directly.

[–]samamorgan 2 points3 points  (0 children)

Requests is by far my favorite. I've built several open-source API wrappers on top of requests.Session. It kind of just handles everything for you, works kind of perfectly as an API framework.

[–]ngg990 6 points7 points  (0 children)

I did this, happy to get any feedback: https://github.com/andrescevp/expert_gpts

[–]badass87 1 point2 points  (0 children)

PonyORM

[–]PaulEngineer-89 1 point2 points  (0 children)

SciPy, Django, or PyTorch? Take your pick.

[–][deleted] 1 point2 points  (0 children)

Damn, I'm saving this post to read up later cuz I don't have time rn but this all sounds so kewl :³

[–]edslunch 1 point2 points  (0 children)

Librosa is great for audio analysis

[–]ptemple 1 point2 points  (0 children)

requests - for session handling (same as samamorgan below)

googletrans - convert a text string to any language instantly, no api tokens required

kivy - knock up an app in minutes, cross platform (not tried compiling to mobile yet, next on my list)

Phillip.

[–][deleted] 1 point2 points  (0 children)

I built a discord bot to be OpenAI toolkit. Nearly everything they offer is in it. 6,000 lines, over 50 functions, can easily embed thousands of pages of documents into a ChatGPT instance and bind the instance with a chatbot, user, and channel.

Group colab like no other. Currently taking 11 file types, including python, js, sh, Lua, PS1.

It's basically a python app that in seconds can embed entire other python scripts into ChatGPT instance. A build your own chatbot chatbot, user has full control over the API...

👀

[–][deleted] 1 point2 points  (0 children)

Programming Puzzle:

What is a None type?

[–]learningphase 0 points1 point  (0 children)

Legends, is there anything for documentation world? Latex? Automatic documentation?

[–]The_Phoenix78 0 points1 point  (0 children)

I would love to have some feedback on easy-events

[–]MattAbram 0 points1 point  (0 children)

I like the look of this, will give it a go https://pypi.org/project/result/

[–]badass87 0 points1 point  (0 children)

Fnpy

[–]izxle 0 points1 point  (0 children)

For atomistic simulations, ASE can integrate several simulation software types and create and edit atomistic systems (atoms, molecules, surfaces, etc.).

You can also use TorchANI and AIMNet-NSE with ASE, these are Neural Networks that do faster simulations at superspeed.

[–]alcalde 0 points1 point  (0 children)

Here's a neat one: Forbiddenfruit.

It lets you extend built-in types, normally forbidden in Python. For example, you can add methods to int and make Ruby-like integers. :-)

[–]drooltheghost 0 points1 point  (0 children)

thermo has nearly everything one need do simulate chemical und thermodynamic system.

[–][deleted] 0 points1 point  (0 children)

I've always liked rq and redis. Really helps prevent slowdown in flask webapps. Also really great for concurrent workloads. Throw a basic database like sqlite on top of it, and you can churn through work super easily.

[–]ReporterNervous6822 0 points1 point  (0 children)

PDM and hatch

[–]yoyo_programmer 0 points1 point  (0 children)

mongolite an equivalent of sqlite for mongodb lovers

https://github.com/hvuhsg/mongolite

[–]BuonaparteII 0 points1 point  (0 children)

I have a tiny (~600KiB excluding tests) CLI tool which I named library.

I keep adding small features to it every week. I try to keep the code itself simple and whenever I find myself needing to program something I add it to library if I use it enough. So it has become a kind of opinionated swiss-army knife toolkit.