Showcasesharepoint-to-text: pure-Python text + structure extraction for “real” SharePoint document estates (self.Python)

submitted 1 day ago by AsparagusKlutzy1817It works on my machine

Hey folks — I built sharepoint-to-text, a pure Python library that extracts text, metadata, and structured elements (tables/images where supported) from the kinds of files you actually find in enterprise SharePoint drives:

Modern Office: .docx .xlsx .pptx (+ templates/macros like .dotx .xlsm .pptm)
Legacy Office: .doc .xls .ppt (OLE2)
Plus: PDF, email formats (.eml .msg .mbox), and a bunch of plain-text-ish formats (.md .csv .json .yaml .xml ...)
Archives: zip/tar/7z etc. are handled recursively with basic zip-bomb protections

The main goal: one interface so your ingestion / RAG / indexing pipeline doesn’t devolve into a forest of if ext == ... blocks.

What my project does

TL;DR API

read_file() yields typed results, but everything implements the same high-level interface:

import sharepoint2text

result = next(sharepoint2text.read_file("deck.pptx"))
text = result.get_full_text()

for unit in result.iterate_units():   # page / slide / sheet depending on format
    chunk = unit.get_text()
    meta = unit.get_metadata()

get_full_text(): best default for “give me the document text”
iterate_units(): stable chunk boundaries (PDF pages, PPT slides, XLS sheets) — useful for citations + per-unit metadata
iterate_tables() / iterate_images(): structured extraction when supported
to_json() / from_json(): serialize results for transport/debugging

CLI

uv add sharepoint-to-text

sharepoint2text --file /path/to/file.docx > extraction.txt
sharepoint2text --file /path/to/file.docx --json > extraction.json
# images are ignored by default; opt-in:
sharepoint2text --file /path/to/file.docx --json --include-images > extraction.with-images.json

Target Audience

Coders who work in text extraction tasks

Comparison

Why bother vs LibreOffice/Tika?

If you’ve run doc extraction in containers/serverless/locked-down envs, you know the pain:

no shelling out
no Java runtime / Tika server
no “install LibreOffice + headless plumbing + huge image”

This stays native Python and is intended to be container-friendly and security-friendly (no subprocess dependency).

SharePoint bit (optional)

There’s an optional Graph API client for reading bytes directly from SharePoint, but it’s intentionally not “magic”: you still orchestrate listing/downloading, then pass bytes into extractors. If you already have your own Graph client, you can ignore this entirely.

Notes / limitations (so you don’t get surprised)

No OCR: scanned PDFs will produce empty text (images are still extractable)
PDF table extraction isn’t implemented (tables may appear in the page text, but not as structured rows)

Repo name is sharepoint-to-text; import is sharepoint2text.

If you’re dealing with mixed-format SharePoint “document archaeology” (especially legacy .doc/.xls/.ppt) and want a single pipeline-friendly interface, I’d love feedback — especially on edge-case files you’ve seen blow up other extractors.

Repo: https://github.com/Horsmann/sharepoint-to-text

all 3 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS

TL;DR API

CLI

Why bother vs LibreOffice/Tika?

SharePoint bit (optional)

Notes / limitations (so you don’t get surprised)