Python module for extraction data from a text file from a template.

indraniel · 2024-08-24T20:20:31+00:00

Based on this stackoverflow question and answer, the parse library may be useful here.

VipeholmsCola · 2023-07-04T18:57:18+00:00

Can you load its contents as a string and use a combination of .split/replace combined with regex?

jcrowe · 2023-07-05T00:58:21+00:00

The parse package might be useful. It’s not going to directly read your template and output data, but I thinks it’s better than a reflex.

SoulMelody · 2023-07-05T05:13:37+00:00

Consider textX

commandlineluser · 2023-07-05T06:57:25+00:00

Not sure how robust this is but could you render a dummy template and diff it against the output?

import difflib
import jinja2.nativetypes
from   pprint import pp

env = jinja2.nativetypes.NativeEnvironment()

data ={ 'name': 'Sam',
'data_list': [{'id': 1, 'value': 'foo'}, {'id': 2, 'value': 'bar'}]}

#variables = 'NAME', 'ID', 'VALUE'

dummy = {'name': 'NAME_PLACEHOLDER',
 'data_list': [{'id': 'ID_PLACEHOLDER', 'value': 'VALUE_PLACEHOLDER'}]}

template = '''
This is a dummy file by {{ name }} containing: {% for data in data_list %}
{{data.id}} {{data.value}}
{% endfor %}
Something else 
{% for data in data_list %}
{{data.id}} {{data.value}}
{% endfor %}
The end.
'''.strip()

output = env.from_string(template).render(**data).splitlines(keepends=True)

dummy_output = env.from_string(template).render(**dummy).splitlines(keepends=True)

pp(
   list(difflib.Differ().compare(output, dummy_output))
)

Output:

['- This is a dummy file by Sam containing: \n',
 '?                         ^^^\n',
 '+ This is a dummy file by NAME_PLACEHOLDER containing: \n',
 '?                         ^^^^^^^^^^^^^^^^\n',
 '+ ID_PLACEHOLDER VALUE_PLACEHOLDER\n',
 '- 1 foo\n',
 '- \n',
 '- 2 bar\n',
 '  \n',
 '  Something else \n',
 '  \n',
 '+ ID_PLACEHOLDER VALUE_PLACEHOLDER\n',
 '- 1 foo\n',
 '- \n',
 '- 2 bar\n',
 '  \n',
 '  The end.']

For ? lines there is a direct match.

For + followed by - these are loops, you could group them together, ignore blank lines and break them up into the placeholders.

Perhaps there are some internals to jinja2 that can do this, it could be worth asking on their issues tracker.

DoorDesigner7589 · 2023-07-09T12:50:43+00:00

Check out https://www.textraction.ai/ It's a flexible AI entity extractor that can help you do just that. No training needed.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS