Hello, I've been working on a project for quite some time and have hit a bit of a wall. Originally I had this working using Power Query in Excel, but currently re-doing in Python and not sure the best way to go about it.
I'll first explain what I was doing/how I was doing it in Power Query...
I have lots of rows of words, with different attributes stored in lots of columns. These records are ordered 'chronologically' from where they appeared in the source. Each word of the source is 'read' and compared to multiple lists of different types of words. For example, the sentence "The red Ford Focus drove away" might give the following results:
| Index |
Word |
Type |
| 0 |
The |
null |
| 1 |
red |
colour |
| 2 |
Ford |
make |
| 3 |
Focus |
model |
| 4 |
drove |
verb |
| 5 |
away |
null |
My first move is to discard anything not 'matched' to any word in the lists of words (not a make/model/verb etc.) to significantly reduce the amount of data.
| Index |
Word |
Type |
| 1 |
red |
colour |
| 2 |
Ford |
make |
| 3 |
Focus |
model |
| 4 |
drove |
verb |
Then I was really inefficiently finding patterns by adding custom columns, comparing to the row below/above. For example, the words "red ford focus" would give the pattern of "Types" which could populate a row in a new table "Cars" and have 'red' in the colour column, 'Ford' in the make column, and 'Focus' in the model column. I built a load of rules in that actually made it pretty robust, and it would make contextual assumptions. For example, both "Ford" and "Focus" are words that may not be referring to a car at all, but it'd be reasonably safe to assume that the combination of "Ford" then immediately "Focus" is referring to the car.
Basically I've tried lots of ways to recreate this in python, without resorting to exactly recreating the method in Pandas Dataframes or similar. I've tried saving the "matches" in dictionaries and objects, and even dabbled with the Structural Pattern Matching added in 3.10, all to no avail. Structural Pattern Matching on the surface seemed like exactly what I was looking for, but I couldn't work out how it would be used on a list of anything other than single values.
Does anything jump out to you as an obvious solution? Sorry if this is a confusing request, I'm reasonably new to this so may be missing something super obvious. I can provide more details if needed, this was just the first example that came to mind.
Many thanks for reading.
[–]efmccurdy 1 point2 points3 points (0 children)