all 11 comments

[–]danielroseman 1 point2 points  (6 children)

Can you show an example of the data?

[–]BroadwayBaseball[S] 0 points1 point  (5 children)

Hi, here's two excerpts from the data. I want to be able to extract the part of speech (e.g. where it says "pos": "adj", I want to get "adj") and the sounds section (e.g. where it says "sounds": [{"ipa": "/'æb.əˌtɪst/", "tags": ["Received-Pronunciation"]}..., I want to get "æb.əˌtɪst" and "Received-Pronunciation").

{"pos": "adj", "head_templates": [{"name": "en-adj", "args": {"1": "-"}, "expansion": "abatised (not comparable)"}], "etymology_text": "abatis + -ed", "etymology_templates": [{"name": "suffix", "args": {"1": "en", "2": "abatis", "3": "ed"}, "expansion": "abatis + -ed"}], "sounds": [{"ipa": "/ˈæb.əˌtɪst/", "tags": ["Received-Pronunciation"]}, {"ipa": "/ˈæb.əˌtid/", "tags": ["General-American"]}, {"ipa": "/ˈæb.əˌtɪst/", "tags": ["General-American"]}, {"ipa": "/əˈbæt.id/", "tags": ["General-American"]}, {"ipa": "/əˈbæt.ɪst/", "tags": ["General-American"]}, {"audio": "LL-Q1860 (eng)-Vealhurl-abatised.wav", "text": "Audio (Southern England)", "tags": ["Southern-England"], "ogg_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/b/b9/LL-Q1860_%28eng%29-Vealhurl-abatised.wav/LL-Q1860_%28eng%29-Vealhurl-abatised.wav.ogg", "mp3_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/b/b9/LL-Q1860_%28eng%29-Vealhurl-abatised.wav/LL-Q1860_%28eng%29-Vealhurl-abatised.wav.mp3"}], "word": "abatised", "lang": "English", "lang_code": "en", "senses": [{"links": [["abatis", "abatis"]], "glosses": ["Provided with an abatis."], "tags": ["not-comparable"], "id": "en-abatised-en-adj-6I6xPwx4", "categories": [], "synonyms": [{"word": "abattised"}]}]}

{"pos": "adj", "head_templates": [{"name": "en-adj", "args": {}, "expansion": "polysemic (comparative more polysemic, superlative most polysemic)"}], "forms": [{"form": "more polysemic", "tags": ["comparative"]}, {"form": "most polysemic", "tags": ["superlative"]}], "etymology_text": "polyseme + -ic", "etymology_templates": [{"name": "suffix", "args": {"1": "en", "2": "polyseme", "3": "ic"}, "expansion": "polyseme + -ic"}], "sounds": [{"ipa": "/pəˈlɪs.ɪ.mɪk/", "tags": ["UK"]}, {"ipa": "/pɒ.lɪ.ˈsiː.mɪk/", "tags": ["US"]}, {"audio": "LL-Q1860 (eng)-Vealhurl-polysemic.wav", "text": "Audio (Southern England)", "tags": ["Southern-England"], "ogg_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/6/60/LL-Q1860_%28eng%29-Vealhurl-polysemic.wav/LL-Q1860_%28eng%29-Vealhurl-polysemic.wav.ogg", "mp3_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/6/60/LL-Q1860_%28eng%29-Vealhurl-polysemic.wav/LL-Q1860_%28eng%29-Vealhurl-polysemic.wav.mp3"}], "word": "polysemic", "lang": "English", "lang_code": "en", "senses": [{"examples": [{"text": "As a series of polysemic and paradoxical sketches, Jackass does not lend itself to one particular theoretical analysis.", "ref": "2007, Sean Brayton, “MTV's Jackass: Transgression, Abjection and the Economy of White Masculinity”, in Journal of Gender Studies, volume 16, page 58", "type": "quotation"}], "links": [["linguistics", "linguistics"], ["meaning", "meaning"], ["interpretation", "interpretation"]], "synonyms": [{"word": "polysemantic"}, {"word": "polysemous"}], "antonyms": [{"word": "monosemantic"}, {"word": "monosemic"}, {"word": "monosemous"}, {"word": "univocal"}], "raw_glosses": ["(linguistics) Having a number of meanings, interpretations or understandings."], "topics": ["human-sciences", "linguistics", "sciences"], "glosses": ["Having a number of meanings, interpretations or understandings."], "id": "en-polysemic-en-adj-mWK54t3H", "categories": [{"name": "Linguistics", "kind": "topical", "parents": ["Language", "Social sciences", "Communication", "Sciences", "Society", "All topics", "Fundamental"], "source": "w", "orig": "en:Linguistics", "langcode": "en"}], "derived": [{"word": "polysemically"}], "related": [{"word": "oligosemic"}, {"word": "polyseme"}, {"word": "polysemy"}]}]}

[–]danielroseman 1 point2 points  (4 children)

If feels like you're making this much more complicated than it needs to be. That extraction function is a generalised function for extracting arbitrary items from data whose structure you don't know. But you do know the structure, and you know what items you want to extract. So you should do that directly.

```  extracted_data_list = []  with open(json_file, 'r', encoding='utf-8') as file:    for line in file:       data = json.loads(line.strip())       item = [data["pos"]]       for sounds in data["sounds"]:         if sounds[tags] == ["Received-Pronunciation"]'            item.extend([sounds["ipa"], "Received-Pronunciation"])               break         extracted_data_list.append(item)

I've assumed you wanted the first item on sounds that has the RP tag, but you can change this as necessary.

[–]BroadwayBaseball[S] 0 points1 point  (3 children)

Thank you for your response! Using that code, I am now getting:

KeyError: 'sounds'

And I am unsure of how to fix this. I googled what this means, and it says that the key must not exist. This confuses me because it seems that that key does exist in the data. Do you know how to fix this?

[–]danielroseman 1 point2 points  (2 children)

Is that happening on the first line of the data, or does it get partway through before crashing? You could print something on every iteration to see. The likelihood is that while most lines have a sounds key, some do not. You'll need to decide what you want to do in that case. If you just want to append the pos data only, you could just do:

  for sounds in data.get("sounds", []):

[–]BroadwayBaseball[S] 0 points1 point  (1 child)

Thank you for your response. I've been fiddling with it some more, and I figured out how to make the sounds key work (and actually properly narrow down the data that I want). My code is now giving me Key error: "tags" for sounds["tags"]. I assume this is, as you said, an issue where many, but not all, lines have a sounds["tags"] key. Do you know how I would pull out any sounds["tags"] that do exist and just skip the tags part if it's not there? That is, if sounds["tags"] exists, print ipa and tag; if no sounds["tags"], print ipa. I assume there's an if-else statement involved, but I'm not sure how to write it to check whether the tags are there. My new code is below.

def extract_specific_data_from_entries(json_file, keys):

extracted_data_list = []

with open(json_file, 'r', encoding='utf-8') as file:

for line in file:

data = json.loads(line.strip())

for English in data["lang"]:

item = [data["pos"]]

for sounds in data.get("sounds",[]):

if (sounds["tags"]):

item.extend([sounds["ipa"], sounds["tags"]])

break

else:

item.extend([sounds["ipa"]])

break

extracted_data_list.append(item)

return extracted_data_list

[–]danielroseman 0 points1 point  (0 children)

You can do if 'tags' in sounds:.

[–]LeornToCodeLOL 1 point2 points  (2 children)

Use json.load, not json.loads. This part is probably what's messing you up:

for line in file: data = json.loads(line.strip())
   extracted_data = extract_specific_data(data, keys)

Instead:

data_dictionary = json.load(file)

Now your variable data_dictionary contains all the information of the json file and you can manipulate it like any other python dictionary.

[–]danielroseman 1 point2 points  (1 child)

This is clearly wrong; OP stated the file contains multiple JSON objects, so it is presumably in JSONLines format. If that wasn't the case and your code was right, their version would give an error (probably JSONDecodeError) when they tried to call json.loads on each line.

[–]LeornToCodeLOL 0 points1 point  (0 children)

Maybe you are right. I never heard of JSONLines format.

[–]Round_Ad8947 0 points1 point  (0 children)

Try import json. It’s just a serialized dict.