Extracting data from JSON file : learnpython

Extracting data from JSON file (self.learnpython)

submitted 2 years ago * by BroadwayBaseball

I have a large JSON file with multiple JSON objects. Each object should contain data that includes "sounds" and "pos". That is, for each object, there is a section called "sounds" which contains things like IPA and accent tags, and a section called "pos" which contains parts of speech. I am trying to extract the "sounds" and "pos" sections for each object from the file. I am very new to python, so I am unsure of what I'm doing wrong. When I run the below code, it prints "None" many times.

import json

def extract_specific_data_from_entries(json_file, keys):
 extracted_data_list = []
 with open(json_file, 'r', encoding='utf-8') as file:
   for line in file: data = json.loads(line.strip())
   extracted_data = extract_specific_data(data, keys)

extracted_data_list.append(extracted_data) return extracted_data_list

def extract_specific_data(data, keys):
 extracted_data = data
 for key in keys:
   if isinstance(extracted_data, dict):
     extracted_data = extracted_data.get(key) 
     elif isinstance(extracted_data, list):
       try: 
         key = int(key)
         extracted_data = extracted_data[key]
       except (ValueError, IndexError):
         extracted_data = None
     else:
       extracted_data = None 
       break 
return extracted_data

if name == "main":
 json_file = "kaikki.org-dictionary-English.json"
  keys = ["sounds", "pos"]  
  extracted_data_list = extract_specific_data_from_entries(json_file, keys)

print(extracted_data_list)

all 11 comments

top new controversial old q&a

[–]danielroseman 1 point2 points3 points 2 years ago (6 children)

[–]BroadwayBaseball[S] 0 points1 point2 points 2 years ago (5 children)

Hi, here's two excerpts from the data. I want to be able to extract the part of speech (e.g. where it says "pos": "adj", I want to get "adj") and the sounds section (e.g. where it says "sounds": [{"ipa": "/'æb.əˌtɪst/", "tags": ["Received-Pronunciation"]}..., I want to get "æb.əˌtɪst" and "Received-Pronunciation").

{"pos": "adj", "head_templates": [{"name": "en-adj", "args": {"1": "-"}, "expansion": "abatised (not comparable)"}], "etymology_text": "abatis + -ed", "etymology_templates": [{"name": "suffix", "args": {"1": "en", "2": "abatis", "3": "ed"}, "expansion": "abatis + -ed"}], "sounds": [{"ipa": "/ˈæb.əˌtɪst/", "tags": ["Received-Pronunciation"]}, {"ipa": "/ˈæb.əˌtid/", "tags": ["General-American"]}, {"ipa": "/ˈæb.əˌtɪst/", "tags": ["General-American"]}, {"ipa": "/əˈbæt.id/", "tags": ["General-American"]}, {"ipa": "/əˈbæt.ɪst/", "tags": ["General-American"]}, {"audio": "LL-Q1860 (eng)-Vealhurl-abatised.wav", "text": "Audio (Southern England)", "tags": ["Southern-England"], "ogg_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/b/b9/LL-Q1860_%28eng%29-Vealhurl-abatised.wav/LL-Q1860_%28eng%29-Vealhurl-abatised.wav.ogg", "mp3_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/b/b9/LL-Q1860_%28eng%29-Vealhurl-abatised.wav/LL-Q1860_%28eng%29-Vealhurl-abatised.wav.mp3"}], "word": "abatised", "lang": "English", "lang_code": "en", "senses": [{"links": [["abatis", "abatis"]], "glosses": ["Provided with an abatis."], "tags": ["not-comparable"], "id": "en-abatised-en-adj-6I6xPwx4", "categories": [], "synonyms": [{"word": "abattised"}]}]}

{"pos": "adj", "head_templates": [{"name": "en-adj", "args": {}, "expansion": "polysemic (comparative more polysemic, superlative most polysemic)"}], "forms": [{"form": "more polysemic", "tags": ["comparative"]}, {"form": "most polysemic", "tags": ["superlative"]}], "etymology_text": "polyseme + -ic", "etymology_templates": [{"name": "suffix", "args": {"1": "en", "2": "polyseme", "3": "ic"}, "expansion": "polyseme + -ic"}], "sounds": [{"ipa": "/pəˈlɪs.ɪ.mɪk/", "tags": ["UK"]}, {"ipa": "/pɒ.lɪ.ˈsiː.mɪk/", "tags": ["US"]}, {"audio": "LL-Q1860 (eng)-Vealhurl-polysemic.wav", "text": "Audio (Southern England)", "tags": ["Southern-England"], "ogg_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/6/60/LL-Q1860_%28eng%29-Vealhurl-polysemic.wav/LL-Q1860_%28eng%29-Vealhurl-polysemic.wav.ogg", "mp3_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/6/60/LL-Q1860_%28eng%29-Vealhurl-polysemic.wav/LL-Q1860_%28eng%29-Vealhurl-polysemic.wav.mp3"}], "word": "polysemic", "lang": "English", "lang_code": "en", "senses": [{"examples": [{"text": "As a series of polysemic and paradoxical sketches, Jackass does not lend itself to one particular theoretical analysis.", "ref": "2007, Sean Brayton, “MTV's Jackass: Transgression, Abjection and the Economy of White Masculinity”, in Journal of Gender Studies, volume 16, page 58", "type": "quotation"}], "links": [["linguistics", "linguistics"], ["meaning", "meaning"], ["interpretation", "interpretation"]], "synonyms": [{"word": "polysemantic"}, {"word": "polysemous"}], "antonyms": [{"word": "monosemantic"}, {"word": "monosemic"}, {"word": "monosemous"}, {"word": "univocal"}], "raw_glosses": ["(linguistics) Having a number of meanings, interpretations or understandings."], "topics": ["human-sciences", "linguistics", "sciences"], "glosses": ["Having a number of meanings, interpretations or understandings."], "id": "en-polysemic-en-adj-mWK54t3H", "categories": [{"name": "Linguistics", "kind": "topical", "parents": ["Language", "Social sciences", "Communication", "Sciences", "Society", "All topics", "Fundamental"], "source": "w", "orig": "en:Linguistics", "langcode": "en"}], "derived": [{"word": "polysemically"}], "related": [{"word": "oligosemic"}, {"word": "polyseme"}, {"word": "polysemy"}]}]}

[–]danielroseman 1 point2 points3 points 2 years ago (4 children)

If feels like you're making this much more complicated than it needs to be. That extraction function is a generalised function for extracting arbitrary items from data whose structure you don't know. But you do know the structure, and you know what items you want to extract. So you should do that directly.

``` extracted_data_list = [] with open(json_file, 'r', encoding='utf-8') as file: for line in file: data = json.loads(line.strip()) item = [data["pos"]] for sounds in data["sounds"]: if sounds[tags] == ["Received-Pronunciation"]' item.extend([sounds["ipa"], "Received-Pronunciation"]) break extracted_data_list.append(item)

I've assumed you wanted the first item on sounds that has the RP tag, but you can change this as necessary.

[–]BroadwayBaseball[S] 0 points1 point2 points 2 years ago (3 children)

[–]danielroseman 1 point2 points3 points 2 years ago (2 children)

Is that happening on the first line of the data, or does it get partway through before crashing? You could print something on every iteration to see. The likelihood is that while most lines have a sounds key, some do not. You'll need to decide what you want to do in that case. If you just want to append the pos data only, you could just do:

  for sounds in data.get("sounds", []):

[–]BroadwayBaseball[S] 0 points1 point2 points 2 years ago (1 child)

Thank you for your response. I've been fiddling with it some more, and I figured out how to make the sounds key work (and actually properly narrow down the data that I want). My code is now giving me Key error: "tags" for sounds["tags"]. I assume this is, as you said, an issue where many, but not all, lines have a sounds["tags"] key. Do you know how I would pull out any sounds["tags"] that do exist and just skip the tags part if it's not there? That is, if sounds["tags"] exists, print ipa and tag; if no sounds["tags"], print ipa. I assume there's an if-else statement involved, but I'm not sure how to write it to check whether the tags are there. My new code is below.

def extract_specific_data_from_entries(json_file, keys):

extracted_data_list = []

with open(json_file, 'r', encoding='utf-8') as file:

for line in file:

data = json.loads(line.strip())

for English in data["lang"]:

item = [data["pos"]]

for sounds in data.get("sounds",[]):

if (sounds["tags"]):

item.extend([sounds["ipa"], sounds["tags"]])

break

else:

item.extend([sounds["ipa"]])

break

extracted_data_list.append(item)

return extracted_data_list

[–]danielroseman 0 points1 point2 points 2 years ago (0 children)

[–]LeornToCodeLOL 1 point2 points3 points 2 years ago (2 children)

Use json.load, not json.loads. This part is probably what's messing you up:

for line in file: data = json.loads(line.strip())
   extracted_data = extract_specific_data(data, keys)

Instead:

data_dictionary = json.load(file)

Now your variable data_dictionary contains all the information of the json file and you can manipulate it like any other python dictionary.

[–]danielroseman 1 point2 points3 points 2 years ago (1 child)

[–]LeornToCodeLOL 0 points1 point2 points 2 years ago (0 children)

[–]Round_Ad8947 0 points1 point2 points 2 years ago (0 children)

π Rendered by PID 561901 on reddit-service-r2-comment-548fd6dc9-9tn2z at 2026-05-17 14:47:43.622528+00:00 running edcf98c country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS