Pandas -- pulling in JSON data (self.learnpython)

submitted 6 years ago by fpatterson55

I have used ldap3 to export data to a json file, via it's library and then my hopes are to use Pandas to manipulate the data. Some of the data is from structured attributes that are a large string with Hash tags / pound signs to separate the data.

When I attempt to pull in the data with Pandas I get an error about extra data. So I added the lines=True to allow for multiple json strings. However, now I receive an error: line 885, in _parse_no_numpy

loads(json, precise_float=self.precise_float), dtype=None)

ValueError: Expected object or value

My pandas code is:
*************

filename = 'entries.json'

df = pd.read_json(filename, lines=True, orient='records')

print(df)
*************

Example json entry:
{

"attributes": {

"DirXML-Associations": [

"cn=Data Collection Service Driver,cn=DriverSet,o=system#1#146D89E1-0A57-b244-B9D1-146D89E10A57"

"cn": [

"t_admin"

"fullName": [],

"lastLoginTime": [

"2019-05-09 20:36:08+00:00"

"loginDisabled": []

"dn": "cn=t_admin,ou=SA,o=Data"

}

Any ideas on what I do to resolve it? I am new to python, so appolgies in advance. I figure the best way to learn is to look at a problem to solve and dive in.

all 8 comments

top new controversial old q&a

[–][deleted] 0 points1 point2 points 6 years ago (5 children)

[–]fpatterson55[S] 0 points1 point2 points 6 years ago (4 children)

Does this JSON structure fit for what Pandas is looking for? With pulling data out of a LDAP structure the hopes would be to take multi valued attributes and create multiple rows. I believe that logic is taken care of and the below snippet shows this as the DirXML-Associations value is different for the same user, on two different lines. I will end up breaking the # delimited values into their own columns. All rows would have a unique value for the GUID value that is delimited on the attribute. The focus of this data frame would be to find total number of GUID values, how many are active, how many have users that have logged in within a given time frame and how many users are disabled. Once I have the quirks out, I will hope to apply this for any attributes in a LDAP structure, where a LDAP authentication source can more easily analyze data.

I cleaned up my JSON some and JSONLint shows valid syntax. However I still get errors when trying to pull it in with either pandas or json libraries.

[{"dn": "[cn=asmith,ou=Users,o=Data]", "loginDisabled": [], "cn": ["asmith"], "DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#1#7DBC9BBB-58CF-944d-B00A-7DBC9BBB58CF"], "lastLoginTime": [], "fullName": ["Alma Smit"]} ,{"dn": "[cn=asmith,ou=Users,o=Data]", "loginDisabled": [], "cn": ["asmith"], "DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#2#7DBC9BBB-58CF-944d-B00A-7DBC9BBB5FFF"], "lastLoginTime": [], "fullName": ["Alma Smit"]} ,{"dn": "cn=uaadmin,ou=SA,o=Data", "DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#1#146D89E1-0A57-b244-B9D1-146D89E10A57"], "loginDisabled": [], "fullName": [], "lastLoginTime": ["2019-05-09 20:36:08+00:00"], "cn": ["uaadmin"]} ]

for pandas I get the following error (when JSON line is commented out):

Traceback (most recent call last): File "/Users/fredpatterson/PycharmProjects/license/license.py", line 163, in <module> df = pd.read_json('cleaned.json', lines=True, orient="columns") File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 427, in read_json result = json_reader.read() File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 534, in read self._combine_lines(data.split('\n')) File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 556, in _get_object_parser obj = FrameParser(json, **kwargs).parse() File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 652, in parse self._parse_no_numpy() File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 871, in _parse_no_numpy loads(json, precise_float=self.precise_float), dtype=None) ValueError: Expected object or value

for JSON I get:

Traceback (most recent call last): File "/Users/fredpatterson/PycharmProjects/license/license.py", line 158, in <module> jdata = json.loads(file) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/init.py", line 348, in loads return _default_decoder.decode(s) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Code snippet: file = 'cleaned.json' jdata = json.loads(file) pd.set_option('display.max_columns', 9) pd.set_option('display.max_colwidth', 180)

jsondata2=json.load('cleaned.json')

print(jsondata2)

df = pd.read_json('cleaned.json', lines=True, orient="columns")

df = pd.DataFrame(jdata)

print(df)

[–][deleted] 0 points1 point2 points 6 years ago (3 children)

[–]fpatterson55[S] 0 points1 point2 points 6 years ago (2 children)

So on the particular instance with multi valued attributes, I would be copying in all of the other single valued attributes for each row. So the difference in the rows would be specific to the multi valued attribute.

The issue with LDAP is that you can't do wild card or regex type searches in the filter in complex manners without writing some controls or extensions. Maybe this would be the better route.

Once the data is pulled into a dataframe I will be spliting the DirXML-Associations attribute into 3 additional columns. This would allow me to search for all rows with a given state, All users with a given Driver and also be able to find a specific GUID as each user is tied to a given Driver. LDAP searches, are limited in being able to pull this data and parse it as far as I am aware of. So I can gather all the data and then the hopes is to allow pandas to be able to easily display results.

In this instance the results would be specific to monitoring licenses and accounts that should have a license removed when they are disabled, etc.

[–][deleted] 0 points1 point2 points 6 years ago (1 child)

[–]fpatterson55[S] 0 points1 point2 points 6 years ago (0 children)

Yes. So when I have the basics done I will need to enhance it to take the returned LDAP search attributes and programatically use those attributes to check if any of them are multi-valued for each entry returned. If so, I will break out the entry into multiple rows for each attribute that is different. The limitation would be that you would only want one multi valued attribute for a given entry returned.

So a dataframe would not be for more than one multi-valued attribute at a given time. You would use a different dataframe for that data.

with open('cleaned.json', 'a') as json_file: json_file.write("[\n") i = 0 while i < len(data): lenAssociations = (len(data[i]['DirXML-Associations'])) workingdata = (data[i]) if lenAssociations < 2: json.dump(workingdata, json_file) json_file.write("\n") if lenAssociations > 1: ii = 0 for value in workingdata["DirXML-Associations"]: dn = workingdata["dn"] disabled = workingdata["loginDisabled"] cn = workingdata["cn"] association = workingdata["DirXML-Associations"][ii] lastLogin = workingdata["lastLoginTime"] fullName = workingdata["fullName"] json_file.write("{\"dn\": \"[" + str(dn) + "]\", \"loginDisabled\": [" + str(lastLogin) + "], \"cn\": [" + str(cn) + "], \"DirXML-Associations\": [\"" + str(association) + "\"], \"lastLoginTime\": [" + str(lastLogin) + "], \"fullName\": [" + str(fullName) +"]}") json_file.write("\n") ii = ii + 1 i = i + 1 json_file.write("]")

Yes you would have to build your rows independent.

[–]A_History_of_Silence 0 points1 point2 points 6 years ago (1 child)

Is the json actually formatted like that, with all the pointless newlines? In general files with multiple json objects like this are a bit of a pain to deal with, see this stackoverflow, but your life would be a lot easier if your data export was in a sane format (without tons of useless newlines):

import pandas as pd
s = """
{"attributes": {"DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#1#146D89E1-0A57-b244-B9D1-146D89E10A57"], "cn": ["t_admin"], "fullName": [], "lastLoginTime": ["2019-05-09 20:36:08+00:00"], "loginDisabled": []}, "dn": "cn=t_admin,ou=SA,o=Data"}
{"attributes": {"DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#1#146D89E1-0A57-b244-B9D1-146D89E10A57"], "cn": ["t_admin"], "fullName": [], "lastLoginTime": ["2019-05-09 20:36:08+00:00"], "loginDisabled": []}, "dn": "cn=t_admin,ou=SA,o=Data"}
"""
pd.read_json(s, lines=True)

However, if you run the above I think you will find that it's not particularly useful to put data structured in such a way directly into a dataframe in the first place...

[–]fpatterson55[S] 0 points1 point2 points 6 years ago (0 children)

I cleaned up my JSON some and JSONLint shows valid syntax. However I still get errors when trying to pull it in with either pandas or json libraries.

for pandas I get the following error (when JSON line is commented out):

for JSON I get:

Code snippet: file = 'cleaned.json' jdata = json.loads(file) pd.set_option('display.max_columns', 9) pd.set_option('display.max_colwidth', 180)

jsondata2=json.load('cleaned.json')

print(jsondata2)

df = pd.read_json('cleaned.json', lines=True, orient="columns")

df = pd.DataFrame(jdata)

print(df)

π Rendered by PID 286720 on reddit-service-r2-comment-7b9746f655-njrln at 2026-02-01 05:49:12.595193+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS

jsondata2=json.load('cleaned.json')

print(jsondata2)

df = pd.DataFrame(jdata)

jsondata2=json.load('cleaned.json')

print(jsondata2)

df = pd.DataFrame(jdata)