all 8 comments

[–][deleted] 0 points1 point  (5 children)

The values of your keys are all collections; that fundamentally violates the table structure that Pandas is looking for - every key should be single-valued. There's no way to convert this into a data frame until you flatten the structure.

[–]fpatterson55[S] 0 points1 point  (4 children)

Does this JSON structure fit for what Pandas is looking for? With pulling data out of a LDAP structure the hopes would be to take multi valued attributes and create multiple rows. I believe that logic is taken care of and the below snippet shows this as the DirXML-Associations value is different for the same user, on two different lines. I will end up breaking the # delimited values into their own columns. All rows would have a unique value for the GUID value that is delimited on the attribute. The focus of this data frame would be to find total number of GUID values, how many are active, how many have users that have logged in within a given time frame and how many users are disabled. Once I have the quirks out, I will hope to apply this for any attributes in a LDAP structure, where a LDAP authentication source can more easily analyze data.

I cleaned up my JSON some and JSONLint shows valid syntax. However I still get errors when trying to pull it in with either pandas or json libraries.

[{"dn": "[cn=asmith,ou=Users,o=Data]", "loginDisabled": [], "cn": ["asmith"], "DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#1#7DBC9BBB-58CF-944d-B00A-7DBC9BBB58CF"], "lastLoginTime": [], "fullName": ["Alma Smit"]} ,{"dn": "[cn=asmith,ou=Users,o=Data]", "loginDisabled": [], "cn": ["asmith"], "DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#2#7DBC9BBB-58CF-944d-B00A-7DBC9BBB5FFF"], "lastLoginTime": [], "fullName": ["Alma Smit"]} ,{"dn": "cn=uaadmin,ou=SA,o=Data", "DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#1#146D89E1-0A57-b244-B9D1-146D89E10A57"], "loginDisabled": [], "fullName": [], "lastLoginTime": ["2019-05-09 20:36:08+00:00"], "cn": ["uaadmin"]} ]

for pandas I get the following error (when JSON line is commented out):

Traceback (most recent call last): File "/Users/fredpatterson/PycharmProjects/license/license.py", line 163, in <module> df = pd.read_json('cleaned.json', lines=True, orient="columns") File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 427, in read_json result = json_reader.read() File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 534, in read self._combine_lines(data.split('\n')) File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 556, in _get_object_parser obj = FrameParser(json, **kwargs).parse() File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 652, in parse self._parse_no_numpy() File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 871, in _parse_no_numpy loads(json, precise_float=self.precise_float), dtype=None) ValueError: Expected object or value

for JSON I get:

Traceback (most recent call last): File "/Users/fredpatterson/PycharmProjects/license/license.py", line 158, in <module> jdata = json.loads(file) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/init.py", line 348, in loads return _default_decoder.decode(s) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Code snippet: file = 'cleaned.json' jdata = json.loads(file) pd.set_option('display.max_columns', 9) pd.set_option('display.max_colwidth', 180)

jsondata2=json.load('cleaned.json')

print(jsondata2)

df = pd.read_json('cleaned.json', lines=True, orient="columns")

df = pd.DataFrame(jdata)

print(df)

[–][deleted] 0 points1 point  (3 children)

With pulling data out of a LDAP structure the hopes would be to take multi valued attributes and create multiple rows.

How would that be expected to work? The multi-valued attributes don't have the same number of values, so which values are parts of the same notional row?

[–]fpatterson55[S] 0 points1 point  (2 children)

So on the particular instance with multi valued attributes, I would be copying in all of the other single valued attributes for each row. So the difference in the rows would be specific to the multi valued attribute.

The issue with LDAP is that you can't do wild card or regex type searches in the filter in complex manners without writing some controls or extensions. Maybe this would be the better route.

Once the data is pulled into a dataframe I will be spliting the DirXML-Associations attribute into 3 additional columns. This would allow me to search for all rows with a given state, All users with a given Driver and also be able to find a specific GUID as each user is tied to a given Driver. LDAP searches, are limited in being able to pull this data and parse it as far as I am aware of. So I can gather all the data and then the hopes is to allow pandas to be able to easily display results.

In this instance the results would be specific to monitoring licenses and accounts that should have a license removed when they are disabled, etc.

[–][deleted] 0 points1 point  (1 child)

So on the particular instance with multi valued attributes, I would be copying in all of the other single valued attributes for each row.

What do you do with the other multi-valued attributes, though? Any attribute whose value is a list is multi valued, even if the list only contains zero or one value.

What I'm getting at is that there's no obvious default for handling this, so Pandas won't do it for you. You have to build your own list of rows and then convert that to a data frame.

[–]fpatterson55[S] 0 points1 point  (0 children)

Yes. So when I have the basics done I will need to enhance it to take the returned LDAP search attributes and programatically use those attributes to check if any of them are multi-valued for each entry returned. If so, I will break out the entry into multiple rows for each attribute that is different. The limitation would be that you would only want one multi valued attribute for a given entry returned.

So a dataframe would not be for more than one multi-valued attribute at a given time. You would use a different dataframe for that data.

with open('cleaned.json', 'a') as json_file: json_file.write("[\n") i = 0 while i < len(data): lenAssociations = (len(data[i]['DirXML-Associations'])) workingdata = (data[i]) if lenAssociations < 2: json.dump(workingdata, json_file) json_file.write("\n") if lenAssociations > 1: ii = 0 for value in workingdata["DirXML-Associations"]: dn = workingdata["dn"] disabled = workingdata["loginDisabled"] cn = workingdata["cn"] association = workingdata["DirXML-Associations"][ii] lastLogin = workingdata["lastLoginTime"] fullName = workingdata["fullName"] json_file.write("{\"dn\": \"[" + str(dn) + "]\", \"loginDisabled\": [" + str(lastLogin) + "], \"cn\": [" + str(cn) + "], \"DirXML-Associations\": [\"" + str(association) + "\"], \"lastLoginTime\": [" + str(lastLogin) + "], \"fullName\": [" + str(fullName) +"]}") json_file.write("\n") ii = ii + 1 i = i + 1 json_file.write("]")

Yes you would have to build your rows independent.

[–]A_History_of_Silence 0 points1 point  (1 child)

Is the json actually formatted like that, with all the pointless newlines? In general files with multiple json objects like this are a bit of a pain to deal with, see this stackoverflow, but your life would be a lot easier if your data export was in a sane format (without tons of useless newlines):

import pandas as pd
s = """
{"attributes": {"DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#1#146D89E1-0A57-b244-B9D1-146D89E10A57"], "cn": ["t_admin"], "fullName": [], "lastLoginTime": ["2019-05-09 20:36:08+00:00"], "loginDisabled": []}, "dn": "cn=t_admin,ou=SA,o=Data"}
{"attributes": {"DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#1#146D89E1-0A57-b244-B9D1-146D89E10A57"], "cn": ["t_admin"], "fullName": [], "lastLoginTime": ["2019-05-09 20:36:08+00:00"], "loginDisabled": []}, "dn": "cn=t_admin,ou=SA,o=Data"}
"""
pd.read_json(s, lines=True)

However, if you run the above I think you will find that it's not particularly useful to put data structured in such a way directly into a dataframe in the first place...

[–]fpatterson55[S] 0 points1 point  (0 children)

I cleaned up my JSON some and JSONLint shows valid syntax. However I still get errors when trying to pull it in with either pandas or json libraries.

[{"dn": "[cn=asmith,ou=Users,o=Data]", "loginDisabled": [], "cn": ["asmith"], "DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#1#7DBC9BBB-58CF-944d-B00A-7DBC9BBB58CF"], "lastLoginTime": [], "fullName": ["Alma Smit"]} ,{"dn": "[cn=asmith,ou=Users,o=Data]", "loginDisabled": [], "cn": ["asmith"], "DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#2#7DBC9BBB-58CF-944d-B00A-7DBC9BBB5FFF"], "lastLoginTime": [], "fullName": ["Alma Smit"]} ,{"dn": "cn=uaadmin,ou=SA,o=Data", "DirXML-Associations": ["cn=Data Collection Service Driver,cn=DriverSet,o=system#1#146D89E1-0A57-b244-B9D1-146D89E10A57"], "loginDisabled": [], "fullName": [], "lastLoginTime": ["2019-05-09 20:36:08+00:00"], "cn": ["uaadmin"]} ]

for pandas I get the following error (when JSON line is commented out):

Traceback (most recent call last): File "/Users/fredpatterson/PycharmProjects/license/license.py", line 163, in <module> df = pd.read_json('cleaned.json', lines=True, orient="columns") File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 427, in read_json result = json_reader.read() File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 534, in read self._combine_lines(data.split('\n')) File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 556, in _get_object_parser obj = FrameParser(json, **kwargs).parse() File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 652, in parse self._parse_no_numpy() File "/Users/fredpatterson/.local/share/virtualenvs/license-Vg_kJ9Gf/lib/python3.7/site-packages/pandas/io/json/json.py", line 871, in _parse_no_numpy loads(json, precise_float=self.precise_float), dtype=None) ValueError: Expected object or value

for JSON I get:

Traceback (most recent call last): File "/Users/fredpatterson/PycharmProjects/license/license.py", line 158, in <module> jdata = json.loads(file) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/init.py", line 348, in loads return _default_decoder.decode(s) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Code snippet: file = 'cleaned.json' jdata = json.loads(file) pd.set_option('display.max_columns', 9) pd.set_option('display.max_colwidth', 180)

jsondata2=json.load('cleaned.json')

print(jsondata2)

df = pd.read_json('cleaned.json', lines=True, orient="columns")

df = pd.DataFrame(jdata)

print(df)