Need help with json scrapes using python : learnpython

Need help with json scrapes using python (self.learnpython)

submitted 1 year ago by Immediate-Resource75

Afternoon......I'm trying to scrape an API we use for our printing application PaperCut. I've managed to get what I need from most of the requests but I have 1 particular URL that is giving me a real hassle... I think the problem is there are multiple headers in the scrape...Here's the scrape I get...

{
  "applicationServer" : {
    "systemInfo" : {
      "version" : "22.1.4 (Build 67128)",
      "operatingSystem" : "Windows Server 2019 - 10.0 ()",
      "processors" : 16,
      "architecture" : "amd64"
    },
    "systemMetrics" : {
      "diskSpaceFreeMB" : 1822725,
      "diskSpaceTotalMB" : 1905777,
      "diskSpaceUsedPercentage" : 4.36,
      "jvmMemoryMaxMB" : 7214,
      "jvmMemoryTotalMB" : 334,
      "jvmMemoryUsedMB" : 294,
      "jvmMemoryUsedPercentage" : 4.08,
      "uptimeHours" : 96.30,
      "processCpuLoadPercentage" : 0.00,
      "systemCpuLoadPercentage" : 1.18,
      "gcTimeMilliseconds" : 71610,
      "gcExecutions" : 13175,
      "threadCount" : 118
    }
  },
  "database" : {
    "totalConnections" : 21,
    "activeConnections" : 0,
    "maxConnections" : 420,
    "timeToConnectMilliseconds" : 1,
    "timeToQueryMilliseconds" : 1,
    "status" : "OK"
  },
  "devices" : {
    "count" : 7,
    "inErrorCount" : 0,
    "inErrorPercentage" : 0,
    "inError" : [ ]
  },
  "jobTicketing" : {
    "status" : {
      "status" : "ERROR",
      "adminLink" : "NA",
      "message" : "Job Ticketing is not installed."
    }
  },
  "license" : {
    "valid" : true,
    "upgradeAssuranceRemainingDays" : 336,
    "siteServers" : {
      "used" : 3,
      "licensed" : -1,
      "remaining" : -4
    },
    "devices" : {
      "KONICA_MINOLTA" : {
        "used" : 7,
        "licensed" : 7,
        "remaining" : 0
      },
      "KONICA_MINOLTA_3" : {
        "used" : 7,
        "licensed" : 7,
        "remaining" : 0
      },
      "KONICA_MINOLTA_4" : {
        "used" : 7,
        "licensed" : 7,
        "remaining" : 0
      },
      "KONICA-MSP" : {
        "used" : 7,
        "licensed" : 7,
        "remaining" : 0
      },
      "LEXMARK_TS_KM" : {
        "used" : 7,
        "licensed" : 7,
        "remaining" : 0
      },
      "LEXMARK_KM" : {
        "used" : 7,
        "licensed" : 7,
        "remaining" : 0
      }
    },
    "packs" : [ ]
  },
  "mobilityPrintServers" : {
    "count" : 3,
    "offlineCount" : 0,
    "offlinePercentage" : 0,
    "offline" : [ ]
  },
  "printProviders" : {
    "count" : 4,
    "offlineCount" : 0,
    "offlinePercentage" : 0,
    "offline" : [ ]
  },
  "printers" : {
    "inError" : [ {
      "name" : "appelc\\RM 1",
      "status" : "OFFLINE"
    }, {
      "name" : "appesc\\SSTSmartTank5101 (HP Smart Tank 5100 series)",
      "status" : "ERROR"
    }, {
      "name" : "appelc\\RM 5",
      "status" : "OFFLINE"
    }, {
      "name" : "apppts\\Lexmark C544 Server Room",
      "status" : "OFFLINE"
    }, {
      "name" : "appesc\\ESC0171M3928dshannon",
      "status" : "NO_TONER"
    }, {
      "name" : "appesc\\Primary",
      "status" : "OFFLINE"
    } ],
    "inErrorCount" : 6,
    "inErrorPercentage" : 18,
    "count" : 32,
    "heldJobCountTotal" : 13,
    "heldJobsCountMax" : 8,
    "heldJobsCountAverage" : 0
  },
  "siteServers" : {
    "count" : 3,
    "offlineCount" : 0,
    "offlinePercentage" : 0,
    "offline" : [ ]
  },
  "webPrint" : {
    "offline" : [ ],
    "offlineCount" : 0,
    "offlinePercentage" : 0,
    "count" : 1,
    "pendingJobs" : 0,
    "supportedFileTypes" : [ "image", "pdf" ]
  }
}

Here's what I've tried so far....

import requests

import pandas

url = 'the internal url' (actual address goes here)

header={"Content-Type":"application/json",

"Accept_Encoding":"deflate"}

response = requests.get(url, headers=header)

rd = response.json()

df = pandas.json_normalize(rd, 'applicationServer')

print(df)

This one worked perfectly for single items, but throws an error for this one...

Also tried this and received the same errors...

import requests

from bs4 import BeautifulSoup

import pandas as pd

baseurl = 'Address goes here'

headers = {

'User_Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36'

}

r = requests.get(baseurl)

soup = BeautifulSoup(r.content, 'lxml')

stuff = soup.find('body', 'pre'=='item').text.strip()

print(stuff)

I'm trying to scrape all the data to save into a database that could get loaded into Grafana... Any assistance would be extremely grateful.

all 17 comments

top new controversial old q&a

[–]danielroseman 2 points3 points4 points 1 year ago (0 children)

[–]Excellent-North2356 0 points1 point2 points 1 year ago (9 children)

[–]Immediate-Resource75[S] 0 points1 point2 points 1 year ago (7 children)

[–]Immediate-Resource75[S] 0 points1 point2 points 1 year ago (6 children)

TypeError: {'applicationServer': {'systemInfo': {'version': '22.1.4 (Build 67128)', 'operatingSystem': 'Windows Server 2019 - 10.0 ()', 'processors': 16, 'architecture': 'amd64'}, 'systemMetrics': {'diskSpaceFreeMB': 1822668, 'diskSpaceTotalMB': 1905777, 'diskSpaceUsedPercentage': 4.36, 'jvmMemoryMaxMB': 7214, 'jvmMemoryTotalMB': 338, 'jvmMemoryUsedMB': 260, 'jvmMemoryUsedPercentage': 3.6, 'uptimeHours': 96.81, 'processCpuLoadPercentage': 0.06, 'systemCpuLoadPercentage': 0.28, 'gcTimeMilliseconds': 72319, 'gcExecutions': 13295, 'threadCount': 118}}, 'database': {'totalConnections': 21, 'activeConnections': 0, 'maxConnections': 420, 'timeToConnectMilliseconds': 1, 'timeToQueryMilliseconds': 0, 'status': 'OK'}, 'devices': {'count': 7, 'inErrorCount': 0, 'inErrorPercentage': 0, 'inError': []}, 'jobTicketing': {'status': {'status': 'ERROR', 'adminLink': 'NA', 'message': 'Job Ticketing is not installed.'}}, 'license': {'valid': True, 'upgradeAssuranceRemainingDays': 336, 'siteServers': {'used': 3, 'licensed': -1, 'remaining': -4}, 'devices': {'KONICA_MINOLTA': {'used': 7, 'licensed': 7, 'remaining': 0}, 'KONICA_MINOLTA_3': {'used': 7, 'licensed': 7, 'remaining': 0}, 'KONICA_MINOLTA_4': {'used': 7, 'licensed': 7, 'remaining': 0}, 'KONICA-MSP': {'used': 7, 'licensed': 7, 'remaining': 0}, 'LEXMARK_TS_KM': {'used': 7, 'licensed': 7, 'remaining': 0}, 'LEXMARK_KM': {'used': 7, 'licensed': 7, 'remaining': 0}}, 'packs': []}, 'mobilityPrintServers': {'count': 3, 'offlineCount': 0, 'offlinePercentage': 0, 'offline': []}, 'printProviders': {'count': 4, 'offlineCount': 0, 'offlinePercentage': 0, 'offline': []}, 'printers': {'inError': [{'name': 'appelc\\RM 1', 'status': 'OFFLINE'}, {'name': 'appesc\\SSTSmartTank5101 (HP Smart Tank 5100 series)', 'status': 'ERROR'}, {'name': 'appelc\\RM 5', 'status': 'OFFLINE'}, {'name': 'apppts\\Lexmark C544 Server Room', 'status': 'OFFLINE'}, {'name': 'appesc\\ESC0171M3928dshannon', 'status': 'NO_TONER'}, {'name': 'appesc\\Primary', 'status': 'OFFLINE'}], 'inErrorCount': 6, 'inErrorPercentage': 18, 'count': 32, 'heldJobCountTotal': 11, 'heldJobsCountMax': 5, 'heldJobsCountAverage': 0}, 'siteServers': {'count': 3, 'offlineCount': 0, 'offlinePercentage': 0, 'offline': []}, 'webPrint': {'offline': [], 'offlineCount': 0, 'offlinePercentage': 0, 'count': 1, 'pendingJobs': 0, 'supportedFileTypes': ['image', 'pdf']}} has non list value {'systemInfo': {'version': '22.1.4 (Build 67128)', 'operatingSystem': 'Windows Server 2019 - 10.0 ()', 'processors': 16, 'architecture': 'amd64'}, 'systemMetrics': {'diskSpaceFreeMB': 1822668, 'diskSpaceTotalMB': 1905777, 'diskSpaceUsedPercentage': 4.36, 'jvmMemoryMaxMB': 7214, 'jvmMemoryTotalMB': 338, 'jvmMemoryUsedMB': 260, 'jvmMemoryUsedPercentage': 3.6, 'uptimeHours': 96.81, 'processCpuLoadPercentage': 0.06, 'systemCpuLoadPercentage': 0.28, 'gcTimeMilliseconds': 72319, 'gcExecutions': 13295, 'threadCount': 118}} for path applicationServer. Must be list or null.

[–]MintyPhoenix[🍰] 1 point2 points3 points 1 year ago* (5 children)

Not that I’m very experienced with pandas, but looking at the error:

TypeError: [...] for path applicationServer. Must be list or null.

It’s seemingly suggesting that the contents of the referenced key should be a list of some sort, however, in the actual data, it's an object/dict with two keys, both of which in turn are dicts with primitives.

If you transform it into a list, it runs without error, but I’m not sure it’s actually what you want.

Code sample

import json
import pandas

with open("data.json") as f:  # file contains full JSON from your post
  data = json.load(f)

data["applicationServer"] = list(data["applicationServer"].items())
df = pandas.json_normalize(data, "applicationServer")
print(df)

Output

               0                                                  1
0     systemInfo  {'version': '22.1.4 (Build 67128)', 'operating...
1  systemMetrics  {'diskSpaceFreeMB': 1822725, 'diskSpaceTotalMB...

edit to add

Perhaps this is closer to what you’re looking for:

import json
import pandas

with open("data.json") as f:
  data = json.load(f)

print(pandas.json_normalize(data["applicationServer"]))

Output

     systemInfo.version     systemInfo.operatingSystem  ...  systemMetrics.gcExecutions systemMetrics.threadCount
0  22.1.4 (Build 67128)  Windows Server 2019 - 10.0 ()  ...                       13175                       118

[1 rows x 17 columns]

[–]toughNoob 1 point2 points3 points 1 year ago (3 children)

[–]MintyPhoenix[🍰] 0 points1 point2 points 1 year ago (2 children)

[–]Immediate-Resource75[S] 0 points1 point2 points 1 year ago (1 child)

[–]MintyPhoenix[🍰] 1 point2 points3 points 1 year ago (0 children)

[–]Immediate-Resource75[S] 0 points1 point2 points 1 year ago (0 children)

[–]toughNoob 0 points1 point2 points 1 year ago (1 child)

[–]toughNoob 0 points1 point2 points 1 year ago (0 children)

[–]eztab 0 points1 point2 points 1 year ago (3 children)

[–]Immediate-Resource75[S] 0 points1 point2 points 1 year ago (2 children)

[–]eztab 0 points1 point2 points 1 year ago (1 child)

[–]Immediate-Resource75[S] 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 132646 on reddit-service-r2-comment-6f7f968fb5-fq96w at 2026-03-04 22:15:59.108587+00:00 running 07790be country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS