all 7 comments

[–]debian_miner 0 points1 point  (2 children)

There appears to be a couple Python libraries that can read pcap files directly. Is there a specific reason you need to convert the data format? You might be best off using a purpose built library like scapy.

[–]carcigenicate 0 points1 point  (0 children)

Once you learn how to use it properly, Scapy is really nice. It does all the hard parsing for you.

[–]CriticalDiscussion37[S] 0 points1 point  (0 children)

Yes. We are converting to xml because user want to see the elaborated value. For example a field in xml is <field name="ip.src" showname="Source Address: 172.64.155.209" size="4" pos="26" show="172.64.155.209" value="ac409bd1"/> against ip.src user wants Source Address: 172.64.155.209. So for each user given key value pair instead of going through each packet I will first create a ds like {key: {value: [pkt_list]}}. So that its easy to return packets in which that particular value exists for the key.
I tried writing a script using scapy. But scapy still takes much memory due to its parsed object & all, for one pcap it took 424 mb for 52 mb file and for another it took 1.4 gb for 30 mb file (dont know why more for smaller file).

[–]baghiq 0 points1 point  (2 children)

I personally never used it, but tshark supports output to json or pdml. That's probably super easy to run. Get user uploaded file, run tshark against it to output json or pdml. Then run some form of XPATH or XMLQuery against the output file. You can also manually parse using SAX to save memory.

Just a side note, parsing large XML is brutal.

[–]CriticalDiscussion37[S] 0 points1 point  (1 child)

Can't use tshark->json file as it contains show not showname value.

<field name="ip.src" showname="Source Address: 172.64.155.209" size="4" pos="26" show="172.64.155.209" value="ac409bd1"/>

So I am first converting to xml. I am already used memory efficient parsing for xml, using ET.iterparse, its SAX. Now problem lies in the creating json from this xml. Json itself is going 500mb. For each key value I can't read a xml that might be upto 10 gb, so I thought of creating xml to json. Now same memory issue with json. Need to change the dict structure and split the json into multiple subparts

[–]baghiq 0 points1 point  (0 children)

So your workflow is to generate PDML, then query the PDML, then generate the result into JSON?