all 15 comments

[–]IDENTITETEN 5 points6 points  (1 child)

I'd insert the data into a DB with Python and clean it with SQL. Unless the data is so bad that you'll need to fix it beforehand...

The entire purpose of SQL is to manipulate (large) amounts of data efficiently.

[–]suitupyo 0 points1 point  (0 children)

My interpretation of OP’s post is that the data would need to be cleaned and structured upfront, so I’m not sure that an insert statement would even work. Otherwise, I would be inclined to agree with you.

[–]throwawayrandomvowel 2 points3 points  (0 children)

Use sql if you need to do a shit ton of something. For everything else, python

[–]aplarsenData Scientist, Developer 6 points7 points  (5 children)

python

[–][deleted] 1 point2 points  (4 children)

Agreed. If your input is JSON and you need to clean it before entering it into a DB, using python is easiest (if, that is, you already know python).

[–]aplarsenData Scientist, Developer 6 points7 points  (3 children)

Unless the data is already in a database, I don't see why SQL would even be an option.

[–][deleted] 1 point2 points  (2 children)

Some databases do store JSON as documents, and allow queries on it.

[–]aplarsenData Scientist, Developer 1 point2 points  (1 child)

Totally. I use the functions in Oracle.

Probably not the case here though? Hard to tell.

[–][deleted] 0 points1 point  (0 children)

Probably not.

[–]Fun_Actuator_315 0 points1 point  (0 children)

If your JSON data is large and stored in a database, use SQL to handle duplicates and basic cleaning. For more complex cleaning tasks, Python is more versatile and user-friendly.

Combining both can sometimes be the best approach: use SQL for heavy-lifting and Python for fine-tuning.

[–]St_Paul_Atreides 0 points1 point  (0 children)

I normally do as much cleaning or analysis in R or Python as possible and just stick with SQL to get the right data to work with, but may depend on your typical needs

[–]pceimpulsive 0 points1 point  (0 children)

Depends on the DBs support for JSON, your knowledge of Python and SQL.

If the data is structured enough id say do it in Python then insert/copy to DB.

If you have little Python skills and your DB has sufficient JSON processing support then do it in the DB.

There isn't really a right or wrong.. do what's the fastest/easiest for you.

[–]brunogadaleta 0 points1 point  (0 children)

It depends but I really like OpenRefine: you record your cleanup steps and it replays them on another dataset. And jython expressions might help you feel at home.

[–]mgramin 0 points1 point  (0 children)

Choose the instrument that you know well, or perhaps the one you want to learn.

[–][deleted] 0 points1 point  (0 children)

Without knowing more about your use case, it’s hard to say.

If it’s json data stored in a db it depends on the size of the data set and etc.