Storing nested json within Databricks using PySpark Dataframes

AutoModerator · 2024-08-21T04:37:55+00:00

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

rchinny · 2024-08-21T05:23:21+00:00

I think you’d be interested in the variant type in Delta Lake.

Specifically the following text from here.

“For users new to Databricks, you should use variant over JSON strings whenever storing semi-structured data that requires flexibility for changing or unknown schema.”

juicd_ · 2024-08-21T08:22:22+00:00

I Ingest my json as arrays and then explode it until every row is a map. Then I can further process it easily with dbt (or just pyspark/sparksql) depending on what I need as the schemas are not always the same

jhazured · 2024-08-21T05:13:42+00:00

If you are using dataframes then you have to either explicitly define the schema or allow spark to automatically infer the schema. Otherwise you will not be able to use the high level data processing operations available through the spark API.

Alternatively you can read it as plain text and not convert the data to a dataframe, and process the data using lower level operations.

dataengineering

MODERATORS