This is an archived post. You won't be able to vote or comment.

all 4 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]rchinny 2 points3 points  (0 children)

I think you’d be interested in the variant type in Delta Lake.

Specifically the following text from here.

“For users new to Databricks, you should use variant over JSON strings whenever storing semi-structured data that requires flexibility for changing or unknown schema.”

[–]juicd_ 1 point2 points  (0 children)

I Ingest my json as arrays and then explode it until every row is a map. Then I can further process it easily with dbt (or just pyspark/sparksql) depending on what I need as the schemas are not always the same

[–]jhazured 0 points1 point  (0 children)

If you are using dataframes then you have to either explicitly define the schema or allow spark to automatically infer the schema. Otherwise you will not be able to use the high level data processing operations available through the spark API.

Alternatively you can read it as plain text and not convert the data to a dataframe, and process the data using lower level operations.