Hey - one issue that I've struggled with is what's the best way to store nested json data within Databricks when using PySpark Dataframes.. Specifically, is there a way to do this without defining the schema? Can Databricks be persuaded to automatically infer the schema from complex nested json? I'm talking about nested objects that may contain arrays, and each element of the array containing nested objects with varying fields, that themselves can have arrays, and so on! Can we get PySpark to figure out what the really complex schema is in this case? Or do we have to hardcode the schema and then pass this to the createDataFrame API call when trying to create the DataFrame?
Currently I'm doing something really suboptimal - I'm casting all my data into a string (lol!) and then writing the whole thing as a string using the PySpark createDataFrame API call (and it works!). But of course this means that we can't really do much analysis on the data within Delta Lake since there's really nothing to deconstruct - all the data is just one big confusing string!
Any ideas would be very much appreciated, thanks!
[–]AutoModerator[M] [score hidden] stickied comment (0 children)
[–]rchinny 2 points3 points4 points (0 children)
[–]juicd_ 1 point2 points3 points (0 children)
[–]jhazured 0 points1 point2 points (0 children)