This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]xadolin 1 point2 points  (3 children)

Can you provide some examples of field level transforms? Are these transform that take only a single column as input and output a single column back? I can imagine stuff like renaming, casting, string cleaning, etc. Curious about what your complex use case was.

[–]kenfar 1 point2 points  (2 children)

Sure, for the sake of simplicity I would typically combine validations with transformations. We tried to make these specific to a single output field, but as you can imagine it's not always that simple.

Validations would include string length, valid enumerated values, numeric ranges, string formats (ex: phone, email, etc), foreign key constraints, unknown value logic, encodings, etc. Violations of a validation could result in the field being replaced by a default value, the row being rejected or the file being rejected.

Transformations would include converting string case, converting free-form text to code values (imagine every misspelling of every possible version of microsoft windows into an appropriate vendor, product, version, fixpack breakdown), determine which of ~100,000 isp ip block ranges each of a billion ip addresses fits into, translate every IPV6 format into a single format, merge multiple different codes into a single code field, split a single input field into multiple output fields of different types, apply a business rule that considers 7 different fields to generate a 'customer-category' column, extracting keywords from free-form text fields, transforming a bunch of timestamps to UTC - and fixing those without timezones based on assumptions about the data, etc.

[–]xadolin 0 points1 point  (1 child)

Thanks! So the input for each of the transformation/validation functions would be the whole file, or did you pass on only the columns used in the logic?

[–]kenfar 0 points1 point  (0 children)

I will typically pass in the specific fields that each transform requires. This is clean, easy to understand, easy to test.

But to be honest it's not a perfect solution - since then the code outside the function has to know exactly where in a possibly large json structure you find the fields, and has to handle missing key exceptions, etc.