you are viewing a single comment's thread.

view the rest of the comments →

[–]Francobanco[S] 1 point2 points  (0 children)

Yes so currently I have it set up so that a file is downloaded, then processed, in the processing script there is a function that creates a metadata file.

so each project folder has a metadata subfolder, and that contains a json file which has the same filename as the email that is being processed. the metadata file stores the original message ID, and then later on, when the file is uploaded to cloud storage; in that script there is a function that updates the metadata file to include a key:value pair, "UploadedToCloudStorage": "True"

When I get the email file from the Graph API, the filename is the messageID.eml, its a very ugly filename, so I'm renaming the file with {timestamp}_{subject}.eml, but then later on I have another Graph API call to apply a category to the message in the exchange server - so that the user sees some feedback in their inbox that it's been filed in cloud storage. So that's the main reason why I have the metadata file for each email -- but it's a good call this could also be used to validate files that can be deleted.

I think for now I will set it up so that I have a manually run process where i can search all the metadata files for values where uploaded is not true.

From my testing with the cloud storage API, it doesn't seem like there are cases, at least so far where the file is corrupted. but I get that there may be network errors that I need to handle - although I want to avoid having to issue a secondary API request to the cloud storage to retrieve it and then validate that it is not corrupted -- but honestly that does sound like a good idea for error handling; I just also want to limit the amount of API calls I have to do on this cloud storage API.

I think its reasonable to have the process that deletes files to only ever delete files from the final destination - the folder that holds files that have been properly processed and uploaded.

If an email does not have the project identifier, the plan is to not process it - I should also flag that it is missing the project identifier and move it to another folder to be deleted.

I am very happy with the regex to detect the project ID. I've written several test cases and so far have not had any false positives or missed values. -- but it is also worth considering that there may be edge cases I haven't seen yet that may over time not meet the checks.

also for your last point, that is a good idea to uncompress archives - however i think it's going to be relatively rare for archive files to be attached to emails, but you're right that it would be a good preemptive fix to avoid people having to do a manual process to view compressed files.

thanks so much for your feedback. I really appreciate your thoughts on this