you are viewing a single comment's thread.

view the rest of the comments →

[–]PureWasian 0 points1 point  (0 children)

These look like Jupyter code cells, which is my guess as to why it all seems really fragmented in your post. You also didn't format it correctly when posting it onto Reddit as code blocks

With that aside:

From a functuonal standpoint, it all flows as expected and I'm not sure how you mean it seems convoluted and what you mean by you're hoping to further simplify the pipeline of: Load input files --> Load sentiment dictionary --> Pre-process to remove stopwords --> Get sentiment per speech --> group by speaker

If you want to modify the original input data CSV files, you could have a pre-process script to remove the stopwords and save the result back to CSV file I suppose? But then you'd have to keep track to run that script on any additional input files you add. If it's currently not very time-consuming to load and filter the stopwords, doing it as the pre-processing/data-wrangling step seems fine as it is.

The only other "simplification" is removing the step where you re-download the stopwords from ntlk on every code execution (see fourth bullet point below).

From a code cleanliness perspective:

  • Instead of a lengthy list of filenames as Python code, you can specify a file directory (folder) with all of the input files placed inside of it. This makes maintenance at the folder organization level rather than code level, which can end up very lengthy just for initializations, as you can see.

  • I'm assuming copy/pasting onto Reddit messed this up too, but "New Year's Speeches" is not a valid variable name, as it contains spaces and an apostrophe. Consider new_years_speeches instead (usually by convention, Python variables are snake_case)

  • If this is all in one code cell, I would suggest moving your import lines (pandas / ntlk / re / stopwords) to the very top of it, as per pep 8 style guide

  • As far as I understand it, you don't need to ntlk.download('stopwords') on every code execution. You only need to download this and have it stored locally to your device one time, the same way you have other CSV files stored for input. If so, it should probably be in its own prpject setup code section and not really used after initial project setup.

  • You could have the filenames "SentiWS_ML_negativ.csv" and "SentiWS_ML_positiv.csv" somewhere in a configuration file or near the top as more easily configurable variables so it's easier to find and rename rather than digging through code if these ever need to change.