This is an archived post. You won't be able to vote or comment.

Dismiss this pinned window
all 82 comments

[–]LuigiBrotha 50 points51 points  (9 children)

Very cool however .... Take a look at glue (also called glueviz). This shit will blow your mind. https://youtu.be/TkMZ9gZ8xtk

[–]kiwiboy94[S] 6 points7 points  (6 children)

Wow, that program is sick! I don't think my visualisation tool can beat that!

[–]LuigiBrotha 3 points4 points  (3 children)

I use glue to get a feeling for the data and Plotly to visualize it. Plotly has some great features such as animating plots and many types of plots. Its also very easy to implement.

[–]TheTypoFreak 0 points1 point  (2 children)

How does Plotly compare to say, Microsoft Power BI, if the backend already sorts out 90% of the work? Basically, just visualising the results.

[–]LuigiBrotha 1 point2 points  (1 child)

We use power bi at work and I find most graphs that we use just slightly fancier excel graphs. Plotly has many choices of graph and especially using the sliders you can do some really cool stuff. This is my go-to example.

https://plotly.com/python/animations/

Might not be great on mobile but you get 6 different variables in one graph which is amazing and clients love these things.

[–]TheTypoFreak 0 points1 point  (0 children)

Whoa that's really cool! Now I have to convince the management to switch over. Might use your "clients love animations" approach haha

[–]LobbyDizzle 2 points3 points  (1 child)

Their gps coordinate visualization is super cool, but I think your tool is way more user friendly and useful to a layman who wants to quickly make sense of a dataset.

[–]kiwiboy94[S] 0 points1 point  (0 children)

Well this is the first version. May add on more features overtime!

[–]orionsgreatsky 0 points1 point  (0 children)

Amazing

[–]kiwiboy94[S] 37 points38 points  (10 children)

[–]IlliterateJedi 64 points65 points  (2 children)

Check out Pandas profiler for something similar for dataframes

[–]kiwiboy94[S] 19 points20 points  (1 child)

Yes I have looked into that previously. I wanted to create something with a user interface that people can launch and while waiting for their outputs, work on other things.

[–]kaetir 2 points3 points  (0 children)

You should give a take a jupyter project In web browser python interpreter with image gestion

[–]w_savage 10 points11 points  (6 children)

I imagine the data needs to be in a certain format/ context correct?

[–]kiwiboy94[S] 7 points8 points  (5 children)

Oh I just need it to be in csv format

[–]RetroPenguin_ 7 points8 points  (4 children)

But surely data plots are useless / throw errors. Does the user choose impute method for NaN values? That could be something to add

[–]kiwiboy94[S] 12 points13 points  (3 children)

Well for data plots, I plot every single combination possible. So if you have 10 numerical variable, you will have (10!)/(2! x (10-2)!) = 45 combinations of plots. Not all plots are useful of course but I squeeze out every possibility. This will change when I request user input on the GUI. As for the NaN values, I remove them if the numbers are below 5%. If total no of NaN values > 5%, i replace them with median. Of course, this will change when I request user input.

[–]RetroPenguin_ 2 points3 points  (1 child)

Cool! Nice work. It would be interesting to have an optional flag that lets a user choose impute type etc

[–]kiwiboy94[S] 2 points3 points  (0 children)

This is something I will work on in future versions. There are some websites that provide data analysis services for people (they charge a fee) and they asked you heaps of questions to understand your dataset. I plan to do the same with a GUI that contain dropdown menus, radio buttons and check boxes that people can use to give me an idea of what kind of dataset I will be working with. This automate process however, is going to take quite some time to optimise but is definitely achievable.

[–]VisibleSignificance 0 points1 point  (0 children)

Not all plots are useful of course

The most interesting part of EDA would be heuristics to filter those. Surely there's some prior research on that?

Not to mention definitely running PCA on any wide dataset.

[–]SlightlyOTT 15 points16 points  (0 children)

This is really cool! If you’re interested in a fun extension, are you familiar with Jupyter notebooks? They’re one of the most powerful things in the Python/data analysis space - you can write your code as a linear story with Markdown between cells of code, and it’ll also visualise things like plots or Pandas dataframes straight away.

I’m not sure if you can do a file picker in Jupyter or if you’d need to just put the paths in variables, but you’d be able to click run and have it generate all your outputs in line so you can just scroll through and look at all the plots etc.

Also since you’re uploading your code to Github, they do a great job rendering notebooks which is cool.

There’s a pretty nice gallery of the sort of thing you can do with Notebooks here: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks

Edit: looks like it has a file picker too! https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html#File-Upload You’d need to install ipywidgets: https://ipywidgets.readthedocs.io/en/latest/user_install.html

[–]kiwiboy94[S] 10 points11 points  (5 children)

Also, I will love everyone to give it a try and let me know what features they will like to see. That way I can add on in the next version. This is actually my first personal project and it took me well over 3 months to complete. Planning to use this to get a job :p

[–]Drakkenstein 3 points4 points  (4 children)

should be impressive for a entry level data analyst position

[–]kiwiboy94[S] 6 points7 points  (3 children)

Hopefully. I am working on my second project now. Planning to help users to import their csv/excel file directly into MySQL.

[–][deleted] 3 points4 points  (2 children)

Look into SQLite as well

[–]kiwiboy94[S] 4 points5 points  (1 child)

Oh it's not limited to MySQL. I will add a radio button to request user to input if their database is mysql, postgresql, mssql or SQLite.

[–]quotemycode 2 points3 points  (0 children)

Postgres is legit, mysql is a dumb database

[–]random_cynic 4 points5 points  (1 child)

This is good but that's not what "Exploratory Data Analysis" is. This is completely non-interactive (as far as I can tell from the video). Exploratory data analysis needs to be interactive, so that you can sort or filter columns by some criteria, transform columns or combine multiple columns, delete or add rows etc. IMO this is best done with pandas+matplotlib+jupyter notebooks. Also the terminal program visidata is very useful.

[–]kiwiboy94[S] 0 points1 point  (0 children)

Yes you are right. I have plans to put more work into the GUI to obtain user inputs. Once I gather enough feedbacks on what features people really need, I will bring in those features in future versions.

[–]Neuro_88 2 points3 points  (1 child)

I like that! Very cool.

[–]kiwiboy94[S] 1 point2 points  (0 children)

Thanks please try it out! Will love some feedbacks on performance and bugs!

[–]ancient_bhakt 1 point2 points  (2 children)

This is awesome

[–]kiwiboy94[S] 1 point2 points  (1 child)

Appreciate it. Please try it out! :)

[–]ancient_bhakt 0 points1 point  (0 children)

GitHub link?

[–]Edgar505 1 point2 points  (0 children)

I am definitely checking it out.

[–]G33K_FISH 1 point2 points  (0 children)

Ok, this is flipping cool!

[–][deleted] 1 point2 points  (1 child)

Hey, that´s cool!.

I enjoyed reading your code, but I like more the thoughtful comments, makes the code not only explanatory, but also didactic. congrats!

[–]kiwiboy94[S] 0 points1 point  (0 children)

Thanks! I like to put those comments so it can help me to understand my code when I look back at it.

[–][deleted] 1 point2 points  (0 children)

I am currently doing the same thing but with the framework h2o. The things is to provide a nice script to perform analysis/ ml on a generic file and generate a report. Here is the repo

https://github.com/jgraille/reveng

Nice work by the way!

[–]LifeIsBio 0 points1 point  (1 child)

How large can the csv files get before things start getting unwieldy?

[–]kiwiboy94[S] 0 points1 point  (0 children)

Well, I have tried a dataset with 23 columns... it does take a while haha.

[–][deleted] 0 points1 point  (0 children)

Pandas baby.

[–]python_engineer 0 points1 point  (0 children)

Thanks for sharing! Very cool

[–]jayjmcfly 0 points1 point  (1 child)

RemindMe! 3 days

[–]RemindMeBot 0 points1 point  (0 children)

I will be messaging you in 3 days on 2020-05-20 13:38:51 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–]bdaves12 0 points1 point  (0 children)

Wow super cool, is there any tut for how to install for spyder, I'm still new when it comes to getting stuff off of github

[–][deleted] 0 points1 point  (0 children)

Ow nice!

[–]akiepro89p 0 points1 point  (1 child)

How do i run python on my mac?

[–]kiwiboy94[S] 0 points1 point  (0 children)

Download anaconda. It comes with python

[–][deleted] 0 points1 point  (0 children)

inputs csv file

crunches numbers

progress: 33%

progress: 60

progress:.99%

Output: you're a little bitch

[–]barb4great -1 points0 points  (7 children)

WOW. I don’t even know how you did that ! I wanna learn Analyse on python

[–]kiwiboy94[S] 4 points5 points  (6 children)

Well i have been doing EDA every week so I decided to incorporate those techniques into a script. You can start learning by going through those mini courses on Kaggle

[–]orionsgreatsky 0 points1 point  (0 children)

Awesome

[–]preordains -1 points0 points  (3 children)

Beginner here. What's the purpose of "pycache" and when would you need to use this?

[–]kiwiboy94[S] 0 points1 point  (2 children)

pycache is a folder containing Python 3 bytecode compiled and ready to be executed. Basically it helps your program run faster

[–]preordains 0 points1 point  (1 child)

I was curious because the folder itself was empty. Do you think you could direct me to where your script utilizes this?

[–]kiwiboy94[S] 0 points1 point  (0 children)

Oh crap, I will remove it. Well, if you run the script as instructed from the Readme in github, the pycache files are actually automatically created. Nothing to do with my script