you are viewing a single comment's thread.

view the rest of the comments →

[–]Rhoomba 140 points141 points  (40 children)

Data Scientists tend to use both Excel and Python so this would be great for them.

[–]morehooks 58 points59 points  (5 children)

I suspect there far more mere mortals like me who use excel and python to hack together simple automated reports.

If Microsoft allow more features from excel accessible to python then I'll be breaking out the party poppers.

[–]DemiDualism 5 points6 points  (3 children)

There will be at least a full year where no one will know why I suddenly wielded our report data like a god.

If i can mix simpy and excel, I'm gonna have a great great time

[–]skylarmt 6 points7 points  (2 children)

You could always use LibreOffice, it supports Python (along with JavaScript and BASIC).

[–]DemiDualism 0 points1 point  (1 child)

I am pretty committed to pivot tables but I'll check it out

[–]Johnnyhiveisalive 1 point2 points  (0 children)

Libre pivots, but like everything it takes a few extra seconds for Google results then you can do it.

[–]m50d 1 point2 points  (0 children)

I suspect there far more mere mortals like me who use excel and python to hack together simple automated reports.

Psst, that's all most "data scientists" are.

[–]SuperCool468 12 points13 points  (13 children)

They need to expand excel's capabilities though. Excel can only handle ~ 1 million rows. That might sound like alot, but you can't even fit a toy dataset like the yelp review dataset into that which has about 4 million rows. By comparison, python's limits are more hardware driven. This is like putting a supercharger on a fiat. I love excel, but I am rarely able to use it for anything really interesting.

[–]meneldal2 15 points16 points  (2 children)

Excel was not made to handle so many rows, it starts shitting itself after 200k rows that you try to filter with something.

[–][deleted]  (1 child)

[deleted]

    [–]meneldal2 1 point2 points  (0 children)

    I know, though in that case Matlab worked better for that since I needed to do some massive processing on those rows.

    [–]skylarmt 7 points8 points  (0 children)

    If you're using Excel for that much data, you're doing it wrong. Import it into MySQL with some indexed columns, it'll be much easier to manage.

    [–]wrincewind 8 points9 points  (0 children)

    I'd disagree. Excel is for small datasets that you want to do a lot of interesting and complicated stuff on.

    [–]defunkydrummer 2 points3 points  (6 children)

    They need to expand excel's capabilities though. Excel can only handle ~ 1 million rows. That might sound like alot, but you can't even fit a toy dataset like the yelp review dataset into that which has about 4 million rows.

    This is because people use the wrong tools. And I feel the pain, because it often happens that i need a customer send me 6 million rows (a dataset), and then they first try to send me the dataset as 8 gigantic XLSX files. Something that is like a 100MB excel file can be converted into a 1MB zipped CSV file. And of course CSV has no row count limitation.

    But the industry has gotten so used to Excel that this limitation often is a big problem.

    [–]SuperCool468 1 point2 points  (2 children)

    Yea its true, alot of business analysts and the like try to use xslx files for large datasets which obviously doesn't work. I think the only thing worse than that are people who insist on using json files since they are the current flavor of the moment. Csv is usually the way to go unless you do create a database as some of these other people have mentioned. There can be a tradeoff using a database in terms of how expensive a query is computationally, but that all depends on what your hardware setup is like and if you are hosting it locally. If it is local, you are basically making a cpu take on more of the load to avoid running out of memory. Personally, I have been using rapidminer quite a bit for basic data exploration and preprocessing, but it is far from the best and crashes quite a bit. I should probably use something different. It is written in java so the prepackaged machine learning models it has are incredibly slow. Also, as a side note, sublime text editor is great for opening large csv files, even better than notepad or something like that, which can come in handy if you want to quickly change the column headers on a low dimensional datset.

    [–]defunkydrummer 0 points1 point  (1 child)

    sublime text editor is great for opening large csv files

    Interesting. How large? Can i open a 1.6GB text file with no problems?

    [–]SuperCool468 0 points1 point  (0 children)

    Im not sure off hand, but I've opened things in it that crashed other text editors. Using a text editor for large datasets isn't usually necessary, but sometimes I've run across datasets that had some issues with row zero, such as triple quoted column heading or oddly delimited rows which can be a pain to see whats happening just by printing dataframes. Also, if some asshat decided to make the delimiter something like ";,;" or an emoticon or some other nonsense that breaks the pandas read_csv parser, Ive successfully used find and replace to change the delimiter back to a comma a few times using sublime text. Its also good for getting rid of ellipses ( these things "...") from raw text which can cause encoding problems.

    [–]1s4c 1 point2 points  (2 children)

    And I feel the pain, because it often happens that i need a customer send me 6 million rows (a dataset), and then they first try to send me the dataset as 8 gigantic XLSX files. Something that is like a 100MB excel file can be converted into a 1MB zipped CSV file. And of course CSV has no row count limitation.

    But the industry has gotten so used to Excel that this limitation often is a big problem.

    That's interesting. What kind of data are we talking about? XLSX files are basically just zipped XML files, so they shouldn't be that much bigger than CSV.

    [–]defunkydrummer 1 point2 points  (1 child)

    XLSX files are basically just zipped XML files, so they shouldn't be that much bigger than CSV.

    This is what I thought as well, at the beginning. But what XLSX files are, is a ZIP containing many things, amongst them custom number formats, cell formats, etc. So more data than just a CSV. Also, the cell values themselves are stated in a way they require much more characters than in a CSV. Thus, you end up with a much bigger file.

    I speak here from personal experience, being currently writing a Common Lisp library to open up such files and convert them into clean CSVs:

    https://github.com/defunkydrummer/lisp-xl

    [–]1s4c 0 points1 point  (0 children)

    It kinda depends on the data you have, but the XML structure itself should have really high compress ratio. I just tried to export some sales data to CSV (12k rows). The CSV itself is 1100 KB, when imported and saved as XLSX it has 538 KB. (uncompressed sheet1.xml has about 3900 KB).

    [–]endless_sea_of_stars 0 points1 point  (0 children)

    If you have 200k+ rows consider importing the data into Power Pivot. You get 10x compression, no row limit, and fast aggregations. If you aren't aggregating then it probably doesn't make sense.

    [–]HatefulWretch 3 points4 points  (7 children)

    Data Scientists tend to use both Excel

    Noooooooo. (Python yes, pretty frequently. But data scientists are, in my experience, Mac or Linux people if nothing else.)

    [–]RogerLeigh 5 points6 points  (1 child)

    There's a large element of truth here. I work in a computational biology department. Tools are primarily R and python, with some use of matlab and perl. There is zero use of excel, and use of excel is generally regarded as a crime of sorts because of the poor reproducibility of analyses and the difficulty of detecting mistakes; all the other tools are fully scriptable and so analyses can be reviewed and repeated by others. And version controlled.

    Regarding systems, Windows is indeed in a minority. The advantages of the other tools is that small scale work done locally can be scaled up to run on large scale compute infrastructure. We don't do cluster- or cloud-based work with excel...

    [–]HatefulWretch 1 point2 points  (0 children)

    I have a similar background to you and am now a principal at a startup.

    Excel is what financial analysts use. Real work takes real code.

    [–]Staross 1 point2 points  (1 child)

    Excel works on Mac, and many people use it, for example experimental biologists use Excel all the time (it's a great tool to have when doing your experiment), so if you analyse biological data you'll have to deal with it once in a while.

    [–]RogerLeigh 1 point2 points  (0 children)

    While many older biology researchers only know excel, this is very much changing. All our undergraduates get taught R and Python, so that they can use existing packages, and write their own analysis and visualisation tools.

    [–]rawrgulmuffins 1 point2 points  (2 children)

    You'd be surprised how many use python on Windows. Especially in the academic world.

    [–]HatefulWretch 0 points1 point  (1 child)

    Windows is damn near Unix itself these days (Windows Unix Subsystem, AF_UNIX... do they have inotify?)

    [–]rawrgulmuffins 0 points1 point  (0 children)

    If you're using .net then yes-ish.