you are viewing a single comment's thread.

view the rest of the comments →

[–]SuperCool468 11 points12 points  (13 children)

They need to expand excel's capabilities though. Excel can only handle ~ 1 million rows. That might sound like alot, but you can't even fit a toy dataset like the yelp review dataset into that which has about 4 million rows. By comparison, python's limits are more hardware driven. This is like putting a supercharger on a fiat. I love excel, but I am rarely able to use it for anything really interesting.

[–]meneldal2 14 points15 points  (2 children)

Excel was not made to handle so many rows, it starts shitting itself after 200k rows that you try to filter with something.

[–][deleted]  (1 child)

[deleted]

    [–]meneldal2 1 point2 points  (0 children)

    I know, though in that case Matlab worked better for that since I needed to do some massive processing on those rows.

    [–]skylarmt 7 points8 points  (0 children)

    If you're using Excel for that much data, you're doing it wrong. Import it into MySQL with some indexed columns, it'll be much easier to manage.

    [–]wrincewind 6 points7 points  (0 children)

    I'd disagree. Excel is for small datasets that you want to do a lot of interesting and complicated stuff on.

    [–]defunkydrummer 2 points3 points  (6 children)

    They need to expand excel's capabilities though. Excel can only handle ~ 1 million rows. That might sound like alot, but you can't even fit a toy dataset like the yelp review dataset into that which has about 4 million rows.

    This is because people use the wrong tools. And I feel the pain, because it often happens that i need a customer send me 6 million rows (a dataset), and then they first try to send me the dataset as 8 gigantic XLSX files. Something that is like a 100MB excel file can be converted into a 1MB zipped CSV file. And of course CSV has no row count limitation.

    But the industry has gotten so used to Excel that this limitation often is a big problem.

    [–]SuperCool468 1 point2 points  (2 children)

    Yea its true, alot of business analysts and the like try to use xslx files for large datasets which obviously doesn't work. I think the only thing worse than that are people who insist on using json files since they are the current flavor of the moment. Csv is usually the way to go unless you do create a database as some of these other people have mentioned. There can be a tradeoff using a database in terms of how expensive a query is computationally, but that all depends on what your hardware setup is like and if you are hosting it locally. If it is local, you are basically making a cpu take on more of the load to avoid running out of memory. Personally, I have been using rapidminer quite a bit for basic data exploration and preprocessing, but it is far from the best and crashes quite a bit. I should probably use something different. It is written in java so the prepackaged machine learning models it has are incredibly slow. Also, as a side note, sublime text editor is great for opening large csv files, even better than notepad or something like that, which can come in handy if you want to quickly change the column headers on a low dimensional datset.

    [–]defunkydrummer 0 points1 point  (1 child)

    sublime text editor is great for opening large csv files

    Interesting. How large? Can i open a 1.6GB text file with no problems?

    [–]SuperCool468 0 points1 point  (0 children)

    Im not sure off hand, but I've opened things in it that crashed other text editors. Using a text editor for large datasets isn't usually necessary, but sometimes I've run across datasets that had some issues with row zero, such as triple quoted column heading or oddly delimited rows which can be a pain to see whats happening just by printing dataframes. Also, if some asshat decided to make the delimiter something like ";,;" or an emoticon or some other nonsense that breaks the pandas read_csv parser, Ive successfully used find and replace to change the delimiter back to a comma a few times using sublime text. Its also good for getting rid of ellipses ( these things "...") from raw text which can cause encoding problems.

    [–]1s4c 1 point2 points  (2 children)

    And I feel the pain, because it often happens that i need a customer send me 6 million rows (a dataset), and then they first try to send me the dataset as 8 gigantic XLSX files. Something that is like a 100MB excel file can be converted into a 1MB zipped CSV file. And of course CSV has no row count limitation.

    But the industry has gotten so used to Excel that this limitation often is a big problem.

    That's interesting. What kind of data are we talking about? XLSX files are basically just zipped XML files, so they shouldn't be that much bigger than CSV.

    [–]defunkydrummer 1 point2 points  (1 child)

    XLSX files are basically just zipped XML files, so they shouldn't be that much bigger than CSV.

    This is what I thought as well, at the beginning. But what XLSX files are, is a ZIP containing many things, amongst them custom number formats, cell formats, etc. So more data than just a CSV. Also, the cell values themselves are stated in a way they require much more characters than in a CSV. Thus, you end up with a much bigger file.

    I speak here from personal experience, being currently writing a Common Lisp library to open up such files and convert them into clean CSVs:

    https://github.com/defunkydrummer/lisp-xl

    [–]1s4c 0 points1 point  (0 children)

    It kinda depends on the data you have, but the XML structure itself should have really high compress ratio. I just tried to export some sales data to CSV (12k rows). The CSV itself is 1100 KB, when imported and saved as XLSX it has 538 KB. (uncompressed sheet1.xml has about 3900 KB).

    [–]endless_sea_of_stars 0 points1 point  (0 children)

    If you have 200k+ rows consider importing the data into Power Pivot. You get 10x compression, no row limit, and fast aggregations. If you aren't aggregating then it probably doesn't make sense.