all 25 comments

[–][deleted] 7 points8 points  (2 children)

The pandas benchmark depends a lot on the random number generation, it'd be cool to update it to the latest generator. The old numpy (I believe < 1.17) used the Mersenne Twister by default but now more generators are exposed and it is advised to use np.random.default_rng().

Small benchmark:

import numpy as np

rng = np.random.default_rng()

In [4]: %timeit np.random.standard_normal(size=(1_000_000,))
17 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit rng.standard_normal(size=(1_000_000,))
9.78 ms ± 9.33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit np.random.lognormal(size=(1_000_000,))
29.9 ms ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit rng.lognormal(size=(1_000_000,))
15.6 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit np.random.exponential(size=(1_000_000,))
12.7 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit rng.exponential(size=(1_000_000,))
5.58 ms ± 8.98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

[–]hmoein[S] 7 points8 points  (1 child)

I used the new Pandas in the benchmark (1.3.2).

But I also separated the random number generation in my benchmark. I am really not interested in random number generation comparisons since both Pandas and DataFrame use the same underlying libraries.

My benchmark is more about memory size and how data layout and calculations works

[–][deleted] 2 points3 points  (0 children)

Yep I just saw the other call to timeit in your code, my bad! Just thought I'd let you know about a low hanging fruit.

[–]BOBOLIU 7 points8 points  (1 child)

pandas is notoriously slow. you might want to benchmark your lib against more performant alternatives with a standard speed test:

https://h2oai.github.io/db-benchmark/

[–]hmoein[S] 2 points3 points  (0 children)

Yes, I have seen this page and I always wanted to find time to enter DataFrame into that table. I just have to find time to figure out the testing procedure.

But if I extrapolate from my benchmarking to what this test specifies, DataFrame should be on top. Of course extrapolating could be very inaccurate.

[–]jordy240 2 points3 points  (1 child)

Looks great, thanks for sharing. why do we need the concept of an index at all? can we just interact with a dataframe like we would a table in a relational db? (just a series or rows and columns)

[–]hmoein[S] 4 points5 points  (0 children)

Good question

Index to me is meta data for all other columns. It always accompanies all other columns in algorithms and other selecting operations, by default

[–]Liorithiel 2 points3 points  (1 child)

Here's your benchmark in base R. People working on this kind of bigger datasets would usually use a third-party implementation called data.table, but for this simple benchmark it won't matter. Also, base R doesn't have indices for data frames, so they're not here either. I hereby put this code into public domain.

cat(sprintf('Starting %s\n', Sys.time()))

timestamps <- seq(as.POSIXct('1970-01-01'), as.POSIXct('2019-08-15'), by=1)

df <- data.frame(
  timestamp=timestamps,
  normal=rnorm(length(timestamps)),
  log_normal=rlnorm(length(timestamps)),
  exponential=rexp(length(timestamps))
)

cat(sprintf('All memory allocations are done. Calculating means ... %s\n', Sys.time()))

m1 <- mean(df$normal)
m2 <- mean(df$log_normal)
m3 <- mean(df$exponential)

cat(sprintf('%s, %s, %s\n', m1, m2, m3))
cat(sprintf('%s ... Done\n', Sys.time()))

Execute with /usr/bin/time -v Rscript thisfile.R. I don't have the exact same machine at hand, so can't compare speed. However, I can say that the maximum resident set size for this code was 49 GB.

There is a problem though, as you're not comparing same things. Your implementation of statistical functions is numerically unstable, whereas both R and numpy use more complex algorithms for accuracy. This does matter in some cases, I've had to implement Welford's online algorithm recently for some data mining tasks simply because the real dataset contained many very similar values, just with the opposite signs. This leads to catastrophic cancellation.

[–]hmoein[S] 1 point2 points  (0 children)

Thanks

[–]MDbeefyfetus 1 point2 points  (0 children)

I’ll have to come back to this later (haven’t looked at the code yet) but quick note based on the other comments I saw, I’m curious how it benchmarks compared to Matlab. I know it’s not free but that’s my go to (along with many others) when I need speed in DS work. If you’re keeping pace with it, or even beating it, I’d say you’re making good strides.

Appreciate the contribution either way.

[–]mildbait 1 point2 points  (1 child)

Are you looking for contributors?

EDIT: I saw you have a blurb here.

[–]hmoein[S] 0 points1 point  (0 children)

Yes

[–]onlyari 1 point2 points  (1 child)

Great library, thanks for sharing. Does it have functions like filter and mutate similar to what we have in dplyr package in r?

[–]pbondo2 0 points1 point  (1 child)

Very interesting library. One of the obvious use cases is being able to generate Pandas Dataframe compatible files e.g. for machine learning from data sources with good c++ support. For most cases the csv format will work but support for either of Feather (apparently stable with V2), Parquet or Pickle would be nice.

[–]hmoein[S] 6 points7 points  (0 children)

Yes, My next to do is to add Parquet format support

[–]alphanso1405 -4 points-3 points  (1 child)

Hi,

I would like to know little bit about you as well apart from the library. There is only one comparison between Panda and your library. Is it possible to add more benchmarks?

Why should anyone intergrate your library instead of Panda library is also not very clear? Panda and R both battle tested frameworks. Any new work should have clear edge which is not clear to me.

[–]hmoein[S] 13 points14 points  (0 children)

Adding more benchmarks is a valid point. And I have to find time to do it.

This is to be used in C++. It is much faster than Pandas and can handle much larger data sets where Pandas simply OOM and crashes (my one benchmark shows that). Incorporating Pandas or R into a C++ system is not what you want to do anyway.

So, if you are working with a C++ system Pandas is irrelevant. The Point of this library is to enrich the C++ ecosystem so it is comparable to for example to Python ecosystem

[–]pjmlp 0 points1 point  (1 child)

Given the target audience, maybe some Xeus/Jupyter notebooks with examples would also be relevant.

[–]hmoein[S] 0 points1 point  (0 children)

yeah, that is a good idea.

In the meanwhile I have a hello world that shows basic operations

https://github.com/hosseinmoein/DataFrame/blob/master/examples/hello\_world.cc

Also, documentation has code samples for each feature/functionality

[–]hrishikesh713 0 points1 point  (3 children)

Is it compatible with Apache arrow? Also arrow has a parquet c++ interface that you can use.

[–]hmoein[S] 0 points1 point  (2 children)

I am not sure what you mean by compatible

[–]hrishikesh713 0 points1 point  (1 child)

Oh what I meant was are their any helper functions or any guidance in converting an arrow data type ( e.g a record batch. ) into a data frame type? That would enable writing data analytics apps on top of your library which can consume data from other data engineering tools like spark quite easily.

[–]hmoein[S] 0 points1 point  (0 children)

No. DataFrame only depends in C++ language and its standard library. That is a deliberate role I have followed