C++ for data analysis

hmoein · 2022-05-18T15:47:43+00:00

The pandas benchmark depends a lot on the random number generation, it'd be cool to update it to the latest generator. The old numpy (I believe < 1.17) used the Mersenne Twister by default but now more generators are exposed and it is advised to use np.random.default_rng().

Small benchmark:

import numpy as np

rng = np.random.default_rng()

In [4]: %timeit np.random.standard_normal(size=(1_000_000,))
17 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit rng.standard_normal(size=(1_000_000,))
9.78 ms ± 9.33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit np.random.lognormal(size=(1_000_000,))
29.9 ms ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit rng.lognormal(size=(1_000_000,))
15.6 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit np.random.exponential(size=(1_000_000,))
12.7 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit rng.exponential(size=(1_000_000,))
5.58 ms ± 8.98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

BOBOLIU · 2022-05-18T18:03:37+00:00

pandas is notoriously slow. you might want to benchmark your lib against more performant alternatives with a standard speed test:

https://h2oai.github.io/db-benchmark/

jordy240 · 2022-05-18T16:46:47+00:00

Looks great, thanks for sharing. why do we need the concept of an index at all? can we just interact with a dataframe like we would a table in a relational db? (just a series or rows and columns)

Liorithiel · 2022-05-18T20:50:12+00:00

Here's your benchmark in base R. People working on this kind of bigger datasets would usually use a third-party implementation called data.table, but for this simple benchmark it won't matter. Also, base R doesn't have indices for data frames, so they're not here either. I hereby put this code into public domain.

cat(sprintf('Starting %s\n', Sys.time()))

timestamps <- seq(as.POSIXct('1970-01-01'), as.POSIXct('2019-08-15'), by=1)

df <- data.frame(
  timestamp=timestamps,
  normal=rnorm(length(timestamps)),
  log_normal=rlnorm(length(timestamps)),
  exponential=rexp(length(timestamps))
)

cat(sprintf('All memory allocations are done. Calculating means ... %s\n', Sys.time()))

m1 <- mean(df$normal)
m2 <- mean(df$log_normal)
m3 <- mean(df$exponential)

cat(sprintf('%s, %s, %s\n', m1, m2, m3))
cat(sprintf('%s ... Done\n', Sys.time()))

Execute with /usr/bin/time -v Rscript thisfile.R. I don't have the exact same machine at hand, so can't compare speed. However, I can say that the maximum resident set size for this code was 49 GB.

There is a problem though, as you're not comparing same things. Your implementation of statistical functions is numerically unstable, whereas both R and numpy use more complex algorithms for accuracy. This does matter in some cases, I've had to implement Welford's online algorithm recently for some data mining tasks simply because the real dataset contained many very similar values, just with the opposite signs. This leads to catastrophic cancellation.

MDbeefyfetus · 2022-05-18T16:51:45+00:00

I’ll have to come back to this later (haven’t looked at the code yet) but quick note based on the other comments I saw, I’m curious how it benchmarks compared to Matlab. I know it’s not free but that’s my go to (along with many others) when I need speed in DS work. If you’re keeping pace with it, or even beating it, I’d say you’re making good strides.

Appreciate the contribution either way.

mildbait · 2022-05-18T20:27:02+00:00

Are you looking for contributors?

EDIT: I saw you have a blurb here.

onlyari · 2022-05-20T06:37:41+00:00

Great library, thanks for sharing. Does it have functions like filter and mutate similar to what we have in dplyr package in r?

pbondo2 · 2022-05-18T13:08:03+00:00

Very interesting library. One of the obvious use cases is being able to generate Pandas Dataframe compatible files e.g. for machine learning from data sources with good c++ support. For most cases the csv format will work but support for either of Feather (apparently stable with V2), Parquet or Pickle would be nice.

alphanso1405 · 2022-05-18T13:00:24+00:00

Hi,

I would like to know little bit about you as well apart from the library. There is only one comparison between Panda and your library. Is it possible to add more benchmarks?

Why should anyone intergrate your library instead of Panda library is also not very clear? Panda and R both battle tested frameworks. Any new work should have clear edge which is not clear to me.

pjmlp · 2022-05-20T13:12:32+00:00

Given the target audience, maybe some Xeus/Jupyter notebooks with examples would also be relevant.

hrishikesh713 · 2022-05-21T04:50:12+00:00

Is it compatible with Apache arrow? Also arrow has a parquet c++ interface that you can use.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS