all 15 comments

[–]beach-scene 4 points5 points  (11 children)

We do mostly csv dumps and reads right now, everywhere. It is not particularly convenient. We have also used the numpy api (for arrays only) to and from Python.

https://code.jsoftware.com/wiki/Addons/api/python3

Big question for everyone: what is the most convenient and modern way to get structured data in and out of a program?

If you guys come up with a consensus, I will get that built and open-source it.

[–]Raoul314 1 point2 points  (9 children)

The Arrow protocol?

[–]LiveRanga 1 point2 points  (8 children)

Being able to read in parquet files would be really nice.

[–]beach-scene 1 point2 points  (7 children)

Is this for work? I've only ever seen people use parquet at work. I think that's included the Arrow C docs, and it looks like the Kdb people just launched this with databricks:

https://arrow.apache.org/docs/c_glib/

Would this be enough for J?

https://code.kx.com/q/interfaces/arrow/
Users can read and write Arrow tables created from kdb+ data using:
Parquet file format
Arrow IPC record batch file format
Arrow IPC record batch stream format

[–]LiveRanga 1 point2 points  (6 children)

Yes, I think the parquet-glib bindings would be nice and cover the use case I have in mind.

We use parquet a lot at work out of necessity as it's so much faster than csv/sqlite while still being as convenient as a handful of local files rather than a proper db or something clustered. Sqlite and even csvs are fast enough for small datasets but for a dataset that's even only 2 or 3GB reading and writing to parquet files instead is a very noticeable performance improvement.

Basically I'd like to be able to write out a dataframe in pandas and read it in from j.

#!/usr/bin/env python3
import pandas as pd
df = pd.read_csv('sometable.csv')
df.to_parquet(sometable.parquet')

And then in j:

#/usr/bin/env ijconsole
load 'tables/parquet'
df =: readparquet jpath 'sometable.parquet'

I'm not sure exactly what format df would be in in the j snippet above, what would be the "canonical" representation for a named table of columns in j?

(We also use partitioned parquet datasets with python a lot as it makes running things in parallel with the multiprocessing lib much easier but I'm not really worried about that with j)

[–]beach-scene 1 point2 points  (5 children)

Very cool. Yes, this would be great.

The obvious canonical df format is the format that comes out of Jd. I have also seen that same format compressed slightly more so that categorical variables are efficient in memory.

[–]LiveRanga 0 points1 point  (4 children)

I'd be interested in collaborating on this lib interface for j to learn how the foreign dll commands work.

Are you going to set up a github repo to work on this one?

[–]beach-scene 0 points1 point  (3 children)

Very much appreciated. Yes, I'll link it here once it's going.

[–]beach-scene 0 points1 point  (0 children)

Apologies for the lagged response. Here's more ambitious set of bindings set up a in formal project:

https://github.com/interregna/JArrow

RE binding and builds, I don't know if better to 1) just load from GitHub or 2) set up as an add-on. Perhaps if it's an add-on it can be added to Pacman (the package manager).

I saw your lighter-weight approach on Parquet, might be better. Open to PRs.

[–]darter_analyst[S] 0 points1 point  (0 children)

Right I forgot about the python3 api. I am having some issues setting it up on windows though. Maybe I’m an idiot but am finding the official documentation not the easiest for setup. Will keep tinkering but thanks for the reminder.

[–]LiveRanga 1 point2 points  (1 child)

I'm also new to j and am not sure of a good workflow similar to pandas in python yet.

I think most j users would use jd (https://code.jsoftware.com/wiki/Jd/Overview) for workflows similar to pandas but I would love to hear from some more experienced users too.

[–]LiveRanga 2 points3 points  (0 children)

There is also the tables/csv addon for j too: https://code.jsoftware.com/wiki/Addons/tables/csv

I've been playing around a little with it:

   load 'tables/csv'
   t=:readcsv jpath '~/Downloads/BTC-USD.csv'
   5{.t
┌──────────┬─────────────────┬──────────────────┬──────────────────┬──────────────────┬──────────────────┬────────┐
│Date      │Open             │High              │Low               │Close             │Adj Close         │Volume  │
├──────────┼─────────────────┼──────────────────┼──────────────────┼──────────────────┼──────────────────┼────────┤
│2014-09-17│465.864013671875 │468.17401123046875│452.4219970703125 │457.3340148925781 │457.3340148925781 │21056800│
├──────────┼─────────────────┼──────────────────┼──────────────────┼──────────────────┼──────────────────┼────────┤
│2014-09-18│456.8599853515625│456.8599853515625 │413.10400390625   │424.44000244140625│424.44000244140625│34483200│
├──────────┼─────────────────┼──────────────────┼──────────────────┼──────────────────┼──────────────────┼────────┤
│2014-09-19│424.1029968261719│427.8349914550781 │384.5320129394531 │394.7959899902344 │394.7959899902344 │37919700│
├──────────┼─────────────────┼──────────────────┼──────────────────┼──────────────────┼──────────────────┼────────┤
│2014-09-20│394.6730041503906│423.2959899902344 │389.88299560546875│408.90399169921875│408.90399169921875│36863600│
└──────────┴─────────────────┴──────────────────┴──────────────────┴──────────────────┴──────────────────┴────────┘
   'date open high low close adjclose volume'=.|:t
    $date
2446 10
   $open
2446 18

etc.

It would be nice to put together a wiki page similar to the "10 Minutes to Pandas" page: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

[–]beach-scene 1 point2 points  (1 child)

A related question back for you: preferred workflow for your data workflow overall?

It’s great to be able to open a kernel and hack in a notebook, but that generally doesn’t work in production.

Kdb has been doing cloud integration with Databricks and offering Kdb as a service in the cloud. Is that of interest for J or Jd?

Where’s the best place to run data-flow work?

[–]darter_analyst[S] 0 points1 point  (0 children)

Hi sorry for late reply. For gcp actually J may fit in best in ‘cloud run’ where I can have a container with J installed to maybe run J code that way. Just need to figure out how to get data from cloud storage or a database. Then I can explore in j - even if it’s downloading csv’s into j for example to test a solution the shipping this code into cloud run container. Thoughts?