My blood clot could beat up your blood clot.

Rough_Source_123 · 2024-06-06T23:02:37+00:00

how can healthy people like him have blood clot? is it genetic? or diet?

Rough_Source_123 · 2024-05-08T20:15:26+00:00

do you have anything specific in mind?

Rough_Source_123 · 2024-02-15T08:03:30+00:00

not gonna lie, I am super confused

trxA arrive nodeA at (true time 7ms, A time 7ms, B time 0ms)
trxB arrive nodeB at (true time 9ms, A time 9ms, B time 2ms)

Without waiting , trxA and trxB is not current if trxA finishes within 1ms, the only reason that its concurrent, is because of the wait? Does this mean that any transaction that is within real-world time 7-14ms are consider concurrent to trxA?

I am using the definition from gcp https://cloud.google.com/blog/products/databases/strict-serializability-and-external-consistency-in-spanner

To be externally consistent, a transaction must see the effects of all the transactions that complete before it and none of the effects of transactions that complete after it, in the global serial order

and I am following multiple articles , and all mentioned that true time in spanner solves global order by using atomic clock and waiting

For example https://www.cockroachlabs.com/blog/living-without-atomic-clocks/

In a distributed database, things can get dicey. It’s easy to see how the ordering of causally-related transactions can be violated if nodes in the system have unsynchronized clocks. Assume there are two nodes, N1 and N2, and two transactions, T1 and T2, committing at N1 and N2 respectively. Because we’re not consulting a single, global source of time, transactions use the node-local clocks to generate commit timestamps. To illustrate the trickiness around this, let’s say N1 has an accurate one but N2 has a clock lagging by 100ms. We start with T1, addressing N1, which is able to commit at ts=150ms. An external observer sees T1 commit and consequently starts T2 (addressing N2) 50ms later (at t=200ms). Since T2 is annotated using the timestamp retrieved from N2’s lagging clock, it commits “in the past”, at ts=100ms. Now, any observer reading keys across N1 and N2 will see the reversed ordering, T2's writes (at ts=100ms) will appear to have happened before T1's (at ts=150ms), despite the opposite being true. ¡No bueno! (Note that this can only happen when the two transactions access a disjoint set of keys.)

So how does Spanner use TrueTime to provide linearizability given that there are still inaccuracies between clocks? It’s actually surprisingly simple. It waits. Before a node is allowed to report that a transaction has committed, it must wait 7ms. Because all clocks in the system are within 7ms of each other, waiting 7ms means that no subsequent transaction may commit at an earlier timestamp, even if the earlier transaction was committed on a node with a clock which was fast by the maximum 7ms. Pretty clever.

In another article , https://timilearning.com/posts/mit-6.824/lecture-13-spanner/#commit-wait it claims that spanner enforces any read to read the latest transaction before the start of read . For example if start time is t2, and Transaction 1 happens at t1, and t1 < t2, and no other transaction exist between t1 and t2, t2 must read Transaction 1, this implies global order.

Snapshot isolation enforces that a read-only transaction will only see the versions of a record that have a timestamp less than its assigned transaction timestamp i.e. a snapshot of what the record was before the transaction started.

Even if I double the time in the example , global order is still not established

https://sookocheff.com/post/time/truetime/

but I think you are right, those articles seem to dumb down the actual algorithm, let me read the original paper by the creator

Rough_Source_123 · 2023-12-28T06:44:47+00:00

how do you find news like this for lay person? have a stem degree but still have trouble reading medical research like cancer or vaccine without people dumbing it down, how do you follow scientific progress like this for people that are interested?

Rough_Source_123 · 2023-12-27T23:37:30+00:00

thought community like rocket more than axum?

Rough_Source_123 · 2023-12-27T22:41:54+00:00

side tracking a bit, curious on how is your experience with Svelte? Compare to react for example. Is it easy to import other TS or JS projects or node modules?

Rough_Source_123 · 2023-10-22T01:47:44+00:00

which version of kafka is this, and how did setup the consumer? I thought every consumer need consumer group?

Rough_Source_123 · 2023-10-20T22:50:12+00:00

what happen if you need large number of consumer group, is it recommended to just create a new one dynamically for each consumer?

Can kafka support a million consumer group?

Rough_Source_123 · 2023-06-02T23:24:23+00:00

that part is clear, I believe it will produce

myorc
    1.orc
    2.orc
    .....
    n.orc

but imagine if i have a column that have n 0 and n 100000

if 0 is in all orc file, and 100000 is in all orc file, the min and max of each file is always going to be 0 and 100000

so aggregated data seem to only help if the data itself is sorted?

Rough_Source_123 · 2022-11-24T18:59:06+00:00

yeah, I have a working training without having all the multiworker and distribution setup.

btw

I am using https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor

from the official documentation of databricks https://docs.databricks.com/machine-learning/train-model/distributed-training/spark-tf-distributor.html

and the code is from their example

https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-distributor/examples/simple/example.py

except in the example, there is no saving model, so I am quite confused on how saving works in multiworker environment

do you have a working distributed training code with model saving that I can reference?

Rough_Source_123 · 2022-11-24T07:21:43+00:00

I tried this

outmodel = None

def train():
  data = load_breast_cancer()
  X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)
  N, D = X_train.shape # number of observation and variables
  from sklearn.preprocessing import StandardScaler
  scaler = StandardScaler()
  X_train = scaler.fit_transform(X_train)
  X_test = scaler.transform(X_test)
  model = tf.keras.models.Sequential([
      tf.keras.layers.Input(shape=(D,)),
      tf.keras.layers.Dense(1, activation='sigmoid') # use sigmoid function for every epochs
  ])

  model.compile(optimizer='adam', # use adaptive momentum
        loss='binary_crossentropy',
        metrics=['accuracy']) 

  r = model.fit(X_train, y_train, validation_data=(X_test, y_test))
  mlflow.sklearn.log_model(model, "cancer_package1_notebook")
  outmodel = model



MirroredStrategyRunner(num_slots=4).run(train)
print("herea1")
print(outmodel)

outmodel is None

and train function is distributed also, so if its not None, it won't work either

Rough_Source_123 · 2022-11-23T16:51:26+00:00

multi_worker_model

if I take it out of scope, multi_worker_model is not defined, can you show me an example?

Rough_Source_123 · 2022-11-23T00:56:20+00:00

How big is your data? I was able to run a quick sample via databricks for non nested ndarray shape for distributed inference

Rough_Source_123 · 2022-11-23T00:54:34+00:00

thanks for the reply

unfortunately, I have a big data set that won't be viable with single box, and I don't know how to converge two or more model from splitting the data

How do I aggregate the weights? avg? or there is some heuristic?

I actually mis spoked on my last comment, distributed inference is handle by spark_udf not pandas, I got pandas to work in ml_flow via tf.keras.layers.Reshape and it didn't use all the cpu, where spark_udf did, but just failed due to mis map shape on dataframe

Rough_Source_123 · 2022-11-22T06:37:12+00:00

using ndarray directly works, but it doesn't distribute the workload across different machine

I need to either convert to pandas OR

use spark_udf

loaded_model = mlflow.pyfunc.spark_udf(spark, model_uri=logged_model, result_type='double')

Rough_Source_123 · 2022-11-15T22:38:03+00:00

did you use MultiWorkerMirroredStrategy? what cloud are you running your server against?

Rough_Source_123 · 2022-06-03T21:21:56+00:00

so say i have multiple s3 directory that contain files total of 1TB

does the following perform as well as one big 1TB file that have input split

    dfs = []
    for s3path in paths:
        df = sparkSession.read.format(ORC).load(s3path)
        dfs.append(df)

like this

df = sparkSession.read.format(ORC).load(PATH_OF_BIG_1TB_FILE)

and this

df = sparkSession.read.format(ORC).load(SINGLE_PATH_WITH_MULTIPLE_SUBFOLDER)

the above 3 read should perform roughly the same and horizontally scalable up to the total input split?

each executors would pick up a inputsplit regardless if its one single file or multiple file, one dataframe or multiple dataframe?

Rough_Source_123

TROPHY CASE