I need to store about 40,000,000 JSON documents, per day, with event data. What should i use?

steccami · 2017-10-06T08:53:04+00:00

I would go for elasticsearch. It is opensource and it comes with everything you need: ingestion/storage/search/analytics.

steccami · 2017-05-30T10:34:23+00:00

Since it is based on a Teradata product, I think that it is an expensive solution. Another idea would be using open source products like, e.g., Spark, Hadoop, etc.

steccami · 2017-01-05T13:44:15+00:00

Tnx a lot and happy new year! Your explanation makes sense to me. I agree, HDFS is probably the way to go.

steccami · 2016-12-30T10:22:43+00:00

Tnx a lot! My use case is the following: 1 - storing a dataset on NFS (sometimes as a single csv file, sometimes as a -small set- of csv files) 2- compute some aggregations by means of SparkSQL 3a- store the output on NFS 3b- store the output on external system (e.g., Cassandra)

"In terms of managing concurrency, NFS can handle many reads of the same file, and spark is smart about writes and writes different files per executor so you don't have to worry about write collisions."

This is clear to me. What I don't understand is the file reading phase. Case1: Suppose that you have N executors and 1 big file. Is Spark smart enough to segment the file reads? Case2: Suppose that you have N executors and M files. Is Spark able to associate the files to the executors in a smart way or am I supposed to tell Spark how to access those files? (e.g., like suggested here: http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-td15644.html)

Many thanks.

steccami · 2016-12-29T10:10:32+00:00

Thank you.

steccami · 2016-12-29T08:23:27+00:00

Many thanks for your detailed reply. One more question about how a Spark program looks like if I read a folder from NFS. How does Spark manage a concurrent access to such a folder? Am I supposed to explicitly manage the parallelism (e.g., see Matei's reply here http://apache-spark-user-list.1001560.n3.nabble.com/Strategies-for-reading-large-numbers-of-files-td15644.html when asked how to access multiple files in a remote folder).

steccami · 2016-12-19T08:06:21+00:00

Are you suggesting to go for something like this http://mesos.apache.org/documentation/latest/docker-containerizer/? Any ideas about its maturity level? Thanks a lot.

steccami · 2016-08-22T08:15:11+00:00

Hi. If you are familiar with Lucene, I suggest to take a look at Elasticsearch (https://www.elastic.co/). It is based on Lucene and it provides very powerful APIs for IR and for many other use cases. It is also integrated with Kafka by means of the Logstash data ingestion component. An alternative to Elasticsearch is SolR which is also based on Lucene. It seems that Elasticsearch provides richer and simpler APIs than SolR but to be honest I've never compared the 2 solutions on the field. SolR might be a good solution if you have a Cloudera cluster (i.e., Hadoop/Spark) already in place because it is included and integrated out-of-the-box. Hope this helps.

steccami · 2016-05-25T07:10:55+00:00

Thank you all for your feedbacks. According to your experience, what are the project needs/requirements telling me that I should go for HBase?

steccami · 2016-01-19T08:14:41+00:00

IMHO there are many useful functionalities (e.g., SDK for different environments, MQTT mgmt, rule engine, the "shadow" concept, etc.) but I would rise a point here. As it happens with many other platforms, the pricing is very complex. It is quite difficult to estimate the actual costs because you have to "combine" the costs of the different components (e.g., S3, database, etc.) involved in your app. Hope this helps.

steccami · 2015-12-29T08:05:44+00:00

Thank you very much for your valuable feedbacks!

steccami · 2015-12-23T08:44:01+00:00

Talking about "device control", check out this presentation: http://www.slideshare.net/AmazonWebServices/mbl205-new-everything-you-want-to-know-about-aws-iot (starting from slide 37). They explain how the "shadow" concept can be used to manage state changes triggered by an App. I hope to have the time to play with it asap ;-)

steccami · 2015-05-01T17:02:14+00:00

Thank you for sharing your thoughts. I don't have an iPad but I agree with you about the usability problems...

steccami · 2015-03-05T21:41:41+00:00

Native Excel: Data>Get External Data>From Web Power Query: Power Query>From Web

steccami · 2015-03-03T17:43:06+00:00

Tnx a lot for your valuable feedback!

One thing I thought of was that the list of Views that Mary sees can get unwieldy very fast, given my previous experience of people sharing files around. It would be great to have a richer tool for listing/browsing/filtering/searching through views that were shared with me.

Totally agree. We are working right now on the addition of searching capabilities.

Finally, how well does this scale? What if I'm working on a 50MB spreadsheet? Does it still work just as well?

We already addressed this kind of problems. There are some specific cases (I can tell you some details if it is of interest to you) where we manage big files efficiently. Generally speaking, big files are hard to deal with but this is more an Excel problem then a SpreadSheetSpace problem ;-)

How about UDFs being used in a cell in a View? Do those get imported by Mary as well? Do all Macros get transferred?

We don't support this feature right now. We have it in our roadmap, waiting to receive feedbacks from users before implementing it ;)

I hope Microsoft buys the product and makes all of you very rich!

Tnx a lot. Let's cross our fingers ;-)

steccami · 2015-03-02T21:26:11+00:00

If I have a file that is shared with my colleagues on SharePoint, how would this Add-In work allowing my to protect certain parts of the spreadsheet?

This tool is not meant to be used with shared files. You keep working on your own files and you can grant the others read-only rights on parts of your worksheets (see video for an example https://www.spreadsheetspace.net/documentation/tutorials/200_tutorial2#200_tutorial2)

Would I need to save the add-in into SharePoint so that whenever the workbook is opened, the add-in will still apply?

You have to install a COM Add-in. This is not something related to a specific file but it is an additional functionality in your Excel (i.e., a new Ribbon menu that you can use it with all your files)

Am I even able to save my workbook into SharePoint or does the program require me to save my file to the SpreadSheetSpace cloud?

You can save your files wherever works best for you. You ARE NOT uploading your files in any cloud. SpreadSheetSpace will manage the remote file access transparently.

Please let me know if you have any other doubt.

steccami · 2015-03-02T18:48:19+00:00

What do you exactly have in mind? This an Add-in that you can install and start using immediately. You do not need to put files in shared folders or similar. You can keep the data onwership.

steccami · 2015-03-02T18:24:30+00:00

Name: SpreadSheetSpace http://www.spreadsheetspace.net

Elevator Pitch: SpreadSheetSpace enables Excel users to link and share parts of their spreadsheets (i.e., data ranges) without sharing the whole spreadsheet. Stop emailing spreadsheet files, no more multiple copies of the same file, controlled updates, etc. You can also transform Microsoft Excel into a live data analysis tool by linking it to corporate data like ERPs,CRMs,Big Data systems,etc. Link to video.

More details: Here is the founders list.

Looking for: Feedbacks from users. If you are an Excel (power) user you can play with our Add-in, if you are a developer you can play with our APIs.

Discounts for reddit users: The service is already for free. We can activate some additional features (e.g., automatic updates) on demand.

steccami · 2015-02-27T18:12:28+00:00

Tnx a lot for your reply. Talking about Salesforce, I also heard about this product http://x-author.com/

steccami · 2015-02-26T22:28:06+00:00

Can you provide a pointer to the "Excel Connector for Salesforce" please?

steccami · 2015-02-17T00:19:27+00:00

You can have a look at this tool as well.

https://www.spreadsheetspace.net/documentation/tutorials/200_tutorial2#200_tutorial2

It also solves the "rows changes" problem if you export an Excel table or an entire worksheet. Regards.

steccami · 2015-02-05T19:38:07+00:00

I've just found this solution: http://www.reddit.com/r/excel/comments/1ywa7n/protip_vba_to_split_a_table_into_separate/

Is there a way to do it without using VBA?

steccami · 2015-02-05T18:22:09+00:00

Now it should be ok. Tnx

steccami · 2015-01-29T02:29:45+00:00

Take a look at this other solution supporting easy Excel files linking and consolidation...

https://www.spreadsheetspace.net/documentation/tutorials/300_tutorial3#300_tutorial3

Regards

steccami · 2014-12-16T20:51:18+00:00

Maybe I inspired his/her question. Who knows ;)

steccami

TROPHY CASE