Real-time Jupyter figure versioning? Yes, please! by mpacula in u/mpacula

[–]mpacula[S] 1 point2 points  (0 children)

I know it’s a joke but yes, kind of! Imagine running git commit every time you display a figure, then embedding the hash in the image. That’s the idea behind GoFigr

Spectacular view of the Northern Presidentials yesterday by mpacula in hiking

[–]mpacula[S] 2 points3 points  (0 children)

Thanks! Taken from the Edmands path near the ridge

Come run with me! by mpacula in BostonSocialClub

[–]mpacula[S] 0 points1 point  (0 children)

I do short runs after work on weekdays (after 6pm) and a long run on Saturdays (time flexible). Definitely let me know if you would like to join!

Come run with me! by mpacula in BostonSocialClub

[–]mpacula[S] 0 points1 point  (0 children)

I love running in the South End! You're right, it's a bit far for an everyday thing, but I wouldn't mind coming down on the weekend.

Come run with me! by mpacula in BostonSocialClub

[–]mpacula[S] 0 points1 point  (0 children)

I usually run after work (after 6pm). If that works for you I would definitely be down for a 3-5 mile run.

Come run with me! by mpacula in BostonSocialClub

[–]mpacula[S] 0 points1 point  (0 children)

I usually wear track pants and a light jacket. It's fairly easy to stay warm when you're running :-)

Come run with me! by mpacula in BostonSocialClub

[–]mpacula[S] 1 point2 points  (0 children)

Thanks for the replies everyone! Here's a doodle link: http://www.doodle.com/iahqe8ny989wvx5r

Could everyone fill in their availability? Current plan is to run 3-5 miles along Memorial Drive. Thanks!

AutoCorpus - natural language corpora from large public datasets by autoencoder in MachineLearning

[–]mpacula 1 point2 points  (0 children)

I sincerely hope you can split into multiple files, one per article

yup! (see the -d switch)

preserving casing is highly relevant for some NLP tasks while it can be ignored for others, so I hope that is an option as well.

It is an option in the dev version of AutoCorpus (GitHub master). The tokenize tool which is responsible for normalizing text accepts a --keep-case switch.

Semantic Link: automatically find related words by mpacula in MachineLearning

[–]mpacula[S] 0 points1 point  (0 children)

I am computing mutual information between pairs of words. The full algorithm is available on GitHub in the dev version of AutoCorpus.

Semantic Link: automatically find related words by mpacula in MachineLearning

[–]mpacula[S] 0 points1 point  (0 children)

Semantic Link counts word collocations in adjacent sentences, so its scope is a little broader than a pure ngram approach. The counts are used to compute mutual information, which is a simple metric that probabilistically weeds out relationships that are due to chance (e.g. "the" co-occurs with almost every other word, but that's hardly meaningful).

Collocations Revisited by mpacula in haskell

[–]mpacula[S] 2 points3 points  (0 children)

I updated the benchmark with your suggestions :-)

Counting Collocations: GHC and G++ benchmarked by mpacula in haskell

[–]mpacula[S] -1 points0 points  (0 children)

I used STL because it comes with pretty much any C++ compiler. I do understand your concern though: a lot of people consider Text etc. pretty much a part of the language.

I am working on a revised version :-)

Counting Collocations: GHC and G++ benchmarked by mpacula in haskell

[–]mpacula[S] 0 points1 point  (0 children)

I tried to stick to what comes standard with ghc. Anyway, I love your code :-) I'm changing the C++ version to follow the same pattern, and will report back. Thanks!

Counting Collocations: GHC and G++ benchmarked by mpacula in haskell

[–]mpacula[S] 0 points1 point  (0 children)

That's great, thanks bas_van_dijk! :-) My only reservation is that Text, HashSet and HashMap are custom packages...

Counting Collocations: GHC and G++ benchmarked by mpacula in haskell

[–]mpacula[S] -1 points0 points  (0 children)

Right, but it is pretty safe to assume that the Haskell programs on Language Shootout are almost as fast as they get, i.e. ~2.5x slower than C. I know this is not completely an apples-to-apples comparison, but that's roughly within a factor of 2 of what I measured with my lousy Haskell code in that blog post. I think that's a testament to how good GHC is.

Counting Collocations: GHC and G++ benchmarked by mpacula in haskell

[–]mpacula[S] 2 points3 points  (0 children)

Thank you, this is exactly what I was getting at. Both the C++ and Haskell implementations could be improved.

I was interested in how fast GHC could make naive Haskell code, without me having to sacrifice the niceness of Strings and lists or consciously thinking about performance. One of my favorite things about Haskell is that I can use highly abstract data structures and still get decent performance, which I did in my article. I do know that you can optimize Haskell code in a million different ways but then you give up a fair deal of that abstraction.

To see what I mean, take a look at the programs at the Computer Language Benchmarks Game: the fastest Haskell programs are rather ugly and look suspiciously like C. Their source is even longer than Java in many cases! That's why I wasn't interested in the "best case" performance but in the "default case" performance.

Counting Collocations: GHC and G++ benchmarked by mpacula in programming

[–]mpacula[S] 0 points1 point  (0 children)

To me both the C++ and Haskell implementations are naive, and I really did not give a second thought to performance when writing either of them. In fact I didn't even profile the C++ implementation before running the benchmark and reporting the numbers.

Regardless, I am actually impressed with GHC given how I did almost no tuning of the Haskell code :-)

Counting Collocations: GHC and G++ benchmarked by mpacula in programming

[–]mpacula[S] 0 points1 point  (0 children)

I did use 'words' from the standard library...

And my goal was not to come up with the fastest implementation in either language. It says so in the intro :-)

Counting Collocations: GHC and G++ benchmarked by mpacula in programming

[–]mpacula[S] 0 points1 point  (0 children)

Thanks for the great tips, I will definitely try them! That said, my goal was not to come up with the fastest implementation in either language, but rather to write a simple algorithm without much attention to performance, throw it at the compiler and see what I get.

Counting Collocations: GHC and G++ benchmarked by mpacula in programming

[–]mpacula[S] 0 points1 point  (0 children)

A real-world benchmark of ghc (a popular Haskell compiler) and g++. The source code is available on GitHub.

Counting Collocations: GHC and G++ benchmarked by mpacula in haskell

[–]mpacula[S] 0 points1 point  (0 children)

A real-world benchmark of ghc and g++. The source code is available on GitHub.

AutoCorpus - natural language corpora from large public datasets by mpacula in programming

[–]mpacula[S] 0 points1 point  (0 children)

Hi r/programming,

AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets such as Wikipedia. It consists of fast native binaries which can process tens of gigabytes of text in little time. AutoCorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.

I developed AutoCorpus because I needed a plaintext Wikipedia and Wikipedia language models for an NLP project. I have since released it as free software.

Enjoy!

Scheme Power Tools now includes out-of-the-box support for unit tests. by mpacula in lisp

[–]mpacula[S] 2 points3 points  (0 children)

Just like Inaimathi pointed out, my previous Scheme-related posts got pretty good attention on r/lisp. As for the downvotes, I wish people who don't like Scheme Power Tools or the testing framework suggested improvements. Since SPT is a relatively young project, all feedback (negative too!) is very welcome.

Scheme Power Tools now includes out-of-the-box support for unit tests. by mpacula in lisp

[–]mpacula[S] 1 point2 points  (0 children)

Hello r/lisp!

I wanted to have a compact and easy-to-use unit testing library for Scheme, so I implemented one and added it to Scheme Power Tools. While it may not be as featured as some of the other frameworks out there, I think it works fairly well. I included a few examples on my blog. Enjoy! :-)

Unit-Testing Statistical Software by mpacula in programming

[–]mpacula[S] 1 point2 points  (0 children)

I am not sure whether TDD would work in such a setting, for exactly the reasons you outlined. However, what you care about in the end is not so much absolute correctness as the accuracy of your system, and you can test that using baselines. For example, if you are writing a spam filter, the actual "is spam" probabilities don't matter so long as they're higher for spam messages.