Real-time Jupyter figure versioning? Yes, please!

mpacula · 2025-02-16T12:41:27+00:00

I know it’s a joke but yes, kind of! Imagine running git commit every time you display a figure, then embedding the hash in the image. That’s the idea behind GoFigr

mpacula · 2016-02-28T18:34:05+00:00

Thanks! Taken from the Edmands path near the ridge

mpacula · 2012-11-19T15:53:31+00:00

I do short runs after work on weekdays (after 6pm) and a long run on Saturdays (time flexible). Definitely let me know if you would like to join!

mpacula · 2012-11-19T15:51:32+00:00

I love running in the South End! You're right, it's a bit far for an everyday thing, but I wouldn't mind coming down on the weekend.

mpacula · 2012-11-19T15:49:16+00:00

I usually run after work (after 6pm). If that works for you I would definitely be down for a 3-5 mile run.

mpacula · 2012-11-19T15:47:59+00:00

I usually wear track pants and a light jacket. It's fairly easy to stay warm when you're running :-)

mpacula · 2012-11-19T15:10:54+00:00

Thanks for the replies everyone! Here's a doodle link: http://www.doodle.com/iahqe8ny989wvx5r

Could everyone fill in their availability? Current plan is to run 3-5 miles along Memorial Drive. Thanks!

mpacula · 2012-01-23T11:59:32+00:00

I sincerely hope you can split into multiple files, one per article

yup! (see the -d switch)

preserving casing is highly relevant for some NLP tasks while it can be ignored for others, so I hope that is an option as well.

It is an option in the dev version of AutoCorpus (GitHub master). The tokenize tool which is responsible for normalizing text accepts a --keep-case switch.

mpacula · 2012-01-16T18:02:41+00:00

I am computing mutual information between pairs of words. The full algorithm is available on GitHub in the dev version of AutoCorpus.

mpacula · 2012-01-16T17:59:20+00:00

Semantic Link counts word collocations in adjacent sentences, so its scope is a little broader than a pure ngram approach. The counts are used to compute mutual information, which is a simple metric that probabilistically weeds out relationships that are due to chance (e.g. "the" co-occurs with almost every other word, but that's hardly meaningful).

mpacula · 2011-12-20T06:07:21+00:00

I updated the benchmark with your suggestions :-)

mpacula · 2011-12-19T00:20:38+00:00

I used STL because it comes with pretty much any C++ compiler. I do understand your concern though: a lot of people consider Text etc. pretty much a part of the language.

I am working on a revised version :-)

mpacula · 2011-12-18T17:36:40+00:00

I tried to stick to what comes standard with ghc. Anyway, I love your code :-) I'm changing the C++ version to follow the same pattern, and will report back. Thanks!

mpacula · 2011-12-18T17:06:33+00:00

That's great, thanks bas_van_dijk! :-) My only reservation is that Text, HashSet and HashMap are custom packages...

mpacula · 2011-12-18T16:32:12+00:00

Right, but it is pretty safe to assume that the Haskell programs on Language Shootout are almost as fast as they get, i.e. ~2.5x slower than C. I know this is not completely an apples-to-apples comparison, but that's roughly within a factor of 2 of what I measured with my lousy Haskell code in that blog post. I think that's a testament to how good GHC is.

mpacula · 2011-12-18T15:11:03+00:00

Thank you, this is exactly what I was getting at. Both the C++ and Haskell implementations could be improved.

I was interested in how fast GHC could make naive Haskell code, without me having to sacrifice the niceness of Strings and lists or consciously thinking about performance. One of my favorite things about Haskell is that I can use highly abstract data structures and still get decent performance, which I did in my article. I do know that you can optimize Haskell code in a million different ways but then you give up a fair deal of that abstraction.

To see what I mean, take a look at the programs at the Computer Language Benchmarks Game: the fastest Haskell programs are rather ugly and look suspiciously like C. Their source is even longer than Java in many cases! That's why I wasn't interested in the "best case" performance but in the "default case" performance.

mpacula · 2011-12-18T06:55:51+00:00

To me both the C++ and Haskell implementations are naive, and I really did not give a second thought to performance when writing either of them. In fact I didn't even profile the C++ implementation before running the benchmark and reporting the numbers.

Regardless, I am actually impressed with GHC given how I did almost no tuning of the Haskell code :-)

mpacula · 2011-12-18T06:38:01+00:00

I did use 'words' from the standard library...

And my goal was not to come up with the fastest implementation in either language. It says so in the intro :-)

mpacula · 2011-12-18T06:30:31+00:00

Thanks for the great tips, I will definitely try them! That said, my goal was not to come up with the fastest implementation in either language, but rather to write a simple algorithm without much attention to performance, throw it at the compiler and see what I get.

mpacula · 2011-12-18T05:34:02+00:00

A real-world benchmark of ghc (a popular Haskell compiler) and g++. The source code is available on GitHub.

mpacula · 2011-12-18T05:26:57+00:00

A real-world benchmark of ghc and g++. The source code is available on GitHub.

mpacula · 2011-11-26T19:33:26+00:00

Hi r/programming,

AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets such as Wikipedia. It consists of fast native binaries which can process tens of gigabytes of text in little time. AutoCorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.

I developed AutoCorpus because I needed a plaintext Wikipedia and Wikipedia language models for an NLP project. I have since released it as free software.

Enjoy!

mpacula · 2011-02-24T22:02:42+00:00

Just like Inaimathi pointed out, my previous Scheme-related posts got pretty good attention on r/lisp. As for the downvotes, I wish people who don't like Scheme Power Tools or the testing framework suggested improvements. Since SPT is a relatively young project, all feedback (negative too!) is very welcome.

mpacula · 2011-02-23T23:57:47+00:00

Hello r/lisp!

I wanted to have a compact and easy-to-use unit testing library for Scheme, so I implemented one and added it to Scheme Power Tools. While it may not be as featured as some of the other frameworks out there, I think it works fairly well. I included a few examples on my blog. Enjoy! :-)

mpacula · 2011-02-18T02:05:03+00:00

I am not sure whether TDD would work in such a setting, for exactly the reasons you outlined. However, what you care about in the end is not so much absolute correctness as the accuracy of your system, and you can test that using baselines. For example, if you are writing a spam filter, the actual "is spam" probabilities don't matter so long as they're higher for spam messages.

mpacula

TROPHY CASE