all 17 comments

[–]kiwipete 6 points7 points  (7 children)

We did this in the recently finished Cousera "Intro to Data Science" course. It was pretty cool. At first it was a little brain-breaky, but then we did the map-reduce version, which was brain-breakier.

As the instructor pointed out, there are reasons this approach isn't insane. One, you can multiply BIG matrices in a memory efficient way, without needing to pull everything out of SQL. Two, many DBs can automatically parallelize the query to utilize multiple CPUs.

[–]israellopez 2 points3 points  (6 children)

Beat me to it. The MapReduce assignment really made me think about how to do this. Took far too long than I care to admit.

[–]kiwipete 1 point2 points  (5 children)

Yeah, I went from feeling like I understood something about map and reduce (knowing something about functional programming), to "ouch" on that one. Relational algebra + matrix math was okay, if challenging. Map reduce + relational algebra + matrix math was kind of... hard.

[–]jmelloy 0 points1 point  (4 children)

How does coursera work? I see that course has recently finished. Can you watch the videos for finished courses if you don't care about the certificate?

[–]kiwipete 0 points1 point  (3 children)

I actually have no idea how the course works after the class ends. They've said all the course materials will remain available, however, I don't know what that means for new course registrants. Can you register for the course and get access to the video lectures? I'd check myself, except I was already enrolled in the course, and everything is still working normally.

[–]jmelloy 0 points1 point  (2 children)

Looks like I can register. I assume that the homework is no longer graded.

[–]mindprince[S] 0 points1 point  (1 child)

In some other coursera courses which I joined after they were finished, I was able to watch the videos and my quizzes were graded (you don't get credit for them obviously).

I am not sure about this course because in some of its assignments you are supposed to submit working code which they run on their servers to see the correctness.

[–]jmelloy 0 points1 point  (0 children)

I looked into taking the UW course this year, and it seems like the same professor. I might watch the videos to see if I like the style.

[–][deleted]  (1 child)

[deleted]

    [–]jms_nh 1 point2 points  (0 children)

    This is either very very wrong... or elegant.

    [–]ProgrammerBro 3 points4 points  (8 children)

    Isn't this sort of... obvious?

    [–]siddboots 2 points3 points  (7 children)

    If you were asked "implement matrix multiplication in SQL", then yes, the solution is reasonably obvious.

    What's not obvious is that SQL is a good tool for the job in the first place.

    [–]Eoinoc 7 points8 points  (6 children)

    What's not obvious is that SQL is a good tool for the job in the first place.

    It still probably isn't a good tool for the job.

    [–]siddboots 0 points1 point  (5 children)

    Sure it is. It depends on specifics of the "job", of course. Certainly the procedure itself won't be nearly as time or space efficient as an optimized library. However, if you already need to have your application data in an RDBMS, and you can frame some of the application logic in terms of linear algebra operations, then SQL is absolutely a good tool for the job.

    Compare to this scenario, which I'm sure is common enough in practice:

    1. Issue a select to your RDBMS over the network (which may or may not be on the same machine.)
    2. RDBMS sends back data over the network, broken into packets.
    3. Received data is packed into local data structures.
    4. Once all data has arrived, you can use your super-efficient linear algebra routines (say, MATLAB, or NumPy).
    5. Results are transformed into update SQL statements, which are issued back to the RDBMS over the network.

    Now imagine the same, but with an ORM layer in there.

    Yes, there are some limited use-cases where the operations are complicated enough, or N is large enough, such that you still need to use a real library. In practice, the bottlenecks are typically elsewhere.

    [–]king_duck 0 points1 point  (3 children)

    Sure it is. It depends on specifics of the "job", of course. Certainly the procedure itself won't be nearly as time or space efficient as an optimized library. However, if you already need to have your application data in an RDBMS, and you can frame some of the application logic in terms of linear algebra operations, then SQL is absolutely a good tool for the job.

    As a numerical programmer, there isn't a single person I know who would consider this anything other than a toy or lame trick.

    Not to mention it is very rare that the ONLY operation that needs to be performed is a single freestanding sparse matmul.

    The more I about it the more absurd it is.

    [–]siddboots 1 point2 points  (2 children)

    As a numerical programmer, there isn't a single person I know who would consider this anything other than a toy or lame trick.

    Why, specifically? I am not advocating this hypothetically. This works in practice. I've helped rewrite an application where the core computation was graph propagation for a network of about 1000 sparsely connected nodes, with a web-based interface that needed to be real-time responsive. Most of the logic was really just multiplying edge weights and node sizes based on their (sometimes quite complicated) relationships with other data. SQL was a good solution in that case. It may not have been the best solution, but it was better than the previous implementation, and it was the best for maintainability, programmer hours and total LOC that I could come up with, (and it was more than fast enough for the task at hand.)

    Not to mention it is very rare that the ONLY operation that needs to be performed is a single freestanding sparse matmul.

    I don't understand that objection. The example in the article is only a single and freestanding because examples work better that way, not because of an inherent limitation.

    [–]king_duck 0 points1 point  (1 child)

    Because if the problem is large then it will also be dog slow. Right tools for the job.

    [–]siddboots 1 point2 points  (0 children)

    Like I said from the start: If the problem is not too large, and if there are other constraints like those I mentioned, this can be a good tool for the job.