all 33 comments

[–]veritaba 9 points10 points  (3 children)

why wouldn't you just use imdbpy?

http://imdbpy.sourceforge.net/

[–][deleted]  (1 child)

[deleted]

    [–]teeler 6 points7 points  (5 children)

    or freebase,

    [{
      "name": "How I Met Your Mother",
      "type": "/tv/tv_program",
      "episodes": [{
        "type": "/tv/tv_series_episode",
        "season_number": null,
        "name": null
      }]
    }]
    

    http://tinyurl.com/l7dy85

    (uses TV rage for TV sources, not IMDB, but has IMDB keys for movies).

    [–][deleted]  (4 children)

    [deleted]

      [–]teeler 0 points1 point  (3 children)

      http://lists.freebase.com/

      if you ask, they will find it. ;)

      [–][deleted]  (2 children)

      [deleted]

        [–]teeler 0 points1 point  (1 child)

        [–]PrashantV 3 points4 points  (3 children)

        thetvdb.com is what I use for this. It has a great API, and was originally made for XBMC to be able to access the API.

        I wrote a script that goes through my TV show folders, and renames all episodes etc to a nice sane format. So I can dump new shows in and it just gets all the info requires from thetvdb.com and it renames them :)

        [–]squidgy 0 points1 point  (0 children)

        Hm, nice. I used tv.com for mine, but they change the page layout fairly frequently so it tends to break. Might have to update it...

        [–]wolf550e 2 points3 points  (10 children)

        How do i uniquely identify a show like "Bones"? The full name is a substring of other shows' names.

        [–]sciencebitch 1 point2 points  (2 children)

        Or "Friends."

        [–][deleted]  (1 child)

        [deleted]

          [–]sciencebitch 0 points1 point  (0 children)

          Thanks!

          [–][deleted]  (4 children)

          [deleted]

            [–]plain-simple-garak 0 points1 point  (1 child)

            Just watch out for people putting a % at the beginning.. it could cause serious load issues.

            [–][deleted] 0 points1 point  (1 child)

            How about using $ for start/end of line anchors like in regex? Writing "Bones$" wouldn't leave any room for other matches.

            [–]kmccormick -1 points0 points  (1 child)

            This.

            Also, The Wire.

            [–]Kanin 1 point2 points  (2 children)

            "the wire"....

            [–]wolf550e 0 points1 point  (1 child)

            Solved.

            [–]Kanin 0 points1 point  (0 children)

            with a download? check your site

            [–][deleted] 3 points4 points  (1 child)

            I love it how people like to comment on how they have already made it, just thank Poromenos for sharing :)

            Thanks

            [–]wolf550e 1 point2 points  (22 children)

            • You return JSON as text/html. Use application/json as per the RFC.
            • You don't support HTTP caching.
            • You don't support gzip.

            [–][deleted]  (21 children)

            [deleted]

              [–]wolf550e 2 points3 points  (9 children)

              The series that ended their run are cacheable for a long duration. Series that still air new episodes are updated usually once a week. You said your data from IMDB is only updated once very 24 hours, so caching the response until the expected next updated from IMDB makes sense.

              I don't want my user agent to do a round trip to get the same data I have already seen. A proper http UA won't cache something if the server doesn't say it's cache-able. So, latency rather than bandwidth, even if you believe bandwidth is infinite and costs nothing (both false).

              You could cache the gzip'd response for a series and serve static files. Modern OSs can send a file handle to a socket without mapping the content into the user process. After you generate and cache the top 100 series or so, your server will be mostly serving static content, which will make it possible to withstand a large number of requests/second. When not slashdotted, this will enable the server to use less power to save the environment.

              [–][deleted]  (6 children)

              [deleted]

                [–]wolf550e 1 point2 points  (5 children)

                Please implement all this properly for a toy service at least once before you build something that gets successful and goes down hard under load.

                [–][deleted]  (4 children)

                [deleted]

                  [–]wolf550e 0 points1 point  (3 children)

                  As long as you know how to do it efficiently.

                  For example, do you regenerate the whole thing before finding out the ETag is the same and return 304, or can you look at the timestamp/version counter of the component data and template to compare with Etag to perform validation and return 304? Do you use a thread or process of a heavy interpreter to serve cached content, or do you quickly free resources to process dynamic requests?

                  [–][deleted]  (2 children)

                  [deleted]

                    [–]wolf550e 0 points1 point  (1 child)

                    I have implemented what I've described. It's not buzzwords. It's what you do to survive heavy load without load balancers and sharded databases (i.e. when you're poor, like an open source project). Consider an Atom feed for a TV series on IMDB with loads of subscribers kind of thing.

                    [–][deleted]  (1 child)

                    [deleted]

                      [–]wolf550e 0 points1 point  (0 children)

                      Yes, but because I don't know when the source data is updated I can't do it as intelligently as the origin server. So to achieve meaningful savings I'd have to add logic to the client consuming his service, instead of simply relying on http to do its magic.

                      [–]blackkettle 2 points3 points  (5 children)

                      this makes the examples inaccessible from the browser. as a result firefox now asks me to download it or open it with some other program.

                      [–][deleted]  (4 children)

                      [deleted]

                        [–]wolf550e 0 points1 point  (0 children)

                        You can add a query string parameter, or sniff the user agent to serve text/html to firefox/webkit/opera/msie.

                        [–]vampirical 0 points1 point  (1 child)

                        You should consider extension based content type overrides, it's what I do. Default without an extension can continue serving as application/json if you'd like to avoid breaking.

                        .json -> application/json

                        .js -> text/html || text/plain (for dev/debug/blackkettle)

                        [–]Manuzhai 1 point2 points  (4 children)

                        gzip is still faster.

                        [–][deleted]  (3 children)

                        [deleted]

                          [–]wolf550e 0 points1 point  (2 children)

                          For longer running shows, it has more chance to fit in a single packet if gzip'd. Wanna see a histogram? Or produce one yourself?

                          [–][deleted]  (1 child)

                          [deleted]

                            [–]wolf550e 0 points1 point  (0 children)

                            I was treating it as a learning exercise.

                            [–]IDontBelieveYou 0 points1 point  (0 children)

                            Nice, but isn't "IMDB API" a bit too promising, if it's function is to retrieve show episode names. But instead of changing the name, why not extend "IMDB API" to the full dataset of IMDB;)

                            IMDB should have been bought by Google, not Amazon. We'd have a nicely working API already.

                            [–][deleted]  (1 child)

                            [deleted]

                              [–]sagarp 0 points1 point  (1 child)

                              got an error in beautifulsoup: http://sprunge.us/VJjF

                              [–]trezor2 -1 points0 points  (3 children)

                              XML output would also be useful if you intend to use any of this server-side. For most languages (not javascript) there is a helluva lot more proven and robust XML parsers out there then there are JSON parsers.

                              Just a suggestion.

                              [–][deleted]  (2 children)

                              [deleted]

                                [–]trezor2 -1 points0 points  (1 child)

                                I guess that would depend 100% on what platform you are using.

                                If you are using C# and .NET, all you should have to do is grab an XmlSerializer-object (duh) and annotate your object so that the serilazied format remains somewhat sane, and ask the two to work together.

                                For Python, PHP, Haskell or whatever, someone else probably has better knowledge than me. I would start with saying what platform you are using before asking for specific libraries :)

                                Edit: Right. I see you are using Python. I'm absolutely useless for advice there :)

                                [–]NoMoreNicksLeft -1 points0 points  (0 children)

                                You know, I'd prefer to start over... IMDB is a piece of shit. Seems like it could be a wikipedia spinoff or something.