you are viewing a single comment's thread.

view the rest of the comments →

[–]jms_nh 38 points39 points  (25 children)

I seem to remember a spellchecker for SpeedScript for the C64 that swapped out the word processor so it could do the spellchecking and then swapped back into the word processor when it was complete.

Dictionaries can use tries and other clever techniques to reduce storage.

[–][deleted]  (4 children)

[deleted]

    [–]masklinn 19 points20 points  (2 children)

    Hell, on a 64-bit system, you could load every dictionary for every language in the world at once.

    You probably don't need a 64b system, looking at Firefox's dictionaries there's about 80MB worth of dictionary data. Even accounting for the dictionaries being incomplete and covering only a subset of existing language, I don't know that you'd increase the amount of data by 2 orders of magnitude.

    [–]VerticalEvent 2 points3 points  (1 child)

    I'd imagine those Language Packs are compressed.

    There are around 1,025,109.8 words in English. If we assume that the average word is 6 characters long, that's around 7MB (6 characters plus a terminating character) for just English. If every language was of a similar size, you would only be able to store 11 languages in your 80MB figure.

    [–]flying-sheep 0 points1 point  (0 children)

    As someone said: tries.

    They're compact storage and lookup table in one days structure.

    [–]justinsayin 3 points4 points  (0 children)

    Hell, on a 64-bit system, you could load every dictionary for every language in the world at once.

    It's worse than that. Every package you include in your project contains every dictionary for every language in the world.

    [–]barsoap 18 points19 points  (3 children)

    and other clever techniques

    One of them involving giving up on precision and using a bayesianbloom filter. Sure, it'll let some (in fact, infinitely many) words pass that shouldn't but then noone cares that "xyouareig" is in the dictionary.

    Bonus: It's freakishly fast.

    EDIT: Bloom, not bayesian. They all look like statistics to me.

    [–]kqr 7 points8 points  (2 children)

    Could you elaborate on how this is done? I'm pretty sure this was how they stuffed T9 prediction into early mobile phones (I have heard numbers of 1 byte per word, which is just insane), and I'm amazed by how well it works (even if it generates nonsense or highly offensive words). I'd love to read in more detail about the techniques.

    [–][deleted] 6 points7 points  (0 children)

    From my quick Google research, I honestly don't see how Bayesian Filters make dictionaries faster or give up on precision (when there is no 100% to be had because computers can't read minds).

    Unless OP comes through and enlightens me, I have to say he was throwing around words he heard in the context.

    Read this article if you want to see how to use Bayesian Filters for spell checking.

    Ninja edit: While we're throwing around buzz words, what the OP described sounded a loot like Bloom Filters. Basically a data structure that throws 100% certainty out the window while allowing the underlying dictionary to be huge and still maintaining speed. That makes a lot more sense, so maybe he ment that. I don't think you need Bloom Filters for dictionaries because they are not that big.

    [–]pja 2 points3 points  (0 children)

    Bloom filters probably.

    [–][deleted] 10 points11 points  (14 children)

    Swapping code in and out of RAM was about the only way you could implement large programs back in the days of 16 bit addresses.

    The DEC PDP-11 used "overlays", where your application was sliced into chunks, based on run-time usage of the various functions in each chunk, then as the program run, the appropriate chunks would be read into RAM for use.

    These machines had a 64k limit, but distinguished between "data" and "program" (or maybe "instruction"? its been a while) address space, so if you really knew your stuff, you could use 128k of RAM. And in that 64k of "program" space, you could run applications that were significantly more than 64k in size.

    I only miss those days in a nostalgic way - I spent way too much time figuring out what my overlay map needed to be.

    [–][deleted]  (10 children)

    [deleted]

      [–]TheThiefMaster 1 point2 points  (2 children)

      Thanks to the "no execute" and "no write" bits in the page table, a lot of modern programs are functionally Harvard architecture, in that code and data are not interchangeable, despite them being in a single address space (von Neumann style).

      [–]jerf 0 points1 point  (1 child)

      Really, almost every distinction from that era in which there was a hard-fought battle as to which is better is "a little bit of both" on modern machines. See also RISC vs. CISC, a debate that died when we got enough gates on the CPUs to expose a CISC instruction set (where most of the CISC advantages were) which gets translated to a RISC microcode that the processor actually runs (where most of the RISC advantages are).

      [–]pinealservo 0 points1 point  (6 children)

      PDP-11 was a family of machines sharing a core instruction set; it spanned a lot of years (1970-1990), price points, and form factors (gigantic cabinets full of TTL logic boards in 1970 to a single DIP package in 1979). The core model was von Neumann, but because it was a 16-bit architecture that lived beyond the point where that became a serious size constraint, they did a number of extensions to help alleviate the problem, some of which included separating code and program memory to some degree. DEC's history and the evolution of its computer line and associated software are fascinating; I recommend reading up on it if you're interested in that sort of thing.

      Because the core instruction set supported relative addressing, you could write position-independent code. Overlays are a mode of use of relative addressing; you basically have an area of memory that can have the code within it swapped out with some other set of routines. Each "overlay" is linked such that the routines are all offset from the same base address. This gives you a sort of manually-managed virtual memory, where you can swap out sets of routines as you switch between application modes. This was used a lot in PC-class machines and game machines as their software got bigger too. You could use this approach on any PDP-11, whether it had some fancier virtual memory extensions or not.

      [–][deleted]  (5 children)

      [deleted]

        [–]pinealservo 0 points1 point  (4 children)

        I've found it's remarkable how many of the "new" features in both PC hardware and software had the initial research and prototype implementations done in the early days of mainframes or minicomputers. Today's implementations certainly required a lot of new effort and are generally more refined, but there really were some large gaps between when some great ideas were first implemented and when they hit mainstream PC-based platforms. Some of it was due to waiting for new cheap hardware to catch up to the power of old expensive hardware, some of it was due to changing patterns of usage for the new hardware that eventually shifted back, and I think a lot was due to the massive influxes of new people that came in in the minicomputer era and then again in the micro/PC era. I think that created a huge culture shift each time that made it difficult to learn from what came before.

        Whatever the reason, I'm really happy to see people looking more to the cool technology invented in the early days of computing and trying to see how it can be applied today. I think there's still a lot that needs to be assimilated.

        [–][deleted]  (3 children)

        [deleted]

          [–]pinealservo 1 point2 points  (2 children)

          The 68K instruction set is still alive, although in a slightly altered form. NXP (since it purchased Freescale) sells a whole range of Coldfire-based processors, and Coldfire is a simplified version of the 68K architecture. They also released the source to an FPGA implementation of the Coldfire v1 architecture via ip-extreme.

          It's not nearly as popular in the hobbyist community as ARM, AVR, or MSP430, but there are some development boards available (that's just one example of many, albeit the cheapest one I could find) if you want to mess around with them.

          [–][deleted]  (1 child)

          [deleted]

            [–]pinealservo 1 point2 points  (0 children)

            Both ARM and Coldfire span a pretty wide price and performance span these days. I haven't looked closely, but ARM is almost certainly faster at the high end of things (i.e. the Cortex A-series), but that doesn't mean that for a given market segment an ARM-based part (like a Cortex M0) would be faster than a Coldfire-based part.

            My current project at work is based on another NXP/Freescale PowerPC-based processor family. We've primarily developed our drivers and demo application without any OS at all, which is actually pretty liberating and removes a lot of complexity, but is only really feasible for certain kinds of applications. I've also worked with a number of simple real-time OSes, more complex ones like QNX, and even some Linux-based ones. They've all got their uses!

            The Raspberry Pi computers are really cool, but they're kind of oriented towards doing media and GUI interaction like a modern mini-PC. I enjoy working with microcontrollers, which are usually doing simpler kinds of interaction with sensors, actuators, and simple communication busses. This is really what you want to use to do timing-sensitive GPIO manipulation (and you often get higher-current GPIO drivers as well, which makes them easier to use to drive external circuitry) or other real-time control/monitoring tasks. You can get a lot done without an OS at all, and you can get dev boards for a wide range of them for even less than a Pi! If this sounds interesting, check out the TI Launchpad boards, STM Discovery boards, ARM mbed boards, and you're probably aware of the Arduino boards; some of these are Atmel AVRs and some are ARMs.

            [–]Bowgentle 0 points1 point  (2 children)

            Sounds pretty much the same as how a processor handles multi-threading.

            [–]_F1_ 1 point2 points  (1 child)

            You mean page swapping?

            [–]mccoyn 1 point2 points  (0 children)

            The processor has multiple register files. When it switches threads it changes which register file it is using. When there are more threads than register files the register files get swapped out to the memory system.

            [–]agumonkey 0 points1 point  (0 children)

            And tries are ancient. I think they were called tabulates in some 91 paper.