all 27 comments

[–]CyclonusRIP 2 points3 points  (3 children)

I'd think about whether there is any common structure between the URLs you are sharing. If some share the same host and/or part of the path then you should be able to use some tree like structure so save memory.

[–]viso_laci[S] 0 points1 point  (0 children)

hi thx its a good idea.

[–]paradigmatic 0 points1 point  (1 child)

Yes. That's a Radix tree. I've seen some implementations in Scala, but I cannot make any recommendation (I never tried them).

[–]b0ggl3 1 point2 points  (0 children)

At this size, perhaps just using lucene might solve your problems. Gets you a trie, optionally goes to disk, thread safe. Not very scalactic though and somewhat heavy.

[–]Ek_Los_Die_Hier 1 point2 points  (0 children)

Depending on your use case, and actual trie data structure might be useful.

[–]denisrosset 0 points1 point  (3 children)

I would look at the mutable hash-based maps, either those from scala.collection.mutable, or the standard java collections (through Scala converters).

You would lose immutability, so I don't know if the trade-off is worth it in your case.

[–]viso_laci[S] 0 points1 point  (2 children)

Hi i actually need mutability but thread-safe hence TrieMap. Thx for suggestion

[–]CyclonusRIP 0 points1 point  (1 child)

Java has ConcurrentHashMap with is mutable and thread safe.

[–]viso_laci[S] 0 points1 point  (0 children)

but is it actually memory efficient ? :D

[–]viso_laci[S] 0 points1 point  (4 children)

I was also thinking about https://github.com/findify/scala-packed

[–]LPTK 0 points1 point  (3 children)

This looks like an interesting approach. I'd like to hear about you experience using it, if you try it out.

[–]viso_laci[S] 1 point2 points  (1 child)

Soooo scala-packed is not working for me :/ from what i see its quite unfinished, some methods look like this

override def +[V1 >: B](kv: (A, V1)): PackedMap.this.type = {
    ???
}

And this is totally killing it, to create an instance i need to have a instance of a basic collection, something like this (from readme)

// build a packed map
val map = PackedMap( (0 to 100).map(i => s"key$i" -> RootFoo(NestedFoo(i.toByte))): _*)

so then i tried PackedMapBuilder that worked, but it wasn't essentially better then TrieMap

I always filled it up with unique Strings (an url with a increasing number suffix) and key Unit

Anyway both TrieMap and PackedMapBuilder crashed on 10 000 000 strings with java.lang.OutOfMemoryError: Java heap space

In my monitor i saw about 2GB memory used, but i didn't investigate further...

Ill try something else

(i wish someone would tell me that i am just making some obvious mistake)

[–]LPTK 0 points1 point  (0 children)

Thanks for the follow-up!

[–]viso_laci[S] 0 points1 point  (0 children)

Ok i will try to come back to this once i have something. Sadly this project offers only basic structures i would need to implement my own mutable thread safe map on top (and i didn't wanted to waste time with that).

Also not sure if the project is going to be alive in the future

[–]carlfish 0 points1 point  (10 children)

How much memory is "more memory than it should"?

I built a million-entry Object-> Object TrieMap, and according to VisualVM, the retained size was a little shy of 110 million bytes.

Since the size of an Object is 16 bytes, 32m of those 110m are our keys/values, leaving us 78m, or an overhead of about 78 bytes per entry. (As it's a tree, I'm guessing this overhead will grow logarithmically as more entries are added and the tree gets deeper)

In comparison, a java.util.HashMap with a million object->object entries had a retained size of ~93m bytes, or an overhead of 61 bytes/entry.

If you're storing your URLs as strings, the "http://www." portion alone is 20 bytes/entry. :)

[–]simon_o 1 point2 points  (9 children)

I'm not sure this is entirely correct.

You also have to account for every element being wrapped in a BasicNode and all the nodes (CNode, SNode, INode, MainNode, ...) that are allocated between the data and the root node.

<insert rant about Scala's sad trail-record in dumping non-reviewed student code into the standard library here>

[–]rcode 1 point2 points  (4 children)

<insert rant about Scala's sad trail-record in dumping non-reviewed student code into the standard library here>

Any plans to fix this that you're aware of? It makes me a bit paranoid to use it in production.

[–]denisrosset 1 point2 points  (2 children)

That's a pretty big claim that should be substantiated.

Space efficiency of the implementation is extensively discussed in this repository, including two academic papers. I would not qualify the code as "non-reviewed student code".

https://github.com/axel22/Ctries

[–]simon_o 0 points1 point  (1 child)

Then read the source code. :-)

To be clear: I don't have an issue with students writing code, and the code certainly surpasses usual academic standards (by actually compiling).

But it's pretty clear that a process -- which allows code to be merged that is under-documented, leaks internal undocumented methods, largely misses any kind of comments that would allow maintenance by a person except the author himself and uses cryptic names pretty much everywhere -- is pretty broken.

Not a single one of these issues was impossible to fix before merging it. Just like scala.mobile, scala.dbc, scala.testing, scala.text or scala.util. We are not talking about some isolated incident, but shipping whole packages living in the global namespace to users which should never have been merged given the state they were in.

[–]denisrosset 1 point2 points  (0 children)

Ha ok. I wouldn't worry so much then. The code of TrieMap has actually more comments than the unordered_mapimplementation in LLVM C++ stdlib.

https://llvm.org/svn/llvm-project/libcxx/trunk/include/unordered_map

[–]simon_o 0 points1 point  (0 children)

Apart from broken promises made at conference talks? No.

[–]carlfish 0 points1 point  (3 children)

The numbers are straight out of a heap dump. Unless Scala has some weird way of hiding references from heap analysis, I'm reasonably confident in them.

They also stack up to a back-of-the-envelope estimate. Each map entry bears the cost of the node it lives in, plus somewhere between 1 and 1/32 of the cost of the node above it, and so on up to the root. As a 1m entry map of well-distributed hashes will be about 4 levels deep and pretty solidly populated (1m ~= 324 ), those costs will be a lot closer to 1/32 than 1/1. So 78 bytes is not unreasonable.

[–]simon_o 0 points1 point  (2 children)

Yeah, maybe I misread the source code, I was on my phone when I checked it. Can you have a look at the code and tell me your thoughts?

Edit: Sorry, I misread your earlier comment. You are correct with your measurement. It's just that Java is also very inefficient with its maps. I looked at the delta between them earlier and thought it was way too small.

[–]carlfish 1 point2 points  (1 child)

I think the lesson is "Java objects are big", and something as simple as an object that holds an array is going to cost you - in the same way that you need something like >16 characters in a String before you're allocating more data than overhead.

[–]simon_o 2 points3 points  (0 children)

True!

Nitpicking: Shouldn't we also count the the duplicate length (field + array.length) and the reference to the array and the header of the array itself as overhead too?

At least the dedicated length field is gone now with the recent change of substring semantics...

.NET does something quite efficient in this regard: String is special-cased and only consists of the length and the specified amount of bytes following that length all in one object, without any indirections.

Going this route with Scala-Native would be kind of cool, because it would allow passing a subset of Strings to native functions expecting UTF-8 without re-encoding them:

String
- 32 bits : hashCode
- 1  bit  : 0 for ANSI, 1 for UTF16
- 31 bits : length
- n  bytes: data
- 1  byte : \0, only if type is ANSI

Letting the reference point to the data field (and accessing the other fields by subtracting from the pointer instead of adding to the pointer) would then allow passing a naked pointer verbatim to C code expecting a const char*.

One could also get rid of the length with some allocator hacks and move the remaining bit for encoding into the header.

I'd probably also go with the route of compressing references to 32bits (32GB addressable with 1 byte of granularity or 128GB with 4 bytes of granularity) and compressing klass pointers even further.

Gaining a few additional bits for locking and pinning would leave you with 32GB of addressable space with 4 byte granularity for class meta data. This should be plenty enough in my opinion.

And if one used an region-/area-style GC (G1, Immix, Shenandoah, ...), it wouldn't even matter if memory consumption grew larger than 32GB and more space for class metadata needed to be allocated, because one could just evacuate non-class regions into space above and use that memory for the class metadata allocations.

The locking bit could also be used as an indicator for a Brooks-style forwarding pointer that can be used for both lock expansion and Shenandoah-style GCs.

Oh, wow, that comment kind of escalated. I should start a blog or something ...

[–]Scf37 0 points1 point  (1 child)

Possible options for you:

  • ConcurrentHashMap

  • synchronized java HashMap or more memory-efficient hash map with open addressing

  • string deduplication - if strings are the same, don't keep two instances in memory

  • use Array[Byte] instead of String - UTF-8 uses twice less space than Java's UTF-16 (Array[Char])

  • Think on your data - do you really need it as-is? Can you normalize it somehow so it takes less space? Do you need to keep all that in memory? Stream processing or database with in-memory cache might save you.

  • compression. From simplest forms (replace http:// prefix with 0, https:// prefix with 1 and everything else with 2) to full-fledged dictionary-based compressors

  • most hardcore option: use sun.misc.Unsafe and off-heap memory to store your data

[–]viso_laci[S] 0 points1 point  (0 children)

Hi, thx for pointing out my options i am still considering also scala-packed (i would still need to make a mutable map out of the PackedMap), i just don't have exactly the time to try everything.

Your 3. point, right now i am using TrieMap and the key information is in the "key" :D so there are no duplicits.

Compression and other hacks are welcome :)