[ACCEPTED]-Too much data duplication in mongodb?-norm
Well that is the trade off with document 58 stores. You can store in a normalized fashion 57 like any standard RDMS, and you should strive 56 for normalization as much as possible. It's 55 only where its a performance hit that you 54 should break normalization and flatten your 53 data structures. The trade off is read efficiency 52 vs update cost.
Mongo has really efficient 51 indexes which can make normalizing easier 50 like a traditional RDMS (most document stores 49 do not give you this for free which is why 48 Mongo is more of a hybrid instead of a pure 47 document store). Using this, you can make 46 a relation collection between users and 45 events. It's analogous to a surrogate table 44 in a tabular data store. Index the event 43 and user fields and it should be pretty 42 quick and will help you normalize your data 41 better.
I like to plot the efficiency of 40 flatting a structure vs keeping it normalized 39 when it comes to the time it takes me to 38 update a records data vs reading out what 37 I need in a query. You can do it in terms 36 of big O notation but you don't have to 35 be that fancy. Just put some numbers down 34 on paper based on a few use cases with different 33 models for the data and get a good gut feeling 32 about how much works is required.
Basically 31 what I do is first try to predict the probability 30 of how many updates a record will have vs. how 29 often it's read. Then I try to predict what 28 the cost of an update is vs. a read when 27 it's both normalized or flattened (or maybe 26 partially combination of the two I can conceive... lots 25 of optimization options). I can then judge 24 the savings of keeping it flat vs. the cost 23 of building up the data from normalized 22 sources. Once I plotted all the variables, if 21 the savings of keeping it flat saves me 20 a bunch, then I will keep it flat.
A few 19 tips:
- If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
- If you require update to be quick, and access immediately then favor normalization.
- If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
- If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
- Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.
There are lot of other ideas that you 18 can employ. There a lot of great blogs on 17 line that go into it like highscalabilty.org 16 and make sure you understand CAP theorem.
Also 15 consider a caching layer, like Redis or 14 memcache. I will put one of those products 13 in front my data layer. When I query mongo 12 (which is storing everything normalized), I 11 use the data to construct a flattened representation 10 and store it in the cache. When I update 9 the data, I will invalidate any data in 8 the cache that references what I'm updating. (Although 7 you have to take the time it takes to invalidate 6 data and tracking data in the cache that 5 is getting updated into consideration of 4 your scaling factors). Someone once said 3 "The two hardest things in Computer Science 2 are naming things and cache invalidation."
Hope 1 that helps!
Try adding an IList of type UserEvent property 3 to your User object. You didn't specify 2 much about how your domain model is designed. Check 1 the NoRM group http://groups.google.com/group/norm-mongodb/topics for examples.
More Related questions