Column Stores and Hadoop
Switching gears a bit from the NoSQL to the Hadoop world ... here's a quick preview of some work we did on storage organization on Hadoop. We started this work to investigate how a columnar storage layer could be implemented for Hadoop and if it would lead to any insights that weren't already known in the context of parallel DBMSs. It turned up some pretty interesting results. Get more information hadoop online course First, we built an InputFormat/OutputFormat pair on Hadoop v-0.21 that uses some of the new APIs for a pluggable BlockPlacementPolicy. We gave it a rather inventive name -- CIF and COF-- for ColumnInputFormat and ColumnOutputFormat :-) Instead of using a PAX -like layout with RCFile, CIF lets you you true columnar storage where each column is stored in a separate file. As one would expect, when you scan only a small number of columns from a much wider dataset, CIF eliminates the I/O for the unnecessary columns and improves your map-phase performanc...