In big data scenarios, data is often duplicated for system response time, efficiency, and network issues and then eliminated through deduplication. Data is compressed for storage efficiency and further abstraction or regression. The authors propose DCDedupe, a system for on-premise distributed storage that selectively compresses data and efficiently impacts deduplication using analytics, hardware acceleration, and design parameters (cost, efficiency, and effectiveness). DCDedupe is centered on (1) a (quick) decision mechanism technique for yielding acceptable accuracy and (2) an algorithm for selecting, marshaling, and routing (distributed) data chunks to ensure they are sent to the right nodes of the distributed data system.
The paper is divided into sections: “Introduction,” “Related Work,” “Deduplication vs. Delta Compression,” “Design,” “Evaluation,” and “Conclusions.” Using conclusions from a case study, the design section describes DCDedupe design principles and considerations, selecting an architecture and system, chunk classification methods, routing algorithms, delta compression levels, and the overall (work and data) flow. The evaluation section includes the experimental setup, storage efficiency results, sampling methods, and memory usage for sampling records. In the last section, the authors conclude that DCDedupe improves the decision-making accuracy in pre-processing and reduces storage space requirements by 30 percent; however, there is some penalty on processing speed (between 15 to 22 percent). Further work on pre-processing methods, fault tolerance enhancements, and server overload is required.
The paper has 28 references and should interest people in the big data field.