The Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.
Ceph: A Scalable, High-Performance Distributed File System
We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
GPFS: A Shared-Disk File System for Large Computing Clusters
GPFS is IBM’s parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was
built on many of the ideas that were developed in the academic community over the last several years, particularly
distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale.
We have had the opportunity to test those limits in the context of a product that runs on the largest systems in
existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This
paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to
Apache Flink® — Stateful Computations over Data Streams
Alluxio: A Virtual Distributed File System
The world is entering the data revolution era. Along with the latest advancements of the Inter-net, Artificial Intelligence (AI), mobile devices, autonomous driving, and Internet of Things (IoT),the amount of data we are generating, collecting, storing, managing, and analyzing is growing ex-ponentially. To store and process these data has exposed tremendous challenges and opportunities.
Over the past two decades, we have seen significant innovation in the data stack. For exam-ple, in the computation layer, the ecosystem started from the MapReduce framework, and grew tomany different general and specialized systems such as Apache Spark for general data processing,Apache Storm, Apache Samza for stream processing, Apache Mahout for machine learning, Ten-sorflow, Caffe for deep learning, Presto, Apache Drill for SQL workloads. There are more thana hundred popular frameworks for various workloads and the number is growing. Similarly, thestorage layer of the ecosystem grew from the Apache Hadoop Distributed File System (HDFS) toa variety of choices as well, such as file systems, object stores, blob stores, key-value systems, andNoSQL databases to realize different tradeoffs in cost, speed and semantics.
This increasing complexity in the stack creates challenges in multi-fold. Data is siloed invarious storage systems, making it difficult for users and applications to find and access the dataefficiently. For example, for system developers, it requires more work to integrate a new computeor storage component as a building block to work with the existing ecosystem. For data applicationdevelopers, understanding and managing the correct way to access different data stores becomesmore complex. For end users, accessing data from various and often remote data stores oftenresults in performance penalty and semantics mismatch. For system admins, adding, removing, orupgrading an existing compute or data store or migrating data from one store to another can bearduous if the physical storage has been deeply coupled with all applications.......
STRETCH: In-Memory Storage with Autoscaling for Cluster Computing
IoT enabled devices and observational instruments continuously generate voluminous data. A large portion of these datasets are delivered with the associated geospatial locations. The increased volumes of geospatial data, alongside the emerging geospatial services, pose computational challenges for large-scale geospatial analytics. We have designed and implemented STRETCH, an in-memory distributed geospatial storage that preserves spatial proximity and enables proactive autoscaling for frequently accessed data. STRETCH stores data with a delayed data dispersion scheme that incrementally adds data nodes to the storage system. STRETCH is built on Apache Spark and Ignite and interacts with them at runtime.
Disk-oriented approaches to online storage are becoming increasingly problematic: they do not scale gracefully to meet the needs of large-scale Web applications, and improvements in disk capacity have far outstripped improvements in access latency and bandwidth. This paper argues for a new approach to datacenter storage called RAMCloud, where information is kept entirely in DRAM and large-scale systems are created by aggregating the main memories of thousands of commodity servers. We believe that RAMClouds can provide durable and available storage with 100-1000x the throughput of disk-based systems and 100-1000x lower access latency. The combination of low latency and large scale will enable a new breed of dataintensive applications.