Document Type
Conference Paper
Publication Date
9-25-2020
Keywords
hadoop data platform, hadoop distributed file system, NiFi, streaming data, unstructured data, distributed, parallel, and cluster computing, artificial Intelligence, AI, databases, machine learning
Abstract
Large organizations are seeking to create new architectures and scalable platforms to effectively handle data management challenges due to the explosive nature of data rarely seen in the past. These data management challenges are largely posed by the availability of streaming data at high velocity from various sources in multiple formats. The changes in data paradigm have led to the emergence of new data analytics and management architecture. This paper focuses on storing high volume, velocity and variety data in the raw formats in a data storage architecture called a data lake. First, we present our study on the limitations of traditional data warehouses in handling recent changes in data paradigms. We discuss and compare different open source and commercial platforms that can be used to develop a data lake. We then describe our end-to-end data lake design and implementation approach using the Hadoop Distributed File System (HDFS) on the Hadoop Data Platform (HDP). Finally, we present a real-world data lake development use case for data stream ingestion, staging, and multilevel streaming analytics which combines structured and unstructured data. This study can serve as a guide for individuals or organizations planning to implement a data lake solution for their use cases.
Faculty
Sheridan Research
Version
Pre-print
Copyright
© Ruoran Liu, Haruna Isah, & Farhana Zulkernine, 2020
Terms of Use
Terms of Use for Works posted in SOURCE.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Original Publication Citation
Liu, R., Isah, H., and Zulkernine, F. (2020). Big Data Lake for Multilevel Streaming Analytics. arXiv. https://doi.org/10.48550/arXiv.2009.12415
SOURCE Citation
Liu, Ruoran; Isah, Haruna; and Zulkernine, Farhana, "A Big Data Lake for Multilevel Streaming Analytics" (2020). Publications and Scholarship. 5.
https://source.sheridancollege.ca/centres_publications/5
Included in
Artificial Intelligence and Robotics Commons, Databases and Information Systems Commons, Data Storage Systems Commons