Computing Reviews

A Hadoop based platform for natural language processing of web pages and documents
Nesi P., Pantaleo G., Sanesi G. Journal of Visual Languages and Computing31, Part B, 130-138,2015.Type:Article
Date Reviewed: 03/10/16

The digital universe is increasing from year to year and will reach 40 zettabytes in 2020. This growth is due to many reasons, including the Internet of Things (IoT) and new users buying smartphones, tablets, and personal computers (PCs). This huge amount of data needs to be processed, but devices using single central processing units (CPUs) or multi-core computing are not able to handle it in a reasonable amount of time. In order to address this issue, a new technology has been introduced that is able to manipulate, extract, and even learn from this huge amount of data. Big data management has allowed well-reputed companies to exist and to be the leaders in the IT market (Google, Facebook, Instagram, Twitter, and so on). In reality, most of these companies have adopted Apache Hadoop, an ecosystem designed to support distributed applications, large-scale data processing, and storage, providing high scalability.

Natural language processing (NLP) is one of many fields that is improving, but producing results is time consuming; this makes it a good candidate for using Apache Hadoop. In this paper, the authors present a distributed system for crawling web documents and extracting keywords and phrases using the Apache Hadoop ecosystem. They present a distributed architecture using Apache Hadoop features (masternodes and datanodes), which are part of the Hadoop distributed file system (HDFS). Within this architecture, the authors use the NLP open-source GATE platform, which is capable of solving any text processing problem. The results presented show that there is a significant time improvement, from 115 hours on a single non-Hadoop PC to around seven hours with an HDFS single-node PC.

I found this paper valuable since it shows readers how to drastically reduce time using the Apache HDFS for an NLP problem. I sincerely recommend it to those in the NLP field.

Reviewer:  Karim Hadjar Review #: CR144229 (1605-0337)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy