Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
A Hadoop based platform for natural language processing of web pages and documents
Nesi P., Pantaleo G., Sanesi G. Journal of Visual Languages and Computing31, Part B,  130-138,2015.Type:Article
Date Reviewed: Mar 10 2016

The digital universe is increasing from year to year and will reach 40 zettabytes in 2020. This growth is due to many reasons, including the Internet of Things (IoT) and new users buying smartphones, tablets, and personal computers (PCs). This huge amount of data needs to be processed, but devices using single central processing units (CPUs) or multi-core computing are not able to handle it in a reasonable amount of time. In order to address this issue, a new technology has been introduced that is able to manipulate, extract, and even learn from this huge amount of data. Big data management has allowed well-reputed companies to exist and to be the leaders in the IT market (Google, Facebook, Instagram, Twitter, and so on). In reality, most of these companies have adopted Apache Hadoop, an ecosystem designed to support distributed applications, large-scale data processing, and storage, providing high scalability.

Natural language processing (NLP) is one of many fields that is improving, but producing results is time consuming; this makes it a good candidate for using Apache Hadoop. In this paper, the authors present a distributed system for crawling web documents and extracting keywords and phrases using the Apache Hadoop ecosystem. They present a distributed architecture using Apache Hadoop features (masternodes and datanodes), which are part of the Hadoop distributed file system (HDFS). Within this architecture, the authors use the NLP open-source GATE platform, which is capable of solving any text processing problem. The results presented show that there is a significant time improvement, from 115 hours on a single non-Hadoop PC to around seven hours with an HDFS single-node PC.

I found this paper valuable since it shows readers how to drastically reduce time using the Apache HDFS for an NLP problem. I sincerely recommend it to those in the NLP field.

Reviewer:  Karim Hadjar Review #: CR144229 (1605-0337)
Bookmark and Share
 
Natural Language (H.5.2 ... )
 
 
Management (D.2.9 )
 
Would you recommend this review?
yes
no
Other reviews under "Natural Language": Date
Designing effective speech interfaces
Weinschenk S., Barker D., John Wiley & Sons, Inc., New York, NY, 2000.  405, Type: Book (9780471375456)
Jun 1 2000
Spoken dialogue technology: enabling the conversational user interface
McTear M. ACM Computing Surveys 34(1): 90-169, 2002. Type: Article
Jul 26 2002
Limitations of concurrency in transaction processing
Franaszek P., Robinson J. ACM Transactions on Database Systems 10(1): 1-28, 1985. Type: Article
Jan 1 1986
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy