What are the features that Flume provides to handle log data?

What are the features that Flume provides to handle log data?

What is hadoop and spark

The explosion of “Big Data” is changing data analysis processes, to solve complex problems related to scientific and biomedical research, health education, etc.

Companies, users and devices generate large amounts of data, which have grown exponentially, whose analysis will help to achieve competitive advantages. Technology is needed to move the volume of data. Methodologies and processes are also needed to access and exploit this information.

In addition to the large volume of information, it must be taken into account in the management of Big Data the great variety of data (mobile devices, audio, video, GPS systems, countless digital sensors in industrial equipment, automobiles, electric meters, wind vanes, anemometers, etc.), and the speed of response to obtain the right information at the right time.

This information is generated by people directly and indirectly on a continuous basis in each of the activities we perform several times a day with smartphones, online financial transactions and databases with population data. Also in the communication between computers (M2M machine-to-machine).

Features of apache hadoop

That’s why, as the articles progress, concepts and/or part of them will be put into practice with examples whose complexity will increase in each case, so don’t worry if at the beginning it seems that things don’t make sense because at the end they will.

I will try to make the series of articles as practical as possible (but this is not always possible, especially when we talk about things that have a certain complexity when it comes to understanding and when technology is involved… you know what I mean… hehe), so I promise that this will be the “heaviest” article of the series… it is very theoretical and in some points very complicated.

Read more  What is Qt in editing?

It must be taken into account that the data initially and by themselves do not provide greater value to decision making, they simply show the aspect they represent (for example: the value “7” we know that it represents a whole number, but we do not know anything else), that is, they are not prepared to explain why things happen, but the analysis and interpretation of their values as a whole and knowing the context / utility provides what is known as information (set of processed data that have a meaning and are useful when making decisions) and as it is well known nowadays information is power.

Aws kinesis

There is a widely recognized talent gap. It can be difficult to get entry-level programmers with sufficient Java skills to be productive with MapReduce. That’s one reason why distribution vendors are rushing to put relational technology (SQL) on top of Hadoop. It’s much easier to find programmers with SQL skills than with MapReduce skills. And Hadoop administration seems to be part art and part science, so it requires basic knowledge of operating systems, hardware and Hadoop kernel configuration.

Data security. Another challenge centers around aspects of fragmented data security, although new tools and technologies are emerging. The Kerberos authentication protocol is an important step toward making Hadoop environments secure.

Comprehensive data management and governance. Hadoop lacks easy-to-use, feature-rich tools for data management, data cleansing, governance and metadata. Tools for data quality assurance and data standardization are especially lacking.

Data Streaming

Hadoop is a project of the Apache organization that is being built and used by a global community of contributors,[2] using the Java programming language. Yahoo! has been the largest contributor to the project,[3] and uses Hadoop extensively in its business.[4] Hadoop basically consists of the Hadoop Common Platform.

Read more  What is SQL High Availability?

Hadoop basically consists of the Hadoop Common, which provides access to the file systems supported by Hadoop. The Hadoop Common software package contains the .jar files and scripts needed to run Hadoop. The package also provides source code, documentation, and a contribution section that includes projects from the Hadoop Community.

A key functionality is that for effective job scheduling, each file system must know and provide its location: the name of the rack (more precisely, the switch) where the worker node is located. Hadoop applications can use this information to run jobs on the node where the data is located and, failing that, on the same rack/switch, thus reducing backbone traffic. The HDFS file system uses this when replicating data, to try to keep different copies of the data in different racks. The goal is to reduce the impact of a rack power outage or switch failure so that even if these events occur, the data can still be readable.[8] The HDFS file system uses this when replicating data, to try to keep different copies of the data in different racks.

Read more  What are some examples of ADLs?