Thursday 29 August 2013

Why Flume.?

Flume is not limited to collect logs from distributed systems, but it is capable of performing other use cases such as

  • Collecting readings from array of sensors
  • Collecting impressions from custom apps for an ad network
  • Collecting readings from network devices in order to monitor their performance.
  • Flume is targeted to preserve the reliability, scalability, manageability and extensibility while it serves maximum number of clients with higher QoS.

Flume was awesome to me because it is very easy to extend. The source and sink architecture is flexible and provides simpler APIs.


What is FLUME?

Flume is a distributed system that gets your logs from their source and aggregates them to where you want to process them . The primary use case is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent store such as HDFS.




Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.




The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.







Friday 9 August 2013

Recovery of deleted files in Hadoop

There may be incidents which we accidently delete necessary files from hadoop. Sometimes the entire file system may get deleted. For doing recovery process the below steps may help you.
For doing this recovery method  trash should be enabled in hdfs. Trash can be enabled by setting the property  fs.trash.interval greater than 0. By default the value is zero.  Its value is number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabled. We have to set this property in core-site.xml.
<property>
  <name>fs.trash.interval</name>
  <value>30</value>
  <description>Number of minutes after which the checkpoint
  gets deleted.
  If zero, the trash feature is disabled.
  </description>
</property>

There is one more property which is having relation with the above property calledfs.trash.checkpoint.interval. It is the number of minutes between trash checkpoints. This should be smaller or equal to  fs.trash.interval. Everytime the checkpointer runs, it creates a new checkpoint out of current and removes checkpoints created more than fs.trash.interval minutes ago.The default value of this property is zero.


<property>
  <name>fs.trash.checkpoint.interval</name>
  <value>15</value>
  <description>Number of minutes between trash checkpoints.
  Should be smaller or equal to fs.trash.interval.
  Every time the checkpointer runs it creates a new checkpoint
  out of current and removes checkpoints created more than
  fs.trash.interval minutes ago.
  </description>
</property>

If the above properties are enabled in your cluster. Then the deleted files will be present in .Trash directory of hdfs. You have time to recover the files until the next checkpoint occurs. After the new checkpoint the deleted files will not be present in the .Trash. So recover before the new checkpoint. If this property is not enabled in your cluster,  you can enable this for future recovery.

Back Up Mechanism for Namenode

Namenode is the single point of failure in hadoop cluster. Because it stores the metadata of the entire hadoop system.
So extra care should be given in maintaining it. We use the best hardware for namenode machines.
Even if we use best hardware, complete protection cannot be guarenteed, because hardware issues can happen at anytime. So a backup for namenode is very necessary.
One of the methods is creating a simple backup storage by mounting the partition of another machine located in a different place to the namenode machine.
The back up machine should have the same hardware/software specifications as of namenode machine and installed with hadoop similar to namenode machine. But hadoop services are not started in that machine.
Incase of failure, we can start namenode in this backup machine and it runs like normal namenode. The only thing we need to do is assigning ipaddress/hostname of actual namenode to the backup namenode.
In the hdfs-site.xml, we are giving an additional value to dfs.name.dir property.
ie actual location, backup location.
<property>
<name>dfs.name.dir</name>
<value>/app/hadoop/name,/app/hadoop/backup</value>
<description>
Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
</description>
</property>

here /app/hadoop/name is the actual namenode storage location and /app/hadoop/backupis the location where the partition is mounted for storing the namenode backup.
In case of failure of the first namenode machine, the namenode data will be safe in the second machine(backup), so we can start the namenode in the second machine.
The second machine is placed in different location and is provided with a differnt power supply, so that the dependencies of both the machines will be different, thus making an efficient backup.