Wednesday, 4 March 2015

Spark on avro data

Found spark-avro integration.
https://github.com/databricks/spark-avro

It worked as expected. Thx

Tuesday, 3 March 2015

HBase offheap bucket cache


How to activate short-circuit local read on Hdfs & HBase


Editing....
Below are the minimum configs required :

Add below property to hdfs-site.xml

create dir /var/lib/hadoop-hdfs/ on each of the datanode.

<configuration>
  <property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
  </property>
  <property>
    <name>dfs.domain.socket.path</name>
    <value>/var/lib/hadoop-hdfs/dn_socket</value>
  </property>
</configuration>

You would see dn_socket getting replaced with dn_1090 (if on secure cluster)


To enable on HBase:

just edit hbase-site.xml 

<configuration>
  <property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
  </property>


Sunday, 1 March 2015

Incremental backup in HBase

I was looking for something like fool proof backup in HBase similar to what we have in RDBMS.
Currently HBase has full table backup concept using Snapshot. There is nothing called incremantal backup, found one good jira(https://issues.apache.org/jira/browse/HBASE-7912), committed by IBM. Hope soon this is patched to apache hbase.

Sunday, 25 May 2014

What is the purpose of swapping in virtual memory?

Swapping is exchanging data between the hard disk and the RAM

The goal of the virtual memory technique is to make an application think that it has more memory than actually exists. If you read the recommended question then you know that the virtual memory manager (VMM) creates a file on the hard disk called a swap file. Basically, the swap file (also known as a paging file) allows the application to store any extra data that can’t be stored in the RAM – because the RAM has limitedmemory. Keep in mind that an application program can only use the data when it’s actually in the RAM. Data can be stored in the paging file on the hard disk, but it is not usable until that data is brought into the RAM. Together, the data being stored on the hard disk combined with the data being stored in the RAM comprise the entire data set needed by the application program.
So, the way virtual memory works is that whenever a piece of data needed by an application program cannot be found in the RAM, then the program knows that the data must be in the paging file on the hard disk.
But in order for the program to be able to access that data, it must transfer that data from the hard disk into the RAM. This also means that a piece of existing data in the RAM must be moved to the hard disk in order to make room for the data that it wants to bring in from the hard disk. So, you can think of this process as a trade in which an old piece of data is moved from the RAM to the hard disk in exchange for a ‘new’ piece of data to bring into the RAM from the hard disk. This trade is known as swapping or paging. Another term used for this is a ‘page fault’ – which occurs when an application program tries to access a piece of data that is not currently in the RAM, but is in the paging file on the hard disk. Remember that page faults are not desirable since they cause expensive accesses to the hard disk. Expensive in this context means that accessing the hard disk is slow and takes time.

The Purpose Of Swapping

So, we can say that the purpose of swapping, or paging, is to access data being stored in hard disk and to bring it into the RAM so that it can be used by the application program. Remember that swapping is only necessary when that data is not already in the RAM.

Excessive Swapping Causes Thrashing

Excessive use of swapping is called thrashing and is undesirable because it lowers overall system performance, mainly because hard drives are far slower than RAM

Sunday, 21 April 2013

Hadoop: Map Reduce: Using command line arguments in Mapper Class


In Map Reduce if we pass command line arguments it goes to driver class . However, if want to use it in Mapper class , it can be done as follows:
In Driver class , set the command line argument's value in the configuration object as follows:

public class TestDriver
{
    public static void main(String[] args) throws IOException,InterruptedException,ClassNotFoundException
    {
         Configuration conf = new Configuration();
         conf.set("val1", args[0]);
        conf.set("val2", args[1]);
  }
}

 Now , in Mapper , you can get these values as follows:
   
public class TestMapper extends Mapper
{
  public void map(Object key,Text value,Context context) throws IOException,InterruptedException
    {
        val1=context.getConfiguration().get("val1");
        val2=context.getConfiguration().get("val2");

    }
}



Saturday, 8 December 2012

mounting namenode dir on remote NFS Server

make a directory of your choice on the server: 
mkdir /var/hdfs/dfs 

export it in /etc/exports, giving the IP address of your NN: 
/var/hdfs/dfs 10.1.2.3(rw,no_subtree_check) 

refresh your exports: 
exportfs -a 

log into the NN and verify the export is visible: 
showmount -e nfsserver.example.com 

Export list for nfsserver: 
/var/hdfs/dfs * 

make the local mount point and mount it: 
mkdir -p /mnt/nfs/dfs 
mount nfsserver.example.com:/var/hdfs/dfs /mnt/nfs/dfs 

Now mention this dir :/mnt/nfs/dfs  in your hdfs-site.xml