Topic
  • 1 reply
  • Latest Post - ‏2013-09-24T22:55:24Z by john_poelman
DonSmith
DonSmith
4 Posts

Pinned topic GPFS usage

‏2013-09-24T22:33:11Z |

Greetings,

I'm trying to determine if installing/using GPFS with BigInsights 2.1 is a useful and recommended practice. My cluster will only have 5 data nodes (to start) but has the potential to grow.

Is anyone else using GPFS? And does anyone have some performance/management comparisons with HDFS?

 

Thanks,

 

Don

  • john_poelman
    john_poelman
    6 Posts
    ACCEPTED ANSWER

    Re: GPFS usage

    ‏2013-09-24T22:55:24Z  

    I've done performance comparisons and have found that GPFS is roughly on par with HDFS.  There are cases where it is faster, cases where it is slower.

    Keep you cluster simple and use HDFS if you don't need the bells and whistles that come with GPFS.  If you need your distributed file system (DFS) to be POSIX-compliant and/or you need the enterprise features like snapshots, then GPFS is the way to go.

    Here are a couple of tips if you decide to go with GPFS:

    1) If your cluster will primarily be used for map/reduce, then consider changing Hadoop parameter "mapred.local.dir" to point to local storage instead of GPFS.  In many cases, especially clusters running many small jobs, this will help performance.  Give mapred.local.dir multiple paths if possible.  It's OK to share devices between both GPFS and a local ext4 file system used by mapred.local.dir.

    2) Avoid storing your data as many (10's of thousands of files or more) small files.  Instead, if possible, merge files to be in the tens of megabytes or large range.

    John

    BigInsights performance

  • john_poelman
    john_poelman
    6 Posts

    Re: GPFS usage

    ‏2013-09-24T22:55:24Z  

    I've done performance comparisons and have found that GPFS is roughly on par with HDFS.  There are cases where it is faster, cases where it is slower.

    Keep you cluster simple and use HDFS if you don't need the bells and whistles that come with GPFS.  If you need your distributed file system (DFS) to be POSIX-compliant and/or you need the enterprise features like snapshots, then GPFS is the way to go.

    Here are a couple of tips if you decide to go with GPFS:

    1) If your cluster will primarily be used for map/reduce, then consider changing Hadoop parameter "mapred.local.dir" to point to local storage instead of GPFS.  In many cases, especially clusters running many small jobs, this will help performance.  Give mapred.local.dir multiple paths if possible.  It's OK to share devices between both GPFS and a local ext4 file system used by mapred.local.dir.

    2) Avoid storing your data as many (10's of thousands of files or more) small files.  Instead, if possible, merge files to be in the tens of megabytes or large range.

    John

    BigInsights performance