Introduction to Big Data.
What is it?:
- Analyse all kinds of data: Large volumes, valuable but difficult to extract, time sensitive.
- Multiple devices - systems, mobile devices (phones, tablets, gps), smart meters, RFID tags and more.
- Multiple feeds - website, social media, DBs
We need to make better use of the data we have, year on year the amount of data available increases by huge amounts. This data could be used to make better use of that information that is available, influence decisions and make better choices for business and people.
Big Data tries to deliver that data and analysis it while thinking about the potential customers while merging the data sources to give more intelligent results.
What is Hadoop?
Apache open source software framework fo reliable, scalable,distributed computing on massive amount of data. It hides the underlying systems details and complexities from the users. Developed in Java consisting of 3 sub projects:
- MapReduce - Master/Slave architecture for job control and execution on multiple slaves
- Hadopp Distributed Files system aka HDFS - Performs 3 way replication of data for on filesystem.
- Hadoop Common.
Although the Hadoop framework is implemented in Java, MapReduce application do not need to be written in Java. A number that have emerged are : Hive, Ping, Jaql.
Hive is a Data warehouse infrastructure that is used with Hadoop, it have is own code called HiveSQL.
What is Hadoop Open Source:
- Scalable - New nodes can be added on the fly
- Affordable, massively paralleled
- Flexible - Hadoop is a schema-less - can absorb and type of data
- Fault Tolerant - using MapReduce
IBM Innovation for Hadoop:
Adds performance, reliability, compression, Analytic and Productivity Accelerators, Web-based UIs, visualisation and integration into your Enterprise systems.
Offering Editions for Hadoop
- Standard Edition - Perform your own setup and design, allowing integration with your own hardware: Big Sheets, BigSQL, Dev tools, RDBMS and more.
- Quick Start Edition - is a free to download text analytic offering for testing and familiarising yourself with Hadoop.
- Enterprise Edition - Quick start, Software bundle, HA, GPFS, Advanced Analytics, performance and scalability, security, and IBM integration
- PureData for Hadoop - HA, node redundancy; Installation, Admin and Monitoring built in; Application and Industry Accelerators, Development tools; System ready built, cabled and configured; Integration with Netezza.
- Real-time interactive view of the cluster - Monitoring the appliance status, managing servers, job, nodes, hardware and applications.
- Discover and Analyse - Load and explorer data using BigSheets and other applications, importing SQL tables from IBM Netezza or other databases.
- EasyArchive+ - Import and export to/from NPS table to/from HDFS, Journalling and archive management.
Why a Platform?:
The whole is greater then the sum of the part: Reduced deployment time and costs, out of the box standard-based services, start small and the flexible frame work allows easy expansion based on your business/project needs.
- IBM Accelerator for Social Data Analytics. out of the box for cross industry support, customer acquisitions, optimisations and more.
- IBM Accelerator for Machine Analytics: out of the box for cross industry support, operational monitor, proactive maintenance and more.
Native SQL access to data stored in BigInsights, Reall JDBC/ODBC drivers
User authentication, authorisation (roles-base), and credentials store.
Provides landing area for data from other sources; Analytics, social media, web-feeds, databases, and more.
PureData Technical Overview
IBM PuresSystems Family
- PureFlex: flexible infrastructure hardware
- PureApplication: integrated application platform
- PureData: integrated Data Platform
PureData for Hadoop is not a replacement for data warehouse, it should complement the system using it to explore new data and untapped sources, visualise and gain new insights, identify useful information.
Install and On-site setup, appliance is delivered complete and ready to go, with minimal setup required on-site, with easy to apply updates!
|PureData system for Hadoop Full Rack Specification|
|216||Drives for 3TB|
|96GB||Memory per Node|
HA - Bypassing NameNode SPOF:
- Master nodes redundancy - Active/Standby using hot/warm HA Configuration.
- Heart-beating and STONIT Capabilities
- DRBD block-level replication (no shared storage)
- Dual 10G network links
Centralised Management Interface:
- HW control tab
- Dashboard Tabs predefined for performance and monitoring of hardware.
- CLI Shell Administration (ihash) connection via ssh to nodes
- Role based user management with local or LDAP user control into the system.
Simple built in user to allow users to migrate (import/export) data, quick configuration, ready made scripts.
Hadoop Data Discovery with BigSheets
Whats is BigSheets:
A browsers based analytics tool for business, it uses a web-based spreadsheet like interface with built in readers, thus allowing users to combine and explore various types of data and results as needed.
What can you do with BigSheets?
- File and enrich content with built in functions.
- Combine data in different workbooks.
- Visualise results through spreadsheets and charts.
- Export data into common formats.
No programing knowledge needed!
Typical Scenarios range form data collection, web-crawlers, and other imported sources. Data storage via distributed filesystem (HDFS) and web base file browser. Data exploration via the various tool, Workbooks and charts in BigSheets.
User create Workbooks to work with the data like spreadsheets and interact with the via built-in Readers to analysis the information via a known schema, web-crawlers and known DB layouts.
Processing allows users to build the workbooks, filtering and transforming the data as desired, then the system evaluates the complied front-end command in executable work, it then executes on a simulated environment for sample data. Users runs the workbook to compute resulted on the real data and explores the output.
BigSheets console URL https:/<server>:<port>/data/html/index.html
SQL for Big Data:
- SQL access to data in Hadoop in challenging, data is many formats, CSV, JSPON, Hive RCFile, Hbase and more. Including SQL '92+ Support.
- While highly scalable, MapReduce is notoriously difficult to use; java API requires programing expertise, unfamiliar language (Ping).
- SQL support opens the data to a much wider audience
- Making data in BigInsights accessible to SQL capable tools.
- Big SQL: Familiar SQL for Unstructured Data.
- Big SQL inherits much of its terminology form Hive/HCatalog.
- Big SQL Engine uses real JDBC/ODBC drives.
- Optimised to handle the queries better and more efficiently.
Big SQL Architecture:
- Runs on the Master node in Hadoop.
- Shares the catalogues with Hive via the Hive Metastore, so each can query the other tables.
- SQL engine analyses the incoming queries.
- Separates portions to execute at the server via. portions on the cluster.
- Re-writes query if necessary for improved performance.
- Determines the appropriate storage handler for the data.
- Produces the execution plan for the queries.
- All this is done without the users knowledge and no need to make any decisions on how this is done.
- Big SQL is a multi-threaded architecture, more then one Big SQL instances can be set-up.
Big SQL brings robust SQL Support to the Hadoop ecosystem, using many supported SQL data types and SQL(92). Tool support for JSqsh, BigInsights Console, Existing tools like SQuirreL SQL and Eclipse.