Introduction

  • LAMBDA Project (Webinar 1)
 
  • Learning Big Data Analytics
  • Knowledge Repository
    • Lectures
    • Tools
  • Activities
    • Big Data Analytics Summer School 2019
    • Big Data Analytics Summer School 2020
 
 
Welcome to the preparatory webinar of the Big Data Analytics School 
 
April 25, 2019
 

 LAMBDA - Learning, Applying, Multiplying Big Data Analytics 

 

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 809965
•Coordination and Support Action

Specific Objectives

 
OBJ 1: Strategic Partnership - Establishment and development of productive and fruitful long-term cooperation that continues after project completion
4 partners, 3 different countries (Serbia, Germany, UK)
OBJ 2: Boosting scientific excellence of the linked institutions and capacity building of the widening country and the region in Big Data Analytics and semantics
Different capacity building activities (Big Data Analytics Summer School)
OBJ 3: Spreading excellence and disseminating knowledge throughout the West Balkan and South-East European countries
5 workshops at International conferences in the region
 
 
OBJ 4: Sustainability of research related to key societal challenges (sustainable transport, sustainable energy, security, societal wellbeing) and financial autonomy in the long run
7 brainstorming sessions on key societal challenges
Logo
2019 ESWC-Conferences
Home
Home

Join Us

 
 
 
 
Welcome to the preparatory webinar of the Big Data Analytics School 
 
Q & A...
April 25, 2019
 

ENTERPRISE INFORMATION SYSTEMS

Department Introduction 

 

 

intern
Page 
Logo

Agenda

Overall information
Products and Services
Research Projects
 
intern
Page 
intern
Page 
 
 
 
Dr. Christoph Lange
Head of Department
Dr. Ing. Steffen Lohmann
Deputy Head of Dep’t
EII Business Unit Leader
Prof. Dr. Jens Lehmann
Lead Scientist
Professor of Smart Data Analytics
Dr. Simon Scerri
Dr. Damien Graux
M.Sc. Luigi Selmi
 
Dr. Christian Mader
Dr. Ioanna Lytra
Dr. Giulio Napolitano
Richa Sharma
Mohammad K. Nammous
 
Martina Wiegand
Smart Data Ecosystems
Data Integration
Business Developers
Secretary
Team Leaders
Industrial Information Interaction
Simon.png
Napolitano.Giulio(IAIS).JPG.jpg

1.2. Important Facts about EIS

 

~35 Employees, including 1 professor, 8 postdocs,
13 PhD students, 7 software developers, 2 business developers, etc.

intern
Industrial Projects with
Customers since 2016
8 EU Research Projects
(1 as Coordinator;lots of coordination experience)
9 National Research Projects
Strategic Networks: 
Big Data Value AssociationIndustrial Data Space Association
Semantic Technologies & Linked Data
Enterprise Information Integration
Dr. Steffen Lohmann,
Deputy Head of Dep’t
Business Unit Leader
Dr. Christoph Lange,
Head of Department
Business Unit
Research Group
Page 4

2. Products and Services

 

 

Vocabulary-based Data Integration
Federated Hybrid Search Engine
Enterprise Knowledge Graphs
AskNow - Question Answering
Enterprise Data Market
Semantic Big Data Architecture
Page 5

2.1. VoCol - Vocabulary-based Data Integration

 

 

Page 6

2.2. Federated Hybrid Search Engine

Chaotic Search
Structured & Simplified Search
 
Page 7

2.3. AskNow - Question Answering System

 

 

Page 8
 
 
 
Enterprise Data Market
Semantic Big Data Architecture
Enterprise Knowledge Graphs
Page 9
2.4. Enterprise Solutions
 
1
Heterogeneous Data
Text - Media - RDB - RDF
 
2
Semantification
Metadata, Vocabularies, Integrations
 
3
Applications
Enterprise Solutions
Cortex
SANSA
Ontario
BDE-IP

3. Research Projects (Public Funding)

8 EU and 9 national projects
Some highlights on the following slides, plus:
 
WDAqua (Question Answering ITN)
SLIPO (Scalable Linking and Integration of Big POI Data)
BETTER (Big-data Earth observation Technology and Tools Enhancing Research and development)
LiDaKrA (linked data based crime analysis)
LiMbo (linked mobility data)
Urban Data Space 
Inclusive OpenCourseWare 
HOBBIT (Big Linked Data Benchmarking)
Be-IoT   bIoTope
 and a few more…
Page 5
EU H2020 project for the development of an integrated platform of Big Data management tools (2015–2017)
 
Semantic layer allows for the integration of heterogeneous data sources on-demand
 
CSA brought together big data users/producers from the 7 societal challenges
One pilot per challenge: health, food and agriculture, energy, transport, climate, social sciences, and security
 
Page 11
BDE Platform Architecture   Google Slides-1.png
intern
Large-scale pilots for collaborative OpenCourseWare authoring, multiplatform delivery and Learning Analytics
EU ICT-20-2015 Project
17 partners from 8 countries
Cooperation between IAIS departments
Re-developing the SlideWiki platform
Integration with other learning platforms
Testing and evaluation in trials
Fraunhofer IAIS is coordinator
Page 12
intern
Creating an ecosystem to establish sovereignty over data
Funded by BMBF 2015–2020 and Fraunhofer (CIT Cluster)
12 Fraunhofer Institutes: AISEC, FIT, FKIE, FOKUS, IAIS
IAO, IESE, IML, IOSB, IPA, ISST, SIT
IDS Association: 75+ industry partners, 5 working
groups, 18 use cases
IAIS has a leading role in:
developing the architecture
establishing a common information model
overseeing DIN standardization
Page 14
IDS_Logo_11-rz-02.jpg

THANK YOU

intern
Page 
University of Bonn:
Ranked 48th by THES, Ranked 2nd in Germany
4,600 International Students (12.7%) from 140 countries
900 Doctoral Students
2019’s Candidate for Excellent University in Germany 
 
Computer Science department of University of Bonn:
 
Computer Science I - Theoretical Informatics and Formal Methods
Computer Science II - Computer Graphics and Computer Animation
Computer Science III - Databases and Information Systems, Programming Languages and Software Technologies, Artificial Intelligence, Enterprise Information Systems, Computer Vision, Intelligent Databases, Knowledge Discovery and Machine Learning
Computer Science IV - Security and Networked Systems
Computer Science V - Theoretical Computer Science
Computer Science VI - Autonomous Intelligent Systems, Humanoid Robots Lab, and Technical Computer Science

 

 

 

Smart Data Analytics

Research Group Overview

 

Sub-groups

Distributed Semantic Analytics
Semantic Question Answering
Structured Machine Learning
Deep Learning
Software Engineering for Data Science
Semantic Data Management
Knowledge Extraction and Validation
Founded in 2016
40 Members:
1 Professor
14 PostDocs / Seniors
25 PhD Students
Many master students
 
 

 

 

 

Lectures Presentation

(University of Bonn & Fraunhofer)

 

Lectures at BDA School

 
Lecture - 1 Introduction to Big Data Architecture
Advanced Big Data architectures
Design and architect scalable solutions
Cluster managers
Distributed file systems
Storage systems
Lecture - 2 Big Data Solutions in Practical Use-cases
Components in realizing system architectures
Real-world example of practical use-cases
 

Lectures at BDA School

 
Distributed Big Data Frameworks
Batch-only frameworks (Hadoop)
Stream-only frameworks (Storm, Samza)
Hybrid frameworks (Spark, Hive and Flink)
Distributed Big Data Libraries
Big Data frameworks use different APIs
Graph computations and graph processing
SparkSQL, GraphX and MLlib
Scalable algorithms in Spark using Scala
 

Lectures at BDA School

 
 
Distributed Semantic Analytics I
Scalable semantic analytics stack (SANSA)
 knowledge graph processing
Distributed Semantic Analytics II
Setup, APIs and different layers of SANSA
execute examples and create programs that use SANSA APIs
 

LAMBDA Bonn Key Contacts

Prof. Jens Lehmann
Dr. Damien Graux
 
Dr. Hajira Jabeen
Dr. Sahar Vahdati
 
Welcome to the preparatory webinar of the Big Data Analytics School 
 
Q & A...
April 25, 2019

Introduction

  • Data on the Web
  • Linking Data on the Web
 

5 Stars

 

Uvod


Naziv predmeta: Semantičke Veb tehnologije
Predavanje br. 1
Nastavna jedinica: Uvod u Semantički Veb
Nastavna tema: Istorija Semantičkog Veba
Pripremila: Dr Valentina Janev

 

Istorija Semantičkog Web-a

  • 1989: Web je “izumeo” Tim Berners-Lee (između ostalih), fizičar koji je radio u CERN-u (Evropska organizacija za nuklearno istraživanje)
  • 1994: W3C je formiran (Predsednik Tim Berners-Lee)
    • Kao forum za informacije, trgovinu, komunikaciju i kolektivno razumevanje sa ciljem da vodi Web do njegovog punog potencijala
    • Da razvije zajedničke protokole koji će unaprediti njegovu evoluciju i osigurati njegovu interoperativnost
  • 2001: Berners-Lee, T., Hendler, J., Lassila, O., Semantic Web, Scientific American, May 2001.


Ciljevi


  • Berners-Lee, T., Hendler, J., Lassila, 
    Scientific American, May 2001
  • Semantički Web je proširenje postojećeg web-a, gde informacije imaju precizno definisano značenje, što će omogućiti bolju saradnju između računara i korisnika. 
  • Cilj Semantičkog Web-a je stvoriti mrežu (Web of data) koja ima organizovanu, zajedničku semantiku koja je pristupačna, jasna i lako upotrebljiva za softverske agente.

Ciljevi

  • Imati informacije na Webu čije značenje je razumljivo i računarima
  • Imati trilione malih specijalizovanih softverskih agenata koji pomažu u automatskom izvršavanju zadataka koristeći se raspoloživim informacijama na Webu
  • Ovo otvara nove perspektive za prikupljanje i inženjering znanja 


Title

  • Text bullet 1
  • Text bullet 2
 

Međunarodni konzorcijum W3C


  • obezbeđuje osnovnu informacionu infrastrukturu, povezujući ljude i organizacije širom sveta
  • nastoji da uspostavi interoperabilnost i univerzalni pristup Webu uvođenjem otvorenih standarda za osnovne Web alate – nezavisno od pojedinačnih interesa
  • funkcioniše zahvaljujući prihodima od članarine, a ima oko 400 članova među kojima su vodeće komercijalne kompanije u ovoj oblasti, kao i brojne profitne i neprofitne organizacije, univerziteti i pojedinci – eksperti
  • nadgleda i koordinira Projekat semantičkog Web-a 
  • The World Wide Web Consortium develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential.

Semantic Web Layered Cake (2008)

Standardne tehnologije:

  • RDF – Resource Description Framework
  • RDFS – RDF Schema
  • OWL – Web Ontology Language
  • SPARQL – RDF Query Language
https://www.w3.org/2007/03/layerCake.svg 

Semantic Web Layered Cake (2008)


Unicode
URI

  • Unicode je standard za predstavljanje znakova
  • URI standard za identifikovanje i lociranje resursa

XML

  • XML i sa njim povezani standardi se koriste za strukturiranje podataka

RDF

  • RDF je jednostavan okvir za predstavljanje metapodataka koji koristi URI za identifi kaciju izvora na Webu i graf za opis relacija, to jest semantičkih veza između resursa.

RDFS

  • RDF Shema je semantičko proširenje RDF-a, jednostavan rečnik, alat za opisivanje grupa međusobno povezanih izvora i prirode njihovih relacija

Ontologije

  • Ontologije su jezici, formalno definisani sistemi pojmova i/ili koncepata i relacija za izražavanje kompleksnih osobina izvora i njihovih veza

Logika i dokaz

  • Logika i dokaz automatizovani sistem za zaključivanje iznad ontološke strukture služi za stvaranje novih uticajnih veza

Poverenje

  • U ovome sloju treba da se nalaze elementi za zaključivanje dovoljni za stvaranje poverenja u informacije dobijene pomoću Web-a

Zablude o Semantičkom Web-u

Zabluda 1: Semantički Web pokušava da nametne značenje „s vrha”

  • Odgovor 1: Jedino značenje koje OWL i RDF(S) nameću je značenje veznika u jeziku, koji korisnici mogu da koriste da bi izrazili svoje sopstveno značenje.

Zabluda 2: Semantički Web zahteva da se svi slože oko jednog predefinisanog značenja za izraze koje koriste.

  • Odgovor 2: Moto Semantičkog Web-a nije nametanje jedinstvene ontologije. Njegov moto je pre: „dozvolimo hiljadama ontologija da se razvijaju”. To je upravo i razlog zbog koga je mapiranje među ontologijama osnovna tema u zajednici Semantičkog Web-a.

Izvor: Frank van Harmelen, Istraživanje semantičkog veba 2006. godine: glavni tokovi, popularne zablude, trenutno stanje i budući izazovi.

Zablude o Semantičkom Web-u

Zabluda 3: Semantički Web će od korisnika zahtevati da razumeju komplikovane detalje formalizovane reprezentacije znanja.

  • Odgovor 3: Zaista, pojedini delovi osnovne tehnologije semantičkog Web-a se oslanjaju na komplikovane detalje formalizovane reprezentacije znanja. ...za većinu korisnika (sadašnjih i budućih) semantičkog Web-a, takvi detalji će u potpunosti biti „sakriveni”, kao što su komplikovani detalji CSS-a i (X)HTML-a sakriveni na postojećem Webu.



Izvor: Frank van Harmelen, Istraživanje semantičkog veba 2006. godine: glavni tokovi, popularne zablude, trenutno stanje i budući izazovi.

Zablude o Semantičkom Web-u

Zabluda 4: Korisnici semantičkog Web-a će morati da izvrše ručno označavanje svih postojećih Web stranica.

  • Odgovor 4: … aplikacije semantičkog Web-a se uzdaju u masovnu automatizaciju za izdvajanje takvih semantičkih oznaka iz samih izvora. …podaci su već dostupni u (polu)strukturiranim formatima i često su već organizovani shemama baza podataka, što može da obezbedi potrebnu semantičku interpretaciju.


Izvor: Frank van Harmelen, Istraživanje semantičkog veba 2006. godine: glavni tokovi, popularne zablude, trenutno stanje i budući izazovi.

Razvoj Web-a prema Semantičkom Webu

  • Pregled razvoja Web-a, Nova Spivack, Radar Networks, 2008 

Razvoj Web-a prema Semantičkom Webu

“ Semantički Web obuhvata četiri faze razvoja interneta:

  • Web 1.0, koji se odnosi na povezivanje informacija
  • Web 2.0 predstavlja povezivanje ljudi.
  • Web 3.0, koji upravo počinje ... i predstavlja povezivanje znanja
  • Web 4.0 koji će doči kasnije ... i predstavlja povezivanje inteligencije u sveprisutan Web gde će ljudi i stvari moći da rezonuju i zajedno komuniciraju.”

Pregled razvoja Web-a, “Semantic Wave 2008” , Mills Davis

Web 1.0

  • Statičke HTML strane umesto dinamičkog, od strane korisnika definisanog sadržaja
    • - Lične Web strane, prezentacije preduzeća
  • Korišćenje framesetova
  • Online knjiga posetilaca
  • URL - Uniform Resource Locator

Web 2.0 – Social Web

  • Novi socijalni fenomen:
    • blogovi, wiki, označavanje, folksonomije
  • Novi korisnički interface:
    • AJAX
  • Novi oblici podataka
    • mikroformat, RSS, "mash-ups"
  • Nove aplikacije/ arhitekture:
    • Web servisi, SOA

Web 2.0


  • Web 2.0 prvi su upotrebili i kao koncept definisali Tim O’Reilly i Dale Dougherty, 2004. godine
  • Karakteristike Web 2.0:
    • Omogućava da se posebnim uslugama zadovolje potrebe velikog broja malih korisničkih grupa,
    • prikupljanje kolektivnog znanja, koje ne mora biti eksplicitno iskazano, već se u velikom broju slučajeva koriste postojeći društveni odnosi i veze,
    • mrežni efekti, odnose se pre svega na sinergetske efekte aktivnog učešća većeg broja korisnika pri kreiranju sadržaja,
    • baze podataka sastavljene na osnovu doprinosa korisnika

Web 3.0 – Semantic Web

  • Podaci predstavljeni pomoću formalne sematike:
    • RDF, OWL, SPARQL, RIF
  • Spontana integracija informacija
  • Semantički Web servisi, agenti
  • Naglasak na otvorenim standardima

Overview

  • Definitions
    • IAIS Definition
    • UOXF Definition
    • OntoText Definition
    • TIB Definition
  •  
 

Title

  • Text bullet 1
  • Text bullet 2
 

Title

  • Text bullet 1
  • Text bullet 2
 

Introduction to Big Data & Architectures

 

About us

‹#›
fraunhofer_iais_logo.gif

Smart Data Analytics (SDA)

Prof. Dr. Jens Lehmann
Institute for Computer Science , University of Bonn
Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS)
Institute for Applied Computer Science, Leipzig.
Machine learning techniques ("analytics") for Structured knowledge ("smart data")
Covering the full spectrum of research including theoretical foundations, algorithms, prototypes and industrial applications!
‹#›
Founded in 2016
55 Members:
1 Professor
13 PostDocs
31 PhD Students
11 master students
Core topics:
Semantic Web
AI / ML
10+ awards acquired
3000+ citations / year
Collaboration with Fraunhofer IAIS
 

SDA Group Overview

‹#›
Distributed Semantic Analytics
Aims to develop scalable analytics algorithms based on Apache Spark and Apache Flink for analysing large scale RDF datasets
Semantic Question Answering
Make use of Semantic Web technologies and AI for better and advanced question answering & dialogue systems
Structured Machine Learning
Combines Semantic Web and supervised ML technologies in order to improve both quality and quantity of available knowledge
Smart Services
Semantic services and their composition, applications in IoT
Software Engineering for Data Science
Researches on how data and software engineering methods can be aligned with Data Science
Semantic Data Management
Focuses on Knowledge and data representation, integration, and management based on semantic technologies
 
 

SDA Group Overview

‹#›

Dr. Damien Graux

Research Interests :
Big Data , Data Mining
Machine Learning, Analytics
Semantic Web, Structured Machine learning
‹#›

University of Bonn

Funded in 1818 - 200th anniversary
38000 Students
Among the best German universities
7 nobel prizes and 3 Fields Medal winners
THES CS 2018 Ranking: 81
6 Centers of excellence
 
‹#›
Datei:Universität Bonn.jpg
Bildergebnis für Bonn

Computer Science Institute

New Computer Science Campus uniting previously three CS locations
‹#›
Bildergebnis für Bonn computer science

Dr. Hajira Jabeen

Senior Researcher at University of Bonn, since 2016
Research Interests :
Big Data , Data Mining
Machine Learning, Analytics
Semantic Web, Structured Machine learning
‹#›

Projects — EU H2020

Big Data Europe, Big Data
Big Data Ocean, Big Data
HOBBIT, Big Data
SLIPO, Big Data
QROWD, Big Data
BETTER, Big Data
QualiChain, Block chain
‹#›

Software Projects

SANSA - Distributed Semantic Analytics Stack
AskNow - Question Answering Engine
DL-Learner - Supervised Machine Learning in RDF / OWL
LinkedGeoData - RDF version of OpenStreetMap
DBpedia - Wikipedia Extraction Framework
DeFacto - Fact Validation Framework
PyKEEN - A Python library for learning and evaluating knowledge graph embeddings
MINTE - Semantic Integration Approach
‹#›

Distributed Semantic Analytics

Members
Hajira Jabeen
Damien Graux
Gezim Sejdiu
Heba Allah
Rajjat Dadwal
 
Claus Stadler
Patrick Westphal
Afshin Sadeghi
Mohammed N. Mami
Shimma Ibrahim
‹#›

What is BigData?

‹#›

Big Data

Data is extremely
Large 
Complex 
Does not fit into one memory
Traditional algorithms are inadequate
Processing
Analytics
Patterns
Trends
Interactions
Distributed
‹#›

Big Data landscape (2012)

 
‹#›

 

 
‹#›

 

 
‹#›

 

 
‹#›

Big Data Ecosystem

‹#›
File system
HDFS, NFS
Resource manager
Mesos, Yarn
Coordination
Zookeeper
Data Acquisition
Apache Flume, Apache Sqoop
Data Stores
MongoDB, Cassandra, Hbase, Hive
Data Processing
 
Frameworks
Hadoop MapReduce, Apache Spark, Apache Storm, Apache Flink
Tools
Apache Pig, Apache Hive
Libraries
SparkR, Apache Mahout, MlLib, etc
Data Integration
 
Message Passing
Managing data heterogeneity
Apache Kafka
SemaGrow, Strabon
Operational Frameworks
 
Monitoring
Apache Ambari

Cluster Basics

Host/Node = Computer
Cluster = Two or more hosts connected by an internal high-speed network
There can be several thousands of connected nodes in a cluster
Master = small number of hosts reserved to control the rest of the cluster
Worker = non-master hosts
‹#›

Big Data Architectures

‹#›

Architectures

Lambda Architecture
Batch / Stream Processing
Kappa Architecture
A Simplification of Lambda Architecture (everything is a stream)
Service Oriented Architecture
Interaction of multiple services
 
 
‹#›

Lambda Architecture

Mostly for batch processing
Key features
Distributed 
file system for storage
Processing
Serving 
long term storage (historical data)
 
‹#›

Three layers

Batch-Layer
Large scale long living analytics jobs
Speed-Layer/Stream Processing Layer:
Fast stream processing jobs
Serving Layer:
Allow interactive analytics combining above two
 
‹#›

Lambda Architecture

 
‹#›

Kappa Architecture

Everything is a stream
Distributed ordered event log
Stream processing platforms
Online Machine learning algorithms
‹#›
https://www.ericsson.com/en/blog/2015/11/data-processing-architectures--lambda-and-kappa

Microservice Architecture

Not essentially a style
Emerged from:
Applications as services
Availability of Software containers
Container resource managers (Docker Swarm, Kubernetes)
Flexible 
Quick deployment of services
 
‹#›

Microservice Architecture

Functions that run in response to various events
Scales well and does not require scaling configurations
e.g. Amazon Lambda, OpenLambda
‹#›

Distributed Kernels

‹#›

Distributed Kernels

Minimally complete set of utilities
Distributed resource management
Abstraction of the data center/cluster
View as a single pool of resources
Simplifies execution of distributed systems at scale
Ensures
High availability
Fault tolerance
Optimal resource utilization
‹#›

Distributed Kernels

‹#›
Resource Managers
Apache Hadoop YARN
Resource manager and Job scheduler in Hadoop
Mesos
 Open-source project to manage computer clusters
 
 
File:Conductor.svg - Wikimedia Commons

YARN (Yet Another Resource Manager) 

ResourceManager 
Master daemon
Communicates with the client
Tracks resources on the cluster
Orchestrates work by assigning tasks to NodeManagers
NodeManager 
Worker daemon
Launches and tracks processes spawned on worker hosts
Application Master
 
‹#›

Apache Mesos

Distributed kernel
Decentralised management
Fault-tolerant cluster management
Provides resource isolation
Management across a cluster of slave nodes
Opposite to virtualization
Joins multiple physical resources into a single virtual resource
Schedules CPU and memory resources across the cluster in the same way the Linux Kernel schedules local resources.
 
‹#›

Zoo Keeper

A service that enables the cluster to be:
Highly available
Scalable
Distributed
Assists in
Configuration 
Consensus 
Group membership
Leader election
Naming
Coordination 
 
‹#›

Distributed File Systems

‹#›

Distributed File Systems

NFS
Network File system
GFS
Google File System
HDFS
Hadoop Distributed File System
 
‹#›
 
 

Hadoop

Open source project
Apache Foundation
Java
Built on Google File System
Optimized to handle massive quantities of data
Structured
Unstructured
Semi-structured
On commodity hardware
 
‹#›
 

Hadoop, Why?

Process Multi Petabyte Datasets
Reliability in distributed applications
Node failure
Failure is expected, rather than exceptional
The number of nodes in a cluster is not constant
Provides a common infrastructure
Efficient 
Reliable 
 
 
 
‹#›
 
 

Components

 
 
Hadoop Resource Manager - YARN
Hadoop Distributed File System - HDFS
MapReduce (The Computational Framework)
‹#›

Hadoop Distributed File System

Very Large Distributed File System
10K nodes, 100 million files, 10 PB
 Assumes Commodity Hardware
Uses replication to handle hardware failure
Detects and recovers from failures
Optimized for Batch Processing
Runs on heterogeneous OS
Minimum intervention
Scaling out
Fault tolerance
 
‹#›

Hadoop Distributed File System

Single Namespace for entire cluster
Data Coherency
Write-once-read-many access model
Clients can only append to the existing files
Files are broken up into blocks
Typically 128 MB block size
Each block is replicated on multiple DataNodes
‹#›
 

NameNode 

Meta-data in Memory
List of files
List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time, replication factor
A Transaction Log
Records file creations, file deletions. etc.
 
‹#›
 

DataNode

A Block Server
Stores data in the local file system
Stores meta-data of a block (e.g. CRC)
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all existing blocks to the NameNode
Facilitates Pipelining of Data
Forwards data to other specified DataNodes
‹#›

Block Placement

Current Strategy
One replica on local node
Second replica on a remote rack
Third replica on same remote rack
Additional replicas are randomly placed
Clients read from nearest replica (Location awareness)
 
 
 
‹#›

Hadoop Distributed File System

‹#›
NameNode: A single point of failure
Multiple namenodes using Quorum Journal Manager (QJM)
Transaction Log stored in multiple directories
A directory on the local file system
A directory on a remote file system (NFS/CIFS)

Summary

Distributed Kernels
Apache Mesos
Resource Manager
Hadoop Yarn
File System
Hadoop Distributed File System
‹#›

Next

Distributed Storage
Message Passing
Searching, Indexing
Visualization
Analytics
 
‹#›

THANK YOU !

  Dr. Damien Graux Dr. Hajira Jabeen

Thank you !

 
‹#›

Distributed Big Data Frameworks

A Panorama

 

 
‹#›

THANK YOU !

Database Paradigms

Relational (RDBMS) — The SQL world…!
 
 

Database Paradigms

ACID set of properties i.e. atomicity, consistency, isolation and durability.
 
SQL is the canonical query language
 
MySQL, PostgreSQL, Oracle, …
 

Relational (a quick reminder)

Relational (RDBMS) — The SQL world…!
 
NoSQL
Key-Value stores
 
 

Database Paradigms

Paradigm → One key = One value
Without duplicate
Usually sorted
Key is like a hash
Value is seen as a binary object
 
Examples:
Amazon Dynamo
MemcacheDB
Apache Accumulo
 

Key-Value stores

Key-Value stores

Relational (RDBMS) — The SQL world…!
 
NoSQL
Key-Value stores
Document databases
 
 

Database Paradigms

Key-Value store, but the value is structured and understood by the DB.
 
Querying data is possible (by other means than just a key).
 
Examples:
Amazon SimpleDB
CouchDB
Riak
MongoDB
 

Document databases

Document databases

Relational (RDBMS) — The SQL world…!
 
NoSQL
Key-Value stores
Document databases
Wide Column stores (e.g. BigTable and its variations)
 

Database Paradigms

Often referred as BigTables systems
 
“Sparse, distributed mutli-dimensional sorted map”
 
Examples:
Google BigTable
Cassandra (Facebook)
HBase
 

Wide Column stores

Relational (RDBMS) — The SQL world…!
 
NoSQL
Key-Value stores
Document databases
Wide Column stores (e.g. BigTable and its variations)
Graph databases
 
Other ones…
 

Database Paradigms

Multi-relational graph
 
Put emphasis on links between data pieces
 
Examples:
Neo4j
InfoGrid
Triplestores…
 

Graph databases

Database Paradigms (Visually)

 

Selected Storage Systems

Document database ( NoSQL)
scalability and flexibility
querying and indexing
Stores data in
JSON-like documents
schema free database
Open-Source

MongoDB

RDBMS replacement for Web Applications
Semi-structured Content Management
Real-time Analytics & High-Speed Logging
Caching and Scalability

What is MongoDB great for?

Apache Hive

Apache Hive is a data warehouse
Developed by Facebook
On top of Apache Hadoop
Provides
Data summarization
Query
Analysis
Open-Source
Gives an SQL-like interface

Apache Hive - Limitations

Not design for online transaction processing
Not suited for real-time queries
Not made for low-latency query
Certain standard SQL functions do not exist
NOT IN
NOT LIKE
NOT EQUAL

Apache Cassandra

Facebook inbox search feature
millisecond read and write times
Designed for linear, incremental scalability on top of commodity hardware.
Open-Source since 2008

Cassandra - Strenghts

Linear scale performance when adding node
Peer-to-peer architecture instead of master-slave
Fault tolerant in case of node failure
High performance
Schema-free/Schema-less
 

Cassandra - Limitations

Use cases where it is better to avoid using Cassandra
If there are too many joins required
To store configuration data
During compaction, things slow down
Aggregation Operator support is limited
Can update or delete data but not designed for

Distributed Stream Processing

Apache Kafka

Distributed event streaming platform
Able to handle up to trillions of events a day
Initially conceived as a messaging queue
Open-sourced by LinkedIn in 2011
Useful for:
Stream processing
Website activity tracking
Metrics collection and monitoring
Log aggregation

Apache Kafka

Three key capabilities
Publish and Subscribe
Stream of records
Process
Streams of records as they occur
Store
Streams of records in fault tolerant manner

Search, Indexing, visualization

ElasticSearch

Distributed and highly available search engine
Indexes are sharded ( replicas)
read/search on any replica shards
Multi-tenant
Support for more than one index
Various set of APIs
Document oriented
Reliable
Near real time search

ElasticSearch

Data into ElasticSearch
Receive data in form of JSON documents
Ingest data using Logstash
Connectors to other data stores
Stores and add searchable reference to the document
All data is indexed by default
Every field has a dedicated inverted index
All of inverted indexes can be used in a query

Visualization

Kibana

Browser-based analytics and search dashboard for Elasticsearch
search
view
interact 
with the data stored in Elasticsearch indices
Visualize data
Charts
Tables
Maps 

Kibana

Display changes in real time
Discover
Explore data using selected index patterns
Visualize
Create visualizations of data based on
ElasticSearch queries, Search saved from Discover
Stored as dashboards
Dashboard
collection of visualizations and searches
 

Processing Frameworks

Analytic Frameworks

‹#›
Batch-only
Apache Hadoop MapReduce
Stream and In-Memory Computing
Apache Spark
Apache Flink
 
Distributed  framework to process vast amounts of data (multi-terabyte data-sets)
Cluster of commodity hardware
Reliable
Fault tolerant
MapReduce job
Divides the large data into independent chunks
Processed by Map-tasks in parallel
Sorted output is passed to the reduce-tasks
Typically both input and output are stored in filesystem (HDFS)
 
 

Hadoop MapReduce

Map reduce

First popular data flow model
In the Map-Reduce model, the user provides two functions (map and reduce)
Map() must output key-value pairs
Input to the reduce is partitioned by key across machines (shuffled)
reduce() output the aggregated values
 
 
 
Free vector graphic: Light, Bulb, Yellow, Idea - Free Image on ...

Processing of Map tasks

Given a file divided into multiple parts (splits).
Each record (line) is processed by a Map function,
written by the user,
takes an input key/value pair
produces a set of intermediate key/value pairs.
e.g. (doc—id, doc-content)
Draws an analogy to SQL group-by clause
‹#›
Processing of Reduce Tasks
 Given a set of (key, value) records by map tasks
The intermediate values are combined into a list based on keys and given to a reducer.
Each reducer further performs an aggregate function (e.g., average) computed over all the rows with the same “key” 
 
 
 
 
 
 
 
 

 Example: Word counting in class

”Consider the problem of counting the number of occurrences of each word in a large collection of documents”
 
 
 
 
 
 
Distribute
 
Divide a collection of documents among the class
 
 
 
 
.
 
 
Each person calculates counts of individual words in the documents
independent
Map
 
 
Gather the words and counts
 
Shuffle
 
 
Sum up the counts from all the documents for all the words
 
Reduce

Drawbacks of MapReduce

Forces data analysis workflow into a map and a reduce phase
You might need
Join
Filter
Sample
Complex workflows that do not fit into map/Reduce
Mapreduce relies on reading data from disk
Performance bottleneck
Especially for iterative algorithms
e.g. Machine Learning 

Requirement . .

A tool, compatible with the existing environment
Without replacing the stack
replace map-reduce only
Generic 
Provides a rich API, more functions
Reduces Disk I/O
Faster
In-memory computations
Provides an interactive shell
 
 
 

Apache Flink

Distributed stream processing engine
Process bounded and unbounded streams
Generic deployment
Scalable
Trillions of events
Terabytes of states
Thousands of cores
In-Memory performance
 

Apache Flink APIs

Data Stream API
bounded or unbounded streams of data
Dataset API
bounded data sets
Transformations ( Filter, map, join)
Table API
SQL-like expression language for relational stream
 

Apache Flink Layered APIs

 

Apache Flink Libraries

Complex Event Processing (CEP)
Pattern detection from events
DataSet API
Map, Reduce, join, iterate
Gelly
Scalable Graph processing library

THANK YOU !

  Dr. Damien Graux Dr. Hajira Jabeen

Distributed Big Data Library

Apache Spark

Solution?
New framework: Support the same features of MapReduce and many more.
Capable of reusing Hadoop ecosystem : e.g HDFS, YARN, etc.
 
 
 
 
 

Shortcomings of Mapreduce

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
spark-logo-trademark.png

Apache Spark

Introduction to Spark

Open source
Distributed
Scalable
In-memory
General-purpose
High level APIs
Java
Scala
Python
R
 
 
Libraries
MLlib
Spark SQL
GraphX
Streaming
Spark Stack A unified analytics stack

Introduction to Spark

Spark Core Engine
Deploy
Spark SQL &
Data Frames
Core
APIs & Libraries
Spark Streaming
Real-time processing
Local
Single JVM
Cluster
(Standalone, Mesos, YARN)
Containers
docker-compose
MLlib
Machine Learning
GraphX
Graph processing
... Docker | by xmodulo
File:AWS Simple Icons ...
computer desktop computer icon
Originally developed on UC Berkeley AMPLab in 2009.
Open-sourced in 2010.
Spark paper came out.
Spark Streaming was incorporated in 2011.
Transferred to the Apache Software foundation in 2013
Spark is a top-level Apache project since 2014
Spark 2.0 released with major improvements in 2016
 

Brief History of Spark

Data Scientists:
Analyze and model data.
Data transformations and prototyping.
Statistics and Machine Learning.
 
Software Engineers:
Data processing systems.
API for distributed processing dataflow.
Reliable, high performance and easy to monitor platform.

Who uses Spark and why?

Spark: Overview

Spark provides
parallel distributed processing,
fault tolerance
On commodity hardware,
Applications that reuse a working set of data across multiple parallel operations e.g.
Iterative Machine learning
Interactive Data analysis
General purpose programming interface
Resilient distributed datasets (RDDs)
 

RDDs - Resilient Distributed Dataset

Users can
Persist the RDDs in Memory
Control partitioning
Manipulate using rich set of operations
Coarse-grained transformations
Map, Filter, Join, applied to many data items concurrently
Keeps the lineage