PV177 – DataScience seminář
(ELK stack and Graph DBs)
Tomáš Rebok
Ústav výpočetní techniky MU
L04 (ELK and GraphDBs) - PV177/DataScience2
Data analysis with ELK framework
Logstash
Log Kibana
Elasticsearch ELK Software Stack
ELK consists of three open source software products provided by
the company “Elastic” (formerly Elasticsearch)
E => Elasticsearch
(Highly scalable search index server)
L => Logstash
(Tool for the collection, enrichment, filtering and
forwarding of data, e.g. log data)
K => Kibana
(Tool for the exploration and visualization of data)
L04 (ELK and GraphDBs) - PV177/DataScience3
L04 (ELK and GraphDBs) - PV177/DataScience4
ELK Architecture
L04 (ELK and GraphDBs) - PV177/DataScience5
ELK Architecture
L04 (ELK and GraphDBs) - PV177/DataScience6
ELK Architecture
L04 (ELK and GraphDBs) - PV177/DataScience7
ELK Architecture
Open source software to collect, transform, filter and forward data
(e.g. log data) from input sources to output sources (e.g.
Elasticsearch)
Implemented in JRuby and runs on a JVM (Java Virtual Machine)
Simple message-based architecture
Extendable by plugins (e.g. input, output, filter plugins)
Logstash
L04 (ELK and GraphDBs) - PV177/DataScience8
Configuration
Multiple inputs of
different types
Forward to
multiple outputs
Conditionally
filter and
transform data;
some common
formats are
already known
L04 (ELK and GraphDBs) - PV177/DataScience9
Console output processing Apache log files
L04 (ELK and GraphDBs) - PV177/DataScience10
Configuration for parsing syslog messages
Input filter receives messages directly
from tcp and udp ports
Filter splits messages and adds fields
L04 (ELK and GraphDBs) - PV177/DataScience11
Console output processing Syslog messages
L04 (ELK and GraphDBs) - PV177/DataScience12
file -> for processing files
tcp, udp, unix -> reading directly from network sockets
http -> for processing HTTP POST requests
http_poller -> for polling HTTP services as input sources
imap -> accessing and processing imap mail
Different input plugins to access MOM („Message-Oriented Middleware“,
message queues)
Rabbitmq, stomp, …
Different plugins for accessing database systems
jdbc, elasticsearch, …
Plugins to read data from system log services and from command line
syslog, eventlog, pipe, exec
and more …
Input Plugins
L04 (ELK and GraphDBs) - PV177/DataScience13
The “Elastic Beats” framework allows to forward input from a set of “data
sources” to a Logstash instance for processing
Filebeat, Packetbeat, Winlogbeat, Metricbeat, Functionbeat, etc.
The “Beats plugin” can then be configured to consume messages from
“Elastic Beats”
Transfer can be secured by security certificate and encrypted transmission
authentication and confidentiality
Elastic Beats framework + Beats plugin
L04 (ELK and GraphDBs) - PV177/DataScience14
stdout, pipe, exec -> show output on console, feed to a command
file -> store output in file
email -> send output as email
tcp, udp, websocket -> send output over network connections
http -> send output as HTTP request
Different plugins for sending output to database systems, index server or
cloud storage
elasticsearch, solr_http, mongodb, google_bigquery, google_cloud_storage, opentsdb
Different output plugins to send output to MOM (message queues)
Rabbitmq, stomp, …
Different output plugins for forwarding messages to metrics applications
graphite, graphtastic, ganglic, metriccatcher
Output plugins
L04 (ELK and GraphDBs) - PV177/DataScience15
The Logstash output plugin can write to multiple Elasticsearch nodes
It will distribute output objects to different nodes (“load balancing”)
A Logstash instance can also be part of a Elasticsearch cluster and
write data through the cluster protocol
Multiple node writes
L04 (ELK and GraphDBs) - PV177/DataScience16
grok -> parse and structure arbitrary text: best generic option to interpret text as
(semi-)structured objects
alternative: dissect (faster, but does not use regular expressions)
filter for parsing different data formats
csv, json, kv (key-valued paired messages), xml, …
multiline -> collapse multiline messages to one logstash event
split -> split multiline messages into several logstash events
aggregate -> aggregate several separate message lines into one Logstash event
mutate -> perform mutations of fields (rename, remove, replace, modify)
dns -> lookup DNS entry for IP address
geoip -> find geolocation of IP address
and more
Filter plugins
L04 (ELK and GraphDBs) - PV177/DataScience17
Input: 55.3.244.1 GET /index.html 15824 0.043
grok filter
filter {
grok { match => { "message" => "%{IP:client} %{WORD:method}
%{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" }
}
Then the output will contain fields like:
client: 55.3.244.1
method: GET
request: /index.html
bytes: 15824
duration: 0.043
grok usage example
L04 (ELK and GraphDBs) - PV177/DataScience18
Scaling and high availability
L04 (ELK and GraphDBs) - PV177/DataScience19
ElasticSearch
Server environment for storing large-scale structured index
entries and query them
Written in Java
Based on Apache Lucene
Uses Lucene for index creation and management
Document-oriented (structured) index entries which can (but must not) be associated
with a schema
Combines “full text”-oriented search options for text fields with more precise search
options for other types of fields, like date + time fields, geolocation fields, etc.
Near real-time search and analysis capabilities
Provides Restful API as JSON over HTTP
L04 (ELK and GraphDBs) - PV177/DataScience20
Elasticsearch can run as one integrated application on multiple nodes of a
cluster
Indexes are stored in Lucene instances called “Shards” which can be
distributed over several nodes
Ability to subdivide your (large) index into multiple pieces
Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster
There a two types of “Shards”
Primary Shards
Replica
Replicas of “Primary Shards” provide
Failure tolerance and therefore protect data
Make queries (searches) faster
Scalability of Elasticsearch
L04 (ELK and GraphDBs) - PV177/DataScience21
Send JSON documents to server, e.g. use REST API
No schema necessary => ElasticSearch determines types of attributes
But it‘s possible to explicitly specify schema, i.e. types for attributes
Like string, byte, short, integer, long, float, double, boolean, date
Analysis of text attributes for fulltext-oriented search
Word extraction, reduction of words to their base form (stemming)
Stop words
Support for multiple languages (including czech, but not slovak yet)
Automatically generates identifiers for data sets or allows to
specify them while indexing
Indexing data with Elasticsearch
L04 (ELK and GraphDBs) - PV177/DataScience22
PUT request inserts the JSON payload into the index with name “megacorp” as object of type
“employee”
Schema for type can be explicitly defined (at time of index creation or automatically
determined)
Text field (e.g. “about”) will be analyzed if analyzers are configured for that field
Request URL specifies the identifier “1” for the index entry
Indexing data using the REST API
L04 (ELK and GraphDBs) - PV177/DataScience23
GET /megacorp/employee/1
A “GET” REST API call with “/megacorp/employee/1” will retrieve
the entry with id 1 as JSON object
Retrieval of an index entry
L04 (ELK and GraphDBs) - PV177/DataScience24
GET /megacorp/employee/_search
GET request with “_search”
at the end of the URL
performs query
Search results are returned
in JSON response as “hits”
array
Further metadata specifies
count of search results
(“total”) and max_score
Simple Query
L04 (ELK and GraphDBs) - PV177/DataScience25
GET /megacorp/employee/_search?q=last_name:Smith
Simple Query with search string
L04 (ELK and GraphDBs) - PV177/DataScience26
Query DSL is a JSON language for more complex queries
Will be sent as payload with the search request
Match clause has the same semantics as in simple query
More complex queries with Query DSL
L04 (ELK and GraphDBs) - PV177/DataScience27
Consist of a query and a
filter part
Query part matches all
entries with last_name
“smith” (2)
Filter will then only select
entries which fulfill the
range filter (1)
“age”: {“gt” : 30 }
More complex queries with Query DSL
L04 (ELK and GraphDBs) - PV177/DataScience28
Combined search on different attributes and different indices
Many possibilities for full-text search on attribute values
Exact, non-exact, proximity (phrases), partial match
Support well-known logical operators
(And / or, …)
Range queries (i.e. date ranges)
…
Control relevance and ranking of search results, sort them
Boost relevance while indexing
Boost or ignore relevance while querying
Different possibilities to sort search results otherwise
Some query possibilities
L04 (ELK and GraphDBs) - PV177/DataScience29
Web-based application for exploring and visualizing data
Modern Browser-based interface (HTML5 + JavaScript)
Ships with its own web server for easy setup
Seamless integration with Elasticsearch
Kibana
L04 (ELK and GraphDBs) - PV177/DataScience30
After installation first configure Kibana to access Elasticsearch
server(s)
Should be done by editing the Kibana config file
Then use web UI to configure indexes to use
Configure Kibana
L04 (ELK and GraphDBs) - PV177/DataScience31
Discover data
L04 (ELK and GraphDBs) - PV177/DataScience32
Create a visualization
L04 (ELK and GraphDBs) - PV177/DataScience33
Different types of visualizations
L04 (ELK and GraphDBs) - PV177/DataScience34
Combine visualizations to a Dashboard
L04 (ELK and GraphDBs) - PV177/DataScience35
L04 (ELK and GraphDBs) - PV177/DataScience36
Typical ELK use cases
Log data management and analysis
Monitor systems and/or applications and notify operators about critical
events
Collect and analyze other (mass) data
i.e. business data for business analytics
Energy management data or event data from smart grids
Environmental data
Use the ELK stack for search driven access to mass data in web-based
information systems
Some use cases of the ELK stack
L04 (ELK and GraphDBs) - PV177/DataScience37
Many different types of logs
Application logs
Operating system logs
Network traffic logs from routers, etc.
Different goals for analysis
Detect errors at runtime or while testing applications
Find and analyze security threats
Aggregate statistical data / metrics
Log data management and analysis
L04 (ELK and GraphDBs) - PV177/DataScience38
No centralization
Log data could be everywhere
on different servers and different places within the same server
Accessibility Problems
Logs can be difficult to find
Access to server / device is often difficult for analyst
High expertise for accessing logs on different platforms necessary
Logs can be big and therefore difficult to copy
SSH access and grep on logs doesn’t scale or reach
No Consistency
Structure of log entries is different for each app, system, or device
Specific knowledge is necessary for interpreting different log types
Variation in formats makes it challenging to search
Many different types of time formats
Problems of log data analysis
L04 (ELK and GraphDBs) - PV177/DataScience39
The ELK stack provides solutions
Logstash allows to collect all log entries at a central place (e.g.
Elasticsearch)
End users don’t need to know where the log files are located
Big log files will be transferred continuously in smaller chunks
Log file entries can be transformed into harmonized event objects
Easy access for end users via Browser-based interfaces (e.g.
Kibana)
Elasticsearch / Kibana provide
advanced functionality for
analyzing and visualizing the log data
L04 (ELK and GraphDBs) - PV177/DataScience40
The ELK stack also provides good solutions for monitoring data
and alerting users
Logstash can check conditions on log file entries and even aggregated metrics
And conditionally sent notification events to certain output plugins if monitoring criteria are
met
E.g. forward notification event to email output plugin for notifying user (e.g. operators) about the
condition
Forwarding notification event to a dedicated monitoring application
Elasticsearch in combination with Watcher (another product of Elastic)
Can instrument arbitrary Elasticsearch queries to produce alerts and notifications
These queries can be run at certain time intervals
When the watch condition happens, actions can be taken (sent an email or forwarding an event to
another system)
Monitoring
L04 (ELK and GraphDBs) - PV177/DataScience41
Logging and analyzing network traffic
https://operational.io/elk-stack-for-network-operations-reloaded/
How to Use ELK to Monitor Performance
http://logz.io/blog/elk-monitor-platform-performance/
How Blueliv Uses the Elastic Stack to Combat Cyber Threats
https://www.elastic.co/blog/how-blueliv-uses-the-elastic-stack-to-
combat-cyber-threats
Centralized System and Docker Logging with ELK Stack
http://www.javacodegeeks.com/2015/05/centralized-system-and-
docker-logging-with-elk-stack.html
Log analysis examples from the Internet
L04 (ELK and GraphDBs) - PV177/DataScience42
Summary
The ELK stack is easy to use and has many use cases
Log data management and analysis
Monitor systems and / or applications and notify operators about critical events
Collect and analyze other (mass) data
Providing access to big data in large scale web applications
Thereby solving many problems with these types of use cases
compared to “handmade”-solutions
Because of its service orientation and cluster readiness it fits
nicely into bigger service-oriented applications
L04 (ELK and GraphDBs) - PV177/DataScience43
L04 (ELK and GraphDBs) - PV177/DataScience44
ELK deployment made easy
45
Introducing CopAS
CopAS – Cops Analytic System
▪ fine-tuned production-ready framework running Elastic Platform
developed in collaboration with Police CR (PCR)
▪ Bro, LogStash, ElasticSearch, and Kibana
▪ graphical user interface (Neck)
▪ a set of pre-prepared dashboards and visualizations
▪ main emphasis on user-friendliness and ease of deployment & use
‒ employs Docker for easier deployment
‒ runs on all systems with Docker available (Windows, Linux, MacOS, …)
L04 (ELK and GraphDBs) - PV177/DataScience
L04 (ELK and GraphDBs) - PV177/DataScience46
KIBANA vs. CopAS
47
CopAS container
L04 (ELK and GraphDBs) - PV177/DataScience
48
CopAS – container management
copas ACTION [container name]
▪ a tool for CopAS container management
L04 (ELK and GraphDBs) - PV177/DataScience
49
CopAS – example
Example:
▪ $ copas info
L04 (ELK and GraphDBs) - PV177/DataScience
L04 (ELK and GraphDBs) - PV177/DataScience50
CopAS – old user environment (version 1.0)
L04 (ELK and GraphDBs) - PV177/DataScience51
CopAS – old user environment (version 2.0)
52
CopAS version 3.1
Main changes in workflow and GUI
New functionality
▪ large files analysis support
‒ limited only by available resources
▪ local files analysis support
▪ (g)zipped files support
▪ support for PCAPs and CSVs
▪ automated files import – CopAS WatchDog
‒ one can define monitored directories
▪ backup and restore of containers
‒ copas backup and copas import
‒ ability to move containers among different analytical systems
▪ extended Logstash configuration
▪ integrated Molo.ch analytic tool (PCAPs only)
L04 (ELK and GraphDBs) - PV177/DataScience
L04 (ELK and GraphDBs) - PV177/DataScience53
CopAS – user environment (version 3.1)
L04 (ELK and GraphDBs) - PV177/DataScience54
CopAS – user environment (version 3.1)
L04 (ELK and GraphDBs) - PV177/DataScience55
CopAS – user environment (version 3.1)
L04 (ELK and GraphDBs) - PV177/DataScience56
CopAS – user environment (version 3.1)
L04 (ELK and GraphDBs) - PV177/DataScience57
CopAS – user environment (version 3.1)
58
CopAS development and future
CopAS – main development
▪ great work made by previous PV177/DataScience students
‒ K. Gutič and V. Lazárik
▪ CopAS v. 4.0 alpha – several improvements ongoing (O. Machala)
‒ modular design, unified GUI
CopAS – not only PCR tool
▪ PCR specifics are just pre-defined
visualizations, dashboards, searches, etc.
‒ without specific addons, it is a
generic ES-based data analytic tool
▪ assumes multiple input formats support
in Neck GUI
‒ (proposals for input formats welcomed)
L04 (ELK and GraphDBs) - PV177/DataScience
59
CopAS availability
CopAS v. 3.1 installation (Linux OS)
▪ https://frakira.fi.muni.cz/~jeronimo/PV177/copas-install.tgz
CopAS v. 3.1 offline image
▪ 5,8 GB – not necessary, but easier to deploy
▪ https://frakira.fi.muni.cz/~jeronimo/PV177/copasimg-20200915.tgz
CopAS v. 4.0 alpha
▪ https://frakira.fi.muni.cz/~jeronimo/PV177/v4.0/copas-src.tar (2,2 GB)
Testing datasets:
▪ PCAPs: https://tcpreplay.appneta.com/wiki/captures.html
L04 (ELK and GraphDBs) - PV177/DataScience
L04 (ELK and GraphDBs) - PV177/DataScience60
Graph databases
Object (Vertex, Node)
Link (Edge, Arc, Relationship)
What is a Graph?
• Formally, a graph is a collection of vertices and edges
• Less Formally Defined:
• A graph is a set of nodes, relationships, and properties
• A network of connected objects Graph
L04 (ELK and GraphDBs) - PV177/DataScience61
INTRODUCTIONTOTHE GRAPH MODEL
name: bode
miller
Nodes
➢Nodes represent entities and complex types
➢Nodes can contain properties
➢Each node can have different properties
Think of nodes as documents that store properties in the form of
arbitrary key-value pairs.
L04 (ELK and GraphDBs) - PV177/DataScience62
INTRODUCTIONTOTHE GRAPH MODEL
Olympic
_Address
Relationships
➢Every relationship has a name and direction
➢Relationships can contain properties, which can further clarify the
relationship
➢Must have a start and end node
Relationships connect and structure nodes.
L04 (ELK and GraphDBs) - PV177/DataScience63
INTRODUCTIONTOTHE GRAPH MODEL name: bode
miller
Address:123
Fake Street
Address
Type:Olympic
Properties
➢Key value pairs used for nodes and relationships
➢Adds metadata to your nodes and relationships
➢Entity attributes
➢Relationship qualities
Allows you to create additional semantics to entities and relationships.
L04 (ELK and GraphDBs) - PV177/DataScience64
Megan
Ross
Jack knows
knows knows
Node
Property
Relationship
Basic Graph
L04 (ELK and GraphDBs) - PV177/DataScience66
Different Kinds of Graphs
• Undirected Graph
• Directed Graph
• Pseudo Graph
• Multi Graph
• Hyper Graph
L04 (ELK and GraphDBs) - PV177/DataScience67
More Kinds of Graphs
• Weighted Graph
• Labeled Graph
• Property Graph
L04 (ELK and GraphDBs) - PV177/DataScience68
What is a Graph Database?
• A database with an explicit graph structure
• Each node knows its adjacent nodes
• As the number of nodes increases, the cost of a local step (or
hop) remains the same
• Plus an Index for lookups
L04 (ELK and GraphDBs) - PV177/DataScience69
Relational Databases
L04 (ELK and GraphDBs) - PV177/DataScience70
Relational To Graph Databases …
L04 (ELK and GraphDBs) - PV177/DataScience71
L04 (ELK and GraphDBs) - PV177/DataScience72
Graph Databases
Name: Ross
Age: 34
Name: Jack
Age: 7
Type: Activity
Name: Martial
Arts
Label: Knows
Since: 5/20/2006 Label: Knows
Since 5/20/2008
Label: isMember
Since: 1/20/2014
Label: Member
Label: isMember
Since: 6/15/2013
Label: Member
Another graph example
L04 (ELK and GraphDBs) - PV177/DataScience75
• Each entity table is represented by a label on nodes
• Each row in an entity table is a node
• Columns on those tables become node properties
• Join tables are transformed into relationships, columns on
those tables become relationship properties
L04 (ELK and GraphDBs) - PV177/DataScience76
GRAPH DB VS RELATIONAL DB
Pros:
✓Easy to query
✓Ability to connect disparate data easily
without needing a common data model
Cons:
▪Requires a different way to think about
data
▪No single graph query language
GRAPH DATABASES: PROS AND CONS
L04 (ELK and GraphDBs) - PV177/DataScience77
L04 (ELK and GraphDBs) - PV177/DataScience78
WHEN TO USE / NOT USE GRAPH DBS?
Graph DBs are great for:
▪ data, which are connected and/or where relationships matter
▪ data, which you want to query using various graph algorithms
but not ideal for:
▪ not optimized for massive graph traversing
‒ MATCH (n) WHERE n.name=`Jenifer` RETURN n
• but great for particular graph traversing like
MATCH (n:Person {name: `Jenifer`})-[r:KNOWS]->(p:Person) RETURN p
‒ it will work, but the performance will not be very good
L04 (ELK and GraphDBs) - PV177/DataScience79
Neo4j vs. RDBMS (book „Neo4j in action“)
Example: in a social network, find all the friends of a user’s
friends. Even more so, for friends of friends of friends.
▪ 1.000.000 users, query for 1.000 users
▪ max. time 1 hour
Popular Graph DB Engines
L04 (ELK and GraphDBs) - PV177/DataScience80
Cons:
➢ No native windows installation
➢Docker could be used
Pros:
➢Open-source version available
➢Runs complex distributed queries
➢Scales out through sharded storage
➢Returns data natively in JSON, making it ideally
suited for web development
➢Written on top of GraphQL
L04 (ELK and GraphDBs) - PV177/DataScience81
Cons:
➢Requires more schema design up front
Pros:
➢ Multi model DB – both graph and document DB
➢ Easily add users/roles
➢ Supports multiple databases
L04 (ELK and GraphDBs) - PV177/DataScience82
Cons:
➢ Only one DB can be running on a single port
at a time
Pros:
➢ Open-source version available
➢ Steep learning curve, more user-friendly
➢ Runs on Windows natively - in either a
console or as a service
➢ Large and active user community
L04 (ELK and GraphDBs) - PV177/DataScience83
NEO4J – WHAT DOES IT PROVIDE?
✓Full ACID (atomicity, consistency, isolation, durability)
✓REST API
✓Property Graph
✓Lucene Full-Text Index
✓High Availability (with Enterprise Edition)
L04 (ELK and GraphDBs) - PV177/DataScience84
Node in Neo4j
L04 (ELK and GraphDBs) - PV177/DataScience85
Relationships in Neo4j
• Relationships between nodes are a key part of Neo4j
L04 (ELK and GraphDBs) - PV177/DataScience86
Relationships in Neo4j
L04 (ELK and GraphDBs) - PV177/DataScience87
Twitter and relationships
L04 (ELK and GraphDBs) - PV177/DataScience88
Properties
• Both nodes and relationships can have properties
• Properties are key-value pairs where the key is a string
• Property values can be either a primitive or an
array of one primitive type
For example String, int and int[] values are valid for properties
L04 (ELK and GraphDBs) - PV177/DataScience89
Properties
L04 (ELK and GraphDBs) - PV177/DataScience90
Paths in Neo4j
• A path is one or more nodes with connecting relationships,
typically retrieved as a query or traversal result
L04 (ELK and GraphDBs) - PV177/DataScience91
Creating a small graph
L04 (ELK and GraphDBs) - PV177/DataScience92
Print the data
L04 (ELK and GraphDBs) - PV177/DataScience93
Remove the data
L04 (ELK and GraphDBs) - PV177/DataScience94
L04 (ELK and GraphDBs) - PV177/DataScience95
Graph DB example
SELECT
Me.PersonId AS MeId,
Me.Name,
FriendOfFriend.RelatedPersonId AS SuggestedFriendId,
FriendOfAFriend.Name
FROM
Person AS Me
INNER JOIN
PersonRelationship AS MyFriends
ON MyFriends.PersonId = Me.PersonId
INNER JOIN
PersonRelationship AS FriendOfFriend
ON MyFriends.RelatedPersonId = FriendOfFriend.PersonId
INNER JOIN
Person AS FriendOfAFriend
ON FriendOfFriend.RelatedPersonId = FriendOfAFriend.PersonId
LEFT JOIN
PersonRelationship AS FriendsWithMe
ON Me.PersonId = FriendsWithMe.PersonId
AND FriendOfFriend.RelatedPersonId = FriendsWithMe.RelatedPersonId
INNER JOIN
PersonDisease
ON PersonDisease.PersonId = FriendOfAFriend.PersonId
WHERE
FriendsWithMe.PersonId IS NULL
AND Me.PersonId <> FriendOfFriend.RelatedPersonId
AND Me.Name = 'Bill'
AND PersonDisease.DiseaseId = 1
FIND FRIENDS OF FRIENDS THAT HAVE TYPE 1 DIABETES – RDBMS
L04 (ELK and GraphDBs) - PV177/DataScience96
NEO4J MODEL
L04 (ELK and GraphDBs) - PV177/DataScience97
MATCH (user:Person {name:'Bill'})-[:FRIENDS_WITH*2..5]->(fof)-
[:DIAGNOSED_WITH]->(disease)
return fof
FIND FRIENDS OF FRIENDS THAT HAVE TYPE 1 DIABETES – GRAPHDB
L04 (ELK and GraphDBs) - PV177/DataScience98
L04 (ELK and GraphDBs) - PV177/DataScience99
100 L04 (ELK and GraphDBs) - PV177/DataScience