PV177 – DataScience seminář (ELK stack and Graph DBs) Tomáš Rebok Ústav výpočetní techniky MU L04 (ELK and GraphDBs) - PV177/DataScience2 Data analysis with ELK framework Logstash Log Kibana Elasticsearch ELK Software Stack ELK consists of three open source software products provided by the company “Elastic” (formerly Elasticsearch) E => Elasticsearch (Highly scalable search index server) L => Logstash (Tool for the collection, enrichment, filtering and forwarding of data, e.g. log data) K => Kibana (Tool for the exploration and visualization of data) L04 (ELK and GraphDBs) - PV177/DataScience3 L04 (ELK and GraphDBs) - PV177/DataScience4 ELK Architecture L04 (ELK and GraphDBs) - PV177/DataScience5 ELK Architecture L04 (ELK and GraphDBs) - PV177/DataScience6 ELK Architecture L04 (ELK and GraphDBs) - PV177/DataScience7 ELK Architecture Open source software to collect, transform, filter and forward data (e.g. log data) from input sources to output sources (e.g. Elasticsearch) Implemented in JRuby and runs on a JVM (Java Virtual Machine) Simple message-based architecture Extendable by plugins (e.g. input, output, filter plugins) Logstash L04 (ELK and GraphDBs) - PV177/DataScience8 Configuration Multiple inputs of different types Forward to multiple outputs Conditionally filter and transform data; some common formats are already known L04 (ELK and GraphDBs) - PV177/DataScience9 Console output processing Apache log files L04 (ELK and GraphDBs) - PV177/DataScience10 Configuration for parsing syslog messages Input filter receives messages directly from tcp and udp ports Filter splits messages and adds fields L04 (ELK and GraphDBs) - PV177/DataScience11 Console output processing Syslog messages L04 (ELK and GraphDBs) - PV177/DataScience12 file -> for processing files tcp, udp, unix -> reading directly from network sockets http -> for processing HTTP POST requests http_poller -> for polling HTTP services as input sources imap -> accessing and processing imap mail Different input plugins to access MOM („Message-Oriented Middleware“, message queues) Rabbitmq, stomp, … Different plugins for accessing database systems jdbc, elasticsearch, … Plugins to read data from system log services and from command line syslog, eventlog, pipe, exec and more … Input Plugins L04 (ELK and GraphDBs) - PV177/DataScience13 The “Elastic Beats” framework allows to forward input from a set of “data sources” to a Logstash instance for processing Filebeat, Packetbeat, Winlogbeat, Metricbeat, Functionbeat, etc. The “Beats plugin” can then be configured to consume messages from “Elastic Beats” Transfer can be secured by security certificate and encrypted transmission authentication and confidentiality Elastic Beats framework + Beats plugin L04 (ELK and GraphDBs) - PV177/DataScience14 stdout, pipe, exec -> show output on console, feed to a command file -> store output in file email -> send output as email tcp, udp, websocket -> send output over network connections http -> send output as HTTP request Different plugins for sending output to database systems, index server or cloud storage elasticsearch, solr_http, mongodb, google_bigquery, google_cloud_storage, opentsdb Different output plugins to send output to MOM (message queues) Rabbitmq, stomp, … Different output plugins for forwarding messages to metrics applications graphite, graphtastic, ganglic, metriccatcher Output plugins L04 (ELK and GraphDBs) - PV177/DataScience15 The Logstash output plugin can write to multiple Elasticsearch nodes It will distribute output objects to different nodes (“load balancing”) A Logstash instance can also be part of a Elasticsearch cluster and write data through the cluster protocol Multiple node writes L04 (ELK and GraphDBs) - PV177/DataScience16 grok -> parse and structure arbitrary text: best generic option to interpret text as (semi-)structured objects alternative: dissect (faster, but does not use regular expressions) filter for parsing different data formats csv, json, kv (key-valued paired messages), xml, … multiline -> collapse multiline messages to one logstash event split -> split multiline messages into several logstash events aggregate -> aggregate several separate message lines into one Logstash event mutate -> perform mutations of fields (rename, remove, replace, modify) dns -> lookup DNS entry for IP address geoip -> find geolocation of IP address and more Filter plugins L04 (ELK and GraphDBs) - PV177/DataScience17 Input: 55.3.244.1 GET /index.html 15824 0.043 grok filter filter { grok { match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" } } Then the output will contain fields like: client: 55.3.244.1 method: GET request: /index.html bytes: 15824 duration: 0.043 grok usage example L04 (ELK and GraphDBs) - PV177/DataScience18 Scaling and high availability L04 (ELK and GraphDBs) - PV177/DataScience19 ElasticSearch Server environment for storing large-scale structured index entries and query them Written in Java Based on Apache Lucene Uses Lucene for index creation and management Document-oriented (structured) index entries which can (but must not) be associated with a schema Combines “full text”-oriented search options for text fields with more precise search options for other types of fields, like date + time fields, geolocation fields, etc. Near real-time search and analysis capabilities Provides Restful API as JSON over HTTP L04 (ELK and GraphDBs) - PV177/DataScience20 Elasticsearch can run as one integrated application on multiple nodes of a cluster Indexes are stored in Lucene instances called “Shards” which can be distributed over several nodes Ability to subdivide your (large) index into multiple pieces Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster There a two types of “Shards” Primary Shards Replica Replicas of “Primary Shards” provide Failure tolerance and therefore protect data Make queries (searches) faster Scalability of Elasticsearch L04 (ELK and GraphDBs) - PV177/DataScience21 Send JSON documents to server, e.g. use REST API No schema necessary => ElasticSearch determines types of attributes But it‘s possible to explicitly specify schema, i.e. types for attributes Like string, byte, short, integer, long, float, double, boolean, date Analysis of text attributes for fulltext-oriented search Word extraction, reduction of words to their base form (stemming) Stop words Support for multiple languages (including czech, but not slovak yet) Automatically generates identifiers for data sets or allows to specify them while indexing Indexing data with Elasticsearch L04 (ELK and GraphDBs) - PV177/DataScience22 PUT request inserts the JSON payload into the index with name “megacorp” as object of type “employee” Schema for type can be explicitly defined (at time of index creation or automatically determined) Text field (e.g. “about”) will be analyzed if analyzers are configured for that field Request URL specifies the identifier “1” for the index entry Indexing data using the REST API L04 (ELK and GraphDBs) - PV177/DataScience23 GET /megacorp/employee/1 A “GET” REST API call with “/megacorp/employee/1” will retrieve the entry with id 1 as JSON object Retrieval of an index entry L04 (ELK and GraphDBs) - PV177/DataScience24 GET /megacorp/employee/_search GET request with “_search” at the end of the URL performs query Search results are returned in JSON response as “hits” array Further metadata specifies count of search results (“total”) and max_score Simple Query L04 (ELK and GraphDBs) - PV177/DataScience25 GET /megacorp/employee/_search?q=last_name:Smith Simple Query with search string L04 (ELK and GraphDBs) - PV177/DataScience26 Query DSL is a JSON language for more complex queries Will be sent as payload with the search request Match clause has the same semantics as in simple query More complex queries with Query DSL L04 (ELK and GraphDBs) - PV177/DataScience27 Consist of a query and a filter part Query part matches all entries with last_name “smith” (2) Filter will then only select entries which fulfill the range filter (1) “age”: {“gt” : 30 } More complex queries with Query DSL L04 (ELK and GraphDBs) - PV177/DataScience28 Combined search on different attributes and different indices Many possibilities for full-text search on attribute values Exact, non-exact, proximity (phrases), partial match Support well-known logical operators (And / or, …) Range queries (i.e. date ranges) … Control relevance and ranking of search results, sort them Boost relevance while indexing Boost or ignore relevance while querying Different possibilities to sort search results otherwise Some query possibilities L04 (ELK and GraphDBs) - PV177/DataScience29 Web-based application for exploring and visualizing data Modern Browser-based interface (HTML5 + JavaScript) Ships with its own web server for easy setup Seamless integration with Elasticsearch Kibana L04 (ELK and GraphDBs) - PV177/DataScience30 After installation first configure Kibana to access Elasticsearch server(s) Should be done by editing the Kibana config file Then use web UI to configure indexes to use Configure Kibana L04 (ELK and GraphDBs) - PV177/DataScience31 Discover data L04 (ELK and GraphDBs) - PV177/DataScience32 Create a visualization L04 (ELK and GraphDBs) - PV177/DataScience33 Different types of visualizations L04 (ELK and GraphDBs) - PV177/DataScience34 Combine visualizations to a Dashboard L04 (ELK and GraphDBs) - PV177/DataScience35 L04 (ELK and GraphDBs) - PV177/DataScience36 Typical ELK use cases Log data management and analysis Monitor systems and/or applications and notify operators about critical events Collect and analyze other (mass) data i.e. business data for business analytics Energy management data or event data from smart grids Environmental data Use the ELK stack for search driven access to mass data in web-based information systems Some use cases of the ELK stack L04 (ELK and GraphDBs) - PV177/DataScience37 Many different types of logs Application logs Operating system logs Network traffic logs from routers, etc. Different goals for analysis Detect errors at runtime or while testing applications Find and analyze security threats Aggregate statistical data / metrics Log data management and analysis L04 (ELK and GraphDBs) - PV177/DataScience38 No centralization Log data could be everywhere on different servers and different places within the same server Accessibility Problems Logs can be difficult to find Access to server / device is often difficult for analyst High expertise for accessing logs on different platforms necessary Logs can be big and therefore difficult to copy SSH access and grep on logs doesn’t scale or reach No Consistency Structure of log entries is different for each app, system, or device Specific knowledge is necessary for interpreting different log types Variation in formats makes it challenging to search Many different types of time formats Problems of log data analysis L04 (ELK and GraphDBs) - PV177/DataScience39 The ELK stack provides solutions Logstash allows to collect all log entries at a central place (e.g. Elasticsearch) End users don’t need to know where the log files are located Big log files will be transferred continuously in smaller chunks Log file entries can be transformed into harmonized event objects Easy access for end users via Browser-based interfaces (e.g. Kibana) Elasticsearch / Kibana provide advanced functionality for analyzing and visualizing the log data L04 (ELK and GraphDBs) - PV177/DataScience40 The ELK stack also provides good solutions for monitoring data and alerting users Logstash can check conditions on log file entries and even aggregated metrics And conditionally sent notification events to certain output plugins if monitoring criteria are met E.g. forward notification event to email output plugin for notifying user (e.g. operators) about the condition Forwarding notification event to a dedicated monitoring application Elasticsearch in combination with Watcher (another product of Elastic) Can instrument arbitrary Elasticsearch queries to produce alerts and notifications These queries can be run at certain time intervals When the watch condition happens, actions can be taken (sent an email or forwarding an event to another system) Monitoring L04 (ELK and GraphDBs) - PV177/DataScience41 Logging and analyzing network traffic https://operational.io/elk-stack-for-network-operations-reloaded/ How to Use ELK to Monitor Performance http://logz.io/blog/elk-monitor-platform-performance/ How Blueliv Uses the Elastic Stack to Combat Cyber Threats https://www.elastic.co/blog/how-blueliv-uses-the-elastic-stack-to- combat-cyber-threats Centralized System and Docker Logging with ELK Stack http://www.javacodegeeks.com/2015/05/centralized-system-and- docker-logging-with-elk-stack.html Log analysis examples from the Internet L04 (ELK and GraphDBs) - PV177/DataScience42 Summary The ELK stack is easy to use and has many use cases Log data management and analysis Monitor systems and / or applications and notify operators about critical events Collect and analyze other (mass) data Providing access to big data in large scale web applications Thereby solving many problems with these types of use cases compared to “handmade”-solutions Because of its service orientation and cluster readiness it fits nicely into bigger service-oriented applications L04 (ELK and GraphDBs) - PV177/DataScience43 L04 (ELK and GraphDBs) - PV177/DataScience44 ELK deployment made easy 45 Introducing CopAS CopAS – Cops Analytic System ▪ fine-tuned production-ready framework running Elastic Platform developed in collaboration with Police CR (PCR) ▪ Bro, LogStash, ElasticSearch, and Kibana ▪ graphical user interface (Neck) ▪ a set of pre-prepared dashboards and visualizations ▪ main emphasis on user-friendliness and ease of deployment & use ‒ employs Docker for easier deployment ‒ runs on all systems with Docker available (Windows, Linux, MacOS, …) L04 (ELK and GraphDBs) - PV177/DataScience L04 (ELK and GraphDBs) - PV177/DataScience46 KIBANA vs. CopAS 47 CopAS container L04 (ELK and GraphDBs) - PV177/DataScience 48 CopAS – container management copas ACTION [container name] ▪ a tool for CopAS container management L04 (ELK and GraphDBs) - PV177/DataScience 49 CopAS – example Example: ▪ $ copas info L04 (ELK and GraphDBs) - PV177/DataScience L04 (ELK and GraphDBs) - PV177/DataScience50 CopAS – old user environment (version 1.0) L04 (ELK and GraphDBs) - PV177/DataScience51 CopAS – old user environment (version 2.0) 52 CopAS version 3.1 Main changes in workflow and GUI New functionality ▪ large files analysis support ‒ limited only by available resources ▪ local files analysis support ▪ (g)zipped files support ▪ support for PCAPs and CSVs ▪ automated files import – CopAS WatchDog ‒ one can define monitored directories ▪ backup and restore of containers ‒ copas backup and copas import ‒ ability to move containers among different analytical systems ▪ extended Logstash configuration ▪ integrated Molo.ch analytic tool (PCAPs only) L04 (ELK and GraphDBs) - PV177/DataScience L04 (ELK and GraphDBs) - PV177/DataScience53 CopAS – user environment (version 3.1) L04 (ELK and GraphDBs) - PV177/DataScience54 CopAS – user environment (version 3.1) L04 (ELK and GraphDBs) - PV177/DataScience55 CopAS – user environment (version 3.1) L04 (ELK and GraphDBs) - PV177/DataScience56 CopAS – user environment (version 3.1) L04 (ELK and GraphDBs) - PV177/DataScience57 CopAS – user environment (version 3.1) 58 CopAS development and future CopAS – main development ▪ great work made by previous PV177/DataScience students ‒ K. Gutič and V. Lazárik ▪ CopAS v. 4.0 alpha – several improvements ongoing (O. Machala) ‒ modular design, unified GUI CopAS – not only PCR tool ▪ PCR specifics are just pre-defined visualizations, dashboards, searches, etc. ‒ without specific addons, it is a generic ES-based data analytic tool ▪ assumes multiple input formats support in Neck GUI ‒ (proposals for input formats welcomed) L04 (ELK and GraphDBs) - PV177/DataScience 59 CopAS availability CopAS v. 3.1 installation (Linux OS) ▪ https://frakira.fi.muni.cz/~jeronimo/PV177/copas-install.tgz CopAS v. 3.1 offline image ▪ 5,8 GB – not necessary, but easier to deploy ▪ https://frakira.fi.muni.cz/~jeronimo/PV177/copasimg-20200915.tgz CopAS v. 4.0 alpha ▪ https://frakira.fi.muni.cz/~jeronimo/PV177/v4.0/copas-src.tar (2,2 GB) Testing datasets: ▪ PCAPs: https://tcpreplay.appneta.com/wiki/captures.html L04 (ELK and GraphDBs) - PV177/DataScience L04 (ELK and GraphDBs) - PV177/DataScience60 Graph databases Object (Vertex, Node) Link (Edge, Arc, Relationship) What is a Graph? • Formally, a graph is a collection of vertices and edges • Less Formally Defined: • A graph is a set of nodes, relationships, and properties • A network of connected objects Graph L04 (ELK and GraphDBs) - PV177/DataScience61 INTRODUCTIONTOTHE GRAPH MODEL name: bode miller Nodes ➢Nodes represent entities and complex types ➢Nodes can contain properties ➢Each node can have different properties Think of nodes as documents that store properties in the form of arbitrary key-value pairs. L04 (ELK and GraphDBs) - PV177/DataScience62 INTRODUCTIONTOTHE GRAPH MODEL Olympic _Address Relationships ➢Every relationship has a name and direction ➢Relationships can contain properties, which can further clarify the relationship ➢Must have a start and end node Relationships connect and structure nodes. L04 (ELK and GraphDBs) - PV177/DataScience63 INTRODUCTIONTOTHE GRAPH MODEL name: bode miller Address:123 Fake Street Address Type:Olympic Properties ➢Key value pairs used for nodes and relationships ➢Adds metadata to your nodes and relationships ➢Entity attributes ➢Relationship qualities Allows you to create additional semantics to entities and relationships. L04 (ELK and GraphDBs) - PV177/DataScience64 Megan Ross Jack knows knows knows Node Property Relationship Basic Graph L04 (ELK and GraphDBs) - PV177/DataScience66 Different Kinds of Graphs • Undirected Graph • Directed Graph • Pseudo Graph • Multi Graph • Hyper Graph L04 (ELK and GraphDBs) - PV177/DataScience67 More Kinds of Graphs • Weighted Graph • Labeled Graph • Property Graph L04 (ELK and GraphDBs) - PV177/DataScience68 What is a Graph Database? • A database with an explicit graph structure • Each node knows its adjacent nodes • As the number of nodes increases, the cost of a local step (or hop) remains the same • Plus an Index for lookups L04 (ELK and GraphDBs) - PV177/DataScience69 Relational Databases L04 (ELK and GraphDBs) - PV177/DataScience70 Relational To Graph Databases … L04 (ELK and GraphDBs) - PV177/DataScience71 L04 (ELK and GraphDBs) - PV177/DataScience72 Graph Databases Name: Ross Age: 34 Name: Jack Age: 7 Type: Activity Name: Martial Arts Label: Knows Since: 5/20/2006 Label: Knows Since 5/20/2008 Label: isMember Since: 1/20/2014 Label: Member Label: isMember Since: 6/15/2013 Label: Member Another graph example L04 (ELK and GraphDBs) - PV177/DataScience75 • Each entity table is represented by a label on nodes • Each row in an entity table is a node • Columns on those tables become node properties • Join tables are transformed into relationships, columns on those tables become relationship properties L04 (ELK and GraphDBs) - PV177/DataScience76 GRAPH DB VS RELATIONAL DB Pros: ✓Easy to query ✓Ability to connect disparate data easily without needing a common data model Cons: ▪Requires a different way to think about data ▪No single graph query language GRAPH DATABASES: PROS AND CONS L04 (ELK and GraphDBs) - PV177/DataScience77 L04 (ELK and GraphDBs) - PV177/DataScience78 WHEN TO USE / NOT USE GRAPH DBS? Graph DBs are great for: ▪ data, which are connected and/or where relationships matter ▪ data, which you want to query using various graph algorithms but not ideal for: ▪ not optimized for massive graph traversing ‒ MATCH (n) WHERE n.name=`Jenifer` RETURN n • but great for particular graph traversing like MATCH (n:Person {name: `Jenifer`})-[r:KNOWS]->(p:Person) RETURN p ‒ it will work, but the performance will not be very good L04 (ELK and GraphDBs) - PV177/DataScience79 Neo4j vs. RDBMS (book „Neo4j in action“) Example: in a social network, find all the friends of a user’s friends. Even more so, for friends of friends of friends. ▪ 1.000.000 users, query for 1.000 users ▪ max. time 1 hour Popular Graph DB Engines L04 (ELK and GraphDBs) - PV177/DataScience80 Cons: ➢ No native windows installation ➢Docker could be used Pros: ➢Open-source version available ➢Runs complex distributed queries ➢Scales out through sharded storage ➢Returns data natively in JSON, making it ideally suited for web development ➢Written on top of GraphQL L04 (ELK and GraphDBs) - PV177/DataScience81 Cons: ➢Requires more schema design up front Pros: ➢ Multi model DB – both graph and document DB ➢ Easily add users/roles ➢ Supports multiple databases L04 (ELK and GraphDBs) - PV177/DataScience82 Cons: ➢ Only one DB can be running on a single port at a time Pros: ➢ Open-source version available ➢ Steep learning curve, more user-friendly ➢ Runs on Windows natively - in either a console or as a service ➢ Large and active user community L04 (ELK and GraphDBs) - PV177/DataScience83 NEO4J – WHAT DOES IT PROVIDE? ✓Full ACID (atomicity, consistency, isolation, durability) ✓REST API ✓Property Graph ✓Lucene Full-Text Index ✓High Availability (with Enterprise Edition) L04 (ELK and GraphDBs) - PV177/DataScience84 Node in Neo4j L04 (ELK and GraphDBs) - PV177/DataScience85 Relationships in Neo4j • Relationships between nodes are a key part of Neo4j L04 (ELK and GraphDBs) - PV177/DataScience86 Relationships in Neo4j L04 (ELK and GraphDBs) - PV177/DataScience87 Twitter and relationships L04 (ELK and GraphDBs) - PV177/DataScience88 Properties • Both nodes and relationships can have properties • Properties are key-value pairs where the key is a string • Property values can be either a primitive or an array of one primitive type For example String, int and int[] values are valid for properties L04 (ELK and GraphDBs) - PV177/DataScience89 Properties L04 (ELK and GraphDBs) - PV177/DataScience90 Paths in Neo4j • A path is one or more nodes with connecting relationships, typically retrieved as a query or traversal result L04 (ELK and GraphDBs) - PV177/DataScience91 Creating a small graph L04 (ELK and GraphDBs) - PV177/DataScience92 Print the data L04 (ELK and GraphDBs) - PV177/DataScience93 Remove the data L04 (ELK and GraphDBs) - PV177/DataScience94 L04 (ELK and GraphDBs) - PV177/DataScience95 Graph DB example SELECT Me.PersonId AS MeId, Me.Name, FriendOfFriend.RelatedPersonId AS SuggestedFriendId, FriendOfAFriend.Name FROM Person AS Me INNER JOIN PersonRelationship AS MyFriends ON MyFriends.PersonId = Me.PersonId INNER JOIN PersonRelationship AS FriendOfFriend ON MyFriends.RelatedPersonId = FriendOfFriend.PersonId INNER JOIN Person AS FriendOfAFriend ON FriendOfFriend.RelatedPersonId = FriendOfAFriend.PersonId LEFT JOIN PersonRelationship AS FriendsWithMe ON Me.PersonId = FriendsWithMe.PersonId AND FriendOfFriend.RelatedPersonId = FriendsWithMe.RelatedPersonId INNER JOIN PersonDisease ON PersonDisease.PersonId = FriendOfAFriend.PersonId WHERE FriendsWithMe.PersonId IS NULL AND Me.PersonId <> FriendOfFriend.RelatedPersonId AND Me.Name = 'Bill' AND PersonDisease.DiseaseId = 1 FIND FRIENDS OF FRIENDS THAT HAVE TYPE 1 DIABETES – RDBMS L04 (ELK and GraphDBs) - PV177/DataScience96 NEO4J MODEL L04 (ELK and GraphDBs) - PV177/DataScience97 MATCH (user:Person {name:'Bill'})-[:FRIENDS_WITH*2..5]->(fof)- [:DIAGNOSED_WITH]->(disease) return fof FIND FRIENDS OF FRIENDS THAT HAVE TYPE 1 DIABETES – GRAPHDB L04 (ELK and GraphDBs) - PV177/DataScience98 L04 (ELK and GraphDBs) - PV177/DataScience99 100 L04 (ELK and GraphDBs) - PV177/DataScience