General Environment Setup
Seminar 1 of NoSQL Databases (PA195)
Luděk Bártek, Vlastislav Dohnal
Faculty of Informatics, Masaryk University, Brno
Agenda
● Stratus – cloud computing platform
● Virtual machines
● Simple example in Hadoop Framework
● MapReduce and Spark (as a subset of Tutorial 1)
2
Seminar Organization
● 2-3 tasks solved in groups of 3-4 students
● groups may vary from seminar to seminar
● Each task will practice a NoSQL technology
● on a “real-life” example
● You will typically use a cluster to solve the task.
● Cluster will be formed by the machines of students in the
group.
● Teacher will check completing the task within the
seminar.
● You must succeed in all assigned tasks! 3
Faculty account vs IS account
● Faculty of Informatics has their own user accounts
● We will reference it as faculty credentials or FI credentials.
● See the general information of user logins
● Briefly, the login is generated automatically according to the faculty
relationship
● xlastname – internal students (FI branch),
● xučo – external students (from another faculty, ERASMUS, …).
● If you do not know your FI’s password, please use this IS app
● https://is.muni.cz/auth/system/heslo_fi
4
Stratus – cloud platform @FI
● Uses OpenNebula cloud and edge computing
platform
● Setup you access to Stratus.FI
⚫ For details see info on FI Technical Info
page
⚫ Log into to the stratus.fi.muni.cz
⚫ using FI credentials.
5
Stratus.fi.muni.cz
● Firstly, setup SSH keys:
● Generate an ssh key pair (if you do not have any yet):
musa$ ssh-keygen OR aisa$ ssh-keygen
■ by default, stored in $HOME/.ssh/id_rsa and
$HOME/.ssh/id_rsa.pub
■ In stratus’ menu, navigate to Settings > Auth tab
■ Edit Public SSH Key
■ Copy&paste the contents of $HOME/.ssh/id_rsa.pub
■ The private key is then used to log into VM as root
● Do not use root password setup please.
6
Creating the Hadoop Server
● On the left, select Templates and VMs
● Locate the template “PA195-hadoop-single”
● Select it and click “Instantiate”
● Go to the menu Instances > VMs to find your new VM.
● Wait for the ready state
● Log into the virtual server as root:
● $ ssh root@<VMs_IP>
● It uses the preconfigured SSH key set in the your user profile at stratus.
7
musa (local PC)
HDFS DFS (1)
● HDFS system monitoring & basic commands
$ hdfs dfs -help
● Documentation of HDFS DFS file system commands
● Get some data (complete Shakespeare's plays)
# su - hadoop
$ wget https://is.muni.cz/go/zp93wh -O shake.txt
$ hdfs dfs -put shake.txt
8
stratus (VM)
HDFS DFS (2)
• Other hdfs commands: file list, file removal, directory creation
• (you may not perform them)
# su - hadoop
$ hdfs dfs -ls
$ hdfs dfs -rm shake.txt
$ hdfs dfs -mkdir input
Check HDFS files in the web browser
http://<VM_ip>:9870/explorer.html#/user
9
musa (local PC)
stratus (VM)
MapReduce using Spark
• Spark is a multi-language engine for executing data
engineering, data science, and machine learning on
single-node machines or clusters.
• Installed in your VM: doc
Task: Calculate word frequency in a document, e.g.,
shake.txt
10
Spark: Simple Example
# su - hadoop
$ spark-shell --master yarn
scala> :help
scala> val file =
sc.textFile("hdfs:///user/hadoop/shake.txt")
scala> val counts = file
.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey(_ + _)
scala> counts.saveAsTextFile("spark-output")
scala> :quit
$ hdfs dfs -get spark-output/
11
stratus (VM)
Lessons Learned & Cleanup
What lessons did we take from the following?
● Basic work with the HDFS distributed file system
● Hadoop MapReduce using Spark
○ simple word count
Delete large files from both HDFS and the your home dir in
VM, and shutdown you Stratus VM, if not needed anymore.
# su - hadoop
$ hdfs dfs -rm -R wiki-input/
$ hdfs dfs -rm -R output
12
stratus (VM)