{ "cells": [ { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "# Tutorial 01 – Working with Data" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "In this tutorial we will take a look at how to represent, manipulate, store, and load data. We will use two main packages called `NumPy` and `pandas`. The actual representation of data is handled by `NumPy` and it uses n-dimensional array objects to stored the data. Part of the code is written in C, C++, and Fortran to make it fast. `pandas` is high-level abstraction that makes data manipulation like selecting a specific subset of data, grouping, and basic plotting simple. Mastering just these two packages almost qualifies you as a data scientist.\n", "\n", "While none of these two packages is really doing any machine learning you can go quite a long way with them. `NumPy` alone is perfect for any linear algebra like matrix multiplication and it is an essential requirement for other machine learning packages. `pandas` is great for exploring data and storing/loading data.\n", "\n", "Remember, nicely prepared data are the key to success in machine learning." ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "## NumPy" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "`NumPy` is typically imported with alias `np` to save some characters." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "You can create a simple array right from a list." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "alist = [1,2,3,4]\n", "array = np.array(alist)\n", "array" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "They behave quite similarly in terms of indexing and slicing. Doing any operations with arrays is quite different though. They behave more like vectors (which they are after all). The key difference is that it is a proper array and it cannot be easily extended with new elements." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "print(array[1])\n", "print(array[1::2])\n", "print(array * 2)\n", "print(array + 10)\n", "print(array + array)\n", "print(array * array)\n", "# array.append(10) this does not work!" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "Notice that the math operations are done element-wise. Multiplying `array` with it self results in vector of squared elements. If you would like to do some linear algebra you need to call a method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "array.dot(array)" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "Now its time to go to higher dimensions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "matrix = np.array([\n", " [1,2,3],\n", " [4,5,6],\n", " [7,8,9]])\n", "matrix" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "Multidimensional arrays can be also index in each dimension. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "matrix[1:, :2]" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "You can even use array of booleans to take elements on indices with value True." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "matrix[[True, False, True]]" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "Often you will need to create larger arrays with some specific values and you would not want to manually fill all the numbers. Luckily, there are handful of methods at your disposal." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "print(np.zeros(10))\n", "print(np.full([2, 4], fill_value=[2, 1]))\n", "print(np.identity(10))" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "Now, every array has its shape, i.e., dimensions. They are always represented as tuples and each dimension is called axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "print(array.shape)\n", "print(np.identity(10).shape)" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "Note that when creating an array you can supply any wild shape you want the new array to have. That's right, you can make a tensor." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "np.ones(shape=(3, 5, 7))" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "Every array can be reshaped into any shape you like as long as the number of elements is equal to the product of array dimensions." ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "