Safe HaskellSafe

HW04

Description

Fourth assignment for IB016, semester spring 2019

Analysis of the Ministry of Finance's invoices

In this assignment, you'll get to parse and analyze an open dataset of invoices processed by the Ministry of Finance of the Czech Republic in 2018.

The high-level concept

This assignment tries to tackle a real problem. With the quirks, compromises, (sometimes) lack of metadata the real data tend to have. Furthermore, you are given freedom to run the analyses you're actually interested in, not those prescribed by us.

All that means logic and common sense have priority over details of the task specification. If you find something more useful to be done slightly differently, it can be allowed -- ask in the discussion forum describing what you need and why. Nevertheless, we took a deep thought to design the task reasonably, so in return, we ask you to think deep before you ask for a diversion. Furthermore, we'll need to grade it afterward and having the same interface (=data types) helps us a lot.

That being said, you can use this "real-world task" to practice combining your functional programming skills. The author's solution has quite some monadic operators, leverages monoids and heavily uses functions as return values (ehm, custom structures full of functions, in fact).

Overall, use common sense, enjoy the possibilities of functional programming and learn about your homeland by examining the open data shared by the government.

Assignment overview

The assignment consists of these tasks:

  1. Locate the open data repository of the Ministry of Finance of the Czech Republic and download the CSV file with invoices paid in 2018 (to be more precise, invoices paid between 2018-01-01 and 2018-11-01).
  2. Determine the license/use conditions of the dataset and summarize them in one sentence in the submitted source code.
  3. Write a Parsec parser for the CSV file according to the specification below.
  4. Write reasonable pretty-printers for the data (example given below).
  5. Using the parsed data, perform an analysis to find out something interesting. Report at least five results. See the details below.

Parser details

Overall, the dataset contains more information than we are interested in. See the documentation of the datatypes below to know what to keep (everything else can be dropped at parse-time).

Write a parser that produces the structure given below reasonably efficiently. Linear time complexity is not needed (not even realistic, since invoices are stored in a map), but try to avoid intermediate data structures and unnecessary re-processing of the data. A bad example would be parsing the whole file into lines first, then reiterating to parse the lines into CSV fields, then passing the third time to convert dates, etc.

Your program has to work on the original unmodified (!) file downloaded from the open data repository (this ensures replicability of your analyses and shows you have not manipulated the data). That being said, you can have "unparsed lines" after you process the file (there is a dedicated list for these, see the datatype definition below). This is allowed to not bother you with super-rare border cases that you would, in reality, probably clean by hand. Example for a thousand words: Three suppliers in the dataset have newlines in their names. The ideal solution, of course, would be to allow these (and possibly other escaped characters) but since it's just three, you need not to bother in the bulk of over 7200 invoices. Please don't have more than a couple of such unparsed lines.

As for the pretty-printers, use the format that seems the most understandable to you when doing your analyses. An example of mine is provided below.

Faktura č. 1888800244 (Z)
    Dodavatel:    Úřad pro zastupování státu ve věcech majetkových (IČO 69797111)
    Suma:         9130.00 Kč (vystaveno 2015-07-01, splatnost 2018-05-15)
    Zaplaceno:    9130.00 Kč (  přijato 2018-05-09, zaplaceno 2018-05-04)
    Z rozpočtu:   1300.00 Kč (Studená voda), 3300.00 Kč (Teplo), 4000.00 Kč (Elektrická energie), 530.00 Kč (Teplá voda)

Analysis details

As for the performed analyses, you are not constrained. Run what you find intriguing and report a summary in the source file handed in the IS. Note, however, that there has to be the source code to produce your stats available in the source file handed in! I.e., the stats reported in the free text without the accompanying Haskell code will not be counted valid (as you could have generated them using Excel on the same dataset).

Optionally, paste the summary of the findings (without the source code) to the discussion forum in the IS (even before the assignment deadline). Remember, one of the aims of the task is to get to know the economics of your ministry.

Inspiration for possible analyses:

  • Which invoices did the ministry pay to Masaryk University?
  • What ratio of invoices did the ministry pay overdue? (Remember that some invoices may have been delivered already overdue.)
  • What are the sums for individual sub-budgets?
  • To which supplier did the ministry pay the most?
  • How much did the ministry pay to the companies owned by the current prime minister?
  • And so on...

Bonuses

During grading, you can get up to 3 points of bonuses for extra work (either other code features or better analysis). These are assigned subjectively but will probably apply if you do something of the following:

  • Perform larger or more complicated analyses of the dataset.
  • Perform analyses on the extended dataset (e.g., incorporating invoice data from other years).
  • Use some advanced concept from the other seminars in the solution (e.g., lenses for manipulating the data structures, monoids, appropriate language extensions, ...)
  • Note: If you decide to use lenses, feel free to rename record field names to start with an underscore. You may also consider to use the package lens-datetime.

Modules and packages

You can use any module from packages base and containers. For parsing, use the package parsec. For working with dates, use the package time. If you wish so, you can also use Unicode syntax from unicode-prelude.

In case you feel the need to use some other package (especially in the analytical part), it's probably OK. However, double-check with the assignment author in the discussion forum first.

Synopsis

Documentation

data InvoiceData #

The high-level data structure for all parsed data.

Constructors

InvoiceData 

Fields

type Invoices = Map InvoiceID Invoice #

Invoices are stored in a map keyed by invoice ID ([CISLO]). Beware, the dataset contains multiple lines with the same invoice ID. As these differ only in the sub-budget payment, merge them together (keeping all the sub-budget information).

type Suppliers = Map ICO String #

Supplier names ([DODAVATEL]) are stored in a map keyed by their IČO ([ICO]). Suppliers without the ICO identification are not stored here.

type Budgets = Map SubBudgetID String #

Sub-budget names ([NAZEVPOLOZKYROZPOCTU]) are stored in a map keyed by the budget ID ([POLOZKAROZPOCTU]).

newtype InvoiceID #

Invoice ID ([CISLO]) is internaly an Int but is wrapped in a newtype to ensure type safety.

Constructors

InvoiceID 

Fields

Instances
Eq InvoiceID # 
Instance details

Defined in HW04

Methods

(==) :: InvoiceID -> InvoiceID -> Bool

(/=) :: InvoiceID -> InvoiceID -> Bool

Ord InvoiceID # 
Instance details

Defined in HW04

Methods

compare :: InvoiceID -> InvoiceID -> Ordering

(<) :: InvoiceID -> InvoiceID -> Bool

(<=) :: InvoiceID -> InvoiceID -> Bool

(>) :: InvoiceID -> InvoiceID -> Bool

(>=) :: InvoiceID -> InvoiceID -> Bool

max :: InvoiceID -> InvoiceID -> InvoiceID

min :: InvoiceID -> InvoiceID -> InvoiceID

Show InvoiceID # 
Instance details

Defined in HW04

Methods

showsPrec :: Int -> InvoiceID -> ShowS

show :: InvoiceID -> String

showList :: [InvoiceID] -> ShowS

newtype ICO #

Supplier IČO ([ICO]) is internaly an Int but is wrapped in a newtype to ensure type safety.

Constructors

ICO 

Fields

Instances
Eq ICO # 
Instance details

Defined in HW04

Methods

(==) :: ICO -> ICO -> Bool

(/=) :: ICO -> ICO -> Bool

Ord ICO # 
Instance details

Defined in HW04

Methods

compare :: ICO -> ICO -> Ordering

(<) :: ICO -> ICO -> Bool

(<=) :: ICO -> ICO -> Bool

(>) :: ICO -> ICO -> Bool

(>=) :: ICO -> ICO -> Bool

max :: ICO -> ICO -> ICO

min :: ICO -> ICO -> ICO

Show ICO # 
Instance details

Defined in HW04

Methods

showsPrec :: Int -> ICO -> ShowS

show :: ICO -> String

showList :: [ICO] -> ShowS

newtype SubBudgetID #

Sub-budget ID ([POLOZKAROZPOCTU]) is internaly an Int but is wrapped in a newtype to ensure type safety.

Constructors

SubBudgetID 

Fields

Instances
Eq SubBudgetID # 
Instance details

Defined in HW04

Methods

(==) :: SubBudgetID -> SubBudgetID -> Bool

(/=) :: SubBudgetID -> SubBudgetID -> Bool

Ord SubBudgetID # 
Instance details

Defined in HW04

Show SubBudgetID # 
Instance details

Defined in HW04

Methods

showsPrec :: Int -> SubBudgetID -> ShowS

show :: SubBudgetID -> String

showList :: [SubBudgetID] -> ShowS

data Invoice #

All invoice metadata. Supplier and sub-budget are identified by IDs only.

Constructors

Invoice 

Fields

  • supplier :: Either String ICO

    supplier IČO ([ICO]) if exists, supplier name ([DODAVATEL]) if not

  • dateIssued :: Day

    date the invoice was issued ([DATUMVYSTAVENI])

  • dateDelivered :: Day

    date the invoice was delivered ([DATUMPRIJETI])

  • dateDue :: Day

    date the invoice was due ([DATUMSPLATNOSTI])

  • datePaid :: Day

    date the invoice was paid ([DATUMUHRADY])

  • documentType :: DocumentType

    invoice type ([TYPDOKLADU])

  • amountDue :: Money

    amount due in CZK, VAT included ([CELKOVACASTKA])

  • amountPaid :: Money

    amount paid in CZK, VAT included ([CUHRADA])

  • subBudgets :: [(Money, SubBudgetID)]

    amounts from individual sub-budgets ([CASTKAZAPOLOZKUROZPOCTU], [POLOZKAROZPOCTU])

Instances
Show Invoice # 
Instance details

Defined in HW04

Methods

showsPrec :: Int -> Invoice -> ShowS

show :: Invoice -> String

showList :: [Invoice] -> ShowS

data DocumentType #

Type of the invoice as provided in the data ([TYPDOKLADU]). Unfortunately, I was unable to find out what precisely these mean :-|.

Constructors

F

maybe a common invoice?

W

maybe a cancelled invoice?

Z

maybe a regular invoice paid in advance?

Instances
Eq DocumentType # 
Instance details

Defined in HW04

Methods

(==) :: DocumentType -> DocumentType -> Bool

(/=) :: DocumentType -> DocumentType -> Bool

Show DocumentType # 
Instance details

Defined in HW04

Methods

showsPrec :: Int -> DocumentType -> ShowS

show :: DocumentType -> String

showList :: [DocumentType] -> ShowS

type Money = Double #

Money amounts are stored as simple Doubles.

main :: IO () #

Parse the file fiven in the first command-line argument. | In case of successful parse, pretty-print the parsed database. | In case of parse failure, print the error.