Safe Haskell | Safe |
---|
Fourth assignment for IB016, semester spring 2019
Analysis of the Ministry of Finance's invoices
In this assignment, you'll get to parse and analyze an open dataset of invoices processed by the Ministry of Finance of the Czech Republic in 2018.
The high-level concept
This assignment tries to tackle a real problem. With the quirks, compromises, (sometimes) lack of metadata the real data tend to have. Furthermore, you are given freedom to run the analyses you're actually interested in, not those prescribed by us.
All that means logic and common sense have priority over details of the task specification. If you find something more useful to be done slightly differently, it can be allowed -- ask in the discussion forum describing what you need and why. Nevertheless, we took a deep thought to design the task reasonably, so in return, we ask you to think deep before you ask for a diversion. Furthermore, we'll need to grade it afterward and having the same interface (=data types) helps us a lot.
That being said, you can use this "real-world task" to practice combining your functional programming skills. The author's solution has quite some monadic operators, leverages monoids and heavily uses functions as return values (ehm, custom structures full of functions, in fact).
Overall, use common sense, enjoy the possibilities of functional programming and learn about your homeland by examining the open data shared by the government.
Assignment overview
The assignment consists of these tasks:
- Locate the open data repository of the Ministry of Finance of the Czech Republic and download the CSV file with invoices paid in 2018 (to be more precise, invoices paid between 2018-01-01 and 2018-11-01).
- Determine the license/use conditions of the dataset and summarize them in one sentence in the submitted source code.
- Write a Parsec parser for the CSV file according to the specification below.
- Write reasonable pretty-printers for the data (example given below).
- Using the parsed data, perform an analysis to find out something interesting. Report at least five results. See the details below.
Parser details
Overall, the dataset contains more information than we are interested in. See the documentation of the datatypes below to know what to keep (everything else can be dropped at parse-time).
Write a parser that produces the structure given below reasonably efficiently. Linear time complexity is not needed (not even realistic, since invoices are stored in a map), but try to avoid intermediate data structures and unnecessary re-processing of the data. A bad example would be parsing the whole file into lines first, then reiterating to parse the lines into CSV fields, then passing the third time to convert dates, etc.
Your program has to work on the original unmodified (!) file downloaded from the open data repository (this ensures replicability of your analyses and shows you have not manipulated the data). That being said, you can have "unparsed lines" after you process the file (there is a dedicated list for these, see the datatype definition below). This is allowed to not bother you with super-rare border cases that you would, in reality, probably clean by hand. Example for a thousand words: Three suppliers in the dataset have newlines in their names. The ideal solution, of course, would be to allow these (and possibly other escaped characters) but since it's just three, you need not to bother in the bulk of over 7200 invoices. Please don't have more than a couple of such unparsed lines.
As for the pretty-printers, use the format that seems the most understandable to you when doing your analyses. An example of mine is provided below.
Faktura č. 1888800244 (Z) Dodavatel: Úřad pro zastupování státu ve věcech majetkových (IČO 69797111) Suma: 9130.00 Kč (vystaveno 2015-07-01, splatnost 2018-05-15) Zaplaceno: 9130.00 Kč ( přijato 2018-05-09, zaplaceno 2018-05-04) Z rozpočtu: 1300.00 Kč (Studená voda), 3300.00 Kč (Teplo), 4000.00 Kč (Elektrická energie), 530.00 Kč (Teplá voda)
Analysis details
As for the performed analyses, you are not constrained. Run what you find intriguing and report a summary in the source file handed in the IS. Note, however, that there has to be the source code to produce your stats available in the source file handed in! I.e., the stats reported in the free text without the accompanying Haskell code will not be counted valid (as you could have generated them using Excel on the same dataset).
Optionally, paste the summary of the findings (without the source code) to the discussion forum in the IS (even before the assignment deadline). Remember, one of the aims of the task is to get to know the economics of your ministry.
Inspiration for possible analyses:
- Which invoices did the ministry pay to Masaryk University?
- What ratio of invoices did the ministry pay overdue? (Remember that some invoices may have been delivered already overdue.)
- What are the sums for individual sub-budgets?
- To which supplier did the ministry pay the most?
- How much did the ministry pay to the companies owned by the current prime minister?
- And so on...
Bonuses
During grading, you can get up to 3 points of bonuses for extra work (either other code features or better analysis). These are assigned subjectively but will probably apply if you do something of the following:
- Perform larger or more complicated analyses of the dataset.
- Perform analyses on the extended dataset (e.g., incorporating invoice data from other years).
- Use some advanced concept from the other seminars in the solution (e.g., lenses for manipulating the data structures, monoids, appropriate language extensions, ...)
- Note: If you decide to use lenses, feel free to rename record field names to start with an underscore. You may also consider to use the package lens-datetime.
Modules and packages
You can use any module from packages base and containers. For parsing, use the package parsec. For working with dates, use the package time. If you wish so, you can also use Unicode syntax from unicode-prelude.
In case you feel the need to use some other package (especially in the analytical part), it's probably OK. However, double-check with the assignment author in the discussion forum first.
Synopsis
- data InvoiceData = InvoiceData {}
- type Invoices = Map InvoiceID Invoice
- type Suppliers = Map ICO String
- type Budgets = Map SubBudgetID String
- newtype InvoiceID = InvoiceID {
- unInvoiceID :: Int
- newtype ICO = ICO {
- unICO :: Int
- newtype SubBudgetID = SubBudgetID {
- unSubBudgetID :: Int
- data Invoice = Invoice {
- supplier :: Either String ICO
- dateIssued :: Day
- dateDelivered :: Day
- dateDue :: Day
- datePaid :: Day
- documentType :: DocumentType
- amountDue :: Money
- amountPaid :: Money
- subBudgets :: [(Money, SubBudgetID)]
- data DocumentType
- type Money = Double
- main :: IO ()
Documentation
data InvoiceData #
The high-level data structure for all parsed data.
type Invoices = Map InvoiceID Invoice #
Invoices are stored in a map keyed by invoice ID ([CISLO]
).
Beware, the dataset contains multiple lines with the same invoice ID.
As these differ only in the sub-budget payment, merge them together
(keeping all the sub-budget information).
type Suppliers = Map ICO String #
Supplier names ([DODAVATEL]
) are stored in a map keyed by their IČO ([ICO]
).
Suppliers without the ICO identification are not stored here.
type Budgets = Map SubBudgetID String #
Sub-budget names ([NAZEVPOLOZKYROZPOCTU]
) are stored in a map keyed
by the budget ID ([POLOZKAROZPOCTU]
).
Invoice ID ([CISLO]
) is internaly an Int
but is wrapped in a newtype
to ensure type safety.
InvoiceID | |
|
Supplier IČO ([ICO]
) is internaly an Int
but is wrapped in a newtype
to ensure type safety.
newtype SubBudgetID #
Sub-budget ID ([POLOZKAROZPOCTU]
) is internaly an Int
but is wrapped in a newtype
to ensure type safety.
SubBudgetID | |
|
Instances
Eq SubBudgetID # | |
Defined in HW04 (==) :: SubBudgetID -> SubBudgetID -> Bool (/=) :: SubBudgetID -> SubBudgetID -> Bool | |
Ord SubBudgetID # | |
Defined in HW04 compare :: SubBudgetID -> SubBudgetID -> Ordering (<) :: SubBudgetID -> SubBudgetID -> Bool (<=) :: SubBudgetID -> SubBudgetID -> Bool (>) :: SubBudgetID -> SubBudgetID -> Bool (>=) :: SubBudgetID -> SubBudgetID -> Bool max :: SubBudgetID -> SubBudgetID -> SubBudgetID min :: SubBudgetID -> SubBudgetID -> SubBudgetID | |
Show SubBudgetID # | |
Defined in HW04 showsPrec :: Int -> SubBudgetID -> ShowS show :: SubBudgetID -> String showList :: [SubBudgetID] -> ShowS |
All invoice metadata. Supplier and sub-budget are identified by IDs only.
Invoice | |
|
data DocumentType #
Type of the invoice as provided in the data ([TYPDOKLADU]
).
Unfortunately, I was unable to find out what precisely these mean :-|.
Instances
Eq DocumentType # | |
Defined in HW04 (==) :: DocumentType -> DocumentType -> Bool (/=) :: DocumentType -> DocumentType -> Bool | |
Show DocumentType # | |
Defined in HW04 showsPrec :: Int -> DocumentType -> ShowS show :: DocumentType -> String showList :: [DocumentType] -> ShowS |