{- | Fourth assignment for IB016, semester spring 2019 == Analysis of the Ministry of Finance's invoices In this assignment, you'll get to parse and analyze an open dataset of invoices processed by the Ministry of Finance of the Czech Republic in 2018. = The high-level concept This assignment tries to tackle a real problem. With the quirks, compromises, (sometimes) lack of metadata the real data tend to have. Furthermore, you are given freedom to run the analyses you're actually interested in, not those prescribed by us. All that means logic and common sense have priority over details of the task specification. If you find something more useful to be done slightly differently, it can be allowed -- ask in the discussion forum describing what you need and why. Nevertheless, we took a deep thought to design the task reasonably, so in return, we ask you to think deep before you ask for a diversion. Furthermore, we'll need to grade it afterward and having the same interface (=data types) helps us a lot. That being said, you can use this "real-world task" to practice combining your functional programming skills. The author's solution has quite some monadic operators, leverages monoids and heavily uses functions as return values (ehm, custom structures full of functions, in fact). Overall, use common sense, enjoy the possibilities of functional programming and learn about your homeland by examining the open data shared by the government. = Assignment overview The assignment consists of these tasks: 1. Locate the open data repository of the Ministry of Finance of the Czech Republic and download the CSV file with invoices paid in 2018 (to be more precise, invoices paid between 2018-01-01 and 2018-11-01). 2. Determine the license/use conditions of the dataset and summarize them in one sentence in the submitted source code. 3. Write a Parsec parser for the CSV file according to the specification below. 4. Write reasonable pretty-printers for the data (example given below). 5. Using the parsed data, perform an analysis to find out something interesting. Report at least five results. See the details below. = Parser details Overall, the dataset contains more information than we are interested in. See the documentation of the datatypes below to know what to keep (everything else can be dropped at parse-time). Write a parser that produces the structure given below reasonably efficiently. Linear time complexity is not needed (not even realistic, since invoices are stored in a map), but try to avoid intermediate data structures and unnecessary re-processing of the data. A bad example would be parsing the whole file into lines first, then reiterating to parse the lines into CSV fields, then passing the third time to convert dates, etc. Your program has to work on the original unmodified (!) file downloaded from the open data repository (this ensures replicability of your analyses and shows you have not manipulated the data). That being said, you can have "unparsed lines" after you process the file (there is a dedicated list for these, see the datatype definition below). This is allowed to not bother you with super-rare border cases that you would, in reality, probably clean by hand. Example for a thousand words: Three suppliers in the dataset have newlines in their names. The ideal solution, of course, would be to allow these (and possibly other escaped characters) but since it's just three, you need not to bother in the bulk of over 7200 invoices. Please don't have more than a couple of such unparsed lines. As for the pretty-printers, use the format that seems the most understandable to you when doing your analyses. An example of mine is provided below. @ Faktura č. 1888800244 (Z) Dodavatel: Úřad pro zastupování státu ve věcech majetkových (IČO 69797111) Suma: 9130.00 Kč (vystaveno 2015-07-01, splatnost 2018-05-15) Zaplaceno: 9130.00 Kč ( přijato 2018-05-09, zaplaceno 2018-05-04) Z rozpočtu: 1300.00 Kč (Studená voda), 3300.00 Kč (Teplo), 4000.00 Kč (Elektrická energie), 530.00 Kč (Teplá voda) @ = Analysis details As for the performed analyses, you are not constrained. Run what you find intriguing and report a summary in the source file handed in the IS. Note, however, that there has to be the source code to produce your stats available in the source file handed in! I.e., the stats reported in the free text without the accompanying Haskell code will not be counted valid (as you could have generated them using Excel on the same dataset). Optionally, paste the summary of the findings (without the source code) to the discussion forum in the IS (even before the assignment deadline). Remember, one of the aims of the task is to get to know the economics of your ministry. Inspiration for possible analyses: * Which invoices did the ministry pay to Masaryk University? * What ratio of invoices did the ministry pay overdue? (Remember that some invoices may have been delivered already overdue.) * What are the sums for individual sub-budgets? * To which supplier did the ministry pay the most? * How much did the ministry pay to the companies owned by the current prime minister? * And so on... = Bonuses During grading, you can get up to 3 points of bonuses for extra work (either other code features or better analysis). These are assigned subjectively but will probably apply if you do something of the following: * Perform larger or more complicated analyses of the dataset. * Perform analyses on the extended dataset (e.g., incorporating invoice data from other years). * Use some advanced concept from the other seminars in the solution (e.g., /lenses/ for manipulating the data structures, /monoids/, appropriate language extensions, ...) * Note: If you decide to use lenses, feel free to rename record field names to start with an underscore. You may also consider to use the package . = Modules and packages You can use any module from packages and . For parsing, use the package . For working with dates, use the package . If you wish so, you can also use Unicode syntax from . In case you feel the need to use some other package (especially in the analytical part), it's probably OK. However, double-check with the assignment author in the discussion forum first. -} -- ---------------------------------------------------------------------------- -- Name: -- UCO: -- ---------------------------------------------------------------------------- module HW04 where -- package containers import qualified Data.Map.Strict as M -- package parsec import Text.Parsec import Text.Parsec.String ( Parser, parseFromFile ) -- package time import Data.Time.Calendar ( Day ) -- #### Data type declarations #### -- | The high-level data structure for all parsed data. data InvoiceData = InvoiceData { invoices :: Invoices -- ^ selected invoice data (see below) , suppliers :: Suppliers -- ^ information about common suppliers (those having IČO) , budgets :: Budgets -- ^ information about ministry sub-budgets , notParsed :: [String] -- ^ list of lines that could not be successfully parsed } -- | Invoices are stored in a map keyed by invoice ID (@[CISLO]@). -- Beware, the dataset contains multiple lines with the same invoice ID. -- As these differ only in the sub-budget payment, merge them together -- (keeping all the sub-budget information). type Invoices = M.Map InvoiceID Invoice -- | Supplier names (@[DODAVATEL]@) are stored in a map keyed by their IČO (@[ICO]@). -- Suppliers without the ICO identification are not stored here. type Suppliers = M.Map ICO String -- | Sub-budget names (@[NAZEVPOLOZKYROZPOCTU]@) are stored in a map keyed -- by the budget ID (@[POLOZKAROZPOCTU]@). type Budgets = M.Map SubBudgetID String -- | Invoice ID (@[CISLO]@) is internaly an @Int@ but is wrapped in a newtype -- to ensure type safety. newtype InvoiceID = InvoiceID { unInvoiceID :: Int } deriving (Eq, Ord, Show) -- | Supplier IČO (@[ICO]@) is internaly an @Int@ but is wrapped in a newtype -- to ensure type safety. newtype ICO = ICO { unICO :: Int } deriving (Eq, Ord, Show) -- | Sub-budget ID (@[POLOZKAROZPOCTU]@) is internaly an @Int@ but is wrapped in a newtype -- to ensure type safety. newtype SubBudgetID = SubBudgetID { unSubBudgetID :: Int } deriving (Eq, Ord, Show) -- | All invoice metadata. Supplier and sub-budget are identified by IDs only. data Invoice = Invoice { supplier :: Either String ICO -- ^ supplier IČO (@[ICO]@) if exists, supplier name (@[DODAVATEL]@) if not , dateIssued :: Day -- ^ date the invoice was issued (@[DATUMVYSTAVENI]@) , dateDelivered :: Day -- ^ date the invoice was delivered (@[DATUMPRIJETI]@) , dateDue :: Day -- ^ date the invoice was due (@[DATUMSPLATNOSTI]@) , datePaid :: Day -- ^ date the invoice was paid (@[DATUMUHRADY]@) , documentType :: DocumentType -- ^ invoice type (@[TYPDOKLADU]@) , amountDue :: Money -- ^ amount due in CZK, VAT included (@[CELKOVACASTKA]@) , amountPaid :: Money -- ^ amount paid in CZK, VAT included (@[CUHRADA]@) , subBudgets :: [(Money,SubBudgetID)] -- ^ amounts from individual sub-budgets (@[CASTKAZAPOLOZKUROZPOCTU]@, @[POLOZKAROZPOCTU]@) } deriving Show -- | Type of the invoice as provided in the data (@[TYPDOKLADU]@). -- Unfortunately, I was unable to find out what precisely these mean :-|. data DocumentType = F -- ^ maybe a common invoice? | W -- ^ maybe a cancelled invoice? | Z -- ^ maybe a regular invoice paid in advance? deriving (Eq, Show) -- | Money amounts are stored as simple @Double@s. type Money = Double -- #### Custom data types and class instances #### -- TBA -- #### Constants #### -- TBA -- #### Parsers #### -- TBA -- #### Pretty printers #### -- TBA -- #### Data analysis utility functions #### -- TBA -- | Parse the file fiven in the first command-line argument. -- | In case of successful parse, pretty-print the parsed database. -- | In case of parse failure, print the error. main :: IO () main = undefined