{- | Fourth assignment for IB016, semester spring 2019 == Analysis of the Ministry of Finance's invoices In this assignment, you'll get to parse and analyze an open dataset of invoices processed by the Ministry of Finance of the Czech Republic in 2018. = The high-level concept This assignment tries to tackle a real problem. With the quirks, compromises, (sometimes) lack of metadata the real data tend to have. Furthermore, you are given freedom to run the analyses you're actually interested in, not those prescribed by us. All that means logic and common sense have priority over details of the task specification. If you find something more useful to be done slightly differently, it can be allowed -- ask in the discussion forum describing what you need and why. Nevertheless, we took a deep thought to design the task reasonably, so in return, we ask you to think deep before you ask for a diversion. Furthermore, we'll need to grade it afterward and having the same interface (=data types) helps us a lot. That being said, you can use this "real-world task" to practice combining your functional programming skills. The author's solution has quite some monadic operators, leverages monoids and heavily uses functions as return values (ehm, custom structures full of functions, in fact). Overall, use common sense, enjoy the possibilities of functional programming and learn about your homeland by examining the open data shared by the government. = Assignment overview The assignment consists of these tasks: 1. Locate the open data repository of the Ministry of Finance of the Czech Republic and download the CSV file with invoices paid in 2018 (to be more precise, invoices paid between 2018-01-01 and 2018-11-01). 2. Determine the license/use conditions of the dataset and summarize them in one sentence in the submitted source code. 3. Write a Parsec parser for the CSV file according to the specification below. 4. Write reasonable pretty-printers for the data (example given below). 5. Using the parsed data, perform an analysis to find out something interesting. Report at least five results. See the details below. = Parser details Overall, the dataset contains more information than we are interested in. See the documentation of the datatypes below to know what to keep (everything else can be dropped at parse-time). Write a parser that produces the structure given below reasonably efficiently. Linear time complexity is not needed (not even realistic, since invoices are stored in a map), but try to avoid intermediate data structures and unnecessary re-processing of the data. A bad example would be parsing the whole file into lines first, then reiterating to parse the lines into CSV fields, then passing the third time to convert dates, etc. Your program has to work on the original unmodified (!) file downloaded from the open data repository (this ensures replicability of your analyses and shows you have not manipulated the data). That being said, you can have "unparsed lines" after you process the file (there is a dedicated list for these, see the datatype definition below). This is allowed to not bother you with super-rare border cases that you would, in reality, probably clean by hand. Example for a thousand words: Three suppliers in the dataset have newlines in their names. The ideal solution, of course, would be to allow these (and possibly other escaped characters) but since it's just three, you need not to bother in the bulk of over 7200 invoices. Please don't have more than a couple of such unparsed lines. As for the pretty-printers, use the format that seems the most understandable to you when doing your analyses. An example of mine is provided below. @ Faktura č. 1888800244 (Z) Dodavatel: Úřad pro zastupování státu ve věcech majetkových (IČO 69797111) Suma: 9130.00 Kč (vystaveno 2015-07-01, splatnost 2018-05-15) Zaplaceno: 9130.00 Kč ( přijato 2018-05-09, zaplaceno 2018-05-04) Z rozpočtu: 1300.00 Kč (Studená voda), 3300.00 Kč (Teplo), 4000.00 Kč (Elektrická energie), 530.00 Kč (Teplá voda) @ = Analysis details As for the performed analyses, you are not constrained. Run what you find intriguing and report a summary in the source file handed in the IS. Note, however, that there has to be the source code to produce your stats available in the source file handed in! I.e., the stats reported in the free text without the accompanying Haskell code will not be counted valid (as you could have generated them using Excel on the same dataset). Optionally, paste the summary of the findings (without the source code) to the discussion forum in the IS (even before the assignment deadline). Remember, one of the aims of the task is to get to know the economics of your ministry. Inspiration for possible analyses: * Which invoices did the ministry pay to Masaryk University? * What ratio of invoices did the ministry pay overdue? (Remember that some invoices may have been delivered already overdue.) * What are the sums for individual sub-budgets? * To which supplier did the ministry pay the most? * How much did the ministry pay to the companies owned by the current prime minister? * And so on... = Bonuses During grading, you can get up to 3 points of bonuses for extra work (either other code features or better analysis). These are assigned subjectively but will probably apply if you do something of the following: * Perform larger or more complicated analyses of the dataset. * Perform analyses on the extended dataset (e.g., incorporating invoice data from other years). * Use some advanced concept from the other seminars in the solution (e.g., /lenses/ for manipulating the data structures, /monoids/, appropriate language extensions, ...) * Note: If you decide to use lenses, feel free to rename record field names to start with an underscore. You may also consider to use the package . = Modules and packages You can use any module from packages and . For parsing, use the package . For working with dates, use the package . If you wish so, you can also use Unicode syntax from . In case you feel the need to use some other package (especially in the analytical part), it's probably OK. However, double-check with the assignment author in the discussion forum first. -} -- ---------------------------------------------------------------------------- -- Name: Martin Ukrop (assignment creator) -- UCO: 374297 -- ---------------------------------------------------------------------------- {- ### Data source 1. Narodni katalog otevrenych dat https://data.gov.cz/datov%C3%A1-sada?iri=https%3A%2F%2Fdata.gov.cz%2Fzdroj%2Fdatov%C3%A9-sady%2Fhttp---data.mfcr.cz-api-3-action-package_show-id-prehled-faktur-ministerstva-financi-cr 2. Otevrena data ministerstva finacii http://data.mfcr.cz/en/node/161 ### Licence * Jedná se o volné dílo. * Neobsahuje autorské dílo, originální databáze, zvláštní právo pořizovatele databáze ani osobní údaje * => Literally no retrictiond for use! :-) -} module HW04 where import Data.Semigroup ( Semigroup, (<>) ) import Data.List ( intercalate, sortBy ) import Data.Maybe ( fromMaybe ) import Control.Monad ( void ) import System.Environment ( getArgs ) import Text.Read ( readMaybe ) import Text.Printf ( printf ) import Data.Either ( fromRight ) -- package containers import qualified Data.Map.Strict as M -- package parsec import Text.Parsec import Text.Parsec.String ( Parser, parseFromFile ) -- package time import Data.Time.Calendar ( Day, fromGregorianValid ) -- #### Data type declarations #### -- | The high-level data structure for all parsed data. data InvoiceData = InvoiceData { invoices :: Invoices -- ^ selected invoice data (see below) , suppliers :: Suppliers -- ^ information about common suppliers (those having IČO) , budgets :: Budgets -- ^ information about ministry sub-budgets , notParsed :: [String] -- ^ list of lines that could not be successfully parsed } -- | Invoices are stored in a map keyed by invoice ID (@[CISLO]@). -- Beware, the dataset contains multiple lines with the same invoice ID. -- As these differ only in the sub-budget payment, merge them together -- (keeping all the sub-budget information). type Invoices = M.Map InvoiceID Invoice -- | Supplier names (@[DODAVATEL]@) are stored in a map keyed by their IČO (@[ICO]@). -- Suppliers without the ICO identification are not stored here. type Suppliers = M.Map ICO String -- | Sub-budget names (@[NAZEVPOLOZKYROZPOCTU]@) are stored in a map keyed -- by the budget ID (@[POLOZKAROZPOCTU]@). type Budgets = M.Map SubBudgetID String -- | Invoice ID (@[CISLO]@) is internaly an @Int@ but is wrapped in a newtype -- to ensure type safety. newtype InvoiceID = InvoiceID { unInvoiceID :: Int } deriving (Eq, Ord, Show) -- | Supplier IČO (@[ICO]@) is internaly an @Int@ but is wrapped in a newtype -- to ensure type safety. newtype ICO = ICO { unICO :: Int } deriving (Eq, Ord, Show) -- | Sub-budget ID (@[POLOZKAROZPOCTU]@) is internaly an @Int@ but is wrapped in a newtype -- to ensure type safety. newtype SubBudgetID = SubBudgetID { unSubBudgetID :: Int } deriving (Eq, Ord, Show) -- | All invoice metadata. Supplier and sub-budget are identified by IDs only. data Invoice = Invoice { supplier :: Either String ICO -- ^ supplier IČO (@[ICO]@) if exists, supplier name (@[DODAVATEL]@) if not , dateIssued :: Day -- ^ date the invoice was issued (@[DATUMVYSTAVENI]@) , dateDelivered :: Day -- ^ date the invoice was delivered (@[DATUMPRIJETI]@) , dateDue :: Day -- ^ date the invoice was due (@[DATUMSPLATNOSTI]@) , datePaid :: Day -- ^ date the invoice was paid (@[DATUMUHRADY]@) , documentType :: DocumentType -- ^ invoice type (@[TYPDOKLADU]@) , amountDue :: Money -- ^ amount due in CZK, VAT included (@[CELKOVACASTKA]@) , amountPaid :: Money -- ^ amount paid in CZK, VAT included (@[CUHRADA]@) , subBudgets :: [(Money,SubBudgetID)] -- ^ amounts from individual sub-budgets (@[CASTKAZAPOLOZKUROZPOCTU]@, @[POLOZKAROZPOCTU]@) } deriving Show -- | Type of the invoice as provided in the data (@[TYPDOKLADU]@). -- Unfortunately, I was unable to find out what precisely these mean :-|. data DocumentType = F -- ^ maybe a common invoice? | W -- ^ maybe a cancelled invoice? | Z -- ^ maybe a regular invoice paid in advance? deriving (Eq, Show) -- | Money amounts are stored as simple @Double@s. type Money = Double -- #### Custom data types and class instances #### instance Semigroup Invoice where new <> old = new { subBudgets = subBudgets new ++ subBudgets old } data ModifyInvoiceData = ModifyInvoiceData { modifyInvoices :: Invoices -> Invoices , modifySuppliers :: Suppliers -> Suppliers , modifyBudgets :: Budgets -> Budgets , modifyNotParsed :: [String] -> [String] } instance Semigroup ModifyInvoiceData where mid1 <> mid2 = ModifyInvoiceData { modifyInvoices = modifyInvoices mid1 . modifyInvoices mid2 , modifySuppliers = modifySuppliers mid1 . modifySuppliers mid2 , modifyBudgets = modifyBudgets mid1 . modifyBudgets mid2 , modifyNotParsed = modifyNotParsed mid1 . modifyNotParsed mid2 } -- mappend/<> definition repeated to work with GHC 8.2 and earlier instance Monoid ModifyInvoiceData where mempty = ModifyInvoiceData id id id id mappend mid1 mid2 = ModifyInvoiceData { modifyInvoices = modifyInvoices mid1 . modifyInvoices mid2 , modifySuppliers = modifySuppliers mid1 . modifySuppliers mid2 , modifyBudgets = modifyBudgets mid1 . modifyBudgets mid2 , modifyNotParsed = modifyNotParsed mid1 . modifyNotParsed mid2 } -- #### Parsers #### separatorChar :: Char separatorChar = ';' newlineChars :: [Char] newlineChars = "\n\r" invoiceDataParser :: Parser InvoiceData invoiceDataParser = do skipMany (noneOf "\n") *> newlineParser -- ignore header line fileData <- mconcat <$> many (invoiceParser <* newlineParser) -- parse invoices eof -- force end-of-file (make sure the whole file is processed) pure $ InvoiceData { invoices = modifyInvoices fileData M.empty, suppliers = modifySuppliers fileData M.empty, budgets = modifyBudgets fileData M.empty, notParsed = modifyNotParsed fileData [] } invoiceParser :: Parser ModifyInvoiceData invoiceParser = try invoiceFieldsParser -- try to parse an invoice <|> ( many (noneOf newlineChars) >>= \str -> pure (ModifyInvoiceData id id id (str:)) ) -- if it does not work, ignore the whole line where invoiceFieldsParser = do {-[ROZLISENI]-} _ <- valueParser *> separatorParser {-[CISLO]-} tmp_invoiceID <- InvoiceID . read <$> many digit <* separatorParser -- read is OK, since it's piped from 'many digit' {-[DODAVATEL]-} tmp_supplierName <- valueParser <* separatorParser {-[KODPARTNERA]-} _ <- valueParser <* separatorParser {-[ICO]-} tmp_supplierICO <- fmap ICO . readMaybe <$> valueParser <* separatorParser {-[CISLOFAKTURYDODAVATELE]-} _ <- valueParser <* separatorParser {-[TYPDOKLADU]-} tmp_documentType <- documentTypeParser <* separatorParser {-[EVIDENCEDPH]-} _ <- valueParser <* separatorParser {-[VARIABILNISYMBOL]-} _ <- valueParser <* separatorParser {-[CELKOVACASTKA]-} tmp_amountDue <- moneyParser <* separatorParser {-[CELKOVACASTKABEZDPH]-} _ <- valueParser <* separatorParser {-[CELKOVACASTKACIZIMENA]-} _ <- valueParser <* separatorParser {-[CELKOVACASTKAZADRZNE]-} _ <- valueParser <* separatorParser {-[MENA]-} _ <- valueParser <* separatorParser {-[DATUMVYSTAVENI]-} tmp_dateIssued <- dayParser <* separatorParser {-[DATUMPRIJETI]-} tmp_dateDelivered <- dayParser <* separatorParser {-[DATUMSPLATNOSTI]-} tmp_dateDue <- dayParser <* separatorParser {-[DATUMUHRADY]-} tmp_datePaid <- dayParser <* separatorParser {-[UCELPLATBY]-} _ <- valueParser <* separatorParser {-[UHRADA]-} tmp_amountPaid <- moneyParser <* separatorParser {-[UHRADACIZIMENA]-} _ <- valueParser <* separatorParser {-[POLOZKAROZPOCTU]-} tmp_subBudgetID <- fmap SubBudgetID . readMaybe <$> valueParser <* separatorParser {-[NAZEVPOLOZKYROZPOCTU]-} tmp_subbudgetName <- valueParser <* separatorParser {-[CASTKAZAPOLOZKUROZPOCTU]-} tmp_subBudgetAmount <- moneyParser <* separatorParser {-[KAPITOLA]-} _ <- valueParser <* separatorParser {-[NAZEVKAPITOLY]-} _ <- valueParser pure $ ModifyInvoiceData { modifyInvoices = M.insertWith (<>) tmp_invoiceID $ Invoice { supplier = case tmp_supplierICO of Nothing -> Left tmp_supplierName Just ico -> Right ico , dateIssued = tmp_dateIssued , dateDelivered = tmp_dateDelivered , dateDue = tmp_dateDue , datePaid = tmp_datePaid , documentType = tmp_documentType , amountDue = tmp_amountDue , amountPaid = tmp_amountPaid , subBudgets = maybe [] (\x -> [(tmp_subBudgetAmount, x)]) tmp_subBudgetID }, modifySuppliers = maybe id (flip M.insert tmp_supplierName) tmp_supplierICO, modifyBudgets = maybe id (flip M.insert tmp_subbudgetName) tmp_subBudgetID, modifyNotParsed = id } documentTypeParser :: Parser DocumentType documentTypeParser = (char 'F' >> pure F) <|> (char 'W' >> pure W) <|> (char 'Z' >> pure Z) "invoice type (F/W/Z)" moneyParser :: Parser Money moneyParser = fromMaybe 0 . readMaybe . filter (/= ' ') . map replaceComma <$> valueParser where replaceComma ',' = '.' replaceComma x = x separatorParser :: Parser () separatorParser = void $ char separatorChar newlineParser :: Parser () newlineParser = void . many1 $ oneOf newlineChars valueParser :: Parser String valueParser = many . noneOf $ separatorChar : newlineChars dayParser :: Parser Day dayParser = do day <- read <$> many digit <* char '.' month <- read <$> many digit <* char '.' year <- read <$> many digit maybe (fail $ "invalid date " ++ date year month day) pure $ fromGregorianValid year month day where date y m d = "(" ++ show y ++ "-" ++ show m ++ "-" ++ show d ++ ")" -- #### Pretty printers #### pprintInvoiceData :: Bool -> InvoiceData -> String pprintInvoiceData fullPrint idata = concat $ [ "\n## Faktury (", show . M.size $ invoices idata,")\n" , unlines . map (++"\n") . map (pprintInvoice (suppliers idata) (budgets idata)) . M.assocs $ invoices idata ] ++ if not fullPrint then [] else [ "\n## Dodavatelé (", show . M.size $ suppliers idata, ")\n" , unlines . map (pprintSupplier (suppliers idata)) . map Right . M.keys $ suppliers idata , "\n## Položky rozpočtu (", show . M.size $ budgets idata, ")\n" , unlines . map pprintBudget . M.assocs $ budgets idata , "\n## Nenačtené/nesparsované faktury (", show . length $ notParsed idata, ")\n" , unlines (notParsed idata) ] pprintInvoice :: Suppliers -> Budgets -> (InvoiceID, Invoice) -> String pprintInvoice sups budgs (invID, inv) = concat [ "Faktura č. ", show . unInvoiceID $ invID, " (", show $ documentType inv ,")\n" , "\tDodavatel: \t", pprintSupplier sups $ supplier inv, "\n" , "\tSuma: \t\t", pprintMoney $ amountDue inv, " (vystaveno ", show $ dateIssued inv ,", splatnost ", show $ dateDue inv, ")\n" , "\tZaplaceno: \t", pprintMoney $ amountPaid inv, " ( přijato ", show $ dateDelivered inv ,", zaplaceno ", show $ datePaid inv, ")\n" , "\tZ rozpočtu: \t", intercalate ", " . map budgetLine $ subBudgets inv ] where budgetLine :: (Money, SubBudgetID) -> String budgetLine (amount, budgID) = case M.lookup budgID budgs of Nothing -> pprintMoney amount ++ " (neznámá položka)" Just name -> pprintMoney amount ++ " (" ++ name ++ ")" pprintSupplier :: Suppliers -> Either String ICO -> String pprintSupplier _ (Left name) = name ++ " (dodavatel nemá IČO)" pprintSupplier sups (Right ico) = case M.lookup ico sups of Nothing -> "??? (IČO" ++ show (unICO ico) ++ ")" Just name -> name ++ " (IČO " ++ show (unICO ico) ++ ")" pprintBudget :: (SubBudgetID, String) -> String pprintBudget (budgetId, name) = name ++ " (ID " ++ show (unSubBudgetID budgetId) ++ ")" pprintMoney :: Money -> String pprintMoney = printf "%12.2f Kč" -- #### Data analysis utility functions #### fullData :: IO InvoiceData fullData = parseFile "data/faktury-mfcr-2018-11.csv" -- | * Which invoices did the ministry pay to Masaryk University? analysisMuniInvoices :: IO () analysisMuniInvoices = fullData >>= putStrLn . pprintInvoiceData False . muniInvoices where muniInvoices :: InvoiceData -> InvoiceData muniInvoices idata = idata { invoices = M.filter ((== muni) . supplier) (invoices idata) } muni = Right $ ICO 216224 -- | * What ratio of invoices did the ministry pay overdue? -- (Remember that some invoices may have been delivered already overdue.) analysisRatioPAidOverdue :: IO () analysisRatioPAidOverdue = do pureIvoiceData <- invoices <$> fullData let overdueInvoices = M.filter (\i -> datePaid i > dateDue i) pureIvoiceData let overdueInvoicesDeliveredOnTime = M.filter (\i -> dateDelivered i < dateDue i) overdueInvoices let numInvoices = M.size pureIvoiceData let numOverdueInvoices = M.size overdueInvoices let numOverdueInvoicesDeliveredOnTime = M.size overdueInvoicesDeliveredOnTime putStrLn $ concat [ "Invoices paid overdue: " , show numOverdueInvoices ++ " / " ++ show numInvoices , " (" ++ show (numOverdueInvoices * 100 `div` numInvoices) ++ "%)" , "\n" , "Invoices paid overdue (but delivered on time): " , show numOverdueInvoicesDeliveredOnTime ++ " / " ++ show numInvoices , " (" ++ show (numOverdueInvoicesDeliveredOnTime * 100 `div` numInvoices) ++ "%)" ] -- | * What are the sums for individual sub-budgets? type BudgetsSums = M.Map SubBudgetID (String, Money) analysisSubBudgets :: IO () analysisSubBudgets = do pureInvoicesBudgets <- M.map subBudgets . invoices <$> fullData pureBudgetNamesSums <- M.map (\name -> (name, 0)) . budgets <$> fullData let subBudgetSums = M.elems $ M.foldr addAmountsToSubBudgetsSums pureBudgetNamesSums pureInvoicesBudgets putStrLn . unlines . map formatBudgetSum $ sortBy (\(_,b1) (_,b2) -> compare b2 b1) subBudgetSums where formatBudgetSum (name, totalSum) = pprintMoney totalSum ++ "\t(" ++ name ++ ")" addAmountsToSubBudgetsSums :: [(Money,SubBudgetID)] -> BudgetsSums -> BudgetsSums addAmountsToSubBudgetsSums invoiceBudgetAmounts budgetSums = foldr folderFun budgetSums invoiceBudgetAmounts where folderFun :: (Money, SubBudgetID) -> BudgetsSums -> BudgetsSums folderFun (money,bId) budgetSums2 = M.insertWith insertFun bId ("", money) budgetSums2 insertFun :: (String, Money) -> (String, Money) -> (String, Money) insertFun (_, amount) (name, total) = (name, total + amount) -- | * To which supplier did the ministry pay the most? type SuppliersSums = M.Map ICO (String, Money) analysisSuppliers :: IO () analysisSuppliers = do pureInvoicesSuppliers <- M.map (\i -> (fromRight (ICO 0) $ supplier i, amountPaid i)) . invoices <$> fullData pureSupplierNamesSums <- M.map (\name -> (name, 0)) . suppliers <$> fullData let supplierSums = M.assocs $ M.foldr addAmountsToSuppliersSums pureSupplierNamesSums pureInvoicesSuppliers putStrLn . unlines . map formatSupplierSum $ sortBy (\(_,(_,b1)) (_,(_,b2)) -> compare b2 b1) supplierSums where formatSupplierSum (ico, (name, totalSum)) = pprintMoney totalSum ++ "\t" ++ name ++ " (IČO " ++ show (unICO ico) ++ ")" addAmountsToSuppliersSums :: (ICO, Money) -> SuppliersSums -> SuppliersSums addAmountsToSuppliersSums (ico, money) supplierSums = M.insertWith insertFun ico ("", money) supplierSums where insertFun :: (String, Money) -> (String, Money) -> (String, Money) insertFun (_, amount) (name, total) = (name, total + amount) -- | Parse file with invoice data. -- | In case of file/parsing error, terminate execution. parseFile :: String -> IO InvoiceData parseFile file = do result <- parseFromFile invoiceDataParser file case result of Left err -> error (show err) Right invoiceData -> pure invoiceData -- | Parse the file fiven in the first command-line argument. -- | In case of successful parse, pretty-print the parsed database. -- | In case of parse failure, print the error. main :: IO () main = do args <- getArgs if length args /= 1 then putStrLn "provide 1 file to parse as argument" else parseFile (head args) >>= putStrLn . pprintInvoiceData True