FILOZOFICKÁ FAKULTA KATEDRA INFORMAČNÍCH STUDIÍ A KNIHOVNICTVÍ Thesis Project Proposal User-Centered Conceptual Models for Understanding Web Archive Collections Master’s thesis Bc. Illyria Brejchová 525117 Supervisor: PhDr. Michal Lorenz, Ph.D. 2 Czech title: Na uživatele orientované konceptuální modely pro pochopení dat z tematických sbírek webových archivů 1. Statement of the Research Problem Since the rise of the World Wide Web in the 90s, web-based works have become increasingly prevalent, often replacing the role of more traditional print resources. The informational, historic and cultural value of digital born documents on the web is undeniable, yet these resources are highly ephemeral. Over the past 25 years, countless web archive initiatives have been undertaken around the globe by various types of institutions and in varying scope. Extensive collections now exist, yet they are not without challenges regarding the long-term preservation of the digital objects, their metadata description, and usability. Research has shown that current metadata practices are not sufficient relative to the needs of the designated community (Costea, 2018; Truman, 2016; Venlet et al., 2018). While institutional metadata handbooks exist, they vary significantly. A guideline for web archive metadata description published in 2018 seeks to bridge these institutional practices and takes a user-centered approach (Dooley & Bowers, 2018). However, they do not make their recommendations based on any domain specific conceptual model nor was conceptual modeling in their scope. And while existing bibliographic and archival data models are designed to be extensible to any kind of resource, archived websites have many specific characteristics that should be taken into consideration, such as relationships between the live and archived website, between different versions of the same website, between different websites sharing the same URL in time or between contributors, domain owners and copyright holders. While it is not feasible to describe archived web resources with such extreme granularity, it is important to understand these relationships at least in theory. With solid conceptual modeling, it will be clearer what is being described. Web archive metadata specialists can then make better decisions about how to represent significant properties of the archived resources in metadata formats and information systems. Such a conceptual model would also benefit researchers seeking a deep understanding of the archived web resources they are studying. 3 2. Overview of the Literature, Research Context, and Relevant Projects 2.1 User Needs Truman (2016) studied the situation in 23 web archives looking at their characteristics and which barriers researchers encounter when using them. They identified 22 opportunities for researchers in four broad categories: 1) The largest category is communication and collaboration, here they emphasize the need for closer cooperation between researchers and curators. Curators must better document and share their archival policy and how it and other technical parameters affect the archived collection. This documentation must be accurate, which was not always the case, as the study found. Researchers also require assistance from web archivists with data extraction, and curators need input from researchers regarding their methodological requirements for the data and metadata for the archived content to be analyzed in an academically rigorous way. 2) The study stresses the need for developing standardized and interoperable tools for processing big data from web archives and iteratively expanding them to meet researchers’ needs. A need for better discoverability of web archived content both within each institution and across other web archives was also identified. 3) Education and skill development are important for employees of web archives as well as scholars, who seek to study data in a new and unfamiliar format. There is also the potential for educating web developers and working toward a web that is technically archivable. 4) There is a need for expanding the capacity of web archives in respect to both staff and storage space. Maria-Dorina Costea (2018) published a report of the usability of web archive content for research based on a survey and interviews with Danish Humanities researchers. Attention was paid to both the Internet Archive and the Danish web archive Netarkivet. Both users and potential users of web archived data were included, with the majority of respondents having no prior experience with web archiving. Of the researchers that had experience with using archived web content in their work, most did so for qualitative research in history or analysis of political discourse online. Only one respondent worked on a quantitative study, however, they were unable to complete it due to technical problems. The researchers expressed a need for better search tools, such as full text, image or audio recording search capabilities, the option to export data into a standard format and an integrated workspace for academic work. They emphasized the need for more detailed documentation that would make the collection practices and tools more transparent. They would appreciate more metadata and for the existing metadata to be more accurate. If part of the archived website was unable to be 4 archived or is displayed differently than it would have been on the live web, they need to know about it. They would also like the functionality of being redirected to a better captured instance of the archived web in such situations. The researchers lacked the functionality to filter search results and would appreciate a more complete and easier to understand tutorial as well as explanations to domain specific vocabulary. OCLC published a literature review of user needs conducted as part of their efforts to formulate recommendations of descriptive metadata for web archives. The research focuses largely on the needs of academic researchers. It was found that potential users still lack an understanding of what web archives are and how to use them. Those that do use them express a strong need for provenance metadata and transparency regarding the acquisition of the archived websites. Users generally lack the technical knowledge necessary to interpret and use web archive data, resulting in the need for user friendly tools and interfaces, better user support, and outreach. As for web archive practitioners, they need scalable descriptive practices, hybrid bibliographic and archival approaches to cataloguing, and interoperable metadata across systems (Venlet et al., 2018). 2.2 Conceptual Data Models LRM (Library Reference Model) is the most recent high-level conceptual reference model of the bibliographic universe published by IFLA (The International Federation of Library Associations). It consolidates FRBR (Functional Requirements for Bibliographic Records), FRAD (Functional requirements for Authority Data) and FRSAD (Functional Requirements for Subject Authority Data). LRM is an entity relation model and defines 11 entities at three hierarchical levels: Res, Work, Expression, Manifestation, Item, Agent, Person, Collective Agent, Nomen, Place and Time-span. (Riva et al., 2017) While this model addresses many of the limitations of FRBR and should in theory be applicable for modeling the entirety of the bibliographic universe, which as of now includes web archived resources, it remains highly centered on traditionally published recourses. The concept of publication in the context of online content is problematic, for this reason conceptual models from the museum and archiving tradition could be better applicable. The most significant of these is the extensible object-oriented Conceptual Reference Model (CRM) of CIDOC (International Council of Museums). CRM is meant to function as sematic “glue” for disparate cultural heritage datasets by providing a formal structure for describing concepts and relationships. Attempts at bridging the bibliographic and archival traditions have already been made, starting with object oriented FRBR (FRBRoo), the first version of which was published in 2006. Most recently work on LRMoo is being conducted (LRMoo, 2021). The aim of these conceptual models is to enable semantic interoperability. A conceptual model of the web archiving universe of discourse 5 should therefore be compatible with these models already in place and so I will be drawing upon them significantly. In 2018 the OCLC Research Library Partnership Web Archiving Metadata Working Group published detailed user-centered recommendations for descriptive metadata for web archives. It was found that the application of standards for descriptive practices are highly inconsistent, a combination of bibliographic, archival and hybrid approaches are in use. Moreover, the descriptive guidelines that are in use do not usually take into consideration unique characteristics of either live or archived web content. The most used standards are Resource Description and Access (RDA), Describing Archives: A Content Standard (DACS) and Dublin Core. The working group proposes the use of 14 data elements (Collector, Contributor, Creator, Date, Description, Extent, Genre/Form, Language, Relation, Rights, Source of Description, Subject, Title, URL) and give descriptions and examples for their use (Dooley & Bowers, 2018). Some of the most recent direct research into implementing conceptual models for web archives derives from the internet art preservation community. Rhizome, a New York based non-profit dedicated to digital preservation, exhibition and scholarship of born-digital art, have pilot tested the implementation of PROV-DM for describing provenance metadata of internet art works in ArtBase, their digital art archive. They realized that the metadata available to their users, especially that relating to provenance, was insufficient for contemporary art historians. They decided to implement PROV-DM because it best enables the representation of the lifecycle internet art undergoes, the various actors who contribute to its creation, development over time and preservation, as well as the relationships between variations of the artwork both on the live web and in the collections of cultural institutions. Just as importantly, it allows the representation of unknown data and the description of objects at varying levels of detail. In this case study, they also used PROV-O to map PROV-DM to RDF in a linked data knowledge management system. The metadata model tested is also compatible with other bibliographic semantic models such as CIDOC-CRM and FRBRoo used in other cultural heritage systems (Rossenova et al., 2019). Those had, however, been overwhelmingly developed for the description of primarily physical documents and are more narrowly conceived. On the other hand, metadata schemas developed for digital documents such as PREMIS do concentrate on describing processes, but are too abstracted to be useful in the case of internet art (Rossenova, 2021). Bahry et al. (2022) developed a conceptual data model for the Malaysia public figures web archive as part of the database design process for a new digital repository. The model was 6 visualized in the form of an ER Diagram, therefore concentrating on identifying and describing entities, attributes and relationships. Existing web archiving metadata standards were benchmarked and the most suitable were mapped to the conceptual data model and included as data elements. The resulting data model is designed to be easily understandable to novice web archivists, researchers and student users alike and can be adapted into real metadata modelling for the web archive repository. This conceptual data model was created for a more specialized web archive than I have in mind for my thesis, even so, it can serve as a good foundation to be expanded upon for the purposes of my own research on this topic. Lieber et al. (2021) from the Royal Library of Belgium describe how the Europeana Data Model (EDM) in combination with PREMIS and PROV can be used to represent social media collections in an interoperable way. They take a Competency Question based approach, to identify and express requirements for the subsequent ontology engineering. This involved gathering and identifying user stories to decide which metadata fields are most significant. The solution they came to involved creating a semantically described RDF Knowledge Graph. This allows for automated metadata description at the level of individual posts while also retaining context information from the harvest as a whole. There is also the possibility of adding or editing metadata manually. Specific standardized metadata records in MARC or EDM can then be generated from the Knowledge Graph implemented as a relational database with no added manual labor. Aggregated data from archived posts can also be generated at the collection level. A big benefit of this approach is that the metadata can be made publicly available, while the archived items are not, due to copyright. While this paper emphasizes the application side of the developed model rather than describing the developed ontology in detail, this work is still highly relevant to my own conceptual modeling ambitions. 2.3 Theses As for recent Czech theses dealing with web archiving, I could mention the bachelor’s thesis of Ondřej Kadlec (2020), where he maps the technical workflow of the Czech web archive (Webarchiv) of the Czech National Library and the tools used. Of interest is his discussion of the mapping of current processes of Webarchiv to the reference model OAIS, an ISO standard stating the requirements an open archival system should have for securing the long-term preservation of digital data. It was found that not all requirements of the OAIS model are fully fulfilled. Namely, the archival information packets lack semantic and interpretation metadata and Webarchiv lacks an entity responsible for the long-term preservation and understandability of the archived data by the designated community. Jaroslav Kvasnica (2016) also researched the long-term digital preservation resources in Webarchiv in depth in his masters’ thesis with an emphasis on OAIS and metadata. This thesis is relevant for its detailed description of how 7 metadata is created and stored in Webarchiv. Lastly, I consider the master’s thesis of Jan Vokřál (2021) to be relevant for his use of conceptual modeling, though not of the web archiving universe of discourse, but rather in that of fiction. 3. Motivation for and Impact of the Research I worked part-time for a year as a curator at Webarchiv of the National Library of the Czech Republic and found the work to be very meaningful. Preserving Czech digital national heritage is important and a great responsibility, yet the character of the resources, their technical properties, and the needs of the users change rapidly. Therefore, the research and development into web archiving always lags behind that of the live web. As pessimistic as that sounds, I choose to view it as a challenge and opportunity. Over the past two years I have found the web archiving community to be dedicated, adaptable and inspiring and I hope to contribute to the efforts of this community of practice and research with my own findings. During my work for Webarchiv, which also involved creating descriptive metadata for the archived websites, I felt a need for a better understanding of the designated community and user needs, as well as of the significant properties of the resources and the relationships between them. This is a need felt by the wider web archiving community as well, as is clear from the literature review, as well as from talking to employees of the web archive of the Dutch National library, where I am applying for an internship. A good conceptual model will help web archivists and researchers alike better understand the contents of web archives. Well-defined entities will make it clearer what is being described by the available metadata and enlighten relationships between resources. It will also assist web archive metadata specialists in making better decisions regarding which descriptors are significant relative to their designated community. And last but not least, it may serve as a solid foundation for designing data structures for specialized information systems dealing with web archived content. 4. Research Questions, Aims and Goals The aim of my research is to create a high-level conceptual model describing the web archiving universe. The modal shall be grounded in a deep understanding of user requirements, representative of current domain knowledge, and compatible with CRM and FRBRoo/LRMoo but implementation agnostic. The goal of the research is to provide a shared framework for understanding and describing web archived resources. This should benefit web archivists 8 when managing and creating metadata, developers when designing and implementing web archive information systems, and academic users seeking a theoretical model for understanding archived web resources. My research question is therefore: To what extent can the web archiving universe of discourse be represented in existing conceptual models from related fields? 5. Research Design Drobíková et al. (2018) describe the use of conceptual modeling within LIS. A conceptual model is an explicit representation of a domain conceptualization using a modeling language. For the purpose of this work, I will use the Unified Modeling Language (UML). UML is a widely used standardized general-purpose object-oriented modeling language (Object Management Group, 2017). It is suitable for this work since the conceptual model of the web archiving universe will be developed from existing object-oriented conceptual models in the realms of cultural heritage (CIDOC CRM), and librarianship, (FRBRoo and LRMoo). The modeling language is used to visualize semantic relations between objects and entities. The model being created is to be a high level descriptive conceptual reference model of the web archiving universe of discourse. Kučerová (2018) describes the process of creating a conceptual model and grounds it in the three world and objective and subjective knowledge philosophy of Karl Popper. The first phase of creating a conceptual model involves creating a subjective mental model of the system. This is something we do implicitly as one makes sense of the world, though it can also be done purposefully with the intent of communicating the mental model. The second phase involves creating an objective model. First the subjective model is expressed using a language and then it is manifested on a medium. A systems analysis approach will be taken to the design of the conceptual model. The web archiving universe of discourse will therefore be conceived of as a complex open system. Given web archiving systems are already well established, a bottom-up approach will be taken, and the abstract conceptual model will be in part reverse engineered from an existing understanding of the system. First the system requirements will be defined based on user practices and then the model will be created through an iterative process of decomposing the system (analysis) and abstracting it (synthesis). Through analysis of the system relevant objects, classes and entities are identified and through synthesis hierarchical relations between the objects are identified (Kučerová, 2017). 9 To aid in the system analysis, selected stakeholders with a deep understanding of web archiving will be interviewed. During the interview, the method of card sorting will be employed to gain insight into their own implicit mental models of the web archiving system. I estimate needing to talk to at least six stakeholders, ideally three web archiving professionals and three users. The results will be analyzed, and insights will be integrated into the final model. The comparative method will be used to compare the conceptual model of the web archiving universe to conceptual models in related fields and to identify whether there are concepts within web archiving that do not map to them. 6. Schedule SUMMER 2022: I will be attending an ERASMUS+ internship at the National Library of the Netherlands, where I shall be creating a report with recommendations for the library concerning the management of metadata of web archived resources. While this is not part of my thesis research, it is thematically closely related. I hope to gain insight into current web archive metadata practices and concerns outside the Czech Republic. I will also have a chance to expand and apply my current domain knowledge with expert guidance. FALL 2022: Write the theoretical background of the thesis – a brief description of the historic evolution of web archiving and selected web archives, an overview of relevant existing data models, current metadata formats and practices, and user needs. Identify key stakeholders and interview them to gather knowledge relating to the system potentially missing from the literature review. WINTER 2023: Prepare questions and concepts for the card sorting based on my current understanding of the system. Identify key stakeholders and interview them using the method of card sorting. SPRING 2023: Conceptual modeling of the web archiving universe of discourse grounded in a system analysis of the web archiving domain and user needs. Documentation explaining the identified objects and relationships between them. Comparison of the web archiving conceptual model to those in related fields. SUMMER 2023: Mapping of the conceptual model to a selected thematic collection of web archived resources to demonstrate and discuss its expressive power and weaknesses. FALL 2023: Final revisions. 10 7. Required Resources My research will require access to the university’s academic electronic resources and to publicly accessible web archive catalogues. I will also need access to a suitable modeling software. Thankfully, there are many freely available open-source options, such as Diagrams.net, as well as proprietary online tools with student licenses, like Vertabelo. I am also able to consult my work as needed with web archive specialists at the National Library of the Czech Republic and at the National Library of the Netherlands thanks to pre-existing professional connections. 8. References Bekiar, C., Doerr, M., Le Boeuf, P., & Riva, P. (Eds.). (2021). LRMoo (formerly FRBRoo) object-oriented definition and mapping from IFLA LRM (version 0.7) (p. 62) [Draft]. https://www.cidoc-crm.org/frbroo/sites/default/files/LRMoo_V0.7%28draft%202021- 06-29%29.pdf Costea, M. (2018). Report on the scholarly use of web archives. NetLab. Dooley, J., & Bowers, K. (2018). Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group. 58. https://doi.org/10.25333/C3005C Drobíková, B., Římanová, R., Souček, J., & Souček, M. (2018). Teoretická východiska informační vědy: Využití konceptuálního modelování v informační vědě (Vydání první). Univerzita Karlova, nakladatelství Karolinum. Kadlec, O. (2020). Popis technického workflow Webarchivu NK ČR a užívaných nástrojů [Bakalářská práce, Masarykova univerzita, Filozofická fakulta]. https://is.muni.cz/th/n1dx9/BP.pdf?info=1 Kučerová, H. (2017). Organizace Znalostí. Karolinum Press. https://public.ebookcentral.proquest.com/choice/publicfullrecord.aspx?p=5720124 Kučerová, H. (2018). Pojem modelu a pojmový model v informační vědě. 29(2), 5–32. https://knihovnarevue.nkp.cz/archiv/2018-2/recenzovane-prispevky/pojem-modelu-a- pojmovy-model-v-informacni- vede#:~:text=Pojmov%C3%A9%20modely%20v%20informa%C4%8Dn%C3%AD%2 0v%C4%9Bd%C4%9B,sch%C3%A9mata%20a%20slovn%C3%ADky%20hodnot%20 metadat. Kvasnica, J. (2016). Dlouhodobé uchování webového obsahu [Magisterská diplomová práce, Karlova Univerzita, Filozofická fakulta]. 11 https://dspace.cuni.cz/bitstream/handle/20.500.11956/82967/DPTX_2012_1_11210_ 0_345026_0_129352.pdf?sequence=1&isAllowed=y Lieber, S., Van Assche, D., Chambers, S., Messens, F., Geeraert, F., Birkholz, J. M., & Dimou, A. (2021). BESOCIAL: A sustainable knowledge graph-based workflow for social media archiving. Further with Knowledge Graphs : Proceedings of the 17th International Conference on Semantic Systems, 53, 198–212. https://doi.org/10.3233/ssw210045 Object Management Group. (2017). OMG Unified Modeling Language (OMG UML) version 2.5.1 (p. 796). Object Management Group. https/www.omg.org/spec/UML/ Riva, P., Le Boeuf, P., & Žumer, M. (2017). IFLA Library Reference Model A Conceptual Model for Bibliographic Information (p. 101). IFLA. https://repository.ifla.org/bitstream/123456789/40/1/ifla-lrm-august- 2017_rev201712.pdf Rossenova, L. (2021, June 3). Modeling Net Art Provenance: A New Approach To The Interface of Rhizome’s ArtBase Archive. YouTube. https://www.youtube.com/watch?v=6QUUM2tQxmQ&ab_channel=SAKIPSABANCIM ÜZESİ Rossenova, L., Espenschied, D., & de Wild, K. (2019). Provenance for Internet art: Using the W3C PROV data model. 16th International Conference on Digital Preservation, Amsterdam. https://ipres2019.org/static/proceedings/iPRES2019.pdf Saiful Bahry, F. D., Amran, N., Putri, T. E., & Ramli, M. I. (2022). Database design of the Malaysia public figures web archive repository: A social and cultural heritage web collections. Collection and Curation. https://doi.org/10.1108/CC-09-2021-0025 Truman, G. (2016). Web Archiving Environmental Scan (Harvard Library Report.). https://dash.harvard.edu/handle/1/25658314 Venlet, J., Farrell, K. S., Kim, T., O’Dell, A. J., & Dooley, J. (2018). Descriptive Metadata for Web Archiving: Literature Review of User Needs. https://doi.org/10.25333/C33P7Z Vokřál, J. (2021). Informační hodnota beletrie: Konceptuální model pro popis beletrie v informačních systémech [Masarykova univerzita, Filozofická fakulta]. https://is.muni.cz/auth/th/nspln/Informacni_hodnota_beletrie.pdf 9. Profiling I aim to profile my studies in Information and Data Management, which is also reflected in my thesis. Conceptual data models are key for designing, implementing and managing metadata in information systems in a way which is functional for the designated community of an 12 information system. This is especially true for web archives which preserve especially complex digital objects. 10. List of Acronyms CIDOC- CRM Conceptual Reference Model of the International Council of Museums DACS Describing Archives: A Content Standard EDM Europeana Data Model ER Diagram Entity Relationship Diagram FRAD Functional Requirements for Authority Data FRBR Functional Requirements for Bibliographic Records FRBRoo Object Oriented Functional Requirements for Bibliographic Records FRSAD Functional Requirements for Subject Authority Data IFLA The International Federation of Library Associations ISO International Organization for Standardization LIS Library and Information science LRM Library reference model LRMoo Object Oriented Library Reference Model MARC Machine-readable Cataloging OAIS Open Archival Information System OCLC Online Computer Library Center PREMIS Data Dictionary for Preservation Metadata PROV-DM Provenance Data Model PROV-O Provenance Ontology RDA Resource Description and Access RDF Resource Description Framework UML Uniform Modeling Language This thesis project has been approved by my thesis supervisor.