Ludovic Mayer – Database1 E2020 – Soft skills II – Information Literacy 3. Databases 01.03.2022 Mgr. Ludovic Mayer ludovic.mayer@recetox.muni.cz Ludovic Mayer – 3. Database2 I want to start working/writing, but … ̶ What resources to use? ̶ Where can I find them? ̶ How can I find what interest me? ̶ What to do if I don’t find what I need? ̶ How to formulate a query understood by the database? Ludovic Mayer – 3. Database3 Content ̶ Lecture divided into 2 parts: ̶ 1st part: Theoretical – to understand how is the information stored – how to access it ̶ 2nd part: Practical – you are at the helm and you find the information! Ludovic Mayer – 3. Database4 VPN ̶ MUNI VPN ̶ MUNI ̶ Eduroam https://it.muni.cz/en/services/wireless-wi-fi-connection Ludovic Mayer – 3. Database5 Database “a usually large collection of data organized especially for rapid search and retrieval (as by a computer)” (Webster dictionary) Exist for multiples types of information ̶ Scientific Articles and Journals (WoS – Scopus as seen in 2.Scientometry) ̶ Chemical and other information ̶ Specific properties and parameters Ludovic Mayer – 3. Database6 Database: From “paper” to today ̶ First chemical database ̶ First edition published in 1817, last edition: 8th edition 1990’s Nowadays most of them are digitalized and the information is available online Ludovic Mayer – 3. Database7 Chemical databases ̶ Multiple existing databases based on different criterias ̶ according to chemical structures ̶ literature ̶ crystallographic ̶ spectroscopic: infrared, absorbance, nuclear magnetic resonance, … ̶ reactions ̶ thermodynamic ̶ others, … Ludovic Mayer – 3. Database8 Chemical databases ̶ Often based on chemical structures searches ̶ Chemical structures are easy to read by humans (visual reading, …) ̶ How to explain it to computer / software / app ? 1,5-Dihydroxy-4,8-dinitro-9,10-dioxo-9,10dihydroanthracene-2,6-disulfonic acid Ludovic Mayer – 3. Database9 Chemical structures and their representation ̶ In order to look for a structure and for the computer to understand a) Name (or similar unique identifier) b) Connectivity matrices (e.g.MDL molfile, PDB, CML, … = Computer languages related to chemistry created specifically for computation c) Linear strings: e.g. SMILES/SMARTS, SLN, WLN, InChi = easy computer language which computer can decipher the structure Ludovic Mayer – 3. Database10 Chemical name (or similar unique identifier) ̶ Each compound can have multiple names ̶ Common name (trivial, semi-trivial, systematic, business, …) ▪Chlorpyrifos (Insecticide, banned since 2020 in Europe) ̶ Proper chemical name: (= IUPAC name, International Union of Pure and Applied Chemistry) ▪0,0-diethyl 0-(3,5,6-trichloro-2-pyridinyl)-phosphorothioate ̶ Other names (sometimes names of commercial products they are featured in) ▪Chlorpyrifos-ethyl, Brodan, Bolton insecticide, Cobalt, … Ludovic Mayer – 3. Database11 Chemical name (or similar unique identifier) ̶ In order to remove possible errors and mistakes: ̶ A unique numerical identifier was created = ̶ CAS RN (Chemical Abstracts Service Registry Number) ̶ Assigned to every chemical substance ̶ From 1800’s to today, registry account from more than 193 million compounds Ludovic Mayer – 3. Database12 Chemical name (or similar unique identifier) ̶ Chlorpyrifos: CAS RN = 2921-88-2 Metabolites and derivatives: ̶ Chlorpyrifos-methyl = 5598-13-0 ̶ Chlorpyrifos-oxon: 5598-15-2 Ludovic Mayer – 3. Database13 Connectivity Matrix ̶ Computer language ̶ Chemical Table File (CT File) ̶ Family of text based chemical file formats that describe molecules and chemical reactions ̶ Numerous file format exist ̶ CT File is an open format ̶ Lists each atom in a molecule, with the x-y-z coordinates of that atom, and the bonds amongst atoms ̶ Just need to register on this website to access them: https://discover.3ds.com/ctfile- documentation-request-form Ludovic Mayer – 3. Database14 Molfile: ̶ An MDL Molfile is a file format ̶ Contains information about: ̶ atoms, ̶ bonds, (=connectivity) ̶ charges ̶ coordinates of a molecule ̶ Recognized by most cheminformatics software systems/applications Connectivity Matrix Ludovic Mayer – 3. Database15 Connectivity Matrix ̶ Same exist for Proteins = PDB format Ludovic Mayer – 3. Database16 Connectivity Matrix without connectivity X-Y-Z file ̶ No information about bonds (covalent, hydrogens, VdW, …) admits a greater flexibility ̶ Typical XYZ format specifies the molecule geometry ̶ First line = number of atoms with Cartesian coordinates ̶ Second line = a comment ̶ Third and following line = atomic coordinates Ludovic Mayer – 3. Database17 Connectivity Matrix without connectivity Pyridine ̶ Formula: C5H5N Ludovic Mayer – 3. Database18 Linear string: SMILES Linear string: represents structures as a linear string of characters Simplified Molecular Input Line Entry Specification (SMILES) ̶ Chemical notation allowing user to represent a chemical structure ̶ Easily read, understood and used by computer ̶ Contains connectivity, but no longer 2D or 3D coordinates Ludovic Mayer – 3. Database19 Linear string: SMILES ̶ How does it work? ̶ Every atoms are supported ̶ Upper-case for aromatic atoms, lowercase for non-aromatic atoms ̶ Bonds: – single bond = double bond # triple bond * aromatic bond . disconnected structures Ludovic Mayer – 3. Database20 Linear string: SMILES Simple chain molecule (Hydrogen suppressed = no need to put hydrogen in it, software understand that they are here) SMILES Formula Name Structure CC CH3CH3 Ethane C=C CH2CH2 Ethene CBr CH3Br Bromomethane Ludovic Mayer – 3. Database21 Linear string: SMILES Simple chain molecule (Hydrogen suppressed = no need to put hydrogen in it, software understand that they are here) SMILES Formula Name Structure CC CH3CH3 Ethane C=C CH2CH2 Ethene CBr CH3Br Bromomethane Ludovic Mayer – 3. Database22 Linear string: SMILES Branches (in parentheses = a branche placed right after the atom it is connected to) SMILES Formula Name Structure CC(O)C CH3CHOHCH3 2-propanol CC(=O)C CH3COCH3 2-propanone (acetone) CC(CC)C CH3CH3CHCH2CH3 2-methylbutane (Isopentane) Ludovic Mayer – 3. Database23 Linear string: SMILES Branches (in parentheses = a branche placed right after the atom it is connected to) SMILES Formula Name Structure CC(O)C CH3CHOHCH3 2-propanol CC(=O)C CH3COCH3 2-propanone (acetone) CCC(C)C CH3CH3CHCH2CH3 2-methylbutane (Isopentane= Ludovic Mayer – 3. Database24 Linear string: SMILES Rings: (Use number to identify opening and closing of ring atom) SMILES Formula Name Structure C=1CCCCC1 Also C*1*C*C*C*C*C1 CHCHCH2CH2CH2CH Cyclohexene C1OC1CC CH2(O)CHCH2CH3 Ethyloxirane c1cc2ccccc2cc1 CHCHCHCHCHCHCHCHCHCH Naphtalene 1 1 1 2 Ludovic Mayer – 3. Database25 Linear string: SMILES Rings: (Use number to identify opening and closing of ring atom) SMILES Formula Name Structure C=1CCCCC1 Also C*1*C*C*C*C*C1 CHCHCH2CH2CH2CH Cyclohexene C1OC1CC CH2(O)CHCH2CH3 Ethyloxirane c1cc2ccccc2cc1 CHCHCHCHCHCHCHCHCHCH Naphtalene 1 1 1 2 Ludovic Mayer – 3. Database26 Linear string: SMILES Charged atoms: (Atoms followed by brackets which enclose the charge on the atom, maybe be explicitly stated ({-1}) or not ({-})) SMILES Name Structure CCC(=O)O{-1} Or CCC(=O)O{-} Ionised form of propanoic acid c1ccccn{+1}1CC(=O)O 1-Carboxylmethyl pyridinium Ludovic Mayer – 3. Database27 Linear string: InChI InChI ̶ International Chemical Identifier ̶ Introduced by IUPAC as a standard in 2006 ̶ Contains different layers: ̶ general formula, ̶ hydrogens, charges, ̶ stereochemistry, ̶ isotopes,… Ludovic Mayer – 3. Database28 Representation of structure - Conclusions ̶ Chemical names – connectivity matrices – linear strings ̶ Multiple choice exist, none of them wrong, some more popular than others Ludovic Mayer – 3. Database29 Questions? ̶ Contact me anytime via email: ludovic.mayer@recetox.muni.cz