From ARQMath 2020 to 2021
Topics in the scope
Dávid Lupták
Math Information Retrieval Research Group,
Faculty of Informatics, Masaryk University
https://mir.fi.muni.cz/
March 11, 2021
ARQMath Overview
ARQMath Overview ARQMath Tasks
Task 1: Answer Retrieval
Given a posted question as a query, search all answer posts and return relevant
answer posts.
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 3 / 20
ARQMath Overview ARQMath Tasks
Task 2: Formula Retrieval
Given a question post with an identiﬁed formula as a query, search all question and
answer posts and return relevant formulas with their posts.
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 4 / 20
ARQMath Overview ARQMath Tasks
Topics (questions)
77 topics for Task 1
from various domains (real analysis, calculus, linear algebra, discrete mathematics, set
theory, number theory, etc.)
categorized as computation (26), concept (10), proof (41)
the difﬁculty level spanned from easy problems (32), medium (21) to hard (24)
dependency on surrounding text (13), formulas (32) or both (32)
45 topics for Task 2
mathematical formulae selected from the topics from Task 1
criteria: complexity, elements, and text dependence
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 5 / 20
MIRMU Overview
MIRMU Overview Methods
Methods
Math Representations
In our MIR systems, we used the following math representations:
LaTeX
Presentation MathML
Content MathML
Symbol Layout Tree
M-Terms
Operator Tree
Prefix
Infix
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 7 / 20
MIRMU Overview Methods
Methods
Corpora, Relevance Judgements, and Evaluation Measures
For training, we used the following two corpora:
1. ArXMLiv (four different subsets), [2] and 2. Math StackExchange.
For validation, we used the following two sets of relevance judgements:
1. Automatic (param. opt., model sel.), and 2. Human-Annotated (perf. est.).
In our evaluation, we used the following two measures:
1. Normalized Discounted Cumulative
Gain Prime (nDCG’), [7] and
2. Spearman’s Correlation Coefﬁcient (ρ).
For retrieval, we used a machine with with 32 CPUs and 252 GiB RAM.
For training embeddings, we used an NVIDIA GTX2080 Ti GPU with 11 GiB VRAM.
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 8 / 20
MIRMU Overview Math Indexer and Searcher
Math Indexer and Searcher (MIaS)
input
canonicalized
document
document
handler
text
searcher
input query
text
term
s
query
results
index
indexer
unification
math processing
tokenization
mathm
ath
searching
indexing
Lucene
math processing
ordering
tokenization
variables unification
constants unification
indexing
searching
weighting
canonicalization canonicalization
Historically the ﬁrst MIR system deployed in a digital mathematical library. [9]
Uses TF-IDF with M-Terms extracted from CMML as a math representation.
Accuracy: nDCG’ 0.155, insigniﬁcantly below the Tangent-S baseline.
Speed: avg. 1.24 s/topic, min. 0.1 s/topic, max. 7.27 s/topic.
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 9 / 20
MIRMU Overview Soft Cosine Measure
Soft Cosine Measure (SCM)
Uses joint fastText [1] word embeddings of text & math to measure relatedness.
Uses TF-IDF with the Preﬁx math representation and SCM [8, 4, 5] doc. similarity.
Uses automatic relevance judgements to optimize parameters of fastText and SCM.
Four different fastText models were trained:
1. Tiny (5 epochs, alternative submission)
2. Small (10 epochs, primary submission)
3. Medium (2 epochs on all corpora)
4. Large (10 epochs on all corpora)
Accuracy: nDCG’ 0.224 (small), insigniﬁcantly below the Approach0 baseline.
Speed: avg. 58.46 s/topic, min. 30.52 s/topic, max. 502.84 s/topic.
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 10 / 20
MIRMU Overview Formula2Vec
Formula2Vec
Uses Doc2Vec DBOW [3] with the Preﬁx math representation and cosine doc. sim.
Uses the optimal parameters of fastText and RedHat defaults for Doc2Vec.
Four different Doc2Vec models were trained:
1. Tiny (5 epochs on no_problem ArXMLiv)
2. Small (10 epochs, alternative sub.)
3. Medium (2 epochs on all corpora)
4. Large (10 epochs on all corpora)
Accuracy: nDCG’ 0.050 (small), on par with DPRL and zbMath systems.
Speed: avg. 3.23 s/topic, min. 3.14 s/topic, max. 7.87 s/topic.
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 11 / 20
MIRMU Overview CompuBERT
CompuBERT
Q: Can anyone explain ...
BERT_Base
[wordpieceembedding]
[wordpieceembedding]
[wordpieceembedding]
(Mean) Pooling
[Qembedding]
Sentence Transformer
model
Arelevant: Consider c2 = a2+b2...
[Arelembedding]
Airrelevant: Ask elsewhere ...
[Airrelembedding]
minimize cos(Q, Arel)
maximize cos(Q, Airrel)
Sentence Transformer
model
Sentence Transformer
model
Uses sBERT [6] with the LATEX math representation and the cosine similarity.
Uses our automatic relevance judgements to optimize the Triplet objective.
Stark difference in performance between automatic and human-annotated r.j.’s.
Accuracy: nDCG’ 0.009, not signiﬁcantly better than zero.
Speed: avg. 3.43 s/topic, min. 3.2 s/topic, max. 3.67 s/topic.
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 12 / 20
MIRMU Overview Ensemble
Ensemble
Interleaves the result lists of primary submissions: MIaS, SCM, and CompuBERT.
Uses a parameter-free ensembling algorithm that only uses ranks, not scores.
Results are ranked by median rank, then by frequency, and then interleaved.
Tie-breaking: More than 40% of all results were arbitrarily interleaved.
Accuracy: nDCG’ 0.238, best of our systems, signiﬁcantly better than all but SCM.
The ensemble of all non-baseline primary submissions (0.419) best in competition.
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 13 / 20
Tangent-L Overview
Tangent-L Overview Methodology
Methodology
Conversion – a “bag” of formulae and keywords
Searching – Tangent-L to query the indexed corpus (MSE question-answer pairs)
Re-ranking – Re-order the best matches by considering additional metadata
similarity
tags
votes
reputation
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 15 / 20
Tangent-L Overview Results
Results
strong performance for topics that rely heavily on formulae
strong at Computation-type and Proof -type topics, but is particularly weak at
Concept-type topics
none of the Concept-type topics have a Formula-dependency
excels at all three levels of difﬁculty: Easy, Medium, and Hard
topics relying on formulae (Formula-dependency or Both-dependency)
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 16 / 20
Bibliography
Bibliography
Bibliography I
[1] Piotr Bojanowski et al. “Enriching word vectors with subword information”. In:
Transactions of the Association for Computational Linguistics 5 (2017), pp. 135–146.
[2] Deyan Ginev. arXMLiv:08.2019 dataset, an HTML5 conversion of arXiv.org.
SIGMathLing – Special Interest Group on Math Linguistics. 2019. URL: https:
//sigmathling.kwarc.info/resources/arxmliv-dataset-082019/.
[3] Quoc V. Le and Tomas Mikolov. “Distributed Representations of Sentences and
Documents”. In: CoRR abs/1405.4053 (2014). URL:
http://arxiv.org/abs/1405.4053.
[4] Vít Novotný. “Implementation Notes for the Soft Cosine Measure”. eng. In:
Proceedings of the 27th ACM International Conference on Information and Knowledge
Management (CIKM ’18). Torino, Italy: Association for Computing Machinery, 2018,
pp. 1639–1642. ISBN: 978-1-4503-6014-2. DOI: 10.1145/3269206.3269317.
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 18 / 20
Bibliography
Bibliography II
[5] Vít Novotný et al. Text classiﬁcation with word embedding regularization and soft
similarity measure. 2020. arXiv: 2003.05019 [cs.IR]. URL:
https://arxiv.org/abs/2003.05019.
[6] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using
Siamese BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for
Computational Linguistics, Nov. 2019, pp. 3982–3992. DOI:
10.18653/v1/D19-1410. URL:
https://www.aclweb.org/anthology/D19-1410.
[7] Tetsuya Sakai and Noriko Kando. “On information retrieval metrics designed for
evaluation with incomplete relevance assessments”. In: Information Retrieval 11.5
(2008), pp. 447–470.
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 19 / 20
Bibliography
Bibliography III
[8] Grigori Sidorov et al. “Soft similarity and soft cosine measure: Similarity of features
in vector space model”. In: Computación y Sistemas 18.3 (2014), pp. 491–504.
[9] Krzysztof Wojciechowski et al. The EuDML Search and Browsing Service – Final.
Deliverable D5.3 of EU CIP-ICT-PSP project 250503 EuDML: The European Digital
Mathematics Library, revision 1.2
https://project.eudml.eu/sites/default/files/D5_3_v1.2.pdf.
Feb. 2013.
D. Lupták ·From ARQMath 2020 to 2021 ·March 11, 2021 20 / 20