With COVID-LEAP, I developed and evaluated several IR strategies that included lexical (e.g., BM25), dense (e.g., msmarco-distilbert-base-v3), and re-ranking models (e.g., lexical first stage BM25 reranked by second stage dense c19gq-ance-msmarco-passage), with and without applied paper quality metrics, for the location and ranking of relevant candidate paragraphs from searching a large corpus of academic articles.
A summary of the extensive testing:
-
Intrinsic evaluation with synthetic benchmarking allowed automated, iterative evaluation and optimisation. Showed semantic models benefitted from domain and longer passage fine-tuning. BM25 does well. Reranking ensembles capitalise on strong BM25 & fine-grain the lexical model results with semantic capture for even better performance.
-
Extrinsic evaluation by a medical expert with over 30 years experience shows there is a concensus with intrinsic testing, with the ANCE with cross-encoding the strongest strategy. In contrast to synthetic benchmarking, BM25 performs poorly, highlighting the importance of human in the loop evaluation of deep-learning solutions.
Overall, LEAP outpeforms PubMedCentral. For much more detail, please see Chapter 6 of the Thesis paper