Day 14 of 100 days of large language models

😲 Discovery of the day! Millions of Wikipedia articles in several languages available as an open source dataset of passage embeddings.

🤩 An example of how this dataset could power apps is applying Retrieval Augmented Generation (RAG). Take a user query such as “How do black holes form?”, encode it, and retrieve passages from the Wikipedia dataset that are semantically similar. Add these passage texts to the user query to ground it and give the LLM some context. Helps the LLM answer more accurately and directly.

🏋️ Nils Reimers (creator of sentence transformers), you continue to do amazing things, thanks for this dataset!

📖 Read more on the Wikipedia embeddings archive here