[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"blog-article-en-how-semantic-search-works":3,"blog-related-43c0f013-6a6b-4286-8e74-bbb0bd8eaf93-en-busca":53},{"post":4,"html":52,"reading_time_minutes":14},{"id":5,"tenant_id":6,"author":7,"status":11,"published_at":12,"cover_image_url":13,"reading_time_minutes":14,"view_count":15,"like_count":16,"featured":17,"created_at":18,"updated_at":19,"translations":20,"tags":35},"43c0f013-6a6b-4286-8e74-bbb0bd8eaf93","3272d9cc-43d0-4e23-8d75-3db7f042b2b3",{"sub":8,"name":9,"email":10},"554c6643-8b1c-4484-b190-4d9c71d0c275","Rodolfo De Bonis","dev@rodolfodebonis.com.br","published","2026-05-16T05:52:44.530779Z","https://rodolfodebonis.com.br/api/cdn/portfolio/blog-covers/198d42d7-9779-4d52-85bf-37d93abd1233.png",13,202,2,true,"2026-05-16T05:52:44.463814Z","2026-05-18T05:31:00.131636Z",[21,28],{"id":22,"post_id":5,"tenant_id":6,"lang":23,"slug":24,"title":25,"excerpt":26,"content_md":27,"created_at":18,"updated_at":18},"4ee0c693-b676-4ea5-ba02-d600111df7b7","en","how-semantic-search-works","Semantic Search: teaching machines to understand intent, not just words","BM25 still works, but it has four serious blind spots. Embeddings turn text into geometry — and that's the foundation of everything that's hyped in AI today. First in a series on modern search.","A few years ago, if you asked me how search works in a serious system, I'd answer in three words: inverted index, BM25, done. That was the state of the art, that was what ran everywhere, and that was what I knew well enough to teach.\n \nToday, after putting semantic search into production on top of nearly a million documents, I'd change my answer. Not because BM25 got worse — quite the opposite, it's still the foundation of almost every search system in the world. I'd change it because BM25 alone is leaving a lot of value on the table. And what fills that gap is an idea that looks like magic, until you understand what's actually happening underneath.\n \nThis is the first post in a series on modern search. Here we cover **semantic search** — what it is, why it works, how to use it. The next post adds BM25 and vector search together as **hybrid search**. The third closes with **reranking**, the cherry that separates good search from excellent search. But first we need to understand the underlying problem.\n \n## The problem most people don't notice\n \nImagine you're building a movie catalog. Could be a Letterboxd, an IMDB, an internal app. A user shows up and types:\n \n> \"movie about a guy stuck in the same day\"\n \nYou know what they want. I know what they want. Anyone who's seen *Groundhog Day* knows what they want. The problem is that your database doesn't know.\n \nIf you're using what 99% of systems use — text search based on an inverted index — your database will take that query, look for the literal words (\"guy\", \"stuck\", \"same\", \"day\"), and return:\n \n- *Stuck on You* (it has \"stuck\" in the title)\n- *Day After Tomorrow* (it has \"day\" in the title)\n- Any documentary about prison life (because \"stuck\" shows up in summaries about confinement)\nThe result: the user doesn't find *Groundhog Day*, closes your app, opens Google, types exactly the same phrase. And Google finds it. Why? **Because Google isn't running `LIKE '%stuck%'`.**\n \nThe difference between \"search that works\" and \"search that frustrates the user\" doesn't live in which database you picked, or how many servers you threw at it. It lives in understanding that **words are not meanings**, and that there are tools to bridge that gap.\n \n## How the machine sees text: BM25 and friends\n \nBefore we talk about the pretty stuff, let's understand what's running almost everywhere today.\n \nWhen you send `\"comfortable white sneakers\"` to Elasticsearch (or OpenSearch, or Solr — all the same family), three things happen in sequence. First, **tokenization**: the string gets broken into individual words. Second, **stemming**: suffixes get cut to reduce morphological variations. \"comfortable\", \"white\", \"sneakers\" become \"comfort\", \"white\", \"sneak\". Third, that gets matched against the **inverted index**: a structure that, for each token, stores the list of documents where that token appears.\n \n```\nsneak    -> [12, 47, 89, 156, ...]\nwhite    -> [12, 102, 230, ...]\ncomfort  -> [47, 88, 102, ...]\n```\n \nDocument 12 shows up in two lists. Document 47 also. Document 88, in just one. The more lists a document appears in, the more likely it's relevant. But that's just the start.\n \nWhat ranks the results is **BM25** — Best Match 25, the twenty-fifth iteration of a family of algorithms that started in the 70s. It's the default in Elasticsearch, OpenSearch, and Solr. If you use text search anywhere, BM25 is what's scoring.\n \nThe formula looks intimidating, but it has only three ideas:\n \n**Term Frequency (TF)**: how many times the token appears in the document. More times, more relevant. But not linear — if it appears 50 times, it's not 50× better than appearing once. The formula saturates.\n \n**Inverse Document Frequency (IDF)**: rare terms are worth more. The word \"sneakers\" appears in 2% of your fashion catalog — high weight. The word \"the\" appears in 100% of documents — weight nearly zero. Makes sense: if you searched for \"white sneakers\", matching \"sneakers\" tells me a lot more about relevance than matching \"white\".\n \n**Length normalization**: a short document containing the term is more relevant than a long document containing the term, because the chance that it's actually about that thing is higher. Without this, a 5000-word technical manual would beat a short description just by volume.\n \nBM25 mixes these three things and produces a score. Higher score, higher in the results. It's elegant, it scales well, and it works reasonably in a closed domain. But it has four serious blind spots.\n \n## Where BM25 breaks\n \n**Synonyms.** User searches for \"cellphone\", catalog has \"smartphone\". Zero match. You can solve this with a manual synonym dictionary, but you're going to maintain that for English, Portuguese, regional slang, and every niche's jargon? Good luck.\n \n**Vocabulary.** User searches for \"lightweight clothes for hot weather\". Catalog has \"short-sleeve linen blouse\". Same intent, zero words in common. BM25 returns nothing.\n \n**Intent.** User searches for \"good movie to watch with my girlfriend\". What does that mean? BM25 will match on \"movie\", \"good\", and \"girlfriend\" and return random results.\n \n**Multilingual.** You indexed in English, the user searches in Spanish. Same content, different languages, BM25 has no way to know they're the same thing.\n \nThe problem, at the core, is the same: **BM25 looks at the surface of the text, not at the meaning.** It's a tool for token coincidence, not for understanding.\n \nThat's where the interesting part comes in.\n \n## Embeddings: text becomes geometry\n \nThe simplest and most powerful definition that exists:\n \n> **An embedding is a dense vector of N numbers that represents the meaning of a piece of text.**\n \nYou send the word \"pizza\" to the embedding model. It returns a list of — say — 1024 numbers between -1 and 1:\n \n```\n[0.21, -0.05, 0.78, 0.13, -0.42, ..., 0.09]\n```\n \nYou send \"lasagna\". It returns another list of 1024 numbers:\n \n```\n[0.19, -0.08, 0.81, 0.10, -0.40, ..., 0.07]\n```\n \nThe magic is that these two lists will be **nearly identical**. Not because the model saw \"pizza\" and \"lasagna\" together — though that helped during training — but because it learned that both live in the same semantic neighborhood: Italian food, main dish, pasta or dough base, dinner context.\n \nNow send \"Python\". The vector will be very different. Because Python is a programming language, it's tech, it's an entirely different context. The vector for \"Java\" will be similar to the one for \"Python\", because both are languages. Pizza and lasagna sit together in one corner, Python and Java sit together in another corner, cat and dog sit together in a third corner.\n \n**The distance between two vectors becomes a measure of semantic similarity.** That's the trick. You've converted text — which is symbolic, discrete, hard to compare — into geometry. And geometry, we know how to measure.\n \nIn practice, generating an embedding looks like this:\n \n```python\nfrom openai import OpenAI\n \nclient = OpenAI()\nresponse = client.embeddings.create(\n    input=\"pizza margherita\",\n    model=\"text-embedding-3-small\"\n)\nvector = response.data[0].embedding  # list of 1536 floats\n```\n \nText goes in, geometry comes out.\n \n## Algebra with meanings\n \nTo make this less abstract: because these vectors actually carry meaning, you can do math with them. Real math. The classic experiment, from Mikolov's 2013 paper:\n \n```\nvector(\"king\") - vector(\"man\") + vector(\"woman\") ≈ vector(\"queen\")\n```\n \nThe model learned, without anyone explicitly teaching it, that there's a masculinity-femininity axis in vector space. Another one:\n \n```\nvector(\"Paris\") - vector(\"France\") + vector(\"Italy\") ≈ vector(\"Rome\")\n```\n \nThe concept of \"capital of a country\" became a direction in space. You can subtract \"France\" to remove the \"specific country\" component, then add \"Italy\" to put it back. The result lands near the Italian capital.\n \nThis isn't pretty theoretical math. It's literally what's happening inside the model. That's why semantic search works: the model learned structure about the world, and you're doing geometry on top of that structure.\n \n## Where these numbers come from\n \nEmbedding models are neural networks trained on absurd amounts of text to learn these representations. The main families today:\n \n**Commercial APIs.** OpenAI (`text-embedding-3-small`, `text-embedding-3-large`), Cohere (`embed-v3`, strong on multilingual), Voyage AI. Expensive, but quality near the top of the leaderboard, no infra to maintain on your side.\n \n**State-of-the-art open-source.** BGE (from BAAI), E5 (from Microsoft), GTE (from Alibaba). You run it on your GPU, zero API cost, zero vendor lock-in. BGE-M3 and BGE-large multilingual compete well with OpenAI in many benchmarks.\n \n**The classic base.** Sentence-Transformers — the library that popularized all of this. Smaller, simpler models, great for prototyping.\n \n**How to choose?** Go to [MTEB](https://huggingface.co/spaces/mteb/leaderboard) — the Massive Text Embedding Benchmark, the public reference leaderboard. Pick a model in your cost and size range, and **test it on your domain**. A model that's good at English may be bad at Portuguese. A model that's good at short text may be bad at long documents. A model that's good at general domain may be bad at legal, medical, or technical vocabulary. Always measure.\n \n## How to compare two vectors\n \nYou have the query vector, you have the document vectors. How do you compare them? Three options, in the order you'll probably use them.\n \n**Cosine similarity** measures the angle between vectors, ignoring magnitude. It ranges from -1 to 1 (in practice, with text, it falls between 0 and 1). It's the default in the overwhelming majority of cases, because meaning lives in direction, not in the size of the vector.\n \n**Dot product** is the scalar product. It cares about direction *and* magnitude. If your vectors are normalized — and most modern models return normalized vectors — dot product is mathematically equivalent to cosine and cheaper to compute. Use it for optimization.\n \n**Euclidean distance** is the straight line between two points. It works, but it's less common with text. It shows up more with image embeddings.\n \nRule of thumb: start with cosine, switch to dot product when optimizing.\n \n## kNN, ANN, and the scale problem\n \nHow do you find the documents most similar to the query? The conceptual algorithm is **k-Nearest Neighbors (kNN)**. Four steps: take the query, generate its embedding, measure the distance to every document in the catalog, sort, and return the top K.\n \nIt works perfectly on a thousand documents. On a million, computing distance against every single one is O(n), infeasible in real time.\n \nThe solution is **ANN — Approximate Nearest Neighbors**. You give up a bit of precision to gain orders of magnitude in speed. Instead of comparing against everything, you compare against an intelligently chosen subset.\n \nThe most popular algorithm today is **HNSW — Hierarchical Navigable Small World**. It works like a world map with multiple zoom levels: you start at the highest level, navigate quickly until you're close to the answer, then descend into more detailed levels and refine. From O(n) you drop to O(log n). It's what Elasticsearch uses, what pgvector uses, what practically every vector database uses underneath.\n \nIn production, recall of 95-98% is easy to hit with a well-configured HNSW, and latency lands in the milliseconds.\n \n## Where to run this\n \nThe ecosystem has exploded in the last few years. I split it into two families.\n \n**Dedicated vector databases**: Pinecone (SaaS), Weaviate, Qdrant, Milvus, Chroma. They were born for this. They generally deliver better performance and more mature features for vector search. The downside is it's one more database to maintain.\n \n**Databases that added vector support**: Elasticsearch and OpenSearch (already had mature text search, got vector); PostgreSQL with the pgvector extension; Redis; MongoDB. The advantage here is reusing a stack you already have. The downside is that features and performance sometimes aren't as polished as in the dedicated ones.\n \nHow to decide? If you already have Elasticsearch in production, **start by adding vector search to it**. Don't switch databases over a new feature. If you're starting from zero and want simplicity, Qdrant and Weaviate are great. If you're Postgres-first and your volume is moderate, pgvector handles it. There's no single answer — there's tradeoff.\n \n## The movie example, now with semantics\n \nRemember the query from the beginning? \"movie about a guy stuck in the same day\". Before, with BM25, it returned *Stuck on You* and *Day After Tomorrow* because of the literal words. Now, with vector search:\n \n| Position | Movie | Score |\n|---|---|---|\n| 1 | *Groundhog Day* | 0.89 |\n| 2 | *Edge of Tomorrow* | 0.85 |\n| 3 | *Palm Springs* | 0.81 |\n \nNotice the important detail: the synopsis of *Groundhog Day* says \"a man relives the same day repeatedly\". The word \"stuck\" never appears. But the model understood that being in a time loop is a form of being stuck in time. *Palm Springs* is an indie film many people have never heard of — but the model knows it, because it trained on descriptions from the entire internet.\n \nSame query. Completely different results. **No synonym dictionary. No manual rules.** The model just did geometry.\n \n## Where vector search also breaks\n \nBefore you walk away from here with stars in your eyes — vector search has serious blind spots too.\n \n**Exact identifiers.** User searches for `SKU-A4729`. Vector search will return things semantically similar, which is not what they want. For SKUs, product codes, IDs, order numbers — you need exact match, not similarity.\n \n**Negations.** \"Shoe without laces\" might return shoes with laces, because the concept \"laces\" is strongly represented in the query vector. Modern models handle this better, but it remains fragile.\n \n**Very short or ambiguous queries.** \"java\" — is it the language, the island, or the coffee? BM25 also suffers, but vector search doesn't magically solve it.\n \n**Cost.** Generating an embedding per document costs. Vector storage costs (1024 dimensions × 4 bytes × N documents). Reindexing when you switch models costs. The latency of generating an embedding for every query also costs.\n \nWhen you stack the blind spots — exactness, short queries, cost — it becomes clear: **replacing BM25 with vector search is trading one set of problems for another**.\n \n## The answer is to combine, not replace\n \nThe good news is that these two worlds have complementary blind spots. BM25 is strong exactly where vector search is weak: exact match, SKUs, specific keywords. Vector search is strong exactly where BM25 is weak: meaning, intent, divergent vocabulary.\n \nWhen you combine them, what one misses the other catches.\n \nThat's what the next post is about: **hybrid search**. How to run BM25 and vector search in parallel, how to fuse the rankings without falling into the obvious trap of weighting score against score, and why this is the architecture that delivers the best results in production from day zero, with no manual weight tuning.\n \nUntil then, if you've never played with embeddings, **open a notebook**. Grab the OpenAI API (or run BGE locally), embed a few dozen sentences from your domain, compute cosine between them, and see what clusters together. It's the best way to internalize the idea: to see with your own eyes that the model actually understands.\n \n---\n \n*This is the first post in a series on modern search. Next: hybrid search with Reciprocal Rank Fusion. If you're applying this in some context, drop me a message — I'd love to know where.*",{"id":29,"post_id":5,"tenant_id":6,"lang":30,"slug":31,"title":32,"excerpt":33,"content_md":34,"created_at":18,"updated_at":18},"935d7ee0-2d3a-4018-8572-6b76ae1b805d","pt-BR","como-funciona-busca-semantica","Busca Semântica: como ensinar máquinas a entender intenção (não só palavras)","BM25 ainda vive, mas tem quatro buracos sérios. Embeddings transformam texto em geometria — e essa é a base de tudo que tem hype hoje em IA. Primeiro artigo de uma série sobre busca moderna.","Há uns anos, se você me perguntasse como funciona busca em um sistema sério, eu responderia em três palavras: índice invertido, BM25, fim. Era o estado da arte, era o que rodava em todo lugar, e era o que eu sabia o suficiente pra ensinar.\n \nHoje, depois de colocar busca semântica em produção em cima de quase um milhão de documentos, eu mudaria a resposta. Não porque BM25 ficou ruim — pelo contrário, ele continua sendo a base de quase todo sistema de busca no mundo. Mudaria porque BM25 sozinho está deixando muito valor na mesa. E o que preenche esse vazio é uma ideia que parece mágica, até você entender o que está acontecendo por baixo.\n\nEsse é o primeiro post de uma série sobre busca moderna. Aqui a gente trata só de **busca semântica** — o quê, o porquê, o como. No próximo, vamos somar BM25 com vetorial em **busca híbrida**. No terceiro, fechamos com **reranking**, a cereja que separa busca boa de busca excelente. Mas antes a gente precisa entender o problema base.\n \n## O problema que ninguém nota\n\nImagine que você está construindo um catálogo de filmes. Pode ser um Letterboxd, um IMDB, um app interno. Um usuário chega e digita:\n \n> \"filme sobre cara preso no mesmo dia\"\n \nVocê sabe o que ele quer. Eu sei o que ele quer. Qualquer pessoa que viu *Feitiço do Tempo* sabe o que ele quer. O problema é que seu banco de dados não sabe.\n \nSe você está usando o que 99% dos sistemas usam — busca textual baseada em índice invertido — o banco vai pegar essa query, procurar pelas palavras literais (\"cara\", \"preso\", \"mesmo\", \"dia\") e te entregar:\n\n- *O Especialista* (tem \"preso\" no roteiro)\n- *Os Detentos* (tem \"preso\" no título)\n- Qualquer documentário sobre o sistema penitenciário\nResultado: o usuário não encontra *Feitiço do Tempo*, fecha o app, abre o Google e digita exatamente a mesma frase. E o Google encontra. Por quê? **Porque o Google não está fazendo `LIKE '%preso%'`.**\n \nA diferença entre \"busca que funciona\" e \"busca que frustra o usuário\" não está no banco que você usa, nem na quantidade de servidores. Está em entender que **palavras não são significados**, e que existem ferramentas pra preencher esse abismo.\n \n## Como a máquina vê texto: BM25 e amigos\n \nAntes de falar do bonito, vamos entender o que está rodando hoje em quase todo lugar.\n \nQuando você manda `\"tenis branco confortavel\"` pro Elasticsearch (ou OpenSearch, ou Solr — todos da mesma família), três coisas acontecem em sequência. Primeiro, **tokenização**: a string é quebrada em palavras individuais. Segundo, **stemming**: sufixos são cortados pra reduzir variações morfológicas. \"tenis\", \"branco\", \"confortavel\" viram \"tenis\", \"branc\", \"confort\". Terceiro, isso bate contra o **índice invertido**: uma estrutura que, pra cada token, guarda a lista de documentos onde aquele token aparece.\n \n```\ntenis    -> [12, 47, 89, 156, ...]\nbranc    -> [12, 102, 230, ...]\nconfort  -> [47, 88, 102, ...]\n```\n \nDocumento 12 aparece em duas listas. Documento 47 também. Documento 88, em uma só. Quanto mais listas o documento aparece, mais provável ele ser relevante. Mas isso é só o começo.\n \nO que ordena os resultados é o **BM25** — Best Match 25, a vigésima quinta iteração de uma família de algoritmos que começou nos anos 70. É o default do Elasticsearch, do OpenSearch e do Solr. Se você usa busca textual em algum lugar, BM25 é quem está pontuando.\n \nA fórmula parece intimidante, mas tem três ideias só:\n \n**Term Frequency (TF)**: quantas vezes o token aparece no documento. Mais vezes, mais relevante. Mas não linear — se aparecer 50 vezes, não é 50× melhor que aparecer 1 vez. A fórmula satura.\n \n**Inverse Document Frequency (IDF)**: termos raros valem mais. A palavra \"tênis\" aparece em 2% do seu catálogo de moda — peso alto. A palavra \"de\" aparece em 100% dos documentos — peso quase zero. Faz sentido: se você procurou \"tênis branco\", acertar \"tênis\" me diz muito mais sobre relevância do que acertar \"branco\".\n \n**Normalização por tamanho**: documento curto que contém o termo é mais relevante que documento longo que contém o termo, porque a chance de ser exatamente sobre aquilo é maior. Sem isso, um manual técnico de 5000 palavras ganharia de uma descrição curta só por volume.\n \nBM25 mistura essas três coisas e gera um score. Documento com score maior vai pro topo. É elegante, escala bem, e funciona razoavelmente em domínio fechado. Mas tem quatro grandes pontos cegos.\n\n## Onde BM25 quebra\n \n**Sinônimos.** Usuário busca \"celular\", catálogo tem \"smartphone\". Zero match. Você pode até resolver com dicionário manual de sinônimos, mas vai manter isso pra português, inglês, gírias regionais e jargão de cada nicho? Boa sorte.\n \n**Vocabulário.** Usuário busca \"roupa fresquinha pra usar no calor\". Catálogo tem \"blusa de linho manga curta\". Mesma intenção, zero palavras em comum. BM25 não devolve nada.\n \n**Intenção.** Usuário busca \"filme bom pra ver com a namorada\". O que isso significa? BM25 vai bater em \"filme\", \"bom\" e \"namorada\" e te entregar resultado aleatório.\n \n**Multilíngue.** Você indexou em português, o usuário busca em inglês. Mesmo conteúdo, idiomas diferentes, BM25 não tem como saber que é a mesma coisa.\n \nO problema, no fundo, é o mesmo: **BM25 olha pra superfície do texto, não pro significado.** Ele é uma ferramenta de coincidência de tokens, não de compreensão.\n \nÉ aí que entra a parte interessante.\n \n## Embeddings: texto vira geometria\n \nA definição mais simples e mais poderosa que existe:\n \n> **Um embedding é um vetor denso de N números que representa o significado de um pedaço de texto.**\n \nVocê manda a palavra \"pizza\" pro modelo de embedding. Ele te devolve uma lista de — digamos — 1024 números entre -1 e 1:\n \n```\n[0.21, -0.05, 0.78, 0.13, -0.42, ..., 0.09]\n```\n \nManda \"lasanha\". Ele devolve outra lista de 1024 números:\n \n```\n[0.19, -0.08, 0.81, 0.10, -0.40, ..., 0.07]\n```\n \nA mágica é que essas duas listas vão ser **quase idênticas**. Não porque o modelo viu \"pizza\" e \"lasanha\" juntas — embora isso tenha ajudado no treinamento — mas porque ele aprendeu que ambas vivem no mesmo bairro semântico: comida italiana, prato principal, base de massa, contexto de jantar.\n \nAgora manda \"Python\". O vetor vai ser muito diferente. Porque Python é linguagem de programação, é tecnologia, é outro contexto inteiro. O vetor de \"Java\" vai ser parecido com o de \"Python\", porque ambos são linguagens. Pizza e lasanha ficam juntas num canto, Python e Java ficam juntos noutro canto, gato e cachorro ficam juntos num terceiro canto.\n \n**A distância entre dois vetores vira uma medida de similaridade semântica.** Esse é o pulo do gato. Você converteu texto — que é simbólico, discreto, difícil de comparar — em geometria. E geometria a gente sabe medir.\n \nNa prática, gerar um embedding parece com isso:\n \n```python\nfrom openai import OpenAI\n \nclient = OpenAI()\nresponse = client.embeddings.create(\n    input=\"pizza margherita\",\n    model=\"text-embedding-3-small\"\n)\nvector = response.data[0].embedding  # lista de 1536 floats\n```\n \nPronto. Texto entrou, geometria saiu.\n\n## Álgebra com significados\n \nPra deixar isso menos abstrato: como esses vetores carregam significado de verdade, você pode fazer conta com eles. Conta de verdade. O experimento clássico, do paper de Mikolov de 2013:\n \n```\nvetor(\"rei\") - vetor(\"homem\") + vetor(\"mulher\") ≈ vetor(\"rainha\")\n```\n \nO modelo aprendeu, sem ninguém ensinar explicitamente, que existe um eixo de masculinidade-feminilidade no espaço vetorial. Outro:\n \n```\nvetor(\"Paris\") - vetor(\"França\") + vetor(\"Itália\") ≈ vetor(\"Roma\")\n```\n\nCapital de país, como conceito, virou uma direção no espaço. Você pode subtrair \"França\" pra remover o componente \"país específico\", aí somar \"Itália\" pra colocar de volta. O resultado aterrissa perto da capital italiana.\n \nIsso não é matemática teórica bonita. É literalmente o que acontece dentro do modelo. Por isso busca semântica funciona: o modelo aprendeu estrutura do mundo, e você está fazendo geometria em cima dessa estrutura.\n \n## De onde vêm esses números\n \nModelos de embedding são redes neurais treinadas em quantidades absurdas de texto pra aprender essas representações. As famílias principais hoje:\n \n**APIs comerciais.** OpenAI (`text-embedding-3-small`, `text-embedding-3-large`), Cohere (`embed-v3`, forte em multilíngue), Voyage AI. Caros, mas qualidade no topo do leaderboard, sem você precisar manter infra.\n \n**Open-source de ponta.** BGE (da BAAI), E5 (da Microsoft), GTE (da Alibaba). Você roda na sua GPU, zero custo de API, zero vendor lock-in. BGE-M3 e BGE-large multilingual competem bem com OpenAI em vários benchmarks.\n \n**Base clássica.** Sentence-Transformers — a biblioteca que popularizou tudo isso. Modelos menores, mais simples, ótimos pra prototipar.\n \n**Como escolher?** Vai no [MTEB](https://huggingface.co/spaces/mteb/leaderboard) — o Massive Text Embedding Benchmark, leaderboard público de referência. Escolhe um modelo na sua faixa de custo e tamanho, e **testa no seu domínio**. Modelo bom em inglês pode ser ruim em PT-BR. Modelo bom em texto curto pode ser ruim em documento longo. Modelo bom em domínio geral pode ser ruim em vocabulário jurídico, médico, técnico. Sempre meça.\n \n## Como comparar dois vetores\n \nVocê tem o vetor da query, tem os vetores dos documentos. Como comparar? Três opções, em ordem do que você provavelmente vai usar.\n \n**Similaridade do cosseno** mede o ângulo entre os vetores, ignorando magnitude. Vai de -1 a 1 (na prática, em texto, fica entre 0 e 1). É o default na esmagadora maioria dos casos, porque o significado mora na direção, não no tamanho do vetor.\n \n**Dot product** é o produto escalar. Importa direção *e* magnitude. Se seus vetores são normalizados — e a maioria dos modelos modernos retorna normalizados — dot product é matematicamente equivalente ao cosseno e mais barato de computar. Use isso pra otimizar.\n \n**Distância euclidiana** é a reta entre dois pontos. Funciona, mas é menos comum em texto. Aparece mais em embeddings de imagem.\n \nRegra prática: comece com cosseno, troque pra dot product quando for otimizar.\n \n## kNN, ANN e o problema de escala\n \nComo achar os documentos mais parecidos com a query? O algoritmo conceitual é **k-Nearest Neighbors (kNN)**. Quatro passos: pega a query, gera o embedding, mede a distância pra cada documento do catálogo, ordena e retorna os top K.\n \nFunciona perfeitamente em mil documentos. Em um milhão, calcular distância contra cada um é O(n), inviável em tempo real.\n \nA solução é o **ANN — Approximate Nearest Neighbors**. Você abre mão de um pouco de precisão pra ganhar ordens de magnitude em velocidade. Ao invés de comparar com todos, compara com um subconjunto inteligentemente escolhido.\n \nO algoritmo mais popular hoje é o **HNSW — Hierarchical Navigable Small World**. Funciona como um mapa do mundo com vários níveis de zoom: você começa no nível mais alto, navega rapidamente até chegar perto da resposta, aí desce pra níveis mais detalhados e refina. De O(n) você cai pra O(log n). É o que o Elasticsearch usa, o que o pgvector usa, o que praticamente todos os bancos vetoriais usam por baixo.\n \nEm produção, recall de 95-98% é fácil de atingir com HNSW bem configurado, e a latência fica na casa dos milissegundos.\n \n## Onde rodar isso\n \nO ecossistema explodiu nos últimos anos. Eu divido em duas famílias.\n \n**Bancos vetoriais dedicados**: Pinecone (SaaS), Weaviate, Qdrant, Milvus, Chroma. Nasceram pra isso. Geralmente entregam melhor performance e features mais maduras pra busca vetorial. A desvantagem é que é mais um banco pra manter.\n \n**Bancos que ganharam suporte vetorial**: Elasticsearch e OpenSearch (já tinham busca textual madura, ganharam vetorial); PostgreSQL com a extensão pgvector; Redis; MongoDB. A vantagem aqui é reaproveitar stack que você já tem. A desvantagem é que features e performance às vezes não são tão polidas quanto nos dedicados.\n \nComo decidir? Se você já tem Elasticsearch rodando em produção, **comece adicionando vetorial nele**. Não troque de banco por causa de uma feature nova. Se está começando do zero e quer simplicidade, Qdrant e Weaviate são ótimos. Se é Postgres-first e volume moderado, pgvector resolve. Não tem resposta única — tem trade-off.\n \n## O exemplo dos filmes, agora com semântica\n \nLembra da query do começo? \"filme sobre cara preso no mesmo dia\". Antes, com BM25, ela retornava *O Especialista* e *Os Detentos* pelo \"preso\" literal. Agora, com busca vetorial:\n \n| Posição | Filme | Score |\n|---|---|---|\n| 1 | *Feitiço do Tempo* | 0.89 |\n| 2 | *Edge of Tomorrow* | 0.85 |\n| 3 | *Palm Springs* | 0.81 |\n \nOlha o detalhe importante: a sinopse de *Feitiço do Tempo* diz \"homem revive o mesmo dia repetidamente\". A palavra \"preso\" não aparece. Mas o modelo entendeu que loop temporal é uma forma de estar preso no tempo. *Palm Springs* é um filme indie que muita gente nunca viu — mas o modelo conhece, porque treinou em descrições da internet inteira.\n \nMesma query. Resultados completamente diferentes. **Sem dicionário de sinônimos. Sem regras manuais.** O modelo só fez geometria.\n \n## Onde a busca vetorial também quebra\n \nAntes que você saia daqui só com brilho no olho — vetorial também tem buracos sérios.\n \n**Identificadores exatos.** Usuário busca `SKU-A4729`. Busca vetorial vai retornar coisas semanticamente parecidas, que não é o que ele quer. Pra SKU, código de produto, ID, número de pedido — você precisa de match exato, não similaridade.\n \n**Negações.** \"Sapato sem cadarço\" pode te retornar sapato com cadarço, porque o conceito \"cadarço\" está fortemente representado no vetor da query. Modelos modernos lidam melhor com isso, mas continua frágil.\n \n**Queries muito curtas ou ambíguas.** \"java\" — é a linguagem, a ilha ou o café? BM25 também sofre, mas vetorial não resolve magicamente.\n \n**Custo.** Gerar embedding pra cada documento custa. Storage de vetor custa (1024 dimensões × 4 bytes × N documentos). Reindexar quando você muda de modelo custa. Latência de gerar embedding da query a cada busca também custa.\n \nQuando você junta os pontos cegos — exatidão, queries curtas, custo — fica claro: **substituir BM25 por vetorial é trocar um conjunto de problemas por outro**.\n \n## A solução é somar, não substituir\n \nA boa notícia é que esses dois mundos têm pontos cegos complementares. BM25 é forte exatamente onde vetorial é fraco: match exato, SKU, palavras-chave específicas. Vetorial é forte exatamente onde BM25 é fraco: significado, intenção, vocabulário divergente.\n \nQuando você junta, o que uma falha a outra cobre.\n \nÉ disso que se trata o próximo post: **busca híbrida**. Como rodar BM25 e vetorial em paralelo, como fundir os rankings sem cair na armadilha óbvia de pesar score com score, e por que essa é a arquitetura que entrega o melhor resultado em produção desde o dia zero, sem ajuste manual de pesos.\n \nAté lá, se você nunca brincou com embeddings, **abre um notebook**. Pega a API da OpenAI (ou roda o BGE local), embeda umas dezenas de frases do seu domínio, calcula cosseno entre elas e olha o que aparece junto. É a melhor forma de internalizar a ideia: ver com os próprios olhos que o modelo realmente entende.\n \n---\n \n*Esse é o primeiro post de uma série sobre busca moderna. Próximo: busca híbrida com Reciprocal Rank Fusion. Se você está aplicando isso em algum contexto, me manda mensagem. Adoraria saber no que você está mexendo.*",[36,40,44,48],{"id":37,"tenant_id":6,"slug":38,"created_at":39},"3600c7f5-46a1-44c1-ab4d-14b54bc49300","busca","2026-05-16T05:37:43.100096Z",{"id":41,"tenant_id":6,"slug":42,"created_at":43},"7bfa9291-72ca-46cf-b0e0-4ff8d29a4f78","embeddings","2026-05-16T05:38:28.835806Z",{"id":45,"tenant_id":6,"slug":46,"created_at":47},"c0801447-701e-4cf6-8859-e4cf89f9d8a0","machine-learning","2026-05-16T05:38:07.501314Z",{"id":49,"tenant_id":6,"slug":50,"created_at":51},"43e39379-ec07-4aff-82a6-2946c56fa4f2","ml","2026-05-16T05:39:39.427214Z","\u003Cp>A few years ago, if you asked me how search works in a serious system, I’d answer in three words: inverted index, BM25, done. That was the state of the art, that was what ran everywhere, and that was what I knew well enough to teach.\u003C/p>\n\u003Cp>Today, after putting semantic search into production on top of nearly a million documents, I’d change my answer. Not because BM25 got worse — quite the opposite, it’s still the foundation of almost every search system in the world. I’d change it because BM25 alone is leaving a lot of value on the table. And what fills that gap is an idea that looks like magic, until you understand what’s actually happening underneath.\u003C/p>\n\u003Cp>This is the first post in a series on modern search. Here we cover \u003Cstrong>semantic search\u003C/strong> — what it is, why it works, how to use it. The next post adds BM25 and vector search together as \u003Cstrong>hybrid search\u003C/strong>. The third closes with \u003Cstrong>reranking\u003C/strong>, the cherry that separates good search from excellent search. But first we need to understand the underlying problem.\u003C/p>\n\u003Ch2>The problem most people don’t notice\u003C/h2>\n\u003Cp>Imagine you’re building a movie catalog. Could be a Letterboxd, an IMDB, an internal app. A user shows up and types:\u003C/p>\n\u003Cblockquote>\n\u003Cp>“movie about a guy stuck in the same day”\u003C/p>\n\u003C/blockquote>\n\u003Cp>You know what they want. I know what they want. Anyone who’s seen \u003Cem>Groundhog Day\u003C/em> knows what they want. The problem is that your database doesn’t know.\u003C/p>\n\u003Cp>If you’re using what 99% of systems use — text search based on an inverted index — your database will take that query, look for the literal words (“guy”, “stuck”, “same”, “day”), and return:\u003C/p>\n\u003Cul>\n\u003Cli>\u003Cem>Stuck on You\u003C/em> (it has “stuck” in the title)\u003C/li>\n\u003Cli>\u003Cem>Day After Tomorrow\u003C/em> (it has “day” in the title)\u003C/li>\n\u003Cli>Any documentary about prison life (because “stuck” shows up in summaries about confinement)\nThe result: the user doesn’t find \u003Cem>Groundhog Day\u003C/em>, closes your app, opens Google, types exactly the same phrase. And Google finds it. Why? \u003Cstrong>Because Google isn’t running \u003Ccode>LIKE '%stuck%'\u003C/code>.\u003C/strong>\u003C/li>\n\u003C/ul>\n\u003Cp>The difference between “search that works” and “search that frustrates the user” doesn’t live in which database you picked, or how many servers you threw at it. It lives in understanding that \u003Cstrong>words are not meanings\u003C/strong>, and that there are tools to bridge that gap.\u003C/p>\n\u003Ch2>How the machine sees text: BM25 and friends\u003C/h2>\n\u003Cp>Before we talk about the pretty stuff, let’s understand what’s running almost everywhere today.\u003C/p>\n\u003Cp>When you send \u003Ccode>&quot;comfortable white sneakers&quot;\u003C/code> to Elasticsearch (or OpenSearch, or Solr — all the same family), three things happen in sequence. First, \u003Cstrong>tokenization\u003C/strong>: the string gets broken into individual words. Second, \u003Cstrong>stemming\u003C/strong>: suffixes get cut to reduce morphological variations. “comfortable”, “white”, “sneakers” become “comfort”, “white”, “sneak”. Third, that gets matched against the \u003Cstrong>inverted index\u003C/strong>: a structure that, for each token, stores the list of documents where that token appears.\u003C/p>\n\u003Cpre class=\"shiki shiki-themes github-dark github-light\" style=\"--shiki-dark:#e1e4e8;--shiki-light:#24292e;--shiki-dark-bg:#24292e;--shiki-light-bg:#fff\" tabindex=\"0\">\u003Ccode>\u003Cspan class=\"line\">\u003Cspan>sneak    -> [12, 47, 89, 156, ...]\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan>white    -> [12, 102, 230, ...]\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan>comfort  -> [47, 88, 102, ...]\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan>\u003C/span>\u003C/span>\u003C/code>\u003C/pre>\n\u003Cp>Document 12 shows up in two lists. Document 47 also. Document 88, in just one. The more lists a document appears in, the more likely it’s relevant. But that’s just the start.\u003C/p>\n\u003Cp>What ranks the results is \u003Cstrong>BM25\u003C/strong> — Best Match 25, the twenty-fifth iteration of a family of algorithms that started in the 70s. It’s the default in Elasticsearch, OpenSearch, and Solr. If you use text search anywhere, BM25 is what’s scoring.\u003C/p>\n\u003Cp>The formula looks intimidating, but it has only three ideas:\u003C/p>\n\u003Cp>\u003Cstrong>Term Frequency (TF)\u003C/strong>: how many times the token appears in the document. More times, more relevant. But not linear — if it appears 50 times, it’s not 50× better than appearing once. The formula saturates.\u003C/p>\n\u003Cp>\u003Cstrong>Inverse Document Frequency (IDF)\u003C/strong>: rare terms are worth more. The word “sneakers” appears in 2% of your fashion catalog — high weight. The word “the” appears in 100% of documents — weight nearly zero. Makes sense: if you searched for “white sneakers”, matching “sneakers” tells me a lot more about relevance than matching “white”.\u003C/p>\n\u003Cp>\u003Cstrong>Length normalization\u003C/strong>: a short document containing the term is more relevant than a long document containing the term, because the chance that it’s actually about that thing is higher. Without this, a 5000-word technical manual would beat a short description just by volume.\u003C/p>\n\u003Cp>BM25 mixes these three things and produces a score. Higher score, higher in the results. It’s elegant, it scales well, and it works reasonably in a closed domain. But it has four serious blind spots.\u003C/p>\n\u003Ch2>Where BM25 breaks\u003C/h2>\n\u003Cp>\u003Cstrong>Synonyms.\u003C/strong> User searches for “cellphone”, catalog has “smartphone”. Zero match. You can solve this with a manual synonym dictionary, but you’re going to maintain that for English, Portuguese, regional slang, and every niche’s jargon? Good luck.\u003C/p>\n\u003Cp>\u003Cstrong>Vocabulary.\u003C/strong> User searches for “lightweight clothes for hot weather”. Catalog has “short-sleeve linen blouse”. Same intent, zero words in common. BM25 returns nothing.\u003C/p>\n\u003Cp>\u003Cstrong>Intent.\u003C/strong> User searches for “good movie to watch with my girlfriend”. What does that mean? BM25 will match on “movie”, “good”, and “girlfriend” and return random results.\u003C/p>\n\u003Cp>\u003Cstrong>Multilingual.\u003C/strong> You indexed in English, the user searches in Spanish. Same content, different languages, BM25 has no way to know they’re the same thing.\u003C/p>\n\u003Cp>The problem, at the core, is the same: \u003Cstrong>BM25 looks at the surface of the text, not at the meaning.\u003C/strong> It’s a tool for token coincidence, not for understanding.\u003C/p>\n\u003Cp>That’s where the interesting part comes in.\u003C/p>\n\u003Ch2>Embeddings: text becomes geometry\u003C/h2>\n\u003Cp>The simplest and most powerful definition that exists:\u003C/p>\n\u003Cblockquote>\n\u003Cp>\u003Cstrong>An embedding is a dense vector of N numbers that represents the meaning of a piece of text.\u003C/strong>\u003C/p>\n\u003C/blockquote>\n\u003Cp>You send the word “pizza” to the embedding model. It returns a list of — say — 1024 numbers between -1 and 1:\u003C/p>\n\u003Cpre class=\"shiki shiki-themes github-dark github-light\" style=\"--shiki-dark:#e1e4e8;--shiki-light:#24292e;--shiki-dark-bg:#24292e;--shiki-light-bg:#fff\" tabindex=\"0\">\u003Ccode>\u003Cspan class=\"line\">\u003Cspan>[0.21, -0.05, 0.78, 0.13, -0.42, ..., 0.09]\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan>\u003C/span>\u003C/span>\u003C/code>\u003C/pre>\n\u003Cp>You send “lasagna”. It returns another list of 1024 numbers:\u003C/p>\n\u003Cpre class=\"shiki shiki-themes github-dark github-light\" style=\"--shiki-dark:#e1e4e8;--shiki-light:#24292e;--shiki-dark-bg:#24292e;--shiki-light-bg:#fff\" tabindex=\"0\">\u003Ccode>\u003Cspan class=\"line\">\u003Cspan>[0.19, -0.08, 0.81, 0.10, -0.40, ..., 0.07]\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan>\u003C/span>\u003C/span>\u003C/code>\u003C/pre>\n\u003Cp>The magic is that these two lists will be \u003Cstrong>nearly identical\u003C/strong>. Not because the model saw “pizza” and “lasagna” together — though that helped during training — but because it learned that both live in the same semantic neighborhood: Italian food, main dish, pasta or dough base, dinner context.\u003C/p>\n\u003Cp>Now send “Python”. The vector will be very different. Because Python is a programming language, it’s tech, it’s an entirely different context. The vector for “Java” will be similar to the one for “Python”, because both are languages. Pizza and lasagna sit together in one corner, Python and Java sit together in another corner, cat and dog sit together in a third corner.\u003C/p>\n\u003Cp>\u003Cstrong>The distance between two vectors becomes a measure of semantic similarity.\u003C/strong> That’s the trick. You’ve converted text — which is symbolic, discrete, hard to compare — into geometry. And geometry, we know how to measure.\u003C/p>\n\u003Cp>In practice, generating an embedding looks like this:\u003C/p>\n\u003Cpre class=\"shiki shiki-themes github-dark github-light\" style=\"--shiki-dark:#e1e4e8;--shiki-light:#24292e;--shiki-dark-bg:#24292e;--shiki-light-bg:#fff\" tabindex=\"0\">\u003Ccode>\u003Cspan class=\"line\">\u003Cspan style=\"--shiki-dark:#F97583;--shiki-light:#D73A49\">from\u003C/span>\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\"> openai \u003C/span>\u003Cspan style=\"--shiki-dark:#F97583;--shiki-light:#D73A49\">import\u003C/span>\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\"> OpenAI\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\"> \u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\">client \u003C/span>\u003Cspan style=\"--shiki-dark:#F97583;--shiki-light:#D73A49\">=\u003C/span>\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\"> OpenAI()\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\">response \u003C/span>\u003Cspan style=\"--shiki-dark:#F97583;--shiki-light:#D73A49\">=\u003C/span>\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\"> client.embeddings.create(\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan style=\"--shiki-dark:#FFAB70;--shiki-light:#E36209\">    input\u003C/span>\u003Cspan style=\"--shiki-dark:#F97583;--shiki-light:#D73A49\">=\u003C/span>\u003Cspan style=\"--shiki-dark:#9ECBFF;--shiki-light:#032F62\">\"pizza margherita\"\u003C/span>\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\">,\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan style=\"--shiki-dark:#FFAB70;--shiki-light:#E36209\">    model\u003C/span>\u003Cspan style=\"--shiki-dark:#F97583;--shiki-light:#D73A49\">=\u003C/span>\u003Cspan style=\"--shiki-dark:#9ECBFF;--shiki-light:#032F62\">\"text-embedding-3-small\"\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\">)\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\">vector \u003C/span>\u003Cspan style=\"--shiki-dark:#F97583;--shiki-light:#D73A49\">=\u003C/span>\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\"> response.data[\u003C/span>\u003Cspan style=\"--shiki-dark:#79B8FF;--shiki-light:#005CC5\">0\u003C/span>\u003Cspan style=\"--shiki-dark:#E1E4E8;--shiki-light:#24292E\">].embedding  \u003C/span>\u003Cspan style=\"--shiki-dark:#6A737D;--shiki-light:#6A737D\"># list of 1536 floats\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003C/span>\u003C/code>\u003C/pre>\n\u003Cp>Text goes in, geometry comes out.\u003C/p>\n\u003Ch2>Algebra with meanings\u003C/h2>\n\u003Cp>To make this less abstract: because these vectors actually carry meaning, you can do math with them. Real math. The classic experiment, from Mikolov’s 2013 paper:\u003C/p>\n\u003Cpre class=\"shiki shiki-themes github-dark github-light\" style=\"--shiki-dark:#e1e4e8;--shiki-light:#24292e;--shiki-dark-bg:#24292e;--shiki-light-bg:#fff\" tabindex=\"0\">\u003Ccode>\u003Cspan class=\"line\">\u003Cspan>vector(\"king\") - vector(\"man\") + vector(\"woman\") ≈ vector(\"queen\")\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan>\u003C/span>\u003C/span>\u003C/code>\u003C/pre>\n\u003Cp>The model learned, without anyone explicitly teaching it, that there’s a masculinity-femininity axis in vector space. Another one:\u003C/p>\n\u003Cpre class=\"shiki shiki-themes github-dark github-light\" style=\"--shiki-dark:#e1e4e8;--shiki-light:#24292e;--shiki-dark-bg:#24292e;--shiki-light-bg:#fff\" tabindex=\"0\">\u003Ccode>\u003Cspan class=\"line\">\u003Cspan>vector(\"Paris\") - vector(\"France\") + vector(\"Italy\") ≈ vector(\"Rome\")\u003C/span>\u003C/span>\n\u003Cspan class=\"line\">\u003Cspan>\u003C/span>\u003C/span>\u003C/code>\u003C/pre>\n\u003Cp>The concept of “capital of a country” became a direction in space. You can subtract “France” to remove the “specific country” component, then add “Italy” to put it back. The result lands near the Italian capital.\u003C/p>\n\u003Cp>This isn’t pretty theoretical math. It’s literally what’s happening inside the model. That’s why semantic search works: the model learned structure about the world, and you’re doing geometry on top of that structure.\u003C/p>\n\u003Ch2>Where these numbers come from\u003C/h2>\n\u003Cp>Embedding models are neural networks trained on absurd amounts of text to learn these representations. The main families today:\u003C/p>\n\u003Cp>\u003Cstrong>Commercial APIs.\u003C/strong> OpenAI (\u003Ccode>text-embedding-3-small\u003C/code>, \u003Ccode>text-embedding-3-large\u003C/code>), Cohere (\u003Ccode>embed-v3\u003C/code>, strong on multilingual), Voyage AI. Expensive, but quality near the top of the leaderboard, no infra to maintain on your side.\u003C/p>\n\u003Cp>\u003Cstrong>State-of-the-art open-source.\u003C/strong> BGE (from BAAI), E5 (from Microsoft), GTE (from Alibaba). You run it on your GPU, zero API cost, zero vendor lock-in. BGE-M3 and BGE-large multilingual compete well with OpenAI in many benchmarks.\u003C/p>\n\u003Cp>\u003Cstrong>The classic base.\u003C/strong> Sentence-Transformers — the library that popularized all of this. Smaller, simpler models, great for prototyping.\u003C/p>\n\u003Cp>\u003Cstrong>How to choose?\u003C/strong> Go to \u003Ca href=\"https://huggingface.co/spaces/mteb/leaderboard\" target=\"_blank\" rel=\"noopener noreferrer\">MTEB\u003C/a> — the Massive Text Embedding Benchmark, the public reference leaderboard. Pick a model in your cost and size range, and \u003Cstrong>test it on your domain\u003C/strong>. A model that’s good at English may be bad at Portuguese. A model that’s good at short text may be bad at long documents. A model that’s good at general domain may be bad at legal, medical, or technical vocabulary. Always measure.\u003C/p>\n\u003Ch2>How to compare two vectors\u003C/h2>\n\u003Cp>You have the query vector, you have the document vectors. How do you compare them? Three options, in the order you’ll probably use them.\u003C/p>\n\u003Cp>\u003Cstrong>Cosine similarity\u003C/strong> measures the angle between vectors, ignoring magnitude. It ranges from -1 to 1 (in practice, with text, it falls between 0 and 1). It’s the default in the overwhelming majority of cases, because meaning lives in direction, not in the size of the vector.\u003C/p>\n\u003Cp>\u003Cstrong>Dot product\u003C/strong> is the scalar product. It cares about direction \u003Cem>and\u003C/em> magnitude. If your vectors are normalized — and most modern models return normalized vectors — dot product is mathematically equivalent to cosine and cheaper to compute. Use it for optimization.\u003C/p>\n\u003Cp>\u003Cstrong>Euclidean distance\u003C/strong> is the straight line between two points. It works, but it’s less common with text. It shows up more with image embeddings.\u003C/p>\n\u003Cp>Rule of thumb: start with cosine, switch to dot product when optimizing.\u003C/p>\n\u003Ch2>kNN, ANN, and the scale problem\u003C/h2>\n\u003Cp>How do you find the documents most similar to the query? The conceptual algorithm is \u003Cstrong>k-Nearest Neighbors (kNN)\u003C/strong>. Four steps: take the query, generate its embedding, measure the distance to every document in the catalog, sort, and return the top K.\u003C/p>\n\u003Cp>It works perfectly on a thousand documents. On a million, computing distance against every single one is O(n), infeasible in real time.\u003C/p>\n\u003Cp>The solution is \u003Cstrong>ANN — Approximate Nearest Neighbors\u003C/strong>. You give up a bit of precision to gain orders of magnitude in speed. Instead of comparing against everything, you compare against an intelligently chosen subset.\u003C/p>\n\u003Cp>The most popular algorithm today is \u003Cstrong>HNSW — Hierarchical Navigable Small World\u003C/strong>. It works like a world map with multiple zoom levels: you start at the highest level, navigate quickly until you’re close to the answer, then descend into more detailed levels and refine. From O(n) you drop to O(log n). It’s what Elasticsearch uses, what pgvector uses, what practically every vector database uses underneath.\u003C/p>\n\u003Cp>In production, recall of 95-98% is easy to hit with a well-configured HNSW, and latency lands in the milliseconds.\u003C/p>\n\u003Ch2>Where to run this\u003C/h2>\n\u003Cp>The ecosystem has exploded in the last few years. I split it into two families.\u003C/p>\n\u003Cp>\u003Cstrong>Dedicated vector databases\u003C/strong>: Pinecone (SaaS), Weaviate, Qdrant, Milvus, Chroma. They were born for this. They generally deliver better performance and more mature features for vector search. The downside is it’s one more database to maintain.\u003C/p>\n\u003Cp>\u003Cstrong>Databases that added vector support\u003C/strong>: Elasticsearch and OpenSearch (already had mature text search, got vector); PostgreSQL with the pgvector extension; Redis; MongoDB. The advantage here is reusing a stack you already have. The downside is that features and performance sometimes aren’t as polished as in the dedicated ones.\u003C/p>\n\u003Cp>How to decide? If you already have Elasticsearch in production, \u003Cstrong>start by adding vector search to it\u003C/strong>. Don’t switch databases over a new feature. If you’re starting from zero and want simplicity, Qdrant and Weaviate are great. If you’re Postgres-first and your volume is moderate, pgvector handles it. There’s no single answer — there’s tradeoff.\u003C/p>\n\u003Ch2>The movie example, now with semantics\u003C/h2>\n\u003Cp>Remember the query from the beginning? “movie about a guy stuck in the same day”. Before, with BM25, it returned \u003Cem>Stuck on You\u003C/em> and \u003Cem>Day After Tomorrow\u003C/em> because of the literal words. Now, with vector search:\u003C/p>\n\u003Ctable>\n\u003Cthead>\n\u003Ctr>\n\u003Cth>Position\u003C/th>\n\u003Cth>Movie\u003C/th>\n\u003Cth>Score\u003C/th>\n\u003C/tr>\n\u003C/thead>\n\u003Ctbody>\n\u003Ctr>\n\u003Ctd>1\u003C/td>\n\u003Ctd>\u003Cem>Groundhog Day\u003C/em>\u003C/td>\n\u003Ctd>0.89\u003C/td>\n\u003C/tr>\n\u003Ctr>\n\u003Ctd>2\u003C/td>\n\u003Ctd>\u003Cem>Edge of Tomorrow\u003C/em>\u003C/td>\n\u003Ctd>0.85\u003C/td>\n\u003C/tr>\n\u003Ctr>\n\u003Ctd>3\u003C/td>\n\u003Ctd>\u003Cem>Palm Springs\u003C/em>\u003C/td>\n\u003Ctd>0.81\u003C/td>\n\u003C/tr>\n\u003C/tbody>\n\u003C/table>\n\u003Cp>Notice the important detail: the synopsis of \u003Cem>Groundhog Day\u003C/em> says “a man relives the same day repeatedly”. The word “stuck” never appears. But the model understood that being in a time loop is a form of being stuck in time. \u003Cem>Palm Springs\u003C/em> is an indie film many people have never heard of — but the model knows it, because it trained on descriptions from the entire internet.\u003C/p>\n\u003Cp>Same query. Completely different results. \u003Cstrong>No synonym dictionary. No manual rules.\u003C/strong> The model just did geometry.\u003C/p>\n\u003Ch2>Where vector search also breaks\u003C/h2>\n\u003Cp>Before you walk away from here with stars in your eyes — vector search has serious blind spots too.\u003C/p>\n\u003Cp>\u003Cstrong>Exact identifiers.\u003C/strong> User searches for \u003Ccode>SKU-A4729\u003C/code>. Vector search will return things semantically similar, which is not what they want. For SKUs, product codes, IDs, order numbers — you need exact match, not similarity.\u003C/p>\n\u003Cp>\u003Cstrong>Negations.\u003C/strong> “Shoe without laces” might return shoes with laces, because the concept “laces” is strongly represented in the query vector. Modern models handle this better, but it remains fragile.\u003C/p>\n\u003Cp>\u003Cstrong>Very short or ambiguous queries.\u003C/strong> “java” — is it the language, the island, or the coffee? BM25 also suffers, but vector search doesn’t magically solve it.\u003C/p>\n\u003Cp>\u003Cstrong>Cost.\u003C/strong> Generating an embedding per document costs. Vector storage costs (1024 dimensions × 4 bytes × N documents). Reindexing when you switch models costs. The latency of generating an embedding for every query also costs.\u003C/p>\n\u003Cp>When you stack the blind spots — exactness, short queries, cost — it becomes clear: \u003Cstrong>replacing BM25 with vector search is trading one set of problems for another\u003C/strong>.\u003C/p>\n\u003Ch2>The answer is to combine, not replace\u003C/h2>\n\u003Cp>The good news is that these two worlds have complementary blind spots. BM25 is strong exactly where vector search is weak: exact match, SKUs, specific keywords. Vector search is strong exactly where BM25 is weak: meaning, intent, divergent vocabulary.\u003C/p>\n\u003Cp>When you combine them, what one misses the other catches.\u003C/p>\n\u003Cp>That’s what the next post is about: \u003Cstrong>hybrid search\u003C/strong>. How to run BM25 and vector search in parallel, how to fuse the rankings without falling into the obvious trap of weighting score against score, and why this is the architecture that delivers the best results in production from day zero, with no manual weight tuning.\u003C/p>\n\u003Cp>Until then, if you’ve never played with embeddings, \u003Cstrong>open a notebook\u003C/strong>. Grab the OpenAI API (or run BGE locally), embed a few dozen sentences from your domain, compute cosine between them, and see what clusters together. It’s the best way to internalize the idea: to see with your own eyes that the model actually understands.\u003C/p>\n\u003Chr>\n\u003Cp>\u003Cem>This is the first post in a series on modern search. Next: hybrid search with Reciprocal Rank Fusion. If you’re applying this in some context, drop me a message — I’d love to know where.\u003C/em>\u003C/p>\n",[54],{"id":5,"tenant_id":6,"author":55,"status":11,"published_at":12,"cover_image_url":13,"reading_time_minutes":14,"view_count":15,"like_count":16,"featured":17,"created_at":18,"updated_at":19,"translations":56,"tags":58},{"sub":8,"name":9,"email":10},[57],{"id":22,"post_id":5,"tenant_id":6,"lang":23,"slug":24,"title":25,"excerpt":26,"content_md":27,"created_at":18,"updated_at":18},[59,60,61,62],{"id":37,"tenant_id":6,"slug":38,"created_at":39},{"id":41,"tenant_id":6,"slug":42,"created_at":43},{"id":45,"tenant_id":6,"slug":46,"created_at":47},{"id":49,"tenant_id":6,"slug":50,"created_at":51}]