information-retrieval Questions

1

I know there is a good Compound Word Token Filter in elasticsearch but my problem is kind of different. I am wondering how search engines like google deal with open form compound words like "post o...
Gallium asked 18/11, 2017 at 8:43

10

I have written a system that summarizes a long document containing thousands of words. Are there any norms on how such a system should be evaluated in the context of a user survey? In short,...
Lacto asked 26/3, 2012 at 20:26

2

I'm currently in the process of building an information search system for my personal documents, and I've been reading about both Vector Databases (in regards to stuff like LangChain), and full-tex...

3

I want to count entities/categories in wiki dump of a particular language, say English. The official documentation is very tough to find/follow for a beginner. What I have understood till now is th...
Catrinacatriona asked 22/7, 2020 at 13:33

1

Solved

I have been trying to use Chromadb version 0.4.8 Langchain version 0.0.276 with SentenceTransformerEmbeddingFunction as shown in the snippet below. from langchain.vectorstores import Chroma from ...
Tachylyte asked 30/8, 2023 at 3:41

5

The formula for IDF is log( N / df t ) instead of just N / df t. Where N = total documents in collection, and df t = document frequency of term t. Log is said to be used because it “dampens” the...
Leucotomy asked 21/11, 2014 at 18:33

0

I am using Langchain version 0.218, and was wondering if anyone was able to filter a seeded vectorstore dynamically during runtime? Such as when running by a Agent. My motive is to put this dynamic...

8

Solved

I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online? To just give you ...
Stabilizer asked 21/4, 2010 at 13:56

2

I am trying to implement dense vector based search in solr (currently using version 8.5.2). My requirement is to store a dense vector representation for each document in solr in a field called vec...
Hypogeum asked 19/10, 2021 at 7:2

5

Solved

I've built up a large database of banks in MongoDB. I can easily take this information and create indexes with it in whoosh. For example I'd like to be able to match the bank names 'Eagle Bank &amp...
Concepcion asked 15/7, 2011 at 15:55

3

I know the meaning and methods of word embedding(skip-gram, CBOW) completely. And I know, that Google has a word2vector API that by getting the word can produce the vector. but my problem is this:...

2

Solved

I read How to incorporate multiple fields in QueryParser? but i didn't get it. At the moment i have a very strange construction like: parser = New QueryParser("bodytext", analyzer) parser2 = New ...
Confessedly asked 5/1, 2010 at 9:30

2

Consider there are 10 billion words that people have searched for in google. Corresponding to each word you have the sorted list of all document id's. The list looks like this: [Word 1]->[doc_i...

6

Solved

I am confused by the following comment about TF-IDF and Cosine Similarity. I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrie...
Philender asked 6/6, 2011 at 17:36

2

Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.
Starchy asked 1/2, 2012 at 16:32

3

Solved

I am currently learning Information retrieval and i am rather stuck with an example of recall and precision A searcher uses a search engine to look for information. There are 10 documents on the f...
Lithosphere asked 28/1, 2014 at 18:0

2

Solved

given a dental form as input, need to find all the checkboxes present in the form using image processing. I have answered my current approach below. Is there any better approach to find the checkbo...
Climax asked 8/7, 2020 at 18:11

3

Solved

I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, howeve...
Standin asked 18/1, 2017 at 4:22

3

Solved

Which ngram implementation is fastest in python? I've tried to profile nltk's vs scott's zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/): from nltk.util impor...
Tour asked 19/2, 2014 at 14:16

3

I'm trying to compare the performance of different Solr queries. In order to get a fair test, I want to clear the cache between queries. How is this done? Of course, one can restart the server, I ...
Waggish asked 1/2, 2012 at 14:25

2

Solved

I'd like to store an n-dimensional feature vector, e.g. <1.00, 0.34, 0.22, ..., 0>, with each document, and then provide another feature vector as a query, with the results sorted in order of...
Largeminded asked 13/5, 2015 at 23:26

2

Solved

I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document me...
Abruzzi asked 4/6, 2013 at 21:8

3

Solved

I plan to store images on Amazon S3 how to retrieve from Amazon S3 : file size image height image width ?
Deandre asked 27/5, 2012 at 9:28

1

Solved

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests. The Common Crawl pages suggest I need an S3 accou...
Directrix asked 19/4, 2019 at 13:2

1

Solved

Mean average precision computed at k (for top-k elements in the answer), according to wiki, ml metrics at kaggle, and this answer: Confusion about (Mean) Average Precision should be computed as mea...

© 2022 - 2025 — McMap. All rights reserved.