information-retrieval Questions
1
I know there is a good Compound Word Token Filter in elasticsearch but my problem is kind of different. I am wondering how search engines like google deal with open form compound words like "post o...
Gallium asked 18/11, 2017 at 8:43
10
I have written a system that summarizes a long document containing thousands of words. Are there any norms on how such a system should be evaluated in the context of a user survey?
In short,...
Lacto asked 26/3, 2012 at 20:26
2
I'm currently in the process of building an information search system for my personal documents, and I've been reading about both Vector Databases (in regards to stuff like LangChain), and full-tex...
Klinger asked 7/8, 2023 at 18:46
3
I want to count entities/categories in wiki dump of a particular language, say English. The official documentation is very tough to find/follow for a beginner. What I have understood till now is th...
Catrinacatriona asked 22/7, 2020 at 13:33
1
Solved
I have been trying to use
Chromadb version 0.4.8
Langchain version 0.0.276
with SentenceTransformerEmbeddingFunction as shown in the snippet below.
from langchain.vectorstores import Chroma
from ...
Tachylyte asked 30/8, 2023 at 3:41
5
The formula for IDF is log( N / df t ) instead of just N / df t.
Where N = total documents in collection, and df t = document frequency of term t.
Log is said to be used because it “dampens” the...
Leucotomy asked 21/11, 2014 at 18:33
0
I am using Langchain version 0.218, and was wondering if anyone was able to filter a seeded vectorstore dynamically during runtime? Such as when running by a Agent.
My motive is to put this dynamic...
Mouthwatering asked 30/6, 2023 at 7:0
8
Solved
I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online?
To just give you ...
Stabilizer asked 21/4, 2010 at 13:56
2
I am trying to implement dense vector based search in solr (currently using version 8.5.2). My requirement is
to store a dense vector representation for each document in solr in a field called vec...
Hypogeum asked 19/10, 2021 at 7:2
5
Solved
I've built up a large database of banks in MongoDB. I can easily take this information and create indexes with it in whoosh. For example I'd like to be able to match the bank names 'Eagle Bank &...
Concepcion asked 15/7, 2011 at 15:55
3
I know the meaning and methods of word embedding(skip-gram, CBOW) completely. And I know, that Google has a word2vector API that by getting the word can produce the vector.
but my problem is this:...
Kwasi asked 27/6, 2017 at 17:12
2
Solved
I read How to incorporate multiple fields in QueryParser? but i didn't get it.
At the moment i have a very strange construction like:
parser = New QueryParser("bodytext", analyzer)
parser2 = New ...
Confessedly asked 5/1, 2010 at 9:30
2
Consider there are 10 billion words that people have searched for in google. Corresponding
to each word you have the sorted list of all document id's. The list looks like this:
[Word 1]->[doc_i...
Ishtar asked 5/2, 2014 at 16:43
6
Solved
I am confused by the following comment about TF-IDF and Cosine Similarity.
I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrie...
Philender asked 6/6, 2011 at 17:36
2
Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.
Starchy asked 1/2, 2012 at 16:32
3
Solved
I am currently learning Information retrieval and i am rather stuck with an example of recall and precision
A searcher uses a search engine to look for information. There are 10 documents on the f...
Lithosphere asked 28/1, 2014 at 18:0
2
Solved
given a dental form as input, need to find all the checkboxes present in the form using image processing. I have answered my current approach below. Is there any better approach to find the checkbo...
Climax asked 8/7, 2020 at 18:11
3
Solved
I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, howeve...
Standin asked 18/1, 2017 at 4:22
3
Solved
Which ngram implementation is fastest in python?
I've tried to profile nltk's vs scott's zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/):
from nltk.util impor...
Tour asked 19/2, 2014 at 14:16
3
I'm trying to compare the performance of different Solr queries. In order to get a fair test, I want to clear the cache between queries.
How is this done? Of course, one can restart the server, I ...
Waggish asked 1/2, 2012 at 14:25
2
Solved
I'd like to store an n-dimensional feature vector, e.g. <1.00, 0.34, 0.22, ..., 0>, with each document, and then provide another feature vector as a query, with the results sorted in order of...
Largeminded asked 13/5, 2015 at 23:26
2
Solved
I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document me...
Abruzzi asked 4/6, 2013 at 21:8
3
Solved
I plan to store images on Amazon S3 how to retrieve from Amazon S3 :
file size
image height
image width ?
Deandre asked 27/5, 2012 at 9:28
1
Solved
I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests.
The Common Crawl pages suggest I need an S3 accou...
Directrix asked 19/4, 2019 at 13:2
1
Solved
Mean average precision computed at k (for top-k elements in the answer), according to wiki, ml metrics at kaggle, and this answer: Confusion about (Mean) Average Precision should be computed as mea...
Grieco asked 3/3, 2019 at 6:51
1 Next >
© 2022 - 2025 — McMap. All rights reserved.