information-retrieval

1

How to deal with compound words in Elasticsearch

I know there is a good Compound Word Token Filter in elasticsearch but my problem is kind of different. I am wondering how search engines like google deal with open form compound words like "post o...

java elasticsearch search-engine information-retrieval

Gallium asked 18/11, 2017 at 8:43

10

How do I evaluate a text summarization tool? [closed]

I have written a system that summarizes a long document containing thousands of words. Are there any norms on how such a system should be evaluated in the context of a user survey? In short,...

language-agnostic nlp information-retrieval evaluation

Lacto asked 26/3, 2012 at 20:26

2

What's the difference between a Vector Database and Full-Text Search?

I'm currently in the process of building an information search system for my personal documents, and I've been reading about both Vector Databases (in regards to stuff like LangChain), and full-tex...

javascript full-text-search information-retrieval vector-database

Klinger asked 7/8, 2023 at 18:46

3

How do I download and work with wikipedia data dumps?

I want to count entities/categories in wiki dump of a particular language, say English. The official documentation is very tough to find/follow for a beginner. What I have understood till now is th...

wikipedia information-retrieval wikidata knowledge-graph

Catrinacatriona asked 22/7, 2020 at 13:33

1

Solved

Chromadb + Langchain with SentenceTransformerEmbeddingFunction throwing sqlite3 >= 3.35.0 error, despite sqlite3 3.43.0 being available

I have been trying to use Chromadb version 0.4.8 Langchain version 0.0.276 with SentenceTransformerEmbeddingFunction as shown in the snippet below. from langchain.vectorstores import Chroma from ...

sqlite information-retrieval langchain chromadb

Tachylyte asked 30/8, 2023 at 3:41

5

Why is log used when calculating term frequency weight and IDF, inverse document frequency?

The formula for IDF is log( N / df t ) instead of just N / df t. Where N = total documents in collection, and df t = document frequency of term t. Log is said to be used because it “dampens” the...

information-retrieval tf-idf

Leucotomy asked 21/11, 2014 at 18:33

0

Langchain - Can't solve the dynamic filtering problem from vectorstore

I am using Langchain version 0.218, and was wondering if anyone was able to filter a seeded vectorstore dynamically during runtime? Such as when running by a Agent. My motive is to put this dynamic...

artificial-intelligence information-retrieval chaining large-language-model py-langchain

Mouthwatering asked 30/6, 2023 at 7:0

8

Solved

Wikipedia text download

I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online? To just give you ...

text wikipedia web-crawler information-retrieval

Stabilizer asked 21/4, 2010 at 13:56

2

Vector based search in solr

I am trying to implement dense vector based search in solr (currently using version 8.5.2). My requirement is to store a dense vector representation for each document in solr in a field called vec...

vector solr information-retrieval

Hypogeum asked 19/10, 2021 at 7:2

5

Solved

Fuzzy String Searching with Whoosh in Python

I've built up a large database of banks in MongoDB. I can easily take this information and create indexes with it in whoosh. For example I'd like to be able to match the bank names 'Eagle Bank &amp...

python information-retrieval fuzzy-search whoosh

Concepcion asked 15/7, 2011 at 15:55

3

How combine word embedded vectors to one vector?

I know the meaning and methods of word embedding(skip-gram, CBOW) completely. And I know, that Google has a word2vector API that by getting the word can produce the vector. but my problem is this:...

nlp information-retrieval word2vec google-api-python-client word-embedding

Kwasi asked 27/6, 2017 at 17:12

2

Solved

How to specify two Fields in Lucene QueryParser?

I read How to incorporate multiple fields in QueryParser? but i didn't get it. At the moment i have a very strange construction like: parser = New QueryParser("bodytext", analyzer) parser2 = New ...

java parsing lucene lucene.net information-retrieval

Confessedly asked 5/1, 2010 at 9:30

2

Algorithm for search in inverted index

Consider there are 10 billion words that people have searched for in google. Corresponding to each word you have the sorted list of all document id's. The list looks like this: [Word 1]->[doc_i...

algorithm sorting set information-retrieval inverted-index

Ishtar asked 5/2, 2014 at 16:43

6

Solved

Cosine similarity and tf-idf

I am confused by the following comment about TF-IDF and Cosine Similarity. I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrie...

information-retrieval vsm cosine-similarity tf-idf

Philender asked 6/6, 2011 at 17:36

2

How to extract Highlighted Parts from PDF files

Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.

pdf information-retrieval

Starchy asked 1/2, 2012 at 16:32

3

Solved

Understanding Recall and Precision

I am currently learning Information retrieval and i am rather stuck with an example of recall and precision A searcher uses a search engine to look for information. There are 10 documents on the f...

search-engine information-retrieval precision-recall

Lithosphere asked 28/1, 2014 at 18:0

2

Solved

detect checkboxes from a form using opencv python

given a dental form as input, need to find all the checkboxes present in the form using image processing. I have answered my current approach below. Is there any better approach to find the checkbo...

python image-processing information-retrieval opencv

Climax asked 8/7, 2020 at 18:11

3

Solved

Getting total term frequency throughout entire index (Elasticsearch)

I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, howeve...

elasticsearch information-retrieval

Standin asked 18/1, 2017 at 4:22

3

Solved

Fast/Optimize N-gram implementations in python

Which ngram implementation is fastest in python? I've tried to profile nltk's vs scott's zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/): from nltk.util impor...

python nlp nltk information-retrieval n-gram

Tour asked 19/2, 2014 at 14:16

3

How to clear the cache in Solr?

I'm trying to compare the performance of different Solr queries. In order to get a fair test, I want to clear the cache between queries. How is this done? Of course, one can restart the server, I ...

caching solr lucene information-retrieval

Waggish asked 1/2, 2012 at 14:25

2

Solved

Is it possible to query Elastic Search with a feature vector?

I'd like to store an n-dimensional feature vector, e.g. <1.00, 0.34, 0.22, ..., 0>, with each document, and then provide another feature vector as a query, with the results sorted in order of...

elasticsearch information-retrieval feature-extraction

Largeminded asked 13/5, 2015 at 23:26

2

Solved

How to select stop words using tf-idf? (non english corpus)

I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document me...

information-retrieval text-mining stop-words tf-idf

Abruzzi asked 4/6, 2013 at 21:8

3

Solved

Get image height and width of image stored on Amazon S3

I plan to store images on Amazon S3 how to retrieve from Amazon S3 : file size image height image width ?

image amazon-s3 information-retrieval

Deandre asked 27/5, 2012 at 9:28

1

Solved

Download small sample of AWS Common Crawl to local machine via http

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests. The Common Crawl pages suggest I need an S3 accou...

dataset information-retrieval corpus common-crawl

Directrix asked 19/4, 2019 at 13:2

1

Solved

MAP@k computation

Mean average precision computed at k (for top-k elements in the answer), according to wiki, ml metrics at kaggle, and this answer: Confusion about (Mean) Average Precision should be computed as mea...

python matlab information-retrieval precision-recall average-precision

Grieco asked 3/3, 2019 at 6:51

information-retrieval Questions

Recommended topics

Hot tags