Passing multiple sentences to BERT?
Asked Answered
U

1

6

I have a dataset with paragraphs that I need to classify into two classes. These paragraphs are usually 3-5 sentences long. The overwhelming majority of them are less than 500 words long. I would like to make use of BERT to tackle this problem.

I am wondering how I should use BERT to generate vector representations of these paragraphs and especially, whether it is fine to just pass the whole paragraph into BERT?

There have been informative discussions of related problems here and here. These discussions focus on how to use BERT for representing whole documents. In my case the paragraphs are not that long, and indeed could be passed to BERT without exceeding its maximum length of 512. However, BERT was trained on sentences. Sentences are relatively self-contained units of meaning. I wonder if feeding multiple sentences into BERT doesn't conflict fundamentally with what the model was designed to do (although this appears to be done regularly).

Underbelly answered 17/11, 2020 at 18:50 Comment(0)
C
10

I think your question is based on a misconception. Even though the BERT paper uses the term sentence quite often, it is not referring to a linguistic sentence. The paper defines a sentence as

an arbitrary span of contiguous text, rather than an actual linguistic sentence.

It is therefore completely fine to pass whole paragraphs to BERT and a reason why they can handle those.

Chemisorb answered 17/11, 2020 at 22:44 Comment(3)
when how should we separate the sentences? [SEP] token?Ridgley
@Minions, that's what confusing me tooCamm
Usually (!), the [SEP] token is used to give the model a hint that the text that follows has a different role than the text before. A common example is extractive question answering, where the input is [BOS] question [SEP] paragraph [EOS]. The question should be semantically treated differently than the paragraph. Entity extraction, on the other hand, is usually performed without the [SEP] token.Chemisorb

© 2022 - 2024 — McMap. All rights reserved.