Understanding Bloodhound.tokenizers.obj.whitespace
Asked Answered
C

2

17

All, I was trying to apply Twitter typeahead and Bloodhound into my project based on some working sample, But I can't understand below code .

datumTokenizer: Bloodhound.tokenizers.obj.whitespace('songs'),
queryTokenizer: Bloodhound.tokenizers.whitespace,

The original code looks like below.

var songlist = new Bloodhound({
                datumTokenizer: Bloodhound.tokenizers.obj.whitespace('songs'),
                queryTokenizer: Bloodhound.tokenizers.whitespace,
                limit: 10,
                remote: '/api/demo/GetSongs?searchTterm=%QUERY'

            });

The official document just said :

datumTokenizer – A function with the signature (datum) that transforms a datum into an array of string tokens. Required.

queryTokenizer – A function with the signature (query) that transforms a query into an array of string tokens. Required.

What does it mean ? Could someone please help to tell me more about it so that I have better understanding?

Cackle answered 28/10, 2015 at 2:16 Comment(1)
These really are under documented. My impression is that when a user makes a query, say, "Dog cat", the whitespace queryTokenizer splits that on whitespace, resulting in an array like ["Dog", "cat"]. Then, when results arrive, the datumTokenizer splits those as well. So, if you have a result with a song name of "Dogs and cats rock out", that'll get split into an array as well. Finally, Bloodhound compares the two arrays, and if the entirety of the query array is in the datum array, it considers it a match. I'm about 80% sure on this.Splanchnic
S
14

I found some helpful information here:

https://github.com/twitter/typeahead.js/blob/master/doc/migration/0.10.0.md#tokenization-methods-must-be-provided

The most common tokenization methods split a given string on whitespace or non-word characters. Bloodhound provides implementations for those methods out of the box:

  // returns ['one', 'two', 'twenty-five']
  Bloodhound.tokenizers.whitespace('  one two  twenty-five');

  // returns ['one', 'two', 'twenty', 'five']
  Bloodhound.tokenizers.nonword('  one two  twenty-five');

For query tokenization, you'll probably want to use one of the above methods. For datum tokenization, this is where you may want to do something a tad bit more advanced.

For datums, sometimes you want tokens to be dervied from more than one property. For example, if you were building a search engine for GitHub repositories, it'd probably be wise to have tokens derived from the repo's name, owner, and primary language:

  var repos = [
    { name: 'example', owner: 'John Doe', language: 'JavaScript' },
    { name: 'another example', owner: 'Joe Doe', language: 'Scala' }
  ];

  function customTokenizer(datum) {
    var nameTokens = Bloodhound.tokenizers.whitespace(datum.name);
    var ownerTokens = Bloodhound.tokenizers.whitespace(datum.owner);
    var languageTokens = Bloodhound.tokenizers.whitespace(datum.language);
    
    return nameTokens.concat(ownerTokens).concat(languageTokens);
  }

There may also be the scenario where you want datum tokenization to be performed on the backend. The best way to do that is to just add a property to your datums that contains those tokens. You can then provide a tokenizer that just returns the already existing tokens:

  var sports = [
    { value: 'football', tokens: ['football', 'pigskin'] },
    { value: 'basketball', tokens: ['basketball', 'bball'] }
  ];

  function customTokenizer(datum) { return datum.tokens; }

There are plenty of other ways you could go about tokenizing datums, it really just depends on what you are trying to accomplish.

It seems unfortunate that this information wasn't easier to find from the main documentation.

Spire answered 20/4, 2017 at 18:16 Comment(0)
P
3

It is the token used to split data or query into an array of words to perform search/match. datumTokenizer refer to your data and queryTokenizer refer to the query made (usually the text typed in the input).

If your data is an array of object (i.e. json), datumTokenizer let you specify on which field(s) of your object you want to perform search. For example, if you want to search on name and code fields, you can enter something like Bloodhound.tokenizers.obj.whitespace(['name','code']) or provide a custom function.

You can find more information at:https://github.com/twitter/typeahead.js/blob/master/doc/migration/0.10.0.md#tokenization-methods-must-be-provided

Parcel answered 13/4, 2016 at 14:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.