graphchem.preprocessing.Tokenizer

Bases: object

A simple tokenizer that assigns a unique integer to each token (word) in the input data. If the tokenizer is in training mode, it will add new tokens to the vocabulary. Otherwise, it will return the integer corresponding to 'unk' for unknown tokens.

Attributes

_data : dict A dictionary mapping each token to a unique integer. Initialized with {"unk": 1}. num_classes : int The number of unique classes (tokens) in the vocabulary, including 'unk'. train : bool A flag indicating whether the tokenizer is in training mode. unknown : list A list to store tokens that were encountered during inference but are not in the vocabulary.

Source code in graphchem/preprocessing/features.py

class Tokenizer(object):
    """
    A simple tokenizer that assigns a unique integer to each token (word) in
    the input data. If the tokenizer is in training mode, it will add new
    tokens to the vocabulary. Otherwise, it will return the integer
    corresponding to 'unk' for unknown tokens.

    Attributes
    ----------
    _data : dict
        A dictionary mapping each token to a unique integer. Initialized with
        {"unk": 1}.
    num_classes : int
        The number of unique classes (tokens) in the vocabulary, including
        'unk'.
    train : bool
        A flag indicating whether the tokenizer is in training mode.
    unknown : list
        A list to store tokens that were encountered during inference but are
        not in the vocabulary.
    """

    def __init__(self):
        """
        Initialize the Tokenizer with default values.
        """
        self._data = {"unk": 1}
        self.num_classes = 1
        self.train = True
        self.unknown = []

    def __call__(self, item: str) -> int:
        """
        Tokenizes a given string by returning its corresponding integer from
        the vocabulary.

        Parameters
        ----------
        item : str
            The token (word) to be tokenized.

        Returns
        -------
        int
            The unique integer assigned to the token. If the token is not in
            the vocabulary and the tokenizer is in training mode, it will add
            the token and return its corresponding integer. Otherwise, it
            returns 1, which corresponds to 'unk'.
        """
        try:
            return self._data[item]
        except KeyError:
            if self.train:
                self.num_classes += 1
                self._data[item] = self.num_classes
                return self(item)
            else:
                self.unknown.append(item)
                return 1

    @property
    def vocab_size(self) -> int:
        """
        Returns the size of the vocabulary, which is the number of unique
        tokens plus one.

        Returns
        -------
        int
            The total number of classes (tokens) in the vocabulary plus one.
        """
        return self.num_classes + 1

`vocab_size` `property`

Returns the size of the vocabulary, which is the number of unique tokens plus one.

Returns

int The total number of classes (tokens) in the vocabulary plus one.

`call(item)`

Tokenizes a given string by returning its corresponding integer from the vocabulary.

Parameters

item : str The token (word) to be tokenized.

Returns

int The unique integer assigned to the token. If the token is not in the vocabulary and the tokenizer is in training mode, it will add the token and return its corresponding integer. Otherwise, it returns 1, which corresponds to 'unk'.

Source code in graphchem/preprocessing/features.py

def __call__(self, item: str) -> int:
    """
    Tokenizes a given string by returning its corresponding integer from
    the vocabulary.

    Parameters
    ----------
    item : str
        The token (word) to be tokenized.

    Returns
    -------
    int
        The unique integer assigned to the token. If the token is not in
        the vocabulary and the tokenizer is in training mode, it will add
        the token and return its corresponding integer. Otherwise, it
        returns 1, which corresponds to 'unk'.
    """
    try:
        return self._data[item]
    except KeyError:
        if self.train:
            self.num_classes += 1
            self._data[item] = self.num_classes
            return self(item)
        else:
            self.unknown.append(item)
            return 1

`init()`

Initialize the Tokenizer with default values.

Source code in graphchem/preprocessing/features.py

def __init__(self):
    """
    Initialize the Tokenizer with default values.
    """
    self._data = {"unk": 1}
    self.num_classes = 1
    self.train = True
    self.unknown = []

graphchem.preprocessing.Tokenizer

Attributes

vocab_size property

Returns

__call__(item)

Parameters

Returns

__init__()

`vocab_size` `property`

`call(item)`

`init()`