Skip to content

graphchem.preprocessing.Tokenizer

Bases: object

A simple tokenizer that assigns a unique integer to each token (word) in the input data. If the tokenizer is in training mode, it will add new tokens to the vocabulary. Otherwise, it will return the integer corresponding to 'unk' for unknown tokens.

Attributes

_data : dict A dictionary mapping each token to a unique integer. Initialized with {"unk": 1}. num_classes : int The number of unique classes (tokens) in the vocabulary, including 'unk'. train : bool A flag indicating whether the tokenizer is in training mode. unknown : list A list to store tokens that were encountered during inference but are not in the vocabulary.

Source code in graphchem/preprocessing/features.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
class Tokenizer(object):
    """
    A simple tokenizer that assigns a unique integer to each token (word) in
    the input data. If the tokenizer is in training mode, it will add new
    tokens to the vocabulary. Otherwise, it will return the integer
    corresponding to 'unk' for unknown tokens.

    Attributes
    ----------
    _data : dict
        A dictionary mapping each token to a unique integer. Initialized with
        {"unk": 1}.
    num_classes : int
        The number of unique classes (tokens) in the vocabulary, including
        'unk'.
    train : bool
        A flag indicating whether the tokenizer is in training mode.
    unknown : list
        A list to store tokens that were encountered during inference but are
        not in the vocabulary.
    """

    def __init__(self):
        """
        Initialize the Tokenizer with default values.
        """
        self._data = {"unk": 1}
        self.num_classes = 1
        self.train = True
        self.unknown = []

    def __call__(self, item: str) -> int:
        """
        Tokenizes a given string by returning its corresponding integer from
        the vocabulary.

        Parameters
        ----------
        item : str
            The token (word) to be tokenized.

        Returns
        -------
        int
            The unique integer assigned to the token. If the token is not in
            the vocabulary and the tokenizer is in training mode, it will add
            the token and return its corresponding integer. Otherwise, it
            returns 1, which corresponds to 'unk'.
        """
        try:
            return self._data[item]
        except KeyError:
            if self.train:
                self.num_classes += 1
                self._data[item] = self.num_classes
                return self(item)
            else:
                self.unknown.append(item)
                return 1

    @property
    def vocab_size(self) -> int:
        """
        Returns the size of the vocabulary, which is the number of unique
        tokens plus one.

        Returns
        -------
        int
            The total number of classes (tokens) in the vocabulary plus one.
        """
        return self.num_classes + 1

vocab_size property

Returns the size of the vocabulary, which is the number of unique tokens plus one.

Returns

int The total number of classes (tokens) in the vocabulary plus one.

__call__(item)

Tokenizes a given string by returning its corresponding integer from the vocabulary.

Parameters

item : str The token (word) to be tokenized.

Returns

int The unique integer assigned to the token. If the token is not in the vocabulary and the tokenizer is in training mode, it will add the token and return its corresponding integer. Otherwise, it returns 1, which corresponds to 'unk'.

Source code in graphchem/preprocessing/features.py
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
def __call__(self, item: str) -> int:
    """
    Tokenizes a given string by returning its corresponding integer from
    the vocabulary.

    Parameters
    ----------
    item : str
        The token (word) to be tokenized.

    Returns
    -------
    int
        The unique integer assigned to the token. If the token is not in
        the vocabulary and the tokenizer is in training mode, it will add
        the token and return its corresponding integer. Otherwise, it
        returns 1, which corresponds to 'unk'.
    """
    try:
        return self._data[item]
    except KeyError:
        if self.train:
            self.num_classes += 1
            self._data[item] = self.num_classes
            return self(item)
        else:
            self.unknown.append(item)
            return 1

__init__()

Initialize the Tokenizer with default values.

Source code in graphchem/preprocessing/features.py
144
145
146
147
148
149
150
151
def __init__(self):
    """
    Initialize the Tokenizer with default values.
    """
    self._data = {"unk": 1}
    self.num_classes = 1
    self.train = True
    self.unknown = []