Tokenization Utilizing Spacy library – GeeksforGeeks


Earlier than shifting to the reason of tokenization, let’s first focus on what’s Spacy. Spacy is a library that comes below NLP (Pure Language Processing). It’s an object-oriented Library that’s used to take care of pre-processing of textual content, and sentences, and to extract data from the textual content utilizing modules and capabilities.

Tokenization is the method of splitting a textual content or a sentence into segments, that are referred to as tokens. It is step one of textual content preprocessing and is used as enter for subsequent processes like textual content classification, lemmatization, and so forth.

Process followed to convert text into tokens

Course of adopted to transform textual content into tokens

Making a clean language object offers a tokenizer and an empty pipeline so as to add modules within the pipeline together with a tokenizer we will use:


Intermediate steps for tokenization

Intermediate steps for tokenization


Under is the Implementation


import spacy


nlp = spacy.clean("en")


doc = nlp("GeeksforGeeks is a one cease

studying vacation spot for geeks.")


for token in doc:



vacation spot

We are able to additionally add performance in tokens by including different modules within the pipeline utilizing spacy.load().


nlp = spacy.load("en_core_web_sm")




['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Right here is an instance to point out what different functionalities will be enhanced by including modules to the pipeline.


import spacy


nlp = spacy.load("en_core_web_sm")


doc = nlp("If you wish to be a superb programmer

, be constant to follow every day on GFG.")


for token in doc:

    print(token, " | ",


          " | ", token.lemma_)


If  |  subordinating conjunction  |  if
you  |  pronoun  |  you
need  |  verb  |  need
to  |  particle  |  to
be  |  auxiliary  |  be
an  |  determiner  |  an
glorious  |  adjective  |  glorious
programmer  |  noun  |  programmer
,  |  punctuation  |  ,
be  |  auxiliary  |  be
constant  |  adjective  |  constant
to  |  particle  |  to
follow  |  verb  |  follow
every day  |  adverb  |  every day
on  |  adposition  |  on
GFG  |  correct noun  |  GFG
.  |  punctuation  |  .

Within the above instance, we have now used a part of speech (POS) and lemmatization utilizing NLP modules, which resulted in POS for each phrase and lemmatization (a course of to cut back each token to its base kind). We weren’t in a position to entry this performance earlier than, this performance is barely added after we loaded our NLP occasion with (“en_core_web_sm”). 


Leave a Reply