Updated March 29, 2023

Definition of spaCy models

SpaCy models training pipelines are available as Python packages. This implies they, like any other module, are a part of our program. They’re versioned and can be specified in our requirements.txt as a dependency. Trained pipelines can be installed manually or using pip. The pipelines are built to be efficient in terms of speed and size, and they function effectively when they are fully loaded.

Introduction to spaCy models

The spaCy is very efficient and customizable. Multiple components, for example, can share a similar “token-to-vector” paradigm, and the lemmatizer can be easily swapped out or disabled.
For creating the spaCy models first, we need to create a code environment and add “spaCy” to our package needs, just like any other Python package.
Some spaCy features are not included in the library. An extra download step is required to use these models. On shared DSS nodes, this can cause problems because users don’t have write access to shared server locations.
Unlike in spaCy v2, when the tagger parser was self-contained, several v3 components rely on earlier pipeline components.
As a result, deactivating has an impact on the quality of the annotations or results in warnings and errors.
The tok2vec component is monitored by the tagger, morphologizer, and parser components.
If morphologizer, the attribute ruler binds token.tag to token.pos. If there is no tagger, the attribute ruler also ensures that whitespace is uniformly tagged and transfers token.pos to token.tag.
SpaCy’s dedicated pip delivery system can help us to solve this problem. To use spaCy models first, we need to install spaCy in our system. SpaCy models are very useful and important in python.
The below steps show how to install and check spaCy modules are as follows.
In this step, we are installing the spaCy package by using the pip command. In the below example, we have already installed the spaCy package in our system so it will show that requirement is already satisfied, then we have no need to do anything.

pip install spacy

After installing all the modules, we are opening the python shell by using the python3 command.

python3

After login into the python shell in this step, we are checking bs4, and the requests package is installed in our system.

import spaCy
print (spaCy)

SpaCy models and languages

We already know that spaCy’s models are installed by using a package of python, which means they are part of our program just like any other module. The requirement.txt file can be used to version and define certain modules.
Alternatively, we may use pip to install a model directly. We can do this by running the pip command on the archive file’s URL or local directory. If we don’t have a model’s direct URL, go to its release page and copy it from there.
The below example shows how to install spaCy models in Linux-like systems as follows. In the below example, we have installed the spaCy models by using external URLs as follows.

pip install https://github.com/explosion/spaCy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

In the below example, we have to install the model by using a local file as follows.

pip install en_core_web_sm-2.2.0.tar.gz

We can also manually download the data and save it to our own directory. Manually download the data using one of these methods. From the most recent release, we can download the model directly from our browser. The archive files can be used to create our own download script.
We can save the package model wherever we choose on our local file system once we’ve finished downloading it.
The below example shows importing as module are as follows.

Code –

import spaCy
import en_core_web_sm
py_exmp = en_core_web_sm.load ()
py_doc = py_exmp ("spaCy models")
print (py_doc)

We can also make use of our previously trained model. To do so, use the language. We may also encapsulate it as a Python package to make deployment easier.
The following is reflected in the model versioning.
SpaCy compatibility
The major and minor versions of the model.
A model will translate as follows,

r – SpaCy major version.
S – Model major version. It prevents users from loading several major versions with the same code.
T – Minor version of the model. The model structure is the same, but the parameter values are different. For instance, different data for varying numbers of iterations could be used to train the model.

Below is the spaCy models used in python as follows.
en_core_web_sm – The language of this model is English. Type is syntax, entities, and vocabulary. The size of the model is 13 MB. The license is from MIT.
en_core_web_md – The language of this model is English. Type is vectors, syntax, entities, and vocabulary. The size of the model is 43 MB. The license is from MIT.
en_core_web_lg – The language of this model is English. Type is syntax, vectors, entities, and vocabulary. The size of the model is 741 MB. The license is from MIT.
en_core_web_trf – The language of this model is English. Type is syntax, entities, and vocabulary. The size of the model is 438 MB. The license is from MIT.
SpaCy will support the following languages as follows. The below table shows language and its code.

Language	Code	Language	Code
Chinese	zh	Danish	da
English	en	Dutch	nl
French	fr	German	de
Greek	el	Italian	It
Japanese	ja	Lithuanian	It
Multi-language	xx	Norwegian Bokmål	nb
Polish	pl	Portuguese	Pt
Romanian	ro	Spanish	es
Afrikaans	af	Albanian	sq
Arabic	ar	Armenian	hy
Basque	eu	Bengali	bn
Bulgarian	bg	Catalan	ca
Croatian	hr	Czech	cs
Estonian	et	Finnish	fi
Gujarati	gu	Hebrew	he
Hindi	hi	Hungarian	hu
Icelandic	is	Indonesian	Id
Irish	ga	Kannada	kn
Korean	ko	Latvian	lv
Ligurian	lij	Luxembourgish	lb
Macedonian	mk	Malayalam	ml
Marathi	mr	Nepali	ne
Persian	fa	Russian	ru
Serbian	sr	Sinhala	si
Slovak	sk	Slovenian	sl
Swedish	sv	Tagalog	tl
Tamil	ta	Tatar	tt
Telugu	te	Thai	th
Turkish	tr	Ukrainian	uk
Urdu	ur	Vietnamese	vi
Yoruba	yo

Conclusion

The spaCy models are very efficient and customizable. Multiple components, for example, can share a similar “token-to-vector” paradigm, and the lemmatizer can be easily swapped out or disabled. SpaCy models training pipelines are available as Python packages. This implies they, like any other module, are part of our program.