Loading Glove Pre-trained Word Embedding Model from Text File [Faster]

With the evolution of transfer learning approaches in image processing, the field of Natural Language Processing has also a ubiquitous pre-trained model which is used for multiple states of the art transfer learning solutions for Text classification, Named Entity Recognition.

And this pre-trained model is Word Embeddings. Word embedding is a vector representation of vocabulary which is trained following the concept “meaning of the word is carried by its correspondence” Excuse me if I have misphrased Ahem!

[For further elaboration in the State of Transfer Learning in NLP, follow this link which I went through .]

[Additionally, now if you want to know about word embeddings then follow the following link.]

Moving forward, we have available pre-trained models like glove, w2vec, fasttext which can be easily loaded and used. In this tutorial, I am just gonna cover how to load .txt file provided by glove in python as a model (which is a dictionary) and getting vector representation of words. These vector representations can be used in other models as input. Usually while loading a text file as a model what we do is, read it line by line and separate word from the vector and insert that word as a key and vector as the value in the dictionary. Well, this takes a long time to load. To optimize this time, what we can do is make two separate files for vocab and vector. Numpy saving array as npy (binary format) file is handy and fast to read as well. Hence, here we will split .txt file into .vocab and .npy file (vector file).

**Step 1: **Download the desired pre-trained embedding file.

Follow the link below and pre-trained word embedding provided by the glove. You can download glove pre-trained model through this link. I have downloaded 100 dimensions of embedding which was derived from 2B tweets, 27B tokens, 1.2M vocab. The vector length is 100 features.

Step 2: Now, load the text file into word embedding model in python. Following is the code snippet. {% gist 2c7c4427248fe89188d36cf6241c53f5%} Use it as : model = load_glove_model(“path/to/txt/file/also/exclude/extension of filename.”)

Alternative and Faster Way

Step 1: Once you have a text file, then we will convert it to vocab and npy file.

Now vocab file will have all the words while npy file will have their respective vectors. Consequently, Later they will be mapped. Now, let’s see a snippet of code to convert .txt file to the required file.

{% gist d9a80e4ed66a0e1648f48da13edc6406 %}

Use it as: convert_to_binary(“path/to/txt/file/also/exclude/extension of filename.”)

Taaadaa, your desired files are created.

**Step 2: **load these files to get the model. {% gist 3fc9a979259df954c97a99a3ab4e822c %} load_embeddings_binary(“path/to/vocab and npy/file/also/exclude/extension of filename.”) Here, our filename of both files are same and extension will be different.

**Step 3: **Get word vector representation. Now, when you pass a sentence, its gonna give a vector of dimension [number of words, dimension of vector] {% gist 4cb676f0a97346dd6fa3a43ec94409f1 %} Example: get_w2v(“This is demonstration”, model)

You will get a numpy array of dimension 3 * 100, 3 being the number of tokens and 100 dimension of the feature vector.

This is it! Thank you! Next Session I will be explaining loading embedding model using popular NLP library Gensim.