How To Create A Multiple Language Dictionary Using A Pipeline

A language dictionary is the right way to check spelling errors in a document using NLP techniques in python. How? Easy create a compare function using the module recordlinkage or writing a function using Jaccard Similarity equation. But this is for another time.

Please import:

%%time
%matplotlib inline
from matplotlib import pyplot as plt
import time
import re, random
import random
import string
import sys, types, os
import numpy as np
import pandas as pd
from textblob import Word
from nltk.tag import pos_tag
from nltk import word_tokenize
from textblob.taggers import PatternTagger
from textblob.decorators import requires_nltk_corpus
from textblob.utils import tree2str, filter_insignificant
from textblob.base import BaseNPExtractor
from textblob.wordnet import VERB
from textblob import Word
from spacy import displacy
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.util import ngrams
from nltk.stem import PorterStemmer
stemming = PorterStemmer()
from nltk import trigrams
nltk.download('punkt')
nltk.download('wordnet')
stop = stopwords.words('english')
import en_core_web_sm
nlp = en_core_web_sm.load()
from spacy.language import Language

from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp)
nlp.add_pipe(ruler)

from translate import Translator
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
from autocorrect import Speller
spell = Speller(lang='en')
from textblob import TextBlob
import spacy

To Generate A Word List:

from nltk.corpus import words
word_list = words.words()
#To save the modify file. To save file to your desktop.  This is the best way to save your checkpoint when you modified a document in python.  
word_list.to_csv(r'C:\Users\XXXXXXXX\Desktop\dictwordslang.csv', index = False, header=True)

Save your file on the desktop, then upload the file into your Jupyter Notebook or Jupyter Labs. Import the word list into your python code.
Let us prepare the word list. Text processing, making the words tokens, lemmatization, positions, tags, dep, alpha, and stop words.

Word Tokens: the process of segmenting text into words, punctuation marks, etc

Word Lemmatization: is the process of grouping together the inflected forms of a word to be analyzed, identified by the word’s lemma or dictionary form

Word Position: is the process of categorical the words group into the parts of speech

Word Tag: is the process of assigning linguistic information for that word

Word Dependency: is the process of assigning syntactic dependency labels describing the relations between individual tokens like subject or object

Word Alpha: the process of identifying if the word is alpha or not

Word Stop: the process of identifying stop words, for example (is, not, this)

You can do this is Spacy:

%%time
tokens = []
lemma = []
pos = []
tag = []
dep = []
alpha = []
stop = []



for doc in nlp.pipe(wordlist['words'].astype('unicode').values, batch_size=100, n_threads=4):
    if doc.is_parsed:
        tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
        tag.append([n.tag_ for n in doc])
        dep.append([n.dep_ for n in doc])
        alpha.append([n.is_alpha for n in doc])
        stop.append([n.is_stop for n in doc])

        
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some 
        #blanks in case the parse fails

        tokens.append(None)
        lemma.append(None)
        pos.append(None)
        tag.append(None)
        dep.append(None)
        alpha.append(None)
        stop.append(None)
        
        
wordlist['tokens'] = tokens
wordlist['lemma'] = lemma
wordlist['pos'] = pos
wordlist['tag'] = tag
wordlist['dep'] = dep 
wordlist['alpha'] = alpha
wordlist['stop'] = stop

This takes me 1 min and 40s to complete. Note: If you are using this code to analysis a document, it will take longer.

What I like to do, group the words in their column.

To Get Adjectives:

def get_adjectives(text):
    blob = TextBlob(text)
    return [ word for (word,tag) in blob.tags if tag.startswith("JJ")]
wordlist['adjectives'] = wordlist['words'].apply(get_adjectives)

To Get Verbs:

def get_verbs(text):
    blob = TextBlob(text)
    return [ word for (word,tag) in blob.tags if tag.startswith("VB")]

wordlist['verbs'] = wordlist['words'].apply(get_verbs)

To Get Adverbs:

def get_adverbs(text):
    blob = TextBlob(text)
    return [ word for (word,tag) in blob.tags if tag.startswith("RB")]

wordlist['adverb'] = wordlist['words'].apply(get_adverbs)

To Get Nouns:

def get_nouns(text):
    blob = TextBlob(text)
    return [ word for (word,tag) in blob.tags if tag.startswith("NN")]
wordlist['nouns'] = wordlist['words'].apply(get_nouns)

Word Sentiment:
To understand if a word is negative, positive or neutral

wordlist[['polarity', 'subjectivity']] = wordlist['words'].apply(lambda words: pd.Series(TextBlob(words).sentiment))

To Clean Column of Words:
The code below will delete the brackets around the word.

wordlist['xxxx'] = dictionary['xxxx'].apply(lambda x: ",".join(x) if isinstance(x, list) else x)

Translate the List of Words
What you should use is %%time in line one and then the code to translate to words. The keyword %%time will record the time that it takes for the code to run. On my system, it took the translating into Spanish, and it took 4 hours, 5 minutes, and 25 seconds. I would use TextBlob as a translator. Google translator is excellent when you want to translate words correctly, but it does not work all the time because you need to connect with a server, and the server can be down.

To get language codes, go to Language Codes

%%time
translator = Translator()
wordlist["spanishwords"] = wordlist["words"].map(lambda x: translator.translate(x, src="en", dest="es").text)

You can create a translator pipe in order to translator multiple languages as a series:

spanishme = Translator(to_lang="es")
frenchme = Translator(to_lang="fr")
italianme = Translator(to_lang="it")
germanme = Translator(to_lang="de")
hindime = Translator(to_lang="hi")
chineseme = Translator(to_lang="zh")
japanme = Translator(to_lang="ja")
korenme = Translator(to_lang="ko")
taglome = Translator(to_lang="tl")
viteme = Translator(to_lang="vi")
thaime = Translator(to_lang="th")
russiame = Translator(to_lang="ru")
afrikaansme = Translator(to_lang="af")

%%time

spanishme = []
frenchme = []
italianme = []
germanme = []
hindime = []
chineseme = []
japanme = []
korenme = []
taglome = []
viteme = []
thaime = []
russiame = []
afrikaansme = []

for doc in nlp.pipe(wordlist['words'].astype('unicode').values, batch_size=100, n_threads=4):
    
        spanishme.append([spanishme for n in doc])
        frenchme.append([frenchme  for n in doc])
        italianme.append([italianme for n in doc])
        germanme.append([germanme  for n in doc])
        hindime.append([hindime for n in doc])
        chineseme.append([chineseme  for n in doc])
        japanme.append([japanme for n in doc])
        taglome.append([taglome  for n in doc])
        viteme.append([viteme for n in doc])
        thaime.append([thaime  for n in doc])
        russiame.append([russiame for n in doc])
        afrikaansme.append([afrikaansme for n in doc])

wordlist['spainishwords'] = spanishme
wordlist['frenchwords'] = frenchme
wordlist['italianwords'] = italianme
wordlist['germanword'] = germanme
wordlist['hindiwords'] = hindime
wordlist['chinesewords'] = chineseme
wordlist['japanhwords'] = japanme
wordlist['koreanwords'] = korenme
wordlist['tagalogwords'] = taglome
wordlist['vitetnamesewords'] = viteme
wordlist['thaiwords'] = thaime
wordlist['russianwords'] = russiame
wordlist['afrikaanwords'] = afrikaansme

Depending on your system, this may take a while to run. For my computer system to all of these languages, it will take at least 52 hours. My suggestion to you, use the code above and translate two languages and add a timer to the code. From the time, you will know how long the code will run.

At completion, save the field on your desktop.

word_list.to_csv(r'C:\Users\XXXXXXXX\Desktop\dictwordslangnew.csv', index = False, header=True)

Data ManagementData ManipulationDatabases