A language dictionary is the right way to check spelling errors in a document using NLP techniques in python. How? Easy create a compare function using the module recordlinkage or writing a function using Jaccard Similarity equation. But this is for another time.
Please import:
%%time %matplotlib inline from matplotlib import pyplot as plt import time import re, random import random import string import sys, types, os import numpy as np import pandas as pd from textblob import Word from nltk.tag import pos_tag from nltk import word_tokenize from textblob.taggers import PatternTagger from textblob.decorators import requires_nltk_corpus from textblob.utils import tree2str, filter_insignificant from textblob.base import BaseNPExtractor from textblob.wordnet import VERB from textblob import Word from spacy import displacy import nltk from nltk import word_tokenize, pos_tag, ne_chunk from nltk import RegexpParser from nltk import Tree from nltk.corpus import stopwords from nltk.stem.snowball import SnowballStemmer from nltk.tokenize import RegexpTokenizer from nltk.stem import WordNetLemmatizer,PorterStemmer from nltk.util import ngrams from nltk.stem import PorterStemmer stemming = PorterStemmer() from nltk import trigrams nltk.download('punkt') nltk.download('wordnet') stop = stopwords.words('english') import en_core_web_sm nlp = en_core_web_sm.load() from spacy.language import Language from spacy.pipeline import EntityRuler ruler = EntityRuler(nlp) nlp.add_pipe(ruler) from translate import Translator from spacy.lang.en import English from spacy.matcher import PhraseMatcher from spacy.tokens import Doc, Span, Token from autocorrect import Speller spell = Speller(lang='en') from textblob import TextBlob import spacy
To Generate A Word List:
from nltk.corpus import words word_list = words.words() #To save the modify file. To save file to your desktop. This is the best way to save your checkpoint when you modified a document in python. word_list.to_csv(r'C:\Users\XXXXXXXX\Desktop\dictwordslang.csv', index = False, header=True)
Save your file on the desktop, then upload the file into your Jupyter Notebook or Jupyter Labs. Import the word list into your python code.
Let us prepare the word list. Text processing, making the words tokens, lemmatization, positions, tags, dep, alpha, and stop words.
Word Tokens: the process of segmenting text into words, punctuation marks, etc
Word Lemmatization: is the process of grouping together the inflected forms of a word to be analyzed, identified by the word’s lemma or dictionary form
Word Position: is the process of categorical the words group into the parts of speech
Word Tag: is the process of assigning linguistic information for that word
Word Dependency: is the process of assigning syntactic dependency labels describing the relations between individual tokens like subject or object
Word Alpha: the process of identifying if the word is alpha or not
Word Stop: the process of identifying stop words, for example (is, not, this)
You can do this is Spacy:
%%time tokens = [] lemma = [] pos = [] tag = [] dep = [] alpha = [] stop = [] for doc in nlp.pipe(wordlist['words'].astype('unicode').values, batch_size=100, n_threads=4): if doc.is_parsed: tokens.append([n.text for n in doc]) lemma.append([n.lemma_ for n in doc]) pos.append([n.pos_ for n in doc]) tag.append([n.tag_ for n in doc]) dep.append([n.dep_ for n in doc]) alpha.append([n.is_alpha for n in doc]) stop.append([n.is_stop for n in doc]) else: # We want to make sure that the lists of parsed results have the # same number of entries of the original Dataframe, so add some #blanks in case the parse fails tokens.append(None) lemma.append(None) pos.append(None) tag.append(None) dep.append(None) alpha.append(None) stop.append(None) wordlist['tokens'] = tokens wordlist['lemma'] = lemma wordlist['pos'] = pos wordlist['tag'] = tag wordlist['dep'] = dep wordlist['alpha'] = alpha wordlist['stop'] = stop
This takes me 1 min and 40s to complete. Note: If you are using this code to analysis a document, it will take longer.
What I like to do, group the words in their column.
To Get Adjectives:
def get_adjectives(text): blob = TextBlob(text) return [ word for (word,tag) in blob.tags if tag.startswith("JJ")] wordlist['adjectives'] = wordlist['words'].apply(get_adjectives)
To Get Verbs:
def get_verbs(text): blob = TextBlob(text) return [ word for (word,tag) in blob.tags if tag.startswith("VB")] wordlist['verbs'] = wordlist['words'].apply(get_verbs)
To Get Adverbs:
def get_adverbs(text): blob = TextBlob(text) return [ word for (word,tag) in blob.tags if tag.startswith("RB")] wordlist['adverb'] = wordlist['words'].apply(get_adverbs)
To Get Nouns:
def get_nouns(text): blob = TextBlob(text) return [ word for (word,tag) in blob.tags if tag.startswith("NN")] wordlist['nouns'] = wordlist['words'].apply(get_nouns)
Word Sentiment:
To understand if a word is negative, positive or neutral
wordlist[['polarity', 'subjectivity']] = wordlist['words'].apply(lambda words: pd.Series(TextBlob(words).sentiment))
To Clean Column of Words:
The code below will delete the brackets around the word.
wordlist['xxxx'] = dictionary['xxxx'].apply(lambda x: ",".join(x) if isinstance(x, list) else x)
Translate the List of Words
What you should use is %%time in line one and then the code to translate to words. The keyword %%time will record the time that it takes for the code to run. On my system, it took the translating into Spanish, and it took 4 hours, 5 minutes, and 25 seconds. I would use TextBlob as a translator. Google translator is excellent when you want to translate words correctly, but it does not work all the time because you need to connect with a server, and the server can be down.
To get language codes, go to Language Codes
%%time translator = Translator() wordlist["spanishwords"] = wordlist["words"].map(lambda x: translator.translate(x, src="en", dest="es").text)
You can create a translator pipe in order to translator multiple languages as a series:
spanishme = Translator(to_lang="es") frenchme = Translator(to_lang="fr") italianme = Translator(to_lang="it") germanme = Translator(to_lang="de") hindime = Translator(to_lang="hi") chineseme = Translator(to_lang="zh") japanme = Translator(to_lang="ja") korenme = Translator(to_lang="ko") taglome = Translator(to_lang="tl") viteme = Translator(to_lang="vi") thaime = Translator(to_lang="th") russiame = Translator(to_lang="ru") afrikaansme = Translator(to_lang="af") %%time spanishme = [] frenchme = [] italianme = [] germanme = [] hindime = [] chineseme = [] japanme = [] korenme = [] taglome = [] viteme = [] thaime = [] russiame = [] afrikaansme = [] for doc in nlp.pipe(wordlist['words'].astype('unicode').values, batch_size=100, n_threads=4): spanishme.append([spanishme for n in doc]) frenchme.append([frenchme for n in doc]) italianme.append([italianme for n in doc]) germanme.append([germanme for n in doc]) hindime.append([hindime for n in doc]) chineseme.append([chineseme for n in doc]) japanme.append([japanme for n in doc]) taglome.append([taglome for n in doc]) viteme.append([viteme for n in doc]) thaime.append([thaime for n in doc]) russiame.append([russiame for n in doc]) afrikaansme.append([afrikaansme for n in doc]) wordlist['spainishwords'] = spanishme wordlist['frenchwords'] = frenchme wordlist['italianwords'] = italianme wordlist['germanword'] = germanme wordlist['hindiwords'] = hindime wordlist['chinesewords'] = chineseme wordlist['japanhwords'] = japanme wordlist['koreanwords'] = korenme wordlist['tagalogwords'] = taglome wordlist['vitetnamesewords'] = viteme wordlist['thaiwords'] = thaime wordlist['russianwords'] = russiame wordlist['afrikaanwords'] = afrikaansme
Depending on your system, this may take a while to run. For my computer system to all of these languages, it will take at least 52 hours. My suggestion to you, use the code above and translate two languages and add a timer to the code. From the time, you will know how long the code will run.
At completion, save the field on your desktop.
word_list.to_csv(r'C:\Users\XXXXXXXX\Desktop\dictwordslangnew.csv', index = False, header=True)