Working With The NLTK For Textual Data PreProcessing On COLAB

Working With The NLTK For Textual Data PreProcessing On COLAB

NLTK (Natural Language Tool Kit) is a suite of Python libraries and programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum (refer NLTK Org).

NLTK has been pre-installed into COLAB. Developers just need to import the library whenever it is needed.

import stopwords and porterstemmer from nltk library

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

see colab example

load sample text

# load the remote data into a Pandas DataFrame
import pandas as pd
df = pd.read_csv('https://archive.org/download/crowdflower/text_emotion.csv', on_bad_lines='skip', encoding='latin-1')
df.head()