Blog 11: AI, Natural Language Processing (NLP)
Natural Language Processing (NLP)
NLP is a filed of Artifical Intelliegence (AI) that helps human language intelligible to machines. NLP combines the power of linguistics and computer science to study the rules and structure of language, and allows to develop intelligent systems that is capable of understanding, analyzing, and extracting meaning from text and speech.
In this blog, I will demonstrate some NLP techniques using Jupyter Notebook using python.
NLTK
Python provides large volumes of packages and libraries that allows to use different NLP techniques. NLTK provides easy-to-use interfaces to perform classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
import nltk
import pandas as pd
import numpy as np
import os
Reading a random Lab file from CIT480
I will create a text file that will be from random Lab file from cit480 course.
with open('lab.txt', "r") as f:
rand_lab= f.read().replace("\n","")
Tokenzation
We will use library word tokenize to seperate each words into token/unit. Thus, it allows to study each words as features. It allows to understand the importance of each words with respect to the whole sentence.
from nltk.tokenize import word_tokenize
lab_tokenize= word_tokenize(rand_lab)
lab_tokenize[:10]
output: First 10 tokens
['LAB',
'INTRODUCTIONThis',
'lab',
'is',
'designed',
'to',
'introduce',
'you',
'to',
'creating']
Here, we can see that the words of each sentence now has been Tokenized or seperated into single seperate unit.
Frequency Distribution
We can import Freqency Distribution library from nltk to better understand the distribution of words in this file. By using several functions form the library we can perform different analysis to come into some conclusions.
from nltk import FreqDist
# creating instance
freq_dist= FreqDist()
#LOWERCASE
for i in lab_tokenize:
freq_dist[i.lower()] +=1
freq_dist.most_common(10)
output: Here, top 10 most common words or character that has been used.
[('to', 29),
('the', 27),
('aws', 26),
('|', 16),
('you', 14),
('.', 14),
(':', 14),
('and', 11),
(',', 10),
('a', 9)]
Understanding the NGRAMS, BIGRAMS, TRIGRAMS
By applying Tokenization we can seperate the each words of sentences into different consecutive ways.
bigram, trigram and ngram
This concept allows to reflect words are basic, meaningful elements with the ability to represent a different meaning when they are in a sentence.
Example
In the following example I will write a sentence and how we can tokenize each words and extract consecutive word that can give different meanings.
my_string= "Lets eat, grandpa."
my_string_token= word_tokenize(my_string)
output: Here, Now applying the above functions we can see speration each words into different consecutive ways can give different meaning.
bigrams:
# Bigrams
my_string_bigram= list(bigrams(my_string_token))
my_string_bigram
[('Lets', 'eat'), ('eat', ','), (',', 'grandpa'), ('grandpa', '.')]
trigrams:
# TRIGRAMS
my_string_trigram= list(trigrams(my_string_token))
my_string_trigram
[('Lets', 'eat', ','), ('eat', ',', 'grandpa'), (',', 'grandpa', '.')]
)
ngrams:
# Ngrams, where n is number
my_string_ngram= list(ngrams(my_string_token,1))
my_string_ngram
[('Lets',), ('eat',), (',',), ('grandpa',), ('.',)]