Intro to NLP and Text Processing with TensorFlow¶
1. Intro to Natural Language Processing¶
We are surrounded by intelligent machines that can not only see the world, but also can understand and talk with us.
That is not exagerating. At regular basis, some of us interact with virtual assistants such as Siri, Amazon Alexa, and Google Assistant. And there are thousands of chatbots that we interact with on many software applications and websites.
NLP or Natural Language Processing is an interdisciplinary field. It is a branch of computer science, machine learning, and computational linguistic that is concerned with giving the computers the ability to understand texts and human languages.
Common tasks involved in NLP¶
Below are some of the common tasks that can be done with NLP.
- Text classification
- Sentiment analysis
- Text generation
- Machine translation
- Speech recognition
- Text to speech conversion
- Optical character recognition
Example Applications of NLP¶
- Spam detection
- Question answering
- Language to language translation (Machine translation)
- Grammatical error correction (like Grammarly)
One of the classical NLP tool is NLTK. Natural Language Toolkit or NLTK provides different functionalities for working with texts and it is commonly used.
In this lab and later lab, we won't use NLTK. We will use Keras and TensorFlow texts functions.
There is also a suite of TensorFlow libraries for text processing such as TensorFlow Text.
2. Intro to Text Processing with TensorFlow¶
2.1 Text encodings¶
Most machine learning models (including deep learning ones) can not handle text data. They have to be converted into numerics. In essence, that's what text encoding means: it's converting the text into numerics representation.
There are 4 main texts encoding techniques which are:
- Character encoding
- Words based encoding
- One hot encoding
- Word embeddings
Let's talk about these techniques in details.
Character Based Encoding¶
In this type of encoding technique, each character in a word is represented by unique number.
One of the traditional character encoding technique is ASCII(American Standard Code Information Interchange). With ASCII, we can nearly convert any character to numeric, and it's pretty standard, but one of the disadvantages is that the antigrams (words with same letters in different order) can have the same encodings, and that can hurt the machine learrning model.
Example of antigrams include
Character encoding is not widely used and it's less efficient compared to other later techniques.
The image below shows character encoding done on antigrams.
Words Based Encoding¶
In word based encoding, instead of taking a single character in a word, we represent all the individual words with numbers.
In most cases, words encoding works well than character encodings.