[PDF] Blueprints for Text Analytics Using Python Free Download


Turning text into valuable information is essential for businesses looking to gain a competitive advantage. With recent improvements in natural language processing (NLP), users now have many options for solving complex challenges. But it’s not always clear which NLP tools or libraries would work for a business’s needs, or which techniques you should use and in what order.

This practical book provides data scientists and developers with blueprints for best practice solutions to common tasks in text analytics and natural language processing. Authors Jens Albrecht, Sidharth Ramachandran, and Christian Winkler provide real-world case studies and detailed code examples in Python to help you get started quickly.

  • Extract data from APIs and web pages
  • Prepare textual data for statistical analysis and machine learning
  • Use machine learning for classification, topic modeling, and summarization
  • Explain AI models and classification results
  • Explore and visualize semantic similarities with word embeddings
  • Identify customer sentiment in product reviews
  • Create a knowledge graph based on named entities and their relations


From the Publisher

OReilly Media Inc.OReilly Media Inc.

Text Analytics Using Python, machine learningText Analytics Using Python, machine learning

From the Preface

Approach of the Book

This book is intended to support data scientists and developers so they can quickly enter the area of text analytics and natural language processing. Thus, we put the focus on developing practical solutions that can serve as blueprints in your daily business. A blueprint, in our definition, is a best-practice solution for a common problem. It is a template that you can easily copy and adapt for reuse. For these blueprints we use production-ready Python frameworks for data analysis, natural language processing, and machine learning. Nevertheless, we also introduce the underlying models and algorithms.

We do not expect any previous knowledge in the field of natural language processing but provide you with the necessary background knowledge to get started quickly. In each chapter, we explain and discuss different solution approaches for the respective tasks with their potential strengths and weaknesses.

Thus, you will not only acquire the knowledge about how to solve a certain kind of problem but also get a set of ready-to-use blueprints that you can take and customize to your data and requirements.

Each of the 13 chapters includes a self-contained use case for a specific aspect of text analytics. Based on an example dataset, we develop and explain the blueprints step by step.

The choice of topics reflects the most common types of problems in our daily text analytics work. Typical tasks include data acquisition, statistical data exploration, and the use of supervised and unsupervised machine learning. The business questions range from content analysis (“What are people talking about?”) to automatic text categorization.


In this book you will learn how to solve text analytics problems efficiently with the Python ecosystem. We will explain all concepts specific to text analytics and machine learning in detail but assume that you already have basic knowledge of Python, including fundamental libraries like Pandas. You should also be familiar with Jupyter notebooks so that you can experiment with the code while reading the book. If not, check out the tutorials on learnpython.org, docs.python.org, or DataCamp.

Even though we explain the general ideas of the algorithms used, we won’t go too much into the details. You should be able to follow the examples and reuse the code without completely understanding the mathematics behind it. College-level knowledge of linear algebra and statistics is helpful, though.


O’Reilly’s mission is to change the world by sharing the knowledge of innovators. For over 40 years, we’ve inspired companies and individuals to do new things (and do them better) by providing the skills and understanding that are necessary for success.

At the heart of our business is a unique network of expert pioneers and practitioners who share their knowledge through the O’Reilly learning platform and our books—which have been heralded for decades as the definitive way to learn the technologies that are shaping the future. So individuals, teams, and organizations learn the tools, best practices, and emerging trends that will transform their industries.

Our customers are hungry to build the innovations that propel the world forward. And we help them do just that.

Table of contents :

Table of Contents
Approach of the Book
Some Important Libraries to Know
Books to Read
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Chapter 1. Gaining Early Insights from Textual Data
What You’ll Learn and What We’ll Build
Exploratory Data Analysis
Introducing the Dataset
Blueprint: Getting an Overview of the Data with Pandas
Calculating Summary Statistics for Columns
Checking for Missing Data
Plotting Value Distributions
Comparing Value Distributions Across Categories
Visualizing Developments Over Time
Blueprint: Building a Simple Text Preprocessing Pipeline
Performing Tokenization with Regular Expressions
Treating Stop Words
Processing a Pipeline with One Line of Code
Blueprints for Word Frequency Analysis
Blueprint: Counting Words with a Counter
Blueprint: Creating a Frequency Diagram
Blueprint: Creating Word Clouds
Blueprint: Ranking with TF-IDF
Blueprint: Finding a Keyword-in-Context
Blueprint: Analyzing N-Grams
Blueprint: Comparing Frequencies Across Time Intervals and Categories
Creating Frequency Timelines
Creating Frequency Heatmaps
Closing Remarks
Chapter 2. Extracting Textual Insights with APIs
What You’ll Learn and What We’ll Build
Application Programming Interfaces
Blueprint: Extracting Data from an API Using the Requests Module
Rate Limiting
Blueprint: Extracting Twitter Data with Tweepy
Obtaining Credentials
Installing and Configuring Tweepy
Extracting Data from the Search API
Extracting Data from a User’s Timeline
Extracting Data from the Streaming API
Closing Remarks
Chapter 3. Scraping Websites and Extracting Data
What You’ll Learn and What We’ll Build
Scraping and Data Extraction
Introducing the Reuters News Archive
URL Generation
Blueprint: Downloading and Interpreting robots.txt
Blueprint: Finding URLs from sitemap.xml
Blueprint: Finding URLs from RSS
Downloading Data
Blueprint: Downloading HTML Pages with Python
Blueprint: Downloading HTML Pages with wget
Extracting Semistructured Data
Blueprint: Extracting Data with Regular Expressions
Blueprint: Using an HTML Parser for Extraction
Blueprint: Spidering
Introducing the Use Case
Error Handling and Production-Quality Software
Density-Based Text Extraction
Extracting Reuters Content with Readability
Summary Density-Based Text Extraction
All-in-One Approach
Blueprint: Scraping the Reuters Archive with Scrapy
Possible Problems with Scraping
Closing Remarks and Recommendation
Chapter 4. Preparing Textual Data for Statistics and Machine Learning
What You’ll Learn and What We’ll Build
A Data Preprocessing Pipeline
Introducing the Dataset: Reddit Self-Posts
Loading Data Into Pandas
Blueprint: Standardizing Attribute Names
Saving and Loading a DataFrame
Cleaning Text Data
Blueprint: Identify Noise with Regular Expressions
Blueprint: Removing Noise with Regular Expressions
Blueprint: Character Normalization with textacy
Blueprint: Pattern-Based Data Masking with textacy
Blueprint: Tokenization with Regular Expressions
Tokenization with NLTK
Recommendations for Tokenization
Linguistic Processing with spaCy
Instantiating a Pipeline
Processing Text
Blueprint: Customizing Tokenization
Blueprint: Working with Stop Words
Blueprint: Extracting Lemmas Based on Part of Speech
Blueprint: Extracting Noun Phrases
Blueprint: Extracting Named Entities
Feature Extraction on a Large Dataset
Blueprint: Creating One Function to Get It All
Blueprint: Using spaCy on a Large Dataset
Persisting the Result
A Note on Execution Time
There Is More
Language Detection
Token Normalization
Closing Remarks and Recommendations
Chapter 5. Feature Engineering and Syntactic Similarity
What You’ll Learn and What We’ll Build
A Toy Dataset for Experimentation
Blueprint: Building Your Own Vectorizer
Enumerating the Vocabulary
Vectorizing Documents
The Document-Term Matrix
The Similarity Matrix
Bag-of-Words Models
Blueprint: Using scikit-learn’s CountVectorizer
Blueprint: Calculating Similarities
TF-IDF Models
Optimized Document Vectors with TfidfTransformer
Introducing the ABC Dataset
Blueprint: Reducing Feature Dimensions
Blueprint: Improving Features by Making Them More Specific
Blueprint: Using Lemmas Instead of Words for Vectorizing Documents
Blueprint: Limit Word Types
Blueprint: Remove Most Common Words
Blueprint: Adding Context via N-Grams
Syntactic Similarity in the ABC Dataset
Blueprint: Finding Most Similar Headlines to a Made-up Headline
Blueprint: Finding the Two Most Similar Documents in a Large Corpus (Much More Difficult)
Blueprint: Finding Related Words
Tips for Long-Running Programs like Syntactic Similarity
Summary and Conclusion
Chapter 6. Text Classification Algorithms
What You’ll Learn and What We’ll Build
Introducing the Java Development Tools Bug Dataset
Blueprint: Building a Text Classification System
Step 1: Data Preparation
Step 2: Train-Test Split
Step 3: Training the Machine Learning Model
Step 4: Model Evaluation
Final Blueprint for Text Classification
Blueprint: Using Cross-Validation to Estimate Realistic Accuracy Metrics
Blueprint: Performing Hyperparameter Tuning with Grid Search
Blueprint Recap and Conclusion
Closing Remarks
Further Reading
Chapter 7. How to Explain a Text Classifier
What You’ll Learn and What We’ll Build
Blueprint: Determining Classification Confidence Using Prediction Probability
Blueprint: Measuring Feature Importance of Predictive Models
Blueprint: Using LIME to Explain the Classification Results
Blueprint: Using ELI5 to Explain the Classification Results
Blueprint: Using Anchor to Explain the Classification Results
Using the Distribution with Masked Words
Working with Real Words
Closing Remarks
Chapter 8. Unsupervised Methods: Topic Modeling and Clustering
What You’ll Learn and What We’ll Build
Our Dataset: UN General Debates
Checking Statistics of the Corpus
Nonnegative Matrix Factorization (NMF)
Blueprint: Creating a Topic Model Using NMF for Documents
Blueprint: Creating a Topic Model for Paragraphs Using NMF
Latent Semantic Analysis/Indexing
Blueprint: Creating a Topic Model for Paragraphs with SVD
Latent Dirichlet Allocation
Blueprint: Creating a Topic Model for Paragraphs with LDA
Blueprint: Visualizing LDA Results
Blueprint: Using Word Clouds to Display and Compare Topic Models
Blueprint: Calculating Topic Distribution of Documents and Time Evolution
Using Gensim for Topic Modeling
Blueprint: Preparing Data for Gensim
Blueprint: Performing Nonnegative Matrix Factorization with Gensim
Blueprint: Using LDA with Gensim
Blueprint: Calculating Coherence Scores
Blueprint: Finding the Optimal Number of Topics
Blueprint: Creating a Hierarchical Dirichlet Process with Gensim
Blueprint: Using Clustering to Uncover the Structure of Text Data
Further Ideas
Summary and Recommendation
Chapter 9. Text Summarization
What You’ll Learn and What We’ll Build
Text Summarization
Extractive Methods
Data Preprocessing
Blueprint: Summarizing Text Using Topic Representation
Identifying Important Words with TF-IDF Values
LSA Algorithm
Blueprint: Summarizing Text Using an Indicator Representation
Measuring the Performance of Text Summarization Methods
Blueprint: Summarizing Text Using Machine Learning
Step 1: Creating Target Labels
Step 2: Adding Features to Assist Model Prediction
Step 3: Build a Machine Learning Model
Closing Remarks
Further Reading
Chapter 10. Exploring Semantic Relationships with Word Embeddings
What You’ll Learn and What We’ll Build
The Case for Semantic Embeddings
Word Embeddings
Analogy Reasoning with Word Embeddings
Types of Embeddings
Blueprint: Using Similarity Queries on Pretrained Models
Loading a Pretrained Model
Similarity Queries
Blueprints for Training and Evaluating Your Own Embeddings
Data Preparation
Blueprint: Training Models with Gensim
Blueprint: Evaluating Different Models
Blueprints for Visualizing Embeddings
Blueprint: Applying Dimensionality Reduction
Blueprint: Using the TensorFlow Embedding Projector
Blueprint: Constructing a Similarity Tree
Closing Remarks
Further Reading
Chapter 11. Performing Sentiment Analysis on Text Data
What You’ll Learn and What We’ll Build
Sentiment Analysis
Introducing the Amazon Customer Reviews Dataset
Blueprint: Performing Sentiment Analysis Using Lexicon-Based Approaches
Bing Liu Lexicon
Disadvantages of a Lexicon-Based Approach
Supervised Learning Approaches
Preparing Data for a Supervised Learning Approach
Blueprint: Vectorizing Text Data and Applying a Supervised Machine Learning Algorithm
Step 1: Data Preparation
Step 2: Train-Test Split
Step 3: Text Vectorization
Step 4: Training the Machine Learning Model
Pretrained Language Models Using Deep Learning
Deep Learning and Transfer Learning
Blueprint: Using the Transfer Learning Technique and a Pretrained Language Model
Step 1: Loading Models and Tokenization
Step 2: Model Training
Step 3: Model Evaluation
Closing Remarks
Further Reading
Chapter 12. Building a Knowledge Graph
What You’ll Learn and What We’ll Build
Knowledge Graphs
Information Extraction
Introducing the Dataset
Named-Entity Recognition
Blueprint: Using Rule-Based Named-Entity Recognition
Blueprint: Normalizing Named Entities
Merging Entity Tokens
Coreference Resolution
Blueprint: Using spaCy’s Token Extensions
Blueprint: Performing Alias Resolution
Blueprint: Resolving Name Variations
Blueprint: Performing Anaphora Resolution with NeuralCoref
Name Normalization
Entity Linking
Blueprint: Creating a Co-Occurrence Graph
Extracting Co-Occurrences from a Document
Visualizing the Graph with Gephi
Relation Extraction
Blueprint: Extracting Relations Using Phrase Matching
Blueprint: Extracting Relations Using Dependency Trees
Creating the Knowledge Graph
Don’t Blindly Trust the Results
Closing Remarks
Further Reading
Chapter 13. Using Text Analytics in Production
What You’ll Learn and What We’ll Build
Blueprint: Using Conda to Create Reproducible Python Environments
Blueprint: Using Containers to Create Reproducible Environments
Blueprint: Creating a REST API for Your Text Analytics Model
Blueprint: Deploying and Scaling Your API Using a Cloud Provider
Blueprint: Automatically Versioning and Deploying Builds
Closing Remarks
Further Reading
About the Authors

Product information

Publisher‏:‎O’Reilly Media; 1st edition (December 29, 2020)
Paperback‏:‎424 pages
Item Weight‏:‎1.5 pounds
Dimensions‏:‎7 x 1 x 9.2 inches


Download Blueprints for Text Analytics Using Python Pdf Free:

You can easily download Blueprints for Text Analytics Using Python: Machine Learning-Based Solutions for Common Real World (NLP) Apps PDF by clicking the link given below. If the PDF link is not responding, kindly inform us through comment section. We will fixed it soon.

Click Here to download

“ NOTE: We do not own copyrights to these books. We’re sharing this material with our audience ONLY for educational purpose. We highly encourage our visitors to purchase original books from the respected publishers. If someone with copyrights wants us to remove this content, If you feel that we have violated your copyrights, then please contact us immediately. please contact us. or Email: [email protected]

Leave A Reply

Your email address will not be published.

19 − six =