Compare Vocabulary Differences Between Ranking Web Pages On SERP With Python – Search Engine Journal

September 3, 2022

Stay ahead of the competition with better content. Analyze context and differences between your page vs. top competitors in the SERPs.
Vocabulary size and difference are semantic and linguistic concepts for mathematical and qualitative linguistics.
For example, Heaps’ law claims that the length of the article and vocabulary size are correlative. Still, after a certain threshold, the same words continue to appear without improving vocabulary size.
The Word2Vec uses Continuous Bag of Words (CBOW) and Skip-gram to understand the locally contextually relevant words and their distance to each other. At the same time, GloVe tries to use matrix factorization with context windows.
Zipf’s law is a complementary theory to Heaps’ law. It states that the most frequent and second most frequent words have a regular percentage difference between them.
There are other distributional semantics and linguistic theories in statistical natural language processing.
But “vocabulary comparison” is a fundamental methodology for search engines to understand “topicality differences,” “the main topic of the document,” or overall “expertise of the document.”
Paul Haahr of Google stated that it compares the “query vocabulary” to the “document vocabulary.”
David C. Taylor and his designs for context domains involve certain word vectors in vector search to see which document and which document subsection are more about what, so a search engine can rank and rerank documents based on search query modifications.
Comparing vocabulary differences between ranking web pages on the search engine results page (SERP) helps SEO pros see what contexts, concurrent words, and word proximity they are skipping compared to their competitors.
It is helpful to see context differences in the documents.
In this guide, the Python programming language is used to search on Google and take SERP items (snippets) to crawl their content, tokenize and compare their vocabulary to each other.
Compare Vocabulary Differences Between Ranking Web Pages On SERP With Python
To compare the vocabularies of ranking web documents (with Python), the used libraries and packages of Python programming language are listed below.
The steps for comparing the vocabulary size and content between ranking web pages are listed below.
Import the necessary Python libraries and packages by using the “from” and “import” commands and methods.
Use the “” only if you’re using NLTK for the first time. Download all the corpora, models, and packages. It will open a window as below.
Refresh the window from time to time; if everything is green, close the window so that the code running on your code editor stops and completes.
If you do not have some modules above, use the “pip install” method for downloading them to your local machine. If you have a closed-environment project, use a virtual environment in Python.
To perform a Google search to retrieve the result URLs on the SERP items, use a for loop in the “search” object, which comes from “Googlesearch” package.
The explanation of the code block above is:
You can see the result below.
The ranking URLs for the query “search engine optimization” is given above.
The next step is parsing these URLs for further cleaning.
Because if the results involve “video content,” it won’t be possible to perform a healthy text analysis if they do not have a long video description or too many comments, which is a different content type.
To clean the video content URLs, use the code block below.
The video search engines such as YouTube, Vimeo, Dailymotion, Sproutvideo, Dtube, and Wistia are cleaned from the resulting URLs if they appear in the results.
You can use the same cleaning methodology for the websites that you think will dilute the efficiency of your analysis or break the results with their own content type.
For example, Pinterest or other visual-heavy websites might not be necessary to check the “vocabulary size” differences between competing documents.
Explanation of code block above:
You can see the result below.
Crawl the cleaned examine URLs for retrieving their content with advertools.
You can also use requests with a for loop and list append methodology, but advertools is faster for crawling and creating the data frame with the resulting output.
With requests, you manually retrieve and unite all the “p” and “heading” elements.
Explanation of code block above:
You can see the result below.
You can see our result URLs and all their on-page SEO elements, including response headers, response sizes, and structured data information.
Tokenization of the content of the web pages requires choosing the “body_text” column of advertools crawl output and using the “word_tokenize” from NLTK.
The code line above calls the entire content of one of the result pages as below.
To tokenize these sentences, use the code block below.
We tokenized the content of the first document and checked how many words it had.
The first document we tokenized for the query “search engine optimization” has 11211 words. And boilerplate content is included in this number.
Remove the punctuations, and the stop words, as below.
Explanation of code block above:
The new length of our tokenized word list is “5319”. It shows that nearly half of the vocabulary of the document consists of stop words or punctuations.
It might mean that only 54% of the words are contextual, and the rest is functional.
To count the occurrences of the words from the corpus, the “Counter” object from the “Collections” module is used as below.
An explanation of the code block is below.
You can see the result below.
We do not see a stop word on the results, but some interesting punctuation marks remain.
That happens because some websites use different characters for the same purposes, such as curly quotes (smart quotes), straight single quotes, and double straight quotes.
And string module’s “functions” module doesn’t involve those.
Thus, to clean our data frame, we will use a custom lambda function as below.
Explanation of code block:
You can see the result below.
We see the most used words in the “Search Engine Optimization” related ranking web document.
With Panda’s “plot” methodology, we can visualize it easily as below.
Explanation of code block above:
The Pandas DataFrame Plotting is an extensive topic. If you want to use the “Plotly” as Pandas visualization back-end, check the Visualization of Hot Topics for News SEO.
You can see the result below.
Now, we can choose our second URL to start our comparison of vocabulary size and occurrence of words.
To compare the previous SEO content to a competing web document, we will use SEJ’s SEO guide. You can see a compressed version of the steps followed until now for the second article.
We collected everything for tokenization, removal of stop words, punctations, replacing curly quotations, counting words, data frame construction, data frame sorting, and visualization.
Below, you can see the result.
The SEJ article is in the eighth ranking.
The number eight means it ranks eighth on the crawl output data frame, equal to the SEJ article for SEO. You can see the result below.
We see that the 20 most used words between the SEJ SEO article and other competing SEO articles differ.
The fundamental step to automating any SEO task with Python is wrapping all the steps and necessities under a certain Python function with different possibilities.
The function that you will see below has a conditional statement. If you pass a single article, it uses a single visualization call; for multiple ones, it creates sub-plots according to the sub-plot count.
To keep the article concise, I won’t add an explanation for those. Still, if you check previous SEJ Python SEO tutorials I have written, you will realize similar wrapper functions.
Let’s use it.
tokenize_visualize(articles=[1, 8, 4])
We wanted to take the first, eighth, and fourth articles and visualize their top 20 words and their occurrences; you can see the result below.
Comparing the unique word count between the documents is quite easy, thanks to pandas. You can check the custom function below.
The result is below.
The bottom of the result shows the number of unique values, which shows the number of unique words in the document.  Counts
16               Google      71
82                  SEO      66
186              search      43
228                site      28
274                page      27
…                 …     …
510   markup/structured       1
1                Recent       1
514             mistake       1
515              bottom       1
1024           LinkedIn       1
[1025 rows x 2 columns] Number of unique words:    1025
Counts                  24
dtype: int64 Total contextual word count:  2399 Total word count:  4918  Counts
9                           SEO      93
242                      search      25
64                        Guide      23
40                      Content      17
13                       Google      17
..                          …     …
229                      Action       1
228                      Moving       1
227                       Agile       1
226                          32       1
465                        news       1
[466 rows x 2 columns] Number of unique words:    466
Counts                          16
dtype: int64 Total contextual word count:  1019 Total word count:  1601  Counts
166               SEO      86
160            search      76
32            content      46
368              page      40
327             links      39
…               …     …
695              idea       1
697            talked       1
698           earlier       1
699         Analyzing       1
1326         Security       1
[1327 rows x 2 columns] Number of unique words:    1327
Counts                31
dtype: int64 Total contextual word count:  3418 Total word count:  6728
There are 1025 unique words out of 2399 non-stopword and non-punctuation contextual words. The total word count is 4918.
The most used five words are “Google,” “SEO,” “search,” “site,” and “page” for “Wordstream.” You can see the others with the same numbers.
Auditing what distinctive words appear in competing documents helps you see where the document weighs more and how it creates a difference.
The methodology is simple: “set” object type has a “difference” method to show the different values between two sets.
To keep things concise, I won’t explain the function lines one by one, but basically, we take the unique words in multiple articles and compare them to each other.
You can see the result below.
Words that appear on: but not on: are below:
Use the custom function below to see how often these words are used in the specific document.
The results are below.
The vocabulary difference between TechTarget and Moz for the “search engine optimization” query from TechTarget’s perspective is above. We can reverse it.
Change the order of numbers. Check from another perspective.
You can see that Wordstream has 868 unique words that do not appear on Boosmart, and the top five and tail five are given above with their occurrences.
The vocabulary difference audit can be improved with “weighted frequency” by checking the query information and network.
But, for teaching purposes, this is already a heavy, detailed, and advanced Python, Data Science, and SEO intensive course.
See you in the next guides and tutorials.
More resources:
Featured Image: VectorMine/Shutterstock
Koray Tuğberk GÜBÜR is the CEO and Founder of Holistic SEO & Digital where he provides SEO Consultancy, Web Development, …
Get our daily newsletter from SEJ’s Founder Loren Baker about the latest news in the industry!
Subscribe to our daily newsletter to get the latest industry news.
Subscribe to our daily newsletter to get the latest industry news.


Article Categories:
Social Media

Leave a Reply

Your email address will not be published.

The maximum upload file size: 512 MB. You can upload: image, audio, video, document, spreadsheet, interactive, text, archive, code, other. Links to YouTube, Facebook, Twitter and other services inserted in the comment text will be automatically embedded. Drop file here