In this guest post, Michael Heydt, the author of the Python Web Scraping Cookbook, shows how to scrape a job listing from StackOverflow using Python.

It’s Easy to Scrape Stack Overflow

StackOverflow actually makes it quite easy to scrape data from their pages. In this article, you’ll be shown how to use content from a posting at https://stackoverflow.com/jobs/122517/spacex-enterprise-software-engineer-full-stack-spacex?so=p&sec=True&pg=1&offset=22&cl=Amazon%3b+.  You’ll note that StackOverflow job listing pages are very structured.  It’s probably because they’re created by programmers and for programmers. The page looks like the following:

All of this information is codified within the HTML of the page.  You can see for yourself by analyzing the page content.  But what’s great about StackOverflow is that it puts much of its page data in an embedded JSON object.  This is placed within a <script type=”application/ld+json”> HTML tag, so it’s really easy to find.  The following shows a truncated section of this tag (the description is truncated, but all the tags are shown):

This makes it very easy to get hold of the content, as you can simply retrieve the page, find this tag, and then convert this JSON into a Python object with the json library.  In addition to the actual job description, much of the “metadata” of the job posting is also included, such as skills, industries, benefits, and location information.  You don’t need to search the HTML for the information—just find this tag and load the JSON.  Please note that if you want to find specific items (such as Job Responsibilities), you’ll still need to parse the description.  The description contains the full HTML, so you’ll still need to deal with HTML tags while parsing it.

Start scraping a job listing from StackOverflow

It’s time to get the job description from this page; you simply have to retrieve the contents. Here’s a step by step walkthrough of the code:

  1. First, you need to read the file:
with open("spacex-job-listing.txt", "r") as file:
    content = file.read()
  1. Then, load the content into a BeautifulSoup object, and retrieve the <script type=”application/ld+json”> tag:
bs = BeautifulSoup(content, "lxml")
script_tag = bs.find("script", {"type": "application/ld+json"})
  1. Now that you have the tag, you can simply load its contents into a Python dictionary using the json library:  
job_listing_contents = json.loads(script_tag.contents[0])
print(job_listing_contents)

The output of this looks like the following (this is truncated for brevity):

{'@context': 'http://schema.org', '@type': 'JobPosting', 'title': 'SpaceX Enterprise Software Engineer, Full 
Stack', 'skills': ['c#', 'sql', 'javascript', 'asp.net', 'angularjs'], 'description': '<h2>About this 
job</h2>\r\n<p><span>Location options: <strong>Paid relocation</strong></span><br/>
<span>Job type: <strong>Permanent</strong></span><br/><span>Experience level: <strong>Mid-
Level, Senior</strong></span><br/><span>Role: <strong>Full Stack Developer</strong></span>
<br/><span>Industry: <strong>Aerospace, Information Technology, Web Development</strong>
  1. Now do some simple tasks with this without involving HTML parsing.  For example, you can retrieve the skills required for the job with just a few lines of code:
# print the skills
for skill in job_listing_contents["skills"]:
    print(skill)

The above produces the following output:

c#
sql
javascript
asp.net
angularjs

Reading and cleaning the description in the job listing

If you take a close look, you’ll see that the description of the job listing is still in HTML.  In order to extract valuable content out of this data, you’ll need to parse this HTML and perform several other steps:

  1. Start by creating a BeautifulSoup object from the description key of the description you loaded.  Print this to see what it looks like:
desc_bs = BeautifulSoup(job_listing_contents["description"], "lxml")
print(desc_bs)

 

<p><span>Location options: <strong>Paid relocation</strong></span><br/><span>Job type: <strong>Permanent</strong></span><br/><span>Experience level: <strong>Mid-Level, Senior</strong></span><br/><span>Role: <strong>Full Stack Developer</strong></span><br/><span>Industry: <strong>Aerospace, Information Technology, Web Development</strong></span><br/><span>Company size: <strong>1k-5k people</strong></span><br/><span>Company type: <strong>Private</strong></span><br/></p><br/><br/><h2>Technologies</h2> <p>c#, sql, javascript, asp.net, angularjs</p> <br/><br/><h2>Job description</h2> <p><strong>Full Stack Enterprise Software Engineer</strong></p>
<p>The EIS (Enterprise Information Systems) team writes the software that builds rockets and powers SpaceX. We are responsible for all of the software on the factory floor, the warehouses, the financial systems, the restaurant, and even the public home page. Elon has called us the "nervous system" of SpaceX because we connect all of the other teams at SpaceX to ensure that the entire rocket building process runs smoothly.</p>
<p><strong>Responsibilities:</strong></p>
<ul>
<li>We are seeking developers with demonstrable experience in: ASP.NET, C#, SQL Server, and AngularJS. We are a fast-paced, highly iterative team that has to adapt quickly as our factory grows. We need people who are comfortable tackling new problems, innovating solutions, and interacting with every facet of the company on a daily basis. Creative, motivated, able to take responsibility and support the applications you create. Help us get rockets out the door faster!</li>
</ul>
<p><strong>Basic Qualifications:</strong></p>
<ul>
<li>Bachelor's degree in computer science, engineering, physics, mathematics, or similar technical discipline.</li>
<li>3+ years of experience developing across a full-stack:  Web server, relational database, and client-side (HTML/Javascript/CSS).</li>
</ul>
<p><strong>Preferred Skills and Experience:</strong></p>
<ul>
<li>Database - Understanding of SQL. Ability to write performant SQL. Ability to diagnose queries, and work with DBAs.</li>
<li>Server - Knowledge of how web servers operate on a low-level. Web protocols. Designing APIs. How to scale web sites. Increase performance and diagnose problems.</li>
<li>UI - Demonstrated ability creating rich web interfaces using a modern client side framework. Good judgment in UX/UI design.  Understands the finer points of HTML, CSS, and Javascript - know which tools to use when and why.</li>
<li>System architecture - Knowledge of how to structure a database, web site, and rich client side application from scratch.</li>
<li>Quality - Demonstrated usage of different testing patterns, continuous integration processes, build deployment systems. Continuous monitoring.</li>
<li>Current - Up to date with current trends, patterns, goings on in the world of web development as it changes rapidly. Strong knowledge of computer science fundamentals and applying them in the real-world.</li>
</ul> <br/><br/></body></html>
  1. You need to go through this and remove all the HTML tags so that you’re only left with the text of the description.  This is what you’ll then tokenize.  Fortunately, throwing out all the HTML tags is easy with BeautifulSoup:
just_text = desc_bs.find_all(text=True)
print(just_text)

 

['About this job', '\n', 'Location options: ', 'Paid relocation', 'Job type: ', 'Permanent', 'Experience level: ', 'Mid-Level, Senior', 'Role: ', 'Full Stack Developer', 'Industry: ', 'Aerospace, Information Technology, Web Development', 'Company size: ', '1k-5k people', 'Company type: ', 'Private', 'Technologies', ' ', 'c#, sql, javascript, asp.net, angularjs', ' ', 'Job description', ' ', 'Full Stack Enterprise\xa0Software Engineer', '\n', 'The EIS (Enterprise Information Systems) team writes the software that builds rockets and powers SpaceX. We are responsible for all of the software on the factory floor, the warehouses, the financial systems, the restaurant, and even the public home page. Elon has called us the "nervous system" of SpaceX because we connect all of the other teams at SpaceX to ensure that the entire rocket building process runs smoothly.', '\n', 'Responsibilities:', '\n', '\n', 'We are seeking developers with demonstrable experience in: ASP.NET, C#, SQL Server, and AngularJS. We are a fast-paced, highly iterative team that has to adapt quickly as our factory grows. We need people who are comfortable tackling new problems, innovating solutions, and interacting with every facet of the company on a daily basis. Creative, motivated, able to take responsibility and support the applications you create. Help us get rockets out the door faster!', '\n', '\n', 'Basic Qualifications:', '\n', '\n', "Bachelor's degree in computer science, engineering, physics, mathematics, or similar technical discipline.", '\n', '3+ years of experience developing across a full-stack:\xa0 Web server, relational database, and client-side (HTML/Javascript/CSS).', '\n', '\n', 'Preferred Skills and Experience:', '\n', '\n', 'Database - Understanding of SQL. Ability to write performant SQL. Ability to diagnose queries, and work with DBAs.', '\n', 'Server - Knowledge of how web servers operate on a low-level. Web protocols. Designing APIs. How to scale web sites. Increase performance and diagnose problems.', '\n', 'UI - Demonstrated ability creating rich web interfaces using a modern client side framework. Good judgment in UX/UI design.\xa0 Understands the finer points of HTML, CSS, and Javascript - know which tools to use when and why.', '\n', 'System architecture - Knowledge of how to structure a database, web site, and rich client side application from scratch.', '\n', 'Quality - Demonstrated usage of different testing patterns, continuous integration processes, build deployment systems. Continuous monitoring.', '\n', 'Current - Up to date with current trends, patterns, goings on in the world of web development as it changes rapidly. Strong knowledge of computer science fundamentals and applying them in the real-world.', '\n', ' ']

Superb; this is already broken down into what can be considered sentences!

  1. Join all these together, tokenize them, get rid of stop words, and then apply common tech job 2-grams:
joined = ' '.join(just_text)
tokens = word_tokenize(joined)

 

stop_list = stopwords.words('english')
with_no_stops = [word for word in tokens if word not in stop_list]
cleaned = remove_punctuation(two_grammed)
print(cleaned)

This yields the following output:

 ['job', 'Location', 'options', 'Paid relocation', 'Job', 'type', 'Permanent', 'Experience', 'level', 'Mid-Level', 'Senior', 'Role', 'Full-Stack', 'Developer', 'Industry', 'Aerospace', 'Information Technology', 'Web Development', 'Company', 'size', '1k-5k', 'people', 'Company', 'type', 'Private', 'Technologies', 'c#', 'sql', 'javascript', 'asp.net', 'angularjs', 'Job', 'description', 'Full-Stack', 'Enterprise Software', 'Engineer', 'EIS', 'Enterprise', 'Information', 'Systems', 'team', 'writes', 'software', 'builds', 'rockets', 'powers', 'SpaceX', 'responsible', 'software', 'factory', 'floor', 'warehouses', 'financial', 'systems', 'restaurant', 'even', 'public', 'home', 'page', 'Elon', 'called', 'us', 'nervous', 'system', 'SpaceX', 'connect', 'teams', 'SpaceX', 'ensure', 'entire', 'rocket', 'building', 'process', 'runs', 'smoothly', 'Responsibilities', 'seeking', 'developers', 'demonstrable experience', 'ASP.NET', 'C#', 'SQL Server', 'AngularJS', 'fast-paced', 'highly iterative', 'team', 'adapt quickly', 'factory', 'grows', 'need', 'people', 'comfortable', 'tackling', 'new', 'problems', 'innovating', 'solutions', 'interacting', 'every', 'facet', 'company', 'daily', 'basis', 'Creative', 'motivated', 'able', 'take', 'responsibility', 'support', 'applications', 'create', 'Help', 'us', 'get', 'rockets', 'door', 'faster', 'Basic', 'Qualifications', 'Bachelor', "'s", 'degree', 'computer science', 'engineering', 'physics', 'mathematics', 'similar', 'technical', 'discipline', '3+', 'years', 'experience', 'developing', 'across', 'full-stack', 'Web server', 'relational database', 'client-side', 'HTML/Javascript/CSS', 'Preferred', 'Skills', 'Experience', 'Database', 'Understanding', 'SQL', 'Ability', 'write', 'performant', 'SQL', 'Ability', 'diagnose', 'queries', 'work', 'DBAs', 'Server', 'Knowledge', 'web', 'servers', 'operate', 'low-level', 'Web', 'protocols', 'Designing', 'APIs', 'scale', 'web', 'sites', 'Increase', 'performance', 'diagnose', 'problems', 'UI', 'Demonstrated', 'ability', 'creating', 'rich', 'web', 'interfaces', 'using', 'modern', 'client-side', 'framework', 'Good', 'judgment', 'UX/UI', 'design', 'Understands', 'finer', 'points', 'HTML', 'CSS', 'Javascript', 'know', 'tools', 'use', 'System', 'architecture', 'Knowledge', 'structure', 'database', 'web', 'site', 'rich', 'client-side', 'application', 'scratch', 'Quality', 'Demonstrated', 'usage', 'different', 'testing', 'patterns', 'continuous integration', 'processes', 'build', 'deployment', 'systems', 'Continuous monitoring', 'Current', 'date', 'current trends', 'patterns', 'goings', 'world', 'web development', 'changes', 'rapidly', 'Strong', 'knowledge', 'computer science', 'fundamentals', 'applying', 'real-world']

You now have a very nice and refined set of keywords pulled out of the job listing.

If you enjoyed this article and want to learn more about several other web scraping concepts, you can always refer to this book, Python Web Scraping Cookbook. With over 90 proven recipes to get you scraping with Python, microservices, Docker, and AWS, the book is a must-have for Python programmers, web administrators, security professionals, and anyone interested in performing web analytics.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.