Scrape CENTCOM's data - part 222 Jan 2015
All right, now for part two of our quick tutorial on scraping CENCTOM.
Important note: I asked my pal Balto for advices in his field of expertise: Python. Long story short, he corrected a bunch of things in my script, from variables not capitalised to indentation, and also more important things.
Anyway, moving on.
Preparing the variables
As we did before with
BASE_URL, we're going to define the basic URLs that we'll use later.
Getting the content
We don't need to change anything in our
get_links() function, which grabs all the links from the index page.
However, we'll need a new function to scrape the press releases.
So, we call another
soup() on our links, and then, as we did before for
get_link(), we throw in some parameters in, i.e. the DOM elements containing the body of the press releases.
Calling all our stuff
Balto made some adjustments to the boilerplate used to call the functions, so let's just re-use it as is:
Then I propose we simplify: instead of storing the URLs in a JSON file, let's just use the variable containing these URLs to scrape the press releases directly. Like this:
Voila. Everything will be inputed in a (messy) JSON:
Now, you noted that there are some HTML tags in there. That's quite good, because we can directly generate an HTML output by replacing
.html. then, some styling. Because it's 2015, people.
And, as promised, the Github Gist.