Now that you have learned how Wget can be used to mirror or download specific files from websites like ActiveHistory.ca via the command line, it’s time to expand your web-scraping skills through a few more lessons that focus on other uses for Wget’s recursive retrieval function. The following tutorial provides three examples of how Wget can be used to download large collections of documents from archival websites with assistance from the Python programing language. It will teach you how to parse and generate a list of URLs using a simple Python script, and will also introduce you to a few of Wget’s other useful features. Similar functions to the ones demonstrated in this lesson can be achieved using curl, an open-source software capable of performing automated downloads from the command line. For this lesson, however, we will focus on Wget and building your Python skills.
Don’t take your data at face value. That is the key message of this tutorial which focuses on how scholars can diagnose and act upon the accuracy of data. In this lesson, you will learn the principles and practice of data cleaning, as well as how OpenRefine can be used to perform four essential tasks that will help you to clean your data:
1. Remove duplicate records
2. Separate multiple values contained in the same field
3. Analyse the distribution of values throughout a data set
4. Group together different representations of the same reality
These steps are illustrated with the help of a series of exercises based on a collection of metadata from the Powerhouse museum, demonstrating how (semi-)automated methods can help you correct the errors in your data.
Computer programs can become long, unwieldy and confusing without special mechanisms for managing complexity. This lesson will show you how to reuse parts of your code by writing Functions and break your programs into Modules, in order to keep everything concise and easier to debug. Being able to remove a single dysfunctional module can save time and effort.
Your list is now clean enough that you can begin analyzing its contents in meaningful ways. Counting the frequency of specific words in the list can provide illustrative data. Python has an easy way to count frequencies, but it requires the use of a new type of variable: the dictionary. Before you begin working with a dictionary, consider the processes used to calculate frequencies in a list.
In this lesson you will learn how to create vector layers based on scanned historical maps. In Intro to Google Maps and Google Earth you used vector layers and created attributes in Google Earth. We will be doing the same thing in this lesson, albeit at a more advanced level, using QGIS software.
In the lesson Up and Running with Omeka.net, you added items to your Omeka.net site and grouped them into collections. Now you are ready for the next step: taking your users on a guided tour through the items you have collected.
This lesson uses Python to create and view an HTML file. If you write programs that output HTML, you can use any browser to look at your results. This is especially convenient if your program is automatically creating hyperlinks or graphic entities like charts and diagrams.
Here you will learn how to create HTML files with Python scripts, and how to use Python to automatically open an HTML file in Firefox.
In this two-part lesson, we will build on what you’ve learned about Working with Webpages, learning how to remove the HTML markup from the webpage of Benjamin Bowsey’s 1780 criminal trial transcript. We will achieve this by using a variety of string operators, string methods and close reading skills. We introduce looping and branching so that programs can repeat tasks and test for certain conditions, making it possible to separate the content from the HTML tags. Finally, we convert content from a long string to a list of words that can later be sorted, indexed, and counted.
In this lesson, you will learn the Python commands needed to implement the second part of the algorithm begun in the From HTML to a List of Words (part 1). The first half of the algorithm gets the content of an HTML page and saves only the content that follows the tags.
In this lesson, you will learn how to georeference historical maps so that they may be added to a GIS as a raster layer. Georeferencing is required for anyone who wants to accurately digitize data found on a paper map, and since historians work mostly in the realm of paper, georeferencing is one of our most commonly used tools. The technique uses a series of control points to give a two-dimensional object like a paper map the real world coordinates it needs to align with the three-dimensional features of the earth in GIS software (in Intro to Google Maps and Google Earth we saw an ‘overlay’ which is a Google Earth shortcut version of georeferencing).
In this lesson you will first learn what topic modeling is and why you might want to employ it in your research. You will then learn how to install and work with the MALLET natural language processing toolkit to do so. MALLET involves modifying an environment variable (essentially, setting up a short-cut so that your computer always knows where to find the MALLET program) and working with the command line (ie, by typing in commands manually, rather than clicking on icons or menus). We will run the topic modeller on some example files, and look at the kinds of outputs that MALLET installed. This will give us a good idea of how it can be used on a corpus of texts to identify topics found in the documents without reading them individually.
In this lesson you will install QGIS software, download geospatial files like shapefiles and GeoTIFFs, and create a map out of a number of vector and raster layers. Quantum or QGIS is an open source alternative to the industry leader, ArcGIS from ESRI. QGIS is multiplatform, which means it runs on Windows, Macs, and Linux and it has many of the functions most commonly used by historians. ArcGIS is prohibitively expensive and only runs on Windows (though software can be purchased to allow it to run on Mac). However, many universities have site licenses, meaning students and employees have access to free copies of the software (try contacting your map librarian, computer services, or the geography department). QGIS is ideal for those without access to a free copy of Arc and it is also a good option for learning basic GIS skills and deciding if you want to install a copy of ArcGIS on your machine. Moreover, any work you do in QGIS can be exported to ArcGIS at a later date if you decide to upgrade. The authors tend to use both and are happy to run QGIS on Mac and Linux computers for basic tasks, but still return to ArcGIS for more advanced work. In many cases it is not lack of functions, but stability issues that bring us back to ArcGIS. For those who are learning Python with the Programming Historian, you will be glad to know that both QGIS and ArcGIS use Python as their main scripting language.
Google Maps and Google Earth provide an easy way to start creating digital maps. With a Google Account you can create and edit personal maps by clicking on My Places. In the new Google Maps interface, click on the gear menu [icon] at the upper right of the menu bar, and select My Places. The new (as of summer 2013) interface provides a new way of creating custom maps: Google Maps Engine Lite allows users to import and add data onto the map to visualize trends.
This tutorial assumes basic knowledge of HTML, CSS, and the Document Object Model. It also assumes some knowledge of Python. For a more basic introduction to Python, see Working with Text Files.
Like in Output Data as HTML File, this lesson takes the frequency pairs collected in Counting Frequencies and outputs them in HTML. This time the focus is on keywords in context (KWIC) which creates n-grams from the original document content – in this case a trial transcript from the Old Bailey Online. You can use your program to select a keyword and the computer will output all instances of that keyword, along with the words to the left and right of it, making it easy to see at a glance how the keyword is used.
Once the KWICs have been created, they are then wrapped in HTML and sent to the browser where they can be viewed. This reinforces what was learned in Output Data as HTML File, opting for a slightly different output.
At the end of this lesson, you will be able to extract all possible n-grams from the text. In the next lesson, you will be learn how to output all of the n-grams of a given keyword in a document downloaded from the Internet, and display them clearly in your browser window.
The list that we created in the From HTML to a List of Words (2) needs some normalizing before it can be used further. We are going to do this by applying additional string methods, as well as by using regular expressions. Once normalized, we will be able to more easily analyze our data.
This lesson takes the frequency pairs created in Counting Frequencies and outputs them to an HTML file.
Here you will learn how to output data as an HTML file using Python. You will also learn about string formatting. The final result is an HTML file that shows the keywords found in the original source in order of descending frequency, along with the number of times that each keyword appears.
This lesson builds on Keywords in Context (Using N-grams), where n-grams were extracted from a text. Here, you will learn how to output all of the n-grams of a given keyword in a document downloaded from the Internet, and display them clearly in your browser window.
Downloading a single record from a website is easy, but downloading many records at a time – an increasingly frequent need for a historian – is much more efficient using a programming language such as Python. In this lesson, we will write a program that will download a series of records from the Old Bailey Online using custom search criteria, and save them to a directory on our computer. This process involves interpreting and manipulating URL Query Strings. In this case, the tutorial will seek to download sources that contain references to people of African descent that were published in the Old Bailey Proceedings between 1700 and 1750.
This first lesson in our section on dealing with Online Sources is designed to get you and your computer set up to start programming. We will focus on installing the relevant software – all free and reputable – and finally we will help you to get your toes wet with some simple programming that provides immediate results.
In this opening module you will install the Python programming language, the Beautiful Soup HTML/XML parser, and a text editor. Screencaps provided here come from Komodo Edit, but you can use any text editor capable of working with Python. Here’s a list of other options: Python Editors. Once everything is installed, you will write your first programs, “Hello World” in Python and HTML.
This lesson shows how to use Python to transliterate automatically a list of words from a language with a non-Latin alphabet to a standardized format using the American Standard Code for Information Interchange (ASCII) characters. It builds on readers’ understanding of Python from the lessons “Viewing HTML Files,” “Working with Web Pages,” “From HTML to List of Words (part 1)” and “Intro to Beautiful Soup.” At the end of the lesson, we will use the transliteration dictionary to convert the names from a database of the Russian organization Memorial from Cyrillic into Latin characters. Although the example uses Cyrillic characters, the technique can be reproduced with other alphabets using Unicode.
In this exercise we will use advanced find-and-replace capabilities in a word processing application in order to make use of structure in a brief historical document that is essentially a table in the form of prose. Without using a general programming language, we will gain exposure to some aspects of computational thinking, especially pattern matching, that can be immediately helpful to working historians (and others) using word processors, and can form the basis for subsequent learning with more general programming environments.
Omeka is a free content management system that makes it easy to create websites that show off collections of items. As you will learn below, there are actually two versions of Omeka: Omeka.netand Omeka.org. In this lesson you willl be using the former.
Omeka is an ideal solution for historians who want to display collections of documents, archivists who want to organize artifacts into categories, and teachers who want students to learn about the choices involved in assembling historical collections. It is not difficult, but it is helpful to start off with some basic terms and concepts. In this lesson, you will sign up for an account at Omeka.net and start adding digital objects to your site.
When you are working with online sources, much of the time you will be using files that have been marked up with HTML (Hyper Text Markup Language). Your browser already knows how to interpret HTML, which is handy for human readers. Most browsers also let you see the HTML source code for any page that you visit. The two images below show a typical web page (from the Old Bailey Online) and the HTML source used to generate that page, which you can see with the Tools -> Web Developer -> Page Source command in Firefox.
This lesson introduces Uniform Resource Locators (URLs) and explains how to use Python to download and save the contents of a web page to your local hard drive.
In this lesson you will learn how to manipulate text files using Python. This includes opening, closing, reading from, and writing to .txt files.
The next few lessons will involve downloading a web page from the Internet and reorganizing the contents into useful chunks of information. You will be doing most of your work using Python code written and executed in Komodo Edit.