The Programming Historian - An open-access introduction to programming in Python (2010).pdf

(1904 KB) Pobierz
The Programming Historian
The Programming Historian
The Programming Historian is an open-access introduction to programming in Python, aimed at working
historians (and other humanists) with little previous experience. There are two editions available here; the
second is currently under development. We are constantly adding new material, much of it driven by reader
request. We welcome questions, corrections and suggestions for improvement. At this point we are still
figuring out how best to allow community participation, while maintaining the coherence and direction of a
more monographic work. If you e-mail us at wturkel@uwo.ca , acrymbl@uwo.ca and/or amaceach@uwo.ca ,
we are happy to respond to you personally and try to incorporate your comments. In the future we may come
up with something more elegant... but, hey, it's a work in progress.
• William J. Turkel, Adam Crymble and Alan MacEachern, The Programming Historian , 2nd ed.
NiCHE: Network in Canadian History & Environment (2009-).
• William J. Turkel and Alan MacEachern, The Programming Historian , 1st ed. NiCHE: Network in
Canadian History & Environment (2007-08).
Introductory lessons teach you how to
• install Zotero, the Python programming language and other useful tools
• read and write data files
• save web pages and automatically extract information from them
• count word frequencies
• remove stop words
• automatically refine searches
• make n-gram dictionaries
• create keyword-in-context (KWIC) displays
• make tag clouds, and
• harvest sets of hyperlinks
Table of Contents
0. About this book...........................................................................................................................................3
1. Do you need to learn how to program?.......................................................................................................4
Techniques that don't involve programming..............................................................................................4
Why you might want to learn to program..................................................................................................4
What kind of techniques you will learn.....................................................................................................5
2. Getting started.............................................................................................................................................5
Install and set up software..........................................................................................................................5
Linux instructions.............................................................................................................................6
Mac instructions................................................................................................................................7
Windows instructions.......................................................................................................................8
"Hello world" in Python.............................................................................................................................9
Interacting with a Python shell...................................................................................................................9
Linux instructions.............................................................................................................................9
Mac instructions................................................................................................................................9
Windows instructions.....................................................................................................................10
"Hello world" in JavaScript.....................................................................................................................11
Viewing HTML files................................................................................................................................11
"Hello World" in HTML..........................................................................................................................12
"Hello World" in embedded JavaScript...................................................................................................13
Back up your work...................................................................................................................................13
Keep in touch with us...............................................................................................................................13
Other resources.........................................................................................................................................14
Suggested readings...................................................................................................................................14
3. Working with files and web pages............................................................................................................14
Making use of your ability to do close reading........................................................................................14
Sending information to text files..............................................................................................................15
Getting information from text files..........................................................................................................15
Splitting code into modules and functions...............................................................................................16
About URLs.............................................................................................................................................17
Opening URLs with Python.....................................................................................................................18
Saving a local copy of a web page...........................................................................................................19
Suggested Readings.................................................................................................................................20
4. From HTML to a list of words..................................................................................................................20
Getting rid of HTML formatting..............................................................................................................20
More about Python strings.......................................................................................................................20
Looping....................................................................................................................................................22
Branching.................................................................................................................................................22
The stripTags routine...............................................................................................................................23
Python lists...............................................................................................................................................23
Suggested Readings.................................................................................................................................25
5. Computing frequencies.............................................................................................................................25
Useful measures of a text.........................................................................................................................25
Cleaning up the list...................................................................................................................................25
Our first use of regular expressions.........................................................................................................26
Python dictionaries...................................................................................................................................27
Counting word frequencies......................................................................................................................28
From HTML to a dictionary of word-frequency pairs.............................................................................29
Removing stop words...............................................................................................................................30
Putting it all together................................................................................................................................31
Suggested Readings.................................................................................................................................32
6. Wrapping output in HTML.......................................................................................................................32
Putting new information where you can use it.........................................................................................32
Python string formatting..........................................................................................................................33
Creating HTML output............................................................................................................................33
Sending HTML output to Firefox............................................................................................................34
Self-documenting data files......................................................................................................................34
Python comments.....................................................................................................................................35
Building an HTML wrapper.....................................................................................................................35
Putting it all together................................................................................................................................36
Using word frequencies to refine a Google search..................................................................................37
Suggested Readings.................................................................................................................................38
7. Keyword in context (KWIC)....................................................................................................................38
N-grams....................................................................................................................................................38
From text to n-grams................................................................................................................................39
Making an n-gram dictionary...................................................................................................................40
Pretty printing a KWIC............................................................................................................................40
From HTML to KWIC.............................................................................................................................42
Turning each KWIC into a Google search link........................................................................................43
8. Tag clouds.................................................................................................................................................44
Visualizing term frequency......................................................................................................................44
Mapping one range onto another..............................................................................................................44
A little bit of CSS.....................................................................................................................................45
Functions to write HTML divs and spans................................................................................................46
Other dimensions for visualization..........................................................................................................47
Putting it all together................................................................................................................................48
Combining the tag cloud with KWIC......................................................................................................49
9. Harvesting links and downloading pages.................................................................................................51
The idea of text mining............................................................................................................................51
Selecting a group of biographies..............................................................................................................51
Extracting hyperlinks with Beautiful Soup..............................................................................................52
Scraping with regular expressions...........................................................................................................53
Working with accented characters...........................................................................................................54
Some helper functions..............................................................................................................................55
Putting it all together................................................................................................................................56
10. Indexing a document collection..............................................................................................................58
An overview.............................................................................................................................................58
Getting a list of filenames from a directory.............................................................................................59
Normalizing the files................................................................................................................................59
Mapping an anonymous function over a list............................................................................................60
Replacing stopwords with a placeholder..................................................................................................61
Zip and tuples...........................................................................................................................................62
Putting it all together................................................................................................................................63
Suggested Readings.................................................................................................................................64
Discussion of The Programming Historian, 1st ed.......................................................................................64
Do you need to learn how to program?....................................................................................................64
Getting started..........................................................................................................................................64
Working with files and web pages...........................................................................................................66
From HTML to a list of words.................................................................................................................67
Computing frequencies............................................................................................................................67
Wrapping output in HTML......................................................................................................................68
Keyword in context (KWIC)....................................................................................................................68
Tag clouds................................................................................................................................................69
Peer Reviewers.............................................................................................................................................69
0. About this book
This book is a tutorial-style introduction to programming for practicing historians. We assume that you're
starting out with no prior programming experience and only a basic understanding of computers. More
experience, of course, won't hurt. Once you know how to program, you will find it relatively easy to learn
new programming languages and techniques, and to apply what you know in unfamiliar situations. In order
to get you to that point we've adopted the following strategy.
• You should be able to put what you learn to work in your research immediately. We think that many
beginning programmers lose patience because they can't see why they're learning what they're
learning.
• Digital history requires working with sources on the web. This means that you're going to be spending
most of your research time working in a browser, so you should be able to put your programming
skills to work there.
• You will have to be somewhat polyglot . Individual programming languages can be beautiful objects
in their own right, and each embodies a different way of looking at the world. In order to become a
good programmer, you will eventually have to master the intricacies of one or more particular
languages. When you're first getting started, however, you need something more like a pidgin.
• Open source and open access are both good things. We're providing open access to this book. As we
develop it, we'll be searching for ways to best incorporate the peer review and continual improvement
that characterize open source projects. We also build our work on top of other open source projects,
particularly Python , Firefox , Zotero and the Simile tools.
We both do archival work, write monographs and journal articles, and teach undergraduate and graduate
courses in history. Our backgrounds are a bit different: although we're the same age, one of us has been
programming for about 30 years (WJT) whereas the other started on 1 January 2008 (AM). We share the
conviction, however, that digital history represents the future of our discipline.
To some extent, this book is an extended conversation about the degree to which future historians will need
to be able to program in order to do their jobs. We also hope, of course, that if you work through the book
you'll learn techniques that make you a better historian.
1. Do you need to learn how to program?
Techniques that don't involve programming
Do you need to be able to program? The short answer is "maybe not." You can certainly become more
effective at online research with a few simple techniques that don't require any programming.
Citation management . Install Zotero and learn how to use it. Make sure to backup your Zotero
database regularly.
Searching . Always use the advanced search interface when working with search engines. Learn
whatever specialized search syntax is available, and check periodically to see if features have
changed. You should know, for example, that Google lets you search for exact phrases or for words in
any order; that it lets you exclude words; that it can limit your search to a particular domain or help
you find the pages that link to a page you're interested in. You should also know that there are
separate Google searches for books , images , historic news articles , code and scholarly articles among
many other things.
Information Trapping . Think of a search as something that you do once. When you find what you're
looking for, you stop searching. You may bookmark a website, but you have to return to it explicitly
whenever you want to see if something has changed. There are some kinds of information that you
need to monitor on a more regular basis. In these cases, it makes more sense to subscribe to regularly-
updated RSS feeds. See Tara Calishain's Information Trapping for more detail.
Why you might want to learn to program
We think that at least some historians really will need to learn how to program. Think of it like learning how
to cook. You may prefer fresh pasta to boxed macaroni and cheese, but if you don't want to be stuck eating
the latter, you have to learn to cook or pay someone else to do it for you. Learning how to program is like
learning to cook in another way: it can be a very gradual process. One day you're sitting there eating your
macaroni and cheese and you decide to liven it up with a bit of Tabasco, Dijon mustard or Worcestershire
sauce. Bingo! Soon you're putting grated cheddar in, too. You discover that the ingredients that you bought
for one dish can be remixed to make another. You begin to linger in the spice aisle at the grocery store.
People start buying you cookware. You get to the point where you're willing and able to experiment with
recipes. Although few people become master chefs, many learn to cook well enough to meet their own needs.
If you don't program, your research process will always be at the mercy of those who do.
At this point you might object that some of your primary sources are not in digital form and won't be for the
foreseeable future. We get this. We're not suggesting that historians no longer need to know how to use
material sources in real archives. What we're suggesting is that the rest of your scholarly life has already gone
digital. You communicate electronically using e-mail and mailing lists; you search library catalogs and
archival finding aids online; you submit drafts of monographs and articles electronically; you present
yourself to the world on one or more websites; you have to put up lecture notes or submit grades online; an
awful lot of the information that you need daily is already on the web. To use another food metaphor,
imagine that digital sources are like sugar (and who wouldn't like to think of them that way?) In medieval
Europe, sugar was a rare and expensive spice. Although some people might know how to use it in a dish,
most people didn't ever need to think about it. Fast forward to the late 19th century, when sugar made up a
relatively large proportion of many European diets. Not everyone needed to know how to make dessert, but it
was no longer a rare skill. In the 21st century, some forms of sugar (e.g., high-fructose corn syrup) have
become very difficult to avoid.
What kind of techniques you will learn
Many books about programming fall into one of two categories: (1) books about particular programming
languages, and (2) books about computer science that demonstrate abstract ideas using a particular
programming language. When you're first getting started, it's easy to lose patience with both of these kinds of
books. On the one hand, a systematic tour of the features of a given language and the style(s) of
programming that it supports can seem rather remote from the tasks that you'd like to accomplish. On the
other hand, you may find it hard to see how the abstractions of computer science are related to your specific
application. Once you know how to program, of course, both kinds of book are very useful. You can use
books about programming languages as references, or to transfer your knowledge of one language to another.
And you can use computer science books as a source of inspiration and deeper understanding.
Our goal is to introduce programming techniques that will be immediately useful in your work as a (digital)
historian. Although we will provide links to programming language reference books and computer science
texts as necessary, we won't be concerned with giving you a full tour of any particular programming
language or a systematic introduction to the algorithms and data structures of introductory computer science.
We're going to assume that you are connected to the web, and that there are a vast number of online primary
and secondary sources that are relevant to your research, if only you could find and make use of them. We
will start by developing techniques to find new textual sources, download batches of them, convert them
from one format to another, characterize them individually and cluster them automatically into useful groups.
Programming is for digital historians what sketching is for artists or architects: a mode of creative expression
and a means of exploration.
2. Getting started
Install and set up software
In order to work through the techniques in this book, you will need to download and install some freely
available software. As much as possible, we've tried to make everything compatible with Linux, Mac and
Windows PCs. We assume that the majority of our readers will probably be using Windows, so we've taken
the approach of getting a Windows XP version working first, then a Mac version and finally a Linux version.
We'd be happy to include instructions for specific platforms, especially if you want to send them to us. We've
also included peer feedback and commentary on the discussion page. If you run into trouble with our
Zgłoś jeśli naruszono regulamin