One Encyclopedia Per Child - Wikipedia Analysis

Student name

Eric Astor

Student's member profile on PlanetSoC

Mentor's name

Ivan Krstic

Mentor's member profile on PlanetSoC

Anonymous

Description

Well, as my mentor commented - my main failing in this application was linking to a non-technical resume for a technical job! If you decide to check out my resume, keep that in mind, please!

Proposal

Name: Eric Astor
Email: eastor1@swarthmore.edu
IM: Aricle (AIM), epastor@comcast.net (MSN), EPA153 (Yahoo)

Project title: One Encyclopedia Per Child - Statistical Analysis of Wikipedia

Benefits to the One Laptop Per Child project

Putting even a partial reference for the world's information into every laptop designed by the OLPC project gives every user the chance to have a reference book at their fingertips. This sort of information availability could hugely contribute to the project's ideal of helping children 'learn learning', making it far easier to get access to all sorts of relevant collective knowledge. In particular, this analysis could help isolate the articles that are considered most basic and/or most important, helping human editors locate less obviously needed material that is still essential for a genuine understanding of the whole.

Synopsis

I will create a suite of programs that will process data from Wikipedia, gathering statistics about what articles matter. In particular, I will analyze Wikipedia's internal-link network, ranking pages by a mutual-link algorithm very similar to that used by various search engines (Google among them). If necessary, it could be ideal for Google to grant an explicit license to use some variant of the PageRank algorithm itself for this restricted purpose - however, if not, there are alternatives available. If time allows, I also intend to gather and analyze reader-popularity data on a page-by-page basis for Wikipedia.

How much time do you expect to have for this project?

I intend to put time roughly equivalent to a full-time job into this project, with flexible scheduling. This will be the focus of my summer, so that gives me about 10 weeks to spend.

Deliverables

I will create a command-line utility that generates ranking statistics for Wikipedia pages based on Wikipedia's internal link structure.

Project Details

I will create a program to generate ranking statistics for Wikipedia pages based on Wikipedia's internal link structure. This program should be able to run against a MySQL load of a Wikipedia database dump - most likely the page-to-page link records SQL dump.

Ideally, I will also create a program to gather per-page reader-popularity data from Wikipedia. However, at this time I won't commit to this program, since I need to better understand how this data has been gathered in the past and how any relevant information is stored in the Wikipedia databases.

Project Schedule

I will first take a few days to obtain the relevant Wikipedia dumps, and to better understand the format of the information they contain. During this time, I will also investigate possible ranking algorithms and any issues accompanying them. Together, I estimate this to take 1 week.

Next, my priority is developing and testing an implementation of the ranking algorithm within a command-line utility. I estimate this could take 2 weeks, including documentation of code and use.

With any remaining time, I will develop a program to gather per-page reader-popularity data. If this is possible with the existing databases, I estimate this to take roughly 1 week.

The total estimate for the ranking algorithm comes to roughly 3 weeks. Considering that realistically, any project can take over 3 times as long as planned for, I feel this means the project should be accomplishable within the summer, with some hard work. Moreover, with some luck, I think the odds are good for my completing the reader-popularity program as well.

Bio/Background

I am a freshman at Swarthmore College, with an intended Math/Physics major. I have used Wikipedia for years, contributing a few basic proofreading revisions, and find it an invaluable basic resource when exploring new topics. PageRank and other link-based ranking algorithms have been an interest to me for a few years now, and I have read several papers on the subject.

I have coded professionally for the last 4 summers, last summer working for EnterpriseDB. I am self-taught in Java and C#, with some formal education in C/C++. I consider myself proficient in Java, C#, and in revising C/C++ code (with several patches submitted and accepted to the PostgreSQL project).

Have you applied (or plan to apply) for any other 2006 Summer of Code projects? If so, which ones?
I do plan to apply to other projects - in particular, the Mono Project, and possibly PostgreSQL.

Please list jobs, summer classes, and/or vacations that you'll need to work around:
I hope to take about a week's vacation late in the summer - however, nothing is determined yet, so my scheduling is flexible on that.

If interested, please see my resume at the attached link.

Sent you mail re this

jhscott's picture

Let me know what you think :)

jhscott – Fri, 2006 – 05 – 26 18:54