Category Archives: functional

Graphing YouTube Viewing Figures for the SICP Lecture Series with R

Yesterday at Forward our resident data analysis and stats guru Alex Farquhar gave an introductory lunchtime session on the R programming language. R is an interesting data analysis and statistical language and seems like a programmer friendly alternative to Excel for that type of work. In this post I describe how I used a combination of Ruby and R to graph the viewing figures for the Structure and Interpretation of Computer Programs Lecture Series on YouTube.

Extracting View Statistics from YouTube

I was quite keen to try out R on a problem before I forgot everything I had learned! Luckily I had this Ruby script that I knocked up recently lying around:

require 'rubygems'
require 'net/http'
require 'rexml/document'
require 'active_support'

url = 'http://gdata.youtube.com/feeds/api/videos?q=MIT+6.001+Structure+and+Interpretation+Lecture&max-results=50'

xml_data = Net::HTTP.get_response(URI.parse(url)).body

doc = REXML::Document.new(xml_data)
titles = ActiveSupport::OrderedHash.new

doc.elements.each("//entry") do |ele|
  title = ele.elements["media:group"].elements["media:title"].text
  view_count = ele.elements["yt:statistics"].attributes["viewCount"]
  titles[title] = view_count
end

keys = titles.keys.sort{ |a,b|  String.natcmp(a,b) }.select{ |title| title =~ /Lecture/}

csv_file = File.new("sicp.csv", "w+")
csv_file.puts("Title, Year, Views")

keys.each do |title|
   view_count = titles[title]
   csv_file.puts("#{title}, #{view_count}\n")
end

This script queries the YouTube API and fetches the viewing statistics for the Structure and Interpretation of Computer Programs Lecture Series. The script is pretty simple and consist of the following steps:

  1. Perform a http request against the YouTube API
  2. Extract the viewing statistics from the returned XML
  3. Sort the result (using natural sort) and filter out non-lecture videos
  4. Write the results out to a CSV file

This returns the following data:

Title, Year, Views
Lecture 1A | MIT 6.001 Structure and Interpretation, 1986, 51327
Lecture 1B | MIT 6.001 Structure and Interpretation, 1986, 12240
Lecture 2A | MIT 6.001 Structure and Interpretation, 1986, 8858
Lecture 2B | MIT 6.001 Structure and Interpretation, 1986, 5561
Lecture 3A | MIT 6.001 Structure and Interpretation, 1986, 4393
Lecture 3B | MIT 6.001 Structure and Interpretation, 1986, 2935
Lecture 4A | MIT 6.001 Structure and Interpretation, 1986, 3115
Lecture 4B | MIT 6.001 Structure and Interpretation, 1986, 2558
Lecture 5A | MIT 6.001 Structure and Interpretation, 1986, 2540
Lecture 5B | MIT 6.001 Structure and Interpretation, 1986, 2933
Lecture 6A | MIT 6.001 Structure and Interpretation, 1986, 2692
Lecture 6B | MIT 6.001 Structure and Interpretation, 1986, 6275
Lecture 7A | MIT 6.001 Structure and Interpretation, 1986, 2157
Lecture 7B | MIT 6.001 Structure and Interpretation, 1986, 1363
Lecture 8A | MIT 6.001 Structure and Interpretation, 1986, 2940
Lecture 8B | MIT 6.001 Structure and Interpretation, 1986, 1887
Lecture 9A | MIT 6.001 Structure and Interpretation, 1986, 3795
Lecture 9B | MIT 6.001 Structure and Interpretation, 1986, 3636
Lecture 10A | MIT 6.001 Structure and Interpretation, 1986, 2068
Lecture 10B | MIT 6.001 Structure and Interpretation, 1986, 2672

Plotting the YouTube Statistics with R

Now we have the data from YouTube, it is time to load the data into R and plot it. I used R Studio which offer a friendly easy to use environment to learn and work with R. Here is the R script to load the CSV data and plot it:

sicpdata = read.csv("sicp.csv")

plot(sicpdata$Views, type="b", main="YouTube views of the SICP lecture series", xlab="Lecture", ylab="Number of views")

As you can see the code looks quite similar to a traditional programming language such as JavaScript. The interesting bit is the sicpdata variable which is a Data Frame object. Data Frames act like tables or Excel workbooks and contain the rows and columns of data you are working with. Drawing the graph is simply a call to the plot function passing the data we want to plot. Here we plot sicpdata$Views which is the Views column from our dataset.

Next, I wanted to annotate some of the points on the graph to show the names of some of the lectures:

titles = c()
for (title in sicpdata$Title) ( titles = c(titles, strtrim(title, 11)))
identify(sicpdata$Views, labels=titles)

There isn’t enough room on the graph to show the full titles (Lecture 1A | MIT 6.001 Structure and Interpretation etc), so instead only the lecture numbers (Lecture 1A) are shown.The titles are copied and trimmed from the sicpdata$Title data frame into a vector (New vectors are created by using the c() function.). The identify function is then used to annotate the graph. At this point things got weird – the R Studio environment goes into interactive mode and you select the points you want to annotate.

Annotating point using RStudio

When you click on the points the corresponding title from the titles vector is drawn on the graph.

The Graph

After all that here is the finished graph:

SICP Lecture Series Views on YouTube
SICP Lecture Series Views on YouTube

Unsurprisingly the first video has the most views (50,000+). After that there is a steep decline until the number of views evens out at around 2-3,000 views per lecture. There is a weird spike half way through for Lecture 6B with 6,000 views. Finally, the last video has 2,672 views. This is more views than several of the other lectures in the series which suggest people were skipping to the end!

Refactoring to LINQ Part 2: Aggregate is Great

In part one, death to the foreach, I discussed some of the quick and easy refactorings you can use to convert foreach loops to LINQ. In this post I look at the Aggregate extension method that LINQ provides. This method is used less frequently then the other LINQ extension methods but is just as powerful. Often aggregate examples revolve around primitive types such as int and strings. Below I will show some more interesting object-based examples.

LINQ provides several aggregation extension methods: Aggregate, Average, Count, LongCount, Max, Min and Sum. The aggregation methods all take a list of objects and reduces that list to a single result. The Aggregate method is the most flexible and generic of the aggregation extension methods. Conceptually it helps to think of Aggregate as a generic building block and the others aggregation methods (Average, Max, etc.) as special cases of Aggregate. In functional programming languages, such as F#, Aggregate is usually named fold (or inject in Ruby). The SQL like name Aggregate leads developers to write off Aggregate as purely for numeric aggregation purposes. In fact, Aggregate can be used whenever we want to build a single object from a group of objects.

So how does Aggregate work?

Looking at the Aggregate method signature is a pretty scary experience:

public static TResult Aggregate(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func func)

The Aggregate methods takes a list of source objects, a seed value and an accumulator function which it processes as follows:

  • The accumulator function is called for each item in the list and returns a value
  • The first time the accumulator function is called the seed and the first item in the list are passed to it
  • The accumulator function is called again with the result of the first accumulator function call and the second item in the list as its parameters
  • This continues until all items in the list are processed
  • The result of the last call to the accumulator function is returned as the result of the entire Aggregate method

This can take some getting your head around. Here is an example:

var whiskeyNames = new [] {"Ardbeg 1998", "Glenmorangie","Talisker", "Cragganmore"};

var listOfWhiskies = whiskeyNames.Aggregate("Whiskies: ", (accumulated, next) =>
{
	Console.Out.WriteLine("(Adding [{0}] to the list [{1}])", next, accumulated);
	return accumulated + " " + next;
});

Console.Out.WriteLine(listOfWhiskies);

This outputs:

(Adding [Ardbeg 1998] to the list [Whiskies: ])
(Adding [Glenmorangie] to the list [Whiskies:  Ardbeg 1998])
(Adding [Talisker] to the list [Whiskies:  Ardbeg 1998 Glenmorangie])
(Adding [Cragganmore] to the list [Whiskies:  Ardbeg 1998 Glenmorangie Talisker])
Whiskies:  Ardbeg 1998 Glenmorangie Talisker Cragganmore

In this example a string listing several brands of whiskies is being built. The seed passed to the Aggregate method is the string “Whiskey: “. A lambda that adds the whiskey names together (and outputs to the console) is passed as the accumulator function. This accumulator function is called for each item in the list of whiskies. On the first execution of the accumulator function the seed, “Whiskey: “, is passed as the accumulated parameter and the first item in the list “Ardbeg 1998″ is passed as the next parameter. On the second execution of the accumulator function the return value from the accumulator function, “Whiskies: Ardbeg 1998″, is passed as the accumulated parameter and the second item in the list, “Glenmoragnie” is passed as the next parameter.

The seed parameter is optional. If it is omitted the first two items in the list will be passed to the function, as this example demonstrates:

listOfWhiskies = whiskeyNames.Aggregate((accumulated, next) =>
{
	Console.Out.WriteLine("(Adding [{0}] to the list [{1}])", next, accumulated);
	return accumulated + " " + next;
});

Console.Out.WriteLine(listOfWhiskies);

This outputs:

(Adding [Glenmorangie] to the list [Ardbeg 1998])
(Adding [Talisker] to the list [Ardbeg 1998 Glenmorangie])
(Adding [Cragganmore] to the list [Ardbeg 1998 Glenmorangie Talisker])
Ardbeg 1998 Glenmorangie Talisker Cragganmore

Finding the best item in a list

A common coding pattern using foreach is finding the “best” item in a list based on some criteria. This example reuses the whiskey class from the previous post:

public class Whiskey
{
	public Whiskey()
	{
		Ingredients = new List<Whiskey>();
	}

	public string Name { get; set; }
	public int Age { get; set; }
	public decimal Price { get; set; }
	public string Country { get; set; }

	public List<Whiskey> Ingredients {get; set;}

	public string IngredientsAsString
	{
		get
		{
			return String.Join(",", Ingredients.Select(x=> x.Name).ToArray());
		}
	}
}

Whiskey ardbeg = new Whiskey { Name = "Ardbeg 1998", Age = 12, Price = 49.95m, Country = "Scotland" };
Whiskey glenmorangie = new Whiskey { Name = "Glenmorangie", Age = 10, Price = 28.95m, Country = "Scotland" };
Whiskey talisker = new Whiskey { Name = "Talisker", Age = 18, Price = 57.95m, Country = "Scotland" };
Whiskey cragganmore = new Whiskey { Name = "Cragganmore", Age = 12, Price = 30.95m, Country = "Scotland" };
Whiskey redbreast = new Whiskey { Name = "Redbreast", Age = 12, Price = 27.95m, Country = "Ireland" };
Whiskey greenspot = new Whiskey { Name = "Green spot", Age = 8, Price = 44.48m, Country = "Ireland" };

List whiskies = new List { ardbeg, glenmorangie, talisker, cragganmore, redbreast, greenspot };

In the code snippet below we are searching for the most expensive whiskey in a list. This is done by storing the first whiskey of the list in a mostExpensiveWhiskey variable. Then the price of each subsequent whiskey object is compared to the mostExpensiveWhiskey object and if its price is higher that whiskey is stored in the mostExpensiveWhiskey variable.

Whiskey mostExpensiveWhiskey = null;
foreach (var challenger in whiskies)
{
    if (mostExpensiveWhiskey == null)
    {
        mostExpensiveWhiskey = challenger;
    }
    if (challenger.Price > mostExpensiveWhiskey.Price)
    {
        mostExpensiveWhiskey = challenger;
    }
}
Console.WriteLine("Most expensive is {0}", mostExpensiveWhiskey.Name);

This outputs:

Most expensive is Talisker

This can be refactored to some remarkably concise code by using the Aggregate method with a lambda expression and the ternary operator:

Whiskey mostExpensiveWhiskey = whiskies.Aggregate((champion, challenger) => challenger.Price > champion.Price ? challenger : champion);
Console.WriteLine("Most expensive is {0}", mostExpensiveWhiskey.Name);

Creating a new ‘aggregated’ object

In this example (which also uses the whiskey domain!) we will create a new object from several other objects. The majority of whiskies sold are blended whiskies, which are made by mixing together several single malt whiskies. In the code below we loop through a list of whiskies. Where the whiskey is Scottish we add it to our new blendedWhiskey object and update the price of the blendedWhiskey object accordingly.

var blendedWhiskey = new Whiskey() { Name="Tesco value whiskey", Age=3, Country="Scotland" };
foreach (var whiskey in whiskies)
{
    if (whiskey.Country != "Scotland")
    {
        continue;
    }

    blendedWhiskey.Ingredients.Add(whiskey);
    blendedWhiskey.Price = blendedWhiskey.Price + (whiskey.Price / 10);
};

Console.WriteLine("Blended Whiskey Name: {0}", blendedWhiskey.Name);
Console.WriteLine("Blended Whiskey Price: {0}", blendedWhiskey.Price);
Console.WriteLine("Blended Whiskey Ingredients: {0}", blendedWhiskey.IngredientsAsString);

This outputs:

Blended Whiskey Name: Tesco value whiskey
Blended Whiskey Price: 16.780
Blended Whiskey Ingredients: Ardbeg 1998,Glenmorangie,Talisker,Cragganmore

This can be refactored in a few steps. First we use the Where extension method to filter the list to only Scottish whiskies. Then we pass the object we are building as the seed to Aggregate method. The accumulator function then adds the single malt whiskies to the blended whiskey and updates the price.

var blendedWhiskey = whiskies.Where(x=> x.Country == "Scotland")
.Aggregate(new Whiskey() { Name="Tesco value whiskey", Age=3, Country="Scotland" },
	(newWhiskey, nextWhiskey) =>
	{
		newWhiskey.Ingredients.Add(nextWhiskey);
		newWhiskey.Price += (nextWhiskey.Price / 10);
		return newWhiskey;
	});

Console.WriteLine("Blended Whiskey Name: {0}", blendedWhiskey.Name);
Console.WriteLine("Blended Whiskey Price: {0}", blendedWhiskey.Price);
Console.WriteLine("Blended Whiskey Ingredients: {0}", blendedWhiskey.IngredientsAsString);

Unfortunately, the refactored code in this example is not as concise as the previous refactored examples. When the accumulator function consists of multiple statements it can be a bad idea to refactor the code to use Aggregate, as the code becomes less clear.

Summing up

I hope this post has demystified the Aggregate method for you. Learning to use Aggregate can help you to better understand functional programming ideas and for several frequent programming tasks it provides a compelling improvement to the traditional foreach statement. However, for some scenarios using foreach still makes more sense.

Refactoring to LINQ Part 1: Death to Foreach

Language integrated query (LINQ) was first introduced to the .NET framework 3 years ago in 2007. It has taken time for developers to become familiar with it and fully grasp its usefulness. When it was released LINQ was promoted as a way to integrate database and XML querying into .NET languages. Although LINQ makes heavy use of SQL terminology (SELECT/WHERE/AGGREGATE), the LINQ implementation is in fact heavily based on functional programming concepts. In the last year, I have become more familiar with functional programming ideas by exploring and using languages such as JavaScript, Ruby and F#. As a result, I now regularly use the LINQ extension methods to write C# in a more functional programming style.

The advantages and disadvantages of a functional programming style

What advantages does a functional programming style bring?

  • shorter more concise code
  • less temporary variables and less state
  • less errors
  • easier to understand, more declarative code

There are some disadvantages however:

  • debugging is harder
  • performance gotchas due to deferred execution/lazy evaluation
  • higher learning curve

At first functional style code can provoke a WTF? response. To realise the advantages of a functional techniques in a team environment all the developers on a team need to become comfortable with these functional approaches. Hopefully, this should become less of an issue over time as functional constructs are now built into most languages. Additionally, tools such as ReSharper now support automatic refactoring to LINQ expressions methods.

Thinking functional

LINQ introduced a more high level, declarative way to work with collections. Instead of manipulating individual items in a collection, operations are performed on the entire collection. Mark Needham refers to this as the transformational mindset in his talk on Functional C#. Using the transformational mindset we think in terms of the transformations that can be applied to a collection to produce a result. Several transformations can be chained together to replace a complex foreach loop made up of many imperative statements. Typically, a collection is passed through a pipeline of one or more LINQ extension methods such as Where, Select and Aggregate to produce the desired result.

I have now become deeply suspicious of foreach loops and regard them as a code smell. In most cases the foreach loop can be refactored into a shorter more meaningful functional version. Often this will reduce the amount of code by 70% to 80%. A key skill to do perform these refactorings is to recognize which type of looping construct map to which functional operation. Below are several examples of foreach statements refactored to LINQ extension methods.

Functional refactorings

In these examples we will manipulate a list of whiskies. Each whiskey has a name, an age, a price and a country of origin:

public class Whiskey
{
	public string Name { get; set; }
	public int Age { get; set; }
	public decimal Price { get; set; }
	public string Country { get; set; }
}

Whiskey ardbeg = new Whiskey { Name = "Ardbeg 1998", Age = 12, Price = 49.95m, Country = "Scotland" };
Whiskey glenmorangie = new Whiskey { Name = "Glenmorangie", Age = 10, Price = 28.95m, Country = "Scotland" };
Whiskey talisker = new Whiskey { Name = "Talisker", Age = 18, Price = 57.95m, Country = "Scotland" };
Whiskey cragganmore = new Whiskey { Name = "Cragganmore", Age = 12, Price = 30.95m, Country = "Scotland" };
Whiskey redbreast = new Whiskey { Name = "Redbreast", Age = 12, Price = 27.95m, Country = "Ireland" };
Whiskey greenspot = new Whiskey { Name = "Green spot", Age = 8, Price = 44.48m, Country = "Ireland" };

List whiskies = new List { ardbeg, glenmorangie, talisker, cragganmore, redbreast, greenspot };

Creating one list of objects from another

In our first example we create a list of whiskey names from the list of whiskies.

var whiskeyNames = new List ();
foreach (var whiskey in whiskies) {
    whiskeyNames.Add (whiskey.Name);
}

Console.WriteLine("Whiskey names: {0}", String.Join(", ", whiskeyNames.ToArray()));

This outputs:
Whiskey names: Ardbeg 1998, Glenmorangie, Talisker, Cragganmore, Redbreast, Green spot

Converting a list of objects to another list of objects is called mapping or projection in functional programming. The LINQ extension method to do this is Select.

var whiskeyNames = whiskies.Select(x=> x.Name).ToList();

Filtering a list

In our second example we want to get a list of “good value” whiskies that cost 30 pounds or less. To do this a new list is created, the list of whiskies is iterated over and any whiskey under 30 pounds is added to the new list:

var goodValueWhiskies = new List ();
foreach (var whiskey in whiskies) {
  if (whiskey.Price < 30m) {
    goodValueWhiskies.Add (whiskey);
  }
}
Console.WriteLine("Found {0} good value whiskeys", goodValueWhiskeies.Count);

This outputs:
Found 2 good value whiskeys

This operation is called filtering in functional programming. In LINQ we use the Where extension method to filter a list. A predicate is passed to Where to determine which items to keep:

	var goodValueWhiskies = whiskies.Where(x=> x.Price <= 30m).ToList();

Counting the number of items matching a condition in a list

In this example we count the number of 12-year-old whiskies in our list by incrementing a counter variable inside an if statement.

var howMany12YearOldWhiskies = 0;
foreach (var whiskey in whiskies) {
  if (whiskey.Age == 12) {
    howMany12YearOldWhiskies++;
  }
}

Console.WriteLine ("How many 12-year-old whiskies do we have {0}", howMany12YearOldWhiskies);

This outputs:
How many 12-year-old whiskies do we have 3

In this case two of the extension methods, Where and Count, can be combined to produce the desired result:

  var howMany12YearOldWhiskies = whiskies.Where(x=> x.Age == 12).Count();

This can be shortened further as Count optionally takes a predicate to filter the items:

  var howMany12YearOldWhiskies = whiskies.Count(x=> x.Age == 12);

Checking if some or all of the items in a list match a criteria

In this example we check if a condition is true for all items in a list. This involves creating a boolean variable initialized to true, looping through each item in the list and setting the variable to false if any item violates the condition.

var allAreScottish = true;
foreach (var whiskey in whiskies) {
  if (whiskey.Country != "Scotland") {
     allAreScottish = false;
     break;
  }
}

This outputs:
All are scottish? False
This can be replace with the All extension method, which conveys more clearly the intention of what the code is trying to do.

   var allAreScottish = whiskies.All(x=> x.Country == "Scotland");

All is one of three quantifier extension methods available in LINQ the other two being Any and Contains. For example, it is easy to find if there are any Irish whiskies in our list by using the Any extension method:

  var isThereIrishWhiskey = whiskies.Any(x=> x.Country == "Ireland");

Splitting up complex foreach statements

Often you will be faced with quite complex foreach statements that may initially seem hard to refactor. For example how would you refactor the following code to use LINQ extension methods?

var scottishWhiskiesCount = 0;
var scottishWhiskeyTotal = 0m;
foreach (var whiskey in whiskies) {
    if (whiskey.Country == "Scotland") {
        scottishWhiskiesCount++;
        scottishWhiskeyTotal += whiskey.Price;
    }
}

This code is doing three things: determining if a whiskey is Scottish, counting those whiskies and then totalling the price of those whiskies.
The first step is to split the single foreach statement into multiple foreach statements. This is usually a good idea anyway, as doing multiple things in a single loop can be difficult to understand.

List scottishWhiskies = new List ();
foreach (var whiskey in whiskies) {
    if (whiskey.Country == "Scotland") {
        scottishWhiskies.Add (whiskey);
    }
}
foreach (var whiskey in scottishWhiskies) {
    scottishWhiskiesCount++;
}

foreach (var whiskey in scottishWhiskies) {
    scottishWhiskeyTotal += whiskey.Price;
}

The code is then easily refactored to use LINQ extension methods:

var scottishWhiskies = whiskies.Where(x=> x.Country == "Scotland");
scottishWhiskiesCount = scottishWhiskies.Count();
scottishWhiskeyTotal = scottishWhiskies.Sum(x=> x.Price);

In my next post I will discuss some of the more advanced but seldom used LINQ extension methods: Aggregate and SelectMany.