Mathematics


Two concepts come up when talking about information retrieval in most standard documentation, Precision and Recall. Precision is a measure that tells you if your result set contains only results that are relevant to the query, and recall tells you if your result set contains everything that is relevant to the query.

The formula for classical precision is:

Precision Formula

However, I would argue that the classical notion of Precision is flawed, in that it doesn't model anything we tend to care about. Rarely are we interested in binary classification, instead we want a ranked classification of relevance.

When Google tells you that you have a million results, do you care? No, you skim the first few entries for what it is that you are looking for, unless you are particularly desperate for an answer. So really, you want a metric that models the actual behavior of a search engine user and that level of desperation.

There are two issues with classical precision:

  1. the denominator of precision goes to infinity as the result set increases in size
  2. each result is worth the same amount no matter where it appears in the list

The former ensures that a million answers drowns out any value from the first screen, the latter ensures that it doesn't matter which results are on the first screen. A more accurate notion of precision suitable for modern search interfaces should model the prioritization of the results, and should allow for a long tail of crap if the stuff that people will look at is accurate over all.

So how to model user behavior? We can replace the denominator with a partial sum of a geometric series for probability p < 1, where p models the percentage chance that a user will continue to browse to the next item in the list. Then you can scale the value of the nth summand in the numerator as being worth up to pn. If you have a ranked training set it is pretty easy to score precision in this fashion.

You retain all of the desirable properties of precision. It maxes out at 100%, it decreases when you give irrelevant results, but now it effectively models when you return irrelevant results early in your result list.

The result more accurately models user behavior when faced with a search engine than the classical binary precision metric. The parameter p models the desperation of the user and can vary to fit your problem domain. I personally like p=50%, because it makes for nice numbers, but it should proabably be chosen based on sampling based on knowledge of the search domain.

You can of course embellish this model with a stair-step in the cost function on each page boundary, etc. — any monotone decreasing infinite series that sums to a finite number in the limit should do.

A similar modification can of course be applied to recall.

I used this approach a couple of years ago to help tune a search engine to good effect. I went to refer someone to this post today and I realized I hadn't posted it in the almost two years since it was written, so here it is, warts and all.

If anyone is familiar with similar approaches in the literature, I'd be grateful for references!

I was asked to give two talks at the Boston Area Haskell User Group for this past Tuesday. The first was pitched at a more introductory level and the second was to go deeper into what I have been using monoids for lately.

The first talk covers an introduction to the mathematical notion of a monoid, introduces some of the features of my Haskell monoids library on hackage, and starts to motivate the use of monoidal parallel/incremental parsing, and the modification use of compression algorithms to recycle monoidal results.

The second talk covers a way to generate a locally-context sensitive parallel/incremental parser by modifying Iteratees to enable them to drive a Parsec 3 lexer, and then wrapping that in a monoid based on error productions in the grammar before recycling these techniques at a higher level to deal with parsing seemingly stateful structures, such as Haskell layout.

  1. Introduction To Monoids (PDF)
  2. Iteratees, Parsec and Monoids: A Parsing Trifecta (PDF)

Due to a late start, I was unable to give the second talk. However, I did give a quick run through to a few die-hards who stayed late and came to the Cambridge Brewing Company afterwards. As I promised some people that I would post the slides after the talk, here they are.

The current plan is to possibly give the second talk in full at either the September or October Boston Haskell User Group sessions, depending on scheduling and availability.

[ Iteratee.hs ]

Some people have requested my slides from the short talk I gave about monoids and monoidal parsing at Hac Phi. So, here they are.

There will be more to come at the next Boston Haskell User Group in August, where it looks like I'll be giving two short talks covering monoids. I may use the monoidal parsing engine from Kata as an example for the advanced talk if I have time and will start to cover parsing larger classes of grammars in general (regular languages, CFGs/TIGs, TAGs, PEGs, LALR, attribute-grammars, etc.)

About a year back I posted a field guide of recursion schemes on this blog and then lost it a few months later when I lost a couple of months of blog entries to a crash. I recently recovered the table of recursion schemes from the original post thanks to Google Reader's long memory and the help of Jeff Cutsinger.

The following recursion schemes can be found in category-extras, along with variations on the underlying themes, so this should work as a punch-list.

Folds
Scheme Code Description
catamorphism Cata tears down a structure level by level
paramorphism*† Para tears down a structure with primitive recursion
zygomorphism*† Zygo tears down a structure with the aid of a helper function
histomorphism† Histo tears down a structure with the aid of the previous answers it has given.
prepromorphism*† Prepro tears down a structure after repeatedly applying a natural transformation
Unfolds
Scheme Code Description
anamorphism† Ana builds up a structure level by level
apomorphism*† Apo builds up a structure opting to return a single level or an entire branch at each point
futumorphism† Futu builds up a structure multiple levels at a time
postpromorphism*† Postpro builds up a structure and repeatedly transforms it with a natural transformation
Refolds
Scheme Code Description
hylomorphism† Hylo builds up and tears down a virtual structure
chronomorphism† Chrono builds up a virtual structure with a futumorphism and tears it down
with a histomorphism
synchromorphism Synchro a high level transformation between data structures using a third data structure to queue intermediate results
exomorphism Exo a high level transformation between data structures from a trialgebra to a bialgebraga
metamorphism Erwig a hylomorphism expressed in terms of bialgebras
metamorphism Gibbons A fold followed by an unfold; change of representation
dynamorphism† Dyna builds up a virtual structure with an anamorphism and tears it down with a histomorphism
Elgot algebra Elgot builds up a structure and tears it down but may shortcircuit the process during construction
Elgot coalgebra Elgot builds up a structure and tears it down but may shortcircuit the process during deconstruction

* This gives rise to a family of related recursion schemes, modeled in category-extras with distributive law combinators
† The scheme can be generalized to accept one or more F-distributive (co)monads.

Grant B. asked me to post the derivation for the right and left Kan extension formula used in previous Kan Extension posts (1,2). For that we can turn to the definition of Kan extensions in terms of ends, but first we need to take a couple of steps back to find a way to represent (co)ends in Haskell.

(more...)

I think I may spend a post or two talking about Kan extensions.

They appear to be black magic to Haskell programmers, but as Saunders Mac Lane said in Categories for the Working Mathematician:

All concepts are Kan extensions.

So what is a Kan extension? They come in two forms: right- and left- Kan extensions.

First I'll talk about right Kan extensions, since Haskell programmers have a better intuition for them.

(more...)

 
> import Control.Arrow ((|||),(&&&),left)
> newtype Mu f = InF { outF :: f (Mu f) }
 

I want to talk about a novel recursion scheme that hasn't received a lot of attention from the Haskell community and its even more obscure dual -- which is necessarily more obscure because I believe this is the first time anyone has talked about it.

Jiri Adámek, Stefan Milius and Jiri Velebil have done a lot of work on Elgot algebras. Here I'd like to translate them into Haskell, dualize them, observe that the dual can encode primitive recursion, and provide some observations.

You can kind of think an Elgot algebra as a hylomorphism that cheats.

 
> elgot :: Functor f => (f b -> b) -> (a -> Either b (f a)) -> a -> b
> elgot phi psi = h where h = (id ||| phi . fmap h) . psi
 

(more...)

Ok, I decided to take a step back from my flawed approach in the last post and play with the idea of power series of functors from a different perspective.

I dusted off my copy of Herbert Wilf's generatingfunctionology and switched goals to try to see some well known recursive functors or species as formal power series. It appears that we can pick a few things out about the generating functions of polynomial functors.

As an example:

 
Maybe x = 1 + x
 

Ok. We're done. Thank you very much. I'll be here all week. Try the veal...

For a more serious example, the formal power series for the list [x] is just a geometric series:

(more...)

The post below will only compile on a version of GHC >= 6.9, since it uses type families.

(more...)