About this entry

Cross Validation Module for Python

For one of my projects, I need a simple way to generate data sets for cross validation purposes. I didn’t want to spend a lot of time with programming, so not every component of the resulting code is generic enough for sharing. However, the following two Python classes might be useful for you if you are interested in statistical evaluations and analysis.

I admit that the included unit tests are rather basic because I relied on external tests for verification. Nevertheless, they are a good start if you decide to mess around and adapt the code to your needs. Feel free to notify me of any errors, suggestions are also welcome.

partitioner.py

This class creates v-fold partition sets from an input sequence. It can optionally randomize the input sequence.

This class splits an input sequence into (almost) equally-sized partitions. When it is not possible to create completely equally-sized partitions, the ‘first’ partitions are slightly extended so that all list elements fit into the partitions. Say, when the input sequence has length 11 and two partitions are requested, the algorithm will return a list of length 6 as first partition and a list of length 5 as second.

Here is sample code to demonstrate basic usage of Partitioner.py:

  1. from partitioner import Partitioner
  2. p = Partitioner(numPartitions=4, randomize=True)
  3. inputList = range(0,14)
  4. partitions_iterator = p.partition(inputList)
  5. # partitions_iterator yields [9, 8, 2, 5], [7, 6, 1, 11], [10, 3, 4], [12, 13, 0]

Here is a visualization of the code above:

Example of Partitioner

See the code of CrossValidationDataConstructor for more examples of how to make use of the Partitoner.

crossvalidationdataconstructor.py

This class (I couldn’t come up with a better name, sigh) constructs data sets for cross validation, and it makes use of the Partitioner described above. It accepts two input lists, positiveList and negativeList, splits them into a user-specifiable number of partitions and pairs the good and bad partitions item-wise, i.e. (positivePartitions[i], negativePartitions[i]) and returns an iterator for these tuples. The resulting list of good/bad tuples can be used for various statistical evaluations, in particular cross validations.

Here is sample code to demonstrate basic usage of crossvalidationdataconstructor.py:

  1. from crossvalidationdataconstructor import CrossValidationDataConstructor
  2. positiveInputList = range(1,10) # 1..9
  3. negativeInputList = range(-9,0) # -9..-1
  4. c = CrossValidationDataConstructor(positiveInputList, negativeInputList, numPartitions=3, randomize=True)
  5. dataSets_iterator = c.getDataSets()
  6. # dataSets_iterator yields (([2, 3, 8, 6, 4, 7], [-3, -6, -4, -7, -2, -8]), ([9, 1, 5], [-5, -9, -1])), (([9, 1, 5, 6, 4, 7], [-5, -9, -1, -7, -2, -8]), ([2, 3, 8], [-3, -6, -4])), (([9, 1, 5, 2, 3, 8], [-5, -9, -1, -3, -6, -4]), ([6, 4, 7], [-7, -2, -8]))

Here is a visualization of the code above:

Example of CrossValidationDataConstructor

For each of the resulting “rows”, you can use the first two sequences (green and red) for training and the last two sequences (green and red) for testing. You can then combine the performance of each evaluation run (here: 3) into a final performance measurement.

Download the code:

The code has been tested with Python 2.5 and 2.4.3.