18 minutes
Leveling Up: The Art of Transduction
This article was originally published in PHP|Architect magazine in the June 2016 issue. These articles are all copyright by David Stockton. You may be able to purchase the issue here.
Requirements:
- https://github.com/mtdowling/transducers.php
As software developers, we will run across new problems all the time, often with aspects similar to previous problems we’ve encountered. In these cases, we typically write similar solutions. Many times you’ve probably needed to process a bunch of data, whether an array, a stream, or a database query result set. Often, the go-to tool to solve these problems is the humble
foreach
loop. PHP provides other ways for certain cases that are faster thanforeach
yet I’d guess they are used only a fraction of the time they could be.
Processing via foreach
The common pattern I typically see is to get data from some source, usually as an array, loop over it via foreach
, and either change it, output it, filter it, or a combination of these in order to massage the data in some way, and get it all fixed up how you want. It’s relatively quick, often easy to understand, but there are better ways.
Suppose you’ve got an CSV data file that’s uploaded and you need to import the values to the database, but the included date field is not in the format your database expects. It’s foreach
to the rescue, right?
|
|
More often than not, I see the above pattern, but the whole array is manipulated before anything is written to the database or to the output file. This leads to slow-downs due to processing the array multiple times as well as a memory usage that is higher than what is really needed.
PHP’s Array Functions
Now, the foreach
solution is fine, it works, it can be relatively quick assuming you’re not iterating over the whole set of data to fix it and then going back over it again to load it into the database, but the problem is that every part of it is ‘user land’ code. PHP has some built-in functions that put the looping part in native (read, built in C) code and the processing part would be provided by the programmer. That means the above bit becomes something like this:
|
|
The array_walk
function takes a provided function and applies it to each member of an array, so it’s great when you’re able to manipulate and process in one shot.
There’s a bunch of other similar built-in array functions for different situations, each of which will do their particular job faster than a foreach
loop would.
There’s array_filter
which is great for looping over an array and deciding if you want to keep a value in the array or discard it. It takes a callback that returns a true
value to keep the row or false
to discard. Suppose we’ve got an array of objects that can indicate if they are active or not, and we want to filter down to just the “active” objects.
|
|
At the end of the filter routine, the $activeObjects
array will contain only the objects which returned a “truthy” value from the isActive()
method.
Next up is array_map
. It will change each row in the array according to the callback function. Here’s an example of using the array_map
to take an array of numbers and turn it into an array of the squares of those numbers.
|
|
The result of this function will be that the $squares
array will contain [1, 9, 25, 64, 4]. There’s one final PHP array function I want to talk about before jumping in to the meat of the article.
The array_reduce
function is used to move through an array and end up with a single value at the end. An example would be iterating over shopping cart full of items to determine a total price, concatenating strings together or any of a number of other common operations. Let’s take a look at an example of totaling up shopping cart items:
|
|
Since this one looks a bit different, I’ll take a minute to explain. As with all previous examples, the first argument is the array we’re iterating over. The callback is a bit different though. We have the first argument, $sum
which is a “carry over” value. Each time the callback is executed, the return value of it will become the $sum
value for the next time it is executed. You can almost think of the body of the callback as being $sum += $cartItem->getPrice();
which is probably exactly what you’d write if you wanted to use a foreach
loop to total up the values. The final argument, 0.00
is used as the initial value. It will be passed in as the initial value of $sum
. Once the function is done running, the $total
value will contain the sum of all the items in the cart.
So dealing with all data as arrays in PHP is certainly possible, but it’s often not the best way. Data in arrays means that everything needs to be kept in memory and there’s a limit to that. If you need to load large files of data, loading everything into memory first may take too long, depending on the file size, and your system may not be able to handle it due to memory limits. You may also have a need for a series of data which doesn’t have an end. Think of a cycle that goes through the days of the week.
If you can process the data a row at a time, it means that you’ll be able to essentially process files as large as you want. Your code will have a known and predictable memory footprint even though the files or data you’re working with may be extremely large.
Introducing Transducers
The name “Transducers” is a portmanteau of “transform” and “reducers”. In short, they are functions that take in a reducing function and return a different reducing function. Transducers as a concept comes from Rich Hickey, the creator of the Clojure programming language.
In functional programming, which I’m not going into too deep here, there are three major functions or concepts - map, filter and reduce. You can accomplish quite a lot with these. Let’s take a look at how each works. Instead of working exclusively with arrays though, like array_map
, we want a map function that can work with any kind of iterable. This means not only arrays, but also generators and iterators. Map is going to take a function defining the transformation we want and a data source.
Filter will take a function that gives back a true or false value, with true indicating the value should be retained and false meaning it should be discarded.
The reduce function will take a function defining how to take two values and end up with a single value, along with a data source and a potential initial value.
Transducers allow us to combine or compose these functions to make data processing simple and understandable.
Transducers.php
Shortly after transducers in Clojure were introduced, Michael Dowling (creator of Guzzle) started a PHP implementation of transducers. You can find it at https://github.com/mtdowling/transducers.php. You can build chains of functionality that work on iterators, generators, arrays, and Traversables. You can evaluate the results of these chains either eagerly or lazily. Eager evaluation means that all the calculations will be done before you start working with the results. Lazy evaluation allows you to stream data through the transducers and keep memory requirements down.
To follow along, you can install the package with the following composer command:
composer require mtdowling/transducers
One of the ways I’ve used transducers is to transform an incoming file from a customer into a file that our current file loading system could handle. Typically, we import CSV (comma-separated values) files but they provided a TSV (tab-separated values) file of people data. Additionally, some of the values provided needed some work. The file had name fields all lower-cased, the date field was in the wrong format, and for completely made-up for this article reasons, we needed to know how many days until each person’s birthday, or how long since their birthday if it had already passed.
Also, the file also had a header which we didn’t need. Let’s take a look at how we can build this out:
|
|
We’ve got a lot of things that are happening that we need to talk about. First of all, we’re using t\comp
which is creating a function by composing other functions together. The functions will be executed in the order you specify. The importance of the order will become apparent soon if it’s not already apparent.
The first function is one provided by the Transducers package. It will throw away the first row it encounters and let every following row through. This function will serve to drop the header row from the input.
Next, we need to understand that each incoming bit of data that will pass through is starting as a string with tab separated values. We need to turn this into something we can more easily manage, like an array.
|
|
This function explodes the row using tabs, and then combines it with another array representing the field names - what the actual columns represent. In other words, a row that comes in looking like this:
42 david stockton 1/1/1999
will be transformed into an array like this:
[
'id' => 42,
'first' => 'david',
'last' => 'stockton',
'dob' => '1/1/1999'
]
This transformed array is going to be passed through the next part of our composed function, t\map($fixNames)
. For $fixNames
, I actually want to create a function that creates functions and use that twice to make a new composed function that will fix the capitalization on both the first and last field. As a side note, yes, I realize the futility of trying to properly fix name capitalization programmatically, but please bear with me for the example.
We could easily build a function that would capitalize a hardcoded field value:
|
|
The function to fix last name would be nearly identical. We need to have a function that takes in a single value, $row, but where the name of the field can actually be provided as a variable. Since our map function needs a function that takes in a row, we need a different way to provide the name of the field we want to capitalize. This is where making a function that returns a function comes in.
|
|
This code takes in a field name in $field
and through the use of a closure, closes around that field, and returns a new function which will accept a row of data and will upper case the first letter of whatever field we passed in. The use bit is where our function is able to essentially reach out into its parent’s scope (the outside function) and use the value of $field that was passed in. It’s a little bit like using PHP’s global except way more cool and significantly less disgusting. Now we can make two new functions with this and compose them together:
|
|
We’re using the t\comp
again to create a new composed function that will work on both the first and the last name fields. This is assigned to the variable $fixNames
. We could have just added the two calls to t\map
directly to our transform, but I wanted to demonstrate that you could compose these functions from other composed functions, making the final code more readable and understandable while leaving out a lot of clutter and repeated code. You could follow the same pattern with more complex operations as well.
Next up, we’re running another map function in order to convert the invalid month/day/year
provided format into a more standard year-month-day
format. For right now, we’re going to just replace the string date representation with a \DateTimeImmutable
object since we’ll need it in a moment. Here’s the $convertToDate
function:
|
|
This one is also pretty straight forward. By using the DateTimeImmutable’s createFromFormat
function, we can convert into an object that makes it easy to do date/time math. The trim call I threw in because in my sample data, when the date field (the last in the row) is converted to an array, the date field has a retained newline character. This was causing the createFromFormat
function to fail parsing the date. We don’t want that and trim
happily removes it. As before, we’re simply placing our transformed value back into the row that was passed in, remembering to return it. If we don’t return a value from a map function, the value sent to the next function will be null which is probably not what we want.
At this point, we’ve got our original data, but it’s been transformed from TSV to an array, the first and last name fields have had their first letters’ capitalized, and the ‘dob’ field is converted to an actual \DateTimeImmutable object. All that’s left is to determine how many days from or until the person’s birthday and attach that data to the row and then turn the object back into a string.
Since we’ve kept the ‘dob’ field as an object, we don’t need to read or parse the value again, we can just use it. This means the $addDaysFromBirthday
function looks like this:
|
|
There’s a bit more going on here, but none of it is terribly complicated. The first line is creating a \DateTimeImmutable
object representing right now. We use that object to extract the current year on the next line. Since we want the same value for all of our calculations, there’s no point in creating and recreating the $now
object within the map function, so I’m creating it outside the function and then using a closure to bring it into the function.
The $addDaysFromBirthday
function will have access to both the $now
and the $thisYear
values, and they’ll only need to be calculated once.
Next, we’re extracting the ‘dob’ field from the provided $row
data. We determine the value of the person’s birthday by replacing the year they were born with the current year and creating the $birthday
object. Next, the $timeUntilBirthday
object is created by determining the difference between now and the birthday. If the invert
flag has been set, then we know that for this year, that person’s birthday is in the past. Otherwise, it’s still coming up. The format commands will create a nice description of how long until the person’s birthday, or how long since they celebrated. This new data is placed into a new field called time_until_bday
and the whole row is returned.
The data row now includes this string. Finally, we need to convert the \DateTimeImmutable
object back into a date string that’s in the format we need. The $fixDateFormat
function is pretty straight forward then:
|
|
We simply replace the value in the dob
field (the object) with a formatted string representation of that date. Then we return the row.
All these functions are composed together into one that will do all the work we need to convert from an incoming TSV file to an array, fixing name fields, properly formatting date fields and adding new data about the number of months and days until or since a person’s birthday. But so far, we’ve not actually processed any data.
The transducer signature we need to look at will take the composed function and a data source. Since I’d like to provide the data from the file via a generator to keep memory usage down, and process the outgoing file in the same way, I want the result to be an iterator. This means the transformation pipeline we created will be lazily evaluated – that is, each row from the file will be processed individually when we iterate over it.
First, we’ll need a way to read the data and provide it via a generator:
|
|
This code opens the TSV file in read mode with fopen
, and uses that file handle to create a generator that will yield a single line of that file each time it is iterated. Once the file runs out of data, the generator will stop.
All that is left is to iterator over our transforming composite function with the data and do something with it.
|
|
As you can see, the t\to_iter
function receives the generator as the data source, and the transforming function $transformer
. In this example we are just echo’ing out the data. The output will look something like this:
[0] Aurelia Hegmann - 1976-08-03 (2 months, 18 days)
[1] Ena Metz - 1979-05-12 (0 months, 4 days ago)
[2] Johathan Terry - 1977-07-21 (2 months, 5 days)
[3] Roderick Hickle - 2008-06-05 (0 months, 20 days)
[4] Tyrel Auer - 2011-11-03 (5 months, 18 days)
This runs quickly, and the memory usage is minimal. Depending on your needs, there are other options for combining your transformation function and your data. If you wanted the type of data to remain instead of being turned into an iterator like we did above, you could replace t\to_iter
with t\xform
. The eagerly evaluated functions include t\transduce()
, t\into()
, t\to_array()
, t\to_assoc()
and t\to_string()
.
Additionally, there’s a t\to_fn()
function that will give back a function that can be used in PHP’s array_reduce function. The package also provides a number of built-in reducing functions that can be used and it’s possible to build your own transducers as well.
Included Transducer Functions
The Transducers.php package provides quite a few transducing functions. I’ll briefly touch on the other included functions:
- map(callable $f) - Applies the map function
$f
to each value in a collection - filter(callable $predicate) - Filters (removes) values from a collection that do not satisfy the predicate
- remove(callable $predicate) - Removes values that do satisfy the $predicate function
- cat() - Concatenates items from nested lists
- mapcat(callable $map) - Applies a map function to the collection and concatenates them to one less nesting level
- flatten() - Takes nested combination of sequential items and returns a single flattened sequence
- partition($size) - Takes the source and splits into arrays of the specified $size. If there are not enough to evenly split, the final array will have the remainder.
- partition_by(callable $predicate) - Splits the inputs into arrays each time the callable $predicate changes to a different value.
- take($n) - Takes
$n
values from the collection - take_while(callable $predicate) - Takes from the collection while the $predicate function returns true
- take_nth($n) - Takes every nth value from a sequence of values
- drop($n) - Drops $n items from the start of a sequence
- drop_while(callable $predicate) - Drops values from the sequence as long as the $predicate function returns true
- replace(array $map) - Replaces values from the sequence according to the map.
- keep(callable $f) - Keeps items where the $f function doesn’t return null
- keep_indexed(callable $f) - Returns the non-null results of calling $f($index, $value)
- dedupe() - Given an ordered sequence, it will remove values that are the same as the previous value
- interpose($separator) - Adds the separator between each item in a sequence
- tap (callable $interceptor) - Tap will “tap into” the chain, to do something with the intermediate result. It doesn’t change the sequence.
- compact() - Trims out all “falsey” values from the sequence
- words() - Splits the input into words
- lines() - Splits the input by lines
Conclusion
Transducers and thinking about functional programming allow us to process streams of data in a way that’s easier to understand than a large foreach. Composing map, reduce and filter functions allows for powerful transformation and processing chains that are easy to understand, simple to test and extremely useful. At this point I’ve used this package in a couple of different projects that are in production. It reminds me a bit of how middleware works but on a more generic, less specific to web requests level. I highly recommend taking a look at transducers and playing around with functional programming. For me, it was a lot of fun. See you next month.
David Stockton is a husband, father and Software Engineer and builds software in Colorado, leading a few teams of software developers. He's a conference speaker and an active proponent of TDD, APIs and elegant PHP. He's on twitter as @dstockto, YouTube at youtube.com/dstockto, and can be reached by email at levelingup@davidstockton.com.
leveling upphptransducerscollections
3709 Words
2016-06-01 18:00 -0600