Like any large entity that’s been around for many, many decades, the MTA has its fair share of legacy IT systems. With the transition from MetroCards to OMNY well underway, the system that keeps track of MetroCard swipes is one such system.
While this system has kept MetroCards working throughout the bus and subway networks for quite a few years now, it was not built with modern analytics needs in mind. So getting granular MetroCard data onto our data lake was an important project in our ongoing efforts to democratize access to data both internally and outside the MTA.
Setting up a process to start ingesting detailed ridership data to our new data lake platform was straightforward enough, and that’s been publicly available since May 2023. But backfilling historic MetroCard data proved to be a bit of an adventure.
Goal: Minimize historical backfill burden on legacy mainframe
The relevant piece of the MetroCard system runs on an IBM mainframe, and our mainframe team does a good job of keeping that up and running day-to-day. It’s still a legacy system, though, so building out a seemingly simple solution like writing and executing a script to iteratively feed historic files to the pipeline we use for daily exports is not as easy as it sounds. Between our quite busy mainframe colleagues’ bandwidth constraints and the technical challenges of having to work with the computational resources of yesteryear, we determined that the best way to get this project done in a timely fashion would be to rework our process to just get raw data off the mainframe and handle all processing elsewhere.
Recreating the SAS code the mainframe uses to reformat data didn’t prove too difficult, but it turned out that we had to tackle a more fundamental problem before we could even start.
Trouble with raw data
First, a little background. One of the things the SAS script does is convert fields stored using an old encoding called packed decimal to a more readable format. The idea behind packed decimal encoding is that individual digits are represented using four bits, or only half a byte, rather than using an entire byte for a digit. So for instance, the two digits 81 could be encoded in a single byte as 10000001. (1000 is 8 in binary and 0001 is 1 in binary—put the two pieces together and you get 1000 0001 → 10000001.)
Apparently, people in the early days of programming loved a good pun—the technical term for these four bit pieces is “nibble” (as in a small bite/small byte).
Anyway, our mainframe colleagues sent us a sample file, and after breaking out the dreaded hex editor to see what this file looked like on the byte level, we ran into trouble just three bytes in. We were expecting the first field to be a date, and to look something like 020240101F (0 and F denote the start and end of a record, so 020240101F would be what the record for 2024-01-01 would look like). Instead we saw this: 0201D82F0F.
Two breakthroughs
After banging our heads against the wall just enough (or maybe a little too much), we finally had our first big breakthrough when we asked our mainframe colleagues to pull up the file they had sent us directly on the mainframe system itself. When they did, we noticed that the troublesome third and fourth bytes showed up as 80 and 61 instead of D8 and 2F, giving 020180610F instead of 0201D82F0F, and revealing the perfectly reasonable date 2018-06-10 in place of the mysterious date 2018-D8-2F.
This was a big step. Now instead of trying to solve the nebulous problem of what the date 2018-D8-2F represented, we could deal with the much more concrete issue of figuring out how and why the file we received was getting garbled.
There were several places where bytes were getting reencoded, so we tried to draw out the patterns we were seeing. Pretty soon our notepads began to look like the tangled mess of lines, symbols, and numbers that an overly zealous conspiracy theorist might come up with.
What finally clued us in was a byte where 81 was getting reencoded as 61.
The mainframe uses an encoding scheme called EBCDIC that’s pretty specific to IBM mainframes (EBCDIC may also be the only encoding scheme that has a Wikipedia page with a Criticism and humor section). In EBCDIC, the character “a” is encoded as 81, whereas essentially any other modern encoding scheme would encode “a” as 61.
This led us to the root cause of the issue we were seeing. The software the mainframe uses to export data was automatically reencoding EBCDIC to a more widely used standard to ensure the exported data could be read by other systems. For most use cases this would be great. However, since dealing with packed decimal meant we needed to see the raw bytes, any reencoding at all was going to cause trouble.
Final steps and lessons learned
Continuing in the spirit of avoiding having to touch any mainframe code, rather than tweaking the mainframe’s export script to avoid this reencoding we opted to resolve the issue by writing a Python script to reverse engineer the original encoding. The mainframe team was then able to pretty quickly send us all the raw files using the default reencoding, and from there it was just a matter of processing each file on our platform.
In addition to being quite a fun detective adventure, this project also taught us some important lessons about working with legacy systems. Perhaps the most critical one is not to be intimidated by unfamiliar terminology. While it can be a bit painful, taking at least a few hours to develop some baseline familiarity with the lingo of these systems goes a long way towards being able to collaborate effectively with the teams that maintain them.
With this project now in the books, we’re now excited to share the following data sets on Open Data:
- Subway Hourly Ridership back to 2020
- Subway Origin-Destination Ridership Estimates for 2020, 2021, and 2022
Keep your eyes out for additional historic ridership data releases in the near future!
About the author
Eli Schultz is a data engineer with the Data & Analytics team.