Abstract
Storing large files in an Oracle database can be expensive and cumbersome. The expense comes from the size of the on-line disk storage and the efforts to run backups with archive logs. Oracle Cloud Infrastructure introduced a REST API for large object storage (multi-media files, etc.) This feature has been available with Amazon Web Services (AWS) for years. I wrote and published an API for AWS S3 storage using PL/SQL in 2016 and 2017. With AWS S3, a development team can reliably store binary large objects (BLOB / multimedia files) securely and cost effectively. In this article, I’ll explore the advantages and updates I have made to the AWS S3 package.
Operational Challenge
When I went to play with AWS S3 in 2016, I discovered that the original work by Morten Braten and Jason Straub predated AWS’ version 4 security protocol. I have no memory of why I spent weeks working through my AWS4_S3_PKG that I published here: https://github.com/cmoore-sp/plsql-aws-s3 I am glad I did the work then.
Storm Petrel LLC earned a contract with the Government of Puerto Rico to support recovery efforts following the twin Hurricanes Irma and Maria that devastated the island during the autumn of 2017. By early 2018, we had 2,000 users on our Oracle APEX grants management application, Tempest-GEMS. Users entered financial data then uploaded supporting documentation (invoices, contracts) as PDF to our application. We stored these documents as BLOB in our Oracle table. We struggled to keep up with the demand for storage space. We also had an API that permitted specialty projects to upload documents. A few times, we experienced 20,000 new PDF in a single evening’s upload.
Data costs soared and our document table’s performance suffered some. When a document gets uploaded, we stored it in the table. Within 15 minutes, it would be duplicated into the transaction log by Oracle. Each evening, we ran full backups. Each weekend, we imaged our servers. 1 became 2, 2 becomes 4. If your archive/transaction log fills the disk space allocated, Oracle stops operating. We spent weeks chasing the exponential growth of our data storage.
On-line diskspace at AWS gets expensive with you get to very large sizes.
We created a support problem. We created a problem with escalating costs. We created a performance problem.
We identified decided to push our documents to external, long-term storage. The Oracle Cloud Infrastructure’s product did not then exist. Today, it would serve nicely as a solution. In February and March of 2018 rapidly implemented a process to store documents at AWS S3.
Storm Petrel’s Workflow
Our workflow took a very conservative approach. Our team supported the people of Puerto Rico through the early stages of a long-term financial recovery process. We must avoid failure and performance issues, both. People all over the islands were using mobile phones to push documents up to our application. We observed that pushing through Oracle directly to AWS S3 slowed the upload process – there was a two-hop between the user on a mobile network and the affirmative acknowledgement from AWS S3.
Storm Petrel opted to continue uploading the documents into a BLOB column within our documents table. We added several fields to this table:
- S3 Bucket
- S3 Prefix (file “path”)
- S3 filename
- S3 eTag
- Document Filesize
- S3 Version ID
- S3 Last Modified Date
- S3 Filesize
- S3 Transferred
- S3 confirmed
- Status
These columns may appear redundant, but they provided us the incremental data we needed to be confident of our process.
- A documents/multimedia file got uploaded to the table
- The status set identify it remained in the table “only”
- A DBMS scheduler job pushed the documents to S3
- The status set to identify document is in the table and S3 “both”
- We capture the S3 information such as eTag, version ID, modified date, filesize
- A separate DBMS scheduler job confirms the S3 status. This runs hourly.
- Weekly, on Sundays, we remove the BLOBs from the Oracle table that we have confirmed are present at S3
We could have chosen to be more aggressive with removing BLOBs from our table, executing the job daily. We would have saved space and thus costs for storage, but we took a balanced approach with costs and reliability. We run a full system image backup on Saturday nights. By deleting the BLOBs on Sunday, we gave ourselves a recovery path.
We executed this process on-the-fly while in production with thousands of simultaneous users.
By the end of 2018, we stored over 400,000 PDF documents. Those documents required 1.8Tb of storage at AWS S3. These were the financial records of over $5 billion U.S. grant moneys provided to the people and organizations of Puerto Rico – all of whom are U.S. citizen.
Storm Petrel did not win the long-term contract with Puerto Rico. That contract went to a much larger multi-national firm based in Canada. In December of 2018, we were tasked with providing the new contract with the documents. AWS S3 simplified the process of handing over the documents. The documents were isolated from our servers and databases. The hand-off resulted in very little risk to our systems.
On-Going Document Storage
Storm Petrel continues to support states and other organizations managing data related to disasters, including the COVID19 pandemic. The document storage process we implemented for Puerto Rico remains in place 2½ years later. The download process in Oracle APEX pulls from AWS S3 (or the Oracle table) seamlessly and rapidly. The user is never aware of the location and the download speed is fantastic (given your last-mile connection to Ye Olde Internet).
Challenges with Spanish
In the early days of pushing documents to AWS S3, we encountered challenges with filenames written in Spanish. Several of the characters commonly found on Spanish-based keyboard do not encode well for HTTP. We rapidly learned to use the document table’s primary key as the filename for S3. The filenames are entirely and totally ugly on AWS S3 given they are all strings of numbers.
PL/SQL AWS S3 Package Updated
Storm Petrel has written several application programming interfaces (API) in the recent years. I revisited my own work from 2016/2017. Some of that code includes work that dates to 2011. I strove to consolidate code more aggressively. The package has lost about 400 lines of code for the same performance.
AWS Workflow
Following the workflow above, these are the procedures/functions that get called.
- A DBMS scheduler job pushed the documents to S3
- Call procedure that loops through the document table for any document not yet pushed to S3
- Calls AWS4_S3_PKG.Put_Object – copying the BLOB from the Oracle table to a designated AWS S3 bucket. “Key” to AWS is the file path.
- A separate DBMS scheduler job confirms the S3 status. This runs hourly.
- Calls procedure AWS4_S3_PKG.Object_Head – this process gets the HTTP “head” or header data from AWS. This biographical information provides us with the eTag, size, date, and last modified information in the headers returned from AWS (apex_web_service.g_headers).
- We store the AWS object header data into our documents table after verifying the match.
We do not provide a means of deleting BLOBs nor the corresponding files on S3. We mark rows as “archived”. All of our queries, reports, procedures disregard data identified as archived (archived = ‘N’). We had engaged in this process long before arriving in Puerto Rico. Given that we work with financial funds related to the U.S. government, we tag data changes with date/time stamps, username, and the IP address of the change.
Of course, we did encounter a situation in 2018, where we were informed by the government that we’d lost 5,000 documents. They magically disappeared on a Saturday morning. Within an hour, we provided proof that the documents existed but were marked “archived” by a specific user. We provided the date, time, and IP address of the actions – actions that could only be executed with deliberate intent.
GitHub
I did nothing to maintain the PL/SQL code for the AWS S3 interface at my GitHub site for the last years. I know we had run into issues with bugs and such. The most common error is that our Oracle code and AWS disagreed about the time. Often it was the time zone offset from UTC.
I am not going to reconcile the old code with new. Instead, the code we’ve run in production is being posted. It is backwards compatible.
Please take from it as you will and enjoy.
The Github link is: https://github.com/cmoore-sp/plsql-aws-s3
Hey, of course, confidential stuff has been replaced with ‘xxxxx’. Do substitute your own API keys and such where needed.