Dataset and Schemaless Relational Databases

tldr: use relational databases the way they were intended to be used.

I’ve been using dataset on and off when tinkering with Postgres and SQLite databases. It’s not a bad library if you are strictly pulling rows from tables. It allows for quick testing and prototyping.

Recently, I used it to build a Reddit bot that compares perceptual hashes of images to previous submissions to detect duplicate posts.

This project started out as a proof-of-concept to see how viable it would be to try to quickly and accurately discern the similarity between two images.

It was surprisingly easy. dataset was used to stick threads, image URLs and the corresponding p-hashes of images in a database. For this purpose, it was quick and simple to use.

On top of that, I built a small bot that posts to Reddit to see how it would be received by different communities. dataset continued to be used to quickly and easily store data.

Beyond the tinkering phase, dataset combines the disadvantages of squeezing NoSQL databases into use-cases they weren’t meant for with problems that only arise when you try to abstract away an RDBMS behind a schema-less interface.

One cannot escape the fact that relational databases have a schema. dataset simply generates a schema and hides the details from the user.

That’s cute and fun, but the magic and convenience have unseen costs.

Schemas can describe tables and their column types. Column types allow for efficient storage and can indicate to clients the types of data will be accepted or returned. Naturally, if a column of a row in a table is of the type int its type should be reflected in the database as an Integer type. Brilliantly, dataset generates a schema with column types that reflect the data that is inserted into the database.

What happens when a column’s data does not match the column type originally designated by dataset?

The resultant behavior might as well be “undefined”. dataset will throw it’s hands up and change the column-type to ‘Text’ and be done with it. With that, all columns in rows that were previously inserted have now changed types. Any new data will be coerced to fit into the column’s new type. Client code that interacts with the database must reflect the new column-type lest bad things happen.

All of the above happens silently. No exception is thrown. I believe this is intentional.

The only way to know if there is an error is to check input before inserting it into the database and after fetching it.

Therefore, you will have to write the schema’s constraints in code and check data at runtime. In essence, you will rewrite some of the functionality the RDBMS provides just to run around the RDBMS.

The other option is to ditch the schema-less abstraction the second you’re done tinkering and to use a real database solution if you see your project progressing past the ‘I wonder if this would work’ stage.

As an aside, I’d occasionally have to look behind dataset’s abstraction over SQLAlchemy and even dip into SQL to query the database. That’s not fun.

Comments