Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 24356

Peter Bengtsson: How to JSON schema validate 10x (or 100x) faster in Python

$
0
0

This is perhaps insanely obvious but it was a measurement I had to do and it might help you too if you use python-jsonschema a lot too.

I have this project which has a migration script that needs to transfer about 1M records from one PostgreSQL database, transform it a bit, validate it, and store it in another PostgreSQL database. The validation step was done like this:

fromjsonschemaimportvalidate...withopen(os.path.join(settings.BASE_DIR,"schema.yaml"))asf:SCHEMA=yaml.load(f)["schema"]...classBuild(models.Model):...@classmethoddefvalidate_build(cls,build):validate(build,SCHEMA)

That works fine when you have a slow trickle of these coming in with many seconds or minutes apart. But when you have to do about 1M of them, the speed overhead starts to really matter. Granted, in this context, it's just a migration which is hopefully only done once but it helps that it doesn't take too long since it makes it easier to not have any downtime.

What about python-fastjsonschema?

The name python-fastjsonschema just sounds very appealing but I'm just not sure how mature it is or what the subtle differences are between that and the more established python-jsonschema which I was already using.

It has two ways of using it either...

fastjsonschema.validate(schema,data)

...or...

validator=fastjsonschema.compile(schema)validator(data)

That got me thinking, why don't I just do that with regular python-jsonschema!
All you need to do is crack open the validate function and you can now re-used one instance for multiple pieces of data:

fromjsonschema.validatorsimportvalidator_forklass=validator_for(schema)klass.check_schema(schema)# optionalinstance=klass(SCHEMA)instance.validate(data)

I rewrote my projects code to this:

fromjsonschemaimportvalidate...withopen(os.path.join(settings.BASE_DIR,"schema.yaml"))asf:SCHEMA=yaml.load(f)["schema"]_validator_class=validator_for(SCHEMA)_validator_class.check_schema(SCHEMA)validator=_validator_class(SCHEMA)...classBuild(models.Model):...@classmethoddefvalidate_build(cls,build):validator.validate(build)

How do they compare, performance-wise?

Let this simple benchmark code speak for itself:

frombuildhub.main.modelsimportBuild,SCHEMAimportfastjsonschemafromjsonschemaimportvalidate,ValidationErrorfromjsonschema.validatorsimportvalidator_fordeff1(qs):forbuildinqs:validate(build.build,SCHEMA)deff2(qs):validator=validator_for(SCHEMA)forbuildinqs:validate(build.build,SCHEMA,cls=validator)deff3(qs):cls=validator_for(SCHEMA)cls.check_schema(SCHEMA)instance=cls(SCHEMA)forbuildinqs:instance.validate(build.build)deff4(qs):forbuildinqs:fastjsonschema.validate(SCHEMA,build.build)deff5(qs):validator=fastjsonschema.compile(SCHEMA)forbuildinqs:validator(build.build)# Reportingimporttimeimportstatisticsimportrandomfunctions=f1,f2,f3,f4,f5times={f.__name__:[]forfinfunctions}for_inrange(3):qs=list(Build.objects.all().order_by("?")[:1000])forfuncinfunctions:t0=time.time()func(qs)t1=time.time()times[func.__name__].append((t1-t0)*1000)deff(ms):returnf"{ms:.1f}ms"forname,numbersintimes.items():print("FUNCTION:",name,"Used",len(numbers),"times")print("\tBEST  ",f(min(numbers)))print("\tMEDIAN",f(statistics.median(numbers)))print("\tMEAN  ",f(statistics.mean(numbers)))print("\tSTDEV ",f(statistics.stdev(numbers)))

Basically, 3 times for each of the alternative implementations, do a validation on a JSON blob (technically a Python dict) that is around 1KB in size.

The results:

FUNCTION: f1 Used 3 times
    BEST   1247.9ms
    MEDIAN 1309.0ms
    MEAN   1330.0ms
    STDEV  94.5ms
FUNCTION: f2 Used 3 times
    BEST   1266.3ms
    MEDIAN 1267.5ms
    MEAN   1301.1ms
    STDEV  59.2ms
FUNCTION: f3 Used 3 times
    BEST   125.5ms
    MEDIAN 131.1ms
    MEAN   133.9ms
    STDEV  10.1ms
FUNCTION: f4 Used 3 times
    BEST   2032.3ms
    MEDIAN 2033.4ms
    MEAN   2143.9ms
    STDEV  192.3ms
FUNCTION: f5 Used 3 times
    BEST   16.7ms
    MEDIAN 17.1ms
    MEAN   21.0ms
    STDEV  7.1ms

Basically, if you use python-jsonschema and create a reusable instance it's 10 times faster than the "default way". And if you do the same but with python-fastjsonscham it's 100 times faster.

By the way, in version f5 it validated 1,000 1KB records in 16.7ms. That's insanely fast!


Viewing all articles
Browse latest Browse all 24356

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>