{"id":838531,"date":"2022-04-22T11:49:40","date_gmt":"2022-04-22T18:49:40","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=838531"},"modified":"2022-04-25T09:12:44","modified_gmt":"2022-04-25T16:12:44","slug":"getting-deterministic-results-from-sparks-randomsplit-function","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/getting-deterministic-results-from-sparks-randomsplit-function\/","title":{"rendered":"Getting Deterministic Results from Spark’s randomSplit Function: A Deep Dive"},"content":{"rendered":"\n
Authors: <\/h6>\n\n\n\n

<\/p>\n\n\n\n

Tommy Guy<\/a> and Kidus Asfaw<\/p>\n\n\n\n

We noticed an odd case of nondeterminism in Spark\u2019s randomSplit function, which is often used to generate test\/train data splits for Machine Learning training scripts. There are other posts, notably\u00a0this one (opens in new tab)<\/span><\/a>\u00a0that diagnose the problem, but there are a few details to spell out. We also want to suggest an alternative to randomSplit that will guarantee determinism.<\/p>\n\n\n\n

The Problem<\/h4>\n\n\n\n

If you want to split a data set 80\/20 in Spark, you call df.randomSplit([0.80, 0.20], seed) where seed is some integer used to reseed the random number generator. Reseeding a generator is a common way to force determinism. But in this case, it doesn\u2019t work! In some cases (we\u2019ll identify exactly which cases below), randomSplit will:<\/p>\n\n\n\n