{"id":838531,"date":"2022-04-22T11:49:40","date_gmt":"2022-04-22T18:49:40","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=838531"},"modified":"2022-04-25T09:12:44","modified_gmt":"2022-04-25T16:12:44","slug":"getting-deterministic-results-from-sparks-randomsplit-function","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/getting-deterministic-results-from-sparks-randomsplit-function\/","title":{"rendered":"Getting Deterministic Results from Spark&#8217;s randomSplit Function: A Deep Dive"},"content":{"rendered":"\n<h6 id=\"authors\">Authors: <\/h6>\n\n\n\n<p><\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/riguy\/\" target=\"_blank\" rel=\"noreferrer noopener\">Tommy Guy<\/a> and Kidus Asfaw<\/p>\n\n\n\n<p>We noticed an odd case of nondeterminism in Spark\u2019s randomSplit function, which is often used to generate test\/train data splits for Machine Learning training scripts. There are other posts, notably\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/medium.com\/udemy-engineering\/pyspark-under-the-hood-randomsplit-and-sample-inconsistencies-examined-7c6ec62644bc\">this one<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u00a0that diagnose the problem, but there are a few details to spell out. We also want to suggest an alternative to randomSplit that will guarantee determinism.<\/p>\n\n\n\n<h4 id=\"the-problem\">The Problem<\/h4>\n\n\n\n<p>If you want to split a data set 80\/20 in Spark, you call&nbsp;df.randomSplit([0.80, 0.20], seed)&nbsp;where seed is some integer used to reseed the random number generator. Reseeding a generator is a common way to force determinism. But in this case, it doesn\u2019t work! In some cases (we\u2019ll identify exactly which cases below), randomSplit will:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Leave some rows out of either split<\/li><li>Duplicate other rows into both splits<\/li><li>On two separate runs on the same data with the same seed, assign data differently.<\/li><\/ul>\n\n\n\n<p>This feels like a bit of a bait and switch. I feel like any function that accepts a seed is advertising that it should be deterministic: otherwise why bother with the seed at all?<\/p>\n\n\n\n<p>Luckily, there is a way to force randomSplit to be deterministic, and it\u2019s listed in&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/medium.com\/udemy-engineering\/pyspark-under-the-hood-randomsplit-and-sample-inconsistencies-examined-7c6ec62644bc\" target=\"_blank\" rel=\"noopener noreferrer\">several<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-12590\" target=\"_blank\" rel=\"noopener noreferrer\">places<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/databricks\/kb\/data\/random-split-behavior\" target=\"_blank\" rel=\"noopener noreferrer\">online<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. The trick is to cache the dataframe before invoking randomSplit. This seems straightforward, but it relies on a solid understanding of Spark internals to gain an intuition on when you should be careful. Ultimately, Spark tries hard to force determinism (and more recent Spark versions are even better at this) but they can\u2019t provide 100% assurance that randomSplit will work deterministically. Below, I\u2019m going to suggest a&nbsp;<strong>different way to randomly partition<\/strong>&nbsp;that will be deterministic no matter what.<\/p>\n\n\n\n<h4 id=\"pseudorandomization-a-reminder\">Pseudorandomization: A Reminder<\/h4>\n\n\n\n<p>Just as a quick reminder, the way computers produce &#8220;random&#8221; numbers is actually pseudorandom: they start with some number then iterate in a complicated but deterministic way to produce a stream of numbers that are uncorrelated with each other. In the example below, we assign random numbers to some names, and we show that we can do this repeatably<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image1_ImportNumpyAsNP-6262ed9b1145b.jpg\" alt=\"Import numpy as np\" class=\"wp-image-838546\" width=\"888\" height=\"399\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image1_ImportNumpyAsNP-6262ed9b1145b.jpg 888w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image1_ImportNumpyAsNP-6262ed9b1145b-300x135.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image1_ImportNumpyAsNP-6262ed9b1145b-768x345.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image1_ImportNumpyAsNP-6262ed9b1145b-240x108.jpg 240w\" sizes=\"auto, (max-width: 888px) 100vw, 888px\" \/><figcaption>If we shuffle the names, we get different results even if we keep the seed.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"893\" height=\"208\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image2_ShuffleNames-.jpg\" alt=\"shuffled_names=list(names)\" class=\"wp-image-838549\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image2_ShuffleNames-.jpg 893w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image2_ShuffleNames--300x70.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image2_ShuffleNames--768x179.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image2_ShuffleNames--240x56.jpg 240w\" sizes=\"auto, (max-width: 893px) 100vw, 893px\" \/><figcaption>Note that the numbers are the same, but they apply to different names.<\/figcaption><\/figure>\n\n\n\n<p>So, the way to make a deterministic algorithm with a random number generator is to:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\"><li>Set the seed the same way.<\/li><li>Invoke the random number generator the exact same number or times and use the sequence in the exact same way.<\/li><\/ol>\n\n\n\n<h4 id=\"another-reminder-spark-dataframe-definition-vs-execution\">Another Reminder: Spark DataFrame definition vs execution<\/h4>\n\n\n\n<p>Spark makes a distinction between&nbsp;<em>defining<\/em>&nbsp;what to do and&nbsp;<em>executing<\/em>&nbsp;the defined compute. Some expressions on DataFrames are transformations that convert one DataFrame to a new DataFrame while others are actions that execute a sequence of transformations. There are many sources talking about this distinction online, but the&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www2.eecs.berkeley.edu\/Pubs\/TechRpts\/2011\/EECS-2011-82.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">original paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;on Spark is still a really great intro. (Aside: the paper talks about Resilient Distributed Datasets, which are a foundational element that DataFrames use).<\/p>\n\n\n\n<p>If you\u2019ve worked in Spark long at all, you\u2019ve seen this phenomenon. I can execute the following commands in a REPL and they succeed almost immediately no matter how big the data really is:<\/p>\n\n\n\n<p><sup>df = spark.read.parquet(&#8220;\/some\/parquet\/file\/pattern*.parquet&#8221;)<\/sup><\/p>\n\n\n\n<p><sub>df = df.filter(df[&#8216;amount&#8217;] > 4000).filter(df[&#8216;month&#8217;] != &#8216;jan&#8217;).show()<\/sub><\/p>\n\n\n\n<p><sub>df2 = spark.read.parquet(&#8220;\/someother\/parquet\/file\/pattern*.parquet&#8221;)<\/sub><\/p>\n\n\n\n<p><sub>df3 = df.join(df2)<\/sub><\/p>\n\n\n\n<p>That\u2019s because all I\u2019ve done so far is define a set of computations. You can see the plan by trying<\/p>\n\n\n\n<p><sub>df3.explain()<\/sub><\/p>\n\n\n\n<p>But when we execute something like&nbsp;df3.count(), we issue an action. The set of transformations that create&nbsp;df3&nbsp;execute on Spark workers, and it can take much longer to execute the statement because it blocks on the actual Spark action finishing.<\/p>\n\n\n\n<p>In a normal python script, if you trace the program on a white board, you can basically track the system state line by line. But in a pyspark script, it\u2019s much harder to trace when the &#8220;real&#8221; work (the actions) take place, or even when and how often they take place.<\/p>\n\n\n\n<h4 id=\"randomsplit0-8-0-2-seed-creates-two-dataframes-and-each-results-in-an-action\">randomSplit([0.8, 0.2], seed) creates two DataFrames, and each results in an action<\/h4>\n\n\n\n<p>Ok, so now it\u2019s time to look at the randomSplit function. The&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/apache\/spark\/blob\/master\/sql\/core\/src\/main\/scala\/org\/apache\/spark\/sql\/Dataset.scala#L2316\">actual code<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;is below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1017\" height=\"746\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image3_RandomSplit-6262eebd49367.jpg\" alt=\"randomSplit\" class=\"wp-image-838552\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image3_RandomSplit-6262eebd49367.jpg 1017w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image3_RandomSplit-6262eebd49367-300x220.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image3_RandomSplit-6262eebd49367-768x563.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image3_RandomSplit-6262eebd49367-80x60.jpg 80w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image3_RandomSplit-6262eebd49367-240x176.jpg 240w\" sizes=\"auto, (max-width: 1017px) 100vw, 1017px\" \/><\/figure>\n\n\n\n<p>This is what it does:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Sort the data within each partition. This ensures that within a Spark partition, the random number generator in Sample will execute the same number of times and will use the random numbers in the same exact way.<\/li><li>Normalize the weights.<\/li><li>Issue a series of calls to Sample with different sliding windows and with the same seed. Those calls are totally independent, and each call returns a Dataframe.<\/li><li>Return a list of DataFrames: one per sample partition.<\/li><\/ul>\n\n\n\n<p>Sample is a transformation: it adds to the DAG of transformations but doesn\u2019t result in an action. In our example of an 80\/20 split, the first call to Sample will use a random generator to assign a value between 0 and 1 to every row, and it will keep rows where the random value is&nbsp;<0.8. The second call will assign new random values to every row and keep rows where the random value is&nbsp;>0.8. This works if and only if the random reassignment is exactly the same in both calls to Sample.<\/p>\n\n\n\n<p>Each of the 2 DataFrames (one with 80% of data and one with 20%) corresponds to a set of transformations. They share the set of steps up to the sample transformation, but those shared steps will execute independently for each random split. This could extend all the way back to data reading, so data would literally be read from disk independently for the 80% sample and the 20% sample. Any other work that happens in the DataFrame before Sample will also run twice.<\/p>\n\n\n\n<p>This all works just fine assuming&nbsp;<em>every step in the upstream DAG deterministically maps data to partitions<\/em>! If everything is deterministic upstream, then all data maps to the same partition every time the script runs, and that data is sorted the same way in randomSplit every time, and the random numbers generated use the same seed and used on the same data row every time. But if something upstream changes the mapping of data to partitions then some rows will end up on different partitions in the execution for the 80% sample than they end up in the 20% sample. To summarize:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>If a non-deterministic process maps data to partitions, then the non-deterministic process could run independently per partition.<\/li><li>If the independent, non-deterministic transformation changes something that Spark uses to partition data, then some rows may map to partitions differently in each DAG execution.<\/li><li>That data is assigned different random numbers in the 80% sample and 20% sample because the random numbers in Sample are used differently in the two samples. In fact, likely nearly all data gets different random numbers because any change to partitioning impacts data that is sorted.<\/li><\/ul>\n\n\n\n<p>What could cause the DataFrame input to randomSplit to be non-deterministic? Here are a few samples:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Changing data. If your data changes between reads, the two frames could start with different data. This could happen if you are, say, reading from a stream with frequent appends. The extra rows from the second action would end up somewhere.<\/li><li>Some UDFs (User Defined Functions) can be nondeterministic. A classic example would be a function that generates a UUID for every row, especially if you later use that field as a primary key.<\/li><\/ul>\n\n\n\n<p>There used to be a much more nefarious problem in&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/spark.apache.org\/docs\/latest\/rdd-programming-guide.html#shuffle-operations\">Shuffle<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;when used in&nbsp;df.partition(int). Spark did a round robin partitioning, which meant rows were distributed across partitions in a way that depended on the order of data in the original partition. By now, you should see a problem with that approach! In fact, someone filed a&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-23207\" target=\"_blank\" rel=\"noopener noreferrer\">bug<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;pointing out the same sort of nondeterministic behavior we saw in randomSplit, and it was fixed. The&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/apache\/spark\/blob\/branch-2.4\/sql\/core\/src\/main\/scala\/org\/apache\/spark\/sql\/execution\/exchange\/ShuffleExchangeExec.scala#L265\" target=\"_blank\" rel=\"noopener noreferrer\">source<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;for round robin shuffling now explicitly sorts to ensure rows are handled in a deterministic order.<\/p>\n\n\n\n<h4 id=\"a-few-workarounds\">A Few Workarounds<\/h4>\n\n\n\n<p>There are really two options, and they are&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/databricks\/kb\/data\/random-split-behavior\" target=\"_blank\" rel=\"noopener noreferrer\">documented elsewhere<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;in more detail. They boil down to:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\"><li>Force the upstream DAG to only run once. This is what cache does: it persists the DataFrame to memory or disk. Subsequent reads hit the cache, so&nbsp;<code>someNonDeterministicDataFrame.cache().randomSplit<\/code>&nbsp;forces the DAG creating&nbsp;<code>someNonDeterministicDataFrame<\/code>&nbsp;to run once, saves results in cache, then forces all samples in randomSplit to read from the cache. The cache is deterministic by definition: it\u2019s a fixed data set.<\/li><li>Do something that deterministically forces data to partitions. Do this after the nondeterministic transformation, and be careful not to partition on something that is nondeterministic (like a guid you build in a UDF)!<\/li><\/ol>\n\n\n\n<p>Both workaround options require that you&nbsp;<strong>think globally<\/strong>&nbsp;to&nbsp;<strong>act locally<\/strong>. That breaks the encapsulation that is at the core of software engineering! You are left to either understand every step upstream in the DAG (likely by using explain function) and hoping that doesn\u2019t change or adding potentially expensive extra computation to guard against changes. Both of these options effectively require global knowledge and global change knowledge! For example, my team at Microsoft intentionally separates the problem of reading data from disk and producing DataFrames from the actual Machine Learning training and inference steps. We don\u2019t want you to think globally!<\/p>\n\n\n\n<h4 id=\"an-alternative-fix-deterministic-by-design-shuffle\">An Alternative Fix: Deterministic by Design Shuffle<\/h4>\n\n\n\n<p>randomSplit&nbsp;relies on&nbsp;<em>DataFrame structure<\/em>&nbsp;to produce deterministic results: consistent data-to-partition mapping and consistent ordering within partition (enforced in the method). Another approach is to deterministically use the&nbsp;<em>data values<\/em>&nbsp;to map to partitions. This is an approach that is commonly used in AB test initialization (I described it&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/guyrt\/guyrt.github.com\/blob\/master\/notebooks\/RandomizingUsers.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) that has a few interesting properties:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>The same input always maps to the same sample.<\/li><li>You can use a subset of columns to consistently hash all data that matches on the subset to same sample. For instance, you could map all data from a userId to the same random split.<\/li><li>The algorithm is stateless: this is important for scale in AB testing but for our purposes it makes implementation easier.<\/li><\/ul>\n\n\n\n<p>The basic idea for a row is:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\"><li>Concatenate any columns you want to sample on into a new column.<\/li><li>Add a salt to the new column (we\u2019ll use seed), which allows us to produce different partitions at different times.<\/li><li>Hash the column.<\/li><li>Compute the modulus of the hash using some large modulus number (say 1000) [0]<\/li><li>Pick a set of modulus outputs for each split. For an 80\/20 split, moduli 0-799 is the 80% split and 800-999 is the 20% split.<\/li><\/ol>\n\n\n\n<p>In pyspark:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"623\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image4_PySpark-6262e8841379c-1024x623.jpg\" alt=\"pyspark.sql.functions\" class=\"wp-image-838525\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image4_PySpark-6262e8841379c-1024x623.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image4_PySpark-6262e8841379c-300x183.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image4_PySpark-6262e8841379c-768x468.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image4_PySpark-6262e8841379c-240x146.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/04\/Image4_PySpark-6262e8841379c.jpg 1107w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Tommy Guy and Kidus Asfaw We noticed an odd case of nondeterminism in Spark\u2019s randomSplit function, which is often used to generate test\/train data splits for Machine Learning training scripts. There are other posts, notably\u00a0this one (opens in new tab)\u00a0that diagnose the problem, but there are a few details to spell out. We also want [&hellip;]<\/p>\n","protected":false},"author":41161,"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":804652,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-838531","msr-blog-post","type-msr-blog-post","status-publish","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":804652,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/838531","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/41161"}],"version-history":[{"count":8,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/838531\/revisions"}],"predecessor-version":[{"id":839119,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/838531\/revisions\/839119"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=838531"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=838531"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=838531"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=838531"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}