{"id":667005,"date":"2020-07-20T09:41:57","date_gmt":"2020-07-20T16:41:57","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=667005"},"modified":"2022-04-19T15:09:14","modified_gmt":"2022-04-19T22:09:14","slug":"prose-pattern-inspector","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-pattern-inspector\/","title":{"rendered":"PROSE – Pattern Inspector"},"content":{"rendered":"

Have you ever written a script to perform a string transformation and have it either crash or produce wrong results silently due to input data being in unexpected formats? Or do you want to figure out how many different cases you need to handle in your standardization procedure. Matching.Text<\/strong> to the rescue!<\/p>\n

Matching.Text<\/strong> automatically identifies different formats and patterns in string data. Given a set of input strings, Matching.Text<\/strong> produces a small number of disjoint regular expressions such that they together match all the input strings, except possibly a small fraction of outliers. Additional documentation and usage can be found here<\/a>.<\/p>\n

Scenario<\/h2>\n

Consider a list of names below which from which you want to extract last names.<\/p>\n\n\n\n\n\n\n\n\n\n\n
Full Name<\/th>\n<\/tr>\n<\/thead>\n
Laia Sanchis<\/td>\n<\/tr>\n
Gwilym Jones<\/td>\n<\/tr>\n
Cai Huws<\/td>\n<\/tr>\n
Tomi Elis<\/td>\n<\/tr>\n
Geraint Llwyd<\/td>\n<\/tr>\n
\u2026<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n
<\/div>\n

A simple looking task, if there was one \u2013 the python function below is a good attempt.<\/p>\n

\r\ndef extract_last_name(name):\r\n    return name[name.find(' ')+1:]<\/pre>\n

However, while the first 10 names look standard, running Matching.Text<\/strong> provides more insight into the different formats, further identifies outliers that do not fall into any of the other formats.<\/p>\n\n\n\n\n\n\n\n\n\n
Pattern Name<\/th>\nRegex Pattern<\/th>\nFrequency<\/th>\nExamples<\/th>\n<\/tr>\n<\/thead>\n
Word_Word<\/td>\n[A-Z][a-z]+ [A-Z][a-z]+<\/code><\/td>\n0.84<\/td>\n\u201cLaia Sanchis\u201d, \u201cGwilym Jones\u201d<\/td>\n<\/tr>\n
Word_Word_Hyphen_Word<\/td>\n[A-Z][a-z]+ [A-Z][a-z]+-[A-Z][a-z]+<\/code><\/td>\n0.06<\/td>\n\u201cTulga Bat-Erdene\u201d, \u201cDabir Al-Zuhairi\u201d<\/td>\n<\/tr>\n
Word_Word_Word<\/td>\n[A-Z][a-z]+ [A-Z][a-z]+ [A-Z][a-z]+<\/code><\/td>\n0.06<\/td>\n\u201cYue Ying Jen\u201d, \u201cRolf Van Eeuwijk\u201d<\/td>\n<\/tr>\n
Word<\/td>\n[A-Z][a-z]+<\/code><\/td>\n0.04<\/td>\n\u201cDanlami\u201d, \u201cIsioma\u201d<\/td>\n<\/tr>\n
Outliers<\/td>\n<\/td>\n<0.01<\/td>\n\u201cUNKNOWN\u201d, \u201cNULL\u201d<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n
<\/div>\n

Given this new insight, it can be seen that extract_last_name<\/code> may not always return the right answer, and you may want to handle the last name extraction task quite differently. Further, to make the writing the procedure easier, Matching.Text<\/strong> can also generate a switch-case like template to match against the different patterns.<\/p>\n

\r\nregex_word_word = re.compile(r'[A-Z][a-z]+ [A-Z][a-z]+')\r\nregex_word_word_hyphen_word = re.compile(r'[A-Z][a-z]+ [A-Z][a-z]+-[A-Z][a-z]+')\r\nregex_word_word_word = re.compile(r'[A-Z][a-z]+ [A-Z][a-z]+ [A-Z][a-z]+')\r\nregex_word = re.compile(r'[A-Z][a-z]+')\r\n\r\ndef extract_last_name(name):\r\n  if regex_word_word.match(name):\r\n    return \"TitleWord & TitleWord\" # Modify\r\n  elif regex_word_word_hyphen_word.match(name):\r\n    return \"TitleWord & TitleWord & Const[-] & TitleWord\" # Modify\r\n  elif regex_word_word_word.match(name):\r\n    return \"TitleWord & TitleWord & TitleWord\" # Modify\r\n  elif regex_word.match(name):\r\n    return \"TitleWord\" # Modify\r\n  else:\r\n    return \"Others\" # Modify\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"

Matching.Text automatically identifies different formats and patterns in string data.<\/p>\n","protected":false},"featured_media":674232,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"research-area":[13556,13554,13560],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-667005","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-computer-interaction","msr-research-area-programming-languages-software-engineering","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[{"id":0,"name":"Usage","content":"The Matching.Text<\/code> API is accessed through the Matching.Text.Session<\/code> class. The input strings are added using Session.Constraints.Add()<\/code>. Once the inputs are added, calling Session.LearnPatterns()<\/code> returns a list of PatternInfo<\/code> objects that describe each pattern.\r\n\r\nEach PatternInfo<\/code> object either has:\r\n

    \r\n \t
  1. The IsNull<\/code> field set to true that indicates that the pattern matches only null<\/code> strings, or<\/li>\r\n \t
  2. The IsNull<\/code> field set to false, and the strings that match the pattern are those that match the regular expression in the Regex<\/code> field and do not match the regular expressions in the RegexesToExclude<\/code> field.<\/li>\r\n<\/ol>\r\nThe other fields indicate the frequency of the pattern (MatchingFraction<\/code>), a description in a PROSE specific format (Description<\/code>), and a few examples of the input strings matched by the pattern (Examples<\/code>).\r\n

    Basic usage<\/h2>\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nusing Microsoft.ProgramSynthesis.Matching.Text;\r\n\r\nSession session = new Session();\r\n\r\nIEnumerable inputs = new[] {\r\n \"21-Feb-73\",\r\n \"2 January 1920a \",\r\n \"4 July 1767 \",\r\n \"1892\",\r\n \"11 August 1897 \",\r\n \"11 November 1889 \",\r\n \"9-Jul-01\",\r\n \"17-Sep-08\",\r\n \"10-May-35\",\r\n \"7-Jun-52\",\r\n \"24 July 1802 \",\r\n \"25 April 1873 \",\r\n \"24 August 1850 \",\r\n \"Unknown \",\r\n \"1058\",\r\n \"8 August 1876 \",\r\n \"26 July 1165 \",\r\n \"28 December 1843 \",\r\n \"22-Jul-46\",\r\n \"17 January 1871 \",\r\n \"17-Apr-38\",\r\n \"28 February 1812 \",\r\n \"1903\",\r\n \"1915\", \r\n \"1854\",\r\n \"9 May 1828 \",\r\n \"28-Jul-32\",\r\n \"25-Feb-16\",\r\n \"19-Feb-40\",\r\n \"10-Oct-50\",\r\n \"5 November 1880 \",\r\n \"1928\",\r\n \"13-Feb-03\",\r\n \"8-Oct-43\",\r\n \"1445\",\r\n \"8 July 1859 \",\r\n \"25-Apr-27\",\r\n \"25 November 1562 \",\r\n \"2-Apr-10\", };\r\n \r\n session.Inputs.Add(inputs);\r\n IReadOnlyList patterns = session.LearnPatterns(); \/\/ Five patterns are returned corresponding to the formats \"dd-MMM-yy\", \"dd MMMM yyyy \", \"yyyy\", \"Unknown\", and \"2 January 1920a \".[\/code]\r\n"}],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Arjun Radhakrishna","user_id":39405,"people_section":"Section name 0","alias":"arradha"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/667005"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":11,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/667005\/revisions"}],"predecessor-version":[{"id":732766,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/667005\/revisions\/732766"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/674232"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=667005"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=667005"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=667005"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=667005"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=667005"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}