ࡱ> '/(0 / 0DTimes New RomanTTvܖ 0ܖDComic Sans MSnTTvܖ 0ܖ@B DVerdanans MSnTTvܖ 0ܖ@"0DMonotype SortsTTvܖ 0ܖ@DArialpe SortsTTvܖ 0ܖPDWingdingsortsTTvܖ 0ܖ C%(.@  @@``  @n?" dd@  @@`` >6x    @$  !#$%')*,-./01234 gֳgֳ     A@  A5% 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||S"`f3f@8 g4EdEd` 0&ppp@  <4!d!d l 0Tqʚ;3ʚ;<4dddd l 0 <4BdBd l 0  r0___PPT10 2___PPT9/ 0? -O =" 6Detecting, Managing, and Diagnosing Failures with FUSE0John Dunagan, Juhan Lee (MSN), Alec Wolman WIP1 f,Goals &Target EnvironmentImprove the ability of large internet portals to gain insight into failures Non-goals: masking failures use machine learning to infer abnormal behavior 6XAXA< MSN BackgroundMessenger, Homepage, Hotmail, Search, many other  properties Large (> 100 million users) Sources of Complexity: multiple data-centers large # of machines complex internal network topology diversity of applications and software infrastructure&rr2OutlineDetecting, managing, and diagnosing failures Review MSN s current approaches Describe our solution at a high level&-F-F3Detecting FailuresNMonitor system availability with heartbeats Monitor applications availability & quality of service using synthetic requests Customer complaints Telephone, email Problems: These approaches provide limited coverage  harder to catch failures that don t affect every request Data on detected failures often lacks necessary detail to suggest a remedy: which front end is flaky? which app component caused end-user failure?~ZZ ZZHZ H4Managing FailuresbDefinition: Ability to prioritize failures Detect component service degradation Characterizing app-stability Capacity planning When server  x fails, what is the impact of this failure? Better use of ops and engineering resources Current approach: no systematic attempt to provide this functionalityx w<,F w<,G6Our solution (in 2 steps)Detecting and Managing Failures Step 1: Instrument applications to track user requests across the  service chain Each request is tagged with a unique id Service chain is composed on-the-fly with help of app instrumentation For each request: Collect per-hop performance information Collect per-request failure status Centralized data collectiond SK SKAWhat kinds of failures?~We can handle: Machine failures Network connectivity problems Most: Misconfiguration Application bugs But not all: Application errors where app itself doesn t detect that there is a problem VZ/ZZ"Z ZMZDl5Diagnosing FailuresEAssigning responsibility to a specific hw or sw component Insight into internals of a component Cross component interactions Current approach: instrument applications App-specific log messages Problems High request rates => log rollover Perceived overhead => detailed logging enabled during testing, disabled in productionTZZ ZyZ y-7Fuse BackgroundFUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred Lack of a positive ack => failure&a"a"t 8 Step 2: Conditional LoggingStep 2: Implement  conditional logging to significantly reduce the overhead of collecting detailed logs across different machines in the service chain Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain While fate is undecided: Detailed log messages stored in main memory Common case overload of logging is vastly reduced Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures Quantity of data generated is manageable, when most requests are successfull2rL2rL  @Example'Benefits: FUSE allows monitoring of real transactions. All transactions, or a sampled subset to control overhead. When a request fails, FUSE provides an audit trail How far did it get? How long did each step take? Any additional application specific context. FUSE can be deployed incrementally.  -;3^$ -;3^$= IssuesOverload policy: need to handle bursts of failures without inducing more failures How much effort to make apps FUSE enabled? Are the right components FUSE enabled? Identifying and filtering false positives Tracking request flow is non-trivial with network load balancers "ZZ9 Wrap-UpWe ve implemented FUSE for MSN, integrated with ASP.NET rendering engine Testing in progress Roll-out at end of summer> Backups?FUSE is Easy to Integrate /T:;   0` ̙33` 3` 3333f` 999MMM` f` f3` 3>?" dd@(v?lAd@ @`@2 n?" dd@   @@``PR    @ ` ` p>>  (     `gֳgֳ ?``  >*   `gֳgֳ ?`   @*@8 y 0x  H3f?yx  H3f?y4  Ngֳgֳ ?  RClick to edit Master text styles Second Level Third Level Fourth Level Fifth Level!     S   Zgֳgֳ ?  T Click to edit Master title style! !    `gֳgֳ ?`p   @*H  0޽h? ? 33###ggg dbllineb  0 wo  H( \@`S@ H H  ` gֳgֳ ?``  >* H  `gֳgֳ ?`   @* H T\gֳgֳ ? `    W#Click to edit Master subtitle style$ $  H Zgֳgֳ ?  T Click to edit Master title style! !H H 0޽h? ? 33###ggg; 0 K(     T/RKRK ?s$   p*  G$$GGkk  T BRKRK ? 0$  r*  G$$GGkk  ZKRKRK ?s   p*  G$$GGkk  Z0NRKRK ? 0  r*  G$$GGkk;  T`El l  ? K!  SClick to edit Master notes styles Second Level Third Level Fourth Level Fifth Level"     Tp  01 ?ID  H  0jB? ? a(80___PPT10.<`0 ` P( A     TRKRK ?s$   `* G$$GGkk  TRKRK ? 0$  b* G$$GGkk  ZRKRK ?s   `* G$$GGkk  ZRKRK ? 0  b* G$$GGkkH  0jB? ? a(80___PPT10.<` 0 0D( @`Q@ Dr D S  H  l D C H` @`  H D 0޽h ? ###ggg  0L0 0(p(    # lVgֳgֳ ?      # lVgֳgֳ ?`  H  0޽h ? ###ggg$  0 $(  r  S w    r  S x  H  0޽h ? 33###ggg80___PPT10.f`$  0 $(  r  S }    r  S \~  H  0޽h ? 33###ggg80___PPT10.&f@$  0 0$(  r  S     r  S ؅  H  0޽h ? 33###ggg80___PPT10.&f$  0 @$(  r  S 5    r  S   H  0޽h ? 33###ggg80___PPT10.&f:$  0 $(  r  S Ĥ    r  S   H  0޽h ? 33###ggg80___PPT10.(fO$  0 @$(  r  S     r  S h  H  0޽h ? 33###ggg80___PPT10.fOI$  0 P$(  r  S D    r  S   H  0޽h ? 33###ggg80___PPT10.&fV$  0 p$(  r  S 8̝    r  S ̝  H  0޽h ? 33###ggg80___PPT10.(f`|yT$  0 $(  r  S     r  S —  H  0޽h ? 33###ggg80___PPT10.0f@'g~  0 0N(  r  S ҝ    r  S ӝ`  "8      6֝1"`    DServer1  0 2   6ڝ1"`  DServer3  0 2   6콝1"` ` DServer2  0 2   61"`  CClient  0 2`B  0D)@`B   0D) 0 `B   0D)p`B  B 0D)p p0 p`B  B 0D)ppp   <   ?X"0 H  0޽h ? 33###ggg___PPT10i.f+ +D='  = @B +$  0 $(  r  S     r  S d  H  0޽h ? 33###ggg80___PPT10.fm$  0 $(  r  S $     r  S    H  0޽h ? 33###ggg80___PPT10.?fP <$  0 $(  r  S     r  S   H  0޽h ? 33###ggg80___PPT10.f  0 f^ (  r  S     ~  Nlgֳgֳ ?"g Example current code on Front End: ReceiveRequestFromClient(& ) { & SendRequestToBackEnd(& ); } Example code on Front End using FUSE: ReceiveRequestFromClient(& , FUSEinfo f) { // default value of f = null if ( f != null ) JoinFUSEGroup( f ); & SendRequestToBackEnd(& , f ); } Current implementation is in C#, and consists of 2400 LOCT$ #?& "%>t# .6  F  N@)gֳgֳ ?"X 2$ H  0޽h ? 33###ggg80___PPT10.f,;J 0  ( ֳ    H1 ?ID     # lkl l  ? K!    H  0jB ? a( 0 RJ(  X  C ID   J  S q K!   Even with the ability to  manage failures, there are times when even more detail is neededUH  0jB ? a(80___PPT10.~fX 0 LD(  X  C ID   D  S z K!   Even with the ability to  manage failures, there are times when even more detail is needed  ]H  0jB ? a(80___PPT10.fPi7rpP>E// Mhy 9%K2QT/V\[X^ an{~Ok>pjr7cZhA1b/(0 / 0DTimes New RomanTTvܖ 0ܖDComic Sans MSnTTvܖ 0ܖ@B DVerdanans MSnTTvܖ 0ܖ@"0DMonotype SortsTTvܖ 0ܖ@DArialpe SortsTTvܖ 0ܖPDWingdingsorts    Letter Paper (8.5x11 in)} Times New RomanComic Sans MSVerdanaMonotype SortsArial Wingdings dbllineb7Detecting, Managing, and Diagnosing Failures with FUSEGoals & Target EnvironmentMSN Background The PlanDetecting FailuresManaging FailuresOur solution (in 2 steps)What kinds of failures?Diagnosing FailuresFuse BackgroundStep 2: Conditional LoggingExampleIssuesStatusBackupsFUSE is Easy to Integrate  Fonts UsedDesign Template Slide Titles_0alecwalecw-2 +Detecting, Managing, and ."System/ 0DTimes New RomanTTvܖ 0ܖDComic Sans MSnTTvܖ 0ܖ@B DVerdanans MSnTTvܖ 0ܖ@"0DMonotype SortsTTvܖ 0ܖ@DArialpe SortsTTvܖ 0ܖPDWingdingsortsTTvܖ 0ܖ C%(.@  @@``  @n?" dd@  @@`` >6x    @$  !#$%')*,-./01234 gֳgֳ     A@  A5% 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||S"`f3f@8 g4EdEd` 0&ppp@  <4!d!d l 0Tqʚ;3ʚ;<4dddd l 0 <4BdBd l 0  r0___PPT10 2___PPT9/ 0? -O =R" 6Detecting, Managing, and Diagnosing Failures with FUSE0John Dunagan, Juhan Lee (MSN), Alec Wolman WIP1 f,Goals & Target EnvironmentImprove the ability of large internet portals to gain insight into failures Non-goals: masking failures use machine learning to infer abnormal behavior 6XAXA< MSN BackgroundMessenger, www.msn.com, Hotmail, Search, many other  properties Large (> 100 million users) Sources of Complexity: multiple data-centers large # of machines complex internal network topology diversity of applications and software infrastructure&uu 2The Plan Detecting, managing, and diagnosing failures Review MSN s current approaches Describe our solution at a high level&-F-F3Detecting FailuresNMonitor system availability with heartbeats Monitor applications availability & quality of service using synthetic requests Customer complaints Telephone, email Problems: These approaches provide limited coverage  harder to catch failures that don t affect every request Data on detected failures often lacks necessary detail to suggest a remedy: which front end is flaky? which app component caused end-user failure?~ZZ ZZHZ H4Managing FailuresbDefinition: Ability to prioritize failures Detect component service degradation Characterizing app-stability Capacity planning When server  x fails, what is the impact of this failure? Better use of ops and engineering resources Current approach: no systematic attempt to provide this functionalityx w<,F w<,G6Our solution (in 2 steps)Detecting and Managing Failures Step 1: Instrument applications to track user requests across the  service chain Each request is tagged with a unique id Service chain is composed on-the-fly with help of app instrumentation For each request: Collect per-hop performance information Collect per-request failure status Centralized data collectiond SK SKAWhat kinds of failures?~We can handle: Machine failures Network connectivity problems Most: Misconfiguration Application bugs But not all: Application errors where app itself doesn t detect that there is a problem VZ/ZZ"Z ZMZDl5Diagnosing FailuresEAssigning responsibility to a specific hw or sw component Insight into internals of a component Cross component interactions Current approach: instrument applications App-specific log messages Problems High request rates => log rollover Perceived overhead => detailed logging enabled during testing, disabled in productionTZZ ZyZ y-7Fuse BackgroundFUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred Lack of a positive ack => failure&a"a"t 8 Step 2: Conditional LoggingStep 2: Implement  conditional logging to significantly reduce the overhead of collecting detailed logs across different machines in the service chain Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain While fate is undecided: Detailed log messages stored in main memory Common case overload of logging is vastly reduced Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures Quantity of data generated is manageable, when most requests are successfull2rL2rL  @Example'Benefits: FUSE allows monitoring of real transactions. All transactions, or a sampled subset to control overhead. When a request fails, FUSE provides an audit trail How far did it get? How long did each step take? Any additional application specific context. FUSE can be deployed incrementally.  -;3^$ -;3^$= IssuesOverload policy: need to handle bursts of failures without inducing more failures How much effort  !"#$%&'()*+,-./0123456789:;<=>?@A]CEFGHIJKLMNOPQRSTUVWXY}\^_`abcdefghijklmnopqrstuvDyz{|B~Root EntrydO) fx Current User5/SummaryInformation( PowerPoint Document(DocumentSummaryInformation8'TTvܖ 0ܖ C%(.@  @@``  @n?" dd@  @@`` >6x    @$  !#$%')*,-./01234 gֳgֳ     A@  A5% 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||S"`f3f@8 g4EdEd` 0&ppp@  <4!d!d l 0Tqʚ;3ʚ;<4dddd l 0 <4BdBd l 0  r0___PPT10 2___PPT9/ 0? -O =Q" 6Detecting, Managing, and Diagnosing Failures with FUSE0John Dunagan, Juhan Lee (MSN), Alec Wolman WIP1 f,Goals &Target EnvironmentImprove the ability of large internet portals to gain insight into failures Non-goals: masking failures use machine learning to infer abnormal behavior 6XAXA< MSN BackgroundMessenger, www.msn.com, Hotmail, Search, many other  properties Large (> 100 million users) Sources of Complexity: multiple data-centers large # of machines complex internal network topology diversity of applications and software infrastructure&uu 2The Plan Detecting, managing, and diagnosing failures Review MSN s current approaches Describe our solution at a high level&-F-F3Detecting FailuresNMonitor system availability with heartbeats Monitor applications availability & quality of service using synthetic requests Customer complaints Telephone, email Problems: These approaches provide limited coverage  harder to catch failures that don t affect every request Data on detected failures often lacks necessary detail to suggest a remedy: which front end is flaky? which app component caused end-user failure?~ZZ ZZHZ H4Managing FailuresbDefinition: Ability to prioritize failures Detect component service degradation Characterizing app-stability Capacity planning When server  x fails, what is the impact of this failure? Better use of ops and engineering resources Current approach: no systematic attempt to provide this functionalityx w<,F w<,G6Our solution (in 2 steps)Detecting and Managing Failures Step 1: Instrument applications to track user requests across the  service chain Each request is tagged with a unique id Service chain is composed on-the-fly with help of app instrumentation For each request: Collect per-hop performance information Collect per-request failure status Centralized data collectiond SK SKAWhat kinds of failures?~We can handle: Machine failures Network connectivity problems Most: Misconfiguration Application bugs But not all: Application errors where app itself doesn t detect that there is a problem VZ/ZZ"Z ZMZDl5Diagnosing FailuresEAssigning responsibility to a specific hw or sw component Insight into internals of a component Cross component interactions Current approach: instrument applications App-specific log messages Problems High request rates => log rollover Perceived overhead => detailed logging enabled during testing, disabled in productionTZZ ZyZ y-7Fuse BackgroundFUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred Lack of a positive ack => failure&a"a"t 8 Step 2: Conditional LoggingStep 2: Implement  conditional logging to significantly reduce the overhead of collecting detailed logs across different machines in the service chain Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain While fate is undecided: Detailed log messages stored in main memory Common case overload of logging is vastly reduced Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures Quantity of data generated is manageable, when most requests are successfull2rL2rL  @Example'Benefits: FUSE allows monitoring of real transactions. All transactions, or a sampled subset to control overhead. When a request fails, FUSE provides an audit trail How far did it get? How long did each step take? Any additional application specific context. FUSE can be deployed incrementally.  -;3^$ -;3^$= IssuesOverload policy: need to handle bursts of failures without inducing more failures How much effort to make apps FUSE enabled? Are the right components FUSE enabled? Identifying and filtering false positives Tracking request flow is non-trivial with network load balancers "ZZ9 Wrap-UpWe ve implemented FUSE for MSN, integrated with ASP.NET rendering engine Testing in progress Roll-out at end of summer> Backups ?FUSE is Easy to Integrate /T:;$  0 $(  r  S w    r  S x  H  0޽h ? 33###ggg80___PPT10.f`$  0 $(  r  S }    r  S \~  H  0޽h ? 33###ggg80___PPT10.&f@  0 (  r  S     H  0޽h ? 33###ggg80___PPT10.fr 2<n>Ƶ xA1c/(0   !"#$%&()*+,-./01234Oh+'0` 0 @L p |  Intel SIO PresentationMombo The Clown4c:\msoffice\powerpnt\template\bwovrhd\dbllineb.pptalecw61Microsoft PowerPoint 4.0@Vղw@@o`ק[@@@}@pfuGg  3  y--$xx--'@BComic Sans MS-. -2 ,Detecting, Managing, and .. -2 +Detecting, Managing, and ."System7-@BComic Sans MS-. 32 6Diagnosing Failures with FUSE.. 32 5Diagnosing Failures with FUSE.-@"Verdana-. 332 JJohn a.-@"Verdana-. 332 J/Dunagan.-@"Verdana-. 33 2 JL, .-@"Verdana-. 332 JPJuhana.-@"Verdana-. 332 Je Lee (MSN), .-@"Verdana-. f2 Q> Alec Wolman.-@"Verdana-. f 2 cLWIP.-՜.+,0P to make apps FUSE enabled? Are the right components FUSE enabled? Identifying and filtering false positives Tracking request flow is non-trivial with network load balancers "ZZ9 Wrap-UpWe ve implemented FUSE for MSN, integrated with ASP.NET rendering engine Testing in progress Roll-out at end of summer> Backups ?FUSE is Easy to Integrate /T:;  0L0 0(p(    # lVgֳgֳ ?      # lVgֳgֳ ?`  H  0޽h ? ###gggrķ/A1b/(0 / 0DTimes New RomanTTvܖ 0ܖDComic Sans MSnTTvܖ 0ܖ@B DVerdanans MSnTTvܖ 0ܖ@"0DMonotype SortsTTvܖ 0ܖ@DArialpe SortsTTvܖ 0ܖPDWingdingsortsTTvܖ 0ܖ C%(.@  @@``  @n?" dd@  @@`` >6x    @$  !#$%')*,-./01234 gֳgֳ     A@  A5% 8c8c     ?1 d0u0@Ty2 NP'p<'pA)BCD|E||S"`f3f@8 g4EdEd` 0&ppp@  <4!d!d l 0Tqʚ;3ʚ;<4dddd l 0 <4BdBd l 0  r0___PPT10 2___PPT9/ 0? -O =Q" 6Detecting, Managing, and Diagnosing Failures with FUSE0John Dunagan, Juhan Lee (MSN), Alec Wolman WIP1 f,Goals & Target EnvironmentImprove the ability of large internet portals to gain insight into failures Non-goals: masking failures use machine learning to infer abnormal behavior 6XAXA< MSN BackgroundMessenger, www.msn.com, Hotmail, Search, many other  properties Large (> 100 million users) Sources of Complexity: multiple data-centers large # of machines complex internal network topology diversity of applications and software infrastructure&uu 2The Plan Detecting, managing, and diagnosing failures Review MSN s current approaches Describe our solution at a high level&-F-F3Detecting FailuresNMonitor system availability with heartbeats Monitor applications availability & quality of service using synthetic requests Customer complaints Telephone, email Problems: These approaches provide limited coverage  harder to catch failures that don t affect every request Data on detected failures often lacks necessary detail to suggest a remedy: which front end is flaky? which app component caused end-user failure?~ZZ ZZHZ H4Managing FailuresbDefinition: Ability to prioritize failures Detect component service degradation Characterizing app-stability Capacity planning When server  x fails, what is the impact of this failure? Better use of ops and engineering resources Current approach: no systematic attempt to provide this functionalityx w<,F w<,G6Our solution (in 2 steps)Detecting and Managing Failures Step 1: Instrument applications to track user requests across the  service chain Each request is tagged with a unique id Service chain is composed on-the-fly with help of app instrumentation For each request: Collect per-hop performance information Collect per-request failure status Centralized data collectiond SK SKAWhat kinds of failures?~We can handle: Machine failures Network connectivity problems Most: Misconfiguration Application bugs But not all: Application errors where app itself doesn t detect that there is a problem VZ/ZZ"Z ZMZDl5Diagnosing FailuresEAssigning responsibility to a specific hw or sw component Insight into internals of a component Cross component interactions Current approach: instrument applications App-specific log messages Problems High request rates => log rollover Perceived overhead => detailed logging enabled during testing, disabled in productionTZZ ZyZ y-7Fuse BackgroundFUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred Lack of a positive ack => failure&a"a"t 8 Step 2: Conditional LoggingStep 2: Implement  conditional logging to significantly reduce the overhead of collecting detailed logs across different machines in the service chain Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain While fate is undecided: Detailed log messages stored in main memory Common case overload of logging is vastly reduced Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures Quantity of data generated is manageable, when most requests are successfull2rL2rL  @Example'Benefits: FUSE allows monitoring of real transactions. All transactions, or a sampled subset to control overhead. When a request fails, FUSE provides an audit trail How far did it get? How long did each step take? Any additional application specific context. FUSE can be deployed incrementally.  -;3^$ -;3^$= IssuesOverload policy: need to handle bursts of failures without inducing more failures How much effort to make apps FUSE enabled? Are the right components FUSE enabled? Identifying and filtering false positives Tracking request flow is non-trivial with network load balancers "ZZ9 StatusWe ve implemented FUSE for MSN, integrated with ASP.NET rendering engine Testing in progress Roll-out at end of summer> Backups ?FUSE is Easy to Integrate /T:;$  0 $(  r  S $     r  S    H  0޽h ? 33###ggg80___PPT10.?fP <r 9u A1Root EntrydO)*gx Current User58SummaryInformation( PowerPoint Document(\  !"#$%&'()*+,-./0123456789:;<=>?@A]EFGHIJKLMNOPQRSTUVWXY}^_`abcdefghijklmnopqrstuvDyz{|B~  !"#$%&()*+,-./01234ing FailuresOur solution (in 2 steps)What kinds of failures?Diagnosing FailuresFuse BackgroundStep 2: Conditional LoggingExampleIssuesStatusBackupsFUSE is Easy to Integrate  Fonts UsedDesign Template Slide Titles _t-msinght-msingh2 +Detecting, Managing, and ."System