SQL Sentry Plan Explorer v2.7 : Better Anonymization

SQL Sentry Plan Explorer v2.7 : Better Anonymization

Plan Explorer : AnonymousOn April 1st (and with no joke intended), we released a new build of SQL Sentry Plan Explorer: v2.7 (8.4.33.0). The change list lists all of the fixes and new features, and while many are self-explanatory, I thought I would elaborate on the enhancements we have made to our plan anonymization feature.

In the change list, you will see this item:

3305     Anonymize Command Text and Text Data

(Note that this functionality was also enhanced in our core product suite.)

Many customers will recognize that Plan Explorer already had the capability to anonymize plans. I don't want to trivialize the development effort that went into the first iteration of this feature, but anonymizing the names exposed in the operators in the graphical plan – which had to conform to a very strict XML schema definition in ShowPlanXML.xsd – was relatively easy when compared to anonymizing a batch of Transact-SQL, which could literally contain just about anything. Because of the high number of edge and corner cases, we only anonymized the plan operators, and just stripped the statement(s) altogether – the risk of being wrong in even one of these cases was too high.

Using an intentionally simplistic example, here is how this feature worked initially. We have a query against SalesOrderHeader that, when anonymized, becomes simple Statement1, but in the plan diagram (and all the grids and tooltips) the individual objects, indexes and columns all have their names replaced with generics (click to enlarge):

Anonymization v1 (click to enlarge)

This was occasionally a problem when users would post their problematic query plans to answers.SQLPerformance.com; while Plan Explorer helps make some plan issues obvious regardless of the query, others are caused by the specific syntax used in the T-SQL statement. Without the query, those problems are impossible to diagnose, and inevitably Paul White (@SQL_Kiwi), myself and others would ask for a non-anonymized plan, or at least some hint about the shape of the query text.

In Plan Explorer v2.7, the SQL statements are parsed and anonymized as well. While that doesn't help a reader to translate Schema1.Object1 into Sales.SalesOrderHeader, but that is partially the point. The important piece of functionality that has been added is that you can match a table, index, or column in the plan to its reference in the actual query text. Here the same query is anonymized, and you can see that the query text is intact, with the entity names replaced with the same generics as the plan diagram (click to enlarge):

Anonymization v2 (click to enlarge)

Now again, I am certainly not trying to over-simplify what is going on here – statement anonymization is a very difficult technical problem. We borrowed a huge deal of this technology from the investments we made improving our statement normalization functionality in our core Power Suite in v8:

Normalization Engine

I've prodded Brooke Philpott (@MacroMullet) to elaborate further on these enhancements in a post of his own. Please leave a comment below if you'd like to help urge him on!

Limitations

Even with these improvements to Plan Explorer, there are still some known issues with anonymization that you should keep in mind:

  • Statement truncation – the normalization routine relies on logically parsing a valid statement text according to the showplan XML. If an individual statement is greater than 4000 characters, it will be truncated, which could result in invalid or – potentially worse – valid, but incorrect, SQL. In this case we must simply revert to the previous behavior, stripping the entire statement and replacing it with a generic like Statement1.
  • Comments are stripped – since we can't be sure if sensitive table, column or other information is embedded in inline comments, and since comments are by definition impossible to parse meaningfully, we remove those as part of the anonymization process.
  • Parameter values are stripped – similarly, we remove parameter values and replace them with ? in the command text and ValueN on the parameters tab. You won't be able to tell what the parameter values are, but you will still be able to tell when there are differences between compiled and runtime values (e.g. to identify potential parameter sniffing issues).
  • Object / schema names are simple replacements – If you have two different tables with the same name but in different schemas (or the same schema in different databases), both will be called ObjectN. There are some places where – after anonymizing – you won't be able to tell them apart, for example in the plan diagram, where database and schema are not shown (they are present in the tooltips and grids, though). This should only affect people who, for example, regularly join tables with the same name but from different schemas.
  • Constraint names are skipped – If you have proprietary information embedded in constraint names, be aware that currently this information remains exposed even after obfuscation, but it has been logged as a future enhancement.

As a simple demonstration of a couple of the above limitations, create the following objects in tempdb:

CREATE SCHEMA schema_a;
GO
CREATE SCHEMA schema_b;
GO
CREATE TABLE schema_a.foo(id INT);
CREATE TABLE schema_b.foo(id INT);
GO
CREATE PROCEDURE dbo.p1
  @id INT
AS
BEGIN
  SELECT a.id, b.id
    FROM schema_a.foo AS a
    INNER JOIN schema_b.foo AS b
    ON a.id = b.id
    WHERE a.id = @id;
END
GO

Now, generate an actual execution plan from within Plan Explorer for the following statement:

EXEC tempdb.dbo.p1 @id = 5;

Then run it again, but this time change the @id parameter value to 12. You will get this plan (sorry, to try and fit as much onto one screenshot, I arranged my tabs in a very impractical way):

Original plan

Note that the plan diagram shows that both objects are named foo, which is correct, just a bit incomplete (Management Studio's graphical plans omit schema name as well). You can also see that the compiled and runtime parameters are clearly different.

Now, if you click the Anonymize button on the toolbar, the plan will be converted to the following:

Anonymized plan

You can see how the names have all been obfuscated, and how the parameter values have been changed to ValueN. Parameter names have been changed to VariableN as well. And while the graphical plan still shows no obvious visual distinction between the two schemas in use, you can see the differences exposed in the Statement and Text Data queries, and you can also see these differences in several of the grids as well as the tooltips over the plan operators:

Schema names are present in the grids and tooltips

And finally, for PRO users, while the full call stack is no longer shown in the Statements grid, if you move to the Command Text tab, you'll see an anonymized version of the statement that was originally called:

EXEC [Schema3].[Object5] Variable1 = ?;

Even with these limitations in anonymized queries, the ability to correlate a plan with the actual query shape that created it is significant (especially for those helping you optimize0, and they still achieve their primary purpose: hiding sensitive database, object, index, and column names from the public eye.

For more information about SQL Sentry Plan Explorer, see the web site, the list of blog posts at the bottom of our "latest builds" post, and this post, with many links to Plan Explorer resources.

Comments ( 2 )

    • IL says:

      Full Schema, Object, Column and Variable notation is counted towards 4000 nchar statement limit. Might shortening these to Sch, Obj or Tab, Col, Var or 1-char prefix could help to squash more query text into the limit?

    • Aaron Bertrand says:

      Hi IL, the challenge we have is typically that the statement (which comes from the XML) has already been truncated before we ever touch it, so even in a plan that hasn't been anonymized, you will not be able to see the full statement. We still have to go through the XML rather than simply parsing the text in the command text window, so if a statement is incomplete there, we don't even try. In any case, I can't imagine there are many scenarios where a statement is only over the limit because of entity names – I suspect that in most cases our Object1 / Index1 replacements are actually shorter than the originals.

    Leave A Comment

    Your email address will not be published.

    This site uses Akismet to reduce spam. Learn how your comment data is processed.