I just finished the SQLPASS pre-conference from Brian Knight – Building a Microsoft Data Warehousing Platform, and I’m so pumped to do more BI! Reporting, Data Warehousing, Data Mining – all fun stuff.
In a gist, this is what we learned today (and it’s a lot of acronyms):
– SSIS, ETL
– SSAS
– Cubes, Data Warehouse
– Data Mining
– MDX and DMX (on a very very high level)
It was a lot of details and demos jammed in one day, but I won’t have it any other way, and overall the session was just great! Loved it!
Here are some bits and pieces of information / tips / useful sites / resources (probably mostly for my benefit, so I remember. But if you find it useful, then all the better):
- 65% of data warehousing projects fail, and one of the reasons for this is lack of communication. Business users want something, but devs deliver something else.
- Waterfall model for DW projects – hard to react. Agile is better – and Brian recommends 1 month iterations (ie complete deliverable cycle per month)
- Why data warehouse? Consolidate data, aggregates, reporting, analytics, archived data (ex if you need to keep data for years and years)
- Business Key = Alternate Key = PK from source
- Surrogate Keys – insulates data warehouse from source changes
- In one of Pragmatic Works’ projects, they lowered SSIS loading time from 4.5 hours to 3 mins – all about choosing “better” way. In SSIS most components are in memory and are synchronous.
- Most SSIS components are synchronous – good for performance
- Some are asynchronous and partially blocking, not so good – for example UNION ALL and MERGE – slight overhead
- Some are asynchronous and fully blocking – not good at all – for example sort or aggregate transforms – but sometimes there are reasons to use these
- In BIDS, devenv.exe -NOSPLASH to prevent splash screen
- In SSIS – Protection Levels for Packages – EncryptSensitiveWithUserKey generally a bad idea because it associates the package with your profile; if you transfer package, might say package is corrupt. Best to Encrypt All With Password; ServerStorage relies on msdb
- Different types of dimensions
- Type 0 – Changing Attribute – updates; no history kept
- Type 1 – Fixed Attribute – cannot change under any circumstance; ignores change
- Type 2 – helps track changes; biggest disadvantage is additional storage; it creates a new record per update
- Good practice – smaller packages; easier to change and maintain
- Slowly Changing Dimension Wizard (SCD) – data types need to be lined up, need to be very careful; easy to break; also when you need to customize later on and relaunch the wizard, previous customizations will be dropped. You should keep documentation on how to recreate your customizations
- Another problem with SCD – similar to a cursor, goes through row by row; alternative way is to create staging table first. Use T-SQL for what it’s good at, SSIS for what it’s good at. If you can do sorts in T-SQL, do the sorts in T-SQL
- Inferred members – data that not in source, but is in Fact; for example, if it takes 2 weeks to add a salesperson in HR, but salesperson already has sales from Day 1
- Better load – “Fast Load”; also configure “Maximum Insert Commit Size”
- Every member should has an unknown member
- ALWAYS alias your columns
- Alternatives to detecting updates – instead of specifying all columns in a multi OR expression, use CHECKSUM or HASHBYTES. Problem with CHECKSUM is it is not guaranteed to be unique. Ex.
HASHBYTES('sha', ISNULL(col1, '') + ISNULL(col2, ''))
So in your expression column can just have something similar to: (DT_WSTR, 64)yourhashcol_src != (DT_WSTR,64)yourhashcol_dw
- SQL Cache Transform – creates cache file, can be reused multiple times; 80% faster in SQL Server 2008; SQL Server 2005 was bad with threading
- Terminologies – data warehouse (SSAS 2008)
- Measure Group = Fact Table
- Attributes = Columns in a Dimension
- Member = Row in an Attribute
- Hierarchies are shortcuts for end users; you can ultimately hide actual attributes (AttributeHierarchyVisible = True), and just expose the hierarchies
- Always reconnect to cube after dimension or attribute changes so changes are apparent
- Set up attribute relationships ex:
Date SK -- Date -- Month -- Quarter -- Year
Note that uniqueness of quarter is really Year + Quarter.
- Set up constraints/uniqueness through the “Key Column”. Set up what’s being displayed to user using “Name Column”. You will probably need to explore your dimensions and see which attributes need this.
- For any time dimensions, add “Time Intelligence”. As a side effect, this automatically adds additional filters like “Yesterday”, “This Week”, “Last Week” etc in Excel
- “Mushroom Cloud” – Invalid queries, but you are getting invalid results. For example if you drag Product then EmployeeCount. This gives you the EmployeeCount value for Product.
- In Cube Editor, MeasureGroup, set IgnoreUnrelatedDimension = False helps prevent “mushroom clouds”; result will display blank instead of giving you invalid value
- Discretization Method – allows you to create dynamic buckets of data
- Automatic
- EqualAddress – your just specify number of groups
- Clusters – groups based on your data
- HideMemberIf (trying to remember why this option was highlighted in the session)
- MDX is language for querying the cube/SSAS; used for creating KPIs, calculations
- Tuple in MDX points to a cell; you can work with six (6) different axes. Good book is MS Press MDX Step by Step
- When you want to show data, put in SELECT; if you don’t but you want to filter by it, put in WHERE
- Parenthesis in an MDX helps you represent a tuple
- SSRS and Excel can generate some MDX
- Calculations < Calculated Member does not store anything in the cube, just the metadata; actual calculations done on the fly
- PRO: Only formula is stored in the cube; don’t need extra storage
- CON: Only formula is stored in the cube; performance will be worse
- Usage Based Optimization : Go to SSMS < Connect to SSAS < Set CreateQueryLogTable = true; you can also choose to “Review Aggregation Usage”
- How to lock data in your cube : Go to your dimension then go to Role, and add roles/users
- Option: Enable Visual Totals – like an implicit WHERE; Off by default, which means users will still see all data even if data is locked down; for example, if user can only see US data, his/her totals should be just for US, but s/he will see US Totals, then a different Grand Total. Enable this to lock down these totals to only the data they can see
- Deployment wizard in SSAS – creates .asdatabase file that contains metadata required to deploy the cube
- You dont need to always process the whole cube; you can do incremental processing
- Partitioning in SSAS – in about 50K records, you will start to feel the slowness
- SQL Server Standard allows 3 partitions; SQL Server Enterprise allows any number
- New feature in SQL Server 2008 Enterprise – Proactive Caching
- In SQL Server 2008, you can use profiler on SSAS to deconstruct MDX queries. In Excel, you can also right click on some columns then check MDX
- There is an option to do offline OLAP : Excel < Options < OLAP Tools < Offline OLAP. You can get to your data, but actions and drillthrough are off
- Double hop issue in SSRS and SSAS when you build reports on top of the cube : check Kerberos, SPN
- You don’t need a cube to do data mining! You can also allow your users to do predictions based on flat tables or views.
- Sample data mining scenario: used to auto approve insurance claims; can be used for fraud detection; can be integrated with SSIS in conditional tasks
- For data mining structures, Decision Trees is the best way to start; as far as alogrithms go, you can even create your own, or purchase 3rd party algorithms
- Users can get/do data mining through Excel
- Users can do drillthroughs with data mining
Useful Resources / Sites
- pragmaticworks.com
- sqlshare.com
- ssas-info.com
- dtsxchange.com
- sqlserverdatamining.com
- Marco Russo’s Expert Cube Development with Microsoft SQL Server 2008 Analysis Services
- Brian Knight’s 24-Hour Trainer: Microsoft SQL Server 2008 Integration Services (Wrox Programmer to Programmer)
- MS Press Microsoft SQL Server 2008 MDX Step by Step (Step By Step (Microsoft))
Quote “HideMemberIf (trying to remember why this option was highlighted in the session)”
Probably it is highlighted because it is not recommended to change this property to anything else, but default value. Really affects performance on bigger dimensions. For example: forum thread about this here.
[…] still remember the first time I attended PASS (started with a pre-con from Brian Knight no less […]