Logo

Maarten Balliauw {blog}

ASP.NET, ASP.NET MVC, Azure, PHP, OpenXML, VSTS, ...

About the author

Maarten Balliauw is an MVP ASP.NET and is currently employed as .NET Software Engineer at RealDolmen. His interests are mainly web applications developed in ASP.NET (C#) or PHP.
More about me More about me
Send mail E-mail me


Microsoft Most Valuable Professional - MVP - ASP.NET

Subscribe to my RSS feed Follow me on Twitter! View Maarten Balliauw's profile on LinkedIn RealDolmen - Rock-solid passion for ICT
I'm a speaker at TechDays Belgium and TechDays Finland

Search

Latest Twitter

    Follow me on Twitter...

    Disclaimer

    The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

    © Copyright Maarten Balliauw 2010

    Preview Word files (docx) in HTML using ASP.NET, OpenXML and LINQ to XML

    Since an image (or even an example) tells more than any text will ever do, here's what I've created in the past few evening hours:

    image

    Live examples:

    Want the source code? Download it here.

    Want to know how?

    If you want to know how I did this, let me first tell you why I created this. After searching Google for something similar, I found a Sharepoint blogger who did the same using a Sharepoint XSL transformation document called DocX2Html.xsl. Great, but this document can not be distributed without a Sharepoint license. The only option for me was to do something similar myself.

    ASP.NET handlers

    The main idea of this project was to be able to type in a URL ending in ".docx", which would then render a preview of the underlying Word document. Luckily, ASP.NET provides a system of creating HttpHandlers. A HttpHandler is the class instance which is called by the .NET runtime to process an incoming request for a specific extension. So let's trick ASP.NET into believing ".docx" is an extension which should be handled by a custom class...

    Creating a custom handler

    A custom handler can be created quite easily. Just create a new class, and make it implement the IHttpHandler interface:

    /// <summary>
    /// Word document HTTP handler
    /// </summary>
    public class WordDocumentHandler : IHttpHandler
    {
        #region IHttpHandler Members

        /// <summary>
        /// Is the handler reusable?
        /// </summary>
        public bool IsReusable
        {
            get { return true; }
        }

        /// <summary>
        /// Process request
        /// </summary>
        /// <param name="context">Current http context</param>
        public void ProcessRequest(HttpContext context)
        {
            // Todo...
            context.Response.Write("Hello world!");
        }

        #endregion
    }

    Registering a custom handler

    For ASP.NET to recognise our newly created handler, we must register it in Web.config:

    image

    Now if you are using IIS6, you should also register this extension to be handled by the .NET runtime:

    image

    In the application configuration, add the extension ".docx" and make it point to the following executable: C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\aspnet_isapi.dll

    This should be it. Fire up your browser, browse to your web site and type anything.docx. You should see "Hello world!" appearing in a nice, white page.

    OpenXML

    As you may already know, Word 2007 files are OpenXML packages containg WordprocessingML markup. A .docx file can be opened using the System.IO.Packaging.Package class (which is available after adding a project reference to WindowsBase.dll).

    The Package class is created for accessing any OpenXML package. This includes all Office 2007 file formats, but also custom OpenXML formats which you can implement for yourself. Unfortunately, if you want to use Package to access an Office 2007 file, you'll have to implement a lot of utility functions to get the right parts from the OpenXML container.

    Luckily, Microsoft released an OpenXML SDK (CTP), which I also used in order to create this Word preview handler.

    LINQ to XML

    As you know, the latest .NET 3.5 release brought us something new & extremely handy: LINQ (Language Integrated Query). On Doug's blog, I read about Eric White's attempts to use LINQ to XML on OpenXML.

    LINQ to OpenXML

    For implementing my handler, I basically used similar code to Eric's to run query's on a Word document's contents. Here's an example which fetches all paragraphs in a Word document:

    using (WordprocessingDocument document = WordprocessingDocument.Open("test.docx", false))
    {
        // Register namespace
        XNamespace w = ""http://schemas.openxmlformats.org/wordprocessingml/2006/main";">http://schemas.openxmlformats.org/wordprocessingml/2006/main";

        // Element shortcuts
        XName w_r = w + "r";
        XName w_ins = w + "ins";
        XName w_hyperlink = w + "hyperlink";

        // Load document's MainDocumentPart (document.xml) in XDocument
        XDocument xDoc = XDocument.Load(
            XmlReader.Create(
                new StreamReader(document.MainDocumentPart.GetStream())
            )
        );

        // Fetch paragraphs
        var paragraphs = from l_paragraph in xDoc
                        .Root
                        .Element(w + "body")
                        .Descendants(w + "p")
             select new
             {
                 TextRuns = l_paragraph.Elements().Where(z => z.Name == w_r || z.Name == w_ins || z.Name == w_hyperlink)
             };

        // Write paragraphs
        foreach (var paragraph in paragraphs)
        {
            // Fetch runs
            var runs = from l_run in paragraph.Runs
                       select new
                       {
                           Text = l_run.Descendants(w + "t").StringConcatenate(element => (string)element)
                       };

            // Write runs
            foreach (var run in runs)
            {
                // Use run.Text to fetch a text string
                Console.Write(run.Text);
            }
        }
    }

    Now if you run this code, you will notice a compilation error... This is due to the fact that I used an extension method StringConcatenate.

    Extension methods

    In the above example, I used an extension method named StringConcatenate. An extension method is, as the name implies, an "extension" to a known class. In the following example, find the extension for all IEnumerable<T> instances:

    public static class IEnumerableExtensions
    {
        /// <summary>
        /// Concatenate strings
        /// </summary>
        /// <typeparam name="T">Type</typeparam>
        /// <param name="source">Source</param>
        /// <param name="func">Function delegate</param>
        /// <returns>Concatenated string</returns>
        public static string StringConcatenate<T>(this IEnumerable<T> source, Func<T, string> func)
        {
            StringBuilder sb = new StringBuilder();
            foreach (T item in source)
                sb.Append(func(item));
            return sb.ToString();
        }
    }

    Lambda expressions

    Another thing you may have noticed in my example code, is a lambda expression:

    z => z.Name == w_r || z.Name == w_ins || z.Name == w_hyperlink.

    A lambda expression is actually an anonymous method, which is called by the StringConcatenate extension method. Lambda expressions always accept a parameter, and return true/false. In this case, z is instantiated as an XNode, returning true/false depending on its Name property.

    Wrapping things up...

    If you read this whole blog post, you may have noticed that I extensively used C# 3.5's new language features. I combined these with OpenXML and ASP.NET to create a useful Word document preview handler. If you want the full source code, download it here.

    kick it on DotNetKicks.com


    Categories: ASP.NET | C# | General | ICT | LINQ | Office 2007 | OpenXML | Software | XML

    Comments

    Sam Canada |

    Friday, February 29, 2008 10:14 PM

    Sam

    Great Post! It works very well with my project.

    Just one little problem though. I have a Word document. The document contains two paragraphs and an image . The image is placed between the two paragraphs. In the WordVisualizer display. The text of the second paragraph was shown on the right hand side of the image indtea of being underneath it.

    Is there any way I can fix this?

    maartenba |

    Tuesday, March 04, 2008 9:01 AM

    maartenba

    To make it easy for myself, images are always "left" aligned in this example. You can however find the image part in the docx file and parse all "location" information to place it on the preview correctly.

    snautz Latvia |

    Friday, March 21, 2008 8:08 PM

    snautz

    There is a problem with no images, when docx is created in word 2003 with office 2007 compatibility pack

    mikey Australia |

    Wednesday, April 16, 2008 6:03 AM

    mikey

    Great post.

    I have a question though.. I'm trying to implement bullet and numbered lists. With LINQ, how do I access the numbering.xml document in order to get the information I need to make these lists?

    Cheers,
    Mikey

    dilip India |

    Saturday, May 03, 2008 11:41 AM

    dilip

    how to fetch image from word document using c#.i want to fetch image and store in database.

    maartenba |

    Monday, May 05, 2008 8:33 AM

    maartenba

    @Mikey: I think the code shows quite clear how to access a specific OpenXML part? With LINQ, you'l have to experiment a little on how to loop over all elements in numbering.xml.

    @Dilip: have a look at the code, the handler shows clearly hoq to retrieve an image part from an OpenXML package.

    mikey Australia |

    Monday, May 05, 2008 9:10 AM

    mikey

    Yep, you did.
    It was my mistake, I thought that I could somehow browse through the numbering without having to open it seperately and iterating through it.
    Its ok though, I opened it seperately and its working now.

    Cheers,
    Mikey

    Pankaj Sharma United States |

    Monday, June 09, 2008 10:35 PM

    Pankaj Sharma

    Excellent Posting!
    All the best wishes for your future!
    God Bless you.

    Regards,
    Pankaj

    Le Xuan Manh Vietnam |

    Monday, July 07, 2008 6:20 AM

    Le Xuan Manh

    Great Post!

    But when i insert an Equation Object, It not show.
    Any idea for that?

    Cheers,
    LXManh.

    maartenba Belgium |

    Monday, July 07, 2008 7:50 AM

    maartenba

    Those are actually not converted. Only bitmap/jpeg/gif/png/... images are shown.

    Guy Ellis United States |

    Thursday, September 11, 2008 9:23 PM

    Guy Ellis

    To get this to compile I had to replace paragraph.Runs with paragraph.TextRuns - might be because I'm using .net 3.5 sp1? I'm guessing that Runs has been replaced with TextRuns? I haven't tried to run this yet but thought that I'd comment before I forgot.

    mahmoud Egypt |

    Wednesday, September 17, 2008 11:32 AM

    mahmoud

    thanks it was very helpful for me

    Augustin |

    Thursday, November 13, 2008 5:43 PM

    Augustin

    Good job maartenba

    faxt.com |

    Sunday, January 04, 2009 5:40 PM

    pingback

    Pingback from faxt.com

    Working with Docx using Silverlight and WPF - XBAP

    keyongtech.com |

    Sunday, January 18, 2009 6:17 PM

    pingback

    Pingback from keyongtech.com

    DOCX MIME Type | keyongtech

    Ramakrishna United States |

    Wednesday, February 18, 2009 2:46 PM

    Ramakrishna

    Hi Mikey,
    The handlers seems to be an excellent piece of work. But one thing is missing here...to fetch the tables from word document and displaying them. the content from it seems to be fetched but the table structure is not generated in your code. can you give me linq query to fetch the tables and display them. that would be really of great help for me.
    Thanks,
    Ramakrishna

    Comments are closed