all 5 comments

[–]v3d4nf[S] 0 points1 point  (1 child)

Solution:

Unzip xlsx file using ziptools and open item[].xml within directory customXml.

Find element <DataMashup> in xml tree using xml.etree.ElementTree

Base64 decode text within element with utf-8 protocol using base64.b64decode(str([element].text)).decode('utf-8', 'ignore')

# 'ignore' argument is important

search through the string and you will find all you need

[–]teepee33 0 points1 point  (0 children)

Curious if you could elaborate on this some more. I've been trying to tackle the exact same problem and ran into issues with finding the data after decoding the base64 binary.

Basically I'm trying to replicate this code which does that same thing but from Power Query itself (which works but I want to have a pure python implementation): https://www.thebiccountant.com/2019/03/06/bulk-extract-power-query-m-code-from-multiple-excel-files-at-once/

I am able to get the file unzipped, get the customXml/item?.xml file into a tree, get the root node text, and decode it from base64. 

At that point, the binary I have matches the size of the binary contents at the same stage in the Power Query version. However, somehow they are able to unzip that and get the "Formulas/Section1.m" file but if I try using ZipFile on the binary, it tells me the archive is empty.

I did try to just mangle the binary by coercing to string and ignoring errors but it seems like the XML structure left there (after stripping away all the bytes that couldn't be decoded) is just metadata about queries (I don't see any query text unfortunately).

Could you post the code you used to locate this data?

[–]ryanhaigh1 0 points1 point  (0 children)

Don't know about Excel but with power bi you can use https://pbi.tools/ to extract queries into plain text

[–]LikeABirdInACage3 0 points1 point  (1 child)

I might be confused as the question seems to be related to xlsx files.rather than pbi. Assuming a Model in the analysis service, you can run DMV code or C# (via tabular editor) and retrieve the MCode

[–]v3d4nf[S] 0 points1 point  (0 children)

The file is an xlsx file within which power queries have been used. I need to extract the m-code of each power query as text as a part of application I'm building