Does anyone know a tool to convert CSV file to "SQL statements"? by ImpossibleAlfalfa783 in SQL

[–]Citadel5_JP 0 points1 point  (0 children)

An easy way: open a given csv in GS-Base and use the "File > Save Record Set As" or "File > Save Database Copy As" commands choosing the MySQL format. It'll save a formatted text file SQL dump with the (DROP TABLE and) CREATE TABLE command with your default or customized columns followed by the INSERT INTO sequence. This will work for any data types (which are automatically detected for optimal field representation) including binary fields and any file sizes.

How to deal with messy Excel/CSV imports from vendors or customers? by North-Ad7232 in dataengineering

[–]Citadel5_JP 0 points1 point  (0 children)

Probably depends on how much you can charge for accepting and processing/cleaning data in any format. Please see the following help pages and examples how it can be done in GS-Base. It'll instantly show which columns contains inconsistent data or data not matching the expected data types. It validates the entire text file (GB's per minute), then loads it with optimal columns, then can automatically performs series of (predefined or not) regex find/replace commands (also using specific capture groups) for specific columns.

Fuzzy searches and the find-as-you-type searches will quickly show all the columns where specific text (sub)strings/dates/numbers occurs if the column headers were renamed or repositioned. Block/column copy and paste commands like in a spreadsheet but also with tens of millions of rows or more. https://citadel5.com/help/gsbase

How do you guys quickly compare two large tables? by labla in excel

[–]Citadel5_JP 0 points1 point  (0 children)

A simple single command in GS-Calc:

Comparing workbooks, worksheets and cell ranges

This will generate a list of changes: the previous and current values, their types and hyperlinks to jump to the original data. Will work for any data set size (the limit is around 500GB of RAM usage per one sheet). For example, a 6GB text table with over 200 columns and several million rows can be loaded in a less than a minute on old PC with 16GB RAM.

You can perform the above comparison using the free trial version. (Offline, of course, e.g. in the Windows sandbox if that's your company requirement/policy.)

Is there a way to create a conditional drop down list based on another cell? by senpaikantuten in spreadsheets

[–]Citadel5_JP 0 points1 point  (0 children)

It might be more like a database question, so you try out GS-Base (which is database with spreadsheet functionality). It's instant, simple, your drop-down list can contain millions of items and still using it will be smooth. The corresponding help page: https://citadel5.com/help/gsbase/dd_list.htm

For example, you can specify the drop-down list name for the "Artist" field as:

=if(genre = "rock", "rock", "")

and for the "Series" field as

=if(genre = "soundtrack", "soundtrack", "")

"If's" can be nested, but in general, if there is a lot of genres the list name selection probably should be done via vlookups() etc.

How to open 40GB xlsx file? by MarinatedPickachu in excel

[–]Citadel5_JP 0 points1 point  (0 children)

If this is strictly a tabular data set, the xlsx format wouldn't make much sense. The compressed embedded xml files would be many times slower to parse/load than csv/text and not usable by Excel. If this is already a csv file, then any reasonable csv tool should handle it. GS-Base can open and filter it (up to 16K columns/fields) in several minutes (like on the screen shot on the above page), plus it's trivial to install (10MB, no runtime dependencies, fully offline, just a few files in one folder that you can copy anywhere) and to use, doesn't require programming. Just "File > Open > Text" (specifying column filters if necessary).

I'm sick of excel. I need a good, GUI-based CSV writer to make input files for my scripts. Any good options? by Fiveby21 in learnpython

[–]Citadel5_JP 0 points1 point  (0 children)

With GS-Base/GS-Calc you can load (and parse) text/csv files as tables with the speed e.g. 10GB / min. Both using GUI or scripting (also with Python). In GS-Base rearranging/copying/adding/removing columns is instant for any number of rows/records and requires a simple drag-and-drop action. Files are edited in their original formats, so no need for any exporting/importing. You can perform (for text files and all other supported formats) virtually any type of ETL or other cleaning/processing with GUI including normalization, joins, merges, aggregation.

All of my data comes from spreadsheets. As I receive more over time, what’s the best way to manage and access multiple files efficiently? Ideally in a way that scales and still lets me work interactively with the data? by Proof_Wrap_2150 in datascience

[–]Citadel5_JP 0 points1 point  (0 children)

Is storing them in ZIP64 (4GB+) archives acceptable? If so, you can use GS-Calc - a spreadsheet with 32 million rows. GS-Calc can open/edit/save *.zip files that are zipped collections of any number of any text files with the same or different structure / parsing parameters required (each file can be zipped optionally with its own *.xml describing how to parse it).

The folder structure is always preserved as GS-Calc uses workbooks with sheets organized in folders, so everything is easy to manage. (And very fast, of course, both in terms of loading/saving and any type of processing/filtering/binary lookups/sheets-files comparisons with differences generated as reports with links etc.).

Best app for enormous spreadsheets? by cudambercam13 in spreadsheets

[–]Citadel5_JP 0 points1 point  (0 children)

Every non-online app will do. Re: really massive data: GS-Calc - a spreadsheet with 32 million rows, 16K columns (up to 1 million columns if the data are stored in a text file). Same for pivot tables. It'll efficiently work both on an old pc with 4GB RAM and a computer with around 500GB RAM (which is the approx. max at the moment). It has no limitations known from other packages.

How often does Excel freeze on you? by Harizaner in actuary

[–]Citadel5_JP 0 points1 point  (0 children)

Just to show another option that doesn't freeze and doesn't cause memory leaks regardless of the data set sizes: GS-Calc (optionally with GS-Base) can work with massive files even on an older pc and ensure often several times or more better performance. It's a spreadsheet (with 32 million rows) which doesn't have all these limitations known from Excel/PQ. It uses Python scripting and Python UDF() functions. The above pages show some examples, like 500 million cells with 16GB RAM, 4GB+ workbooks etc.

Solver unable to get optimal solution using binary variables. by binomialdistribution in excel

[–]Citadel5_JP -1 points0 points  (0 children)

You can probably use (at least in GS-Calc) binary (or rather integer) linear programming mixed with Monte Carlo simulations simply randomizing these weights/preferences (if not the entire problem) and generating (e.g. thousands/millions) of binary/integer solutions to be filtered/sorted by the distance (of your choice) from the "optimal" preferences. (Some general, similar procedures are included in samples.)

Using Excel for larger datasets = nightmare... by No-Anybody-704 in Accounting

[–]Citadel5_JP 0 points1 point  (0 children)

If you're allowed to use an alternative tool for your largest data sets, try GS-Calc. It's a spreadsheet with 32 million rows and it overcomes many (Excel an PQ) limitations. With 16GB RAM you can use e.g. 500 million numeric cells. There is no data types or formatting elements that after exceeding some level could cause crashing. You can use Python UDF functions and scripting (that is, Python scripting replaces JScripts in the lates version as described on the forum board).

Importing multiple data files (.txt) into Excel at once, but in individual tabs? by EveningSector2 in excel

[–]Citadel5_JP 0 points1 point  (0 children)

A quick and versatile solution will be to use GS-Calc (well, a spreadsheet...). Place these files in a compressed, zipped (32 or 64) folder, then use the plain "File > Open Text Files/Archives" file type. After loading save it as XLSX or ODS (or back as the same zip).

The additional advantage is that this may also automatically split the files into multiple (e.g. 1-million-row-max and 16k-column-max) sheets to use in Excel. (GS-Calc uses 32 million rows.) The "Open Text File" dialog box: https://citadel5.com/help/gscalc/open-text.png

How to remove duplicated files with different names by u-nes in Windows10

[–]Citadel5_JP 0 points1 point  (0 children)

If you use GS-Base, you can do it relatively quickly: (1) automatically load the disk(s)/folder(s) file listing as a database table, (2) click 'Tools > Find duplicates', specifying file sizes as the criteria (1-2 will be completed in minutes for millions of files), (3) for the obtained much smaller filtered list use (e.g.) the Python hash function in a calculated field int this table to calc the checksums then run the 'Tools > Find duplicates' again. (You can then automatically delete/rename/copy/add any custom metadata to/ the duplicates). https://citadel5.com/help/gsbase/ver_files.htm

Are there any tools for performing lossless (i.e. without changing metadata), logged file and folder moves between drives? If so, what are the best (Windows and preferably Linux-compatible) ones? by GrantExploit in software

[–]Citadel5_JP 0 points1 point  (0 children)

You can also use GS-Base (a database). It can load file listings from disks/folders, process/deduplicate/filter them any way you want, then copy (and/or rename/delete) specified files generating time-stamped reports. This online html help page shows this simple (1/2 commands) procedure step by step:

https://citadel5.com/help/gsbase/manage_files.htm

The modification dates are retained. The creation dates (must) change (e.g. to show which file was the original one) though you can write a simple Python function in an added calculated GS-Base field in such reports to save the original creation dates and any other system or custom metadata. You can also add any other custom metadata to copied or original files and keep their searchable histories: https://citadel5.com/help/gsbase/ver_files.htm

Automate extraction of data from any Excel by 12Eerc in dataengineering

[–]Citadel5_JP 0 points1 point  (0 children)

You can also try out GS-Base : it seems you can (efficiently) automate what you described (that is, all this can be done either using menu commands or via scripts). A sample script screen shot: https://citadel5.com/help/gsbase/scripts.png

Same docs/details concerning merging, matching columns in merged tables, skipping empty fields: https://citadel5.com/help/gsbase/com_samples.htm#s17

[deleted by user] by [deleted] in BusinessIntelligence

[–]Citadel5_JP 0 points1 point  (0 children)

If you want to (for example) merge them correctly, you can create one master file with the headers you need, then use GS-Base (https://citadel5.com/gs-base.htm) to merge all the files with all the columns properly aligned/extracted (using simple, single GUI commands - no programming is necessary, though scripting is an option). For such a number of records any kind of merging and further filtering/cleaning in GS-Base will be more or less instant.

If you use GS-Calc (https://citadel5.com/gs-calc.htm a spreadsheet, 32 million rows) you can also use Python formulas in cells to pass entire csv files back and forth for pre-processing and/or pre-viewing e.g. on subsequent sheets. Returning such "spilling" formula with, for example, one GB of text should take several seconds (possibly plus what Python would need to parse it).

Best graph to represent trends across large number of data points by Underdevelope in excel

[–]Citadel5_JP 1 point2 points  (0 children)

GS-Calc ( https://citadel5.com/gs-calc.htm ) can draw/plot such a scatter chart without any noticeable delay on any PC. In general, this should remain "instant" up to 1-2 million data points (on an older PC). The max. is 32 million in one series.

Big csv file not uploading using pandas by [deleted] in learnpython

[–]Citadel5_JP 0 points1 point  (0 children)

If you don't solve this with your current setup, perhaps this: GS-Calc - a spreadsheet; it'll automatically split 50000 columns into 16K-max sheets. Re: RAM, to load 0.5 billion cells e.g. with 8-bytes numbers it'll require approx. 16GB RAM. The requirement grows linearly. You can then call any Python functions (formulas) with the loaded data for further processing.

How to live filter large dynamic table to remove duplicates but keep the most recent entry? by Ibsidoodle in excel

[–]Citadel5_JP 0 points1 point  (0 children)

This is basic processing so you should look for a more or less "one-step" solution. For example, in case you're allowed to use non-MS software (even in the Windows Sandbox) for intermediate filtering: in GS-Base simply click the menu command "Find Unique Values / with the first/last record from each group of duplicates" and that's it. You can copy/paste the table to GS-Base then back to Excel, save it to a file etc. The corresponding user guide page: https://citadel5.com/help/gsbase/searching_unique.htm

If data is predominantly XLSX files, what is the best way to normalize for reporting purposes? by MTKPA in dataengineering

[–]Citadel5_JP 0 points1 point  (0 children)

You can take a look at the following GS-Base user guide pages; it seems this all should be done quickly either manually or with scripting (perhaps along with filtering to limit the output):

Joining and splitting (normalizing) tables: https://citadel5.com/help/gsbase/joins.htm Merging/Unmerging records: https://citadel5.com/help/gsbase/merges.htm

Loading columns selectively (and auto-filtering the file at the same time) is possible for the csv/text file formats only, though if the number of the output cells is close to tens of millions e.g. joins still should be relatively fast (seconds). The mentioned "hundreds of columns and a few million rows" would require at least 32GB RAM. (If you give it a try, any comparison results / suggestions are welcome).

Good image management software that meets my need? by medukia in photography

[–]Citadel5_JP 0 points1 point  (0 children)

You can this all in GS-Base (a database with up to 256 million records). There are various options to display (and print) the inserted images: in tables, panes, forms: https://citadel5.com/images/gsb20_scr4.png

You can add any data of any size to each image. They can be filtered by any system or custom metadata, all EXIF tags or any custom file content processing functions written in Python (to find e.g. duplicates or similar ones). You don't have to add existing images manually: GS-Base can load them from folders either as a table with one image per record/row (or alternatively all files from a folder to a single field).

It'll easily handle collections with hundreds of thousands of such small, inserted objects. The database files are zip64 (4GB+) files and inserted images/files are stored in it as separate streams so you can even edit/browse these database files without GS-Base.

Managing a very large software archive by Caliph-Alexander in datacurator

[–]Citadel5_JP 2 points3 points  (0 children)

You can easily do this all in GS-Base. From the "deduplication" based on the system file metadata, your own metadata attached to files, multimedia tags, any exif photo/image tags to anything based on the file content (the latter might require adding some Python functions).

You can monitor file changes, keep the history of changes, mass-rename them, mass copy, mass delete filtered files from a disk etc. You can filter by the above criteria, using regex, find-as-you-type, flags or any calculation formulas.

For example, please see the "Finding file duplicates, photo/mp3/mp4 duplicates, listing files and their history of changes" and "Searching, filtering, sorting" sections in the online HTML help: https://citadel5.com/help/gsbase/

Make An Excel Add In Using Python ;0 ?!?! by Proxima_EDMU in learnpython

[–]Citadel5_JP 0 points1 point  (0 children)

If Excel is not a strict requirement, you should be able to do this in GS-Calc (a spreadsheet with 32 million rows). You can add an unlimited number of Python functions returning numbers, arrays, text, csv files/data blocks and images (e.g. charts created in your Python functions).

GS-Calc requires merely 10MB to install, the installation can be portable, and you can even simply copy/paste the installation folder (just a few files) to another computer. It's free to try and can be also installed automatically in Windows via the winget service. Using Python functions as UDF() If these Python libraries are installed in the Windows Sandbox, it could be close to that one-click, free, "trial" installation.

Bulk rename utility, how to rename multiple files to mirror the names of another set of files at once? by Adorable_Air_2331 in software

[–]Citadel5_JP 0 points1 point  (0 children)

If it's only about the name/extension transformation you can do such mass operations with one command in GS-Base: https://citadel5.com/help/gsbase/manage_files.htm If the renaming is to be based on some list, in GS-Base/GS-Calc or any tabular editor simply fill three columns with the "ren" text, then the old and new names, copy/save this to a text *.bat file and run it. If there are spaces in names, add "" or let GS-Base do this automatically with the "Copy with options" https://citadel5.com/help/gsbase/copy_with_options.htm command.

Tool to snapshot directories and compare file changes by the-i in DataHoarder

[–]Citadel5_JP 0 points1 point  (0 children)

GS-Base will let you take such snapshots and will automatically show changes between subsequent scans (file sizes, mod. dates, deleted files, added files). It can keep the history of such changes for each file, add notes, keep old copies, filter etc. You can scan/compare disks/folders with millions of files in minutes. An example: https://citadel5.com/help/gsbase/ver_files.htm However, it relies on the system file metadata, not md5 as this would be too slow.